Categories
erwin Expert Blog

Big Data Posing Challenges? Data Governance Offers Solutions

Big Data is causing complexity for many organizations, not just because of the volume of data they’re collecting, but because of the variety of data they’re collecting.

Big Data often consists of unstructured data that streams into businesses from social media networks, internet-connected sensors, and more. But the data operations at many organizations were not designed to handle this flood of unstructured data.

Dealing with the volume, velocity and variety of Big Data is causing many organizations to re-think how they store and govern their data. A perfect example is the data warehouse. The people who built and manage the data warehouse at your organization built something that made sense to them at the time. They understood what data was stored where and why, as well how it was used by business units and applications.

The era of Big Data introduced inexpensive data lakes to some organizations’ data operations, but as vast amounts of data pour into these lakes, many IT departments found themselves managing a data swamp instead.

In a perfect world, your organization would treat Big Data like any other type of data. But, alas, the world is not perfect. In reality, practicality and human nature intervene. Many new technologies, when first adopted, are separated from the rest of the infrastructure.

“New technologies are often looked at in a vacuum, and then built in a silo,” says Danny Sandwell, director of product marketing for erwin, Inc.

That leaves many organizations with parallel collections of data: one for so-called “traditional” data and one for the Big Data.

There are a few problems with this outcome. For one, silos in IT have a long history of keeping organizations from understanding what they have, where it is, why they need it, and whether it’s of any value. They also have a tendency to increase costs because they don’t share common IT resources, leading to redundant infrastructure and complexity. Finally, silos usually mean increased risk.

But there’s another reason why parallel operations for Big Data and traditional data don’t make much sense: The users simply don’t care.

At the end of the day, your users want access to the data they need to do their jobs, and whether IT considers it Big Data, little data, or medium-sized data isn’t important. What’s most important is that the data is the right data – meaning it’s accurate, relevant and can be used to support or oppose a decision.

Reputation Management - What's Driving Data Governance

How Data Governance Turns Big Data into Just Plain Data

According to a November 2017 survey by erwin and UBM, 21 percent of respondents cited Big Data as a driver of their data governance initiatives.

In today’s data-driven world, data governance can help your business understand what data it has, how good it is, where it is, and how it’s used. The erwin/UBM survey found that 52 percent of respondents said data is critically important to their organization and they have a formal data governance strategy in place. But almost as many respondents (46 percent) said they recognize the value of data to their organization but don’t have a formal governance strategy.

A holistic approach to data governance includes thesekey components.

  • An enterprise architecture component is important because it aligns IT and the business, mapping a company’s applications and the associated technologies and data to the business functions they enable. By integrating data governance with enterprise architecture, businesses can define application capabilities and interdependencies within the context of their connection to enterprise strategy to prioritize technology investments so they align with business goals and strategies to produce the desired outcomes.
  • A business process and analysis component defines how the business operates and ensures employees understand and are accountable for carrying out the processes for which they are responsible. Enterprises can clearly define, map and analyze workflows and build models to drive process improvements, as well as identify business practices susceptible to the greatest security, compliance or other risks and where controls are most needed to mitigate exposures.
  • A data modeling component is the best way to design and deploy new databases with high-quality data sources and support application development. Being able to cost-effectively and efficiently discover, visualize and analyze “any data” from “anywhere” underpins large-scale data integration, master data management, Big Data and business intelligence/analytics with the ability to synthesize, standardize and store data sources from a single design, as well as reuse artifacts across projects.

When data governance is done right, and it’s woven into the structure and architecture of your business, it helps your organization accept new technologies and the new sources of data they provide as they come along. This makes it easier to see ROI and ROO from your Big Data initiatives by managing Big Data in the same manner your organization treats all of its data – by understanding its metadata, defining its relationships, and defining its quality.

Furthermore, businesses that apply sound data governance will find themselves with a template or roadmap they can use to integrate Big Data throughout their organizations.

If your business isn’t capitalizing on the Big Data it’s collecting, then it’s throwing away dollars spent on data collection, storage and analysis. Just as bad, however, is a situation where all of that data and analysis is leading to the wrong decisions and poor business outcomes because the data isn’t properly governed.

Previous posts:

You can determine how effective your current data governance initiative is by taking erwin’s DG RediChek.

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – In-memory Databases

With the volume and velocity of data increasing, in-memory databases provide a way to keep processing speeds low.

Traditionally, databases have stored their data on mechanical storage media such as hard disks. While this has contributed to durability, it’s also constrained attainable query speeds. Database and software designers have long realized this limitation and sought ways to harness the faster speeds of in-memory processing.

The traditional approach to database design – and analytics solutions to access them – includes in-memory caching, which retains a subset of recently accessed data in memory for fast access. While caching often worked well for online transaction processing (OLTP), it was not optimal for analytics and business intelligence. In these cases, the most frequently accessed information – rather than the most recently accessed information – is typically of most interest.

That said, loading an entire data warehouse or even a large data mart into memory has been challenging until recent years.

In-Memory

There are a few key factors in making in-memory databases and analytics offerings relevant for more and more use cases. One such factor has been the shift to 64-bit operating systems. Another is that it makes available much more addressable memory. And as one might assume, the availability of increasingly large and affordable memory solutions has also played a part.

Database and software developers have begun to take advantage of in-memory databases in a myriad of ways. These include the many key-value stores such as Amazon DynamoDB, which provide very low latency for IoT and a host of other use cases.

Another way businesses are taking advantage of in-memory is through distributed in-memory NoSQL databases such as Aerospike, to in-memory NewSQL databases such as VoltDB. However, for the remainder of this post, we’ll touch in more detail on several solutions with which you might be more familiar.

Some database vendors have chosen to build hybrid solutions that incorporate in-memory technologies. They aim to bridge in-memory with solutions based on tried-and-true, disk-based RDBMS technologies. Such vendors include Microsoft with its incorporation of xVelocity into SQL Server, Analysis Services and PowerPivot, and Teradata with its Intelligent Memory.

Other vendors, like IBM with its dashDB database, have chosen to deploy in-memory technology in the cloud, while capitalizing on previously developed or acquired technologies (in-database analytics from Netezza in the case of dashDB).

However, probably the most high-profile application of in-memory technology has been SAP’s significant bet on its HANA in-memory database, which first shipped in late 2010. SAP has since made it available in the cloud through its SAP HANA Cloud Platform, and on Microsoft Azure and it has released a comprehensive application suite called S/4HANA.

Like most of the analytics-focused in-memory databases and analytics tools, HANA stores data in a column-oriented, in-memory database. The primary rationale for taking a column-oriented approach to storing data in memory is that in analytic use cases, where data is queried but not updated, it allows for often very impressive compression of data values in each column. This means much less memory is used, resulting in even higher throughput and less need for expensive memory.

So what approach should a data architect adopt? Are Microsoft, Teradata and other “traditional” RDBMS vendors correct with their hybrid approach?

As memory gets cheaper by the day, and the value of rapid insights increases by the minute, should we host the whole data warehouse or data mart in-memory as with vendors SAP and IBM?

It depends on the specific use case, data volumes, business requirements, budget, etc. One thing that is not in dispute is that all the major vendors recognize that in-memory technology adds value to their solutions. And that extends beyond the database vendors to analytics tool stalwarts like Tableau and newer arrivals like Yellowfin.

It is incumbent upon enterprise architects to learn about the relative merits of the different approaches championed by the various vendors and to select the best fit for their specific situation. This is something that’s admittedly, not easy given the pace of adoption of in-memory databases and the variety of approaches being taken.

But there’s a silver lining to the creative disruption caused by the increasing adoption of in-memory technologies. Because of the sheer speed the various solutions offered, many organizations are finding that the need to pre-aggregate data to achieve certain performance targets for specific analytics workloads is disappearing. The same goes for the need to de-normalize database designs to achieve specific analytics performance targets.

Instead, organizations are finding that it’s more important to create comprehensive atomic data models that are flexible and independent of any assumed analytics workload.

Perhaps surprisingly to some, third normal form (3NF) is once again not an unreasonable standard of data modeling for modelers who plan to deploy to a pure in-memory or in-memory-augmented platform.

Organizations can forgo the time-consuming effort to model and transform data to support specific analytics workloads, which are likely to change over time anyway. They also can stop worrying about de-normalizing and tuning an RDBMS for those same fickle and variable analytics workloads, focusing on creating a logical data model of the business that reflects the business information requirements and relationships in a flexible and detailed format, that doesn’t assume specific aggregations and transformations.

The blinding speed of in-memory technologies provides the aggregations, joins and other transformations on the fly, without the onerous performance penalties we have historically experienced with very large data volumes on disk-only-based solutions. As a long-time data modeler, I like the sound of that. And so far in my experience with many of the solutions mentioned in this post, the business people like the blinding speed and flexibility of these new in-memory technologies!

Please join us next time for the final installment of our series, Data Modeling in a Jargon-filled World – The Logical Data Warehouse. We’ll discuss an approach to data warehousing that uses some of the technologies and approaches we’ve discussed in the previous six installments while embracing “any data, anywhere.”

Categories
erwin Expert Blog

Data Vault Modeling & the Data Warehouse

The data vault method for modeling the data warehouse was born of necessity. Data warehouse projects classically have to contend with long implementation times. This means that business requirements are more likely to change in the course of the project, jeopardizing the achievement of target implementation times and costs for the project.

To improve implementation times, Dan Linstedt introduced the Data Vault method for modeling the core warehouse. The key design principle involves separating the business key, context, and relationships in distinct tables as hub, satellite, and link.

Data vault

Data Vault modeling is currently the established standard for modeling the core data warehouse because of the many benefits it offers. These include the following:

Data Warehouse Pros & Cons

Data Warehouse Benefits

• Easy extensibility enables an agile project approach
• The models created are highly scalable
• The loading processes can be optimally parallelized because there are few synchronization points
• The models are easy to audit

But alongside the many benefits, Data Vault projects also present a number of challenges. These include, but are not limited to, the following:

Data Warehouse Drawbacks

• A vast increase in the number of data objects (tables, columns) as a result of separating the information types and enriching them with meta information for loading
• This gives rise to greater modeling effort comprising numerous unsophisticated mechanical tasks

How can these challenges be mastered using a standard data modeling tool?

The highly schematic structure of the models offers ideal prerequisites for generating models. This allows sizable parts of the modeling process to be automated, enabling Data Vault projects to be accelerated dramatically.

erwin data intelligence

Potential for Automating Data Vault

Which specific parts of the model can be automated?

The standard architecture of a data warehouse includes the following layers:

  • Source system: Operational system, such as ERP or CRM systems
  • Staging area: This is where the data is delivered from the operational systems. The structure of the data model generally corresponds to the source system, with enhancements for documenting loading.
  • Core warehouse: The data from various systems is integrated here. This layer is modeled in accordance with Data Vault and is subdivided into the raw vault and business vault areas. This involves implementing all business rules in the business vault so that only very simple transformations are used in the raw vault.
  • Data marts: The structure of the data marts is based on the analysis requirements and is modeled as a star schema.

Standard Architecture of a Data Vault

Both the staging area and the raw vault are very well suited for automation, as clearly defined derivation rules can be established from the preceding layer.

Should automation be implemented using a standard modeling tool or using a specialized data warehouse automation tool?

Automation potential can generally be leveraged using special automation tools.

What are the arguments in favor of using a standard tool such as the erwin Data Modeler?

Using a standard modeling tool offers many benefits:

  • The erwin Data Modeler generally already includes models (for example, source system), which can continue to be used
  • The modeling functions are highly sophisticated – for example, for comparing models and for standardization within models
  • A wide range of databases are supported as standard
  • A large number of interfaces are available for importing models from other tools
  • Often the tool has already been used to model source systems or other warehouses
  • The model range can be used to model the entire enterprise architecture, not only the
    data warehouse (erwin Web Portal)
  • Business glossaries enable (existing) semantic information to be integrated

So far so good. But can the erwin Data Modeler generate models?

A special add-in for the erwin Data Modeler has been developed specifically to meet this requirement: MODGEN. This enables the potential for automation in erwin to be exploited to the full.

It integrates seamlessly into the erwin user interface and, in terms of operation, is heavily based on comparing models (complete compare).

MODGEN functionalities

The following specific functionalities are implemented in MODGEN:

  • Generation of staging and raw vault models based on the model of the preceding layer
  • Generation is controlled by enriching the particular preceding model with meta-information, which is stored in UDPs
  • Individual objects can be excluded from the generation process permanently or
    interactively
  • Specifications for meta-columns can be integrated very easily using templates

To support a modeling process that can be repeated multiple times, during which iterative models are created or enhanced, it is essential that generation be round-trip capable.

To achieve this, the generation always performs a comparison between the source and target models and indicates any differences. These can be selected by the user and copied during generation.

The generation not only takes all the tables and columns into consideration as a matter of course (horizontal modeling), it also creates vertical model information.

This means the relationship of every generated target column to its source column as data source is documented. Source-to-target mappings can therefore be generated very easily using the model.

Integrating the source and target model into a web portal automatically makes the full impact and lineage analysis functionality available.

If you are interested in finding out more, or if you would like to experience MODGEN live, please contact our partner heureka.

Data Modeling Data Goverance

Author details: Stefan Kausch, heureka e-Business GmbH
Stefan Kausch is the CEO and founder of heureka e-Business GmbH, a company focused on IT consultancy and software development.

Stefan has more than 15 years’ experience as a consultant, trainer, and educator and has developed and delivered data modeling processes and data governance initiatives for many different companies.

He has successfully executed many projects for customers, primarily developing application systems, data warehouse automation solutions and ETL processes. Stefan Kausch has in-depth knowledge of application development based on data models.

Contact:
Stefan Kausch
heureka e-Business GmbH
Untere Burghalde 69
71229 Leonberg

Tel.: 0049 7152 939310
Email: heureka@heureka.com
Web: www.heureka.com