Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – In-memory Databases

With the volume and velocity of data increasing, in-memory databases provide a way to keep processing speeds low.

Traditionally, databases have stored their data on mechanical storage media such as hard disks. While this has contributed to durability, it’s also constrained attainable query speeds. Database and software designers have long realized this limitation and sought ways to harness the faster speeds of in-memory processing.

The traditional approach to database design – and analytics solutions to access them – includes in-memory caching, which retains a subset of recently accessed data in memory for fast access. While caching often worked well for online transaction processing (OLTP), it was not optimal for analytics and business intelligence. In these cases, the most frequently accessed information – rather than the most recently accessed information – is typically of most interest.

That said, loading an entire data warehouse or even a large data mart into memory has been challenging until recent years.

In-Memory

There are a few key factors in making in-memory databases and analytics offerings relevant for more and more use cases. One such factor has been the shift to 64-bit operating systems. Another is that it makes available much more addressable memory. And as one might assume, the availability of increasingly large and affordable memory solutions has also played a part.

Database and software developers have begun to take advantage of in-memory databases in a myriad of ways. These include the many key-value stores such as Amazon DynamoDB, which provide very low latency for IoT and a host of other use cases.

Another way businesses are taking advantage of in-memory is through distributed in-memory NoSQL databases such as Aerospike, to in-memory NewSQL databases such as VoltDB. However, for the remainder of this post, we’ll touch in more detail on several solutions with which you might be more familiar.

Some database vendors have chosen to build hybrid solutions that incorporate in-memory technologies. They aim to bridge in-memory with solutions based on tried-and-true, disk-based RDBMS technologies. Such vendors include Microsoft with its incorporation of xVelocity into SQL Server, Analysis Services and PowerPivot, and Teradata with its Intelligent Memory.

Other vendors, like IBM with its dashDB database, have chosen to deploy in-memory technology in the cloud, while capitalizing on previously developed or acquired technologies (in-database analytics from Netezza in the case of dashDB).

However, probably the most high-profile application of in-memory technology has been SAP’s significant bet on its HANA in-memory database, which first shipped in late 2010. SAP has since made it available in the cloud through its SAP HANA Cloud Platform, and on Microsoft Azure and it has released a comprehensive application suite called S/4HANA.

Like most of the analytics-focused in-memory databases and analytics tools, HANA stores data in a column-oriented, in-memory database. The primary rationale for taking a column-oriented approach to storing data in memory is that in analytic use cases, where data is queried but not updated, it allows for often very impressive compression of data values in each column. This means much less memory is used, resulting in even higher throughput and less need for expensive memory.

So what approach should a data architect adopt? Are Microsoft, Teradata and other “traditional” RDBMS vendors correct with their hybrid approach?

As memory gets cheaper by the day, and the value of rapid insights increases by the minute, should we host the whole data warehouse or data mart in-memory as with vendors SAP and IBM?

It depends on the specific use case, data volumes, business requirements, budget, etc. One thing that is not in dispute is that all the major vendors recognize that in-memory technology adds value to their solutions. And that extends beyond the database vendors to analytics tool stalwarts like Tableau and newer arrivals like Yellowfin.

It is incumbent upon enterprise architects to learn about the relative merits of the different approaches championed by the various vendors and to select the best fit for their specific situation. This is something that’s admittedly, not easy given the pace of adoption of in-memory databases and the variety of approaches being taken.

But there’s a silver lining to the creative disruption caused by the increasing adoption of in-memory technologies. Because of the sheer speed the various solutions offered, many organizations are finding that the need to pre-aggregate data to achieve certain performance targets for specific analytics workloads is disappearing. The same goes for the need to de-normalize database designs to achieve specific analytics performance targets.

Instead, organizations are finding that it’s more important to create comprehensive atomic data models that are flexible and independent of any assumed analytics workload.

Perhaps surprisingly to some, third normal form (3NF) is once again not an unreasonable standard of data modeling for modelers who plan to deploy to a pure in-memory or in-memory-augmented platform.

Organizations can forgo the time-consuming effort to model and transform data to support specific analytics workloads, which are likely to change over time anyway. They also can stop worrying about de-normalizing and tuning an RDBMS for those same fickle and variable analytics workloads, focusing on creating a logical data model of the business that reflects the business information requirements and relationships in a flexible and detailed format, that doesn’t assume specific aggregations and transformations.

The blinding speed of in-memory technologies provides the aggregations, joins and other transformations on the fly, without the onerous performance penalties we have historically experienced with very large data volumes on disk-only-based solutions. As a long-time data modeler, I like the sound of that. And so far in my experience with many of the solutions mentioned in this post, the business people like the blinding speed and flexibility of these new in-memory technologies!

Please join us next time for the final installment of our series, Data Modeling in a Jargon-filled World – The Logical Data Warehouse. We’ll discuss an approach to data warehousing that uses some of the technologies and approaches we’ve discussed in the previous six installments while embracing “any data, anywhere.”

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – Managed Data Lakes

More and more businesses are adopting managed data lakes.

Earlier in this blog series, we established that leading organizations are adopting a variety of approaches to manage data, including data that may be sourced from a wide range of NoSQL, NewSQL, RDBMS and unstructured sources.

In this post, we’ll discuss managed data lakes and their applications as a hybrid of less structured data and more traditionally structured relational data. We’ll also talk about whether there’s still a need for data modeling and metadata management.

The term Data Lake was first coined by James Dixon of Pentaho in a blog entry in which he said:

“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Use of the term quickly took on a life of its own with often divergent meanings. So much so that four years later Mr. Dixon felt compelled to refute some criticisms by the analyst community by pointing out that they were objecting to things he actually never said about data lakes.

However, in my experience and despite Mr. Dixon’s objections, the notion that a data lake can contain data from more than one source is now widely accepted..

Similarly, while most early data lake implementations used Hadoop with many vendors pitching the idea that a data lake had to be implemented as a Hadoop data store, the notion that data lakes can be implemented on non-Hadoop platforms, such as Azure Blob storage or Amazon S3, has become increasingly widespread.

So a data lake – as the term is widely used in 2017 – is a detailed (non-aggregated) data store that can contain structured and/or non-structured data from more than one source implemented on some kind of inexpensive, massively scalable storage platform.

But what are “managed data lakes?”

To answer that question, let’s first touch on why many early data lake projects failed or significantly missed expectations. Criticisms were quick to arise, many of which were critiques of data lakes when they strayed from the original vision, as established earlier.

Vendors seized on data lakes as a marketing tool, and as often happens in our industry, they promised it could do almost anything. As long as you poured your data into the lake, people in the organization would somehow magically find exactly the data they needed just when they needed it. As is usually the case, it turned out that for most organizations, their reality was quite different. And for three important reasons:

  1. Most large organizations’ analysts didn’t have the skillsets to wade through the rapidly accumulating pool of information in Hadoop or whichever new platforms had been chosen to implement their data lakes to locate the data they needed.
  2. Not enough attention was paid to the need of providing metadata to help people find the data they needed.
  3. Most interesting analytics are a result of integrating disparate data points to draw conclusions, and integration had not been an area of focus in most data lake implementations.

In the face of growing disenchantment with data lake implementations, some organizations and vendors pivoted to address these drawbacks. They did so by embracing what is most commonly called a managed data lake, though some prefer the label “curated data lake” or “modern data warehouse.”

The idea is to address the three criticisms mentioned above by developing an architectural approach that allows for the use of SQL, making data more accessible and providing more metadata about the data available in the data lake. It also takes on some of the challenging work of integration and transformation that earlier data lake implementations had hoped to kick down the road or avoid entirely.

The result in most implementations of a managed data lake is a hybrid that tries to blend the strengths of the original data lake concept with the strengths of traditional large-scale data warehousing (as opposed to the narrow data mart approach Mr. Dixon used as a foil when originally describing data lakes).

Incoming data, either structured or unstructured, can be easily and quickly loaded from many different sources (e.g., applications, IoT, third parties, etc.). The data can be accumulated with minimal processing at reasonable cost using a bulk storage platform such as Hadoop, Azure Blob storage or Amazon S3.

Then the data, which is widely used within the organization, can be integrated and made available through a SQL or SQL-like interface, such as those from Hive to Postgres to a tried-and-true commercial relational database such as SQL Server (or its cloud-based cousin Azure SQL Data Warehouse).

In this scenario, a handful of self-sufficient data scientists may wade (or swim or dive) in the surrounding data lake. However, most analysts in most organizations still spend most of their time using familiar SQL-capable tools to analyze data stored in the core of the managed data lake – an island in the lake if we really want to torture the analogy – which is typically implemented either using an RDBMS or a relational layer like Hive on top of the bulk-storage layer.

It’s important to note that these are not two discrete silos. Most major vendors have added capabilities to their database and BI offerings to enable analysis of both RDBMS-based and bulk-storage layer data through a familiar SQL interface.

This enables a much larger percentage of an organization’s analysts to access data both in the core and the less structured surrounding lake, using tools with which they’re already familiar.

As this hybrid managed data lake approach incorporates a relational core, robust data modeling capabilities are as important as ever. The same goes for data governance and a thorough focus on metadata to provide clear naming and definitions to assist in finding and linking with the most appropriate data.

This is true whether inside the structured relational core of the managed data lake or in the surrounding, more fluid data lake.

As you probably guessed from some of the links in this post, more and more managed data lakes are being implemented in the cloud. Please join us next time for the fifth installment in our series: Data Modeling in a Jargon-filled World – The Cloud.

Categories
erwin Expert Blog

Data Vault Modeling & the Data Warehouse

The data vault method for modeling the data warehouse was born of necessity. Data warehouse projects classically have to contend with long implementation times. This means that business requirements are more likely to change in the course of the project, jeopardizing the achievement of target implementation times and costs for the project.

To improve implementation times, Dan Linstedt introduced the Data Vault method for modeling the core warehouse. The key design principle involves separating the business key, context, and relationships in distinct tables as hub, satellite, and link.

Data vault

Data Vault modeling is currently the established standard for modeling the core data warehouse because of the many benefits it offers. These include the following:

Data Warehouse Pros & Cons

Data Warehouse Benefits

• Easy extensibility enables an agile project approach
• The models created are highly scalable
• The loading processes can be optimally parallelized because there are few synchronization points
• The models are easy to audit

But alongside the many benefits, Data Vault projects also present a number of challenges. These include, but are not limited to, the following:

Data Warehouse Drawbacks

• A vast increase in the number of data objects (tables, columns) as a result of separating the information types and enriching them with meta information for loading
• This gives rise to greater modeling effort comprising numerous unsophisticated mechanical tasks

How can these challenges be mastered using a standard data modeling tool?

The highly schematic structure of the models offers ideal prerequisites for generating models. This allows sizable parts of the modeling process to be automated, enabling Data Vault projects to be accelerated dramatically.

erwin data intelligence

Potential for Automating Data Vault

Which specific parts of the model can be automated?

The standard architecture of a data warehouse includes the following layers:

  • Source system: Operational system, such as ERP or CRM systems
  • Staging area: This is where the data is delivered from the operational systems. The structure of the data model generally corresponds to the source system, with enhancements for documenting loading.
  • Core warehouse: The data from various systems is integrated here. This layer is modeled in accordance with Data Vault and is subdivided into the raw vault and business vault areas. This involves implementing all business rules in the business vault so that only very simple transformations are used in the raw vault.
  • Data marts: The structure of the data marts is based on the analysis requirements and is modeled as a star schema.

Standard Architecture of a Data Vault

Both the staging area and the raw vault are very well suited for automation, as clearly defined derivation rules can be established from the preceding layer.

Should automation be implemented using a standard modeling tool or using a specialized data warehouse automation tool?

Automation potential can generally be leveraged using special automation tools.

What are the arguments in favor of using a standard tool such as the erwin Data Modeler?

Using a standard modeling tool offers many benefits:

  • The erwin Data Modeler generally already includes models (for example, source system), which can continue to be used
  • The modeling functions are highly sophisticated – for example, for comparing models and for standardization within models
  • A wide range of databases are supported as standard
  • A large number of interfaces are available for importing models from other tools
  • Often the tool has already been used to model source systems or other warehouses
  • The model range can be used to model the entire enterprise architecture, not only the
    data warehouse (erwin Web Portal)
  • Business glossaries enable (existing) semantic information to be integrated

So far so good. But can the erwin Data Modeler generate models?

A special add-in for the erwin Data Modeler has been developed specifically to meet this requirement: MODGEN. This enables the potential for automation in erwin to be exploited to the full.

It integrates seamlessly into the erwin user interface and, in terms of operation, is heavily based on comparing models (complete compare).

MODGEN functionalities

The following specific functionalities are implemented in MODGEN:

  • Generation of staging and raw vault models based on the model of the preceding layer
  • Generation is controlled by enriching the particular preceding model with meta-information, which is stored in UDPs
  • Individual objects can be excluded from the generation process permanently or
    interactively
  • Specifications for meta-columns can be integrated very easily using templates

To support a modeling process that can be repeated multiple times, during which iterative models are created or enhanced, it is essential that generation be round-trip capable.

To achieve this, the generation always performs a comparison between the source and target models and indicates any differences. These can be selected by the user and copied during generation.

The generation not only takes all the tables and columns into consideration as a matter of course (horizontal modeling), it also creates vertical model information.

This means the relationship of every generated target column to its source column as data source is documented. Source-to-target mappings can therefore be generated very easily using the model.

Integrating the source and target model into a web portal automatically makes the full impact and lineage analysis functionality available.

If you are interested in finding out more, or if you would like to experience MODGEN live, please contact our partner heureka.

Data Modeling Data Goverance

Author details: Stefan Kausch, heureka e-Business GmbH
Stefan Kausch is the CEO and founder of heureka e-Business GmbH, a company focused on IT consultancy and software development.

Stefan has more than 15 years’ experience as a consultant, trainer, and educator and has developed and delivered data modeling processes and data governance initiatives for many different companies.

He has successfully executed many projects for customers, primarily developing application systems, data warehouse automation solutions and ETL processes. Stefan Kausch has in-depth knowledge of application development based on data models.

Contact:
Stefan Kausch
heureka e-Business GmbH
Untere Burghalde 69
71229 Leonberg

Tel.: 0049 7152 939310
Email: heureka@heureka.com
Web: www.heureka.com