Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – In-memory Databases

With the volume and velocity of data increasing, in-memory databases provide a way to keep processing speeds low.

Traditionally, databases have stored their data on mechanical storage media such as hard disks. While this has contributed to durability, it’s also constrained attainable query speeds. Database and software designers have long realized this limitation and sought ways to harness the faster speeds of in-memory processing.

The traditional approach to database design – and analytics solutions to access them – includes in-memory caching, which retains a subset of recently accessed data in memory for fast access. While caching often worked well for online transaction processing (OLTP), it was not optimal for analytics and business intelligence. In these cases, the most frequently accessed information – rather than the most recently accessed information – is typically of most interest.

That said, loading an entire data warehouse or even a large data mart into memory has been challenging until recent years.

In-Memory

There are a few key factors in making in-memory databases and analytics offerings relevant for more and more use cases. One such factor has been the shift to 64-bit operating systems. Another is that it makes available much more addressable memory. And as one might assume, the availability of increasingly large and affordable memory solutions has also played a part.

Database and software developers have begun to take advantage of in-memory databases in a myriad of ways. These include the many key-value stores such as Amazon DynamoDB, which provide very low latency for IoT and a host of other use cases.

Another way businesses are taking advantage of in-memory is through distributed in-memory NoSQL databases such as Aerospike, to in-memory NewSQL databases such as VoltDB. However, for the remainder of this post, we’ll touch in more detail on several solutions with which you might be more familiar.

Some database vendors have chosen to build hybrid solutions that incorporate in-memory technologies. They aim to bridge in-memory with solutions based on tried-and-true, disk-based RDBMS technologies. Such vendors include Microsoft with its incorporation of xVelocity into SQL Server, Analysis Services and PowerPivot, and Teradata with its Intelligent Memory.

Other vendors, like IBM with its dashDB database, have chosen to deploy in-memory technology in the cloud, while capitalizing on previously developed or acquired technologies (in-database analytics from Netezza in the case of dashDB).

However, probably the most high-profile application of in-memory technology has been SAP’s significant bet on its HANA in-memory database, which first shipped in late 2010. SAP has since made it available in the cloud through its SAP HANA Cloud Platform, and on Microsoft Azure and it has released a comprehensive application suite called S/4HANA.

Like most of the analytics-focused in-memory databases and analytics tools, HANA stores data in a column-oriented, in-memory database. The primary rationale for taking a column-oriented approach to storing data in memory is that in analytic use cases, where data is queried but not updated, it allows for often very impressive compression of data values in each column. This means much less memory is used, resulting in even higher throughput and less need for expensive memory.

So what approach should a data architect adopt? Are Microsoft, Teradata and other “traditional” RDBMS vendors correct with their hybrid approach?

As memory gets cheaper by the day, and the value of rapid insights increases by the minute, should we host the whole data warehouse or data mart in-memory as with vendors SAP and IBM?

It depends on the specific use case, data volumes, business requirements, budget, etc. One thing that is not in dispute is that all the major vendors recognize that in-memory technology adds value to their solutions. And that extends beyond the database vendors to analytics tool stalwarts like Tableau and newer arrivals like Yellowfin.

It is incumbent upon enterprise architects to learn about the relative merits of the different approaches championed by the various vendors and to select the best fit for their specific situation. This is something that’s admittedly, not easy given the pace of adoption of in-memory databases and the variety of approaches being taken.

But there’s a silver lining to the creative disruption caused by the increasing adoption of in-memory technologies. Because of the sheer speed the various solutions offered, many organizations are finding that the need to pre-aggregate data to achieve certain performance targets for specific analytics workloads is disappearing. The same goes for the need to de-normalize database designs to achieve specific analytics performance targets.

Instead, organizations are finding that it’s more important to create comprehensive atomic data models that are flexible and independent of any assumed analytics workload.

Perhaps surprisingly to some, third normal form (3NF) is once again not an unreasonable standard of data modeling for modelers who plan to deploy to a pure in-memory or in-memory-augmented platform.

Organizations can forgo the time-consuming effort to model and transform data to support specific analytics workloads, which are likely to change over time anyway. They also can stop worrying about de-normalizing and tuning an RDBMS for those same fickle and variable analytics workloads, focusing on creating a logical data model of the business that reflects the business information requirements and relationships in a flexible and detailed format, that doesn’t assume specific aggregations and transformations.

The blinding speed of in-memory technologies provides the aggregations, joins and other transformations on the fly, without the onerous performance penalties we have historically experienced with very large data volumes on disk-only-based solutions. As a long-time data modeler, I like the sound of that. And so far in my experience with many of the solutions mentioned in this post, the business people like the blinding speed and flexibility of these new in-memory technologies!

Please join us next time for the final installment of our series, Data Modeling in a Jargon-filled World – The Logical Data Warehouse. We’ll discuss an approach to data warehousing that uses some of the technologies and approaches we’ve discussed in the previous six installments while embracing “any data, anywhere.”