By now, you’ve likely heard a lot about Big Data. You may have even heard about “the three Vs” of Big Data. Originally defined by Gartner, “Big Data is “high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision-making, insight discovery and process optimization.”
In this post, we’ll focus on “volume,” deferring a discussion on “velocity” and “variety” and their impacts on data modeling until future posts in this series.
As we all know, the volume of data in the world is growing quickly, with terabytes and petabytes now sounding mundane in a world of exabytes, zettabytes and even yottabytes!
All large organizations are confronted with these rapidly growing data volumes, and those that strive to be data-driven must seek ways to harness the opportunity represented by this rapidly growing resource.
Traditional relational database management systems (RDBMS) have improved in performance and scalability. Yet the volume of data continues to grow so quickly that performance has continued to be a challenge.
Furthermore, even with significantly improved price/performance provided by the traditional RDBMS vendors, costs have continued to escalate. This is particularly true for large-scale analytics and business intelligence requirements.
The original way to try to address the challenge of Big Data, even before Gartner labeled it Big Data, was to attempt to take advantage of massively parallel processing (MPP) technology.
Teradata and Britton Lee were pioneers in the 1980s, but in the past 15 years, a number of large RDBMS vendors have also created or acquired MPP database offerings, notably including the Microsoft Analytics Platform System (APS) and the IBM PureData System for Analytics (previously known as Netezza).
As you may know, erwin Data Modeler has supported modeling and Data Definition Language generation for Teradata for years. It can now support distribution and sort keys, constructs that are used to tell an MPP database how to distribute and sort data across the nodes in the parallel system.
The erwin DM product already provides support for SQL Server, the RDBMS at the heart of the Microsoft APS system. But through a recent enhancement via ODBC and a customer Forward Engineering Template, erwin DM now also supports AWS Redshift, a cloud-based MPP database that we’ll discuss in a future post.
Driven by Microsoft’s entry into the market, price / performance for MPP solutions for analytics has continued to improve. Despite this, these MPP solutions can still be expensive at the scale of an enterprise data warehouse typically implemented by a large organization.
They do however have the advantage that they are essentially parallel implementations of an RDBMS and, as such, enable large organizations to capitalize on existing skill sets. It also helps maintain the familiarity that their employees have in architecting, designing, deploying, supporting and working with that type of commercial product.
At roughly the same time that many of the recent MPP entrants or their predecessors began to emerge, some organizations took a different path. They opted to source less expensive solutions to handle analytics workloads on very large data sets.
Based on papers published in 2003 and 2004 by Google employees and driven by an open source project beginning in 2006, Hadoop rapidly emerged as an alternate approach. Largely because of its Hadoop Distributed Files System (HDFS) and its MapReduce programming model for applying parallel techniques to distributed processing of big data sets on a cluster.
Unfortunately, many organizations have been driven by a credo that stated that storage was cheap, so store everything. In these cases, the value of data modeling and data integration is understated.
Organizations get caught up in the false perception that people will somehow find what they need when they need it. Because of this, many early adopters in search of a “magic bullet” found that even with really cheap storage, if you have enough data, you can still end up spending a lot of money.
Additionally, if you don’t properly organize and catalog your data, you can end up with a “data swamp” or a “Hadump.” This can make it extremely difficult for data consumers in the organization to find the right information at the right time.
The point is not to denigrate Hadoop and Data Lakes, as Managed Data Lakes can play a useful role as we’ll discuss later in this series. The point is that data architecture and data modeling are still important regardless of the deployment technology and associated jargon as we’ll discuss further…
Please join us next time for Data Modeling in a Jargon-filled World – Internet of Things (IoT).