Categories
erwin Expert Blog

How Digital Transformation is Changing the World

Technological revolutions of the past usually came with a bang. However, that’s not the case today as industry quietly transitions systems and processes to incorporate more advanced technology. Still, the outcomes tend to be dramatic with market disruptions and societal advances.

With the rise of the IoT and data – or Big Data to be precise – various industries now operate at increased levels of efficiency and productivity.

This blog post looks at some of the sectors undergoing digital transformation and the impacts.

Digital Transformation in Manufacturing and Retail

Due to the availability of digital tools and automation, manufacturers now have more streamlined and efficient operations without sacrificing quality.

Automobile makers are excellent examples in this regard, with their production plants fabricating parts that have meticulous details and measurements – a process achievable consistently and with less errors through the use of machines and computers.

Retail is another sector that’s progressing rapidly through digital transformation. For instance, the recent merger of Amazon and Whole Foods gives consumers easier and quicker access to products through a series of developments. Entire sections of various items have been added to their online shopping portals.

Retail spaces are being upgraded with interactive and/or virtual displays, giving customers new ways to test items before purchasing. The retail firm also launched its first Amazon Go store recently, which implements a ‘check-out free’ system – buyers simply get what they need and sensors embedded in the products signal the system to charge the users’ accounts as they walk out.

Digital Transformation in Transportation

digital transformation transportation

Uber is the perfect example of effective digital transformation in transportation. Hailing a ride is the underlying principle, but passengers can procure transport service easier, faster and at reasonable fares.

Of course, the transition didn’t stop there. Some companies are working to make large-scale formats available, such as shuttles and car-sharing, which you could think of as the Airbnb of the transportation sector.

Digital Transformation in Entertainment and Media

Forbes says the entertainment industry is becoming the testing ground for more digital content from retailers. For example, corporations that typically sell consumer packaged goods are venturing into original content and “actually producing edgy original series.”

PepsiCo is at the forefront of this next frontier, aiming to establish a greater presence in the age of Netflix, YouTube and other fresh streaming content channels.

The F&B giant even constructed its own massive studio that can relay real-time content to the entire globe.

Digital Transformation in Logistics and Supply Chains

In terms of logistics, ‘blockchain’ technology is revolutionizing entire processes that bring products from suppliers to consumers. For instance, the world’s leading shipping company, Maersk, recently conducted a test to track its cargo with blockchain, according to Fortune.

The key takeaway here is that each container was monitored remotely through digital connections protected by cryptographic signatures.

Digitalization is applied on the part of employees as well, specifically on vehicles that deliver cargo. As of last year, the U.S. requires fleet operators to use electronic logging devices that Fleetmatics specified as devices used mainly to track hours of service.

They’re given until the end of 2017 to comply. As such, truckers now have a digital tool for recording and receiving data, saving time and resources in regard to paperwork.

Digital Transformation in Finance and Banking

The finance and banking sector was quick to adopt mobile technology. ‘Digital finance’ is giving account holders the power to conduct transactions through handheld devices.

Emerging economies benefit greatly from this trend, with many financial institutions embracing the change from traditional to digital accounts. Moreover, mobility is being used in other related processes, such as loan applications, tax collection and fund allocation.

Digital Transformation in Medicine and Healthcare

Digital transformation medicine

Digital equipment and devices are taking over healthcare. Dr. GSK Velu, who contributes to the Economic Times’ Healthworld, even went as far as to say that the human being is now ‘digitized.’

Vital signs and other medical data now can be monitored even when the physician and patient are not in the same room. Surgery also can be performed with the surgeon a world away from the operating table.

Digital Transformation in Government

Several government applications and processes are also undergoing digital transformation. In many cases, such processes are already conducted digitally today. And to further improve such capabilities, Forbes relayed that nations like the U.S. are deploying AI support in select tasks.

Helpdesks and contact centers have the assistance of chatbots, and more advanced cybertechnologies are used for sustainability as well as national security.

Despite this, the U.S. Bureau of Labor Statistics projected that there will be minimal loss of government jobs within the foreseeable future, as most of the manpower will be transferred to more public-centric services. It’s a win-win situation for both government and its citizens.

Digital Transformation is Everywhere

It’s clear that data makes the world go ‘round, and more developments are to come. As data collection, flow and management are further enhanced, innovations such as smart cities and connected cars will be fully realized. Futurism suggests that consumers will even begin to profit from the data they share online. It’s “the new oil.”

And with the increase in value, the volume of data has been sky rocketing. Managing all of this data is becoming ever harder, with disparate data sources being just one of many headaches for data professionals. That’s why much of our blog focusses on best practices in managing Any² – any data, anywhere.

To join the experts, and stay up to date with the erwin blog click here.

 

erwin blog

Categories
erwin Expert Blog

Using Enterprise Architecture to Improve Security

The personal data of more than 143 million people – half the United States’ entire population – may have been compromised in the recent Equifax data breach. With every major data breach comes post-mortems and lessons learned, but one area we haven’t seen discussed is how enterprise architecture might aid in the prevention of data breaches.

For Equifax, the reputational hit, loss of profits/market value, and potential lawsuits is really bad news. For other organizations that have yet to suffer a breach, be warned. The clock is ticking for the General Data Protection Regulation (GDPR) to take effect in May 2018. GDPR changes everything, and it’s just around the corner.

Organizations of all sizes must take greater steps to protect consumer data or pay significant penalties. Negligent data governance and data management could cost up to 4 percent of an organization’s global annual worldwide turnover or up to 20 million Euros, whichever is greater.

With this in mind, the Equifax data breach – and subsequent lessons – is a discussion potentially worth millions.

Enterprise architecture for security

Proactive Data Protection and Cybersecurity

Given that data security has long been considered paramount, it’s surprising that enterprise architecture is one approach to improving data protection that has been overlooked.

It’s a surprise because when you consider enterprise architecture use cases and just how much of an organization it permeates (which is really all of it), EA should be commonplace in data security planning.

So, the Equifax breach provides a great opportunity to explore how enterprise architecture could be used for improving cybersecurity.

Security should be proactive, not reactive, which is why EA should be a huge part of security planning. And while we hope the Equifax incident isn’t the catalyst for an initial security assessment and improvements, it certainly should prompt a re-evaluation of data security policies, procedures and technologies.

By using well-built enterprise architecture for the foundation of data security, organizations can help mitigate risk. EA’s comprehensive view of the organization means security can be involved in the planning stages, reducing risks involved in new implementations. When it comes to security, EA should get a seat at the table.

Enterprise architecture also goes a long way in nullifying threats born of shadow IT, out-dated applications, and other IT faux pas. Well-documented, well-maintained EA gives an organization the best possible view of current tech assets.

This is especially relevant in Equifax’s case as the breach has been attributed to the company’s failure to update a web application although it had sufficient warning to do so.

By leveraging EA, organizations can shore up data security by ensuring updates and patches are implemented proactively.

Enterprise Architecture, Security and Risk Management

But what about existing security flaws? Implementing enterprise architecture in security planning now won’t solve them.

An organization can never eliminate security risks completely. The constantly evolving IT landscape would require businesses to spend an infinite amount of time, resources and money to achieve zero risk. Instead, businesses must opt to mitigate and manage risk to the best of their abilities.

Therefore, EA has a role in risk management too.

In fact, EA’s risk management applications are more widely appreciated than its role in security. But effective EA for risk management is a fundamental part of how EA for implementing security works.

Enterprise architecture’s comprehensive accounting of business assets (both technological and human) means it’s best placed to align security and risk management with business goals and objectives. This can give an organization insight into where time and money can best be spent in improving security, as well as the resources available to do so.

This is because of the objective view enterprise architecture analysis provides for an organization.

To use somewhat of a crude but applicable analogy, consider the risks of travel. A fear of flying is more common than fear of driving in a car. In a business sense, this could unwarrantedly encourage more spending on mitigating the risks of flying. However, an objective enterprise architecture analysis would reveal, that despite fear, the risk of travelling by car is much greater.

Applying the same logic to security spending, enterprise architecture analysis would give an organization an indication of how to prioritize security improvements.

Categories
erwin Expert Blog

Enterprise Architecture for the Cloud – Get Ahead in the Cloud

It’s almost 2018, so there’s a good chance a portion of your business relies on the cloud. You’ve either moved some legacy systems there already, or you’re adopting entirely new, cloud-based systems – or both. But what about enterprise architecture for the cloud? After all, it’s the glue that helps tie all your disparate IT threads together.

The transition to the cloud is owed heavily to improvements in its technology foundation, especially increased security and safeguards. For the likes of governments, banks and defense organizations, these enhancements have been paramount.

Additionally, organizations across industry increasingly turn to the cloud to keep costs low and maximize profits. These options are usually easier and cheaper to install than their on-premise counterparts.

A 2016 RightScale study found that 31 percent of the 1,060 IT professionals surveyed said their companies run more than 1,000 virtual machines. This demonstrates a sizeable uptick compared to 2015, when only 22 percent of participants answered the same.

The rate of adoption is even more impressive when you consider how forecasts have been outpaced. In 2014, Forrester predicted the cloud market would be worth $191 billion by 2020. In 2016, this estimate was revised to $236 billion.

The cloud is big business.

Enterprise architecture for the cloud

Why Enterprise Architecture for the Cloud?

We’ve established the case for the growing cloud market. So why is enterprise architecture for the cloud so important? To answer that question, you have to consider why the cloud market is so expansive.

In short, the answer is competition. Although there are some colossal, main players in the cloud market – Amazon Web Services (AWS), Azure, Google Cloud Platform and IBM make up more than an estimated 60% – they act as hosts to smaller cloud businesses spanning copious industries.

Unlike the hosts, this layer of the cloud market is incredibly and increasingly saturated. Such saturation is due to an even more complex web of disparate systems for which a business must account.

And said business ­must account for these systems to establish what it has right now, what it needs to reach the organization’s future state objectives, and what systems can be integrated for the sake of efficiency.

If enterprise architecture (EA) isn’t actioned in bridging the gap between an organization’s current state and its desired future state, then it’s fundamentally underperforming.

Enterprise Architecture for Introducing Cloud Systems

The above details how EA benefits a business in managing its current cloud-based systems. The primary benefit being the ability to see how and where the newer, disparate cloud systems fit with legacy systems.

But EA’s usefulness doesn’t start once cloud systems are already in place. As established, a key objective of EA is to bridge an organization’s current and future state objectives.

Another key objective is to better align an organization with IT for better preparation in the face of change. In this way, EA enables organizations to make or face enterprise changes with minimal disruptions and costs.

Cloud systems have disrupted how businesses operate already.

And with competition in the cloud space as populated as it is today, more disruption is coming. Therefore, having well-executed EA will make it easier for businesses to manage such inevitable disruption. It will also enable organizations undergoing digital transformation to implement new cloud systems with less friction.

Data-Driven Business Transformation

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – The Logical Data Warehouse

There’s debate surrounding the term “logical data warehouse.” Some argue that it is a new concept, while others argue that all well-designed data warehouses are logical and so the term is meaningless. This is a key point I’ll address in this post.

I’ll also discuss data warehousing that incorporates some of the technologies and approaches we’ve covered in previous installments of this series (1, 2, 3, 4, 5, 6 ) but with a different architecture that embraces “any data, anywhere.”

So what is a “logical data warehouse?”

Bill Inmon and Barry Devlin provide two oft-quoted definitions of a “data warehouse.” Inmon says “a data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.

Devlin stripped down the definition, saying “a data warehouse is simply a single, complete and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use in a business context.

Although these definitions are widely adopted, there is some disparity in their interpretation. Some insist that such definitions imply a single repository, and thus a limitation.

On the other hand, some argue that a “collection of data” or a “single, complete and consistent store” could just as easily be virtual and therefore not inherently singular. They argue that the language is down to most early implementations only being single, physical data stores due to technology limitations.

Mark Beyer of Gartner is a prominent name in the former, singular repository camp. In 2011, he saidthe logical data warehouse (LDW) is a new data management architecture for analytics which combines the strengths of traditional repository warehouses with alternative data management and access strategy,” and the work has since been widely circulated.

So proponents of the “logical data warehouse,” as defined by Mark Beyer, don’t disagree with the value of an integrated collection of data. They just feel that if said collection is managed and accessed as something other than a monolithic, single physical database, then it is something different and should be called a “logical data warehouse” instead of just a “data warehouse.”

As the author of a series of posts about a jargon-filled [data] world, who am I to argue with the introduction of more new jargon?

In fact, I’d be remiss if I didn’t point out that the notion of a logical data warehouse has numerous jargon-rich enabling technologies and synonyms, including Service Oriented Architecture (SOA), Enterprise Services Bus (ESB), Virtualization Layer, and Data Fabric, though the latter term also has other unrelated uses.

So the essence of a logical data warehouse approach is to integrate diverse data assets into a single, integrated virtual data warehouse, without the traditional batch ETL or ELT processes required to copy data into a single, integrated physical data warehouse.

One of the key attractions to proponents of the approach is the avoidance of recurring batch extraction, transformation and loading activities that, typically argued, cause delays and lead to decisions being made based on data that is not as current as it could be.

The idea is to use caching and other technologies to create a virtualization layer that enables information consumers to ask a question as though they were interrogating a single, integrated physical data warehouse and to have the virtualization layer (which together with the data resident in some combination of underlying application systems, IoT data streams, external data sources, blockchains, data lakes, data warehouses and data marts, constitutes the logical data warehouse) respond correctly with more current data and without having to extract, transform and load data into a centralized physical store.

Logical Data Warehouse

While the moniker may be new, the idea of bringing the query to the data set(s) and then assembling an integrated result is not a new idea. There have been numerous successful implementations in the past, though they often required custom coding and rigorous governance to ensure response times and logical correctness.

Some would argue that such previous implementations were also not at the leading edge of data warehousing in terms of data volume or scope.

What is generating renewed interest in this approach is the continued frustration on the part of numerous stakeholders with delays attributed to ETL/ELT in traditional data warehouse implementations.

When you compound this with the often high costs of large (physical) data warehouse implementations, it’s not hard to see why. Especially if it’s based on MPP hardware, juxtaposed against the promise of some new solutions from vendors like Denodo and Cisco that capitalize on the increasing prevalence of new technologies, such as the cloud and in-memory.

One topic that quickly becomes clear as one learns more about the various logical data warehouse vendor solutions is that metadata is a very important component. However, this shouldn’t be a surprise, as the objective is still to present a single, integrated view to the information consumer.

So a well-architected, comprehensive and easily understood data model is as important as ever, both to ensure that information consumers can easily access properly integrated data and because the virtualization technology itself must depend on a properly architected data model to accurately transform an information request into queries to multiple data sources and then correctly synthesize the result sets into an appropriate response to the original information request.

We hope you’ve enjoyed our series, Data Modeling in a Jargon-filled World, learning something from this post or one of the previous posts in the series (1, 2, 3, 4, 5, 6 ).

The underlying theme, as you’ve probably deduced, is that data modeling remains critical in a world in which the volume, variety and velocity of data continue to grow while information consumers find it difficult to synthesize the right data in the right context to help them draw the right conclusions.

We encourage you to read other blog posts on this site by erwin staff members and other guest bloggers and to participate in ongoing events and webinars.

If you’d like to know more about accelerating your data modeling efforts for specific industries, while reducing risk and benefiting from best practices and lessons learned by other similar organizations in your industry, please visit erwin partner ADRM Software.

Data-Driven Business Transformation

Categories
erwin Expert Blog

SQL, NoSQL or NewSQL: Evaluating Your Database Options

A common question in the modern data management space involves database technology: SQL, NoSQL or NewSQL?

But there isn’t a one-size-fits-all answer. What’s “right” must be evaluated on a case-by-case basis and is dependent on data maturity.

For example, a large bookstore chain with a big-data initiative would be stifled by a SQL database. The advantages that could be gained from analyzing social media data (for popular books, consumer buying habits) couldn’t be realized effectively through sequential analysis. There’s too much data involved in this approach, with too many threads to follow.

However, an independent bookstore isn’t necessarily bound to a big-data approach because it may not have a mature data strategy. It might not have ventured beyond digitizing customer records, and a SQL database is sufficient for that work.

Having said that, the “SQL, NoSQL or NewSQL” question is gaining prominence because businesses are becoming increasingly data-driven.

In 2019, an IDC study found 85% of enterprise decision-makers said they had a time frame of two years to make significant inroads into digital transformation or they will fall behind their competitors and suffer financially. Furthermore, a Progress study showed that 85% of enterprise decision-makers feel they only have two years to make significant digital-transformation progress before suffering financially and/or falling behind competitors.

Considering these statistics, what better time than now to evaluate your database technology? The “SQL, NoSQL or NewSQL question,” is especially important if you intend to become more data-driven.

SQL, NoSQL or NewSQL: Advantages and Disadvantages

SQL

SQL databases are tried and tested, proven to work on disks using interfaces with which businesses are already familiar.

As the longest-standing type of database, plenty of SQL options are available. This competitive market means you’ll likely find what you’re looking for at affordable prices.

Additionally, businesses in the earlier stages of data maturity are more likely to have a SQL database at work already, meaning no new investments need to be made.

However in the modern digital business context, SQL databases weren’t made to support the the three Vs of data. The volume is too high, the variety of sources is too vast, and the velocity (speed at which the data must be processed) is too great to be analyzed in sequence.

Furthermore, the foundational, legacy IT world they were purpose-built to serve has evolved. Now, corporate IT departments must be agile, and their databases must be agile and scalable to match.

NoSQL

Despite its name, “NoSQL” doesn’t mean the complete absence of the SQL database approach. Rather, it works as more of a hybrid. The term is a contraction of “not only SQL.”

So, in addition to the advantage of continuity that staying with SQL offers, NoSQL enjoys many of the benefits of SQL databases.

The key difference is that NoSQL databases were developed with modern IT in mind. They are scalable, agile and purpose-built to deal with disparate, high-volume data.

Hence, data is typically more readily available and can be changed, stored or handle the insertion of new data more easily.

For example, MongoDB, one of the key players in the NoSQL world, uses JavaScript Object Notation (JSON). As the company explains, “A JSON database returns query results that can be easily parsed, with little or no transformation.” The open, human- and machine-readable standard facilitates data interchange and can store records, “just as tables and rows store records in a relational database.”

Generally, NoSQL databases are better equipped to deal with other non-relational data too. As well as JSON, NoSQL supports log messages, XML and unstructured documents. This support avoids the lethargic “schema-on-write,” opting to “schema-on-read” instead.

NewSQL

NewSQL refers to databases based on the relational (SQL) database and SQL query language. In an attempt to solve some of the problems of SQL, the likes of VoltDB and others take a best-of-both-worlds approach, marrying the familiarity of SQL with the scalability and agile enablement of NoSQL.

However, as with most seemingly win-win opportunities, NewSQL isn’t without its caveats. These vary from vendor to vendor, but in essence, you either have to sacrifice familiarity side or scalability.

If you’d like to speak with someone at erwin about SQL, NoSQL or NewSQL in more detail, click here.

For more industry advice, subscribe to the erwin Expert Blog.

Benefits of NoSQL Data Modeling

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – In-memory Databases

With the volume and velocity of data increasing, in-memory databases provide a way to keep processing speeds low.

Traditionally, databases have stored their data on mechanical storage media such as hard disks. While this has contributed to durability, it’s also constrained attainable query speeds. Database and software designers have long realized this limitation and sought ways to harness the faster speeds of in-memory processing.

The traditional approach to database design – and analytics solutions to access them – includes in-memory caching, which retains a subset of recently accessed data in memory for fast access. While caching often worked well for online transaction processing (OLTP), it was not optimal for analytics and business intelligence. In these cases, the most frequently accessed information – rather than the most recently accessed information – is typically of most interest.

That said, loading an entire data warehouse or even a large data mart into memory has been challenging until recent years.

In-Memory

There are a few key factors in making in-memory databases and analytics offerings relevant for more and more use cases. One such factor has been the shift to 64-bit operating systems. Another is that it makes available much more addressable memory. And as one might assume, the availability of increasingly large and affordable memory solutions has also played a part.

Database and software developers have begun to take advantage of in-memory databases in a myriad of ways. These include the many key-value stores such as Amazon DynamoDB, which provide very low latency for IoT and a host of other use cases.

Another way businesses are taking advantage of in-memory is through distributed in-memory NoSQL databases such as Aerospike, to in-memory NewSQL databases such as VoltDB. However, for the remainder of this post, we’ll touch in more detail on several solutions with which you might be more familiar.

Some database vendors have chosen to build hybrid solutions that incorporate in-memory technologies. They aim to bridge in-memory with solutions based on tried-and-true, disk-based RDBMS technologies. Such vendors include Microsoft with its incorporation of xVelocity into SQL Server, Analysis Services and PowerPivot, and Teradata with its Intelligent Memory.

Other vendors, like IBM with its dashDB database, have chosen to deploy in-memory technology in the cloud, while capitalizing on previously developed or acquired technologies (in-database analytics from Netezza in the case of dashDB).

However, probably the most high-profile application of in-memory technology has been SAP’s significant bet on its HANA in-memory database, which first shipped in late 2010. SAP has since made it available in the cloud through its SAP HANA Cloud Platform, and on Microsoft Azure and it has released a comprehensive application suite called S/4HANA.

Like most of the analytics-focused in-memory databases and analytics tools, HANA stores data in a column-oriented, in-memory database. The primary rationale for taking a column-oriented approach to storing data in memory is that in analytic use cases, where data is queried but not updated, it allows for often very impressive compression of data values in each column. This means much less memory is used, resulting in even higher throughput and less need for expensive memory.

So what approach should a data architect adopt? Are Microsoft, Teradata and other “traditional” RDBMS vendors correct with their hybrid approach?

As memory gets cheaper by the day, and the value of rapid insights increases by the minute, should we host the whole data warehouse or data mart in-memory as with vendors SAP and IBM?

It depends on the specific use case, data volumes, business requirements, budget, etc. One thing that is not in dispute is that all the major vendors recognize that in-memory technology adds value to their solutions. And that extends beyond the database vendors to analytics tool stalwarts like Tableau and newer arrivals like Yellowfin.

It is incumbent upon enterprise architects to learn about the relative merits of the different approaches championed by the various vendors and to select the best fit for their specific situation. This is something that’s admittedly, not easy given the pace of adoption of in-memory databases and the variety of approaches being taken.

But there’s a silver lining to the creative disruption caused by the increasing adoption of in-memory technologies. Because of the sheer speed the various solutions offered, many organizations are finding that the need to pre-aggregate data to achieve certain performance targets for specific analytics workloads is disappearing. The same goes for the need to de-normalize database designs to achieve specific analytics performance targets.

Instead, organizations are finding that it’s more important to create comprehensive atomic data models that are flexible and independent of any assumed analytics workload.

Perhaps surprisingly to some, third normal form (3NF) is once again not an unreasonable standard of data modeling for modelers who plan to deploy to a pure in-memory or in-memory-augmented platform.

Organizations can forgo the time-consuming effort to model and transform data to support specific analytics workloads, which are likely to change over time anyway. They also can stop worrying about de-normalizing and tuning an RDBMS for those same fickle and variable analytics workloads, focusing on creating a logical data model of the business that reflects the business information requirements and relationships in a flexible and detailed format, that doesn’t assume specific aggregations and transformations.

The blinding speed of in-memory technologies provides the aggregations, joins and other transformations on the fly, without the onerous performance penalties we have historically experienced with very large data volumes on disk-only-based solutions. As a long-time data modeler, I like the sound of that. And so far in my experience with many of the solutions mentioned in this post, the business people like the blinding speed and flexibility of these new in-memory technologies!

Please join us next time for the final installment of our series, Data Modeling in a Jargon-filled World – The Logical Data Warehouse. We’ll discuss an approach to data warehousing that uses some of the technologies and approaches we’ve discussed in the previous six installments while embracing “any data, anywhere.”

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – The Cloud

There’s no escaping data’s role in the cloud, and so it’s crucial that we analyze the cloud’s impact on data modeling. 

Categories
erwin Expert Blog

Every Company Requires Data Governance and Here’s Why

With GDPR regulations imminent, businesses need to ensure they have a handle on data governance.

Categories
erwin Expert Blog

Data Modeling is Changing – Time to Make NoSQL Technology a Priority

As the amount of data enterprises are tasked with managing increases, the benefits of NoSQL technology are becoming more apparent. 

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – Managed Data Lakes

More and more businesses are adopting managed data lakes.

Earlier in this blog series, we established that leading organizations are adopting a variety of approaches to manage data, including data that may be sourced from a wide range of NoSQL, NewSQL, RDBMS and unstructured sources.

In this post, we’ll discuss managed data lakes and their applications as a hybrid of less structured data and more traditionally structured relational data. We’ll also talk about whether there’s still a need for data modeling and metadata management.

The term Data Lake was first coined by James Dixon of Pentaho in a blog entry in which he said:

“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Use of the term quickly took on a life of its own with often divergent meanings. So much so that four years later Mr. Dixon felt compelled to refute some criticisms by the analyst community by pointing out that they were objecting to things he actually never said about data lakes.

However, in my experience and despite Mr. Dixon’s objections, the notion that a data lake can contain data from more than one source is now widely accepted..

Similarly, while most early data lake implementations used Hadoop with many vendors pitching the idea that a data lake had to be implemented as a Hadoop data store, the notion that data lakes can be implemented on non-Hadoop platforms, such as Azure Blob storage or Amazon S3, has become increasingly widespread.

So a data lake – as the term is widely used in 2017 – is a detailed (non-aggregated) data store that can contain structured and/or non-structured data from more than one source implemented on some kind of inexpensive, massively scalable storage platform.

But what are “managed data lakes?”

To answer that question, let’s first touch on why many early data lake projects failed or significantly missed expectations. Criticisms were quick to arise, many of which were critiques of data lakes when they strayed from the original vision, as established earlier.

Vendors seized on data lakes as a marketing tool, and as often happens in our industry, they promised it could do almost anything. As long as you poured your data into the lake, people in the organization would somehow magically find exactly the data they needed just when they needed it. As is usually the case, it turned out that for most organizations, their reality was quite different. And for three important reasons:

  1. Most large organizations’ analysts didn’t have the skillsets to wade through the rapidly accumulating pool of information in Hadoop or whichever new platforms had been chosen to implement their data lakes to locate the data they needed.
  2. Not enough attention was paid to the need of providing metadata to help people find the data they needed.
  3. Most interesting analytics are a result of integrating disparate data points to draw conclusions, and integration had not been an area of focus in most data lake implementations.

In the face of growing disenchantment with data lake implementations, some organizations and vendors pivoted to address these drawbacks. They did so by embracing what is most commonly called a managed data lake, though some prefer the label “curated data lake” or “modern data warehouse.”

The idea is to address the three criticisms mentioned above by developing an architectural approach that allows for the use of SQL, making data more accessible and providing more metadata about the data available in the data lake. It also takes on some of the challenging work of integration and transformation that earlier data lake implementations had hoped to kick down the road or avoid entirely.

The result in most implementations of a managed data lake is a hybrid that tries to blend the strengths of the original data lake concept with the strengths of traditional large-scale data warehousing (as opposed to the narrow data mart approach Mr. Dixon used as a foil when originally describing data lakes).

Incoming data, either structured or unstructured, can be easily and quickly loaded from many different sources (e.g., applications, IoT, third parties, etc.). The data can be accumulated with minimal processing at reasonable cost using a bulk storage platform such as Hadoop, Azure Blob storage or Amazon S3.

Then the data, which is widely used within the organization, can be integrated and made available through a SQL or SQL-like interface, such as those from Hive to Postgres to a tried-and-true commercial relational database such as SQL Server (or its cloud-based cousin Azure SQL Data Warehouse).

In this scenario, a handful of self-sufficient data scientists may wade (or swim or dive) in the surrounding data lake. However, most analysts in most organizations still spend most of their time using familiar SQL-capable tools to analyze data stored in the core of the managed data lake – an island in the lake if we really want to torture the analogy – which is typically implemented either using an RDBMS or a relational layer like Hive on top of the bulk-storage layer.

It’s important to note that these are not two discrete silos. Most major vendors have added capabilities to their database and BI offerings to enable analysis of both RDBMS-based and bulk-storage layer data through a familiar SQL interface.

This enables a much larger percentage of an organization’s analysts to access data both in the core and the less structured surrounding lake, using tools with which they’re already familiar.

As this hybrid managed data lake approach incorporates a relational core, robust data modeling capabilities are as important as ever. The same goes for data governance and a thorough focus on metadata to provide clear naming and definitions to assist in finding and linking with the most appropriate data.

This is true whether inside the structured relational core of the managed data lake or in the surrounding, more fluid data lake.

As you probably guessed from some of the links in this post, more and more managed data lakes are being implemented in the cloud. Please join us next time for the fifth installment in our series: Data Modeling in a Jargon-filled World – The Cloud.