Categories
erwin Expert Blog Data Governance

Data Governance Frameworks: The Key to Successful Data Governance Implementation

A strong data governance framework is central to successful data governance implementation in any data-driven organization because it ensures that data is properly maintained, protected and maximized.

But despite this fact, enterprises often face push back when implementing a new data governance initiative or trying to mature an existing one.

Let’s assume you have some form of informal data governance operation with some strengths to build on and some weaknesses to correct. Some parts of the organization are engaged and behind the initiative, while others are skeptical about its relevance or benefits.

Some other common data governance implementation obstacles include:

  • Questions about where to begin and how to prioritize which data streams to govern first
  • Issues regarding data quality and ownership
  • Concerns about data lineage
  • Competing project and resources (time, people and funding)

By using a data governance framework, organizations can formalize their data governance implementation and subsequent adherence to. This addressess common concerns including data quality and data lineage, and provides a clear path to successful data governance implementation.

In this blog, we will cover three key steps to successful data governance implementation. We will also look into how we can expand the scope and depth of a data governance framework to ensure data governance standards remain high.

Data Governance Implementation in 3 Steps

When maturing or implementing data governance and/or a data governance framework, an accurate assessment of the ‘here and now’ is key. Then you can rethink the path forward, identifying any current policies or business processes that should be incorporated, being careful to avoid making the same mistakes of prior iterations.

With this in mind, here are three steps we recommend for implementing data governance and a data governance framework.

Data Governance Framework

Step 1: Shift the culture toward data governance

Data governance isn’t something to set and forget; it’s a strategic approach that needs to evolve over time in response to new opportunities and challenges. Therefore, a successful data governance framework has to become part of the organization’s culture but such a shift requires listening – and remembering that it’s about people, empowerment and accountability.

In most cases, a new data governance framework requires people – those in IT and across the business, including risk management and information security – to change how they work. Any concerns they raise or recommendations they make should be considered. You can encourage feedback through surveys, workshops and open dialog.

Once input has been discussed and plan agreed upon, it is critical to update roles and responsibilities, provide training and ensure ongoing communication. Many organizations now have internal certifications for different data governance roles who wear these badges with pride.

A top-down management approach will get a data governance initiative off the ground, but only bottom-up cultural adoption will carry it out.

Step 2: Refine the data governance framework

The right capabilities and tools are important for fueling an accurate, real-time data pipeline and governing it for maximum security, quality and value. For example:

Data catalogingOrganization’s implementing a data governance framework will benefit from automated metadata harvesting, data mapping, code generation and data lineage with reference data management, lifecycle management and data quality. With these capabilities, you can  efficiently integrate and activate enterprise data within a single, unified catalog in accordance with business requirements.

Data literacy Being able to discover what data is available and understand what it means in common, standardized terms is important because data elements may mean different things to different parts of the organization. A business glossary answers this need, as does the ability for stakeholders to view data relevant to their roles and understand it within a business context through a role-based portal.

Such tools are further enhanced if they can be integrated across data and business architectures and when they promote self-service and collaboration, which also are important to the cultural shift.

 

Subscribe to the erwin Expert Blog

Once you submit the trial request form, an erwin representative will be in touch to verify your request and help you start data modeling.

 

 

Step 3: Prioritize then scale the data governance framework

Because data governance is on-going, it’s important to prioritize the initial areas of focus and scale from there. Organizations that start with 30 to 50 data items are generally more successful than those that attempt more than 1,000 in the early stages.

Find some representative (familiar) data items and create examples for data ownership, quality, lineage and definition so stakeholders can see real examples of the data governance framework in action. For example:

  • Data ownership model showing a data item, its definition, producers, consumers, stewards and quality rules (for profiling)
  • Workflow showing the creation, enrichment and approval of the above data item to demonstrate collaboration

Whether your organization is just adopting data governance or the goal is to refine an existing data governance framework, the erwin DG RediChek will provide helpful insights to guide you in the journey.

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – Managed Data Lakes

More and more businesses are adopting managed data lakes.

Earlier in this blog series, we established that leading organizations are adopting a variety of approaches to manage data, including data that may be sourced from a wide range of NoSQL, NewSQL, RDBMS and unstructured sources.

In this post, we’ll discuss managed data lakes and their applications as a hybrid of less structured data and more traditionally structured relational data. We’ll also talk about whether there’s still a need for data modeling and metadata management.

The term Data Lake was first coined by James Dixon of Pentaho in a blog entry in which he said:

“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Use of the term quickly took on a life of its own with often divergent meanings. So much so that four years later Mr. Dixon felt compelled to refute some criticisms by the analyst community by pointing out that they were objecting to things he actually never said about data lakes.

However, in my experience and despite Mr. Dixon’s objections, the notion that a data lake can contain data from more than one source is now widely accepted..

Similarly, while most early data lake implementations used Hadoop with many vendors pitching the idea that a data lake had to be implemented as a Hadoop data store, the notion that data lakes can be implemented on non-Hadoop platforms, such as Azure Blob storage or Amazon S3, has become increasingly widespread.

So a data lake – as the term is widely used in 2017 – is a detailed (non-aggregated) data store that can contain structured and/or non-structured data from more than one source implemented on some kind of inexpensive, massively scalable storage platform.

But what are “managed data lakes?”

To answer that question, let’s first touch on why many early data lake projects failed or significantly missed expectations. Criticisms were quick to arise, many of which were critiques of data lakes when they strayed from the original vision, as established earlier.

Vendors seized on data lakes as a marketing tool, and as often happens in our industry, they promised it could do almost anything. As long as you poured your data into the lake, people in the organization would somehow magically find exactly the data they needed just when they needed it. As is usually the case, it turned out that for most organizations, their reality was quite different. And for three important reasons:

  1. Most large organizations’ analysts didn’t have the skillsets to wade through the rapidly accumulating pool of information in Hadoop or whichever new platforms had been chosen to implement their data lakes to locate the data they needed.
  2. Not enough attention was paid to the need of providing metadata to help people find the data they needed.
  3. Most interesting analytics are a result of integrating disparate data points to draw conclusions, and integration had not been an area of focus in most data lake implementations.

In the face of growing disenchantment with data lake implementations, some organizations and vendors pivoted to address these drawbacks. They did so by embracing what is most commonly called a managed data lake, though some prefer the label “curated data lake” or “modern data warehouse.”

The idea is to address the three criticisms mentioned above by developing an architectural approach that allows for the use of SQL, making data more accessible and providing more metadata about the data available in the data lake. It also takes on some of the challenging work of integration and transformation that earlier data lake implementations had hoped to kick down the road or avoid entirely.

The result in most implementations of a managed data lake is a hybrid that tries to blend the strengths of the original data lake concept with the strengths of traditional large-scale data warehousing (as opposed to the narrow data mart approach Mr. Dixon used as a foil when originally describing data lakes).

Incoming data, either structured or unstructured, can be easily and quickly loaded from many different sources (e.g., applications, IoT, third parties, etc.). The data can be accumulated with minimal processing at reasonable cost using a bulk storage platform such as Hadoop, Azure Blob storage or Amazon S3.

Then the data, which is widely used within the organization, can be integrated and made available through a SQL or SQL-like interface, such as those from Hive to Postgres to a tried-and-true commercial relational database such as SQL Server (or its cloud-based cousin Azure SQL Data Warehouse).

In this scenario, a handful of self-sufficient data scientists may wade (or swim or dive) in the surrounding data lake. However, most analysts in most organizations still spend most of their time using familiar SQL-capable tools to analyze data stored in the core of the managed data lake – an island in the lake if we really want to torture the analogy – which is typically implemented either using an RDBMS or a relational layer like Hive on top of the bulk-storage layer.

It’s important to note that these are not two discrete silos. Most major vendors have added capabilities to their database and BI offerings to enable analysis of both RDBMS-based and bulk-storage layer data through a familiar SQL interface.

This enables a much larger percentage of an organization’s analysts to access data both in the core and the less structured surrounding lake, using tools with which they’re already familiar.

As this hybrid managed data lake approach incorporates a relational core, robust data modeling capabilities are as important as ever. The same goes for data governance and a thorough focus on metadata to provide clear naming and definitions to assist in finding and linking with the most appropriate data.

This is true whether inside the structured relational core of the managed data lake or in the surrounding, more fluid data lake.

As you probably guessed from some of the links in this post, more and more managed data lakes are being implemented in the cloud. Please join us next time for the fifth installment in our series: Data Modeling in a Jargon-filled World – The Cloud.