Categories
erwin Expert Blog

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

Organizations have spent a lot of time and money trying to harmonize data across diverse platforms, including cleansing, uploading metadata, converting code, defining business glossaries, tracking data transformations and so on. But the attempts to standardize data across the entire enterprise haven’t produced the desired results.

A company can’t effectively implement data governance – documenting and applying business rules and processes, analyzing the impact of changes and conducting audits – if it fails at data management.

The problem usually starts by relying on manual integration methods for data preparation and mapping. It’s only when companies take their first stab at manually cataloging and documenting operational systems, processes and the associated data, both at rest and in motion, that they realize how time-consuming the entire data prepping and mapping effort is, and why that work is sure to be compounded by human error and data quality issues.

To effectively promote business transformation, as well as fulfil regulatory and compliance mandates, there can’t be any mishaps.

It’s obvious that the manual road is very challenging to discover and synthesize data that resides in different formats in thousands of unharvested, undocumented databases, applications, ETL processes and procedural code.

Consider the problematic issue of manually mapping source system fields (typically source files or database tables) to target system fields (such as different tables in target data warehouses or data marts).

These source mappings generally are documented across a slew of unwieldy spreadsheets in their “pre-ETL” stage as the input for ETL development and testing. However, the ETL design process often suffers as it evolves because spreadsheet mapping data isn’t updated or may be incorrectly updated thanks to human error. So questions linger about whether transformed data can be trusted.

Data Quality Obstacles

The sad truth is that high-paid knowledge workers like data scientists spend up to 80 percent of their time finding and understanding source data and resolving errors or inconsistencies, rather than analyzing it for real value.

Statistics are similar when looking at major data integration projects, such as data warehousing and master data management with data stewards challenged to identify and document data lineage and sensitive data elements.

So how can businesses produce value from their data when errors are introduced through manual integration processes? How can enterprise stakeholders gain accurate and actionable insights when data can’t be easily and correctly translated into business-friendly terms?

How can organizations master seamless data discovery, movement, transformation and IT and business collaboration to reverse the ratio of preparation to value delivered.

What’s needed to overcome these obstacles is establishing an automated, real-time, high-quality and metadata- driven pipeline useful for everyone, from data scientists to enterprise architects to business analysts to C-level execs.

Doing so will require a hearty data management strategy and technology for automating the timely delivery of quality data that measures up to business demands.

From there, they need a sturdy data governance strategy and technology to automatically link and sync well-managed data with core capabilities for auditing, statutory reporting and compliance requirements as well as to drive business insights.

Creating a High-Quality Data Pipeline

Working hand-in-hand, data management and data governance provide a real-time, accurate picture of the data landscape, including “data at rest” in databases, data lakes and data warehouses and “data in motion” as it is integrated with and used by key applications. And there’s control of that landscape to facilitate insight and collaboration and limit risk.

With a metadata-driven, automated, real-time, high-quality data pipeline, all stakeholders can access data that they now are able to understand and trust and which they are authorized to use. At last they can base strategic decisions on what is a full inventory of reliable information.

The integration of data management and governance also supports industry needs to fulfill regulatory and compliance mandates, ensuring that audits are not compromised by the inability to discover key data or by failing to tag sensitive data as part of integration processes.

Data-driven insights, agile innovation, business transformation and regulatory compliance are the fruits of data preparation/mapping and enterprise modeling (business process, enterprise architecture and data modeling) that revolves around a data governance hub.

erwin Mapping Manager (MM) combines data management and data governance processes in an automated flow through the integration lifecycle from data mapping for harmonization and aggregation to generating the physical embodiment of data lineage – that is the creation, movement and transformation of transactional and operational data.

Its hallmark is a consistent approach to data delivery (business glossaries connect physical metadata to specific business terms and definitions) and metadata management (via data mappings).

Automate Data Mapping

Categories
erwin Expert Blog

Top 10 Reasons to Automate Data Mapping and Data Preparation

Data preparation is notorious for being the most time-consuming area of data management. It’s also expensive.

“Surveys show the vast majority of time is spent on this repetitive task, with some estimates showing it takes up as much as 80% of a data professional’s time,” according to Information Week. And a Trifacta study notes that overreliance on IT resources for data preparation costs organizations billions.

The power of collecting your data can come in a variety of forms, but most often in IT shops around the world, it comes in a spreadsheet, or rather a collection of spreadsheets often numbering in the hundreds or thousands.

Most organizations, especially those competing in the digital economy, don’t have enough time or money for data management using manual processes. And outsourcing is also expensive, with inevitable delays because these vendors are dependent on manual processes too.

Automate Data Mapping

Taking the Time and Pain Out of Data Preparation: 10 Reasons to Automate Data Preparation/Data Mapping

  1. Governance and Infrastructure

Data governance and a strong IT infrastructure are critical in the valuation, creation, storage, use, archival and deletion of data. Beyond the simple ability to know where the data came from and whether or not it can be trusted, there is an element of statutory reporting and compliance that often requires a knowledge of how that same data (known or unknown, governed or not) has changed over time.

A design platform that allows for insights like data lineage, impact analysis, full history capture, and other data management features can provide a central hub from which everything can be learned and discovered about the data – whether a data lake, a data vault, or a traditional warehouse.

  1. Eliminating Human Error

In the traditional data management organization, excel spreadsheets are used to manage the incoming data design, or what is known as the “pre-ETL” mapping documentation – this does not lend to any sort of visibility or auditability. In fact, each unit of work represented in these ‘mapping documents’ becomes an independent variable in the overall system development lifecycle, and therefore nearly impossible to learn from much less standardize.

The key to creating accuracy and integrity in any exercise is to eliminate the opportunity for human error – which does not mean eliminating humans from the process but incorporating the right tools to reduce the likelihood of error as the human beings apply their thought processes to the work.  

  1. Completeness

The ability to scan and import from a broad range of sources and formats, as well as automated change tracking, means that you will always be able to import your data from wherever it lives and track all of the changes to that data over time.

  1. Adaptability

Centralized design, immediate lineage and impact analysis, and change activity logging means that you will always have the answer readily available, or a few clicks away.  Subsets of data can be identified and generated via predefined templates, generic designs generated from standard mapping documents, and pushed via ETL process for faster processing via automation templates.

  1. Accuracy

Out-of-the-box capabilities to map your data from source to report, make reconciliation and validation a snap, with auditability and traceability built-in.  Build a full array of validation rules that can be cross checked with the design mappings in a centralized repository.

  1. Timeliness

The ability to be agile and reactive is important – being good at being reactive doesn’t sound like a quality that deserves a pat on the back, but in the case of regulatory requirements, it is paramount.

  1. Comprehensiveness

Access to all of the underlying metadata, source-to-report design mappings, source and target repositories, you have the power to create reports within your reporting layer that have a traceable origin and can be easily explained to both IT, business, and regulatory stakeholders.

  1. Clarity

The requirements inform the design, the design platform puts those to action, and the reporting structures are fed the right data to create the right information at the right time via nearly any reporting platform, whether mainstream commercial or homegrown.

  1. Frequency

Adaptation is the key to meeting any frequency interval. Centralized designs, automated ETL patterns that feed your database schemas and reporting structures will allow for cyclical changes to be made and implemented in half the time of using conventional means. Getting beyond the spreadsheet, enabling pattern-based ETL, and schema population are ways to ensure you will be ready, whenever the need arises to show an audit trail of the change process and clearly articulate who did what and when through the system development lifecycle.

  1. Business-Friendly

A user interface designed to be business-friendly means there’s no need to be a data integration specialist to review the common practices outlined and “passively enforced” throughout the tool. Once a process is defined, rules implemented, and templates established, there is little opportunity for error or deviation from the overall process. A diverse set of role-based security options means that everyone can collaborate, learn and audit while maintaining the integrity of the underlying process components.

Faster, More Accurate Analysis with Fewer People

What if you could get more accurate data preparation 50% faster and double your analysis with less people?

erwin Mapping Manager (MM) is a patented solution that automates data mapping throughout the enterprise data integration lifecycle, providing data visibility, lineage and governance – freeing up that 80% of a data professional’s time to put that data to work.

With erwin MM, data integration engineers can design and reverse-engineer the movement of data implemented as ETL/ELT operations and stored procedures, building mappings between source and target data assets and designing the transformation logic between them. These designs then can be exported to most ETL and data asset technologies for implementation.

erwin MM is 100% metadata-driven and used to define and drive standards across enterprise integration projects, enable data and process audits, improve data quality, streamline downstream work flows, increase productivity (especially over geographically dispersed teams) and give project teams, IT leadership and management visibility into the ‘real’ status of integration and ETL migration projects.

If an automated data preparation/mapping solution sounds good to you, please check out erwin MM here.

Solving the Enterprise Data Dilemma

Categories
erwin Expert Blog

Solving the Enterprise Data Dilemma

Due to the adoption of data-driven business, organizations across the board are facing their own enterprise data dilemmas.

This week erwin announced its acquisition of metadata management and data governance provider AnalytiX DS. The combined company touches every piece of the data management and governance lifecycle, enabling enterprises to fuel automated, high-quality data pipelines for faster speed to accurate, actionable insights.

Why Is This a Big Deal?

From digital transformation to AI, and everything in between, organizations are flooded with data. So, companies are investing heavily in initiatives to use all the data at their disposal, but they face some challenges. Chiefly, deriving meaningful insights from their data – and turning them into actions that improve the bottom line.

This enterprise data dilemma stems from three important but difficult questions to answer: What data do we have? Where is it? And how do we get value from it?

Large enterprises use thousands of unharvested, undocumented databases, applications, ETL processes and procedural code that make it difficult to gather business intelligence, conduct IT audits, and ensure regulatory compliance – not to mention accomplish other objectives around customer satisfaction, revenue growth and overall efficiency and decision-making.

The lack of visibility and control around “data at rest” combined with “data in motion”, as well as difficulties with legacy architectures, means these organizations spend more time finding the data they need rather than using it to produce meaningful business outcomes.

To remedy this, enterprises need smarter and faster data management and data governance capabilities, including the ability to efficiently catalog and document their systems, processes and the associated data without errors. In addition, business and IT must collaborate outside their traditional operational silos.

But this coveted state of data nirvana isn’t possible without the right approach and technology platform.

Enterprise Data: Making the Data Management-Data Governance Love Connection

Enterprise Data: Making the Data Management-Data Governance Love Connection

Bringing together data management and data governance delivers greater efficiencies to technical users and better analytics to business users. It’s like two sides of the same coin:

  • Data management drives the design, deployment and operation of systems that deliver operational and analytical data assets.
  • Data governance delivers these data assets within a business context, tracks their physical existence and lineage, and maximizes their security, quality and value.

Although these disciplines approach data from different perspectives and are used to produce different outcomes, they have a lot in common. Both require a real-time, accurate picture of an organization’s data landscape, including data at rest in data warehouses and data lakes and data in motion as it is integrated with and used by key applications.

However, creating and maintaining this metadata landscape is challenging because this data in its various forms and from numerous sources was never designed to work in concert. Data infrastructures have been cobbled together over time with disparate technologies, poor documentation and little thought for downstream integration, so the applications and initiatives that depend on data infrastructure are often out-of-date and inaccurate, rendering faulty insights and analyses.

Organizations need to know what data they have and where it’s located, where it came from and how it got there, what it means in common business terms [or standardized business terms] and be able to transform it into useful information they can act on – all while controlling its access.

To support the total enterprise data management and governance lifecycle, they need an automated, real-time, high-quality data pipeline. Then every stakeholder – data scientist, ETL developer, enterprise architect, business analyst, compliance officer, CDO and CEO – can fuel the desired outcomes with reliable information on which to base strategic decisions.

Enterprise Data: Creating Your “EDGE”

At the end of the day, all industries are in the data business and all employees are data people. The success of an organization is not measured by how much data it has, but by how well it’s used.

Data governance enables organizations to use their data to fuel compliance, innovation and transformation initiatives with greater agility, efficiency and cost-effectiveness.

Organizations need to understand their data from different perspectives, identify how it flows through and impacts the business, aligns this business view with a technical view of the data management infrastructure, and synchronizes efforts across both disciplines for accuracy, agility and efficiency in building a data capability that impacts the business in a meaningful and sustainable fashion.

The persona-based erwin EDGE creates an “enterprise data governance experience” that facilitates collaboration between both IT and the business to discover, understand and unlock the value of data both at rest and in motion.

By bringing together enterprise architecture, business process, data mapping and data modeling, erwin’s approach to data governance enables organizations to get a handle on how they handle their data. With the broadest set of metadata connectors and automated code generation, data mapping and cataloging tools, the erwin EDGE Platform simplifies the total data management and data governance lifecycle.

This single, integrated solution makes it possible to gather business intelligence, conduct IT audits, ensure regulatory compliance and accomplish any other organizational objective by fueling an automated, high-quality and real-time data pipeline.

With the erwin EDGE, data management and data governance are unified and mutually supportive, with one hand aware and informed by the efforts of the other to:

  • Discover data: Identify and integrate metadata from various data management silos.
  • Harvest data: Automate the collection of metadata from various data management silos and consolidate it into a single source.
  • Structure data: Connect physical metadata to specific business terms and definitions and reusable design standards.
  • Analyze data: Understand how data relates to the business and what attributes it has.
  • Map data flows: Identify where to integrate data and track how it moves and transforms.
  • Govern data: Develop a governance model to manage standards and policies and set best practices.
  • Socialize data: Enable stakeholders to see data in one place and in the context of their roles.

An integrated solution with data preparation, modeling and governance helps businesses reach data governance maturity – which equals a role-based, collaborative data governance system that serves both IT and business users equally. Such maturity may not happen overnight, but it will ultimately deliver the accurate and actionable insights your organization needs to compete and win.

Your journey to data nirvana begins with a demo of the enhanced erwin Data Governance solution. Register now.

erwin ADS webinar

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – Managed Data Lakes

More and more businesses are adopting managed data lakes.

Earlier in this blog series, we established that leading organizations are adopting a variety of approaches to manage data, including data that may be sourced from a wide range of NoSQL, NewSQL, RDBMS and unstructured sources.

In this post, we’ll discuss managed data lakes and their applications as a hybrid of less structured data and more traditionally structured relational data. We’ll also talk about whether there’s still a need for data modeling and metadata management.

The term Data Lake was first coined by James Dixon of Pentaho in a blog entry in which he said:

“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Use of the term quickly took on a life of its own with often divergent meanings. So much so that four years later Mr. Dixon felt compelled to refute some criticisms by the analyst community by pointing out that they were objecting to things he actually never said about data lakes.

However, in my experience and despite Mr. Dixon’s objections, the notion that a data lake can contain data from more than one source is now widely accepted..

Similarly, while most early data lake implementations used Hadoop with many vendors pitching the idea that a data lake had to be implemented as a Hadoop data store, the notion that data lakes can be implemented on non-Hadoop platforms, such as Azure Blob storage or Amazon S3, has become increasingly widespread.

So a data lake – as the term is widely used in 2017 – is a detailed (non-aggregated) data store that can contain structured and/or non-structured data from more than one source implemented on some kind of inexpensive, massively scalable storage platform.

But what are “managed data lakes?”

To answer that question, let’s first touch on why many early data lake projects failed or significantly missed expectations. Criticisms were quick to arise, many of which were critiques of data lakes when they strayed from the original vision, as established earlier.

Vendors seized on data lakes as a marketing tool, and as often happens in our industry, they promised it could do almost anything. As long as you poured your data into the lake, people in the organization would somehow magically find exactly the data they needed just when they needed it. As is usually the case, it turned out that for most organizations, their reality was quite different. And for three important reasons:

  1. Most large organizations’ analysts didn’t have the skillsets to wade through the rapidly accumulating pool of information in Hadoop or whichever new platforms had been chosen to implement their data lakes to locate the data they needed.
  2. Not enough attention was paid to the need of providing metadata to help people find the data they needed.
  3. Most interesting analytics are a result of integrating disparate data points to draw conclusions, and integration had not been an area of focus in most data lake implementations.

In the face of growing disenchantment with data lake implementations, some organizations and vendors pivoted to address these drawbacks. They did so by embracing what is most commonly called a managed data lake, though some prefer the label “curated data lake” or “modern data warehouse.”

The idea is to address the three criticisms mentioned above by developing an architectural approach that allows for the use of SQL, making data more accessible and providing more metadata about the data available in the data lake. It also takes on some of the challenging work of integration and transformation that earlier data lake implementations had hoped to kick down the road or avoid entirely.

The result in most implementations of a managed data lake is a hybrid that tries to blend the strengths of the original data lake concept with the strengths of traditional large-scale data warehousing (as opposed to the narrow data mart approach Mr. Dixon used as a foil when originally describing data lakes).

Incoming data, either structured or unstructured, can be easily and quickly loaded from many different sources (e.g., applications, IoT, third parties, etc.). The data can be accumulated with minimal processing at reasonable cost using a bulk storage platform such as Hadoop, Azure Blob storage or Amazon S3.

Then the data, which is widely used within the organization, can be integrated and made available through a SQL or SQL-like interface, such as those from Hive to Postgres to a tried-and-true commercial relational database such as SQL Server (or its cloud-based cousin Azure SQL Data Warehouse).

In this scenario, a handful of self-sufficient data scientists may wade (or swim or dive) in the surrounding data lake. However, most analysts in most organizations still spend most of their time using familiar SQL-capable tools to analyze data stored in the core of the managed data lake – an island in the lake if we really want to torture the analogy – which is typically implemented either using an RDBMS or a relational layer like Hive on top of the bulk-storage layer.

It’s important to note that these are not two discrete silos. Most major vendors have added capabilities to their database and BI offerings to enable analysis of both RDBMS-based and bulk-storage layer data through a familiar SQL interface.

This enables a much larger percentage of an organization’s analysts to access data both in the core and the less structured surrounding lake, using tools with which they’re already familiar.

As this hybrid managed data lake approach incorporates a relational core, robust data modeling capabilities are as important as ever. The same goes for data governance and a thorough focus on metadata to provide clear naming and definitions to assist in finding and linking with the most appropriate data.

This is true whether inside the structured relational core of the managed data lake or in the surrounding, more fluid data lake.

As you probably guessed from some of the links in this post, more and more managed data lakes are being implemented in the cloud. Please join us next time for the fifth installment in our series: Data Modeling in a Jargon-filled World – The Cloud.