Categories
erwin Expert Blog Data Governance

What is Data Lineage? Top 5 Benefits of Data Lineage

What is Data Lineage and Why is it Important?

Data lineage is the journey data takes from its creation through its transformations over time. It describes a certain dataset’s origin, movement, characteristics and quality.

Tracing the source of data is an arduous task.

Many large organizations, in their desire to modernize with technology, have acquired several different systems with various data entry points and transformation rules for data as it moves into and across the organization.

data lineage

These tools range from enterprise service bus (ESB) products, data integration tools; extract, transform and load (ETL) tools, procedural code, application program interfaces (API)s, file transfer protocol (FTP) processes, and even business intelligence (BI) reports that further aggregate and transform data.

With all these diverse data sources, and if systems are integrated, it is difficult to understand the complicated data web they form much less get a simple visual flow. This is why data’s lineage must be tracked and why its role is so vital to business operations, providing the ability to understand where data originates, how it is transformed, and how it moves into, across and outside a given organization.

Data Lineage Use Case: From Tracing COVID-19’s Origins to Data-Driven Business

A lot of theories have emerged about the origin of the coronavirus. A recent University of California San Francisco (UCSF) study conducted a genetic analysis of COVID-19 to determine how the virus was introduced specifically to California’s Bay Area.

It detected at least eight different viral lineages in 29 patients in February and early March, suggesting no regional patient zero but rather multiple independent introductions of the pathogen. The professor who directed the study said, “it’s like sparks entering California from various sources, causing multiple wildfires.”

Much like understanding viral lineage is key to stopping this and other potential pandemics, understanding the origin of data, is key to a successful data-driven business.

Top Five Data Lineage Benefits

From my perspective in working with customers of various sizes across multiple industries, I’d like to highlight five data lineage benefits:

1. Business Impact

Data is crucial to every organization’s survival. For that reason, businesses must think about the flow of data across multiple systems that fuel organizational decision-making.

For example, the marketing department uses demographics and customer behavior to forecast sales. The CEO also makes decisions based on performance and growth statistics. An understanding of the data’s origins and history helps answer questions about the origin of data in a Key Performance Indicator (KPI) reports, including:

  • How the report tables and columns are defined in the metadata?
  • Who are the data owners?
  • What are the transformation rules?

Without data lineage, these functions are irrelevant, so it makes sense for a business to have a clear understanding of where data comes from, who uses it, and how it transforms. Also, when there is a change to the environment, it is valuable to assess the impacts to the enterprise application landscape.

In the event of a change in data expectations, data lineage provides a way to determine which downstream applications and processes are affected by the change and helps in planning for application updates.

2. Compliance & Auditability

Business terms and data policies should be implemented through standardized and documented business rules. Compliance with these business rules can be tracked through data lineage, incorporating auditability and validation controls across data transformations and pipelines to generate alerts when there are non-compliant data instances.

Regulatory compliance places greater transparency demands on firms when it comes to tracing and auditing data. For example, capital markets trading firms must understand their data’s origins and history to support risk management, data governance and reporting for various regulations such as BCBS 239 and MiFID II.

Also, different organizational stakeholders (customers, employees and auditors) need to be able to understand and trust reported data. Data lineage offers proof that the data provided is reflected accurately.

3. Data Governance

An automated data lineage solution stitches together metadata for understanding and validating data usage, as well as mitigating the associated risks.

It can auto-document end-to-end upstream and downstream data lineage, revealing any changes that have been made, by whom and when.

This data ownership, accountability and traceability is foundational to a sound data governance program.

See: The Benefits of Data Governance

4. Collaboration

Analytics and reporting are data-dependent, making collaboration among different business groups and/or departments crucial.

The visualization of data lineage can help business users spot the inherent connections of data flows and thus provide greater transparency and auditability.

Seeing data pipelines and information flows further supports compliance efforts.

5. Data Quality

Data quality is affected by data’s movement, transformation, interpretation and selection through people, process and technology.

Root-cause analysis is the first step in repairing data quality. Once a data steward determines where a data flaw was introduced, the reason for the error can be determined.

With data lineage and mapping, the data steward can trace the information flow backward to examine the standardizations and transformations applied to confirm whether they were performed correctly.

See Data Lineage in Action

Data lineage tools document the flow of data into and out of an organization’s systems. They capture end-to-end lineage and ensure proper impact analysis can be performed in the event of problems or changes to data assets as they move across pipelines.

The erwin Data Intelligence Suite (erwin DI) automatically generates end-to-end data lineage, down to the column level and between repositories. You can view data flows from source systems to the reporting layers, including intermediate transformation and business logic.

Join us for the next live demo of erwin Data Intelligence (DI) to see metadata-driven, automated data lineage in action.

erwin data intelligence

Subscribe to the erwin Expert Blog

Once you submit the trial request form, an erwin representative will be in touch to verify your request and help you start data modeling.

Categories
erwin Expert Blog Data Intelligence

The Top 8 Benefits of Data Lineage

It’s important we recognize the benefits of data lineage.

As corporate data governance programs have matured, the inventory of agreed-to data policies has grown rapidly. These include guidelines for data quality assurance, regulatory compliance and data democratization, among other information utilization initiatives.

Organizations that are challenged by translating their defined data policies into implemented processes and procedures are starting to identify tools and technologies that can supplement the ways organizational data policies can be implemented and practiced.

One such technique, data lineage, is gaining prominence as a core operational business component of the data governance technology architecture. Data lineage encompasses processes and technology to provide full-spectrum visibility into the ways that data flow across the enterprise.

To data-driven businesses, the benefits of data lineage are significant. Data lineage tools are used to survey, document and enable data stewards to query and visualize the end-to-end flow of information units from their origination points through the series of transformation and processing stages to their final destination.

Benefits of Data Lineage

The Benefits of Data Lineage

Data stewards are attracted to data lineage because the benefits of data lineage help in a number of different governance practices, including:

1. Operational intelligence

At its core, data lineage captures the mappings of the rapidly growing number of data pipelines in the organization. Visualizing the information flow landscape provides insight into the “demographics” of data consumption and use, answering questions such as “what data sources feed the greatest number of downstream sources” or “which data analysts use data that is ingested from a specific data source.” Collecting this intelligence about the data landscape better positions the data stewards for enforcing governance policies.

2. Business terminology consistency

One of the most confounding data governance challenges is understanding the semantics of business terminology within data management contexts. Because application development was traditionally isolated within each business function, the same (or similar) terms are used in different data models, even though the designers did not take the time to align definitions and meanings. Data lineage allows the data stewards to find common business terms, review their definitions, and determine where there are inconsistencies in the ways the terms are used.

3. Data incident root cause analysis

It has long been asserted that when a data consumer finds a data error, the error most likely was introduced into the environment at an earlier stage of processing. Yet without a “roadmap” that indicates the processing stages through which the data were processed, it is difficult to speculate where the error was actually introduced. Using data lineage, though, a data steward can insert validation probes within the information flow to validate data values and determine the stage in the data pipeline where an error originated.

4. Data quality remediation assessment

Root cause analysis is just the first part of the data quality process. Once the data steward has determined where the data flaw was introduced, the next step is to determine why the error occurred. Again, using a data lineage mapping, the steward can trace backward through the information flow to examine the standardizations and transformations applied to the data, validate that transformations were correctly performed, or identify one (or more) performed incorrectly, resulting in the data flaw.

5. Impact analysis

The enterprise is always subject to changes; externally-imposed requirements (such as regulatory compliance) evolve, internal business directives may affect user expectations, and ingested data source models may change unexpectedly. When there is a change to the environment, it is valuable to assess the impacts to the enterprise application landscape. In the event of a change in data expectations, data lineage provides a way to determine which downstream applications and processes are affected by the change and helps in planning for application updates.

6. Performance assessment

Not only does lineage provide a collection of mappings of data pipelines, it allows for the identification of potential performance bottlenecks. Data pipeline stages with many incoming paths are candidate bottlenecks. Using a set of data lineage mappings, the performance analyst can profile execution times across different pipelines and redistribute processing to eliminate bottlenecks.

7. Policy compliance

Data policies can be implemented through the specification of business rules. Compliance with these business rules can be facilitated using data lineage by embedding business rule validation controls across the data pipelines. These controls can generate alerts when there are noncompliant data instances.

8. Auditability of data pipelines

In many cases, regulatory compliance is a combination of enforcing a set of defined data policies along with a capability for demonstrating that the overall process is compliant. Data lineage provides visibility into the data pipelines and information flows that can be audited thereby supporting the compliance process.

Evaluating Enterprise Data Lineage Tools

While data lineage benefits are obvious, large organizations with complex data pipelines and data flows do face challenges in embracing the technology to document the enterprise data pipelines. These include:

  • Surveying the enterprise – Gathering information about the sources, flows and configurations of data pipelines.
  • Maintenance – Configuring a means to maintain an up-to-date view of the data pipelines.
  • Deliverability – Providing a way to give data consumers visibility to the lineage maps.
  • Sustainability – Ensuring sustainability of the processes for producing data lineage mappings.

Producing a collection of up-to-date data lineage mappings that are easily reviewed by different data consumers depends on addressing these challenges. When considering data lineage tools, keep these issues in mind when evaluating how well the tools can meet your data governance needs.

erwin Data Intelligence (erwin DI) helps organizations automate their data lineage initiatives. Learn more about data lineage with erwin DI.

Value of Data Intelligence IDC Report