It’s important we recognize the benefits of data lineage.
As corporate data governance programs have matured, the inventory of agreed-to data policies has grown rapidly. These include guidelines for data quality assurance, regulatory compliance and data democratization, among other information utilization initiatives.
Organizations that are challenged by translating their defined data policies into implemented processes and procedures are starting to identify tools and technologies that can supplement the ways organizational data policies can be implemented and practiced.
One such technique, data lineage, is gaining prominence as a core operational business component of the data governance technology architecture. Data lineage encompasses processes and technology to provide full-spectrum visibility into the ways that data flow across the enterprise.
To data-driven businesses, the benefits of data lineage are significant. Data lineage tools are used to survey, document and enable data stewards to query and visualize the end-to-end flow of information units from their origination points through the series of transformation and processing stages to their final destination.
The Benefits of Data Lineage
Data stewards are attracted to data lineage because the benefits of data lineage help in a number of different governance practices, including:
1. Operational intelligence
At its core, data lineage captures the mappings of the rapidly growing number of data pipelines in the organization. Visualizing the information flow landscape provides insight into the “demographics” of data consumption and use, answering questions such as “what data sources feed the greatest number of downstream sources” or “which data analysts use data that is ingested from a specific data source.” Collecting this intelligence about the data landscape better positions the data stewards for enforcing governance policies.
2. Business terminology consistency
One of the most confounding data governance challenges is understanding the semantics of business terminology within data management contexts. Because application development was traditionally isolated within each business function, the same (or similar) terms are used in different data models, even though the designers did not take the time to align definitions and meanings. Data lineage allows the data stewards to find common business terms, review their definitions, and determine where there are inconsistencies in the ways the terms are used.
3. Data incident root cause analysis
It has long been asserted that when a data consumer finds a data error, the error most likely was introduced into the environment at an earlier stage of processing. Yet without a “roadmap” that indicates the processing stages through which the data were processed, it is difficult to speculate where the error was actually introduced. Using data lineage, though, a data steward can insert validation probes within the information flow to validate data values and determine the stage in the data pipeline where an error originated.
4. Data quality remediation assessment
Root cause analysis is just the first part of the data quality process. Once the data steward has determined where the data flaw was introduced, the next step is to determine why the error occurred. Again, using a data lineage mapping, the steward can trace backward through the information flow to examine the standardizations and transformations applied to the data, validate that transformations were correctly performed, or identify one (or more) performed incorrectly, resulting in the data flaw.
5. Impact analysis
The enterprise is always subject to changes; externally-imposed requirements (such as regulatory compliance) evolve, internal business directives may affect user expectations, and ingested data source models may change unexpectedly. When there is a change to the environment, it is valuable to assess the impacts to the enterprise application landscape. In the event of a change in data expectations, data lineage provides a way to determine which downstream applications and processes are affected by the change and helps in planning for application updates.
6. Performance assessment
Not only does lineage provide a collection of mappings of data pipelines, it allows for the identification of potential performance bottlenecks. Data pipeline stages with many incoming paths are candidate bottlenecks. Using a set of data lineage mappings, the performance analyst can profile execution times across different pipelines and redistribute processing to eliminate bottlenecks.
7. Policy compliance
Data policies can be implemented through the specification of business rules. Compliance with these business rules can be facilitated using data lineage by embedding business rule validation controls across the data pipelines. These controls can generate alerts when there are noncompliant data instances.
8. Auditability of data pipelines
In many cases, regulatory compliance is a combination of enforcing a set of defined data policies along with a capability for demonstrating that the overall process is compliant. Data lineage provides visibility into the data pipelines and information flows that can be audited thereby supporting the compliance process.
Evaluating Enterprise Data Lineage Tools
While data lineage benefits are obvious, large organizations with complex data pipelines and data flows do face challenges in embracing the technology to document the enterprise data pipelines. These include:
- Surveying the enterprise – Gathering information about the sources, flows and configurations of data pipelines.
- Maintenance – Configuring a means to maintain an up-to-date view of the data pipelines.
- Deliverability – Providing a way to give data consumers visibility to the lineage maps.
- Sustainability – Ensuring sustainability of the processes for producing data lineage mappings.
Producing a collection of up-to-date data lineage mappings that are easily reviewed by different data consumers depends on addressing these challenges. When considering data lineage tools, keep these issues in mind when evaluating how well the tools can meet your data governance needs.
erwin Data Intelligence (erwin DI) helps organizations automate their data lineage initiatives. Learn more about data lineage with erwin DI.