ETL code Archives - erwin, Inc.

The benefits of Data Vault automation from the more abstract – like improving data integrity – to the tangible – such as clearly identifiable savings in cost and time.

So Seriously … You Should Automate Your Data Vault

By Danny Sandwell

Data Vault is a methodology for architecting and managing data warehouses in complex data environments where new data types and structures are constantly introduced.

Without Data Vault, data warehouses are difficult and time consuming to change causing latency issues and slowing time to value. In addition, the queries required to maintain historical integrity are complex to design and run slow causing performance issues and potentially incorrect results because the ability to understand relationships between historical snap shots of data is lacking.

In his blog, Dan Linstedt, the creator of Data Vault methodology, explains that Data Vaults “are extremely scalable, flexible architectures” enabling the business to grow and change without “the agony and pain of high costs, long implementation and test cycles, and long lists of impacts across the enterprise warehouse.”

With a Data Vault, new functional areas typically are added quickly and easily, with changes to existing architecture taking less than half the traditional time with much less impact on the downstream systems, he notes.

Astonishingly, nearly 20 years since the methodology’s creation, most Data Vault design, development and deployment phases are still handled manually. But why?

Traditional manual efforts to define the Data Vault population and create ETL code from scratch can take weeks or even months. The entire process is time consuming slowing down the data pipeline and often riddled with human errors.

On the flipside, automating the development and deployment of design changes and the resulting data movement processing code ensures companies can accelerate dev and deployment in a timely and cost-effective manner.

Benefits of Data Vault Automation – A Case Study …

Global Pharma Company Saves Considerable Time and Money with Data Vault Automation

Let’s take a look at a large global pharmaceutical company that switched to Data Vault automation with staggering results.

Like many pharmaceutical companies, it manages a massive data warehouse combining clinical trial, supply chain and other mission-critical data. They had chosen a Data Vault schema for its flexibility in handling change but found creating the hubs and satellite structure incredibly laborious.

They needed to accelerate development, as well as aggregate data from different systems for internal customers to access and share. Additionally, the company needed lineage and traceability for regulatory compliance efforts.

With this ability, they can identify data sources, transformations and usage to safeguard protected health information (PHI) for clinical trials.

After an initial proof of concept, they deployed erwin Data Vault Automation and generated more than 200 tables, jobs and processes with 10 to 12 scripts. The highly schematic structure of the models enabled large portions of the modeling process to be automated, dramatically accelerating Data Vault projects and optimizing data warehouse management.

erwin Data Vault Automation helped this pharma customer automate the complete lifecycle – accelerating development while increasing consistency, simplicity and flexibility – to save considerable time and money.

For this customer the benefits of data vault automation were as such:

Saving an estimated 70% of the costs of manual development
Generating 95% of the production code with “zero touch,” improving the time to business value and significantly reduced costly re-work associated with error-prone manual processes
Increasing data integrity, including for new requirements and use cases regardless of changes to the warehouse structure because legacy source data doesn’t degrade
Creating a sustainable approach to Data Vault deployment, ensuring the agile, adaptable and timely delivery of actionable insights to the business in a well-governed facility for regulatory compliance, including full transparency and ease of auditability

Homegrown Tools Never Provide True Data Vault Automation

Many organizations use some form of homegrown tool or standalone applications. However, they don’t integrate with other tools and components of the architecture, they’re expensive, and quite frankly, they make it difficult to derive any meaningful results.

erwin Data Vault Automation centralizes the specification and deployment of Data Vault architectures for better control and visibility of the software development lifecycle. erwin Data Catalog makes it easy to discover, organize, curate and govern data being sourced for and managed in the warehouse.

With this solution, users select data sets to be included in the warehouse and fully automate the loading of Data Vault structures and ETL operations.

erwin Data Vault Smart Connectors eliminate the need for a business analyst and ETL developers to repeat mundane tasks, so they can focus on choosing and using the desired data instead. This saves considerable development time and effort plus delivers a high level of standardization and reuse.

After the Data Vault processes have been automated, the warehouse is well documented with traceability from the marts back to the operational data to speed the investigation of issues and analyze the impact of changes.

Bottom line: if your Data Vault integration is not automated, you’re already behind.

If you’d like to get started with erwin Data Vault Automation or request a quote, you can email consulting@erwin.com.

A data catalog benefits organizations in a myriad of ways. With the right data catalog tool, organizations can automate enterprise metadata management – including data cataloging, data mapping, data quality and code generation for faster time to value and greater accuracy for data movement and/or deployment projects.

Data cataloging helps curate internal and external datasets for a range of content authors. Gartner says this doubles business benefits and ensures effective management and monetization of data assets in the long-term if linked to broader data governance, data quality and metadata management initiatives.

But even with this in mind, the importance of data cataloging is growing. In the regulated data world (GDPR, HIPAA etc) organizations need to have a good understanding of their data lineage – and the data catalog benefits to data lineage are substantial.

Data lineage is a core operational business component of data governance technology architecture, encompassing the processes and technology to provide full-spectrum visibility into the ways data flows across an enterprise.

There are a number of different approaches to data lineage. Here, I outline the common approach, and the approach incorporating data cataloging – including the top 5 data catalog benefits for understanding your organization’s data lineage.

Data Lineage – The Common Approach

The most common approach for assembling a collection of data lineage mappings traces data flows in a reverse manner. The process begins with the target or data end-point, and then traversing the processes, applications, and ETL tasks in reverse from the target.

For example, to determine the mappings for the data pipelines populating a data warehouse, a data lineage tool might begin with the data warehouse and examine the ETL tasks that immediately proceed the loading of the data into the target warehouse.

The data sources that feed the ETL process are added to a “task list,” and the process is repeated for each of those sources. At each stage, the discovered pieces of lineage are documented. At the end of the sequence, the process will have reverse-mapped the pipelines for populating that warehouse.

While this approach does produce a collection of data lineage maps for selected target systems, there are some drawbacks.

First, this approach focuses only on assembling the data pipelines populating the selected target system but does not necessarily provide a comprehensive view of all the information flows and how they interact.
Second, this process produces the information that can be used for a static view of the data pipelines, but the process needs to be executed on a regular basis to account for changes to the environment or data sources.
Third, and probably most important, this process produces a technical view of the information flow, but it does not necessarily provide any deeper insights into the semantic lineage, or how the data assets map to the corresponding business usage models.

A Data Catalog Offers an Alternate Data Lineage Approach

An alternate approach to data lineage combines data discovery and the use of a data catalog that captures data asset metadata with a data mapping framework that documents connections between the data assets.

This data catalog approach also takes advantage of automation, but in a different way: using platform-specific data connectors, the tool scans the environment for storing each data asset and imports data asset metadata into the data catalog.

When data asset structures are similar, the tool can compare data element domains and value sets, and automatically create the data mapping.

In turn, the data catalog approach performs data discovery using the same data connectors to parse the code involved in data movement, such as major ETL environments and procedural code – basically any executable task that moves data.

The information collected through this process is reverse engineered to create mappings from source data sets to target data sets based on what was discovered.

For example, you can map the databases used for transaction processing, determine that subsets of the transaction processing database are extracted and moved to a staging area, and then parse the ETL code to infer the mappings.

These direct mappings also are documented in the data catalog. In cases where the mappings are not obvious, a tool can help a data steward manually map data assets into the catalog.

The result is a data catalog that incorporates the structural and semantic metadata associated with each data asset as well as the direct mappings for how that data set is populated.

Learn more about data cataloging.

And this is a powerful representative paradigm – instead of capturing a static view of specific data pipelines, it allows a data consumer to request a dynamically-assembled lineage from the documented mappings.

By interrogating the catalog, the current view of any specific data lineage can be rendered on the fly that shows all points of the data lineage: the origination points, the processing stages, the sequences of transformations, and the final destination.

Materializing the “current active lineage” dynamically reduces the risk of having an older version of the lineage that is no longer relevant or correct. When new information is added to the data catalog (such as a newly-added data source of a modification to the ETL code), dynamically-generated views of the lineage will be kept up-to-date automatically.

Top 5 Data Catalog Benefits for Understanding Data Lineage

A data catalog benefits data lineage in the following five distinct ways:

1. Accessibility

The data catalog approach allows the data consumer to query the tool to materialize specific data lineage mappings on demand.

2. Currency

The data lineage is rendered from the most current data in the data catalog.

3. Breadth

As the number of data assets documented in the data catalog increases, the scope of the materializable lineage expands accordingly. With all corporate data assets cataloged, any (or all!) data lineage mappings can be produced on demand.

4. Maintainability and Sustainability

Since the data lineage mappings are not managed as distinct artifacts, there are no additional requirements for maintenance. As long as the data catalog is kept up to date, the data lineage mappings can be materialized.

5. Semantic Visibility

In addition to visualizing the physical movement of data across the enterprise, the data catalog approach allows the data steward to associate business glossary terms, data element definitions, data models, and other semantic details with the different mappings. Additional visualization methods can demonstrate where business terms are used, how they are mapped to different data elements in different systems, and the relationships among these different usage points.

One can impose additional data governance controls with project management oversight, which allows you to designate data lineage mappings in terms of the project life cycle (such as development, test or production).

Aside from these data catalog benefits, this approach allows you to reduce the amount of manual effort for accumulating the information for data lineage and continually reviewing the data landscape to maintain consistency, thus providing a greater return on investment for your data intelligence budget.

Learn more about data cataloging.