Categories
erwin Expert Blog

Five Pillars of Data Governance Readiness: Delivery Capability

The five pillars of data governance readiness should be the starting point for implementing or revamping any DG initiative.

In a recent CSO Magazine article, “Why data governance should be corporate policy,” the author states: “Data is like water, and water is a fundamental resource for life, so data is an essential resource for the business. Data governance ensures this resource is protected and managed correctly enabling us to meet our customer’s expectations.”

Over the past few weeks, we’ve been exploring the five pillars of data governance (DG) readiness, and this week we turn our attention to the fifth and final pillar, delivery capability.

Together, the five pillars of data governance readiness work as a step-by-step guide to a successful DG implementation and ongoing initiative.

As a refresher, the first four pillars are:

  1. The starting point is garnering initiative sponsorship from executives, before fostering support from the wider organization.

 

  1. Organizations should then appoint a dedicated team to oversee and manage the initiative. Although DG is an organization-wide strategic initiative, it needs experience and leadership to guide it.

 

  1. Once the above pillars are accounted for, the next step is to understand how data governance fits with the wider data management suite so that all components of a data strategy work together for maximum benefits.

 

  1. And then enterprise data management methodology as a plan of action to assemble the necessary tools.

Once you’ve completed these steps, how do you go about picking the right solution for enterprise-wide data governance?

Five Pillars of Data Governance: Delivery Capability – What’s the Right Solution?

Many organizations don’t think about enterprise data governance technologies when they begin a data governance initiative. They believe that using some general-purpose tool suite like those from Microsoft can support their DG initiative. That’s simply not the case.

Selecting the proper data governance solution should be part of developing the data governance initiative’s technical requirements. However, the first thing to understand is that the “right” solution is subjective.

Data stewards work with metadata rather than data 80 percent of the time. As a result, successful and sustainable data governance initiatives are supported by a full-scale, enterprise-grade metadata management tool.

Additionally, many organizations haven’t implemented data quality products when they begin a DG initiative. Product selections, including those for data quality management, should be based on the organization’s business goals, its current state of data quality and enterprise data management, and best practices as promoted by the data quality management team.

If your organization doesn’t have an existing data quality management product, a data governance initiative can support the need for data quality and the eventual evaluation and selection of the proper data quality management product.

Enterprise data modeling is also important. A component of enterprise data architecture, it’s an enabling force in the performance of data management and successful data governance. Having the capability to manage data architecture and data modeling with the optimal products can have a positive effect on DG by providing the initiative architectural support for the policies, practices, standards and processes that data governance creates.

Finally, and perhaps most important, the lack of a formal data governance team/unit has been cited as a leading cause of DG failure. Having the capability to manage all data governance and data stewardship activities has a positive effect.

Shopping for Data Governance Technology

DG is part of a larger data puzzle. Although it’s a key enabler of data-driven business, it’s only effective in the context of the data management suite in which it belongs.

Therefore when shopping for a data governance solution, organizations should look for DG tools that unify critical data governance domains, leverage role-appropriate interfaces to bring together stakeholders and processes to support a culture committed to acknowledging data as the mission-critical asset that it is, and orchestrate the key mechanisms required to discover, fully understand, actively govern and effectively socialize and align data to the business.

Data Governance Readiness: Delivery Capability

Here’s an initial checklist of questions to ask in your evaluation of a DG solution. Does it support:

  • Relational, unstructured, on-premise and cloud data?
  • Business-friendly environment to build business glossaries with taxonomies of data standards?
  • Unified capabilities to integrate business glossaries, data dictionaries and reference data, data quality metrics, business rules and data usage policies?
  • Regulating data and managing data collaboration through assigned roles, business rules and responsibilities, and defined governance processes and workflows?
  • Viewing data dashboards, KPIs and more via configurable role-based interfaces?
  • Providing key integrations with enterprise architecture, business process modeling/management and data modeling?
  • A SaaS model for rapid deployment and low TCO?

To assess your data governance readiness, especially with the General Data Protection Regulation about to take effect, click here.

You also can try erwin DG for free. Click here to start your free trial.

Take the DG RediChek

Categories
erwin Expert Blog

A New Wave in Application Development

Application development is new again.

The ever-changing business landscape – fueled by digital transformation initiatives indiscriminate of industry – demands businesses deliver innovative customer – and partner – facing solutions, not just tactical apps to support internal functions.

Therefore, application developers are playing an increasingly important role in achieving business goals. The financial services sector is a notable example, with companies like JPMorgan Chase spending millions on emerging fintech like online and mobile tools for opening accounts and completing transactions, real-time stock portfolio values, and electronic trading and cash management services.

But businesses are finding that creating market-differentiating applications to improve the customer experience, and subsequently customer satisfaction, requires some significant adjustments. For example, using non-relational database technologies, building another level of development expertise, and driving optimal data performance will be on their agendas.

Of course, all of this must be done with a focus on data governance – backed by data modeling – as the guiding principle for accurate, real-time analytics and business intelligence (BI).

Evolving Application Development Requirements

The development organization must identify which systems, processes and even jobs must evolve to meet demand. The factors it will consider include agile development, skills transformation and faster querying.

Rapid delivery is the rule, with products released in usable increments in sprints as part of ongoing, iterative development. Developers can move from conceptual models for defining high-level requirements to creating low-level physical data models to be incorporated directly into the application logic. This route facilitates dynamic change support to drive speedy baselining, fast-track sprint development cycles and quick application scaling. Logical modeling then follows.

Application Development

Agile application development usually goes hand in hand with using NoSQL databases, so developers can take advantage of more pliable data models. This technology has more dynamic and flexible schema design than relational databases and supports whatever data types and query options an application requires, processing efficiency, and scalability and performance suiting Big Data and new-age apps’ real-time requirements. However, NoSQL skills aren’t widespread so specific tools for modeling unstructured data in NoSQL databases can help staff used to RDBMS ramp up.

Finally, the shift to agile development and NoSQL technology as part of more complex data architectures is driving another shift. Storage-optimized models are moving to the backlines because a new format is available to support real-time app development. It is one that understands what’s being asked of the data and enables schemes to be structured to support application data access requirements for speedy responses to complex queries.

The NoSQL Paradigm

erwin DM NoSQL takes into account all the requirements for the new application development era. In addition to its modeling tools, the solution includes patent-pending Query-Optimized ModelingTM that replaces storage-optimized modeling, giving users guidance to build schemas for optimal performance for NoSQL applications.

erwin DM NoSQL also embraces an “any-squared” approach to data management, so “any data” from “anywhere” can be visualized for greater understanding. And the solution now supports the Couchbase Data Platform in addition to MongoDB. Used in conjunction with erwin DG, businesses also can be assured that agility, speed and flexibility will not take precedence over the equally important need to stringently manage data.

With all this in place, enterprises will be positioned to deliver unique, real-time and responsive apps to enhance the customer experience and support new digital-transformation opportunities. At the same time, they’ll be able to preserve and extend the work they’ve already done in terms of maintaining well-governed data assets.

For more information about how to realize value from app development in the age of digital transformation with the help of data modeling and data governance, you can download our new e-book: Application Development Is New Again.

Categories
erwin Expert Blog

Pillars of Data Governance Readiness: Enterprise Data Management Methodology

Facebook’s data woes continue to dominate the headlines and further highlight the importance of having an enterprise-wide view of data assets. The high-profile case is somewhat different than other prominent data scandals as it wasn’t a “breach,” per se. But questions of negligence persist, and in all cases, data governance is an issue.

This week, the Wall Street Journal ran a story titled “Companies Should Beware Public’s Rising Anxiety Over Data.” It discusses an IBM poll of 10,000 consumers in which 78% of U.S. respondents say a company’s ability to keep their data private is extremely important, yet only 20% completely trust organizations they interact with to maintain data privacy. In fact, 60% indicate they’re more concerned about cybersecurity than a potential war.

The piece concludes with a clear lesson for CIOs: “they must make data governance and compliance with regulations such as the EU’s General Data Protection Regulation [GDPR] an even greater priority, keeping track of data and making sure that the corporation has the ability to monitor its use, and should the need arise, delete it.”

With a more thorough data governance initiative and a better understanding of data assets, their lineage and useful shelf-life, and the privileges behind their access, Facebook likely could have gotten ahead of the problem and quelled it before it became an issue.  Sometimes erasure is the best approach if the reward from keeping data onboard is outweighed by the risk.

But perhaps Facebook is lucky the issue arose when it did. Once the GDPR goes into effect, this type of data snare would make the company non-compliant, as the regulation requires direct consent from the data owner (as well as notification within 72 hours if there is an actual breach).

Five Pillars of DG: Enterprise Data Management Methodology

Considering GDPR, as well as the gargantuan PR fallout and governmental inquiries Facebook faced, companies can’t afford such data governance mistakes.

During the past few weeks, we’ve been exploring each of the five pillars of data governance readiness in detail and how they come together to provide a full view of an organization’s data assets. In this blog, we’ll look at enterprise data management methodology as the fourth key pillar.

Enterprise Data Management in Four Steps

Enterprise data management methodology addresses the need for data governance within the wider data management suite, with all components and solutions working together for maximum benefits.

A successful data governance initiative should both improve a business’ understanding of data lineage/history and install a working system of permissions to prevent access by the wrong people. On the flip side, successful data governance makes data more discoverable, with better context so the right people can make better use of it.

This is the nature of Data Governance 2.0 – helping organizations better understand their data assets and making them easier to manage and capitalize on – and it succeeds where Data Governance 1.0 stumbled.

Enterprise Data Management: So where do you start?

  1. Metadata management provides the organization with the contextual information concerning its data assets. Without it, data governance essentially runs blind.

The value of metadata management is the ability to govern common and reference data used across the organization with cross-departmental standards and definitions, allowing data sharing and reuse, reducing data redundancy and storage, avoiding data errors due to incorrect choices or duplications, and supporting data quality and analytics capabilities.

  1. Your organization also needs to understand enterprise data architecture and enterprise data modeling. Without it, enterprise data governance will be hard to support

Enterprise data architecture supports data governance through concepts such as data movement, data transformation and data integration – since data governance develops policies and standards for these activities.

Data modeling, a vital component of data architecture, is also critical to data governance. By providing insights into the use cases satisfied by the data, organizations can do a better job of proactively analyzing the required shelf-life and better measure the risk/reward of keeping that data around.

Data stewards serve as SMEs in the development and refinement of data models and assist in the creation of data standards that are represented by data models. These artifacts allow your organization to achieve its business goals using enterprise data architecture.

  1. Let’s face it, most organizations implement data governance because they want high quality data. Enterprise data governance is foundational for the success of data quality management.

Data governance supports data quality efforts through the development of standard policies, practices, data standards, common definitions, etc. Data stewards implement these data standards and policies, supporting the data quality professionals.

These standards, policies, and practices lead to effective and sustainable data governance.

  1. Finally, without business intelligence (BI) and analytics, data governance will not add any value. The value of data governance to BI and analytics is the ability to govern data from its sources to destinations in warehouses/marts, define standards for data across those stages, and promote common algorithms and calculations where appropriate. These benefits allow the organization to achieve its business goals with BI and analytics.

Gaining an EDGE on the Competition

Old-school data governance is one-sided, mainly concerned with cataloging data to support search and discovery. The lack of short-term value here often caused executive support to dwindle, so the task of DG was siloed within IT.

These issues are circumvented by using the collaborative Data Governance 2.0 approach, spreading the responsibility of DG among those who use the data. This means that data assets are recorded with more context and are of greater use to an organization.

It also means executive-level employees are more aware of data governance working as they’re involved in it, as well as seeing the extra revenue potential in optimizing data analysis streams and the resulting improvements to times to market.

We refer to this enterprise-wide, collaborative, 2.0 take on data governance as the enterprise data governance experience (EDGE). But organizational collaboration aside, the real EDGE is arguably the collaboration it facilitates between solutions. The EDGE platform recognizes the fundamental reliance data governance has on the enterprise data management methodology suite and unifies them.

By existing on one platform, and sharing one repository, organizations can guarantee their data is uniform across the organization, regardless of department.

Additionally, it drastically improves workflows by allowing for real-time updates across the platform. For example, a change to a term in the data dictionary (data governance) will be automatically reflected in all connected data models (data modeling).

Further, the EDGE integrates enterprise architecture to define application capabilities and interdependencies within the context of their connection to enterprise strategy, enabling technology investments to be prioritized in line with business goals.

Business process also is included so enterprises can clearly define, map and analyze workflows and build models to drive process improvement, as well as identify business practices susceptible to the greatest security, compliance or other risks and where controls are most needed to mitigate exposures.

Essentially, it’s the approach data governance needs to become a value-adding strategic initiative instead of an isolated effort that peters out.

To learn more about enterprise data management and getting an EDGE on GDPR and the competition, click here.

To assess your data governance readiness ahead of the GDPR, click here.

Take the DG RediChek

Categories
erwin Expert Blog

An Agile Data Governance Foundation for Building the Data-Driven Enterprise

The data-driven enterprise is the cornerstone of modern business, and good data governance is a key enabler.

In recent years, we’ve seen startups leverage data to catapult themselves ahead of legacy competitors. Companies such as Airbnb, Netflix and Uber have become household names. Although the service each offers differs vastly, all three identify as ‘technology’ organizations because data is integral to their operations.

Data-Driven Business

As with any standard-setting revolution, businesses across the spectrum are now following these examples. But what these organizations need to understand is that simply deciding to be data-driven, or to “do Big Data,” isn’t enough.

As with any strategy or business model, it’s advisable to apply best practices to ensure the endeavor is worthwhile and that it operates as efficiently as possible. In fact, it’s especially important with data, as poorly governed data will lead to slower times to market and oversights in security. Additionally, poorly managed data fosters inaccurate analysis and poor decision-making, further hampering times to market due to inaccuracy in the planning stages, false starts and wasted cycles.

Essentially garbage in, garbage out – so it’s important for businesses to get their foundations right. To build something, you need to know exactly what you’re building and why to understand the best way to progress.

Data Governance 2.0 Is the Underlying Factor

Good data governance (DG) enables every relevant stakeholder – from executives to frontline employees – to discover, understand, govern and socialize data. Then the right people have access to the right data, so the right decisions are easier to make.

Traditionally, DG encompassed governance goals such as maintaining a business glossary of data terms, a data dictionary and catalog. It also enabled lineage mapping and policy authoring.

However, Data Governance 1.0 was siloed with IT left to handle it. Often there were gaps in context, the chain of accountability and the analysis itself.

Data Governance 2.0 remedies this by taking into account the fact that data now permeates all levels of a business. And it allows for greater collaboration.

It gives people interacting with data the required context to make good decisions, and documents the data’s journey, ensuring accountability and compliance with existing and upcoming data regulations.

But beyond the greater collaboration it fosters between people, it also allows for better collaboration between departments and integration with other technology.

By integrating data governance with data modeling (DM), enterprise architecture (EA) and business process (BP), organizations can break down inter-departmental and technical silos for greater visibility and control across domains.

By leveraging a common metadata repository and intuitive role-based and highly configurable user interfaces, organizations can guarantee everyone is singing off the same sheet of music.

Data Governance Enables Better Data Management

The collaborative nature of Data Governance 2.0 is a key enabler for strong data management. Without it, the differing data management initiatives can and often do pull in different directions.

These silos are usually born out of the use of disparate tools that don’t enable collaboration between the relevant roles responsible for the individual data management initiative. This stifles the potential of data analysis, something organizations can’t afford given today’s market conditions.

Businesses operating in highly competitive markets need every advantage: growth, innovation and differentiation. Organizations also need a complete data platform as the rise of data’s involvement in business and subsequent frequent tech advancements mean market landscapes are changing faster than ever before.

By integrating DM, EA and BP, organizations ensure all three initiatives are in sync. Then historically common issues born of siloed data management initiatives don’t arise.

A unified approach, with Data Governance 2.0 at its core, allows organizations to:

  • Enable data fluency and accountability across diverse stakeholders
  • Standardize and harmonize diverse data management platforms and technologies
  • Satisfy compliance and legislative requirements
  • Reduce risks associated with data-driven business transformation
  • Enable enterprise agility and efficiency in data usage.

Data governance is everyone's business

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – The Logical Data Warehouse

There’s debate surrounding the term “logical data warehouse.” Some argue that it is a new concept, while others argue that all well-designed data warehouses are logical and so the term is meaningless. This is a key point I’ll address in this post.

I’ll also discuss data warehousing that incorporates some of the technologies and approaches we’ve covered in previous installments of this series (1, 2, 3, 4, 5, 6 ) but with a different architecture that embraces “any data, anywhere.”

So what is a “logical data warehouse?”

Bill Inmon and Barry Devlin provide two oft-quoted definitions of a “data warehouse.” Inmon says “a data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.

Devlin stripped down the definition, saying “a data warehouse is simply a single, complete and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use in a business context.

Although these definitions are widely adopted, there is some disparity in their interpretation. Some insist that such definitions imply a single repository, and thus a limitation.

On the other hand, some argue that a “collection of data” or a “single, complete and consistent store” could just as easily be virtual and therefore not inherently singular. They argue that the language is down to most early implementations only being single, physical data stores due to technology limitations.

Mark Beyer of Gartner is a prominent name in the former, singular repository camp. In 2011, he saidthe logical data warehouse (LDW) is a new data management architecture for analytics which combines the strengths of traditional repository warehouses with alternative data management and access strategy,” and the work has since been widely circulated.

So proponents of the “logical data warehouse,” as defined by Mark Beyer, don’t disagree with the value of an integrated collection of data. They just feel that if said collection is managed and accessed as something other than a monolithic, single physical database, then it is something different and should be called a “logical data warehouse” instead of just a “data warehouse.”

As the author of a series of posts about a jargon-filled [data] world, who am I to argue with the introduction of more new jargon?

In fact, I’d be remiss if I didn’t point out that the notion of a logical data warehouse has numerous jargon-rich enabling technologies and synonyms, including Service Oriented Architecture (SOA), Enterprise Services Bus (ESB), Virtualization Layer, and Data Fabric, though the latter term also has other unrelated uses.

So the essence of a logical data warehouse approach is to integrate diverse data assets into a single, integrated virtual data warehouse, without the traditional batch ETL or ELT processes required to copy data into a single, integrated physical data warehouse.

One of the key attractions to proponents of the approach is the avoidance of recurring batch extraction, transformation and loading activities that, typically argued, cause delays and lead to decisions being made based on data that is not as current as it could be.

The idea is to use caching and other technologies to create a virtualization layer that enables information consumers to ask a question as though they were interrogating a single, integrated physical data warehouse and to have the virtualization layer (which together with the data resident in some combination of underlying application systems, IoT data streams, external data sources, blockchains, data lakes, data warehouses and data marts, constitutes the logical data warehouse) respond correctly with more current data and without having to extract, transform and load data into a centralized physical store.

Logical Data Warehouse

While the moniker may be new, the idea of bringing the query to the data set(s) and then assembling an integrated result is not a new idea. There have been numerous successful implementations in the past, though they often required custom coding and rigorous governance to ensure response times and logical correctness.

Some would argue that such previous implementations were also not at the leading edge of data warehousing in terms of data volume or scope.

What is generating renewed interest in this approach is the continued frustration on the part of numerous stakeholders with delays attributed to ETL/ELT in traditional data warehouse implementations.

When you compound this with the often high costs of large (physical) data warehouse implementations, it’s not hard to see why. Especially if it’s based on MPP hardware, juxtaposed against the promise of some new solutions from vendors like Denodo and Cisco that capitalize on the increasing prevalence of new technologies, such as the cloud and in-memory.

One topic that quickly becomes clear as one learns more about the various logical data warehouse vendor solutions is that metadata is a very important component. However, this shouldn’t be a surprise, as the objective is still to present a single, integrated view to the information consumer.

So a well-architected, comprehensive and easily understood data model is as important as ever, both to ensure that information consumers can easily access properly integrated data and because the virtualization technology itself must depend on a properly architected data model to accurately transform an information request into queries to multiple data sources and then correctly synthesize the result sets into an appropriate response to the original information request.

We hope you’ve enjoyed our series, Data Modeling in a Jargon-filled World, learning something from this post or one of the previous posts in the series (1, 2, 3, 4, 5, 6 ).

The underlying theme, as you’ve probably deduced, is that data modeling remains critical in a world in which the volume, variety and velocity of data continue to grow while information consumers find it difficult to synthesize the right data in the right context to help them draw the right conclusions.

We encourage you to read other blog posts on this site by erwin staff members and other guest bloggers and to participate in ongoing events and webinars.

If you’d like to know more about accelerating your data modeling efforts for specific industries, while reducing risk and benefiting from best practices and lessons learned by other similar organizations in your industry, please visit erwin partner ADRM Software.

Data-Driven Business Transformation

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – In-memory Databases

With the volume and velocity of data increasing, in-memory databases provide a way to keep processing speeds low.

Traditionally, databases have stored their data on mechanical storage media such as hard disks. While this has contributed to durability, it’s also constrained attainable query speeds. Database and software designers have long realized this limitation and sought ways to harness the faster speeds of in-memory processing.

The traditional approach to database design – and analytics solutions to access them – includes in-memory caching, which retains a subset of recently accessed data in memory for fast access. While caching often worked well for online transaction processing (OLTP), it was not optimal for analytics and business intelligence. In these cases, the most frequently accessed information – rather than the most recently accessed information – is typically of most interest.

That said, loading an entire data warehouse or even a large data mart into memory has been challenging until recent years.

In-Memory

There are a few key factors in making in-memory databases and analytics offerings relevant for more and more use cases. One such factor has been the shift to 64-bit operating systems. Another is that it makes available much more addressable memory. And as one might assume, the availability of increasingly large and affordable memory solutions has also played a part.

Database and software developers have begun to take advantage of in-memory databases in a myriad of ways. These include the many key-value stores such as Amazon DynamoDB, which provide very low latency for IoT and a host of other use cases.

Another way businesses are taking advantage of in-memory is through distributed in-memory NoSQL databases such as Aerospike, to in-memory NewSQL databases such as VoltDB. However, for the remainder of this post, we’ll touch in more detail on several solutions with which you might be more familiar.

Some database vendors have chosen to build hybrid solutions that incorporate in-memory technologies. They aim to bridge in-memory with solutions based on tried-and-true, disk-based RDBMS technologies. Such vendors include Microsoft with its incorporation of xVelocity into SQL Server, Analysis Services and PowerPivot, and Teradata with its Intelligent Memory.

Other vendors, like IBM with its dashDB database, have chosen to deploy in-memory technology in the cloud, while capitalizing on previously developed or acquired technologies (in-database analytics from Netezza in the case of dashDB).

However, probably the most high-profile application of in-memory technology has been SAP’s significant bet on its HANA in-memory database, which first shipped in late 2010. SAP has since made it available in the cloud through its SAP HANA Cloud Platform, and on Microsoft Azure and it has released a comprehensive application suite called S/4HANA.

Like most of the analytics-focused in-memory databases and analytics tools, HANA stores data in a column-oriented, in-memory database. The primary rationale for taking a column-oriented approach to storing data in memory is that in analytic use cases, where data is queried but not updated, it allows for often very impressive compression of data values in each column. This means much less memory is used, resulting in even higher throughput and less need for expensive memory.

So what approach should a data architect adopt? Are Microsoft, Teradata and other “traditional” RDBMS vendors correct with their hybrid approach?

As memory gets cheaper by the day, and the value of rapid insights increases by the minute, should we host the whole data warehouse or data mart in-memory as with vendors SAP and IBM?

It depends on the specific use case, data volumes, business requirements, budget, etc. One thing that is not in dispute is that all the major vendors recognize that in-memory technology adds value to their solutions. And that extends beyond the database vendors to analytics tool stalwarts like Tableau and newer arrivals like Yellowfin.

It is incumbent upon enterprise architects to learn about the relative merits of the different approaches championed by the various vendors and to select the best fit for their specific situation. This is something that’s admittedly, not easy given the pace of adoption of in-memory databases and the variety of approaches being taken.

But there’s a silver lining to the creative disruption caused by the increasing adoption of in-memory technologies. Because of the sheer speed the various solutions offered, many organizations are finding that the need to pre-aggregate data to achieve certain performance targets for specific analytics workloads is disappearing. The same goes for the need to de-normalize database designs to achieve specific analytics performance targets.

Instead, organizations are finding that it’s more important to create comprehensive atomic data models that are flexible and independent of any assumed analytics workload.

Perhaps surprisingly to some, third normal form (3NF) is once again not an unreasonable standard of data modeling for modelers who plan to deploy to a pure in-memory or in-memory-augmented platform.

Organizations can forgo the time-consuming effort to model and transform data to support specific analytics workloads, which are likely to change over time anyway. They also can stop worrying about de-normalizing and tuning an RDBMS for those same fickle and variable analytics workloads, focusing on creating a logical data model of the business that reflects the business information requirements and relationships in a flexible and detailed format, that doesn’t assume specific aggregations and transformations.

The blinding speed of in-memory technologies provides the aggregations, joins and other transformations on the fly, without the onerous performance penalties we have historically experienced with very large data volumes on disk-only-based solutions. As a long-time data modeler, I like the sound of that. And so far in my experience with many of the solutions mentioned in this post, the business people like the blinding speed and flexibility of these new in-memory technologies!

Please join us next time for the final installment of our series, Data Modeling in a Jargon-filled World – The Logical Data Warehouse. We’ll discuss an approach to data warehousing that uses some of the technologies and approaches we’ve discussed in the previous six installments while embracing “any data, anywhere.”

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – The Cloud

There’s no escaping data’s role in the cloud, and so it’s crucial that we analyze the cloud’s impact on data modeling. 

Categories
erwin Expert Blog

Data Modeling is Changing – Time to Make NoSQL Technology a Priority

As the amount of data enterprises are tasked with managing increases, the benefits of NoSQL technology are becoming more apparent. 

Categories
erwin Expert Blog

Data Modeling in a Jargon-filled World – Managed Data Lakes

More and more businesses are adopting managed data lakes.

Earlier in this blog series, we established that leading organizations are adopting a variety of approaches to manage data, including data that may be sourced from a wide range of NoSQL, NewSQL, RDBMS and unstructured sources.

In this post, we’ll discuss managed data lakes and their applications as a hybrid of less structured data and more traditionally structured relational data. We’ll also talk about whether there’s still a need for data modeling and metadata management.

The term Data Lake was first coined by James Dixon of Pentaho in a blog entry in which he said:

“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Use of the term quickly took on a life of its own with often divergent meanings. So much so that four years later Mr. Dixon felt compelled to refute some criticisms by the analyst community by pointing out that they were objecting to things he actually never said about data lakes.

However, in my experience and despite Mr. Dixon’s objections, the notion that a data lake can contain data from more than one source is now widely accepted..

Similarly, while most early data lake implementations used Hadoop with many vendors pitching the idea that a data lake had to be implemented as a Hadoop data store, the notion that data lakes can be implemented on non-Hadoop platforms, such as Azure Blob storage or Amazon S3, has become increasingly widespread.

So a data lake – as the term is widely used in 2017 – is a detailed (non-aggregated) data store that can contain structured and/or non-structured data from more than one source implemented on some kind of inexpensive, massively scalable storage platform.

But what are “managed data lakes?”

To answer that question, let’s first touch on why many early data lake projects failed or significantly missed expectations. Criticisms were quick to arise, many of which were critiques of data lakes when they strayed from the original vision, as established earlier.

Vendors seized on data lakes as a marketing tool, and as often happens in our industry, they promised it could do almost anything. As long as you poured your data into the lake, people in the organization would somehow magically find exactly the data they needed just when they needed it. As is usually the case, it turned out that for most organizations, their reality was quite different. And for three important reasons:

  1. Most large organizations’ analysts didn’t have the skillsets to wade through the rapidly accumulating pool of information in Hadoop or whichever new platforms had been chosen to implement their data lakes to locate the data they needed.
  2. Not enough attention was paid to the need of providing metadata to help people find the data they needed.
  3. Most interesting analytics are a result of integrating disparate data points to draw conclusions, and integration had not been an area of focus in most data lake implementations.

In the face of growing disenchantment with data lake implementations, some organizations and vendors pivoted to address these drawbacks. They did so by embracing what is most commonly called a managed data lake, though some prefer the label “curated data lake” or “modern data warehouse.”

The idea is to address the three criticisms mentioned above by developing an architectural approach that allows for the use of SQL, making data more accessible and providing more metadata about the data available in the data lake. It also takes on some of the challenging work of integration and transformation that earlier data lake implementations had hoped to kick down the road or avoid entirely.

The result in most implementations of a managed data lake is a hybrid that tries to blend the strengths of the original data lake concept with the strengths of traditional large-scale data warehousing (as opposed to the narrow data mart approach Mr. Dixon used as a foil when originally describing data lakes).

Incoming data, either structured or unstructured, can be easily and quickly loaded from many different sources (e.g., applications, IoT, third parties, etc.). The data can be accumulated with minimal processing at reasonable cost using a bulk storage platform such as Hadoop, Azure Blob storage or Amazon S3.

Then the data, which is widely used within the organization, can be integrated and made available through a SQL or SQL-like interface, such as those from Hive to Postgres to a tried-and-true commercial relational database such as SQL Server (or its cloud-based cousin Azure SQL Data Warehouse).

In this scenario, a handful of self-sufficient data scientists may wade (or swim or dive) in the surrounding data lake. However, most analysts in most organizations still spend most of their time using familiar SQL-capable tools to analyze data stored in the core of the managed data lake – an island in the lake if we really want to torture the analogy – which is typically implemented either using an RDBMS or a relational layer like Hive on top of the bulk-storage layer.

It’s important to note that these are not two discrete silos. Most major vendors have added capabilities to their database and BI offerings to enable analysis of both RDBMS-based and bulk-storage layer data through a familiar SQL interface.

This enables a much larger percentage of an organization’s analysts to access data both in the core and the less structured surrounding lake, using tools with which they’re already familiar.

As this hybrid managed data lake approach incorporates a relational core, robust data modeling capabilities are as important as ever. The same goes for data governance and a thorough focus on metadata to provide clear naming and definitions to assist in finding and linking with the most appropriate data.

This is true whether inside the structured relational core of the managed data lake or in the surrounding, more fluid data lake.

As you probably guessed from some of the links in this post, more and more managed data lakes are being implemented in the cloud. Please join us next time for the fifth installment in our series: Data Modeling in a Jargon-filled World – The Cloud.

Categories
erwin Expert Blog

NoSQL Database Adoption Is Poised to Explode

NoSQL database technology is gaining a lot of traction across industry. So what is it, and why is it increasing in use?

Techopedia defines NoSQL as “a class of database management systems (DBMS) that do not follow all of the rules of a relational DBMS and cannot use traditional SQL to query data.”

The rise of the NoSQL database

The rise of NoSQL can be attributed to the limitations of its predecessor. SQL databases were not conceived with today’s vast amount of data and storage requirements in mind.

Businesses, especially those with digital business models, are choosing to adopt NoSQL to help manage “the three Vs” of Big Data: increased volume, variety and velocity. Velocity in particular is driving NoSQL adoption because of the inevitable bottlenecks of SQL’s sequential data processing.

MongoDB, the fastest-growing supplier of NoSQL databases, notes this when comparing the traditional SQL relational database with the NoSQL database, saying “relational databases were not designed to cope with the scale and agility challenges that face modern applications, nor were they built to take advantage of the commodity storage and processing power available today.”

With all this in mind, we can see why the NoSQL database market is expected to reach $4.2 billion in value by 2020.

What’s next and why?

We can expect the adoption of NoSQL databases to continue growing, in large part because of Big Data’s continued growth.

And analysis indicates that data-driven decision-making improves productivity and profitability by 6%.

Businesses across industry appear to be picking up on this fact. An EY/Nimbus Ninety study found that 81% of companies understand the importance of data for improving efficiency and business performance.

However, understanding the importance of data to modern business isn’t enough. What 100% of organizations need to grasp is that strategic data analysis that produces useful insights has to start from a stable data management platform.

Gartner indicates that 90% of all data is unstructured, highlighting the need for dedicated data modeling efforts, and at a wider level, data management. Businesses can’t leave that 90% on the table because they don’t have the tools to properly manage it.

This is the crux of the Any2 data management approach – being able to manage “any data” from “anywhere.” NoSQL plays an important role in end-to-end data management by helping to accelerate the retrieval and analysis of Big Data.

The improved handling of data velocity is vital to becoming a successful digital business, one that can effectively respond in real time to customers, partners, suppliers and other parties, and profit from these efforts.

In fact, the velocity with which businesses are able to harness and query large volumes of unstructured, structured and semi-structured data in NoSQL databases makes them a critical asset for supporting modern cloud applications and their scale, speed and agile development demands.

For more data advice and best practices, follow us on Twitter, and LinkedIn to stay up to date with the blog.

For a deeper dive into Taking Control of NoSQL Databases, get the FREE eBook below.

Benefits of NoSQL