12 December 2020
Data Vault 2.0: the good, the bad and the downright confusing
The Data Vault 2.0 methodology was designed to support the notion of an “agile” data warehouse. This accommodates a constantly evolving view of enterprise data by supporting easier change, enabling high-volume load processes, and providing audit capabilities out of the box.
Taken in its entirety, Data Vault 2.0 offers a wide-ranging architecture that encompasses metadata management, data audit, data load, and master data management. At its core is a modelling technique that separates structural information from attributes by arranging data into one of three types of table: hubs (business entities), links (relationships) and satellites (attributes).
This model provides a malleable layer that sits between staging and reporting layers in a warehouse architecture. It replaces the “enterprise data warehouse” as the primary data store for an organisation’s highly valued, strongly governed data.
Deferring design decisions
This convention-driven approach to modelling makes it easier to perform common warehouse design tasks such as adding attributes, refactoring entity relationships, incorporating new sources and deprecating older ones. In a vault model these all become additive tasks that build on existing structures rather than something that requires changes to them.
A vault model also maintains an important separation between “raw” ingested data and the “business” data that has domain specific rules applied to it, such as naming conventions or common taxonomies. This abstraction allows you to defer many modelling decisions until after data has been ingested as the structure of new data can largely be derived from the source system.
This can be very powerful, especially in an environment where you frequently need to ingest new data sources. This type of change can be very expensive in an enterprise data warehouse and the process of analysing new sources, relating them to an existing model and figuring out the impact on downstream transformations can be protracted.
This does not completely negate the need for modelling decisions, but it gives you a lot more flexibility over when and how you make them. It allows for more iteration and experimentation, as you can make changes to entities and relationships without them cascading across the entire model.
Audit, performance, and automation
Apart from design flexibility, data vault modelling also offers solutions for practical concerns such as auditing, data provenance and loading speed.
The approach bakes record sources into every separate entity, relationship and attribute allowing you to track where each data point came from. The additive nature of the model means you should be able to retrieve the state of your data as it was understood at any point in time. This is extremely powerful in regulated environments or where you have onerous data retention requirements.
The use of hash-based keys for every table allows you to load data without dependencies, so you can support high-volume or parallel loads if required. Audit information can also be included in the design of every item, allowing you to trace information such as load time and load duration.
If you want to leverage these benefits, then you will need to apply iron discipline to ensure that table design and hashing conventions are properly enforced. This requires a high level of automation to ensure a consistent implementation across the many hundreds of vault tables that you will end up creating. Without this you will be extremely vulnerable to human error or development memory loss over time.
Note that the tooling options here are relatively limited. The most widely known tooling is DbtVault, a set of extensions built on top of Data Build Tool (DBT). This remains a work in progress so you will inevitably need to roll your own automation stack to some degree.
Although Data Vault lowers the price of change, it does not make it free. One of the trade-offs with vault modelling is that you will have to learn to deal with escalating internal complexity.
The basic concepts of hubs, satellites and links is easy enough, but there is a lot more to wrap your head around. You need to appreciate the difference between the “raw” vault of “hard” rules and the “business” vault of “soft” rules. The vault model can be augmented with reference tables or query assistant tables, while the architecture can be extended with operational and error marts.
The sheer number of tables involved in a vault database can be intimidating – there will be at least three times as many tables in a vault than in an equivalent relational model. This can give rise to some unwieldy and complex join conditions that can make a vault difficult to query. Structural data changes can exasperate join complexity over time as you navigate an increasingly arcane collection of tables.
In operational terms, this is where Data Vault 2.0 starts to suffer in comparison to relational or dimensional models. It is difficult to understand and awkward to query. It offers some patterns to address query complexity, such as “bridge” or “point-in-time” tables, but a model will always be impenetrable to some audiences.
Data Vault 2.0 should be treated with caution. You should not try to implement it without training and consultancy support from people with some practical experience. There are many traps for the unwary that are not immediately obvious to the newcomer. Having an experienced modeller to hold your hands at least for the early stages will be invaluable.
Complexity and the “explanation tax”
Something that may not be immediately obvious is the sheer “explanation tax” involved in adopting Data Vault 2.0. You will spend a lot of time explaining the nuances to stakeholders, both technical and non-technical. It is easy to underestimate just how much of a burden this on-going explanation and justification can be.
Data Vault 2.0 is regarded as an outlier compared with more established methodologies such as Kimball and Inmon. The community around it is relatively small (though helpful). It lacks the wide acceptance enjoyed by relational and dimensional modelling. It may be met with suspicion by any colleagues who are accustomed to more conventional data warehouse architecture.
The inherent complexity in data vault modelling can also be a fatal flaw. It makes the data in warehouse difficult to understand by “outsiders” who have not been trained in the technique. If a key requirement of enterprise data is that it is easy to access, understand and query then data vault modelling can fall short.
This means that Data Vault 2.0 may not be a good fit for enterprise-wide data governance initiatives that need to engage with a wide range of stakeholders. It can create a fog that undermines attempts to collectively reason about data and the way it is structured.
Where to use it?
Data Vault 2.0 is ultimately about providing scalability and managing change. This comes at the cost of considerable complexity. If you only have a handful of predictable source systems, then it may not be for you.
It is a better fit for scenarios where you have multiple, frequently changing source systems that must be integrated into your enterprise warehouse. By “multiple” I mean at least a few dozen of them. You also need a decent sized team of data engineers on hand to build the automation infrastructure required to keep the show on the road. This isn’t something a small handful of enthusiasts can implement.
Most importantly, perhaps, you need commitment. You do need to ensure everybody is fully on board before you start constructing your vault. It can take time for the real benefits of the accumulated data in a vault to become more obvious. The unfamiliarity of the approach can unnerve more risk-averse stakeholders who may prefer the warm embrace of a traditional enterprise data warehouse.