4 July 2019

Azure Data Factory and the myth of the “code-free” data warehouse

Azure Data Factory has been much improved with the addition of data flows to enable data transformations. It now resembles more established platforms in a crowded market that includes the likes of Informatica Cloud, Dell Boomi, SnapLogic and Mulesoft.

All these platforms promise a code-free integration environment where data from can be extracted, transformed and mapped between systems using generic connectors. Like many of the platforms in this space Azure Data Factory demos beautifully and can sound compelling. A “code free” data warehouse? Where do I sign!

The problem is that is suffers from some familiar shortcomings that affects similar tools in the cloud-based integration space.

There’s always code lurking somewhere

Integration tools are often scripting platforms at heart and Azure Data Factory is no exception. The UI is a wrapper around a JSON-based instruction format that is interpreted by an engine. The tooling is immature in places and there are functional gaps that force you to dive in and edit the JSON directly. This is encouraged by the large number of examples that illustrate features with JSON rather than explaining how to achieve something through the UI.

Support for variables, expressions and flow statements make it easy to conceal processing logic within the configuration. What you’re really doing is building a set of JSON scripts that are interpreted by a run-time engine. It’s not as expressive as code and you can lose days of your life hunting down tiny errors that prevent a job from running.

Where these platforms add value is in eliminating the need to write loads of connectivity boiler plate code. They don’t eliminate the complexity involved in data integration. The transformation logic must be written and maintained somewhere. In this case it will sit in hard-to-read JSON files.

The black hole of integration logic

Over time, this complexity can build up and create a black hole in your enterprise from which no business logic can ever escape. It’s difficult to enforce any meaningful naming conventions on the numerous connections, tasks and settings that are created for each pipeline. The platform ends up as a repository for undocumented logic that is hidden away in all sorts of nooks and crannies.

Also, bear in mind that an integration platform involves the mother of all vendor lock-ins. Your integrations are never going run anywhere else, not when they have been written in a custom integration format. In this sense, the cost of change can be extortionate.

Given the learning curve associated with these platforms they are often operated by relatively isolated integration teams. Ideally, transformation logic should sit with development teams who understand the domain and can provide greater redundancy. It should also be subject to some rigor in terms of a software development lifecycle.

Azure Data Factory is more mature than most in this regard, providing integration with source code control, convenient development isolation and support for a meaningful disaster recovery strategy. Working with platforms such as Informatica Cloud and Dell Boomi can be a bit of a white-knuckle ride in this respect – get careless with a delete button and it’s gone forever.

Data Flows

Earlier iterations of Azure Data Factory were more of an extract and load tool rather than something that could perform complex transformations in flight. This has changed this year with the introduction of Data Flows. This provides a doodle-ware interface that compiles transformation logic into code for Apache Spark running on a Data Bricks cluster. The intention is to provide the high capacity scaling of Apache Spark without having to worry about infrastructural concerns.

This kind of abstraction can be unsettling, particularly if you prefer to know what code is really being executed. As with all doodle-ware systems, you are limited to the transformations provided by the templated components on a point-and-click UI. Any more complex transformation inevitably involves entering scripts into hard-to-find configuration windows.

Not all connectors are made equal

ETL tools live and die by the breadth of connectivity they offer. They all promise a similar range of connectivity but each one tends to come with their own foibles, particularly for more generic protocols. My general experience of integration products is to never assume that they can connect to anything at all.

Azure Data Factory is no exception as it has quite a few annoying gaps (at the time of writing). Azure Search can only be used as a target rather than a source. AWS support is as patchy as you’d expect. The generic REST connector is fussy about the APIs that it is prepared to do business with. As you might expect, it seems mostly concerned with connecting Azure data stores together, i.e. blob storage, SQL, Cosmos and Lake Storage.

ELT rather than ETL?

The sweet spot for this kind of tooling isn’t necessarily ETL (extract, transform and load). The transformation logic is inevitably hidden within complex configuration files. This may be fine for simple aggregations and sorts, but it can quickly escalate into complex logic that is hard to trace and impossible to test in isolation. Over the long term it builds into byzantine well of complexity that you will never be able to change safely.

It’s a dangerous myth to suggest that integration can somehow be easy. One thing that integration platforms do is contain the mess of integration and prevent it from leaking into domain applications. This is useful, but it’s certainly never easy. A modern “code free” data warehouse? There’s always code lurking somewhere under the hood and it’s not always wise to try and abstract it away.

Filed under Architecture, Azure, Integration.