Azure Data Factory and the myth of the code-free data warehouse
4 July 2019
Azure Data Factory promises a "low-code" environment for orchestrating data pipelines. You can ingest, transform and move data between environments without having to write time-consuming boiler-plate code. A "code free" data warehouse? Where do I sign!
In this sense it resembles established platforms in the enterprise integration space such as Informatica Cloud, Dell Boomi, SnapLogic and Mulesoft. The problem is that Azure Data Factory suffers from some familiar shortcomings that tend to affect all the platforms in this space.
There’s always code lurking somewhere
Integration tools are often scripting platforms at heart and Azure Data Factory is no exception. The UI is a wrapper around a JSON-based instruction format that is interpreted by a run-time engine. The tooling is immature in places and there are functional gaps that force you to dive in and edit the JSON directly. That said, it does provide an abstraction that saves you from having to write code to connect, copy and transform data.
This kind of "low-code" approach makes it very difficult to test, debug and maintain data pipelines. Support for variables, expressions and flow statements make it easy to conceal processing logic within the configuration. There are limited facilities for debugging pipelines, the exception reporting is patchy and the monitoring something of a black box. You can lose days of your life hunting down tiny errors that prevent a job from running.
Doing anything beyond basic lift-and-shift work in Azure Data Factory can be challenging. There’s limited support for building workflows based on multiple pipelines. You can schedule execution or remote trigger through an API, but there are few options for event-based triggering beyond responding to changes in storage files.
The main promise of "low-code" integration platforms is enabling greater productivity. However, they don’t do anything to address the complexity involved in data integration. This transformation logic must be written and maintained somewhere. In the case of Azure Data Factory, it sits in hard-to-read and impossible-to-maintain JSON files.
The black hole of integration logic
Over time, this complexity can build up and create a black hole in your enterprise from which no business logic can ever escape. It’s difficult to enforce any meaningful conventions on the numerous connections, tasks and settings that are created for each pipeline. The platform ends up as a repository for undocumented logic that can be hidden away in all sorts of nooks and crannies.
Bear in mind that an integration platform involves the mother of all vendor lock-ins. Your integrations are never going run anywhere else, not when they have been written in a custom integration format. In this sense, the cost of change can be extortionate.
Given the learning curve associated with these platforms they are often operated by relatively isolated integration teams. Ideally, transformation logic should sit with development teams who understand the domain and can provide some redundancy. It should also be subject to the same application development lifecycle as any other code.
Azure Data Factory is more mature than some in this regard, as it does at least provide an integration with source code control. This allows versioning of pipelines, development isolation and backup of pipelines. Working with platforms such as Informatica Cloud and Dell Boomi can be a bit of a white-knuckle ride in this respect - get careless with a delete button and it’s gone forever.
Earlier iterations of Azure Data Factory were more of an extract and load tool rather than something that could perform complex transformations in flight. This has changed this year with the introduction of Data Flows. This provides a doodle-ware interface that compiles transformation logic into code for Apache Spark running on a Data Bricks cluster.
This abstraction of transformation logic can be unsettling, particularly if you prefer to know what code is really being executed under the hood. The implementation also feels somewhat bolted on. You plunge into a very different interface to write data flows. You need to explicitly spin up a cluster before you can run anything. You even need to use a separate scripting syntax.
Pricing can be a little murky and it’s difficult to plan costs. The unit prices for orchestration and runtime execution may appear small, but they can really add up for large numbers of complex pipelines. Working with data flows exasperates this as it’s not always obvious when you’re getting charged for a cluster to execute or debug pipelines.
As with all doodleware implementations, you are also limited to the transformations provided by the templated components on a point-and-click UI. Any more complex transformation inevitably involves entering scripts into scattered configuration windows.
Not all connectors are made equal
ETL tools live and die by the breadth of connectivity they offer. They all promise a similar range of connectivity but each one tends to come with their own foibles, particularly for more generic protocols. My general experience of integration products is to never assume that they can connect to anything at all.
Azure Data Factory is no exception. As you might expect, its connectivity seems mostly concerned with Azure data stores, i.e. blob storage, SQL, Cosmos and Lake Storage. Azure Search can only be used as a target rather than a source. AWS support is as patchy as you’d expect. The generic REST connector is fussy about the APIs that it is prepared to do business with and can’t even work with OAuth 2.0.
This does reinforce the notion of a tool that is better suited to lifting and shifting data around the Azure ecosystem as opposed to enabling widespread integration.
Integration is never easy...
Integration platforms can serve a useful purpose. They can contain the mess of integration and prevent it from leaking into domain applications. They can also cut down the amount of repetitive boilerplate code needed to wire things together.
Despite this, it's a dangerous myth to suggest that integration can somehow be easy. A "code free" data warehouse? There’s always code lurking somewhere under the hood and it’s not always wise to try and abstract it away from development teams.