Machine learning models in production: the maths is the easy part

There can often be too much focus on the research aspects of machine learning. When you’re putting models into production, the maths is only a small part of the overall system. You also need to consider an infrastructure that allows you to prepare data, build models, test outputs and serve predictions in a robust, automated and scalable way.

Machine learning systems tend to have all the challenges of “traditional” code, plus an additional set of very specific issues. These largely stem from the fact that the data is as important as the code in a machine learning system. Many architectural and engineering practices struggle to take this into account.

Detecting problems

As machine learning behaviour is affected by data inputs, it can be difficult to detect problems as they don’t exist at code level. Things can go wrong in machine learning models that cannot be picked up by traditional unit or automation tests. You can deploy the wrong version of a model, use the wrong training data or forget a specific feature.

Configuration management can be a real challenge for machine learning systems. It can be hard to keep track of models that are being constantly iterated on and subtly changed. Managing the different streams of data that feed them is equally difficult.

Ultimately you need to implement some form of differential testing where you compare results between old and new models against a standard data set. Benchmarks can also tell you the time it takes to serve up a consistent set of predictions so you can guard against deteriorating performance. Load and stress tests become particularly important given the resource-heavy nature of the workloads.

The first one is always free

There’s a useful Google paper that explores the kinds of technical debt that can build up in machine learning systems. Many of these of problems only become apparent over time. Shipping the first version of a model is relatively easy, but managing subsequent change is much harder.

One of the bigger problems concerns entanglement, where the complexity in models means that simple changes can have cascading and unpredictable affects. There can also be hidden feedback loops, particularly when a model relies on historical data. Any changes in the model can have an unpredictable effect on these inputs that take a while to manifest.

Further problems start to appear when you chain models together, particularly if this happens unintentionally. "Undeclared consumers" happen when the outputs of a model are made freely available, so they become inputs to other models. These become informal and unregulated dependencies between systems.

Over the long term, models tend to suffer from "concept drift" where the accuracy of the model naturally deteriorates over time. This happens because the scenarios that it was trained to predict can change. For example, if you’re predicting sales then you must take changes to competitors or current events into account. This implies that models and the data used to train them need constant review to avoid an inevitable process of entropy.

Data is harder than code

Data dependencies are harder to manage than code. They can be unstable, either because of the nature of the data or because of unplanned changes in the source system.

Only a very small amount of code in a machine learning system has anything to do with learning or prediction. Much of the work tends to be in sourcing, extracting and moulding data into the right format. This tends to give rise to vast amounts of scripts, transforms, joins and intermediate outputs, all managed by ugly data mangling code.

This code tends to be fragile and very hard to test. It leads to the kind of bloated complexity that will be familiar to anybody who has worked in enterprise data integration. Platforms such as Informatica Cloud or Mulesoft are often enlisted to try and manage this complexity. They can reduce the amount of boiler plate you need to write, though they don’t help to reduce the complexity that is inherent in all this data transformation.

Separation of expertise

There tends to be a separation between researchers who build the models and developers who implement them. This separation of expertise can have knock-on effects, mostly caused by no single person or team fully understanding how the system works.

One problem is with tooling. The popularity of Jupyter Notebooks among data scientists is understandable as they provide a nicely interactive tool for exploring data. When it comes to production code, they are not everybody's cup of tea. They can be executed in a non-linear order so tend to have lots of hidden state that is easy to get wrong. Dependencies can be a problem as it’s not always clear which version of a library was being used during the exploration. Their error handling isn’t exactly elegant.

Jupyter Notebooks require a very fastidious developer to be at all suitable for production. They can often become a mess of code snippets, jagged markdown notes and cryptic graphs. Many development workflows require these to be re-worked into separate, testable and predictable scripts before they can be used in anger.

Another approach is to build cross-skilled teams who can handle both the research and the production aspects. This is easier said than done as they are very different disciplines. Data science is concerned with experimenting with models rather than following the rules of good software engineering. On the other hand, some of the maths is hard. Just because you understand how to write testable and modular code, it doesn’t mean you can bluff your way through machine learning models.

Knowing where to stop

Many "meat and drink" machine prediction applications are based on standard regression or categorisation algorithms. There may even be such a thing as "commodity" machine learning, where there’s a limit to how much you need to invest in the accuracy of a model before diminishing returns sets in.

For example, if you’re estimating sales to predict stock ordering then you only really need to be more accurate than the average human to add value. If you’re categorising data to speed up data entry, an educated approximation is fine if a manual checking process is still involved.

We may be starting to see the emergence of libraries that make machine learning more accessible to developers. For example, ML.Net provides a simple set of abstractions that let you automate the comparison and training of models. It is a total black box that confines you to a set of pre-packaged models, but this may be fine for “commodity” machine learning scenarios.

These tools could also serve as a simple starting point that allows design thinking to focus on providing scalable, maintainable and long-lived production code. After all, there’s much more than just the model to consider here.