What would machine learning look like if you mixed in DevOps? Wonder no more, we lift the lid on MLOps

And why do such a thing? Well, how else will you push your artificially intelligent software into production?

Achieving production-level governance with machine-learning projects currently presents unique challenges. A new space of tools and practices is emerging under the name MLOps. The space is analogous to DevOps but tailored to the practices and workflows of machine learning.

Why MLOps is Needed

Machine learning models make predictions for new data based on the data they have been trained on. Managing this data in a way that can be safely used in live environments is challenging, and one of the key reasons why 80 per cent of data science projects never make it to production – an estimate from Gartner.

It is essential that the data is clean, correct, and safe to use without any privacy or bias issues. Real-world data can also continuously change, so inputs and predictions have to be monitored for any shifts that may be problematic for the model. These are complex challenges that are distinct from those found in traditional DevOps.

MLOps Not Just DevOps

DevOps practices are centred on the “build and release” process and continuous integration. Traditional development builds are packages of executable artifacts compiled from source code. Non-code supporting data in these builds tends to be limited to relatively small static config files. In essence, traditional DevOps is geared to building programs consisting of sets of explicitly defined rules that give specific outputs in response to specific inputs.

In contrast, machine-learning models make predictions by indirectly capturing patterns from data, not by formulating all the rules. A characteristic machine-learning problem involves making new predictions based on known data, such as predicting the price of a house using known house prices and details such as the number of bedrooms, square footage, and location. Machine-learning builds run a pipeline that extracts patterns from data and creates a weighted machine-learning model artifact. This makes these builds far more complex and the whole data science workflow more experimental. As a result, a key part of the MLOps challenge is supporting multi-step machine learning model builds that involve large data volumes and varying parameters.

To run projects safely in live environments, we need to be able to monitor for problem situations and see how to fix things when they go wrong. There are pretty standard DevOps practices for how to record code builds in order to go back to old versions. But MLOps does not yet have standardisation on how to record and go back to the data that was used to train a version of a model.

There are also special MLOps challenges to face in the live environment. There are largely agreed DevOps approaches for monitoring for error codes or an increase in latency. But it’s a different challenge to monitor for bad predictions. You may not have any direct way of knowing whether a prediction is good, and may have to instead monitor indirect signals such as customer behaviour (conversions, rate of customers leaving the site, any feedback submitted). It can also be hard to know in advance how well your training data represents your live data. For example, it might match well at a general level but there could be specific kinds of exceptions. This risk can be mitigated with careful monitoring and cautious management of the rollout of new versions.

The MLOps Tool Scene

The effort involved in solving MLOps challenges can be reduced by leveraging a platform and applying it to the particular case. Many organisations face a choice of whether to use an off-the-shelf machine-learning platform or try to put an in-house platform together themselves by assembling open-source components.

Some machine-learning platforms are part of a cloud provider’s offering, such as AWS SageMaker or AzureML. This may or may not appeal, depending on the cloud strategy of the organisation. Other platforms are not cloud-specific and instead offer self-install or a custom hosted solution (eg, Databricks MLflow).

Instead of choosing a platform, organisations can instead choose to assemble their own. This may be a preferred route when requirements are too niche to fit a current platform, such as needing integrations to other in-house systems or if data has to be stored in a particular location or format. Choosing to assemble an in-house platform requires learning to navigate the ML tool landscape. This landscape is complex with different tools specialising in different niches and in some cases there are competing tools approaching similar problems in different ways (see the Linux Foundation’s LF AI project for a visualization or categorised lists from the Institute for Ethical AI).

ML Ops diagram

The Linux Foundation’s diagram of MLOps tools ... Click for full detail

For organisations using Kubernetes, the kubeflow project presents an interesting option as it aims to curate a set of open-source tools and make them work well together on kubernetes. The project is led by Google, and top contributors (as listed by IBM) include IBM, Cisco, Caicloud, Amazon, and Microsoft, as well as ML tooling provider Seldon, Chinese tech giant NetEase, Japanese tech conglomerate NTT, and hardware giant Intel.


Challenges around reproducibility and monitoring of machine learning systems are governance problems. They need to be addressed in order to be confident that a production system can be maintained and that any challenges from auditors or customers can be answered. For many projects these are not the only challenges as customers might reasonably expect to be able to ask why a prediction concerning them was made. In some cases this may also be a legal requirement as the European Union’s General Data Protection Regulation states that a "data subject" has a right to "meaningful information about the logic involved" in any automated decision that relates to them.

Explainability is a data science problem in itself. Modelling techniques can be divided into “black-box” and “white-box”, depending on whether the method can naturally be inspected to provide insight into the reasons for particular predictions. With black-box models, such as proprietary neural networks, the options for interpreting results are more restricted and more difficult to use than the options for interpreting a white-box linear model. In highly regulated industries, it can be impossible for AI projects to move forward without supporting explainability. For example, medical diagnosis systems may need to be highly interpretable so that they can be investigated when things go wrong or so that the model can aid a human doctor. This can mean that projects are restricted to working with models that admit of acceptable interpretability. Making black-box models more interpretable is a fast-growth area, with new techniques rapidly becoming available.

The MLOps scene is evolving as machine-learning becomes more widely adopted, and we learn more about what counts as best practice for different use cases. Different organisations have different machine learning use cases and therefore differing needs. As the field evolves we’ll likely see greater standardisation, and even the more challenging use cases will become better supported. ®

Want to learn more?

Ryan Dawson is a core member of the Seldon open-source team, providing tooling for machine-learning deployments to Kubernetes. He has spent 10 years working in the Java development scene in London across a variety of industries.

Bringing DevOps principles to machine learning throws up some unique challenges, not least very different workflows and artifacts. Ryan will dive into this topic in May at Continuous Lifecycle London 2020 – a conference organized by The Register's mothership, Situation Publishing.

You can find out more, and book tickets, right here.

Broader topics

Other stories you might like

  • Cerebras sets record for 'largest AI model' on a single chip
    Plus: Yandex releases 100-billion-parameter language model for free, and more

    In brief US hardware startup Cerebras claims to have trained the largest AI model on a single device powered by the world's largest Wafer Scale Engine 2 chip the size of a plate.

    "Using the Cerebras Software Platform (CSoft), our customers can easily train state-of-the-art GPT language models (such as GPT-3 and GPT-J) with up to 20 billion parameters on a single CS-2 system," the company claimed this week. "Running on a single CS-2, these models take minutes to set up and users can quickly move between models with just a few keystrokes."

    The CS-2 packs a whopping 850,000 cores, and has 40GB of on-chip memory capable of reaching 20 PB/sec memory bandwidth. The specs on other types of AI accelerators and GPUs pale in comparison, meaning machine learning engineers have to train huge AI models with billions of parameters across more servers.

    Continue reading
  • Meta: We need 5x more GPUs to combat TikTok, stat
    And 30% fewer new engineers this year

    Comment Facebook parent Meta has reportedly said it needs to increase its fleet of datacenter GPUs fivefold to help it compete against short-form video app and perennial security concern TikTok.

    The oft-controversial tech giant needs these hardware accelerators in its servers by the end of the year to power its so-called discovery engine that will become the center of future social media efforts, according to an internal memo seen by Reuters that was written by Meta Chief Product Officer Chris Cox.

    Separately, CEO Mark Zuckerberg told Meta staff on Thursday in a weekly Q&A the biz had planned to hire 10,000 engineers this year, and this has now been cut to between 6,000 and 7,000 in the shadow of an economic downturn. He also said some open positions would be removed, and pressure will be placed on the performance of those staying at the corporation.

    Continue reading
  • Jenkins warns of security holes in these 25 plugins
    Relax, most of the vulnerabilities so far have, er, no fix

    Jenkins, an open-source automation server for continuous integration and delivery (CI/CD), has published 34 security advisories covering 25 plugins used to extend the software.

    Eleven of the advisories are rated high severity, 14 are medium, and 9 are said to be low.

    The vulnerabilities described include: cross-site scripting (XSS); passwords, API keys, secrets, and tokens stored in plaintext; cross-site request forgery (CSRF); and missing and incorrect permission checks.

    Continue reading

Biting the hand that feeds IT © 1998–2022