We often think of ‘deployment’ as packaging software into an artifact and moving it to an environment to run on. For Machine Learning it can be better to think of deployment as “the action of bringing resources into effective action” (one of Oxford’s definitions of ‘deployment’).
There are a range of patterns for using ML models to make business decisions. Deploying machine learning models can mean different things, depending on the context. Understanding the key prediction-making patterns can help to decide on which tools apply to your use case.
Different situations lend themselves to different deployment patterns. But sometimes the same model might be deployed in more than one way. The use cases for the patterns are distinguished more by how a model is to be consumed than the type or meaning of the predictions.
Many projects currently assume they have to build their own ML deployment infrastructure. Armed with a knowledge of the deployment patterns, we’ll see that there are tools available to help.
Offline prediction is predictions made on an ad-hoc or one-off basis. This could be done on a local machine, depending on the scale of predictions and compute resources required.
Offline predictions might be generated directly from python code (e.g. a jupyter notebook) and given to the intended audience (e.g. emailed as a CSV). This applies most to where the predictions are for a single event and a new model would be required for a new event. Example cases could be predicting the outcome of an election or a sporting event.
Batch prediction is when a set of predictions need to be made from a file or store, typically on a regular time cycle (e.g. weekly). Imagine a model being used to predict income for the next quarter. Or to predict how much water a crop will need next week.
With batch use-cases there’s a regular predict cycle and it might be that the same training data or same model is used each cycle. Or it might be that new training data becomes available for each cycle and a new model needs to be produced each time. Either way, the use case lends itself to trigger-activated processing jobs that predict by feeding their way through the store of input data.
Real-time serving is making predictions on-demand, usually via an HTTP call. These online use-cases include (but are not limited to) e-commerce applications for recommending content such as products or advertising.
Commonly a model is trained on a training environment and packaged for real-time serving. The packaged model goes through a promotion process to the live serving environment. The train-package-promote process starts again if new training data becomes available.
There are other less common serving use cases such as online learning. With online learning training happens continuously and every new prediction request is also a new piece of training data. Most serving tools centre on separate training and serving, not online learning.
Real-time serving can be intimately tied to roll-out and monitoring. The served model becomes a microservice and new versions of it need to be rolled out safely and their performance monitored.
Streaming use cases are marked out by throughput of predictions required, which can be high and variable. Streaming can be similar to batch or real-time but with the addition of a messaging system to queue up predictions to be handled at the rate of processing rather than the rate of arrival.
What distinguishes streaming from other real-time cases is the use of a queue to smooth out the processing of a variable workload of predictions.
The live predictions when generated might go to a file, to a database or to an HTTP service.
An example of streaming might be online fraud detection, where perhaps transactions are allowed to go through and are then asynchronously queued up for post-transaction verification by the fraud detection model.
Leveraging ML Deployment Tools
Machine Learning Deployment becomes easier to navigate if we break it down into categories. Here we’ve suggested the spaces of Offline Prediction, Batch Prediction, Real-time Serving and Streaming. There are open source tools in each of these spaces. Established tools are designed from the beginning to handle scaling and operating prediction systems in production.
A great place to find out about tools is the Linux Foundation for AI’s (LFAI) Landscape Diagram. Here’s part of the LFAI diagram:
Featured in the middle of the diagram is Seldon, where I work. Seldon Core is an open source real-time serving solution that has integrations for batch or streaming use cases. Seldon has integrations for progressive rollouts, metrics, audit logging, explanations and data drift detection. Seldon supports a range of different frameworks and allows different types of models to be seamlessly swapped or mixed (for example, with a multi-armed bandit router).
Our philosophy at Seldon is to focus on key use cases and work with the open source community, as well as enterprise customers, to form flexible and scalable solutions. This has led to certain technology choices. Kubernetes for target platform. Knative, with its pluggable options of message broker (including a Kafka option), for streaming architecture. An option for either of Argo workflows or kubeflow pipelines for batch.
To delve deeper into the range of tools available for each area, I’d suggest the ‘Awesome Production Machine Learning’ GitHub repo. It has 7k starts at the time of writing and has dedicated sections on Model Serving (Real-time), Data Pipelines (Batch) and Data Stream Processing (Streaming).
Summary: Navigating ML Deployment
Deciding how best to get predictions from ML models can be confusing. We can approach decisions with more clarity if we keep in mind the key deployment patterns — Offline, Real-time, Batch and Streaming.
Many projects currently assume they have to build their own ML Deployment infrastructure. But there are technologies out there for each of the four deployment patterns. Knowledge of the patterns also brings reassurance that you are not alone.