Seldon Blog


Seldon Deploy Advanced Released

The Next Generation Data-Centric MLOps Platform

Seldon today has announced Seldon Deploy Advanced, a game-changing data-centric MLOps platform to power next-generation production machine learning systems. Seldon Deploy Advanced is an enterprise-grade MLOps platform capable of supporting organisation-wide AI services and applications at massive scale. As of today, the core engine of Seldon Deploy Advanced powers over 3 million Machine Learning models across 10,000+ Kubernetes clusters, with enterprise users powering their organisation-wide ML capabilities across Pharma, Manufacturing, Insurance, Finance and more.

Organisations continue to unlock unprecedented value from a growing number of machine learning use cases that range across one or many departments and domains. In 2021, the worldwide software revenue for all Data Science and AI platforms grew by $1bn, a 20% growth in the industry, with businesses applying AI across all industries. However, this growth in adoption has also uncovered new challenges, ranging across duplicated efforts, significant cost overheads and unmanaged compliance risk, among others. These challenges have restricted the full potential of organisation-wide ML capabilities – that is, until now. 

This is a major milestone for the MLOps ecosystem as a whole, as Seldon Deploy Advanced brings together state-of-the-art concepts from the emerging field of data-centric Machine Learning into practice. The platform tackles these growing organisational pain-points when reaching scale. The data-centric features announced as part of this release push the boundaries of current capabilities of production Machine learning systems, and include:

  • Substantial cost savings through Multi-model Serving
  • Infrastructure optimization through Overcommit Functionality
  • Full control of your ML Systems through Data-centric Pipelines
  • Deeper User Insights with extended Usage Metrics
  • Discoverability of Production MLOps Assets through Model Catalogue

These features are covered in more detail below.

Substantial cost savings through Multi-model Serving

Production machine learning models can introduce significant overhead when deployed in production. These costs can be attributed to multiple factors, including infrastructure overheads, allocation of specialised resources (high CPU / GPU / memory), etc. For serving ML models in production, a standard pattern is to package up the ML model inside a container image which then gets deployed and managed by a service orchestration framework (e.g. Kubernetes).  While this pattern works well for organisations deploying a couple of models, it does not scale well as there is a one-to-one mapping between a deployed container and an ML model being served.

There are technological limitations that are imposed in the single-model-serving scenario. For example, there is a current Kubernetes limitation on the number of pods per node, which is 110 pods per node. To deploy 20,000 single pod ML models we would need 200 nodes. The smallest node in Google Cloud is e2-micro (1 GB memory) and therefore the total system would require at least 200 GB memory. In fact, the memory requirement is likely to be far greater as it is not possible to have hundreds of model inference server pods on a small node.

With multi-model serving, however, the memory footprint of the system is expected to be one order of magnitude less by design, as resources are shared at the model level. Multimodel serving has additional benefits, too: it allows for better CPU/GPU sharing and it does not suffer from the issue of cold start, where we have to download the container image before starting each ML model to deploy. This is usually in the order of tens of minutes. Multimodel serving also reduces the risk of allocating new cloud resources (e.g. GPU) on-demand as model inference servers are long-lived by design.

The benefits of multi-model serving can have substantial cost savings, which are illustrated in the example reflected in the image above. As visualised in the single-model-serving example, each model runs in a different container resulting in extra overhead added for each of these containers. This is in contrast to the multi-model serving example, where multiple models can be deployed into a single container, avoiding the extra overhead added for each model. This overhead reduction can have substantial cost savings, especially given how often the environments of machine learning containers can reserve large amounts of memory, numerous CPUs and in some cases expensive GPU / TPU processors.

Infrastructure optimisation through Overcommit Functionality

It is usually hard to predict inference usage patterns and therefore provision the right infrastructure in advance. When dealing with MLOps at growing scale, it becomes an even larger challenge when it’s necessary to operationalise thousands of machine learning models. This forces a setup where resources are over-provisioned, which unnecessarily increases the size and the cost of the infrastructure. Instead, by design, Seldon Deploy Advanced scales resources intelligently according to demand while fully accounting for service level objectives, even when models have heterogeneous compute demand.

While Multimodel serving and Autoscaling help organisations manage infrastructure according to demand, in many cases demand patterns allow for further optimisation such as “Overcommit” of resources. In other words, ML systems could register more models than can traditionally be served with the provisioned infrastructure. However, this assumes inference traffic patterns that are complementary in nature, such as different ML models for weekends and weekdays. In these cases we can provision the infrastructure to accommodate one set of models and swap them as required. This is particularly important in resource-constrained environments, such as edge device ML deployments.

With Overcommit, we keep a cache of active models and evict models that are not recently used to warm storage. In the case of an inference request being invoked on an evicted model, the system should be able to activate it to serve the incoming inference request. Given that the evicted model lives in warm storage, loading it back is relatively fast and typically imperceptible to end users.

Full Control of your ML Systems through Data-centric Pipelines

The industry has been moving from model-centric to data-centric machine learning. The primary driver for this is that any machine learning powered application or service will consist of non-trivial pipelines with multiple stages, and complex dependencies. When something goes wrong at a particular point, the issue propagates throughout the rest of the machine learning system. It is also increasingly common for organisations to reuse similar or identical machine learning components across multiple use cases. For example, NLP pipelines may use the exact same preprocessing or tokenization steps. Having duplicated instances of such large models for multiple use cases can become prohibitively expensive.

The Seldon Deploy Advanced platform tackles these issues by introducing a data-centric pipeline infrastructure that enables ML and MLOps practitioners to define the flow of data throughout and across their productionised ML systems. This data-centric infrastructure provides asynchronous stream-processing-based data flow, which ensures fully reproducible audits of data across every stage of the pipeline. This also allows complex configurations where inference pipelines can be configured to reuse the same ML models deployed across highly scalable infrastructure. These data-centric features not only provide value by reducing costs through re-usability, but also introduce a higher degree of reproducibility and traceability, whilst providing flexible and rich interfaces to extend, version and manage complex machine learning systems.

Discoverability of Production MLOps Assets through Model Catalogue

As MLOps systems scale, organisations may lose sight of the risks and opportunities in their production machine learning systems. This can result in issues such as duplicated efforts, lack of visibility of operational ML services, unmanaged risks and many more unknown unknowns. The Seldon Deploy Advanced platform addresses these challenges through the concept of production ML system metadata discoverability. This provides full visibility on the “ML assets” that have been operationalised, together with the full lineage and auditability needed for complex compliance requirements, such as the source where the model was trained, unique identifiers, extra tags, etc.

The Seldon Deploy Advanced platform adds extended capabilities to provide visibility not only on machine learning models deployed, but also on the advanced monitoring components, such as drift detectors, outlier detectors and AI explainers. This provides flexible metadata functionality with rich programmatic interfaces that enable organisations to synchronise with their internal “centralised” metadata management systems, which allows them to build their respective taxonomies, ensure accountability, and manage risk at scale.

Try Out Seldon Deploy Advanced!

Set up a free trial today to test the new Seldon Deploy Advanced capabilities for yourself.

See it for yourself

Serve, monitor, explain, and manage your models today.

© 2023 Seldon Technologies. All Rights Reserved.

Rise London
41 Luke Street

UK: +44 (20) 7193-6752
US. +1 (646) 397-9911