Anomaly Detection in Machine Learning

Anomaly detection is an important factor for every stage of the whole machine learning lifecycle. The development and building of a machine learning model will usually require a huge array of high quality training data. The more high quality data available, the more accurate the  model will be. Anomaly detection is used early in the machine learning process to help clean and refine the training data used by the model. Outliers may skew the training data and affect the overall accuracy of the model, so once detected these deviations can be resolved. 

Beyond the model production phase, anomaly detection is often a key part of the deployed machine learning itself. Anomaly detection is an integral part of machine learning solutions across many different sectors, whether detecting fraudulent activity in the financial sector or monitoring product quality. Machine learning anomaly detection goes beyond what is manually possible, as the model will usually process vast ranges of data.  Models can take into account complex features and behaviours models can perform anomaly detection which takes into account complex features and behaviours. This way, models can be trained to monitor for anomalous behaviour or trends.  

Anomaly detection in machine learning includes different approaches to model development, depending on the type of data. Models will either be trained on labelled data or more commonly unlabelled, raw data sets. When trained on labelled data, models will monitor for outliers outside of the defined threshold for normal data. When trained on unlabelled data, a model will cluster the raw data into categories, and identify outliers which sit outside of the clusters. In both circumstances, the model understands what is within a normal threshold of behaviour, and will identify anomalous behaviour or data that is different. 

This guide explores the basics of anomaly detection in machine learning, including what it is, how it’s used, and techniques for anomaly detection. 

What is anomaly detection in machine learning? 

Anomaly detection in machine learning is the process of identifying anomalies or outliers in a dataset. Anomalies are unusual data points which are significantly different to the wider trends in the rest of the data set. They are unexpected deviations from the expected outcome. Anomaly detection will usually lead to an intervention. This could be an action to clean the dataset or troubleshooting the cause of the anomaly. In the case of fraud detection models, anomaly detection may trigger a bank account freeze and human intervention and escalation. 

Anomaly detection in machine learning is an important topic because models are so reliant on high quality data. Anomalies or outliers can skew the quality of this training data, as machine learning models are developed to understand the relationship between data points. Outliers may affect the accuracy of the model by altering patterns learned by the model. 

Sometimes models can be overfit to training data too, which lowers the model’s ability to generalise when facing new or unseen data. An anomaly in this case may be a sign that the model itself should be retrained, or a data scientist must intervene. For example if a model was trained without a specific subset of demographic data, a relatively normal data point may be flagged as an anomaly if the model encounters a group unrepresented by the training data. In this case the model would need to be retrained to bring into account the bias. 

Machine learning is increasingly being utilised to automate anomaly detection in a range of sectors. Models can effectively screen huge arrays of data for outliers, effectively flagging any anomalies for intervention. An example would be detection of suspicious account behaviour in the banking sector. In this case, anomaly detection in machine learning will flag unusual account behaviour which may go beyond the expected thresholds of normal behaviour.  

Benefits of machine learning anomaly detection

Anomaly detection has historically been performed manually, but machine learning techniques are increasingly making anomaly detection more efficient and effective. Machine learning is being used to monitor datasets to identify anomalies and resolve issues with data quality. As the use of digital tools and apps increase, so too does the amount of data that is processed and stored. Although outliers are rare, any large array of data may include anomalies. This makes processes for anomaly detection important. Anomaly detection and resolution can be leveraged to improve the quality of datasets.  

Manually checking for anomalies in this wealth of data would be impossible at scale. Although algorithms designed by human coders may streamline this manual process, this approach would have limitations. The nature of live data is always evolving and changing because of complex external factors, even gradually. A static anomaly detection algorithm would be ineffective if the definition of an anomaly changes over time, as complex behaviours that affect a data set change. Machine learning anomaly detection is therefore a powerful solution, as it can adapt and evolve from the data itself. By keeping track of types of model drift like concept drift, models can be refit and realigned to stay accurate.  

The benefits of machine learning anomaly detection include: 

  • The ability to process a huge array of data. 
  • Scalable beyond what could ever be possible with manual anomaly detection methods. 
  • Can be automated to make processes more efficient, especially in the case of unsupervised anomaly detection.  
  • Can be adapted and refined depending on the data. 
  • Leveraged as a tool to detect model drift or training bias. 

How is anomaly detection in machine learning used?

Machine learning anomaly detection has a range of applications in different settings. In most cases anomaly detection will lead to an intervention. At its core, anomaly detection must define what is within the thresholds of normal data or behaviour. Anomaly detection techniques can be used in the discovery phase to cluster unlabelled data into groupings to define the threshold of ‘normal’ data distribution. Any data points that sit outside these boundaries can be flagged as an anomaly. 

Machine learning anomaly detection can also be developed through labelled data in a range of formats. Labels will include what is normal and what are examples of outliers. The model can then identify defects or issues in new data based on these defined features.   

Anomaly detection in machine learning is often used to: 

  • Clean and prepare datasets. 
  • Detect fraud in banking and financial settings. 
  • Identify defects in products. 

Clean and prepare datasets

A common task for unsupervised machine learning models is to cluster or categorise unlabelled datasets. Hyperparameters such as the number of clusters will be set by the data scientist, but features that make up the cluster points will be learned by the model from the data. Anomaly detection will be a natural part of the process, where any data points that sit too far beyond the clusters can be identified and resolved.  

Detecting fraud in banking and financial settings

Flagging irregular behaviour within live data is a key use of anomaly detection in machine learning. Within machine learning in finance, models are utilised to automatically detect fraudulent or suspicious account activity so that effective action can be taken. This is achieved by firstly understanding and learning the boundaries of normal account behaviour. Anomalies in geolocation of payments or spending behaviour are all elements that could trigger intervention. The same approach to anomaly detection is taken in different sectors too. For example a cybersecurity solution powered by machine learning may monitor for suspicious network activity using the same concepts. 

Identifying defects in products

Anomaly detection techniques can also be used to identify predetermined anomalies or outliers in file types such as images. The use of machine learning anomaly detection to identify product defects in this way will rely on labelled training data. Through a process of supervised anomaly detection, the model learns what constitutes normal data points and outliers. A system can then use computer vision to monitor a production line and send an alert if a design anomaly is identified. 

Three techniques for machine learning anomaly detection

There are a range of techniques and approaches for anomaly detection in machine learning. Each technique can be grouped into three general approaches. Each type of approach will include specific outlier detection and analysis algorithms and methods, depending on factors like the type of data. Overall, each technique shares the same assumption about an anomaly: that they are rare and significantly different to the features of normal data points. 

Three most common techniques for machine learning anomaly detection are: 

  • Unsupervised anomaly detection in machine learning 
  • Supervised anomaly detection in machine learning 
  • Semi-supervised anomaly detection in machine learning 

Unsupervised anomaly detection

Much like unsupervised machine learning techniques, unsupervised anomaly detection deals with unlabelled data. The anomaly will be identified by exploring the trends or patterns within the dataset itself, then detect anomalies that sit outside these patterns. For example, a model may cluster unlabelled data into a specific count of groupings or categorisations based on relationship between data points. Individual data points that sit beyond a threshold of a cluster are identified as anomalies or outliers. Unsupervised anomaly detection is generally the most used type of anomaly detection technique, because unlabelled data is much more common.  

Supervised anomaly detection in machine learning

Supervised anomaly detection relies on labelled dataset which highlight examples of normal data and examples of outliers. The model can then learn how to identify anomalies in new and unseen data. Examples may include anomalies in image file types in diagnosis tools for machine learning in the healthcare sector. Models can be trained to identify outliers in examples such as x-rays or other patient data. 

Semi-supervised anomaly detection

Semi-supervised anomaly detection is a blending of the unsupervised and supervised approaches. It usually happens when there is labelled input data available but no labelled outliers. The model will learn the trends of the normal data from the labelled training data, and identify anomalies that sit beyond this threshold in the unlabelled data. 

Machine learning deployment and anomaly detection for every organisation

Seldon moves machine learning from POC to production to scale, reducing time-to-value so models can get to work up to 85% quicker. In this rapidly changing environment, Seldon can give you the edge you need to supercharge your performance.

With Seldon Deploy, your business can efficiently manage and monitor machine learning, minimise risk, and understand how machine learning models impact decisions and business processes. Meaning you know your team has done its due diligence in creating a more equitable system while boosting performance.

Deploy machine learning in your organisations effectively and efficiently. Talk to our team about machine learning solutions today.

Contents