Decision trees in machine learning are a common way of representing the decision-making process through a branching, tree-like structure. It’s often used to plan and plot business and operational decisions as a visual flowchart. The approach sees a branching of decisions which end at outcomes, resulting in a tree-like structure or visualisation.
Decision trees are used as an approach in machine learning to structure the algorithm. A decision tree algorithm will be used to split dataset features through a cost function. The decision tree is grown before being optimised to remove branches that may use irrelevant features, a process called pruning. Parameters such as the depth of the decision tree can also be set, to lower the risk of overfitting or an overly complex tree.
The majority of decision trees in machine learning will be used for classification problems, to categorise objects against learned features. The approach can also be used for regression problems, or as a method of predicting continuous outcomes from unseen data. The main benefits of using a decision tree in machine learning is its simplicity, as the decision-making process is easy to visualise and understand. However, decision trees in machine learning can become overly complex by generating very granular branches, so pruning of the tree structure is often a necessity.
This guide explores decision trees in machine learning, including the benefits and drawbacks to the approach, and the different types of decision trees in machine learning.
What is a decision tree in machine learning?
Decision trees are a way of modeling decisions and outcomes, mapping decisions in a branching structure. Decision trees are used to calculate the potential success of different series of decisions made to achieve a specific goal. The concept of a decision tree existed long before machine learning, as it can be used to manually model operational decisions like a flowchart. They are commonly taught and utilised in business, economics and operation management sectors as an approach to analysing organisational decision making.
Decision trees are a form of predictive modeling, helping to map the different decisions or solutions to a given outcome. Decision tree are made up of different nodes. The root node is the start of the decision tree, which is usually the whole dataset within machine learning. Leaf nodes are the endpoint of a branch, or the final output of a series of decisions. The decision tree won’t branch any further from a leaf node. With decision trees in machine learning, the features of the data are internal nodes and the outcome is the leaf node.
Decision trees are an approach used in supervised machine learning, a technique which uses labelled input and output datasets to train models. The approach is used mainly to solve classification problems, which is the use of a model to categorise or classify an object. Decision trees in machine learning are also used in regression problems, an approach used in predictive analytics to forecast outputs from unseen data.
Decisions trees are popular in machine learning as they are a simple way of structuring a model. The tree-like structure also makes it simple to understand the decision-making process of the model. Explainability in machine learning is an important consideration, as the process of explaining a model’s output to a human. The strength of machine learning is the optimisation of a task without direct human control, which often makes it difficult to explain a given model’s output. The reasoning behind a model’s decision-making process is clearer when the model uses a decision tree structure, because each decision branch can be observed.
Different types of decision tree in machine learning
Most models are part of the two main approaches to machine learning, supervised or unsupervised machine learning. The main differences between these approaches is in the condition of the training data and the problem the model is deployed to solve. Supervised machine learning models will generally be used to classify objects or data points as in facial recognition software, or to predict continuous outcomes as in stock forecasting tools. Unsupervised machine learning models are mainly used to cluster data into groupings of similar data points, or to discover association rules between variables as in automated recommendation systems.
Decision trees are used in the supervised type of machine learning. The approach can be used to solve both regression or classification problems. The two main types of decision trees in machine learning are therefore known as classification trees and regression trees. Overall, classification trees are the main use of decision trees in machine learning, but the approach can be used to solve regression problems too. The main difference is in the type of problem and data. Classification trees are used for decisions such as yes or no, with a categorical decision variable. Regression trees are used for a continuous outcome variable such as a number.
What is a classification tree?
Classification problems are the most common use of decision trees in machine learning. It is a supervised machine learning problem, in which the model is trained to classify whether data is a part of a known object class. Models are trained to assign class labels to processed data. The classes are learned by the model through processing labelled training data in the training part of the machine learning model lifecycle.
To solve a classification problem, a model must understand the features that categorise a datapoint into the different class labels. In practice, a classification problem can occur in a range of settings. Examples may include the classification of documents, image recognition software, or email spam detection.
A classification tree is a way of structuring a model to classify objects or data. The leaves or endpoint of the branches in a classification tree are the class labels, the point at which the branches stop splitting. The classification tree is generated incrementally, with the overall dataset being broken down into smaller subsets. It is used when the target variables are discrete or categorical, with branching happening usually through binary partitioning. For example, each node may branch on a yes or no answer. Classification trees are used when the target variable is categorical, or can be given a specific category such as yes or no. The endpoint of each branch is one category.
What is a regression tree?
Regression problems are when models are designed to forecast or predict a continuous value, such as predicting house prices or stock price changes. It is a technique used to train a model to understand the relationship between independent variables and an outcome. Regression models will be trained on labelled training data, so sit within the supervised type of machine learning algorithm.
Machine learning regression models are trained to learn the relationship between output and input data. Once the relationship is understood, the model can be used to forecast outcomes from unseen input data. The use case for these models is to predict future or emerging trends in a range of settings, but also to fill gaps in historic data. Examples of a regression model may include forecasting house prices, future retail sales, or portfolio performance in machine learning for finance.
Decision trees in machine learning which deal with continuous outputs or values will be regression trees. Much like a classification tree, the dataset is incrementally broken down into smaller subsets. The regression tree will create dense or sparse clusters of data, to which new and unseen data points can be applied. It’s worth noting that regression trees can be less accurate than other techniques to predicting continuous numerical outputs.
Benefits of decision trees in machine learning
Decision trees are a popular approach in machine learning for good reason. The resulting decision tree is straightforward to understand because of its visualisation of the decision process. This streamlines the process of explaining a model’s output to stakeholders without specialised knowledge of data analytics. Non-specialist stakeholders can access and understand the visualisation of the model and data, so the data is accessible to diverse business teams. The reasoning or logic of the model can therefore be understood clearly. Explainability can be a barrier to adopting machine learning within organisations, so this is a clear benefit for the use of decision trees in machine learning.
Another benefit is in the data preparation phase for decision tree machine learning models. Decision tree models require less data cleaning in comparison to other approaches to machine learning models. Namely, decision trees avoid the need for data normalisation in the early phase of the machine learning process. Decision tree models can process both categorical or numerical data, so qualitative variables won’t need to be transformed as in other techniques.
The main benefits of decision trees in machine learning include:
- Straightforward to understand, even by stakeholders without technique data knowledge.
- Intrinsically explainable, as any decision within the model can be explained within the model. This is in contrast to black box algorithms, where explainability becomes difficult.
- Data doesn’t need normalisation, as the technique can process both numerical and categorical variables.
- Can be used to understand the hierarchy of features within a dataset, which can be pruned and refined in future modelling.
Drawbacks of decision trees in machine learning
One of the main drawbacks of using decision trees in machine learning is the issue of overfitting. An aim of machine learning models is to achieve a reliable degree of generalisation, so the model can accurately process unseen data once deployed. Overfitting is when a model is fit too closely to the training data, so may become less accurate when encountering new data or predicting future outcomes.
Overfitting can be a major issue from decision trees, which can often become very complex and oversized. The process of pruning is needed to refine decision trees and overcome the potential of overfitting. Pruning removes branches and nodes of the tree that are irrelevant to the model’s aims, or those that provide no extra information. Any pruning should be measured through the process of cross validation in machine learning, which can evaluate the model’s ability to function or its accuracy in a live environment.
Some of the main disadvantages of decision trees in machine learning that need to be considered include:
- Decision trees can grow to be very complex so require pruning and optimisation.
- Overfitting is a common issue, again requiring pruning and optimisation of hyperparameters.
- Small tweaks to training data can have a big impact on the decision tree, often resulting in different decision trees being created.
- The approach can be less accurate with regression problems than other machine learning techniques, specifically with continuous numerical outputs.
- Prone to bias in machine learning models if the training dataset isn’t balanced or representative.
Machine learning deployment for every organisation
Seldon moves machine learning from POC to production to scale, reducing time-to-value so models can get to work up to 85% quicker. In this rapidly changing environment, Seldon can give you the edge you need to supercharge your performance.
With Seldon Deploy, your business can efficiently manage and monitor machine learning, minimise risk, and understand how machine learning models impact decisions and business processes. Meaning you know your team has done its due diligence in creating a more equitable system while boosting performance.
Deploy machine learning in your organisations effectively and efficiently. Talk to our team about machine learning solutions today.