Decision Tree in Machine Learning – A Complete Overview

Updated on October 24, 2024

Article Outline

What is a Decision Tree?Important Terms and Concepts in Decision Trees How Does a Decision Tree Work?Python Implementation of a Decision Tree Types of Decision Trees Advantages and Disadvantages of Decision Trees Common Use Cases of Decision Trees Ensemble Methods Based on Decision Trees Pruning in Decision Trees Hyperparameter Tuning for Decision Trees Conclusion FAQs

A decision tree is an increasingly popular approach that is being embraced in machine learning, particularly in prediction. It splits a complex problem of having to make a decision into stages which are easy to grasp. Thanks to its straightforwardness, it is applied in various fields such as healthcare, finance and marketing.

In this blog, we will discuss the workings of a decision tree, its basics and how to use it step by step. We will also cover practical examples of creating decision trees, their types, Python implementation of decision trees in practice and so on.

What is a Decision Tree?

A decision tree is a visual structure that is associated with machine learning and is used in the analysis process to solve problems based on the input given. It can be thought of as a flow system where every internal node is a question pertaining to a data feature and every edge is an answer to that question. It has the shape of a tree and just like a tree, it has a root that branches out into other nodes making it possible to follow the series of decisions made.

In the world of machine learning, decision trees are used for both classification as well as regression tasks. They partition data by making branches and classifying it depending on certain conditions set. For example, in a classification tree, splits may be based on yes/no questions about the data features and the last prediction will be based on the final question’s answer. In regression trees, the splits can be based on the numerical values but their goal is to know the general output which is continuous in nature.

It is interesting to note that one of the useful factors about decision trees is their simplicity which makes them interpretable and understandable. The process undertaken to make a particular decision can easily be followed from the root of the tree to its branches and therefore it is appropriate for explaining results to laymen. Also, they are capable of dealing with both numerical and categorical variables, thus their versatility in different use cases.

Get curriculum highlights, career paths, industry insights and accelerate your data science journey.

Download brochure

Important Terms and Concepts in Decision Trees

There are concepts and terms which are associated with decision trees and here are some which are important ones:

Node: The node is the part of the tree which is the point where the data is branched out depending on one feature. Different types of nodes appear in a decision tree which includes, the root node, decision nodes and the leaf nodes.
Root Node: This node is the very first node in a decision tree. It has the complete data set and from this point the feature providing the best split is chosen.
Leaf Node (Terminal Node): These are the last nodes in the tree and no more splitting is done on them. They give the output, also known as the target variable which could be a class label or a number prediction.
Decision Node: In a binary tree, these are the non-leaf nodes other than the root node where data is split further. No two datasets fit into the same decision node, so each node uses its feature to split the dataset into portions.
Splitting: This is the process of splitting a dataset at every decision node, based on certain conditions.
Pruning: This is a method used in decision tree construction to cut down the amount of branches in the tree. It is useful in avoiding overfitting because it eliminates branches which are not useful in terms of predicting the outcome.
Tree Depth: In decision trees, its depth is the distance between the root node and the furthest leaf node measured through the number of levels contained in the tree. Complex structures can be formed by deeper trees, and more complex structures can also lead to overfitting.
Entropy: The concept of entropy tries to depict the amount of disorder or impurity present in the data. It is used in decision tree algorithms to figure out the best possible outcome to make a split by determining the change in entropy after making the split.
Information Gain: This is the difference in entropy before and after a split. It helps in selecting the best feature for splitting at each decision node by maximising the reduction in entropy.
Gini Index: The Gini index is another metric used to evaluate the quality of a split. It measures the impurity in the data, with lower values indicating better splits.

Also Read: Introduction to Tree Data Structure

Understanding these terms will help you follow the decision-making process in decision trees and grasp how they build predictions step by step.

How Does a Decision Tree Work?

A decision tree works by breaking down a dataset into smaller subsets through a series of decision points, eventually leading to a final prediction. Here’s a structured explanation of how it operates:

Start at the Root Node:

The process begins with the root node, which contains the entire dataset. At this point, the algorithm decides which feature will be used for the first split.

2. Selecting the Best Feature for Splitting:

The algorithm evaluates different features to determine the best one for splitting the data. It uses metrics like entropy, information gain, or Gini index to measure the quality of each potential split.
The feature that results in the highest information gain or lowest Gini index is chosen for the split.

3. Splitting the Data:

Once the best feature is selected, the dataset is divided into smaller subsets based on the values of that feature. Each subset forms a new branch in the tree.
For continuous features, conditions like “greater than” or “less than” are used for splitting. For categorical features, the data is divided based on distinct categories.

4. Repeat the Process for Each Subset:

The splitting process continues recursively for each subset, with each new subset becoming a new decision node. At each decision node, the algorithm again selects the best feature for further splitting.

5. Reaching the Leaf Nodes:

The process continues until one of the stopping conditions is met. This could be when all the data points in a subset belong to the same class, a maximum tree depth is reached, or further splitting does not improve the model significantly.
Once a stopping condition is met, the node is marked as a leaf node, providing the final prediction.

6. Pruning (if required):

After the tree is built, pruning may be used to remove branches that do not contribute much to the accuracy. This helps in reducing overfitting and improves the model’s generalisation.

Python Implementation of a Decision Tree

Follow below steps to implement the decision tree in Python.

Import the Necessary Libraries:

Start by importing the required libraries for data handling and the decision tree model.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn import datasets

2. Load the Dataset:

For demonstration, we will use the Iris dataset, which is built into scikit-learn.

# Load the Iris dataset

iris = datasets.load_iris()

X = iris.data

y = iris.target

3. Split the Data into Training and Testing Sets:

Split the data into training and testing sets to evaluate the model’s performance.

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

4. Initialise the Decision Tree Classifier:

Create an instance of the DecisionTreeClassifier with default parameters.

# Initialise the Decision Tree Classifier

classifier = DecisionTreeClassifier()

5. Train the Model:

Fit the classifier on the training data.

# Train the model

classifier.fit(X_train, y_train)

6. Make Predictions on the Test Data:

Use the trained model to predict the outcomes for the test data.

# Make predictions

y_pred = classifier.predict(X_test)

7. Evaluate the Model’s Performance:

Calculate the accuracy of the model to see how well it performs on the test data.

# Evaluate the accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

8. Visualise the Decision Tree (Optional):

You can visualise the decision tree to understand its structure.

from sklearn import tree

import matplotlib.pyplot as plt

# Plot the decision tree

plt.figure(figsize=(12, 8))

tree.plot_tree(classifier, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)

plt.show()

With these steps, you can implement a decision tree in Python and evaluate its accuracy.

Here is the full code.

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn import datasets

from sklearn import tree

import matplotlib.pyplot as plt

 

# Load the Iris dataset

iris = datasets.load_iris()

X = iris.data

y = iris.target

 

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

 

# Initialise the Decision Tree Classifier

classifier = DecisionTreeClassifier()

 

# Train the model

classifier.fit(X_train, y_train)

 

# Make predictions

y_pred = classifier.predict(X_test)

 

# Evaluate the accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy * 100:.2f}%')

 

# Plot the decision tree

plt.figure(figsize=(12, 8))

tree.plot_tree(classifier, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)

plt.title("Decision Tree Visualization")

plt.show()

Types of Decision Trees

Depending on the target variable that curves decision trees, they can be broadly classified into two categories: Classification Trees and Regression Trees. Their explanations have been provided below:

1. Classification Trees

What They Do: Classification trees are used when the target variable is categorical, meaning the output can belong to one of several distinct categories. These trees are appropriate for situations when the desired outcome is the assignment of a certain category or class mark to a particular data point.
How They Work: The tree splits the data based on features to maximise the separation between classes at each decision node. To test the split, the tree calculates and employs measures such as the Gini index or related metrics. To illustrate, a classification tree determines whether someone has a disease according to age, presence or absence of particular symptoms and results of tests on them.
Applications: Classification trees are used in medical diagnosis, email filtering, and market research so that a particular type of data can be assigned to the relevant classification.

2. Regression Trees

What They Do: Regression trees are very useful when the target variable is continuous i.e., it will generate numeric output. These trees predict the value of some real number rather than classifying into categories.
How They Work: A regression tree divides the data in order to decrease the difference between predicted and actual values rather than cutting the data into different categories. It usually optimises MSE among other metrics in finding splits. For example, when predicting house prices, the tree could choose to split data on a different number of bedrooms, location, and square foot area.
Applications: In econometrics, regression trees are widely used to predict forecasts such as stock prices, house prices or consumer expenditure where the response is continuous.

Learn more about: Regression Testing – Meaning, Types and Tools

Key Differences Between Classification and Regression Trees:

Target Variable: Classification trees are used where the categorical outputs are present (eg. Yes, or No) while regression trees are used where a cut-off for the continuous outputs is required (predicting value at x).
Splitting criteria: Classification trees adopt the Gini index or entropy measures while Regression Trees generally use MSE or variance reduction measures.

Advantages and Disadvantages of Decision Trees

Advantages

Simple to Grasp, Simple to Interpret: Most decision trees can be grasped and comprehended by someone with no machine-learning experience. The model’s decision-making process can be easily visualised as a flowchart, showing the steps taken to reach a conclusion.
Gives Results for Categorical and Numerical Data Types: Decision trees are more effective at dealing with both categories because they can separate data into two types. They can be separated by numeric values and placed a point in relation to the features of a category.
No Feature Scaling: As compared to other algorithms, decision trees do not need input data to be normalised or even standardised. They employ the values directly, hence scaling as a method is not needed.
Capable of Modeling Nonlinear Systems: Non-linear feature-target relationships can be modelled using decision trees in an effective manner. They are sculpted to the data through splitting to capture different conditions.
Less Data Pre-Processing Involved: Decision trees generally do not need as much pre-processing as other algorithms such as handling of missing values, outlier treatment and feature scaling do not arise as critical aspects.

Disadvantages

Less Accurate Compared to Other Algorithms: In some cases, decision trees may not achieve the same level of accuracy as more advanced algorithms like Random Forests, Gradient Boosting, or Neural Networks. They are often used as a base model or combined with other techniques.
Sensitive to Data Changes: Small changes in the data can significantly affect the structure of the tree. A small change in the training dataset can lead to an entirely different tree being generated.
Can Be Biased Toward Features with More Levels: Features with more unique values can dominate the splits, leading to biased models. For example, a feature with many categories might get more attention than it deserves compared to features with fewer categories.
Complex Trees Are Hard to Interpret: While simple decision trees are easy to understand, complex trees with many branches can become difficult to interpret, reducing their usefulness for explaining model behaviour.

Common Use Cases of Decision Trees

The following are the use cases of decision trees that are often encountered:

Fraud Detection: Prevents fraudulent schemes by categorising transaction attributes to distinguish anomalous behaviour.
Sales Prediction: Enables estimation of expected sales by interpreting various factors such as season, business promotion and recorded sales in the previous years.
Recommendation Systems: Systems that help to offer products or services to the users depending on their history of actions and preferences.
Employee Performance Evaluation: Assesses employee performance based on various factors like task completion, feedback scores, and productivity.
Medical Diagnosis: Used to classify patients based on their symptoms, and medical records and even predict certain diseases or conditions.
Customer Segmentation: Assists in identifying how to cluster clients into various categories such as buying behaviour, model’s features or interaction with the firm.
Sentiment Analysis: It is text classification where customer’s feedback text is analysed and classified into positive, negative or neutral.
Manufacturing Process Control: Applied to enhance quality measures by providing an estimate of defective products in a batch based on relevant production parameters.
Market Trend Analysis: Makes use of trends and data from the past to identify the market strategies that will benefit the business in future enterprise planning.

These examples highlight the respective strengths of Decision Trees when used in various industries for both classification and regression tasks.

Ensemble Methods Based on Decision Trees

Ensemble methods supply a framework where multiple decision trees are employed to enhance predictive ability. Let’s look at some common techniques which are a modification of decision trees.

1. Random Forest

How It Works: There are different subsets of data and features used when multiple decision trees are formed during the creation of random forests. Each tree predicts an output. The combination of the average of those predictions makes the final output if it is regression or the majority vote if it is classification.
Benefits: In comparison with a single decision tree, the chances of overfitting are much smaller and the precision of the predictive model improves as the outcomes of many trees are combined when giving a result.
Examples: The model is successfully applied in several areas for example in performing classifications such as spam or fraud detection and in regression like predicting prices of houses and stocks.

2. Gradient Boosting

How It Works: Gradient boosting sequentially builds decision trees, where every new tree attempts to reconcile the errors of its predecessors. It gives more weight to data points made in earlier trees that were classified incorrectly.
Benefits: Achieves high predictive capabilities and is able to perform well under complex patterns. Provided the parameters are well-tuned, there is a lower chance of overfitting the model
Typical Applications: Generally employed in ranking tasks e.g., search engine algorithms, in classification tasks, for instance, medical diagnosis, and in regression tasks such as financial forecasting.

3. AdaBoost (Adaptive Boosting)

How It Works: The basic idea of AdaBoost is to create a series of decision trees, where each new tree is trained on the misclassified examples of the previous trees making it weighted more heavily. The trees work on the voting mechanism and the output is decided by summing the votes with the given weights of each tree.
Advantages: It can increase the performance of weak classifiers and also performs well with noisy data.
Use Cases: Used in image processing, text categorization and prediction of customer churn.

4. XGBoost (Extreme Gradient Boosting)

How It Works: XGBoost is a more efficient version of Gradient Boosting which is optimised for speed and performance. XGBoost utilises more regularisation techniques instead of just one to enhance performance while preventing overfitting and allows parallel computing to increase speed.
Advantages: It is recognized for its very high efficiency and high accuracy in modelling predictive tasks. It has great versatility since it has an internal mechanism to deal with missing values.
Use Cases: Popular in data science competitions (like Kaggle), as well as real-world applications in finance, marketing, and healthcare.

5. Bagging (Bootstrap Aggregating)

How It Works: Bagging creates multiple decision trees using different random samples of the dataset. Each tree is trained independently, and the final output is determined by averaging (regression) or majority voting (classification).
Advantages: Reduces variance and helps prevent overfitting, especially when used with unstable models like decision trees.
Use Cases: Suitable for problems where the model tends to overfit, such as complex datasets with noisy data.

Learn more about: What is Bagging vs. Boosting in Machine Learning?

Pruning in Decision Trees

Pruning is employed to enhance the model by deleting some branches which are contributing little to the overall model. This helps to avoid overfitting the tree and helps it to be more generalizable in nature. There are mainly two types of pruning:

1. Pre-Pruning (Early Stopping)

This approach stops the tree from growing once a certain condition is met. Common conditions include setting a maximum tree depth, a minimum number of samples required at a leaf node, or a minimum number of samples required to split a node.
The goal is to prevent the tree from becoming too complex by limiting its growth during the building phase. This helps to avoid overfitting but may risk underfitting if the stopping criteria are too strict.

2. Post-Pruning (Pruning After Tree Construction)

In post-pruning, the tree is allowed to grow fully, and then branches that contribute little to the model’s accuracy are pruned. The process evaluates the impact of removing a branch on the model’s overall accuracy, and if it improves or does not significantly degrade the performance, the branch is removed.
Common techniques include cost-complexity pruning that involves exceeding a threshold set by a complexity parameter whereby there are trade-offs between the accuracy of the tree and the number of nodes in the pruned tree.

Pruning helps to keep a decision tree simple and brief and makes it easier to comprehend as well as provides a higher confidence in the accuracy of the unseen data.

Hyperparameter Tuning for Decision Trees

Hyperparameter tuning involves optimising the settings of a decision tree to improve its performance. Some key hyperparameters to tune include:

Max Depth: Limits the depth of the tree to prevent overfitting. Setting a maximum depth helps in controlling the tree’s complexity. A smaller depth may cause underfitting, while a larger depth may lead to overfitting.
Min Samples Split: Specifies the minimum number of samples required to split an internal node. Increasing this value can prevent the model from learning overly specific patterns in the data, thereby reducing overfitting.
Min Samples Leaf: Sets the minimum number of samples required at a leaf node. Higher values can help in smoothing the model by creating larger leaf nodes, which reduces the risk of capturing noise.
Max Features: Limits the number of features to consider when looking for the best split. This can improve model performance by reducing variance and making the model less sensitive to noisy features.
Criterion: The splitting criterion, such as “gini” for the Gini index or “entropy” for information gain, affects how the quality of a split is measured. Testing different criteria can help find the one that provides better results for a specific dataset.
Max Leaf Nodes: Limits the number of leaf nodes in the tree. Restricting the number of leaf nodes can simplify the tree and reduce overfitting.

Conclusion

Decision trees are very useful in machine learning for classification and regression tasks. Their structure is quite simple, making it easier for individuals to comprehend the decision-making process when it comes to interpreting the trees where modelling has been used. By splitting the data into smaller segments, decision trees can uncover patterns that help in making accurate predictions.

Nonetheless, if not handled appropriately, decision trees can also be prone to overfitting. Pruning and hyperparameter tuning are important techniques in enhancing the performance and generalisation of the model. In addition, ensemble methods such as Random Forest and Gradient Boosting often further extend the decision tree abilities. If handled correctly, decision trees can be a very effective means to solve a range of global issues. Consider pursuing the Accelerator Program in AI and Machine Learning offered by Hero Vired if you want to master machine learning.

FAQs

What is a decision tree?

A decision tree can be defined as a model which classifies and predicts based on variables and is structured in the form of a tree.

How are decision trees used in machine learning?

These are utilised for classification and regression tasks to predict certain outcomes

What are the main types of decision trees?

The main types are classification trees (for categorical output) and regression trees (for continuous output).

What is pruning in decision trees?

Pruning is the process of removing unnecessary branches to prevent overfitting and simplify the model.

Why is hyperparameter tuning important for decision trees?

It optimises the model's parameters to improve accuracy and reduce overfitting.

How do Random Forests relate to decision trees?

An ensemble method such as Random Forests combines decision trees in order to be more accurate.

Can decision trees handle both numerical and categorical data?

Yes, they can process both types of data, making them versatile for different tasks.

Updated on October 24, 2024

Link

Upskill with expert articles

View all