A deep dive into random forest classification algorithm
The world of machine learning is constantly progressing, and at a fast pace. Hence, every AI enthusiast must keep up with the number of solutions the methods of machine learning propose. Till today, machine learning engineers have come up with a number of methods for regression and classification problems. One such method is called random forest.
Random forest, basically, is a supervised machine learning algorithm that is used to solve both classification and regression problems. Random forest, in a way, is an extension of the well-known decision tree algorithm, that is also used for regression and classification.
To learn extensively about the random forest algorithm, we first have to look into the basics of supervised learning and must know about the difference between classification and regression. This will help you to understand the basic idea of the random forest algorithm and how it works.
We will also look at the Python implementation of the algorithm for both classification and regression problems using the Sklearn library for the implementation purpose. The reason for demonstrating both classification and regression is to help you understand why random forest is preferred for classification purposes and not for regression.
Supervised vs. unsupervised learning
Supervised and unsupervised learning are the two main and mostly adopted machine learning methods.
In case of supervised learning, the machine learning model is provided with a dataset which contains both the features of the training examples and the output label or value. The aim of this method is to able the underlying algorithm ‘learn’ a hypothesis function to give an output given the input features. This is done by using a loss function which uses the provided output labels and the estimated labels to optimize the hypothesis function.
The algorithm is designed so as to minimize the loss function and give the best approximation for the output labels based on the input data. You can understand this with a simple example. Let’s suppose we provide our algorithm with pictures of cats, each with a corresponding label ‘cat’, and we also provide the algorithm with images of dogs, each with the corresponding label ‘dog’.
Now we will train our machine learning model on this data so that, in future, when the model encounters the picture of a cat it gives the output label ‘cat’ and not ‘dog’.
Unlike supervised learning, unsupervised learning is the method where the available dataset does not have labels. The model is left to find the commonalities in the data by itself.
The availability of unlabeled data is more abundant than labeled data. The main goal of this method is to discover hidden patterns in the dataset provided. Transactional data is commonly solved and assessed using unsupervised learning methods. Let’s suppose you have a huge dataset of customers and what they have purchased from your store.
For a human, it is highly unlikely to extract meaningful patterns from the data. An unsupervised learning algorithm can structure data in a way where it can help your store and sales. It may determine what age group of men and women are most likely to buy what kind of product and so on.
Without the need for labels, unsupervised learning approaches can handle complex data, which may be seemingly unrelated, and output potentially meaningful patterns.
Classification vs. regression
Both regression and classification are supervised learning algorithms which are used for predictive analysis in machine learning. The main difference between classification and regression is that classification algorithms are used to classify the discrete values whereas regression algorithms are used for the prediction of continuous values and/or data.
Regression basically finds the correlation between independent and dependent variables. It is used in predicting the continuous values, for example, the prediction of house prices, trends in the market, etc.
It computes a mapping function that maps the input variable to the continuous output. Here, we try to fit the best fit line, or the best fit function, which can predict the output accurately.
Regression can both be linear and non-linear regression. In linear regression, we try to fit the best fit line on the available dataset while in non-linear regression, we try to fit a non-linear function to the data, which can be a curve of any degree.
Classification is the method of computing a hypothesis function that is able to divide the dataset into multiple discrete classes based on the input features.
A classification model categorizes the input data into one of the various classes it is trained on. The simplest example for classification can be predicting whether the input image to the model is that of a cat or not. The model should give 1 for cats and 0 for not cat. The classification model maps the input variables to the discrete output variables or classes.
A classification model tries to find the decision boundary which separates the dataset into various classes efficiently. Classification can either be binary, where we only classify between two input classes, or it can be multi-class, where there are more than two target classes.
The random forest algorithm
A random forest algorithm is a machine learning technique which, as stated before, is used to solve both classification and regression problems. The main technique used in random forest algorithm is something called ensemble learning. Ensemble learning techniques help in solving some extremely complex problems.
Ensemble learning refers to the use of multiple classification or regression models in some sort of a combination to achieve a specific output. This method proves to be more efficient than using a single classifier and is able to handle more complex problems which increases the performance of the model.
Random Forest, as evident by the name, is a collection of several decision trees, each of which is applied on a subset of the original dataset. Then the majority voting is used to output a single most probable prediction using the outputs of all the decision trees.
Greater number of trees lead to higher accuracy and simultaneously prevents the issue of overfitting. The following features are held by a random forest algorithm:
- A reasonable prediction can be obtained without the use of hyperparameter tuning.
- The issue of overfitting is handled effectively.
- Less time is needed to train random forests as compared to other algorithms.
- Even large datasets are handled efficiently and can perform accurately.
- Missing data is also handled by random forests by themselves.
To understand the complete working of a random forest algorithm, we first need to learn how decision trees work their way around datasets.
How does a decision tree work?
Decision trees make up the building blocks of a random forest. They are one of the most popular tools for classification techniques in machine learning. It has a tree-like structure where each node inside the tree denotes a specific test on an input attribute of the data. Each branch emerging from the node represents an outcome of the test.
A decision tree usually starts with a single node. This first node outputs several branches based on the test performed on the node. Each outcome or branch leads to another node, which outputs other branches along the way. This is what gives the algorithm a tree-like structure. The nodes are divided into two types of categories: namely decision node and leaf node. Decision nodes are tasked with evaluating the input data and making decisions based on a specific test on the data, whereas a leaf node represents an outcome of the tree.
The implementation of a decision tree is based on how to decide the best attribute for the root node and for other decision nodes. The technique to decide this is called Attribute Selection Measure or ASM. There are two most popular methods of ASM:
- Information Gain
- Gini index
Information Gain measures the changes in entropy in the dataset based on a single attribute. It basically determines how much information does an attribute possesses about a specific class. According to this value of information gain, we split the node of the tree.
Information Gain= Entropy(S)- [(Weighted Avg) * Entropy (each feature)]
Where S
is the total number of samples.
Entropy (S) = - [P(a) x log2 P(a)] - [P(b) x log2 P(b)]
Here P(a)
and P(b)
denote the probability of class a
and class b
, respectively.
Gini index basically measures the purity or impurity used while creating a decision tree. An attribute with a lower Gini index is preferred as compared to one having high Gini index. The Gini index is calculated as follows:
Gini index = 1 - Σj (Pj)2
The greatest and the most basic difference between decision tree and random forest algorithm is that the establishment of root nodes and the segregating nodes is performed randomly random forest. It involves the use of different samples of data instead of using a single big sample. The decision trees output different leaf node classes and then the outputs are ranked. The one with the highest ranking is selected to be the final output.
Despite all this, even though random forests can be used for both classification and regression tasks, they do not perform very efficiently in regression tasks and are not preferred as compared to other regression models. Both classification and regression models using random forests can be implemented in Python using Sklearn.
A Basic introduction of Sklearn
Scikit-Learn of Sklearn is the most widely used and robust open-source machine learning library in Python.
It provides a huge number of pre-defined tools and modules for the development of machine learning models which include supervised learning methods such as classification, regression, and support vector machines, and unsupervised learning methods such as factor analysis, Principal Component Analysis (PCA), and clustering along with several other tools for handling data also. It also provides functions and modules for the post-evaluation of the developed machine learning models.
Before moving on to the Python implementation of random forests using Scikit-Learn, we have to install the Scikit-Learn library:
Using the pip command, install as follows:
pip install -U scikit-learn
Using conda environment, install using:
conda install scikit-learn
Random Forest for classification in Sklearn
Now let us look at the implementation of the random forest algorithm for the purpose of classification. For simplicity, we would be using a dataset that is already present in the Sklearn library called the digits data.
This data contains several images with handwritten digits and their corresponding labels. Our task is to develop a classifier that predicts the handwritten digit by looking at the input image of the digit.
Though we provide the full code with explanation on the article, you can also access the code for this section on Google Colab here .
We start by importing the required modules for the code.
# Importing the libraries
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, plot_confusion_matrix
from sklearn.datasets import load_digits
The matplotlib library helps in the visualization of image-based data. All the other tools we imported from Sklearn are for model development and fitting. The train_test_split module splits the total dataset into two subsets, one for training purpose and the other for testing.
It also shuffles the data before splitting to increase performance. The StandardScaler class is used for preprocessing of the data. It provides multiple modules for different preprocessing techniques. The RandomForestClassifier class gives the tools for developing the random forest classification model.
We can use multiple metrics for the evaluation of the model. We would be using accuracy and a confusion matrix for evaluation purposes. You can choose other evaluation metrics also for example precision, recall, F1-score, and Area Under Curve (AUC) score. The load_digits gives the module to load the digits dataset from the Sklearn dataset collection.
# Importing the dataset
dataset = load_digits()
Here, we load the digits dataset. This dataset contains both the images of the handwritten digits and also the target labels corresponding to the images.
# Vizualizing the Dataset
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, dataset.images, dataset.target):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r)
ax.set_title("Training: %i" % label)
We visualize the data by creating a plot using the matplotlib library having one row and 4 columns which means we will be able to display 4 images in a single row. The for loop iterates through the dataset using the axes variable and shows the first four entries of the dataset along with their labels. The images are shown as grayscale images.
# Unpacking the data into X(input) and Y(target output)
n_samples = len(dataset.images)
X = dataset.images.reshape((n_samples, -1))
Y = dataset.target
Now we unpack the data from the dataset variable and store it into two variables: namely X and Y, where X refers to the input images and Y contains the corresponding target labels.
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Now, we preprocess the data before developing the model. We first split the whole data into two subsets, that is train and test data.
The probability of test data size is 0.25 which means that 25% of the whole data will be saved into the test set while the remaining will be put in the training set. After that, we use the standard scaler class to transform the train data.
The fit_transform()
function normalizes the whole data with the mean and the standard deviation of the whole training dataset. To ensure that the test data is a little tougher than the training data but still consistent with the model developed, we use the transform()
function for the test set. This function transforms the test data with the same mean and standard deviation as the training data without computing the mean and standard deviation of the test data again.
# Fitting Random Forest Classification to the Training set
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Computing Accuracy
accuracy = accuracy_score(y_test, y_pred)
Then we finally create the random forest model. The RandomForestClassifier
class is used to create an instance of the model.
The n_estimators
parameter is set equal to the number of decision trees to be used to create the random forest and we set the criterion for the decision trees calculation to be entropy.It can also be set to Gini.
We then fit the training data to the model so that the model trains on it and creates the final architecture on which we can test our model. We then use the images from the test set to output predictions from the model.
These predictions are then compared with the actual target labels of the test data and the accuracy is computed.
print("Accuracy of model: ", accuracy)
plot_confusion_matrix(classifier,X_test,y_test,cmap='Blues',display_labels=dataset.target_names)
plt.tight_layout()
plt.show()
Finally, we display the accuracy and the confusion matrix of the model.
As we can see, the random forest classifier gives a good accuracy of 93% without any kind of hyperparameter tuning.
Random Forest for regression in Sklearn
The code for the regression model is not much different from the classification one. The only difference is that we are going to use another dataset from the Sklearn collection called the diabetes dataset.
Also, since this is a regression task, we cannot apply accuracy as the evaluation metric so we will be using R2 score as our evaluation metric, and we will be using Random Forest Regressor instead of Random Forest Classifier.
Since the dataset has several attributes for input so we cannot plot the data for visualization for every attribute. Hence, for the sake of visualization, we will use only the first attribute to plot the data.
Same as before, you can follow the code here, or you can access a Google Colab notebook with the code for this section here .
# Importing the libraries
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_diabetes
# Importing the dataset
X, Y = load_diabetes(return_X_y=True)
# Plotting the data
plt.scatter(X[:,0], Y)
plt.title("Targets corresponding to a single attribute of X")
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)
# Fitting Random Forest Classification to the Training set
regressor = RandomForestRegressor(n_estimators=100)
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
# Computing Accuracy
r2 = r2_score(y_test, y_pred)
print("R2 score of model: ", r2)
As evident from the code above, the only changes are the ones proposed above. I also noticed that the use of standard scaler did not really improve the performance of the model in case of regression, so we can omit that for this data. The R2 score is computed to be 0.4396, which is a very low score for a regression model.
The closer R2 score is to 1, the better the performance of the model. This also proves the drawback of using random forests for regression purposes.
Conclusion
The Random Forest algorithm is one of the most powerful algorithms for solving machine learning problems as far as methods for classification are concerned. It is efficient to use random forest for classification purposes.
However, regression-related tasks are not usually made to be solved using the random forest algorithm since it does not perform efficiently on regression-based tasks. Overall, the random forest algorithm gives better computational power and speed compared to other machine learning classification techniques except the decision tree algorithm since it is simpler than a random forest.
Source: livecodestream