Random Forest Algorithm

Sneha Bose

a day ago

Random Forest Algorithm | insideaiml
Random Forest Algorithm | insideaiml
Random Forest algorithm is an ensemble model that uses “Bagging” as the ensemble method and decision tree as the individual model. It is a learning method that works by constructing multiple decision trees and the final decision is made based on the majority of the trees and is chosen by the random forest.
The random forest comes under supervised learning and can be used for both classification as well as regression problems. But mostly, it is used for classification problems.
decision tree algorithm is a tree-shaped diagram which is used to determine a course of action. In decision tree, each branch of the tree represents a possible decision, occurrence, or reaction.
Random Forest Algorithm | insideaiml
Random Forest Algorithm | insideaiml

Why we use a Random Forest Algorithm?

One of the main advantages of using Random Forest The algorithm among a lot of benefits is that it reduces the risk of overfitting and as well as the required training time. Additionally, it provides a high level of accuracy. Random Forest algorithm runs efficiently in large datasets and also produces highly accurate predictions by estimating missing data.

How Random forest works?

·      Step 1 - Select n (e.g. 1500) random subsets from the training set.
·      Step 2 - Train “n” decision trees. (Here, 1500 for 1 each)
·      Step 3 - Each individual tree predicts the records/candidates in the train set, independently.
·      Step 4 - Make the final predictions using the majority voting.
Rain Forest Simplified | insideaiml
Rain Forest Simplified | insideaiml

Advantages of Random Forest:

1. The random-forest can solve both types of problems that are classification and regression and does a decent estimation on both fronts.
2. One of the benefits of Random Forest which exists me most is the power to handle large data sets with higher dimensionality. It can handle thousands of input variables and identify the most significant variables so it is considered as one of the dimensionality reduction methods. Furthermore, the model outputs the importance of variable, which can be a very handy feature for feature selection.
3. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data is missing.
4. It has methods for balancing errors in data sets where classes are imbalanced.
5. The capability of the above can be extended to unlabeled data, leading to unsupervised clustering, data views, and outlier detection.
6. Random forest involves the sampling of the input data with a replacement called bootstrap sampling. Here one-third of data is not used for training and can be used for testing. These are called the OUT OF BAG samples. The Error estimated on these output bag samples is known as OUT OF BAG ERROR. The study of error estimates by out of the bag provides us evidence to show that the out of bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out of bag error estimate helps us to remove the need for a set-aside test set.

Disadvantages of Random Forest:

1. It surely does a good job at classification but not as for regression problem as it does not give precise continuous nature prediction. In the case of regression, it doesn't predict beyond the range in the training data, and that they may overfit data sets that are particularly noisy.
2. The random-forest can feel like a black box approach for a statistical modeler we have very little control over what the model does. You can at best try different parameters and random seeds.

Implementation in Python

Let’s see how we can implement the Random forest algorithm in Python. For this, we will be taking the Iris dataset.
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# dataset path link
data_path = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
# Providing dataset headers
headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
#Loading and printing dataframe
dataset = pd.read_csv(data_path, names = headers)

# Assigning independent and dependent variables (target variable)
X = dataset.iloc[:, :-1].values   #(independent variable)
y = dataset.iloc[:, 4].values     #(dependent variable)

#splitting dataset into the ratio of 80:20
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

# Training the Random forest model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50)
classifier.fit(X_train, y_train)

#Making predictions
y_pred = classifier.predict(X_test)

# Printing the result
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(“confusion Matrix:”)
result1 = classification_report(y_test, y_pred)
print(“Classification Report:”)
result2 = accuracy_score(y_test, y_pred)
print(“Accuray:”, result2)


Confusion Matrix:
[[14 0 0]
   [ 0 18 1]
   [ 0 0 12]]
Classification Report:
              precision   recall   f1-score   support
    Iris-setosa    1.00     1.00       1.00        14
Iris-versicolor    1.00     0.95       0.97        19
 Iris-virginica    0.92     1.00       0.96        12

      micro avg    0.98     0.98        0.98       45
      macro avg    0.97     0.98        0.98       45
   weighted avg    0.98     0.98        0.98       45

Accuracy: 0.9777777777777777

Submit Review