All Courses

# Random Forest Algorithm

Sneha Bose

3 years ago

Table of Content
• Why do we use a Random Forest Algorithm?
• How does Random forest works?
• Implementation in Python
• Summary
Random Forest algorithm is an ensemble model that uses “Bagging” as the ensemble method and decision tree as the individual model. It is a learning method that works by constructing multiple decision trees and the final decision is made based on the majority of the trees and is chosen by the random forest.
The random forest comes under supervised learning and can be used for both classification as well as regression problems. But mostly, it is used for classification problems.
decision tree algorithm is a tree-shaped diagram which is used to determine a course of action. In decision tree, each branch of the tree represents a possible decision, occurrence, or reaction.

## Why do we use a Random Forest Algorithm?

One of the main advantages of using Random Forest The algorithm among a lot of benefits is that it reduces the risk of overfitting and as well as the required training time. Additionally, it provides a high level of accuracy. Random Forest algorithm runs efficiently in large datasets and also produces highly accurate predictions by estimating missing data.

## How does Random forest works?

·      Step 1 - Select n (e.g. 1500) random subsets from the training set.
·      Step 2 - Train “n” decision trees. (Here, 1500 for 1 each)
·      Step 3 - Each individual tree predicts the records/candidates in the train set, independently.
·      Step 4 - Make the final predictions using the majority voting.

1. The random-forest can solve both types of problems that are classification and regression and does a decent estimation on both fronts.
2. One of the benefits of Random Forest which exists me most is the power to handle large data sets with higher dimensionality. It can handle thousands of input variables and identify the most significant variables so it is considered as one of the dimensionality reduction methods. Furthermore, the model outputs the importance of variable, which can be a very handy feature for feature selection.
3. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data is missing.
4. It has methods for balancing errors in data sets where classes are imbalanced.
5. The capability of the above can be extended to unlabeled data, leading to unsupervised clustering, data views, and outlier detection.
6. Random forest involves the sampling of the input data with a replacement called bootstrap sampling. Here one-third of data is not used for training and can be used for testing. These are called the OUT OF BAG samples. The Error estimated on these output bag samples is known as OUT OF BAG ERROR. The study of error estimates by out of the bag provides us evidence to show that the out of bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out of bag error estimate helps us to remove the need for a set-aside test set.

1. It surely does a good job at classification but not as for regression problem as it does not give precise continuous nature prediction. In the case of regression, it doesn't predict beyond the range in the training data, and that they may overfit data sets that are particularly noisy.
2. The random-forest can feel like a black box approach for a statistical modeler we have very little control over what the model does. You can at best try different parameters and random seeds.

## Implementation in Python

Let’s see how we can implement the Random forest algorithm in Python. For this, we will be taking the Iris dataset.
``````# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

data_path = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Assigning independent and dependent variables (target variable)
X = dataset.iloc[:, :-1].values   #(independent variable)
y = dataset.iloc[:, 4].values     #(dependent variable)

#splitting dataset into the ratio of 80:20
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

# Training the Random forest model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50)
classifier.fit(X_train, y_train)

#Making predictions
y_pred = classifier.predict(X_test)

# Printing the result
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print(“confusion Matrix:”)
result1 = classification_report(y_test, y_pred)
print(“Classification Report:”)
print(result1)
result2 = accuracy_score(y_test, y_pred)
print(“Accuray:”, result2)
``````
##### Output
``````Confusion Matrix:
[[14 0 0]
[ 0 18 1]
[ 0 0 12]]
Classification Report:
precision   recall   f1-score   support
Iris-setosa    1.00     1.00       1.00        14
Iris-versicolor    1.00     0.95       0.97        19
Iris-virginica    0.92     1.00       0.96        12

micro avg    0.98     0.98        0.98       45
macro avg    0.97     0.98        0.98       45
weighted avg    0.98     0.98        0.98       45

Accuracy: 0.9777777777777777
``````