Download our e-book of Introduction To Python

Matplotlib - Subplot2grid() FunctionDiscuss Microsoft Cognitive ToolkitMatplotlib - Working with ImagesMatplotlib - PyLab moduleMatplotlib - Working With TextMatplotlib - Setting Ticks and Tick LabelsCNTK - Creating First Neural NetworkMatplotlib - MultiplotsMatplotlib - Quiver PlotPython - Chunks and Chinks View More

How can I write Python code to change a date string from "mm/dd/yy hh: mm" format to "YYYY-MM-DD HH: mm" format? Which sorting technique is used by sort() and sorted() functions of python? How to use Enum in python? Can you please help me with this error? I was just selecting some random columns from the diabetes dataset of sklearn. Decision tree is a classification algo...How can it be applied to load diabetes dataset which has DV continuous Objects in Python are mutable or immutable? How can unclassified data in a dataset be effectively managed when utilizing a decision tree-based classification model in Python? How to leave/exit/deactivate a Python virtualenvironment Join Discussion

Shashank Shanu

2 years ago

- Problem Statement:

- Importing libraries

- Importing data and displaying it

- Checking shape of the dataset

- Checking dataset information

- Checking for none values in the dataset

- Data visualization

- Applying Encoding

- Building model

- Which model gives the best accuracy and why?

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
```

```
data = pd.read_csv("People Charm case.csv")
data
```

`data.shape`

`(14999, 10)`

**From the above, we can see that our dataset contains 14999 observations and 10 variables. Out of 10 variable, 9 are independent variables and 1 is the dependent or target variable.**

`data.info()`

```
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfactoryLevel 14999 non-none float64
1 lastEvaluation 14999 non-none float64
2 numberOfProjects 14999 non-none int64
3 avgMonthlyHours 14999 non-none int64
4 timeSpent.company 14999 non-none int64
5 workAccident 14999 non-none int64
6 left 14999 non-none int64
7 promotionInLast5years 14999 non-none int64
8 dept 14999 non-none object
9 salary 14999 non-none object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
```

**From the above data, we can see that most of the variables present in our dataset are integer types and some of the variables are in of float and object types**

`data.isnone().sum()`

```
satisfactoryLevel 0
lastEvaluation 0
numberOfProjects 0
avgMonthlyHours 0
timeSpent.company 0
workAccident 0
left 0
promotionInLast5years 0
dept 0
salary 0
dtype: int64
```

**We can see that in our dataset there are no missing values present, so here we do have to apply any data imputation techniques. But in real-world problems, data imputation is one of the most important steps to be performed.**

`sns.heatmap(data.isnone())`

**As we can see that from the above graph(heatmap) there are no missing or none values in our dataset.**

`data['dept'].unique()`

```
array(['sales', 'accounting', 'hr', 'technical', 'support', 'IT',
'product_mng', 'marketing', 'management', 'RandD'], dtype=object)
```

`data['dept'].nunique()`

`10`

**From the above, we can see that the department column is having 10 unique departments**

`data['dept'].value_counts()`

```
sales 4140
technical 2720
support 2229
IT 1227
product_mng 902
marketing 858
RandD 787
accounting 767
hr 739
management 630
Name: dept, dtype: int64
```

**From the above, we can observe that sales category of department column is having maximum data points 4120, followed by technical category and so on.**

`data['salary'].unique()`

`array(['low', 'medium', 'high'], dtype=object)`

**From the above, we can see that the salary column contains 3 unique categories which are "low", "medium" and "high".**

`data['salary'].value_counts()`

```
low 7316
medium 6446
high 1237
Name: salary, dtype: int64
```

**From the above, we can see that most of the employees fall under low category salary.**

`data['satisfactoryLevel'].value_counts()`

```
0.10 358
0.11 335
0.74 257
0.77 252
0.84 247
...
0.25 34
0.28 31
0.27 30
0.12 30
0.26 30
Name: satisfactoryLevel, Length: 92, dtype: int64
```

**From the above, we can see that 358 employees are having Satisfactory Level of 0.1**

`data['numberOfProjects'].value_counts()`

```
4 4365
3 4055
5 2761
2 2388
6 1174
7 256
Name: numberOfProjects, dtype: int64
```

**From the above, we can observe that maximum employees completed a minimum of 4 projects and only 256 employees completed 7 projects**

`sns.boxplot(data['avgMonthlyHours'])`

`sns.boxplot(data['satisfactoryLevel'])`

`sns.boxplot(data['lastEvaluation'])`

**From above we can see that there are no outliers present in the above-checked columns**

`sns.distplot(data["avgMonthlyHours"])`

`sns.distplot(data["lastEvaluation"])`

**From the above, we can observe that our data is not normally distributed for the checked columns/variables.**

```
numerical_features = ['satisfactoryLevel','lastEvaluation','numberOfProjects','avgMonthlyHours','timeSpent.company']
categorical_features = ['dept','salary','workAccident','promotionInLast5years']
print(data[numerical_features].hist(bins=15, figsize=(15, 6), layout=(2, 4)))
```

`sns.countplot(data['dept'])`

`sns.countplot(data['salary'])`

```
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 4, figsize=(20, 8))
for variable, subplot in zip(categorical_features, ax.flatten()):
sns.countplot(data[variable], ax=subplot)
for label in subplot.get_xticklabels():
label.set_rotation(90)
```

```
plt.figure(figsize = (15,10))
sns.boxplot(x="salary",y="timeSpent.company",data=data) #boxplot
plt.xticks(rotation=90)
```

**From the above, we can see that there are some outliers present in our dataset.**

```
plt.figure(figsize = (15,10))
sns.boxplot(x="salary",y="avgMonthlyHours",data=data) #boxplot
plt.xticks(rotation=90)
```

**From the above, we can see that there are no outliers present in our dataset.**

`data.head()`

**As we can observe that some of the variables are in the form of categorical data. So before building our machine learning model we have to convert them into some numerical form. This can be done by applying encoding techniques.**

```
from sklearn.preprocessing import LabelEncoder
x1= LabelEncoder()
data['salary'] = x1.fit_transform(data['salary'])
data.head()
```

`data['salary'].nunique()`

`3`

```
data['dept'] = x1.fit_transform(data['dept'])
data.head(3)
```

**Now as our dataset is ready for building machine learning model. So we can apply a different algorithm and can build our model.**

**But for this project, I will be only using Random forest Classifier Algorithm.**

```
# importing libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score
```

```
X = data.drop(['left'],axis=1) # independent variables
X.head(3)
```

```
Y = data["left"] # dependent variables
Y.head()
```

```
0 1
1 1
2 1
3 1
4 1
Name: left, dtype: int64
```

```
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=3)
y_test.shape
```

`(3000,)`

```
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)
print("Confusion Matrix: ",confusion_matrix(y_test,y_pred),sep='\n')
print("Accuracy Score: ",accuracy_score(y_test, y_pred)*100)
```

```
Confusion Matrix:
[[2274 3]
[ 21 702]]
Accuracy Score: 99.2
```

**From the above, we can see our model is giving us an accuracy of 99.2 which is a very good accuracy**

```
from sklearn import metrics
probs = rf.predict_proba(x_test)
prob_positive = probs[:,1]
fpr,tpr,threshold = metrics.roc_curve(y_test,prob_positive)
roc_auc = metrics.auc(fpr,tpr)
print('Area under the curve:',roc_auc)
```

`Area under the curve: 0.990623050518414`

- From the above, we can see our model is giving us an Area under the curve of 0.990623050518414 which is a very good

```
plt.title('Reciever Operating characterstics')
plt.plot(fpr, tpr,'Orange',label='AUC= %0.2f'%roc_auc)
plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1],'r--')
plt.xlabel('false Positive Rate')
plt.ylabel('true Positive Rate')
plt.show()
```

As per the above, "Random Forest Classifier" gives the accuracy of 99.2 %. which is one of the classification Algorithm's widely used in real-world problems.

- Random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with a number of trees.**
- Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It then aggregates the votes from different decision trees to decide the final class of the test object.
- In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high accuracy results.
- Random Forest Classifier being ensembled algorithm tends to give a more accurate result. This is because it works on the principle, "Number of weak estimators when combined forms strong estimator".

Do you feel accuracy is a good performance evaluation metric for the given data? If yes, justify your answer. If no, justify your answer and suggest alternative metric/s

Classification accuracy alone is typically not enough information to make this decision.

- Classification accuracy is our starting point. It is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage.

So we can use some the below evaluation metric

**Confusion Matrix:**A clean and unambiguous way to present the prediction results of a classifier is to use a confusion matrix (also called a contingency table).For a binary classification problem, the table has 2 rows and 2 columns. Across the top is the observed class labels and down the side are the predicted class labels. Each cell contains the number of predictions made by the classifier that fall into that cell.But in the confusion matrix, a problem arises of the "Accuracy Paradox"**Accuracy Paradox:**Accuracy can be misleading. Sometimes it may be desirable to select a model with a lower accuracy because it has a greater predictive power on the problem.For example, in a problem where there is a large class imbalance, a model can predict the value of the majority class for all predictions and achieve a high classification accuracy, the problem is that this model is not useful in the problem domain. This is called the Accuracy Paradox. For problems like these additional measures are required to evaluate a classifier.**Precision:**Precision is the number of true Positives divided by the number of true Positives and false Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the Positive Predictive Value (PPV).Precision can be thought of as a measure of a classifiers exactness. A low precision can also indicate a large number of false Positives.**Recall:**The recall is the number of true Positives divided by the number of true Positives and the number of false Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the true Positive Rate.Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many false Negatives.**F1 Score:**The F1 Score is the 2((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall.

I hope you enjoyed this project and also you came to know about how we can use and implement the Random Forest Algorithm. For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at https://insideaiml.com/home.

Thanks for reading…

Happy Learning…