All Courses

Project 2: Employee Retention

Shashank Shanu

2 years ago

Employee Retention
Table of Content
  • Problem Statement:
  • Importing libraries
  • Importing data and displaying it
  • Checking shape of the dataset
  • Checking dataset information
  • Checking for none values in the dataset
  • Data visualization
  • Applying Encoding
  • Building model
  • Which model gives the best accuracy and why?

Problem Statement:

         With the help of the given data of company XYZ, your task is to classify whether the employee will leave the company or not? Also, display confusion matrix, calculate accuracy, plot ROC of the model.

Importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Importing data and displaying it

data = pd.read_csv("People Charm case.csv")
data
Output:
Output

Let's do some Data Exploration Analysis first, so that we can get some insights using Visualization

Checking shape of the dataset

data.shape
Output:
(14999, 10)
  • From the above, we can see that our dataset contains 14999 observations and 10 variables. Out of 10 variable, 9 are independent variables and 1 is the dependent or target variable.

Checking dataset information

data.info()
Output:

RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfactoryLevel      14999 non-none  float64
 1   lastEvaluation         14999 non-none  float64
 2   numberOfProjects       14999 non-none  int64  
 3   avgMonthlyHours        14999 non-none  int64  
 4   timeSpent.company      14999 non-none  int64  
 5   workAccident           14999 non-none  int64  
 6   left                   14999 non-none  int64  
 7   promotionInLast5years  14999 non-none  int64  
 8   dept                   14999 non-none  object 
 9   salary                 14999 non-none  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
  • From the above data, we can see that most of the variables present in our dataset are integer types and some of the variables are in of float and object types

Checking for none values in the dataset

data.isnone().sum()
Output:
satisfactoryLevel        0
lastEvaluation           0
numberOfProjects         0
avgMonthlyHours          0
timeSpent.company        0
workAccident             0
left                     0
promotionInLast5years    0
dept                     0
salary                   0
dtype: int64
  • We can see that in our dataset there are no missing values present, so here we do have to apply any data imputation techniques. But in real-world problems, data imputation is one of the most important steps to be performed.

Let's plot and check none values with the help heatmap

sns.heatmap(data.isnone())
Output:
Output
  • As we can see that from the above graph(heatmap) there are no missing or none values in our dataset.

Let's check some of the varibles in our dataset to know how many different unique values are present

Checking "dept" column

data['dept'].unique()
Output:
array(['sales', 'accounting', 'hr', 'technical', 'support', 'IT',
       'product_mng', 'marketing', 'management', 'RandD'], dtype=object)
data['dept'].nunique()
Output:
10
  • From the above, we can see that the department column is having 10 unique departments

Let's check how many values are present in each category of department column

data['dept'].value_counts()
Output:
sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: dept, dtype: int64
  • From the above, we can observe that sales category of department column is having maximum data points 4120, followed by technical category and so on.

Checking the "salary" column

data['salary'].unique()
Output:
array(['low', 'medium', 'high'], dtype=object)
  • From the above, we can see that the salary column contains 3 unique categories which are "low", "medium" and "high".

Let's check how many values are present in each category of salary column

data['salary'].value_counts()
Output:
low       7316
medium    6446
high      1237
Name: salary, dtype: int64
  • From the above, we can see that most of the employees fall under low category salary.

Checking "satisfactoryLevel" column

data['satisfactoryLevel'].value_counts()
Output:
0.10    358
0.11    335
0.74    257
0.77    252
0.84    247
       ... 
0.25     34
0.28     31
0.27     30
0.12     30
0.26     30
Name: satisfactoryLevel, Length: 92, dtype: int64
  • From the above, we can see that 358 employees are having Satisfactory Level of 0.1

Checking "numberOfProjects" column

data['numberOfProjects'].value_counts()
Output:
4    4365
3    4055
5    2761
2    2388
6    1174
7     256
Name: numberOfProjects, dtype: int64
  • From the above, we can observe that maximum employees completed a minimum of 4 projects and only 256 employees completed 7 projects

Data visualization

Checking outlier's

sns.boxplot(data['avgMonthlyHours'])
Output:
Checking outlier's
sns.boxplot(data['satisfactoryLevel'])
Output:
Output
sns.boxplot(data['lastEvaluation'])
Output:
Output
  • From above we can see that there are no outliers present in the above-checked columns

Checking data distribution

sns.distplot(data["avgMonthlyHours"])
Output:
Output
sns.distplot(data["lastEvaluation"])
Output:
Output
  • From the above, we can observe that our data is not normally distributed for the checked columns/variables.

Lets plot histogram for Numerical data

numerical_features = ['satisfactoryLevel','lastEvaluation','numberOfProjects','avgMonthlyHours','timeSpent.company']

categorical_features = ['dept','salary','workAccident','promotionInLast5years']

print(data[numerical_features].hist(bins=15, figsize=(15, 6), layout=(2, 4)))
Output:
histogram for Numerical data

Lets plot count plot for categorical data

sns.countplot(data['dept'])
Output:
plot for categorical data
sns.countplot(data['salary'])
Output:
Output
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 4, figsize=(20, 8))
for variable, subplot in zip(categorical_features, ax.flatten()):
    sns.countplot(data[variable], ax=subplot)
    for label in subplot.get_xticklabels():
        label.set_rotation(90)
Output:
Output
plt.figure(figsize = (15,10))
sns.boxplot(x="salary",y="timeSpent.company",data=data)   #boxplot
plt.xticks(rotation=90)
Output:
Output
  • From the above, we can see that there are some outliers present in our dataset.
plt.figure(figsize = (15,10))
sns.boxplot(x="salary",y="avgMonthlyHours",data=data)   #boxplot
plt.xticks(rotation=90)
Output
  • From the above, we can see that there are no outliers present in our dataset.
data.head()
Output:
Output
  • As we can observe that some of the variables are in the form of categorical data. So before building our machine learning model we have to convert them into some numerical form. This can be done by applying encoding techniques.

Applying Encoding

from sklearn.preprocessing import LabelEncoder
x1= LabelEncoder()
data['salary'] = x1.fit_transform(data['salary'])
data.head()
Output:
Output
data['salary'].nunique()
Output:
3
data['dept'] = x1.fit_transform(data['dept'])
data.head(3)
Output:
Output
  • Now as our dataset is ready for building machine learning model. So we can apply a different algorithm and can build our model.
  • But for this project, I will be only using Random forest Classifier Algorithm.

Building model

# importing libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score

Splitting the independent variable from the dataset

X = data.drop(['left'],axis=1)   # independent variables
X.head(3)
Output:
Output

Splitting dependent variable from the dataset

Y = data["left"]          # dependent variables
Y.head()
Output:
0    1
1    1
2    1
3    1
4    1
Name: left, dtype: int64

Splitting data into training and testing part in the ratio of 80:20

x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=3)
y_test.shape
Output:
(3000,)

Applying Random forest Classifier Algorithm

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)
print("Confusion Matrix: ",confusion_matrix(y_test,y_pred),sep='\n')
print("Accuracy Score: ",accuracy_score(y_test, y_pred)*100)
Output:
Confusion Matrix: 
[[2274    3]
 [  21  702]]
Accuracy Score:  99.2
  • From the above, we can see our model is giving us an accuracy of 99.2 which is a very good accuracy

Plotting Roc Curve

from sklearn import metrics
probs = rf.predict_proba(x_test)
prob_positive = probs[:,1]
fpr,tpr,threshold = metrics.roc_curve(y_test,prob_positive)
roc_auc = metrics.auc(fpr,tpr)
print('Area under the curve:',roc_auc)
Output:
Area under the curve: 0.990623050518414
  • From the above, we can see our model is giving us an Area under the curve of 0.990623050518414 which is a very good
plt.title('Reciever Operating characterstics')
plt.plot(fpr, tpr,'Orange',label='AUC= %0.2f'%roc_auc)
plt.legend(loc="lower right")
plt.plot([0, 1], [0, 1],'r--')


plt.xlabel('false Positive Rate')
plt.ylabel('true Positive Rate')

plt.show()
Output:
Output

CONCLUSION:

As per the above, "Random Forest Classifier" gives the accuracy of 99.2 %. which is one of the classification Algorithm's widely used in real-world problems.

Which model gives the best accuracy and why?

  • Random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest with a number of trees.**
  • Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It then aggregates the votes from different decision trees to decide the final class of the test object.
  • In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high accuracy results.
  • Random Forest Classifier being ensembled algorithm tends to give a more accurate result. This is because it works on the principle, "Number of weak estimators when combined forms strong estimator".
Do you feel accuracy is a good performance evaluation metric for the given data? If yes, justify your answer. If no, justify your answer and suggest alternative metric/s
Classification accuracy alone is typically not enough information to make this decision.
  • Classification accuracy is our starting point. It is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage.
So we can use some the below evaluation metric
  • Confusion Matrix:A clean and unambiguous way to present the prediction results of a classifier is to use a confusion matrix (also called a contingency table).For a binary classification problem, the table has 2 rows and 2 columns. Across the top is the observed class labels and down the side are the predicted class labels. Each cell contains the number of predictions made by the classifier that fall into that cell.But in the confusion matrix, a problem arises of the "Accuracy Paradox"
  • Accuracy Paradox:Accuracy can be misleading. Sometimes it may be desirable to select a model with a lower accuracy because it has a greater predictive power on the problem.For example, in a problem where there is a large class imbalance, a model can predict the value of the majority class for all predictions and achieve a high classification accuracy, the problem is that this model is not useful in the problem domain. This is called the Accuracy Paradox. For problems like these additional measures are required to evaluate a classifier.
  • Precision:Precision is the number of true Positives divided by the number of true Positives and false Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the Positive Predictive Value (PPV).Precision can be thought of as a measure of a classifiers exactness. A low precision can also indicate a large number of false Positives.
  • Recall:The recall is the number of true Positives divided by the number of true Positives and the number of false Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the true Positive Rate.Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many false Negatives.
  • F1 Score:The F1 Score is the 2((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall.
I hope you enjoyed this project and also you came to know about how we can use and implement the Random Forest Algorithm. For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at https://insideaiml.com/home.
Thanks for reading…
Happy Learning…

Submit Review