IBM HR Analytics Employee Attrition

Abdulellah Alhudaithy
8 min readJun 24, 2021

Project Overview

This project is about analyzing data about BM’s HR Analytics Employee Attrition and Performance the data provided and licensed by IBM and Kaggle. Here we will explore the data to analyze the employee data and predict if they will attrite or not.

Introduction

Employee attrition and candidates absconding are significant business concerns in today’s knowledge-driven marketplace, where employees are the most important human capital assets.
The World Future Society predicted that the greatest test of durability for companies in the next five years would be the ability to attract and retain top performers.

Problem Statement

Every company wouldn’t its employee to leave the company because it will cost the companies a lot of money and time also it could affect the performance. So will build a Machine Learning model to predict how likely each employee will attrite.

Metrics

For our metrics, we will use both accuracy and F1 score to measure the performance of our model

Accuracy is “the ability of the instrument to measure the accurate value is known as accuracy. In other words, the closeness of the measured value to a standard or true value”.(Source)

F1 score: “It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive”.(Wikipedia)

Data Exploration and Data Visualization

Here are the columns in the dataset:

Dataframe Features

We removed some attributes since it hasn’t a clear explanation and we will not use them whether in our analysis or mode.

df.drop(columns = ['DailyRate', 'EmployeeCount', 'HourlyRate',     'MonthlyRate', 'Over18', 'StandardHours', 'StockOptionLevel'], 
inplace = True)

We Changed the data type for some attributes.

categorical_column = ['Department', 'Attrition', 'EducationField', 'BusinessTravel', 'Gender', 'MaritalStatus']
for column in categorical_column:
df[column] = df[column].astype("category")

We have observed some question to help us in our analysis

The Questions

Is there a field that has attrited more than the others at IBM?

Is there are differences between male and female attrition rates?

How likely the travel could affect employee attrition?

What is the most age range has a lot of attrited?

What is the age range that has the highest average monthly income?

What is the age range that has the lowest average job satisfaction rate?

What is the most marital status of the employee who left IBM?

Is there a correlation between the monthly income and the number of years at IBM?

1.Is there a field that has attrited more than the others at IBM?

There is no big difference between the field but in the employees who field is Technical Degree has a higher percentage of attrited around 20%, it was expected since there are a lot of opportunities in this field and many companies need them.

2. Is there are differences between male and female attrition rates?

17% of the male in IBM are attrited more than females only 14.8% and that could be because the male are taking risks or the females are patient and give the opportunity enough time before they decide to leave. here we can’t decide the reason because it depends on the human itself.

3.How likely the travel could affect employee attrition?

We could see that almost 25% of the employee who travels frequently and 14.95% of the employee who travels rarely have attrited since there are many factors affect on the employee especially when the employee is married because the family can affect many decisions in a person life even though if the income is high, a lot of people will accept less income but they can be with their family a lot.

4.What is the most age range has a lot of attrited?

The age range who are between 18–29 is the most attrited and that could be because there are a lot of opportunities for them they have a bigger chance to change their career path or even they still trying to find the best field. also, many of them are not married yet so, they can take the risk to move to another job or start a new business. we notice when you get older and older you try to look for stability. and at group +60 there isn’t attrited which make sense because a lot of people at this age are thinking about retirement not attrition.

5.What is the age range that has the highest average monthly income?

The average monthly income is highest at group 50–59 which a lot of employees in this group becomes in these positions managers, directors, and C-level. Even in the real-life when getting older your income will increase and your expenses will increase as well.

6.What is the age range that has the lowest average job satisfaction rate?

The employees who are+60 they not that satisfied with their jobs and that could be because a lot of them are getting tired of work and want to retire.

7.What is the most marital status of the employee who left IBM?

As we have mentioned many employees attritions when they are single because it’s difficult to change your work or start a new business when you are married and especially when you have children.

8.Is there a correlation between the monthly income and the number of years at IBM?

We found a correlation between the years the employee will spend at IBM and the monthly income since the average percent salary hike is 15.2% and that lets us see what is the average percent salary hike for the employee who left the company and we found it 15.09% which is very close to the overall. So that let us ask questions about Environment Satisfaction. And we found that there is 25.35% of the employee who is not satisfied with the working environment that the affects a lot on the employee performance as well.z Also, we found the Researched Director has an average rate of Environment Satisfaction is the less which we can let us investigate further with the employees.

Methodology

Data Preprocessing

We create dummy variables to convert all the categorical info encoding 0, 1

ml_df = pd.get_dummies(ml_df, columns=['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'JobRole','Gender','MaritalStatus', 'OverTime', 'Age_range'])

We dropped the other column of Attrition_No to have only one target and dropped EmployeeNumber that we don’t need it in our model.

ml_df.drop(columns=['Attrition_No', 'EmployeeNumber'], axis=1, inplace=True)ml_df.rename(columns={'Attrition_Yes': 'Attrition'}, inplace=True )

Implementation

Since we are trying to predict employee attrition, this means we are dealing with a binary classification problem. So I used and tested 3 algorithms then I picked one of them to hyperparameter tone it.

The three algorithms were RandomForestClassifier, DecisionTreeClassifier, and SVC -Support Vector Machines Classifier- I calculated their Accuracy and F-1 score and I chose RandomForestClassifier to hyperparameter tone it.

Since RandomForestClassifier and SVC get the same accuracy but F1-Score is higher in RandomForestClassifier I chose RandomForestClassifier to hyperparameter tone it.

RandomForestClassifier
Accuracy: 0.8673
F1-score: 0.2817
DecisionTreeClassifier
Accuracy: 0.7789
F1-score: 0.2242
SVC
Accuracy: 0.8673
F1-score: 0.0000

Refinement

The Grid Search Code:

clf = RandomForestClassifier(random_state = 42)# Create the parameters list
parameters = {'max_depth': [10,20,30,40,50, 60,70], 'min_samples_split':[1,4,5,6,7,8,9,10], 'min_samples_leaf' : [1,4,5,6,7,8,9,10]}
# Perform grid search on the classifiergrid_obj = GridSearchCV(clf, parameters)grid_fit = grid_obj.fit(X_train, y_train)# Get the estimator
best_clf = grid_fit.best_estimator_

After we improved our model we get 87.76% accuracy we didn’t get that much difference in the accuracy but the F1-score has increased almost 10%

Accuracy: 0.8776
F1-score: 0.3731

Results

Model Evaluation and Validation

I chose these features to further parameter tuning it:

parameters = {'max_depth': [10,20,30,40,50, 60,70], 'min_samples_split':[1,4,5,6,7,8,9,10], 'min_samples_leaf' : [1,4,5,6,7,8,9,10]}

and the best features after the cross-validation were these:

RandomForestClassifier(max_depth=20, min_samples_split=5,
random_state=42)

Justification

When comparing the 3 algorithms we saw that the RandomForestClassifier, DecisionTreeClassifier, and SVC I found the best algorithm all around was RandomForestClassifier so I used it and did the hyperparameter tuning and got good results considering the small dataset I used.

The best algorithm all around for me was RandomForestClassifierwith these parameters:

RandomForestClassifier(max_depth=20, min_samples_split=5,
random_state=42)

The confusion matrix for it:

                      precision    recall  

0 0.88 0.99
1 0.71 0.13

Conclusion

This project helped me indulge more in the HR environment to analyze HR data, by analyzing the attrition for each employee the company can save a lot of money not losing any employee.

In this article, we have analyzed IBM’s employee attrition data which had a lot of useful insights:

  1. The employee whose education field is Technical Degree has a higher potential of attrited around 20%.
  2. Most of attrited at IBM are male with around 4% more than females
  3. The employee who travels frequently has a potential of attrition for almost 25% and 14.95% of the employee who travels rarely.
  4. Most of the employees whose age between 18–29 are more likely to attrited and because they have a lot of opportunities.
  5. The employee whose age between 50–59 earns a highly average monthly income than the others.
  6. The employee who is +60 they not that satisfied with the job and that could be because a lot of them are getting tired of work.
  7. A lot of employees attritions when they are single because it’s difficult to change your work when you are married and especially when you have children.
  8. Most of the employees are left the company, not for financial reasons since the average percent salary hike is 15.09%, So IBM should focus on the working environment especially for Researched Directors
  9. We build 3 Classification models using sklearn library to predict who will leave IBM with 87.76% accuracy.

Finally, the employees are the most important assets in the company which should the companies focus on because ONLY the employees will bring revenue to the company.

Improvement

I think there is a lot of improvement that can be made to the features we used but I’m happy with what I come up with, with the help of domain expertise I think we can come up with better features and testing on a larger dataset.

To see more about this analysis and the model, you can see it in details in my Github

References

  1. https://www.investopedia.com/terms/c/churnrate.asp
  2. https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
  3. https://byjus.com/physics/accuracy-precision-measurement/
  4. https://en.wikipedia.org/wiki/F-score

--

--