An Analysis of the Use of Machine Learning for Employee Attrition Prediction - A Literature Review

Page created by Alfredo Lawson

Finance

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Journal of Information and Computational Science ISSN: 1548-7741

An Analysis of the Use of Machine Learning for
Employee Attrition Prediction – A Literature
Review
Usha.P.M1,
Research Scholar, Department of CA, CS & IT,
Karpagam Academy of Higher Education, Coimbatore.
E mail Id – pmusha.72@gmail.com

Dr. N.V.Balaji2,
Dean, Faculty of Arts, Science and Humanities,
Karpagam Academy of Higher Education, Coimbatore.
E mail Id – balajinv@karpagam.com

Abstract
Machine learning is an area of application where artificial intelligence is
employed empowering the systems to learn and act from experience without the
requirement of an explicit program. [1]. Future events can be predicted by
executing machine learning algorithms on the available past data [2]. Machine
learning produces labeled classifications of data, and it can also be used to form
hidden structures from the unlabeled data.
The top level management of organizations can leverage this prediction potential
of machine learning algorithms to foretell the likelihood of an employee exiting from
the organization. This process will in turn help in controlling the factors leading to
attrition and prevent it from happening. Employee turnover is a grave challenge
faced by the employer. Retaining talents are crucial for every organization. Hence,
if the management can obtain a prediction probability of separation of employees,
as well as the factors influencing the separation, it can be instrumental in making
decisions to mitigate the risk of attrition. This is where machine learning has a role
to play. The predictions made by the machine learning algorithms will indicate
proactive steps the management should take to retain the employees.
This paper is attempting to review the studies conducted in this area to explore
various machine learning algorithms that can be used for the predictions and the
effectiveness of such predictions.

Key words: Attrition, Machine language, Predictive analytics, Classification

1. Introduction
“Take care of your employees and they will take care of your business”, opined Sir
Richard Charles Nicholas Branson, a British business tycoon [3]. In the current competitive
business environment, it is inevitable for an organization to invest time and efforts to reduce
employee attrition and retain the employees who are the most valuable assets of the
organization. High employee turnover will result in considerable financial stress for the
organization [4].
Since the operations of all departments in current organizations are digitalized, there is
availability of abundant data. The details of human resource domain where details of job,

Volume 10 Issue 3 - 2020 1429 www.joics.org

Journal of Information and Computational Science ISSN: 1548-7741

change in job, time in company and other demographic features can be extracted from
Human Resource Information System. The data resource can be utilized to identify the
drivers of attrition and also predict the chance of attrition. HR Analytics is gaining
importance nowadays in organizations. Descriptive analytics of data has been used in
organizations earlier also. But the current technology of predictive analytics utilizes
machine learning techniques to predict the future from the available data [5].
Machine learning is an area of application where Artificial intelligence is employed
empowering the systems to learn and act from experience without the requirement of an
explicit program. [1]. Future events can be predicted by executing machine learning
algorithms on the available past data [2]. Machine learning produces labeled classifications
of data or it can also be used to form hidden structures from the unlabeled data.
The top level management of organizations can leverage this prediction potential of
machine learning algorithms to foretell the likelihood of an employee exiting from the
organization. This process will in turn help in controlling the factors leading to attrition and
prevent it from happening. Employee turnover is a grave challenge faced by the employer.
Retaining talents are crucial for every organization. Hence if the management can obtain a
prediction probability of separation of employees as well as the factors influencing the
separation, it can be instrumental in making decisions which mitigate the risk of attrition.
This is where machine learning has a role to play. The predictions prepared by the machine
learning algorithms will initiate proactive steps by top management to retain the employees.

2. Literature review
Rohit Punnoose, PankajAjit (2016) in the article “Prediction of Employee Turnover in
Organizations using Machine Learning Algorithms”, is making a comparison of Extreme
Gradient Boosting with other selected classification algorithms for predicting the turnover
of employees. The study has used a data set collected from the Information system used by
Human Resource department of aretailerwith global operations and also data from Bureau
of Labor Statistics [6]. Since it was a labeled data set, supervised learning was carried out.
Data set also contained records of employees spreading over 18 years, which had data
pertaining to every quarter. 73,115 records with labels ‘active’ or ‘terminated’ formed the
source data set. A preprocessing was done on the data by replacing the missing values by
mean, by median where outliers were found and by values on the basis of domain expertise.
Eighty percent of data were used for training the model. A tenfold cross validation was
performed on each chosen algorithm. The model thus generated after training was used to
test the remaining twenty percent of data.
Various classification algorithms implemented in this study are Extreme Gradient
Boosting, Logistic Regression, Naïve Bayesian, Random Forest, K-Nearest Neighbor,
Linear Discriminant Analysis andSVM. The researcher has adopted AUC-ROC curve as a
tool for measuring the performance of the classification algorithms that are used in the
study. An AUC ROC curve is a graph drawn by plotting True Positive Rate (TPR)shown
on y-axis and False Positive Rate (FPR) shown on x-axis. The AUC (Area under curve)
gives a measure of how much the model is able to distinguish between different classes in
the data set. Other measures of comparison are the memory consumed by the process and
also the run time of the algorithm. The results after running various algorithms on a
MacBook OS are given in the table below:

Volume 10 Issue 3 - 2020 1430 www.joics.org

Journal of Information and Computational Science                                                 ISSN: 1548-7741

                 Table 1: Comparison of performance of classification algorithms

             Reprinted from “Prediction of Employee Turnover in Organizations using Machine
           Learning Algorithms” by RohitPunnoose, PankajAjit, 2016, (IJARAI) International
           Journal of Advanced Research in Artificial Intelligence, Vol. 5
             From the table it is clear that the best performance is put up by XGBoostin case of
           accuracy and memory utilization.
              Ibrahim OnuralpYi˘git,HamedShourabizadeh (2017)in their article “An Approach for
           Predicting Employee Churn by Using Data Mining”, compared the performance of various
           classification techniques in predicting the churn rate of employees [7]. The researcher
           worked on a fictional dataset prepared by IBM data scientists, which had 1470 records and
           35 features. Some features which did not have any significance on attrition prediction were
           removed. Attributes like employee id could not be considered as a variable influencing the
           decision of employee churn. Also, the database had a field over18, where all the employees
           had the value “Yes”. Such attributes were removed from the dataset.
             The comparison made by the researcher with respect to accuracy metric is as follows
             Table II: Comparison of accuracy of classification algorithms
               Algorithm                     Accuracy metric
               Decision tree                 0.813
               Naïve Bayes                   0.839
               Logistic Regression           0.855
               SVM                           0.887
               KNN                           0.867
               Random forest                 0.879

Volume 10 Issue 3 - 2020                              1431                                           www.joics.org

Journal of Information and Computational Science                                                   ISSN: 1548-7741

             Table III: The comparison with respect to precision metric
               Algorithm                    Precision
               Decision tree                0.29
               Naïve Bayes                  0.31
               Logistic Regression          0.35
               SVM                          0.51
               KNN                          0.38
               Random forest                0.45

             Table IV:The comparison with respect to recall
               Algorithm                    recall
               Decision tree                0.43
               Naïve Bayes                  0.32
               Logistic Regression          0.31
               SVM                          0.31
               KNN                          0.23
               Random forest                0.22

              From the above tables it is evident that SVM performed better than other algorithms.
           The author also employed certain methods for feature selection and had also used recursive
           feature elimination for the purpose of extracting relevant variables. The algorithms were
           applied on the dataset after applying feature selection. A slight improvement was shown
           in accuracy, precision, recall and f-measure. But the comparison proved SVM to be the
           performer after using recursive feature elimination.
              Rachna Jain and AnandNayyar(2018),in their article, “Predicting Employee Attrition
           using XGBoost Machine Learning Approach”, tried to predict attrition using XGBoost, a
           popular machine learning technique [8]. OSEMN framework was used in this study while
           designing the project which actually represented obtaining, scrubbing, exploring, modeling
           and interpreting the data. The dataset used was prepared by IBM data scientists. Apart from
           selecting important variables from the dataset, the researcher was creating some new
           attributes like tenure per job which was formed from the number of companies worked and
           compa ratio, which was the ratio between monthly income and midpoint of salary range.
           Naturally, we can understand that if compa ratio is small, the employee will not be satisfied
           regarding his salary, and it may lead to attrition. The researcher has found that the factors
           that highly influence attrition among the attributes in the data set are age, gender, marital
           status, years at company, job satisfaction and distance from home. Glm boost and XGBoost
           techniques were used to give an accuracy of 89% and less than 30% error rate.
              JeelSukhadiya, Harshal Kapadia, Prof. Mitchell D’silva (2018),in their article
           “Employee Attrition Prediction using Data Mining Techniques”,applied various algorithms
           on the same IBM data set as mentioned in a previous article which was prepared by IBM
           data scientists. The algorithms used in this study were Logistic regression, Gradient boosted

Volume 10 Issue 3 - 2020                                1432                                           www.joics.org

Journal of Information and Computational Science ISSN: 1548-7741

classifier, Support Vector Machine and Random Forest [5]. After using Random Forest
Classifier, fifteen features were found to be more influential in deciding the retention of
employees where overtime and monthly income were more prominent. One drawback seen
in this was that employee number was also figuring in these fifteen features, whereas
employee number was a variable which could be avoided from the list. Again the researcher
had used Extreme Gradient Boosting for finding the most influential factors of attrition.
Here monthly income was more important. As in random forest, employee number was also
appearing in the list which should not have been considered at all. Various algorithms like
SVM, Extreme Gradient Boosting, Logistic regression, Random Forest and Ensemble
average were used. Among these algorithms Extreme Gradient boosting was found to be
the best performer. The performance was measured using Area Under Curve (0.845960).
Dilip Singh Sisodia, SomduttaVishwakarma, AbinashPujahari (2017), in their study,
“Evaluation of Machine Learning Models for Employee Churn Prediction”, statistically
analyzed data to find out the factors that affected attrition of employees. The researcher was
also using machine learning algorithms to build models to predict attrition. The data set in
Kaggle having 10 attributes and 15000 records was used for the study. An analysis of the
data revealed that promotion, workload and the time spent with company were major factors
affecting attrition.
The study had used KNN, Lagrarian Support Vector Machine, Decision Tree, Naïve
Bayes, and Random Forest for building model. The models were compared for accuracy,
precision, recall, F-measure, false positive rate, specificity and false negative rate.
In terms of accuracy Random forest performs well followed by Decision tree, K Nearest
Neighbour, Naïve Bayes and at last LSVM
In terms of precision the order of performance is Random forest, Decision tree, K Nearest
Neighbour, Lagrarian Support Vector Machine and then Naïve Bayes
In terms of recall the order of performance is Decision tree, Random forest, K Nearest
Neighbour, Naïve Bayes and at last Lagrarian Support Vector Machine
In terms of F-measure the order of performance is Random forest, Decision tree, K
Nearest Neighbour, Lagrarian Support Vector Machine and then Naïve Bayes
Heng Zhang, Lexi Xu, Xinzhou Cheng, Kun Chao, Xueqing Zhao in their article
“Analysis and prediction of employee turnover characteristics based on Machine
Learning”, tried to find out the factors leading to attrition of employees. An attempt was
made to employ machine learning algorithm to predict attrition. In this article also the IBM
data set prepared by data scientists was used. The correlations between data items were
checked, and it was found that there was high correlation between department and work
role. There was no significant contribution done by attributes “Standard Hours” and “Over
18”, “age”, “employee number” and “relationship satisfaction”. Scaling was used to
minimize the difference between the attribute values. Logistic regression was used for
predicting the attrition, and 87.2% accuracy was achieved. It was also observed by the
author that frequent business travel also contributed to attrition. In the same way, employees
with technical degree had a high probability of attrition. The gender of employees was also
an important factor in attrition. Model fusion employing Regressor function used in Python
resulted in an accuracy of 89.32%, which is an improvement on earlier cases.

In the article, “Early Prediction of Employee Attrition using Data Mining Techniques”,
the authors used various classification techniques for predicting the attrition of employees.
The dataset used was extracted from the Kaggle website. Attributes about the employee
including name and other particulars like details of project, department, promotion and the
like were stored in the database, where a majority of the fields were numeric. Only name,
salary and department were stored as categorical variables.

Volume 10 Issue 3 - 2020 1433 www.joics.org

Journal of Information and Computational Science ISSN: 1548-7741

For reducing the number of attributes, the author had used brute force approach, one hot
encoding and feature selection. In brute force approach, department can be divided as
technical and non-technical and assigned values 0 and 1.
In one hot encoding, salary which was stored as a categorical variable comprising of
values low, medium and high, and were replaced by three attributes salary-low, salary-
medium and salary-high, which could be declared as a numeric field with values 0 and 1.
Department was represented as ten separate features as Department IT, Department R and
D and the like, which were numeric variables with values 0 and 1.
Recursive Feature Elimination with cross validation was used to reduce features.
Redundant features were found out and eliminated in this method. Only minimum number
of features having considerable impact on the output was considered.
Fourth approach was used to explore the causes of attrition of employees, who had more
experience. Only the data of employees whose “time_spent_in_company” was above 4 and
“number_of_projects” more than 5 and “last_evaluation” higher than 0.74 were considered.
The classification techniques SVM, Random Forest, Logistic Regression, Decision Tree
and AdaBoost were compared with attributes selected using Brute Force, One hot encoding,
Feature selection and details of employee who were experienced.
Classification of employees using the above-mentioned algorithms on features
engineered using Brute Force method showed that Random Forest gave the highest
accuracy of 0.9863%.
Classification of employees using the algorithm mentioned on features
engineered using one hot encoding approach showed that decision tree had the
highest accuracy of 0.9817%.
Classification techniques Random Forest and AdaBoost were carried out using
the set of features which were reduced using Recursive Feature Elimination with
cross validation. Random Forest gave an accuracy of 0.9863% and Adaboost
0.9583%.
When the algorithms were applied on data of employees who had experience,
decision tree showed an accuracy of 0.9927%.

3. An analysis of the reviews
Various studies conducted on the process of employing classification techniques to
forecast the attrition of human resources have been reviewed. The accuracy of predictions
of employee attrition is measured in each case for all tested algorithms. It is observed that
there is a change in the accuracy attained by algorithms on different data sets. The methods
adopted for preprocessing the data also result in change of accuracy level. The major
constraint of the research conducted in application of data analytics is regarding the
availability of data. Real time data from organizations are not available, as they are highly
confidential. So, the researchers are constrained to make use of readymade datasets
available in Kaggle.
The best performing classification method is identified and shown in the Table.

Volume 10 Issue 3 - 2020 1434 www.joics.org

Journal of Information and Computational Science                                          ISSN: 1548-7741

             Table V: A comparison of classification algorithms
            Name of the article Data set used        Algorithms tested   Measurin   Result
            and author                                                   g tool

            RohitPunnoose,      A      data    set   Extreme     Gradient AUC-      Best
                                collected from       Boosting,    Logistic ROC      perfor
            PankajAjit,
                                the        Human     Regression,    Naïve curve     mance
            “Prediction of
                                Resource             Bayesian, Random               is put
            Employee
                                Information          Forest,       Linear           up by
            Turnover in
                                system of a          Discriminant                   XGBo
            Organizations
                                retailer and also    Analysis,    Support           ost
            using Machine
                                data         from    Vector Machine and             (AUC
            Learning
                                Bureau of Labor      K-Nearest Neighbor             0.88)
            Algorithms”
                                Statistics                                          and
                                                                                    maxim
                                                                                    umme
                                                                                    mory
                                                                                    utilizat
                                                                                    ion is
                                                                                    12%.
            Ibrahim             A        dataset Decision tree, Naïve Confusion Best
            OnuralpYi˘git       prepared     by Bayes,        Logistic Matrix   perfor
            ,HamedShourabiza IBM                 regression,    SVM,            mance
            deh ,“An Approach                    KNN, Random forest             is put
            for      Predicting                                                 up by
            Employee Churn by                                                   SVM
            Using         Data                                                  (0.897
            Mining”                                                             ) and
                                                                                precisi
                                                                                on
                                                                                (0.98)
            Rachna Jai1 and A           dataset XGBoost                  Confusion Accur
            AnandNayyar,”      prepared     by                           Matrix    acy -
            Predicting         IBM                                                 89.1 %
            Employee Attrition
            using      XGBoost
            Machine Learning
            Approach”
            JeelSukhadiya,       A    dataset Logistic regression, AUC-             Best
            Harshal Kapadia, prepared     by Support         Vector ROC             perfor
            Prof.       Mitchell IBM          Machine, Gradient curve               mance
            D’silva ,“Employee                boosted classifier and                is put
            Attrition Prediction              Random Forest                         up by
            using Data Mining                                                       Extrem
            Techniques”,                                                            e
                                                                                    Gradie
                                                                                    nt
                                                                                    Boosti
                                                                                    ng(AU
                                                                                    C      -
                                                                                    0.8459
                                                                                    60)

Volume 10 Issue 3 - 2020                             1435                                      www.joics.org

Journal of Information and Computational Science                                       ISSN: 1548-7741

            Dilip Singh Sisodia, Data set from KNN, LSVM, Naïve Confusion Best
            SomduttaVishwaka Kaggle            Bayes, Decision Tree matrix perfor
            rma,                               and Random Forest           mance
            AbinashPujahari                                                is put
            ,”Evaluation      of                                           up by
            Machine Learning                                               Rando
            Models           for                                           m
            Employee Churn                                                 Forest
            Prediction”                                                    in
                                                                           terms
                                                                           of
                                                                           accura
                                                                           cy(0.9
                                                                           897),
                                                                           precisi
                                                                           on(0.9
                                                                           981)
                                                                           and F-
                                                                           measu
                                                                           re(0.99
                                                                           33) .
            Heng Zhang1, Lexi a          dataset Logistic Regression   Confusion 87.2%
            Xu,         Xinzhou prepared     by                        matrix    accura
            Cheng, Kun Chao, IBM                                                 cy
            Xueqing       Zhao,”
            Analysis        and                  After applying model
            prediction        of                 fusion          using
            employee turnover                    regressor function in
            characteristics                      python
            based on Machine                                                     89.32
            Learning”                                                            %
                                                                                 accura
                                                                                 cy
            Sandeep              Data set from LogisticRegression,S Confusion Rando
            Yadav,Aman Jain, Kaggle            VM,Random Forest, matrix       m
            Deepti       Singh,                Decision Tree and              Forest
            “Early Prediction of                                              (0.986
            Employee Attrition                   AdaBoost                     3)
            using Data                                                        when
                                                                              applie
               Mining
                                                                              d on
            Techniques”
                                                                              data
                                                                              prepro
                                                                              cessed
                                                                              by
                                                                              Brute
                                                                              Force
                                                                              Metho
                                                                              d
                                                                                    deci
                                                                                 sion

Volume 10 Issue 3 - 2020                         1436                                      www.joics.org

Journal of Information and Computational Science                                                     ISSN: 1548-7741

                                                                                               tree
                                                                                               (0.981
                                                                                               7)
                                                                                               applie
                                                                                               d on
                                                                                               data
                                                                                               set
                                                                                               prepro
                                                                                               cessed
                                                                                               using
                                                                                               one
                                                                                               hot
                                                                                               encodi
                                                                                               ng
                                                                                               approa
                                                                                               ch
              It is observed in the above studies that Xtreme Gradient boosting, SVM and Random
           Forest exhibit a very high-performance level. The selected literature also shows that the
           algorithms which have used the dataset prepared by IBM Watson having 1470 records and
           35 attributes, out of which seven are categorical, show accuracy below 90%.

           4. Conclusion
              HR Analytics using predictive techniques is an area which is not optimally utilized by
           organizations. The reviews done in the article are exhibiting the use of various classification
           techniques for predicting attrition. Similar techniques can be experimented with to analyze
           and predict the performance of human resources in an organization. More research is to be
           done in the area of HR analytics to explore all untapped areas. This can help organizations
           to do a better selection of human resources to increase their productivity for the benefit of
           the organization.

Volume 10 Issue 3 - 2020                                1437                                             www.joics.org

Journal of Information and Computational Science ISSN: 1548-7741

References

1. https://expertsystem.com/machine-learning-definition/
2. Tom Mitchell, 1997, Machine Learning, McGraw Hill.
3. https://gordontredgold.com/take-care-of-your-staff-and-they-will-take-care-of-
business/
4. Angelo S. DeNisi; Ricky Griffin, 2005,Human Resource Management
5. https://competency.aicpa.org/media_resources/212508-using-predictive-analytics-
in-employee-retention/detail
6. Rohit Punnoose, Pankaj Ajit, “Prediction of Employee Turnover in Organizations
using Machine Learning Algorithms”,(IJARAI) International Journal of Advanced
Research in Artificial Intelligence, Vol. 5, No. 9, 2016
7. Ibrahim Onuralp Yi˘git , Hamed Shourabizadeh ,“An Approach for Predicting
Employee Churn by Using Data Mining”,IEEE, 2017 International Artificial
Intelligence and Data Processing Symposium (IDAP)
8. Rachna Jai1 and Anand Nayyar ,”Predicting Employee Attrition using XGBoost
Machine Learning Approach”,IEEE, Proceedings of the SMART–2018, IEEE
Conference ID: 44078,2018 International Conference on System Modeling &
Advancement in Research Trends, 23rd–24th November, 2018
9. Jeel Sukhadiya, Harshal Kapadia, Prof. Mitchell D’silva ,“Employee Attrition
Prediction using Data Mining Techniques”,International Journal of Management,
Technology And Engineering, ISSN Online Number: 2249-7455
10. Dilip Singh Sisodia, Somdutta Vishwakarma, Abinash Pujahari ,”Evaluation of
Machine Learning Models for Employee Churn Prediction”, IEEE, Proceedings of
the International Conference on Inventive Computing and Informatics (ICICI
2017), IEEE Xplore Compliant - Part Number: CFP17L34-ART, ISBN: 978-1-
5386-4031-9
11. Heng Zhang1, Lexi Xu, Xinzhou Cheng, Kun Chao, Xueqing Zhao ,”Analysis and
prediction of employee turnover characteristics based on Machine Learning”,
IEEE, Proceedings of The 18th International Symposium on Communications and
Information Technologies (ISCIT 2018)
12. Sandeep Yadav, Aman Jain, Deepti Singh, “Early Prediction of Employee Attrition
using Data Mining Techniques, IEEE, Proceedings of 2018 IEEE 8th
International Advance Computing Conference (IACC)

Volume 10 Issue 3 - 2020 1438 www.joics.org

You can also read