An Analysis of the Use of Machine Learning for Employee Attrition Prediction - A Literature Review
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Journal of Information and Computational Science ISSN: 1548-7741 An Analysis of the Use of Machine Learning for Employee Attrition Prediction – A Literature Review Usha.P.M1, Research Scholar, Department of CA, CS & IT, Karpagam Academy of Higher Education, Coimbatore. E mail Id – pmusha.72@gmail.com Dr. N.V.Balaji2, Dean, Faculty of Arts, Science and Humanities, Karpagam Academy of Higher Education, Coimbatore. E mail Id – balajinv@karpagam.com Abstract Machine learning is an area of application where artificial intelligence is employed empowering the systems to learn and act from experience without the requirement of an explicit program. [1]. Future events can be predicted by executing machine learning algorithms on the available past data [2]. Machine learning produces labeled classifications of data, and it can also be used to form hidden structures from the unlabeled data. The top level management of organizations can leverage this prediction potential of machine learning algorithms to foretell the likelihood of an employee exiting from the organization. This process will in turn help in controlling the factors leading to attrition and prevent it from happening. Employee turnover is a grave challenge faced by the employer. Retaining talents are crucial for every organization. Hence, if the management can obtain a prediction probability of separation of employees, as well as the factors influencing the separation, it can be instrumental in making decisions to mitigate the risk of attrition. This is where machine learning has a role to play. The predictions made by the machine learning algorithms will indicate proactive steps the management should take to retain the employees. This paper is attempting to review the studies conducted in this area to explore various machine learning algorithms that can be used for the predictions and the effectiveness of such predictions. Key words: Attrition, Machine language, Predictive analytics, Classification 1. Introduction “Take care of your employees and they will take care of your business”, opined Sir Richard Charles Nicholas Branson, a British business tycoon [3]. In the current competitive business environment, it is inevitable for an organization to invest time and efforts to reduce employee attrition and retain the employees who are the most valuable assets of the organization. High employee turnover will result in considerable financial stress for the organization [4]. Since the operations of all departments in current organizations are digitalized, there is availability of abundant data. The details of human resource domain where details of job, Volume 10 Issue 3 - 2020 1429 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 change in job, time in company and other demographic features can be extracted from Human Resource Information System. The data resource can be utilized to identify the drivers of attrition and also predict the chance of attrition. HR Analytics is gaining importance nowadays in organizations. Descriptive analytics of data has been used in organizations earlier also. But the current technology of predictive analytics utilizes machine learning techniques to predict the future from the available data [5]. Machine learning is an area of application where Artificial intelligence is employed empowering the systems to learn and act from experience without the requirement of an explicit program. [1]. Future events can be predicted by executing machine learning algorithms on the available past data [2]. Machine learning produces labeled classifications of data or it can also be used to form hidden structures from the unlabeled data. The top level management of organizations can leverage this prediction potential of machine learning algorithms to foretell the likelihood of an employee exiting from the organization. This process will in turn help in controlling the factors leading to attrition and prevent it from happening. Employee turnover is a grave challenge faced by the employer. Retaining talents are crucial for every organization. Hence if the management can obtain a prediction probability of separation of employees as well as the factors influencing the separation, it can be instrumental in making decisions which mitigate the risk of attrition. This is where machine learning has a role to play. The predictions prepared by the machine learning algorithms will initiate proactive steps by top management to retain the employees. 2. Literature review Rohit Punnoose, PankajAjit (2016) in the article “Prediction of Employee Turnover in Organizations using Machine Learning Algorithms”, is making a comparison of Extreme Gradient Boosting with other selected classification algorithms for predicting the turnover of employees. The study has used a data set collected from the Information system used by Human Resource department of aretailerwith global operations and also data from Bureau of Labor Statistics [6]. Since it was a labeled data set, supervised learning was carried out. Data set also contained records of employees spreading over 18 years, which had data pertaining to every quarter. 73,115 records with labels ‘active’ or ‘terminated’ formed the source data set. A preprocessing was done on the data by replacing the missing values by mean, by median where outliers were found and by values on the basis of domain expertise. Eighty percent of data were used for training the model. A tenfold cross validation was performed on each chosen algorithm. The model thus generated after training was used to test the remaining twenty percent of data. Various classification algorithms implemented in this study are Extreme Gradient Boosting, Logistic Regression, Naïve Bayesian, Random Forest, K-Nearest Neighbor, Linear Discriminant Analysis andSVM. The researcher has adopted AUC-ROC curve as a tool for measuring the performance of the classification algorithms that are used in the study. An AUC ROC curve is a graph drawn by plotting True Positive Rate (TPR)shown on y-axis and False Positive Rate (FPR) shown on x-axis. The AUC (Area under curve) gives a measure of how much the model is able to distinguish between different classes in the data set. Other measures of comparison are the memory consumed by the process and also the run time of the algorithm. The results after running various algorithms on a MacBook OS are given in the table below: Volume 10 Issue 3 - 2020 1430 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 Table 1: Comparison of performance of classification algorithms Reprinted from “Prediction of Employee Turnover in Organizations using Machine Learning Algorithms” by RohitPunnoose, PankajAjit, 2016, (IJARAI) International Journal of Advanced Research in Artificial Intelligence, Vol. 5 From the table it is clear that the best performance is put up by XGBoostin case of accuracy and memory utilization. Ibrahim OnuralpYi˘git,HamedShourabizadeh (2017)in their article “An Approach for Predicting Employee Churn by Using Data Mining”, compared the performance of various classification techniques in predicting the churn rate of employees [7]. The researcher worked on a fictional dataset prepared by IBM data scientists, which had 1470 records and 35 features. Some features which did not have any significance on attrition prediction were removed. Attributes like employee id could not be considered as a variable influencing the decision of employee churn. Also, the database had a field over18, where all the employees had the value “Yes”. Such attributes were removed from the dataset. The comparison made by the researcher with respect to accuracy metric is as follows Table II: Comparison of accuracy of classification algorithms Algorithm Accuracy metric Decision tree 0.813 Naïve Bayes 0.839 Logistic Regression 0.855 SVM 0.887 KNN 0.867 Random forest 0.879 Volume 10 Issue 3 - 2020 1431 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 Table III: The comparison with respect to precision metric Algorithm Precision Decision tree 0.29 Naïve Bayes 0.31 Logistic Regression 0.35 SVM 0.51 KNN 0.38 Random forest 0.45 Table IV:The comparison with respect to recall Algorithm recall Decision tree 0.43 Naïve Bayes 0.32 Logistic Regression 0.31 SVM 0.31 KNN 0.23 Random forest 0.22 From the above tables it is evident that SVM performed better than other algorithms. The author also employed certain methods for feature selection and had also used recursive feature elimination for the purpose of extracting relevant variables. The algorithms were applied on the dataset after applying feature selection. A slight improvement was shown in accuracy, precision, recall and f-measure. But the comparison proved SVM to be the performer after using recursive feature elimination. Rachna Jain and AnandNayyar(2018),in their article, “Predicting Employee Attrition using XGBoost Machine Learning Approach”, tried to predict attrition using XGBoost, a popular machine learning technique [8]. OSEMN framework was used in this study while designing the project which actually represented obtaining, scrubbing, exploring, modeling and interpreting the data. The dataset used was prepared by IBM data scientists. Apart from selecting important variables from the dataset, the researcher was creating some new attributes like tenure per job which was formed from the number of companies worked and compa ratio, which was the ratio between monthly income and midpoint of salary range. Naturally, we can understand that if compa ratio is small, the employee will not be satisfied regarding his salary, and it may lead to attrition. The researcher has found that the factors that highly influence attrition among the attributes in the data set are age, gender, marital status, years at company, job satisfaction and distance from home. Glm boost and XGBoost techniques were used to give an accuracy of 89% and less than 30% error rate. JeelSukhadiya, Harshal Kapadia, Prof. Mitchell D’silva (2018),in their article “Employee Attrition Prediction using Data Mining Techniques”,applied various algorithms on the same IBM data set as mentioned in a previous article which was prepared by IBM data scientists. The algorithms used in this study were Logistic regression, Gradient boosted Volume 10 Issue 3 - 2020 1432 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 classifier, Support Vector Machine and Random Forest [5]. After using Random Forest Classifier, fifteen features were found to be more influential in deciding the retention of employees where overtime and monthly income were more prominent. One drawback seen in this was that employee number was also figuring in these fifteen features, whereas employee number was a variable which could be avoided from the list. Again the researcher had used Extreme Gradient Boosting for finding the most influential factors of attrition. Here monthly income was more important. As in random forest, employee number was also appearing in the list which should not have been considered at all. Various algorithms like SVM, Extreme Gradient Boosting, Logistic regression, Random Forest and Ensemble average were used. Among these algorithms Extreme Gradient boosting was found to be the best performer. The performance was measured using Area Under Curve (0.845960). Dilip Singh Sisodia, SomduttaVishwakarma, AbinashPujahari (2017), in their study, “Evaluation of Machine Learning Models for Employee Churn Prediction”, statistically analyzed data to find out the factors that affected attrition of employees. The researcher was also using machine learning algorithms to build models to predict attrition. The data set in Kaggle having 10 attributes and 15000 records was used for the study. An analysis of the data revealed that promotion, workload and the time spent with company were major factors affecting attrition. The study had used KNN, Lagrarian Support Vector Machine, Decision Tree, Naïve Bayes, and Random Forest for building model. The models were compared for accuracy, precision, recall, F-measure, false positive rate, specificity and false negative rate. In terms of accuracy Random forest performs well followed by Decision tree, K Nearest Neighbour, Naïve Bayes and at last LSVM In terms of precision the order of performance is Random forest, Decision tree, K Nearest Neighbour, Lagrarian Support Vector Machine and then Naïve Bayes In terms of recall the order of performance is Decision tree, Random forest, K Nearest Neighbour, Naïve Bayes and at last Lagrarian Support Vector Machine In terms of F-measure the order of performance is Random forest, Decision tree, K Nearest Neighbour, Lagrarian Support Vector Machine and then Naïve Bayes Heng Zhang, Lexi Xu, Xinzhou Cheng, Kun Chao, Xueqing Zhao in their article “Analysis and prediction of employee turnover characteristics based on Machine Learning”, tried to find out the factors leading to attrition of employees. An attempt was made to employ machine learning algorithm to predict attrition. In this article also the IBM data set prepared by data scientists was used. The correlations between data items were checked, and it was found that there was high correlation between department and work role. There was no significant contribution done by attributes “Standard Hours” and “Over 18”, “age”, “employee number” and “relationship satisfaction”. Scaling was used to minimize the difference between the attribute values. Logistic regression was used for predicting the attrition, and 87.2% accuracy was achieved. It was also observed by the author that frequent business travel also contributed to attrition. In the same way, employees with technical degree had a high probability of attrition. The gender of employees was also an important factor in attrition. Model fusion employing Regressor function used in Python resulted in an accuracy of 89.32%, which is an improvement on earlier cases. In the article, “Early Prediction of Employee Attrition using Data Mining Techniques”, the authors used various classification techniques for predicting the attrition of employees. The dataset used was extracted from the Kaggle website. Attributes about the employee including name and other particulars like details of project, department, promotion and the like were stored in the database, where a majority of the fields were numeric. Only name, salary and department were stored as categorical variables. Volume 10 Issue 3 - 2020 1433 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 For reducing the number of attributes, the author had used brute force approach, one hot encoding and feature selection. In brute force approach, department can be divided as technical and non-technical and assigned values 0 and 1. In one hot encoding, salary which was stored as a categorical variable comprising of values low, medium and high, and were replaced by three attributes salary-low, salary- medium and salary-high, which could be declared as a numeric field with values 0 and 1. Department was represented as ten separate features as Department IT, Department R and D and the like, which were numeric variables with values 0 and 1. Recursive Feature Elimination with cross validation was used to reduce features. Redundant features were found out and eliminated in this method. Only minimum number of features having considerable impact on the output was considered. Fourth approach was used to explore the causes of attrition of employees, who had more experience. Only the data of employees whose “time_spent_in_company” was above 4 and “number_of_projects” more than 5 and “last_evaluation” higher than 0.74 were considered. The classification techniques SVM, Random Forest, Logistic Regression, Decision Tree and AdaBoost were compared with attributes selected using Brute Force, One hot encoding, Feature selection and details of employee who were experienced. Classification of employees using the above-mentioned algorithms on features engineered using Brute Force method showed that Random Forest gave the highest accuracy of 0.9863%. Classification of employees using the algorithm mentioned on features engineered using one hot encoding approach showed that decision tree had the highest accuracy of 0.9817%. Classification techniques Random Forest and AdaBoost were carried out using the set of features which were reduced using Recursive Feature Elimination with cross validation. Random Forest gave an accuracy of 0.9863% and Adaboost 0.9583%. When the algorithms were applied on data of employees who had experience, decision tree showed an accuracy of 0.9927%. 3. An analysis of the reviews Various studies conducted on the process of employing classification techniques to forecast the attrition of human resources have been reviewed. The accuracy of predictions of employee attrition is measured in each case for all tested algorithms. It is observed that there is a change in the accuracy attained by algorithms on different data sets. The methods adopted for preprocessing the data also result in change of accuracy level. The major constraint of the research conducted in application of data analytics is regarding the availability of data. Real time data from organizations are not available, as they are highly confidential. So, the researchers are constrained to make use of readymade datasets available in Kaggle. The best performing classification method is identified and shown in the Table. Volume 10 Issue 3 - 2020 1434 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 Table V: A comparison of classification algorithms Name of the article Data set used Algorithms tested Measurin Result and author g tool RohitPunnoose, A data set Extreme Gradient AUC- Best collected from Boosting, Logistic ROC perfor PankajAjit, the Human Regression, Naïve curve mance “Prediction of Resource Bayesian, Random is put Employee Information Forest, Linear up by Turnover in system of a Discriminant XGBo Organizations retailer and also Analysis, Support ost using Machine data from Vector Machine and (AUC Learning Bureau of Labor K-Nearest Neighbor 0.88) Algorithms” Statistics and maxim umme mory utilizat ion is 12%. Ibrahim A dataset Decision tree, Naïve Confusion Best OnuralpYi˘git prepared by Bayes, Logistic Matrix perfor ,HamedShourabiza IBM regression, SVM, mance deh ,“An Approach KNN, Random forest is put for Predicting up by Employee Churn by SVM Using Data (0.897 Mining” ) and precisi on (0.98) Rachna Jai1 and A dataset XGBoost Confusion Accur AnandNayyar,” prepared by Matrix acy - Predicting IBM 89.1 % Employee Attrition using XGBoost Machine Learning Approach” JeelSukhadiya, A dataset Logistic regression, AUC- Best Harshal Kapadia, prepared by Support Vector ROC perfor Prof. Mitchell IBM Machine, Gradient curve mance D’silva ,“Employee boosted classifier and is put Attrition Prediction Random Forest up by using Data Mining Extrem Techniques”, e Gradie nt Boosti ng(AU C - 0.8459 60) Volume 10 Issue 3 - 2020 1435 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 Dilip Singh Sisodia, Data set from KNN, LSVM, Naïve Confusion Best SomduttaVishwaka Kaggle Bayes, Decision Tree matrix perfor rma, and Random Forest mance AbinashPujahari is put ,”Evaluation of up by Machine Learning Rando Models for m Employee Churn Forest Prediction” in terms of accura cy(0.9 897), precisi on(0.9 981) and F- measu re(0.99 33) . Heng Zhang1, Lexi a dataset Logistic Regression Confusion 87.2% Xu, Xinzhou prepared by matrix accura Cheng, Kun Chao, IBM cy Xueqing Zhao,” Analysis and After applying model prediction of fusion using employee turnover regressor function in characteristics python based on Machine 89.32 Learning” % accura cy Sandeep Data set from LogisticRegression,S Confusion Rando Yadav,Aman Jain, Kaggle VM,Random Forest, matrix m Deepti Singh, Decision Tree and Forest “Early Prediction of (0.986 Employee Attrition AdaBoost 3) using Data when applie Mining d on Techniques” data prepro cessed by Brute Force Metho d deci sion Volume 10 Issue 3 - 2020 1436 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 tree (0.981 7) applie d on data set prepro cessed using one hot encodi ng approa ch It is observed in the above studies that Xtreme Gradient boosting, SVM and Random Forest exhibit a very high-performance level. The selected literature also shows that the algorithms which have used the dataset prepared by IBM Watson having 1470 records and 35 attributes, out of which seven are categorical, show accuracy below 90%. 4. Conclusion HR Analytics using predictive techniques is an area which is not optimally utilized by organizations. The reviews done in the article are exhibiting the use of various classification techniques for predicting attrition. Similar techniques can be experimented with to analyze and predict the performance of human resources in an organization. More research is to be done in the area of HR analytics to explore all untapped areas. This can help organizations to do a better selection of human resources to increase their productivity for the benefit of the organization. Volume 10 Issue 3 - 2020 1437 www.joics.org
Journal of Information and Computational Science ISSN: 1548-7741 References 1. https://expertsystem.com/machine-learning-definition/ 2. Tom Mitchell, 1997, Machine Learning, McGraw Hill. 3. https://gordontredgold.com/take-care-of-your-staff-and-they-will-take-care-of- business/ 4. Angelo S. DeNisi; Ricky Griffin, 2005,Human Resource Management 5. https://competency.aicpa.org/media_resources/212508-using-predictive-analytics- in-employee-retention/detail 6. Rohit Punnoose, Pankaj Ajit, “Prediction of Employee Turnover in Organizations using Machine Learning Algorithms”,(IJARAI) International Journal of Advanced Research in Artificial Intelligence, Vol. 5, No. 9, 2016 7. Ibrahim Onuralp Yi˘git , Hamed Shourabizadeh ,“An Approach for Predicting Employee Churn by Using Data Mining”,IEEE, 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) 8. Rachna Jai1 and Anand Nayyar ,”Predicting Employee Attrition using XGBoost Machine Learning Approach”,IEEE, Proceedings of the SMART–2018, IEEE Conference ID: 44078,2018 International Conference on System Modeling & Advancement in Research Trends, 23rd–24th November, 2018 9. Jeel Sukhadiya, Harshal Kapadia, Prof. Mitchell D’silva ,“Employee Attrition Prediction using Data Mining Techniques”,International Journal of Management, Technology And Engineering, ISSN Online Number: 2249-7455 10. Dilip Singh Sisodia, Somdutta Vishwakarma, Abinash Pujahari ,”Evaluation of Machine Learning Models for Employee Churn Prediction”, IEEE, Proceedings of the International Conference on Inventive Computing and Informatics (ICICI 2017), IEEE Xplore Compliant - Part Number: CFP17L34-ART, ISBN: 978-1- 5386-4031-9 11. Heng Zhang1, Lexi Xu, Xinzhou Cheng, Kun Chao, Xueqing Zhao ,”Analysis and prediction of employee turnover characteristics based on Machine Learning”, IEEE, Proceedings of The 18th International Symposium on Communications and Information Technologies (ISCIT 2018) 12. Sandeep Yadav, Aman Jain, Deepti Singh, “Early Prediction of Employee Attrition using Data Mining Techniques, IEEE, Proceedings of 2018 IEEE 8th International Advance Computing Conference (IACC) Volume 10 Issue 3 - 2020 1438 www.joics.org
You can also read