Fuel efficiency and safety in Coca-Cola FEMSA last-mile logistics
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Fuel efficiency and safety in Coca-Cola FEMSA last-mile logistics by Arturo Torres Arpi Acero Industrial and Systems Engineer, ITESM CSF and Fernando González Gil Industrial and Systems Engineer, ITESM CSF SUBMITTED TO THE PROGRAM IN SUPPLY CHAIN MANAGEMENT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE IN SUPPLY CHAIN MANAGEMENT AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2021 © 2021 Arturo Torres Arpi Acero and Fernando González Gil. All rights reserved. The authors hereby grant to MIT permission to reproduce and to distribute publicly paper and electronic copies of this capstone document in whole or in part in any medium now known or hereafter created. Signature of Author: ____________________________________________________________________ Department of Supply Chain Management May 14, 2021 Signature of Author: ____________________________________________________________________ Department of Supply Chain Management May 14, 2021 Certified by: __________________________________________________________________________ Dr. María Jesús Saenz Gil de Gómez Executive Director, Supply Chain Management Blended Program Capstone Advisor Accepted by: __________________________________________________________________________ Prof. Yossi Sheffi Director, Center for Transportation and Logistics Elisha Gray II Professor of Engineering Systems Professor, Civil and Environmental Engineering
Fuel efficiency and safety in Coca-Cola FEMSA last-mile logistics by Arturo Torres Arpi Acero and Fernando González Gil Submitted to the Program in Supply Chain Management on May 14, 2021 in Partial Fulfillment of the Requirements for the Degree of Master of Applied Science in Supply Chain Management ABSTRACT Across industries and supply chains, the safety of drivers and efficient use of fuel by truck fleets are an increasing concern. This project focused on understanding driving styles, understanding the tradeoffs between safe and efficient driving styles, and finding the highest levels of safety and fuel efficiency. We worked with Coca-Cola FEMSA to analyze one year of telematics data from over 3,000 vehicles. To analyze the data, we employed a methodology that involved multiple machine learning and analytical techniques, including multiple regressions, a random forest classification algorithm, Bayesian Gaussian Mixture Model for clustering, what-if simulations, and the use of interactive data visualization tools. These techniques were used first to understand the main fuel efficiency drivers, then to understand the drivers of safety, and finally to understand the trade-offs between fuel efficiency and safety with respect to different driving styles. Our results show that significant gains can be achieved in terms of fuel efficiency by changing driving behaviors. Results from the regression and simulator show that average speed, acceleration events and maximum RPM are the 3 most important variables for fuel efficiency. With small changes like increasing speed by 1km/h, reduce acceleration events in 5% and reduces maximum RPM by 5% fuel efficiency can be increased by 6%. We also demonstrate the main factors defining safety and their relative importance. Finally, we cluster driving styles and suggest good practices to replicate the best driving styles between different driving style clusters. Through a change management framework, we propose how some drivers could improve Coca-Cola FEMSA’s safety proxy by 34% without sacrificing fuel efficiency. Capstone Advisor: Dr. María Jesús Saenz Gil de Gómez Title: Executive Director, Supply Chain Management Blended Program ACKNOWLEDGMENTS 2
We would like to thank our advisor, Dr. Maria Jesus Saenz, for providing guidance and support throughout this project. Next, we want to thank Pamela Siska for reviewing our reports and providing detailed feedback on areas for improvement. Lastly, we both would like to thank our families and partners, for always being supportive throughout our master’s program. 3
TABLE OF CONTENTS LIST OF FIGURES ............................................................................................................................................................5 LIST OF TABLES ..............................................................................................................................................................6 1 INTRODUCTION .....................................................................................................................................................7 2 LITERATURE REVIEW .............................................................................................................................................9 2.1 Safety .........................................................................................................................................................10 2.2 Fuel Efficiency and Costs ............................................................................................................................14 2.3 Fuel Efficiency and Sustainability ...............................................................................................................16 2.4 Conclusions ................................................................................................................................................18 3 METHODOLOGY ..................................................................................................................................................20 3.1 Business understanding .............................................................................................................................21 3.2 Data understanding....................................................................................................................................22 3.3 Data preparation ........................................................................................................................................23 3.4 Modeling ....................................................................................................................................................24 3.4.1 Fuel Efficiency ....................................................................................................................................24 3.4.2 Safety .................................................................................................................................................26 3.4.3 Cluster Analysis..................................................................................................................................28 3.4.4 Individual Cluster Analysis .................................................................................................................29 3.5 Conclusions ................................................................................................................................................30 4 RESULTS ..............................................................................................................................................................30 4.1 Fuel Efficiency ............................................................................................................................................30 4.1.1 Regression Model ..............................................................................................................................30 4.1.2 Fuel Efficiency Scenario Analysis .......................................................................................................34 4.1.3 Anomaly Detection ............................................................................................................................36 4.2 Safety .........................................................................................................................................................37 4.3 Fuel Efficiency and Safety ..........................................................................................................................39 5 DISCUSSION ........................................................................................................................................................43 5.1 Fuel Efficiency ............................................................................................................................................43 5.2 Safety .........................................................................................................................................................44 5.3 Fuel Efficiency and Safety ..........................................................................................................................44 6 INSIGHTS AND MANAGEMENT RECOMMENDATIONS........................................................................................46 6.1 Fuel Efficiency ............................................................................................................................................46 6.2 Fuel Efficiency and Safety ..........................................................................................................................47 7 FUTURE RESEARCH..............................................................................................................................................50 7.1 Fuel Efficiency ............................................................................................................................................50 7.2 Safety and Fuel Efficiency ..........................................................................................................................51 8 CONCLUSION.......................................................................................................................................................52 REFERENCES.................................................................................................................................................................53 4
LIST OF FIGURES Figure 1: Characteristic turn maneuver and lane change patterns ............................................................ 11 Figure 2: Forces Influencing Driver Safety (Douglas and Swartz, 2016) ..................................................... 13 Figure 3: Factors Influencing Fuel Efficiency .............................................................................................. 14 Figure 4: Factors Influencing Sustainable Supply Chains ............................................................................ 18 Figure 5: Driving Styles Independent Variables and Dependent Variables ................................................ 19 Figure 6: Methodology ............................................................................................................................... 20 Figure 7: Cross Industry Standard Process for Data Mining CRISP-DM (Shearer, 2000) ............................ 21 Figure 8: Histogram of Fuel Efficiency ........................................................................................................ 31 Figure 9: Pareto Chart of the Standardized Effects .................................................................................... 31 Figure 10: Residuals for Linear Regression Model...................................................................................... 32 Figure 11: Prediction Error for Linear Regression....................................................................................... 32 Figure 12: Telematics Parameters Correlation Matrix ............................................................................... 33 Figure 13: Scenario Analysis ....................................................................................................................... 35 Figure 14: Anomaly Detection Example ..................................................................................................... 36 Figure 15: Distribution of Safety Score Classes .......................................................................................... 37 Figure 16: ROC Analysis for Safety Classification for Random Forest......................................................... 38 Figure 17: Relative Feature Importance ..................................................................................................... 39 Figure 18: Driving Style Clusters’ Mean Safety Score and Fuel Efficiency .................................................. 40 Figure 19: Condensed View of Driving Styles ............................................................................................. 41 5
LIST OF TABLES Table 1: Classification Model Comparison.................................................................................................. 38 Table 2: Features used for Cluster Analysis ................................................................................................ 29 Table 3: Fuel Efficiency Linear Regression Results ..................................................................................... 34 Table 4: Change Motivator Nudges for Driving Styles ................................................................................ 48 6
1 INTRODUCTION Across industries and supply chains, the safety of drivers and efficient use of fuel of truck fleets is an increasing concern. This project focused on the research of driving styles that promote the highest levels of safety and fuel efficiency, as well as the tradeoffs that can happen between the two. The hypothesis of the project was that statistically significant different driving styles could be uncovered that showed the most fuel efficient and safest driving styles and that we would also find inherent trade-offs between safety and fuel efficiency. This introduction discusses six main points: the impact of fuel efficiency in terms of cost and carbon dioxide emissions, how driving style is related to fuel efficiency and safety, how telematics data can be used to track driving styles, how analytics can be used to analyze this problem and the data that we will be using for this project. Across industries and supply chains, fuel consumption presents a two-fold problem, as it is an issue that directly affects both companies’ profits and the environment. Proof of this comes from a study carried out by Chainalytics (2020) in which they estimated that, on average, transportation costs make up to 50 to 60% of all supply chain operating costs. A similar study, done in Mexico, found that fuel contributes approximately 38.5% of the total direct costs of road transportation in Mexico (Moreno, 2014). Besides, transportation was responsible in 2010 for approximately 23% of worldwide energy-related CO2 emissions (IPCC, 2014). Therefore, finding ways to reduce fuel consumption across industries is a great win-win solution to both reducing climate impact and helping companies´ bottom lines. One of the main factors affecting fuel efficiency is driving style. The difference in fuel consumption between the most and least efficient drivers can be as high as 35%, according to a report by the American Trucking Association’s Technology and Maintenance Council (Hooper & Murray, 2018). Driving style is also crucial to ensure the safety of the drivers and their communities; the German Federal Statistical Office 7
(2010) presented in their accidents report that 69% of the accidents in Germany happen because of drivers’ mistakes. Also, according to the Dutch Eco-Drive initiative (www.ecodrive.org, 2001), a safer driving style is more efficient and reduces pollution. Fleet driving companies are dealing with a big gap in driving styles, hurting them in safety, environmental impact, and cost. There are many ways to monitor driving styles. One of them is through vehicle telematics and the way that it works is that each vehicle contains multiple sensors that gather data on various metrics and events like CO2 emissions, fuel consumption, RPM, tire pressure, speed, etc. The data provided by vehicle telematics is extremely valuable, McKinsey (Gao, Kaas, Wee, 2018) claims in its report automotive revolution perspective towards 2030 that recurring revenue from data-driven and on-demand mobility services could increase by $1.5 trillion in 2030. This is 30% of the overall automotive revenue pools. Important developments in telematics are coming in the next few years. However, there is still a huge challenge with vehicle telematics: each vehicle can generate extensive data streams, which cannot be analyzed with traditional methods and/or spreadsheets. Telematics providers already have standard safety and cost-related web reports, but usually, but these reports tend to be descriptive in nature and only provide data from past events without being able to advise on the best actions forward. To provide predictive and prescriptive analytics a more advanced approach is needed. This approach is nowadays possible thanks to increase in computer processing power and the decrease in database storage capacity costs. Our approach leverages on this by integrating various machine learning and analytics techniques into a multi-methodological approach to answer our research question. The data analyzed came from Coca-Cola FEMSA based out of Mexico. This company operates a fleet of around 3,000 telematics-enabled delivery trucks in Mexico alone. These trucks are focused on the last- mile delivery from regional distribution centers to both large grocery stores and nano stores all around 8
the country. The data from the telematics sensors is sent continuously to the cloud to give real-time insights to the company and can also be exported to analyze historical datasets. The company provided us with an entire year´s worth of data from those 3,000 trucks which accounts for around 600,000 driving days. In summary, our project focused on finding the safest and most fuel-efficient driving styles and the inherent tradeoffs between safety and fuel efficiency. We are using data from the truck fleet of Coca-Cola FEMSA Mexico. To answer our research question, we will be using a multimethodological approach that includes various machine learning and analytics techniques. We hope that this research leads to data- driven improvements in efficiency and safety for this company and the last-mile transportation industry. 2 LITERATURE REVIEW Numerous articles have been written on transportation safety, fuel efficiency, sustainable supply chains, driving styles, and telematics analytics. The objective of this project is to find out which driving styles are the most fuel-efficient and safest and analyzing the inherent tradeoffs between safety and efficiency. Different approaches for understanding driving behaviors have been researched throughout history, all the way from manual surveys to advanced analytics on telematic data. Three main modeling scales used to estimate and understand fuel consumption were defined by Chen et al. (2017) as: First, the microscopic approach takes near real time values and builds a model given the second-to-second decisions of the driver. Second, the macroscopic approaches look at cumulative data from long periods and uses one trip or day as a measure of unit. Third, the mesoscopic approaches, which makes a hybrid combination of microscopic and macroscopic data to model fuel consumption. Our approach will focus on the macroscopic level by analyzing the summary of daily metrics per truck. With the advent of big data and advances in machine learning practices, a new branch of analytics was created that is commonly referred to as telematics analytics. Several approaches can be taken to analyze 9
the massive amounts of data coming from the telematic devices. Carlos et al. (2020) analyzed vehicle telematics around aggressive behavior and the relation to road accidents worldwide. Their approach used first and second order representations to model accelerometer data for classifying driving behavior. Following this approach, we are looking into incorporating accelerometer data, among other variables, into our model. Various projects analyzed the data that can be collected from smartphone sensors, for example, the article published by Kang and Banerjee (2017) in which they showed how modern smartphones can be used widely to collected data on accelerations, brakes, turns and lane changes. This can serve as encouragement for smaller firms that lack access to advanced telematic devices but want to tap into the advantages of analyzing their drivers’ style to improve on their safety, sustainability, and fuel efficiency. By leveraging the framework and insights provided in this project, firms can even use readily available data as shown in the paper by Kang and Banerjee. In this project we not only extracted telematics data but also further analyzed it to propose data-driven best practices, taking into considerations different impacts of a driving style. 2.1 Safety Fleet management’s top priority is generally safety. A car accident can be very hard to model or predict given the chaotic nature of an accident and the high number of external factors that can influence an accident. Driver behavior is certainly one of the biggest factors. Johnson et al. (2009) found that “as many as 56% of deadly crashes involve one or more unsafe driving behaviors typically associated with aggressive driving”. An aggressive driving style is a behavioral pattern or classification of a driver which is associated with risky speeding profiles (irregular, instantaneous and abrupt changes in vehicle speed). Toledo et al. (2008) also found that even after controlling for the larger distances they drive, company car drivers are 50% more likely to be involved in car crashes compared to other drivers. 10
Some studies focus on understanding driving styles from data in multiple applications, such as autonomous driving, insurance applications, and driver distraction detection. Meiring et al. (2015) reviewed the ongoing research on driving style analysis systems and their applications and synthetized the updated research in their article “A Review of Intelligent Driving Style Analysis Systems and Related Artificial Intelligence Algorithms”. According to them, one of the most traditional ways to rate driving styles is through surveys. For example, the Driver Behavior Questionnaire (DBQ) used by Richard Rowe et al. tested repeatedly with DBQ around 12,000 drivers six months after they passed their driver’s license test and confirmed the integrity and validity of DBQ as a driver behavior measure in traffic accident prediction, however this approach requires a manual input of each driver and relies in the driver’s integrity and may vary with time. Similar approaches were proposed by Houston (2003) with the Aggressive Driving Behavior Scale (ADBS) and by Harris (2014) with the Prosocial and Aggressive Driving Inventory (PADI). Houston and Harris have shown that there is a statistical correlation between driver behavior and crash involvement. Their study focused mainly on individual variability associated with numerous parameters such as age, gender, and geographic locations. Another approach to measuring driving safety is by analyzing in-vehicle data (Toledo et al., 2008). used IVDR (In Vehicle Data Recorder) to measure different factors related to safety. For example, the acceleration and direction of the vehicle, both in the lateral and longitudinal directions are measured by accelerometers at a sampling rate of 40 measurements per second. The vehicle speed is derived from the GPS receiver data or from the vehicle speed sensor (VSS). Then they apply pattern recognition algorithms to the raw measurements to detect maneuvers that the vehicle performs. In Error! Reference source not found. we can appreciate how a turn maneuver (left) and a lane change (right) can be differentiated by detecting the changes in longitudinal and lateral acceleration. Figure 1: Characteristic turn maneuver and lane change patterns 11
Note: In their research, Toledo et al. (2008) use this information to calculate risk indices that indicate on the overall trip safety. Drivers receive feedback through various summary reports, real-time text messages or an in-vehicle display unit. Reductions in crash rates and the risk indices are observed in the short-term. Another more recent study by Amarasinghe et al. (2015) proposed a cloud-based driver monitoring and vehicle diagnostic app with OBD2 Telematics. They design an architecture in which an OBD2 (On Board Diagnostics) sensor reads the data generated in real time from the vehicle computer and monitor different parameters, such as speed, acceleration, and cooler temperature. These inputs are then processed and analyzed automatically by algorithms that detect reckless driving from high lateral and longitudinal acceleration changes, proving the possibility of developing an application that could give feedback. The study shows different metrics and the way to measure them but does not go in the detail of the interactions between different parameters or their relations with the driving styles. Lack of attention is another important factor defining driver safety. The “100-Car Naturalistic Study” was a study designed by the National Highway Safety Administration (NHTSA) in collaboration with the Virginia Tech Transportation Institute (VTTI) to provide insight on the influence and contribution of driver behavior immediately preceding an accident. This study was performed on 100 vehicles fitted out with surveillance and other sensor devices for a duration of a year, driven collectively for nearly 2 million miles, and 12
accumulated 42,000 h of data from the 241 drivers. It revealed that 78% of the 82 accidents recorded, and 65% of the 761 near accidents, were the direct effect of driver inattention. Douglas and Swartz (2016) proposed that three main factors that affect driver safety: external forces, organizational forces, and regulatory forces. Regulatory forces include all the regulations and policies coming from governmental agencies. Organizational forces are the ones set by a company such as dispatching policies, safety priorities and the climate risk of acceptance. External forces, such as road and weather conditions, also affect a driver safety in a variety of ways. Figure 2: Forces Influencing Driver Safety (Douglas and Swartz, 2016) Organizational forces can be adjusted to improve driver´s safety, as demonstrated by Rodriguez, Targar and Belzer (2006). They proved that driver safety, as measured by crash incidence, can be improved by two factors. The first one is by increasing retention of employees, as more experienced drivers have fewer accidents. The second one is by increasing the pay regime, as their data showed how better paid drivers had also fewer accidents. 13
2.2 Fuel Efficiency and Costs The optimization of fuel use while driving is also affected by various factors including the inherent efficiency of the truck, the optimization of the route and the maintenance of the truck. One way companies can improve their fuel efficiency is by incentivizing drivers to reduce their fuel use. Adamidis, Mantouka and Vlahogianni (2020) also showed that adopting smooth driving can have a statistically significant impact on fuel efficiency and emissions. Some of the behaviors that they observed related to the acceleration and the braking speeds of the vehicles. Figure 3 shows some of the various factors that influence fuel efficiency. Figure 3: Factors Influencing Fuel Efficiency In this project, the main variables to be analyzed are the Vehicle Make and Model and the Driving Styles, as there is no readily available data on fuel quality, and routing optimization is out of the scope of this project. The expected fuel efficiencies for each Vehicle, Make and Model can be found on the specific car manufacturer’s websites. To ensure the applicability for Coca-Cola FEMSA and avoid any outside effect, the project team ran a regression model to understand the impact in fuel efficiency of the mentioned parameters. 14
The term Eco-driving was credited by the UK government as the adoption of a driving behavior that maximizes the efficiency of the vehicle’s engine. Xu et al. (2014) revealed that eco-driving can reduce fuel consumption by an amount ranging from 15% to 25% and GHG emissions by at least 30%. A recent study by Panagiotis Fafoutellis (2014) performs an in-depth overview of existing research regarding eco-driving, in which he concludes that ICT (Information and Communications Technology) systems to generate and store data are crucial for the quantification and understanding of the effects that different driving styles can have. Driving style is also remarked in existing research as one of five components of fuel consumption, with road geometry, vehicle specifications, traffic and weather conditions being among the most influential (Gilman, et. al, 2015). Another conclusion from their study is that a big data approach is needed to jointly consider data from different sources of information. As stated by Fafoutellis (2014), Linear models can be considered more useful in assessing the influence and importance of each factor in fuel consumption rather than predicting it, while machine learning and deep learning algorithms, such as AdaBoost and neural networks perform better to predict the fuel efficiency, but do not offer much insight since they are not explainable models. Ping et al. (2014) explained that modeling driving behavior under inherently dynamic driving conditions is complex. They also showed how making a quantitative analysis of the relationship between the driving behavior and the fuel consumption is difficult. Nevertheless, in their study, they applied machine learning algorithms to smartphone data to implement driving style identification. Several studies have used smartphone data to mimic vehicle telematics data. They showed that speed and acceleration are discretized by the smartphone which increases the error margin, and that smartphone data does not provide the same number of parameters that can be extracted directly from the vehicle’s computer. In their study, they developed a deep learning framework (LSTM) and used K-Means clustering to separate drivers into different profiles and then estimate fuel consumption. Although some of the parameters were 15
discretized and they did not have all possible parameters, their model achieves an accuracy greater than 80%. To interpret the data from telematics, Yao et al. (2020) developed various Machine Learning models focusing on driving behavior (speed, acceleration, constant speed duration and braking). The algorithms exploited were Neural Networks, Random Forest and Support Vector Regression. All models achieved RMSE values of 0.87, 0.89 and 0.78 respectively, which correspond to a MAPE of less than 10%. Vittorio Astarita et al. (2013) managed to develop an app called EcoSmart, which replies driving behavior describing apps for a limited set of parameters without the need of connecting with OBDII or any vehicle data. EcoSmart generates fuel consumption simulations based on smartphone GPS and a set of tuned parameters that vary in under 5% with data reported by vehicle telematics. While this approach is practical and easier than the previously mentioned OBDII connected app, it is limited since the simulations could bring less accurate results for shorter and more chaotic routes, e.g., with variable traffic, slopes, stops and load, which is this study’s focus. Common approaches to improve driver styles mentioned in research include targeted pricing policies. For example, Fafoutellis et al. (2020) suggested new regulations for alternative fuel vehicles and a systematic upgrade of the transport infrastructure towards a more connected and cooperative city environment. On the other hand, Scania, C. (2014) mentioned that “by gaining knowledge of the impact of their actions on fuel consumption, drivers are more likely to adopt more environmentally friendly practices”. Another impactful approach is designing a proper driving reward system. Lai (2015) showed how through a proper reward system, a 10% improvement in fuel consumption efficiency was achieved. 2.3 Fuel Efficiency and Sustainability With the increased attention around climate change and corporate responsibility, a growing number of companies are looking into becoming more sustainable for the environment. Companies, such as Amazon, 16
have signed Climate Pledges to build sustainable business which they translate as becoming net zero carbon. The idea behind net zero carbon is to eliminate or offset all CO2 emissions that are produced in any point of their supply chains. Our project helps companies in identifying fuel efficient driving styles that can lessen the amount of CO2 produced. Fuel consumption and CO2 emissions go hand in hand, and several factors influence how efficient fuel consumption can be. Demir et al. (2011) considered four different factors: the vehicle, the environment, the traffic, and the operations. Demir et al. (2014) published another article that narrowed down the factors that most influenced the number of emissions to total mass, speed, and road gradient. Another approach that was proposed by Jaller et al. (2015) was to move the transportation of goods and materials to hours of the day in which there is less congestion. Their study showed that freight deliveries that were done in hours with less traffic led do reduce fuel consumption. This last study supports the observations from previously mentioned studies in which constant changes in speed lead to higher consumption of fuel. Optimizing truck allocation by truck type is another approach that has been taken to reduce fuel consumption. Velázquez et al. (2016) standardized this approach into a methodology that uses K-means clustering and Tukey´s method to cluster trucks into certain types. These types can then be optimally assigned to environments in which they would perform at their best and in this way reducing the overall emissions produced by the fleet. Many of the factors that impact fuel efficiency positively are the same ones that impact the environment positively. As the main offender to sustainability in transportation is the burning of fossil fuels, we see no inherent tradeoff between these two topics. 17
Figure 4: Factors Influencing Sustainable Supply Chains Note: The factors considered for this project are the analysis of fuel efficiency in hours that have low traffic, the changes in environment such as constant changes in altitude, the information around the vehicles and the general constraints of the operation. 2.4 Conclusions From data gathering to problem modelling, various challenges have been found for understanding driving styles through data and proposing strategies for improving safety and fuel efficiency. Many articles have been created given the timely importance of safety, cost, and sustainability, as well as the increasing technological development and general interest for data science and machine learning. Specific solutions for specific problems related to driving behavior have been designed; some traditional methods like DBQ and some other more complex like the diagnostic app with OBD2 connected with vehicle telematics. To provide insights into how all these different factors relate to and affect each other, this project analyzes the interactions around vehicles and their characteristics, the environment in which the drivers conduct their day-to-day business, organizational forces such company policies and regulations, driving styles such as constant changes in velocity, vehicle cargo as measured by the gross weight of the cargo and traffic. All 18
these different variables are analyzed in the context of the two main dimensions of this project which are safety and fuel efficiency. The articles that have been cited center on specific aspects of either safety or fuel efficiency. Our project goes further by also analyze the trade-off situations between the different variables and to also find situations in which there are clear win-win scenarios. Figure 5: Driving Styles Independent Variables and Dependent Variables Note: The variables in squares are a fraction of the independent variables that are used to describe the driving styles, the variables in the center are the dependent variables that are the focus of the project. 19
3 METHODOLOGY To understand which driving styles can help fleet owners improve safety and fuel efficiency, we followed an approach that involved multiple machine learning and analytical techniques. Figure 6 shows our overall approach. The first step consisted in Business Understanding, Data Understanding and Data Preparation, all of which is explained in detail in sections 3.1, 3.2 and 3.3. The second part of our methodology involved doing regression analysis to understand how our different independent variables impact fuel efficiency (section 3.4.1) and a classification model to see how variables affect safety (section 3.4.2). The third part involved factor analysis to reduce the dimensionality of our dataset to create clusters that represented different driving styles (section 3.4.3). The fourth part involved analyzing each of the clusters to understand what differentiates each driving style and how this impacts safety and fuel efficiency (section 3.4.4). The fifth and final part was drafting the conclusions to propose best practices to increase safety and fuel efficiency (3.4.5). Figure 6: Methodology Note: numbers on the diagram represent the section in this document where more information can be found. 20
Throughout the project, we followed the Cross Industry Standard Process (CRISP) for Data Mining framework (Shearer, 2000). This framework constitutes a general framework that emphasizes the iterative nature of data mining problems. Figure 7 shows the general framework; its application to this project is detailed in the following sections. The Evaluation steps are discussed in parts throughout this section and in the Results section. The Deployment was not part of the scope of this project, but several recommendations for deployment are given in section 6, Insights and Management Recommendations. Figure 7: Cross Industry Standard Process for Data Mining CRISP-DM (Shearer, 2000) 3.1 Business understanding Our initial research helped us understand the Coca-Cola FEMSA´s business needs. This research consisted of reading academic documents, industry reviews, and annual reports shared by organizations such as the Intergovernmental Panel on Climate Change (IPCC) and the American Transportation Research Institute 21
(ATRI) for a high-level overview. This research showed that drivers’ decisions and driving style is one of the main factors defining the safety and fuel efficiency of any company’s fleet. Coca-Cola FEMSA also shared documentation regarding their telematics approach and objectives, to focus the project and validate the company´s approach to using Telematics Data by comparing them with the reviewed literature. Based on the initial research, the company’s main priorities are safety and fuel efficiency (which affect equally cost savings and CO2 emissions). A series of weekly interviews and discussions were held with Coca-Cola FEMSA’s secondary distribution stakeholders. In these discussions, insights were shared from visualizations in Power BI, receiving feedback and interpretation from Coca-Cola FEMSA´s experts. During these sessions we interviewed the Director of Distribution, the Telematics Managers, and the Digital Analytics teams. To gain a better understanding of the day-to-day of the truck drivers of Coca-Cola FEMSA, a field visit was arranged. By accompanying truck drivers during their daily routes to deliver beverages, interesting insights beyond the data from telematics emerged. Important insights from the field trip include that some important information not reflected in telematics can impact the drivers’ decisions, for example, traffic, street conditions, weather, and other vehicles’ driving behavior. 3.2 Data understanding The amount of available data in this project was an important challenge. A telematics supplier integrates with over 3,000 trucks to generate over 40 different tables that are updated every day or some even every minute, each table having different parameters with different aggregation levels, therefore a clean data set is fundamental for the project. A data dictionary was created to better understand each of the different parameters shown in the reports, as well as the aggregation level of each report. Weekly calls with the telematics team helped to clarify questions from the team. 22
With the data dictionary ready, different visualizations in Microsoft Power BI helped us to get an initial feel for the data. These visualizations were iteratively validated and discussed with Coca-Cola FEMSA´s stakeholders to clarify the expected ranges for important parameters and the expected relations between them. 3.3 Data preparation The following criteria were followed to clean the data: • For outlier treatment we followed two approaches depending on the attribute. For most attributes, we trimmed the values to what the users found to be real minimums or maximums. As an example, for the attribute “Hours Driven”, there is no way a driver could have driven for more than 24 hours in one day. For attributes in which there was no knowledge of the limits, a conservative approach was followed, trimming only the values that went beyond 3 times the interquartile range. • Trucks that do not report fuel usage or full telematics data were removed. This removed a large part of the trucks that the company uses but still left us with 360 trucks with data from 325 days of delivery. Data was also further processed in the following manner: • One-hot encoding of variables are used to transform categorical variables (e.g., truck type and model) into dummy variables to be used in regression models. • Several features had to be engineered to be used in the system. Mainly features that were a ratio of the time an activity took in respect of the total operating time, for example, the total time a truck spent accelerating with respect to the total time of operation. • Feature scaling was used to normalize the range of independent variables. The feature scaling method used was min-max normalization. What this method does is that the minimum value of 23
an attribute becomes 0 and the maximum value becomes 1 and all the other values are adjusted on that 0 to 1 scale. 3.4 Modeling Our approach for modelling involved three different machine learning models. The first one was a regression model to understand the how different driving behaviors impact fuel efficiency, the second one was a machine learning model to understand how different driving behaviors impact safety and the third model was a clustering analysis of driving behaviors to create different driving styles clusters to analyze how these driving styles affect safety and fuel efficiency simultaneously. 3.4.1 Fuel Efficiency For our first regression model, the dependent variable to be analyzed is fuel efficiency as measured by kilometers per liter, where a higher fuel efficiency is better for the economy of the company and produces a lesser amount of carbon dioxide per kilometer driven. To quantify the monetary impact of any change, the average of the price per liter in Mexican pesos is used. To obtain the average price per liter, a dataset of all refuels of 2020 is used, which was 18.16 MXN per liter. To obtain the average carbon dioxide produced per liter of diesel we used a constant of 2.68 kilograms of carbon dioxide per liter. Different regression algorithms were tested to explain the main forces impacting fuel consumption. Some of the models considered were multiple linear regression (i.e., polynomial regression), support vector regression and simple decision trees. Other ensemble methods were tested like boosting (e.g. AdaBoost, XGBoost and LighGBMs), bagging (e.g. random forests) and stacking of various ensemble and simple methods. Although the bagging and boosting models explained the variance in observations measurements better (As reference, the adjusted R2 for the AdaBoost Regressor was 0.72 while for the Linear Regression it was 0.67), we decided to use multiple linear regressions because they are fully explainable. These types of models allow us to explain how the model interprets the inputs to produce outputs as opposed to a black model that only produces outputs that are not explainable. To make sure 24
the results of the linear regression are reproducible, the main four assumptions behind linear regression were tested using residuals plots. Here is a list of the main four assumptions: 1. Independence of observations 2. Linearity of Response 3. Normality of Residuals 4. Homogeneity of Variance (i.e., homoscedasticity) Multicollinearity issues were addressed in three different ways: • Feature Selection was carried out by understanding the meaning behind each of the attributes to discard metrics that were proxies of each other. • A Pearson Correlation Map was created to discard attributes with correlation greater than 0.6 to at least one of the attributes that a Pearson correlation coefficient, which indicates a high correlation with another feature that increases the effect of multicollinearity. A correlation analysis only checks the probability of a correlation problem between two attributes. • Variance Inflation Factor (VIF) was obtained for each of the attributes, and we discarded attributes that had a factor greater than 5, which would indicate highly correlated attributes. A Pearson Correlation Map helps with identifying pairs of attributes that are correlated, while the VIF approach helps to identify multicollinearity among the interactions between the variables, not just between two of them. 25
Once the regression was validated, we developed a Microsoft Power BI-based simulation tool that allows the users to simulate what would be the fuel efficiency gains if any of the dependent variables are modified. Afterwards, we use the results of our regression and validated them against samples of data. The samples used came from drivers where we had detected abrupt changes in their fuel efficiency. To detect the abrupt changes in fuel efficiency behavior for each driver we calculated a rolling 7-day average of fuel efficiency to smooth out the daily noise and only kept those drivers in which we saw a change that remain constant, an example is shown in Figure 13. 3.4.2 Safety Safety is not a straightforward concept to measure. Some of the proxies for measuring safety include number of accidents or proprietary Safety Scores given by telematics data providers. Our initial approach to understand which driving behaviors affect safety was to use accidents as our dependent variable on a regression model. Predicting crash rates is hard to measure given that accidents are stochastic events that not always follow the same pattern and depend on a wider variety of directly controllable factors like driving style and external factors like the weather, external traffic and highly uncertain events like people crossing streets or other drivers’ reckless driving. An econometric model was used to analyze the driving behavior of the drivers that had accidents. The econometric model was based on logistic regressions to predict the probability of an accident occurring. Econometric models allow the model to incorporate past events (by using lag features) that have led to an accident, such as a driver incurring in unsafe practices for several days in a row. Another option includes using several proxies for safety: events such as reducing the velocity of the vehicle too quickly or events in which vehicles make hard turns at considerable speeds. Events like these could be used as proxies for safety as they are considered unsafe behaviors. 26
Our econometric model to predict accidents did not produce statistically relevant results. This was seen by an adjusted R2 that was less than 0.2 and attributes with p-values greater than 0.05. Therefore, this part of the process was not integrated into the results. Our hypothesis of how to make this model work would be to structure how the data is collected and analyze more data related to status of the driver as suggested by Houston, J. (2003) and Harris, P. (2014). Given that our first proxy for safety failed to work, we decided to use Coca-Cola FEMSA’s telematics provider Safety Score. This score uses various events to calculate a proxy to the probability to have an accident. The calculations used are Intellectual Property of the supplier, but they are based on a micro modeling approach similar method to Toledo et al. (2008) mentioned in the literature review (i.e., real time analytics of telematics data). For example: a sudden longitudinal and lateral acceleration change measured by the vehicle’s computer may indicate an abrupt turn. This way, the previously mentioned independent variables were generated (e.g., abrupt lane change, abrupt turns acceleration or braking events while turning, etc.). To understand the relative importance of our independent variables regarding the Safety Score we discretized the Safety Score variable into 6 equal-frequency categories. The reason for discretizing the variable instead of treating it as continuous numerical feature is that the Safety Score is bounded by an upper limit at 100. So, a linear regression model would produce results with heteroscedasticity problems. Therefore, we ran a classification model using 20 independent variables to predict one of the 6 Safety Score classes. Figure 15 shows the ranges and number of observations for each of the 6 classes. The independent variables used are listed in Table 2. We tested among various machine learning models to decide on which machine learning model would best fit the data. Among the options that we tried were Random Forest, AdaBoost, Naïve Bayes and Support Vector Machines (SVM). Table 1 shows the comparison between the different machine learning 27
models. The machine learning algorithm that produced the best results in terms of Area Under the Curve (AUC) was Random Forest Regressor. The AUC is a common metric to evaluate the results of multiclass classification problems as it provides an aggregate measure of performance across all possible classification thresholds. A common way to interpret this metric is as the probability that the model ranks a random positive example more highly than a random negative example. 3.4.3 Cluster Analysis Our third and final model was clustering. Clustering is a family of machine learning that allow for unsupervised learning. The intention of using clustering was to group the different driving behavior characteristics that are gathered at the daily and truck level to identify clusters of driving styles. As input variables we used twenty different variables shown in Table 2.3. The clustering approach that we decided to use was a Bayesian Gaussian Mixture Model. The reason for using a probabilistic Gaussian Mixture model is that it allowed to us to better understand the properties of input examples. Many clustering algorithms like K-Means simply give a cluster representative that shows nothing about how the points are spread. The Gaussian properties of this approach gives us not only the mean of the cluster but also the variance which can be used to estimate the likelihood that a point belongs to a certain cluster. The reason for choosing a Bayesian Gaussian Mixture Model instead of the traditional Gaussian Mixture was to take a probabilistic approach to choosing the number of clusters. With a traditional Gaussian Mixture Model, a Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) techniques must be used to select an optimal number of clusters. While with a Bayesian one, the algorithm takes the cluster parameters as latent random variables, not as fixed model parameters. In other words, with this algorithm you can set an initial maximum number of clusters and the algorithm will decide the optimal number of clusters to reward models that fit the data well while minimizing a theoretical information criterion. The possible range of number of clusters would be between 1 and the maximum number of clusters that was set. For our problem, we chose a maximum 28
number of clusters as 10 as this would allow us to separate the driving styles into business-relatable information but the algorithm suggested 5 clusters as the optimal number for clusters. Table 1: Features used for Cluster Analysis Feature ID Variable name 1 Life mileage 2 Max. engine t(°C) 3 Max. RPM 4 Top Speed 5 Operation time 6 Time in DC 7 Avg. Speed 8 % route under min. t(°C) 9 Over Revolution Time % 10 Idling time % 11 Acceleration Route time % 12 Overspeed events (%) 13 Number of stops (%) 14 Abrupt Acceleration 15 Abrupt Braking 16 Abrupt turns 17 Abrupt Lane Changes 18 Acceleration while turning 19 Braking while turning 20 OverAcceleration events Note: All variables were measured at a vehicle-day disaggregation level. 3.4.4 Individual Cluster Analysis Each cluster generated by our model had a weight from independent variables that impact fuel efficiency, for example idling times and excessive acceleration events, as well as independent variables related to safety, for example abrupt lane changes, abrupt turns and acceleration or braking events while turning. Each cluster was generated based on the independent variables that represent the driving behavior, to explain which patterns each driver follows. Then we used the created clusters to see how they in terms of fuel efficiency and safety score with the purpose of explaining the tradeoffs between the different types of driving styles. 29
To increase each cluster’s interpretability and applicability to business daily practices, a persona was defined for each cluster. A persona is term borrowed from the marketing industry which is described as “the aspect of someone’s character”. Our intention in using these personas was to create fictitious but relatable characters so that the driving style of any driver could be identified and easily recognized. Our Gaussian Mixture Model approach also allows for driving styles to be, probabilistically speaking, part of many of driving styles. 3.5 Conclusions To answer our research question of which driving styles can help fleet owners increase safety and fuel efficiency we used multiple machine learning and analytics techniques. We used data from over 3,000 trucks to come up with a fuel efficiency regression model, we had an unsuccessful attempt at predicting crash rate safety with econometric regression so we ended up using a proprietary Safety Score from the telematics provider as a proxy for Safety and we developed a clustering analysis to drive the business recommendations, actionable insights, and recommendations. Given the amount of data we were dealing with we also followed the CRISP-DM methodology to guide us through the iterative process of data mining. In the next section we will describe the results of each model. 4 RESULTS 4.1 Fuel Efficiency 4.1.1 Regression Model Our polynomial regression model to understand the main drivers behind fuel efficiency (kilometers per liter of diesel) contained 13 independent variables plus the bias term. The linear regression model had an R2 of 0.67, MAPE of 8.7%, MAE of 0.23 and a RMSE of 0.28. As context, the mean fuel efficiency was 2.72 kilometers per liter with a standard deviation of 0.52. Figure 8 shows a histogram of the distribution of fuel efficiency. 30
You can also read