DEGREE PROJECT LEVEL: MASTER'S IN BUSINESS INTELLIGENCE - DIVA
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Degree Project Level: Master’s in Business Intelligence WRF-Chem vs machine learning approach to predict air quality in urban complex terrains: a comparative study Authors: Andrey Kudryashov Supervisor: Yves Rybarczyk Examiner: Moudud Alam Subject/main field of study: Microdata Analysis Course code: MI4002 Credits: 15 ECTS Date of examination: 08.06.2020 At Dalarna University it is possible to publish the student thesis in full text in DiVA. The publishing is open access, which means the work will be freely accessible to read and download on the internet. This will significantly increase the dissemination and visibility of the student thesis. Open access is becoming the standard route for spreading scientific and academic information on the internet. Dalarna University recommends that both researchers as well as students publish their work open access. I give my/we give our consent for full text publishing (freely accessible on the internet, open access): Yes ☒ No ☐ Dalarna University – SE-791 88 Falun – Phone +4623-77 80 00
Abstract: Air pollution is the main environmental health issues that affects all the regions and causes millions premature deaths every year. In order to take any preventive measures, we need the ability to predict pollution level and air quality. This task is conventionally solved using deterministic models. However, those models fail to capture complex non-linear dependencies in erratic data. Lately machine learning models gained popularity as a very promising alternative to deterministic models. The purpose of this thesis is to conduct a comparative study between Chemical- Transport Model (WRF-Chem) and a Statistical Model built from machine learning algorithms in order to understand which one is advantageous predicting the air quality and the meteorological conditions using data from Cuenca, Ecuador. The study aims to compare the two methods and conclude on which of them is better in forecasting the concentration of fine particulate matter (PM2.5) in an urban complex terrain. I concluded that even though WRF-Chem has the biggest advantage of forecasting all the data of interest for broader time horizon machine learning algorithms provide better accuracy for middle-term period. Machine learning models also require much less computational power but lack ability to predict meteorological conditions along with pollution level. Keywords: Machine learning, WRF-Chem, comparative study, air quality
Table of Contents 1. Introduction ...................................................................................................... 4 1.1 Background ............................................................................................... 4 1.2 Relevance .................................................................................................. 5 1.3 Purpose ...................................................................................................... 5 1.4 Scientific novelty ....................................................................................... 6 1.5 Structure of the research ............................................................................ 6 2. Overview of pollution level modeling ............................................................. 7 2.1 Deterministic methods ............................................................................... 7 2.2 Non-deterministic methods ....................................................................... 9 3. Machine learning algorithms ......................................................................... 13 3.1 Time series models .................................................................................. 13 3.1.1 Univariate analysis ........................................................................... 13 3.1.2 Multivariate analysis ........................................................................ 15 3.2 Classical machine learning methods ....................................................... 15 3.2.1 Regularized linear regression ........................................................... 16 3.2.2 Support Vector Regression .............................................................. 16 3.2.3 Decision tree..................................................................................... 17 3.3 Ensemble learning methods..................................................................... 19 3.3.1 Horizontal ensemble......................................................................... 19 3.3.2 Vertical ensemble ............................................................................. 19 3.4 Artificial Neural Networks ...................................................................... 20 3.4.1 Multilayer perceptron ....................................................................... 20 3.4.2 LSTM neural network ...................................................................... 21 3.4.3 CNN neural network ........................................................................ 23 4. Modeling pollution level ................................................................................ 25 4.1 Data ......................................................................................................... 25 4.2 Methodology of modeling ....................................................................... 27 4.3 Modeling ................................................................................................. 29 5. Discussion and Conclusion ............................................................................ 35 6. References ...................................................................................................... 36 7. Appendix ........................................................................................................ 40 3
1. Introduction 1.1 Background The global population that is currently around 7.8 billion has increased by 100% the last 40 years and is estimated to increase by 50% during the period of the next 40 years reaching 9 billion by 2037 (Ahmadov, 2016). Most of the growth occurs in urban areas of the developing parts of the world and has as a result the overuse and shortage of natural resources, deforestation, climate change and especially environmental pollution (Ritter et al., 1992). According to the World Health Organization (WHO) air pollution is the main environmental health issue that affects all regions of the world and has caused 4.2 million premature deaths all over the world during 2016. However, the inhabitants of low-income cities are the most impacted ones. This fact is supported by the latest air quality database which indicates that 97% of cities in low- and middle- income countries with more than 100,000 residents do not respond to WHO air quality principles (guidelines) (Rybarczyk & Zalakeviciute, 2018). The outdoor air pollution affects large cities as well as rural areas and is caused by multiple factors like industry and energy supply, waste management, transport, dust, agricultural practices and household energy (Zalakeviciute et al., 2018). Pollutants that have been proved as being the most dangerous for public health concern include particulate matter (PM), ozone (O3), nitrogen dioxide (NO2) and Sulphur dioxide (SO2). The most registered health risks are related to particulate matter of less than 10 and 2.5 microns in diameter (PM10 and PM2.5). PM is capable of penetrating deep into lung passageways and entering the bloodstream causing cardiovascular, cerebrovascular and respiratory impacts. Additional serious health issues that induced by air pollution are according to WHO heart disease, stroke, chronic obstructive pulmonary disease, lung cancer (WHO, 2014). It is not only the human health that is critically impacted by the air pollutants but also the earth’s climate and ecosystems globally (WHO, 2014). Air quality can impact climate change and climate change can respectively impact air quality. Emissions of pollutants into the air can have as a result the climate changes. Ozone 4
in the atmosphere warms the climate, while different components of particulate matter (PM) can have either warming or cooling effects on the climate. On the other hand, changes in climate can affect the local air quality. Atmospheric warming related to climate change, potentially increases ground-level ozone in many regions and due to this fact, there may be challenging to comply with the ozone standards in the future. The impact of climate change on other air pollutants is still uncertain but many studies are in progress to manage this uncertainty (Brunelli et al., 2007). 1.2 Relevance Due to the information mentioned above, it is indisputable fact that the prediction and the monitoring of the air quality is of the utmost importance both for human health and climate progress. The present comparative study between Weather Research and Forecast Chemistry model (WRF-Chem) and machine learning (statistical method) air quality prediction (Carnevale et al., 2009), will be based on the available data from the meteorological station of the Cuenca city in Ecuador. 1.3 Purpose The purpose of this study is to compare the accuracy of the prediction between a WRF-Chem model and a Statistical Model built from machine learning algorithms and investigate which of the two methods is better in the forecasting of the concentration of fine particulate matter (PM2.5) in an urban complex terrain as well as the meteorological conditions. In order to reach our goal, we need to determine which machine learning algorithms might be used predict air quality, build those models and conduct a final comparison regarding accuracy, complexity and time costs Our methodology is to compare benchmark with methods developed throughout the process. We use WRF-Chem’s prediction error as a benchmark to compare with results of different statistical methods and machine learning algorithms. 5
1.4 Scientific novelty Current studies show that traditional deterministic models tend to struggle to capture the non-linear relationship between the concentration of air pollutants and their sources of emission and dispersion (Shimadera et al., 2016). To tackle such a limitation, very promising approach is to use statistical models based on machine learning techniques (Chen et al., 2017). We try broad variety of different statistical approaches to overcome the issue, including ensemble learning and sequence-to- sequence neural network models. Related literature demonstrates usage of machine learning models to predict air pollution level for the next day. We will create and evaluate a module allowing for multistep prediction. 1.5 Structure of the research The paper consists of introduction, three chapters, discussion, conclusion, reference list and appendix. First chapter is an overview of best practices used in the field to predict air pollution. We will compare deterministic and non- deterministic approach and discuss advantages of either of both. Second chapter explains statistical methods used in the study. We also discuss their advantages, disadvantages and suitability for the paper’s goal. Third chapter contains empirical part of the present research. It tells about data used and its preprocessing. Then we built selected machine learning algorithms and test them against the benchmark. 6
2. Overview of pollution level modeling In the related literature forecasting of the pollution level usually is performed using one of two approaches: deterministic and statistical. This logically leads to the structure of present chapter. In the deterministic approach prediction is made based on field-specific knowledge about data, e.g. laws of physics and chemistry. In the non-deterministic approach, researcher uses statistical models and algorithms to extract rule from the data with no or little prior knowledge (Armstrong, 2002). 2.1 Deterministic methods Deterministic models are usually represented by systems of models that work together to simulate emission, transport, diffusion, transformation, and removal of air pollutants. Those models are namely meteorological models, emissions models, air quality models. Pollutant concentration forecast can be performed using simple one-dimensional air quality models, but three-dimensional models are used to simulate complex interactions of physical and chemical processes (U.S. Environmental Protection Agency, 2003). One of the most widely used meteorological models is the Penn State/NCAR Mesoscale Model version 5 - MM5. Which is a regional mesoscale model used for weather forecasting and climate projections maintained by Penn State University (Grell et al., 1994). Another prime example is the Regional Atmospheric Modeling System – RAMS which is a comprehensive mesoscale meteorological modeling system (Pielke et al., 1992). In the process of emission modeling estimated emission with the spatial, temporal and chemical resolution are used to model air quality (Pielke et al., 1992). Data on emission includes mobile sources, stationary sources, area sources and natural sources. Most used emission modeling systems are Emission Processing System (EPS 2.0) (U.S. Environmental Protection Agency, 1992), Emissions Modeling System (EMS-95 – EMS-2002) (Bruckman, 1993) and Sparse Matrix Operator Kernel Emissions (SMOKE) modeling system (Coats, 1996). 7
There are two types of three-dimensional models: Lagrangian and Eulerian; depending on the method used to simulate the time-varying distribution of pollution concentrations. Lagrangian models trace individual air parcels of air over time using meteorological data to transport and diffuse the pollutants that is why they are also called trajectory models. However, the fact the model traces each individual parcel of air makes it computationally inefficient if interaction of a large number of individual sources when nonlinear chemistry is involved, and these models have limited usefulness in forecasting secondary pollutants (Pielke et al., 1992). Eulerian models use a grid of cell (vertical and horizontal) where the chemical transformation equations are solved in each cell and pollutants are exchanged between cells. These models can produce three-dimensional concentration fields for several pollutants but require significant computational power. Typically, the computational requirements are reduced using nested grids, with a coarse grid used over rural areas and a finer grid used over urban areas where concentration gradients tend to be more pronounced (Pielke et al., 1992). The Hybrid Single-Particle Lagrangian Integrated Trajectories with a generalized nonlinear Chemistry Module (HY-SPLIT CheM) model is an example of a Lagrangian model used to forecast air quality on a regional scale (Stein et al., 2000). However, these models struggle to works with big number of emission sources, so Eulerian models are used more often for the urban scale. Popular Eulerian models include multiscale Air Quality Simulation Platform (MAQSIP) (Odman & Ingram, 1996), SARMAP Air Quality Model (SAQM) (Chang et al., 1996) and Urban Airshed Model with Aerosols (UAM-AERO) (Lurmann, 2000). Very popular deterministic model is Weather Research and Forecasting with Chemistry (WRF-Chem V3.2) (WRF, 2017). WRF is a 3-D last-generation non- hydrostatic model used for meteorological forecasting and weather research. It is a fully compressible model that solves the equations of atmospheric motion, with applicability to global, mesoscale, regional and local scales. WRF also has the configuration WRF-Chem for modeling the interactions between meteorology and transport of pollutants. 8
It is not rare that deterministic models are developed for some specific regions. Finardi et al. developed deterministic module to forecast air quality in Torino city (Finardi et al., 2008). Modeling system is based on prognostic downscaling of weather forecasts and on multi-scale chemical transport model simulation in order to describe atmospheric circulation in a complex topographic environment, space\time variation of emissions and pollutant import from neighboring regions. 2.2 Non-deterministic methods Quite often authors use broad variety of machine learning models and conduct comparative analysis of the results. Saniya et al. use level of precipitation, wind speed and wind direction to predict concentration of PM2.5. Authors use Linear Regression, Multilayer Perceptron, Support Vector Machine and M5P model Trees. Collaborative filtering algorithm has played a major role by making automatic and accurate predictions based on previous trends of pollutant levels and database in the server (Saniya et al., 2018). Sayegh et al. also employ a number of machine learning models including Linear Regression, Quantile Regression, Generalized Additive model and Boosted Decision Trees model to compare the performance to predict PM10. Meteorological factors including wind speed, wind direction, temperature, humidity and chemical species including CO, NOx, SO2, PM10 value for the previous time step data for one year from Makkah, Saudi Arabia are used. Quantile Linear Regression shows better results due to the fact that covariants are affecting quantiles heterogeneously which is lost in the central rendency prediction framework (linear regression) (Sayegh et al., 2014). Singh et al. in their paper identity sources of pollution and forecast the air pollution level using variuos machine learnig models: Hybrid Model with Principal Components Analysis, Support Vector Machine and ensemlbe learning models – Random Forest and Boosted Desicion Tree. Authors use five years of pollution level and meteorological variables data for Lucknow, India. Models are used ro predict Air Quality Index and Combined AQI. They also research importance of predictors and their influence on the forecast. Boosted Decision 9
Tree in that paper shows the best result closely followed by Random Forest (Singh et al., 2013). Philibert et al. use Random Forest and Linear and Nonlinear Regression to predict N2O emission level. They use data on environmental and crop variables including fertilization, type of crop, experiment duration, country, etc. on the global scale. Authors use variabe selection to rank variables by importance and include only the most informative ones, which results in the increased accuracy. Random Forest model shows the vest result (Philibert et al., 2013). In the paper by Nieto et al. authors aim to predict various pollutants’ level including NO2, SO2 and PM10 in Oviedo, Spain based on a number of meteorological factors. They use Multivariate Adaptive Regression Splines and Multilayer Perceptron model on three years of historical data (Nieto et al., 2015). Kleine Deters et al. use six years of meteorological data including wind speed and precipitation for Quito, Ecuador to identify the meteorology effects on PM2.5. They use Linear Regression as this statistical method offers excellent interpretability and allows for easy analysis of statistical significence of independent variables (Kleine et al., 2017). Carnevale et al. aim to estimate the relationship between PM10 emission and pollutants from the Air Quality Index for Lombard region, Italy using hourly data on SO2, Nox, CO, PM10 and NH3 for a year. The Dijkstra algorithm is deployed in the large-scale data processing system. Model’s performance then was comapaired against deterministic model simulation. Performance of the model is close to the Transport Chemical Aerosol Model which is computationally much more expensive (Carnevale et al., 2018). Suárez Sánchez et al. investigate the dependence between primary and secondary pollutants and most significant contributors to air pollution level. Data include three years of observations of NOx, CO, SO2, O3 and PM10 in Aviles, Spain. Authors use various Support Vector Machine kernels including radial, linear, quadratic, Pearson VII Universal Kernels and Multilayer Perceptron Model to predict NOx, CO, SO2, O3, and PM10. Aviles, Spain. Best quality was achieved using Pearson VII Universal Kernel (Suárez et al., 2011). 10
Liu et al. also use SVM to predict Air Quality Index training models on two years of observations from three cities in China (Beijing, Tianjin, and Shijiazhuang). Data includes AQI values, various pollutants’ concentrations (PM2.5, PM10, SO2, CO, NO2, and O3), meteorological factors (temperature, wind direction and velocity), and weather descriptions (ex. cloudy/sunny, or rainy/snowy, etc.). The model performance was significantly improved after including the surrounding cities’ air quality levels (Liu et al., 2017). Another paper use SVM to forecast pollutants’ (NO2, SO2, O3, SPM) levels from historical and meteorological data from Macau, China by Vong et al. Authors use three years to train the model and ine year to evaluate the performance. The Pearson correlation is used to identify the best predictors for each pollutant and different kernels are used to test which of the predictors or models get the best results. They also use Pearson correlation as a metric to determine optimal number of days for forecsting. They achieve a good fit and conclude that SVM’s performance crucially depends on the choice of kernel (Vong et al., 2012). Study by Zhan et al. uses Random Forest model to build a spatiotemporal model to predict O3 concentration across China. They use RF with 500 estimators (decision trees). Dataset includes one year of observations for meteorology variables, planetary boundary height, elevation, anthropogenic emission inventory, land use, vegetation index, road density, population density, and time from 1601 stations located all across China. Performance of the model is evaluated against Chemical Transport models’ simulations using RMSE and R squared as metrics. Machine learning models show better accuracy at the same time being less consumong in terms of computationsl resourses. They also conclued that accuracy of prediction relies heavily on the quality of coverage by the monitoring network (Zhan et al., 2018). Martínez-España et al. aim to find the most robust machine learning algorithms to preserve accuracy in case of O3 monitoring failure. Authors use Decision Tree, k- Nearest neighbours model, Bagging model, Random Cometee and Random Forest. They compare performance of selected models and then use hierarhical clustering to determine optimal number of models to predict the O3 level in the region of Murcia, Spain. Random Forest slightly outperforms the other models. The best 11
predictors turns out to be NOx, temperature, wind direction, wind speed, relative humidity, SO2, NO, and PM10. They also conclude that two models are enough for chosen data (Martínez-España et al., 2018). In the paper by Bougoudis et al. authors identify the conditions under which high pollution emerges. They use Hybrid system based on the combination of clustering, Artificial Neural Networks, Random Forest and fuzzy logic. Twelve years of hourly observations of CO, NO, NO2, SO2, temperature, relative humidity, pressure, solar radiation, wind speed and direction from Athens, Greece are used. The optimization of the modeling performance is done with Mamdani rule-based fuzzy inference system that exploits relations between the parameters affecting air quality. Specifically, self-organizing maps are used to perform dataset re-sampling, then ensembles of feedforward artificial neural networks and random forests are trained to clustered data vectors (Athanasopoulos et al., 2017). Elangasinghe et al. is one of the earlier papers using neural networks to predict concentration of NO2. They use genetic algorithm to optimize inputs for the neural network. Variables set includes wind speed, wind direction, solar radiation, temperature, relative humidity and time features accounting for hour, day and month (Elangasinghe et al., 2014). Gardner and Dorling concluded that neural networks outperform other linear statistical methods regarding non-linear dependency (Gardner & Dorling, 1999). Perez conducted comparison between persistence method, linear regression and neural network using data from Santiago, Chile. He concluded that the best error on the hourly prediction of pollution level was obtained using neural networks (Pérez et al., 2000). Brunelli et all used recurrent neural networks to predict concentration of various pollutants for two days ahead using meteorological data (Brunelli et al., 2015). Some authors have been improving neural networks’ accuracy using other methods. Grivas et al. uses a neural network capable of combining meteorological and time-scale input to predict hourly pollution level over the Greater Athens Area using data collected in 2001-2002. Their model greatly outperformed linear regression used for comparison (Finardi et al., 2008). 12
3.Machine learning algorithms Machine learning methods gradually infiltrate time series analysis and pollution level modeling. However, properly configurated, they hold powerful potential. In this chapter, we are going to do a quick recap of the time series models and discuss machine learning models. 3.1 Time series models For the univariate time series analysis, we are going to use two models: SARIMA and Holt-Winters Exponential Smoothing. For the multivariate time series analysis, we are going to use vector autoregressive model (VAR). 3.1.1 Univariate analysis The autoregressive integrated moving average (ARIMA) is a classical time series model designed to analyze and forecast time series data (Zhang, 2001). It is a generalization over ARMA model in which data is supposed to be non-stationary. Equation 1 shows ARMA model with autoregressive component of order and moving average component of order . = 0 + 1 −1 + ⋯ + − + + 1 −1 + ⋯ + − (1) To use ARIMA model we need to be sure that our data is stationary, meaning that it has a constant mean and variance regardless of time step. ARIMA models assures stationarity using differencing, as difference (on practice has a high chance of being stationary. Equation 2 shows differencing process. ′ = − −1 (2) In case of seasonal data, we apply seasonal differencing showed in the equation 3, in which depicts assumed seasonality: ′′ = ′ − − ′ (3) 13
Once we have treated non-stationarity and seasonality in our data using differencing, we can write high-level representation of SARIMA model shown in equation 4. ( , , ) ( , , , ) (4) • is an order of autoregressive component (AR) • is an order of non-seasonal differencing • is an order of moving average component (MA) • is an order of seasonal AR • is an order of seasonal differencing • is an order of seasonal MA • is a number of periods in the season Holt-Winters Exponential Smoothing is an extension of Holt’s method to capture seasonality (Winters, 1960). Model consists of forecast equation (equation 5) and three smoothing equations: for the level (equation 6), for the trend (equation 7), for the seasonal component (equation 8). Corresponding smoothing parameters , and are estimated using error minimization. Parameter accounts for the frequency of seasonality. ̂ +ℎ| = + ℎ + +ℎ− ( +1) (5) = ( − − ) + (1 − )( −1 + −1 ) (6) = ( − −1 ) + (1 − ) −1 (7) = ( − −1 ) + (1 − ) −1 (8) Method has two variations: additive is preferred when the seasonal variations are roughly constant throughout the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series. Due to the nature of our data, we use the additive model. 14
3.1.2 Multivariate analysis Vector autoregressive model is a generalization of the univariate autoregressive model which allows for forecasting of a vector of time series (Athanasopoulos, 2017). All the variables affect each other and are treated equally. For example, three-dimensional VAR is described by system of equations shown in equation 9. 1, = 1 + 11,1 1, −1 + 12,1 2, −1 + 13,1 3, −1 + 1, { 2, = 2 + 21,1 1, −1 + 22,1 2, −1 + 23,1 3, −1 + 2, (9) 3, = 3 + 31,1 1, −1 + 32,1 2, −1 + 33,1 3, −1 + 3, Where 1, , 2, and 3, are white noise processes that may be contemporaneously correlated. 3.2 Classical machine learning methods The most widely used machine learning is a classical linear regression, using least square method to get coefficients’ estimations. Linear regression model can be written as an equation 10, where is a target variable; is an explanatory variable; is weight for explanatory variable; is an error between predicted and observed values. = 0 + 1 1 + 2 2 + ⋯ + + (10) Vector of weights is found using minimization problem shown in equation 11: 1 2 ( ∑ ( ( ) − ) ) (11) =1 Whether our coefficients are correct and well representing reality depends on the compliance to the set of assumptions: linear dependency between target variable and predictors; target variable should be normally distributed; homoskedasticity means that variance of error is assumed to be constant throughout data; each observations supposed to be independent; absent of multicollinearity mean that our variables are independent with each other. 15
3.2.1 Regularized linear regression In the era of big data, the researcher may find himself in a situation where the number of variables exceeds the number of observations, in the case of the classical least-squares method, this leads to overfitting and zero predictive ability. The potential multicollinearity of variables and the need to get rid of a number of these in the analysis process are also big problems (Zou & Hastie, 2005). In order to combat these problems, models of normalized least squares were presented. The two most popular models - ridge and lasso - are very similar and differ only in the specification of the penalty component (normalization form). Let's take a closer look at the lasso model. Lasso is an autonomous and convenient way to introduce sparseness into a linear regression model. The lasso abbreviation stands for “the operator of the smallest absolute shrinkage and selection” and, when applied to the linear regression model, performs the selection of features and regularization of the weights of the selected objects. Lasso adds a penalty component to the OLS minimization problem as shown in equation 12. 1 2 ( ∑ ( ( ) − ) + ‖ ‖ ) (12) =1 Component || || 1 is the 1 norm of the variable vector, which leads to a penalty for large weights. Since the 1 norm is used, many weights get a score of 0 (in the case of the ridge model, the 2 norm is used, which leads to the fact that the weights can be arbitrarily small, but not zero) and the rest are reduced. The lambda parameter controls the degree of regularizing effect and is usually tuned by cross- validation. When the lambda is large, many weights become equal to 0. 3.2.2 Support Vector Regression Support vector machine (SVM) is a classical machine learning algorithm often used as a benchmark to measure more complex models’ efficiency due to its speed and accuracy (Basak et al., 2007). 16
The basic idea of SVM is to find a separating hyperplane separating classes (in the case of classification). In the case of regression analysis, the task is similar in appearance to constructing linear regression (minimizing the error), with the difference that in the case of the support vector model, the task is to conclude the error within a certain threshold. Optimization problem is formulated as system of equations shown in equation 13. 1 ‖ ‖2 + ∑(ξ + ξ∗ ) → 2 =1 (13) − 〈 , 〉 − ≤ + ξ 〈 , 〉 + − ≤ + ξ ∗ { ξ , ξ∗ ≥ 0 where С – penalty for the estimation error; – estimation error; ξ , ξ ∗ – slack variables; – vector of weights; – vector of independent variables; – dependent variable. In the case when the set of objects is linearly inseparable, it is necessary to move from the original space to a space of higher dimension, in which the classes are linearly separable. Examples of the most popular mappings: • Linear: ( , ) = ; • Polynomial: ( , ) = (1 + ) ; 2 ‖ − ‖ • Gaussian: ( , ) = exp (− ); 2 2 • Sigmoid: ( , ) = tanh( 0 + 1 ). 3.2.3 Decision tree Decision trees are a family of algorithms that play an important role in machine learning (Thomas, 2000). Due to the simple method of generating decision trees, decision tree learning is quick and easy compared to more complex algorithms (Cruz & Wishart, 2006). The tree structure consists of branches of several edges 17
connected by internal vertices and leaves at the end of each branch. Each leaf at the end makes a prediction. For partitioning, the simplest condition is used, which checks whether the value of some attribute lies to the left of the specified threshold : [ ≤ ]. Let the set objects from the training set be at the vertex . The parameters in the condition will be chosen to minimize the error criterion (e. g. in classification problem Gini impurity index can be used; regression problem can use mean absolute error). The parameters and can be selected by enumeration. There are a finite number of features, and of all the possible values of the threshold , we can consider only those for which various partitions are obtained. After the parameters have been selected, the set of objects from the training set is divided into two subsets, each of which corresponds to its child vertex. Procedures is repeated until the desired accuracy or stopping criteria is met. The accuracy of the decisive trees increases with their depth. The deeper the tree, the more complex, non-monotonous dependencies it can catch. However, increasing depth leads to unwanted consequences: • Loss of interpretability; • Severe overfitting, as tree deep enough can reach 100% accuracy on train data while being unable to perform good enough on test data. The main way to combat retraining is to normalize and select model hyperparameters that, on the one hand, will show good results on training data, and, on the other hand, will be able to produce greater accuracy of predictions on validation data. The main hyperparameters used to normalize decision trees are the maximum depth of the tree (i.e., the maximum number of divisions down the tree), the minimum number of observations at the terminal vertex (i.e. the minimum number of observations in the tree leaf needed to happen division). 18
3.3 Ensemble learning methods Another way to solve the overfitting problem is ensemble learning. The idea of ensemble is to average the predictions of several weak predictors and combine them into one model, which will have high predictive ability (high accuracy). After this, the prediction is conducted by combining the results of all weak predictors, for classification, the simple majority voting rule can be used, for regression – averaging. In the modeling process, it is important to obtain the most different (minimally correlating) weak predictors among themselves. The main methods used to achieve this goal are bootstrap and random selection of a limited number of variables for each weak predictor. A bootstrap consists of selecting random observations from a common sample to train each weak predictor. With a bootstrap with repetition, the same observation can enter the model training dataset several times. 3.3.1 Horizontal ensemble During the horizontal ensemble, we train several weak predictors independently of each other. One of the most popular examples of parallel ensemble is a random forest (RF). A random forest is an ensemble of decision trees, each operating independently, making a prediction as to where an example data entry belongs. The forest aggregates the results and chooses the strongest prediction (Andy & Matthew, 2002). The random forest algorithm, can be described as follows: 1. We draw bootstrap samples from the dataset; 2. For each sample we create an unpruned decision tree based on features in the dataset; 3. We get predictions from trees which are then averaged by voting (classification) or average (regression). 3.3.2 Vertical ensemble During vertical ensembling, we train several weak learners consequently. In general, sequential ensemble allows to obtain higher accuracy of predictions than parallel trained models. However, that model loses in terms of speed, as due to the fact of sequential fit it is impossible to parallel computation. This model is also 19
even more prone to overfitting, which requires use of regularization. One of the most popular algorithms, gradient boosting, can be described as follows: 1. We get the initial model’s error (e. g. decision tree or linear regression): 1 = − ̂1 ; 2. Estimate error for the model in which error from the first step is used as a dependent variable: ̂ 1; 3. Sum obtained prediction with the original: ̂2 = ̂1 + ̂ 1; 4. Get the new error: 2 = − ̂2 ; 5. Repeat steps 2-4 until we overfit or until model’s error become constant. Most popular algorithms are Gradient Bosting Machine (GBM) described before and Extreme Gradient Boosting (XGBoost) using shortcuts in the conventional algorithm to achieve faster computational speed at the expense of potential accuracy loss. 3.4 Artificial Neural Networks 3.4.1 Multilayer perceptron The simplest version of neural network is called multilayer feed-forward perceptron. It is simply defined as an input layer, an output layer, and several hidden layers. Each layer consists of multiple artificial neurons, which are tasked with feeding forward data to the next layer (Svozil et al., 1997). Figure 1 represents simple neural network schematically. Figure 1. Example of a feed-forward neural network with one hidden layer (Svozil et al., 1997) 20
Each node of the network consists of an artificial neuron, a mathematical model intended to emulate the role of a neuron in a physical brain. Each neuron consists of a set of inputs, some type of activation or transfer function, and an output (Svozil et al., 1997). The inputs multiplied by weights and added to biases are passed to the further layers (forward propagation). Example of artificial neuron is shown in figure 2. Figure 2. Artificial neuron schema (Svozil et al., 1997) There are several activation functions in practice, the most popular is sigmoid. Process of training of the neural network can be describe by the following steps: 1. Network receives training data as its input which through feed-forward propagation becomes the set of outputs; 2. Error is calculated (usually for the regression problem it is mean square error; 3. Partial derivatives of the loss function are calculated with respect to the model’s parameters; 4. Model parameters are tuned with respect to mentioned derivatives (backpropagation). 3.4.2 LSTM neural network Classical neural network is poorly suitable for the time series prediction as they are analyzing each datapoint separately, not being able to bear information over time or any other sequence (e. g. language). This problem gets resolved by Recurrent 21
neural network (RNN), as they use previous state as another input. Figure 3 represents simple recurrent neural network schematically. Figure 3. Reccurent neural network (3 units) (Olah) However, if the input sequence is long, RNN gives more attention to the later datapoints, while memory about old ones vanishes. This problem is overcame using Long Short-Term Memory neural networks (LSTM) as they are able to learn long-term dependencies (Olah). In order to have such ability, LSTM has three layers inside of a unit. The most important idea behind the model is a cell state, information ‘conveyer’ running through the entire network. Information goes to cell state through three gates. Forget gate decides what part of the information needs to be withdrawn from the cell state. Input gate decides which part of information is going to be passed to enter the cell state. Output gate decides which part of data is going to be outputted. Example of LSTM network’s unit is shown in figure 4. Figure 4. LSTM neural network (3 units) (Olah) 22
LSTM can be used to map sequence to scalar or vector, to a single or multiple time steps. WE train itusing classical backpropagation taking derivatives of the loss fucntion once comparison of factual and expected output are obtained Those derivative are used as weights to update parameters inside neural network’s layers. 3.4.3 CNN neural network Convolutional neural network is the most popular as image processing models (or at least the most popular building block). Key idea behind CNN is using a set of filters to gradually learn mode and more complex features (Stewart). Close analogy would a flashlight gliding over an image. Using this “flashlight” with convolutional layers we significantly reduce number of trainable parameters. Example of convolutional layer is given in figure 5. Figure 5. The convolution operation (Stewart) Latter is a major improvement over classical neural network, as it uses one input per pixel, and moderate resolution picture processing results in hundreds of thousands of trainable parameters. So, at first CNN learns simple shapes or even shades, then gradually it learns more and more complicated features until by the last layer is can recognize a nose on the picture and even tell which animal it belongs to. 23
Figure 6. Simple CNN example (Stewart) Figure 6 represents simple CNN architecture. CNN can also be applied to time series data, as we glide with a 1D filter over the sequence of observations mapping it to the output which can be in scalar or vector form while latter can be represented bya single or multiple time steps. As any other neural network, CNN is trained using backpropagation. 24
4. Modeling pollution level This is the empirical part of the present research. We start by describing data used and preprocessing steps taken. We also discuss metric chosen for model comparison; feature engineering required to account for time nature of our data. In the end we build models and conduct comparison with the benchmark. We use programming language Python 3.5 for our analysis, code for modeling is available in the appendix. For time series analysis we use statsmodels library. Machine learning models are built using sklearn library. To build neural networks we use Keras library operating on top of Tensorflow library. 4.1 Data We are using 5 years of hourly data on chemical (PM2.5) and meteorological (temperature, relative humidity, solar radiation, wind speed and wind direction) variables collected from a monitoring station located in Cuenca, Ecuador. As WRF-Chem provides daily observations, we down sample data to daily observation using mean as aggregation function. Unfortunately, our data has a lot of missing observations. Even worse, dates for which observations are missing are not consistent over variables. E.g., wind speed and wind direction don’t have observations for the majority of 2016, while temperature is missing for the second half of 2015 and 2 months of 2017. Having said that, we cannot just drop missing observations, as this result in the reduction of our dataset from 1518 to ~350 observations. Hence, we use interpolation. For some variables (e.g. temperature) best interpolation technique proved to be spline of order 5, others (e.g. solar radiation) were best approximated by simple linear interpolation. Interpolation was chosen by the criteria of the best fit for existing data. Possible shortcomings of that approach are discussed in the discussion section of the present paper. Prior to modeling we need to clean our data from outliers. Some data can be just false (for example, negative pollution level), some days show extremely high value of some variables, which can adversely affect the training process and result in a loss of accuracy. 25
To detect outliers, we are going to use interquartile range. The interquartile range ( ) is a measure of statistical dispersion and is calculated as the difference between the 75 ℎ and 25 ℎ percentiles (percentile is a measure indicating the value below which a given percentage of observations falls). It is represented by the formula = 3 − 1. After calculating for each variable, we limit the variable in the interval between 1 – 1.5 and 3 + 1.5 . Table 1 shows statistical description of data. Table 1. Statistical description of data pm temp hum sol wind_dir vel_ms count 1583 1583 1583 1583 1583 1583 mean 9.15 15.25 64.40 191.89 161.79 1.72 std 3.81 1.12 8.06 70.96 50.66 0.34 min 0.00 11.30 25.59 0.00 11.08 0.46 25% 6.57 14.42 59.45 130.56 129.70 1.60 50% 8.99 15.23 64.67 186.47 157.12 1.64 75% 11.51 15.98 69.24 244.39 189.01 1.92 max 21.40 19.11 91.12 472.04 307.62 3.02 Some statistical models can be sensitive to difference in magnitude of variables. For example. Linear regression performs better if all the variables are scaled (or normalized), tree-based algorithms in general are less sensitive, for neural networks normalization is a must, as it allows for faster convergence and better accuracy (Ali & Faraj, 2014). We are going to use a wide range of machine learning models. So, it is better to perform some transformation on our data. We are using min max normalization to assure that all the variables have the same magnitude (contained within 0 and 1). Normalization is performed as shown in equation 14. 26
− = (14) − It is worth mentioning that data is quite erratic without clear trend and/or seasonality. This further nudge us to use machine learning as those models in general are more suitable to model complicated non-linear dependencies. Figure 7 shows erratic nature of pollution level time series. 1 0.9 0.8 0.7 Level of PM2.5 0.6 0.5 0.4 0.3 0.2 0.1 0 2014-09-01 2015-09-01 2016-09-01 2017-09-01 2018-09-01 Figure 7. Pollution level time series 4.2 Methodology of modeling Evaluation metric To evaluate models’ results we are using mean absolute error (MAE) metric, used a lot assessing regression problem. I have chosen absolute over root squared mean error as our data is erratic and thereby, I do not want to inflict additional punishment for outliers. Formula to calculate MAE is shown in equation 15. 1 = ∑ | − ̂ | (15) =1 Where is an observed value of dependent variable, ̂ is a predicted one and is the size of testing sample. 27
Time features For time series modeling we just use our six time series, but to use machine learning algorithms we need to add time features explicitly. I add ℎ , day_of_week and week_of_year features (whole numbers) to account for weekly and annual trend in data. Then, for each variable I add values at 6am, 1pm and 3pm as those hours provide u with the most representative concentrations of pollutants over the day. 6am is the beginning of the morning peak, 1pm corresponds to the midday baseline and 3pm is the beginning of evening peak. Then, for all the variables excluding ℎ, _ _ and _ _ I add lagged values up to 5th lag (e.g. 2nd lag is a value observed two days ago). This way we account for time nature of the data. Walk forward validation In order to make our evaluation robust, we use cross-validation. Using 5-n cross- validation we split dataset into 5 parts. Then we train our model on the first four of them. We use last part of our dataset, not used in the training process, for validation. Once we get the MAE, we save it and repeat the process using first, second, third and fifth parts for training and the fourth one for evaluation, getting another MAE (the process is repeated 5 times). This way we guarantee that our model has been evaluated on all the available data. Unfortunately, it is not feasible in case of time series models. We could use cross- validation for the machine learning models non-depended on the sequential structure. However, for time series models time structure is a requirement. So, we need better approach for evaluation, applicable both to time series models and to machine learning models. We are going to use walk forward cross-validation. This approach requires two sliding windows for training and test set. Schematically approach is depicted in figure 8. For each test set we will calculate a separate MAE and then take the average of them. Sliding over dataset allows for robustness in evaluation in training. 28
Figure 8. 4-n walk forward cross-validation (Moudiki, 2020) 4.3 Modeling Biggest problem of our data is the fact that we only have WRF-Chem prediction for September 2014. At the same time, our dataset stretches from 2014 to 2018. We overcome this issue reversing our data, so order is preserved, but reversed. At the same time, models fit and tuned to predict specifically September 2014 have a chance to lack robustness. So, at first, we evaluate our models using walk forward cross-validation with training window of two years. Then, we take best models and compare forecast for September 2014 to the benchmark. Time series analysis models Prior to build our time series models, we conduct Augmented Dickey-Fuller with constant and trend test for each sequence. Null hypothesis for the test states that series are not stationary. Results are shown in table 3. Table 3. Augmented Dickey-Fuller test results Time series p-value conclusion pollution level 8.8306e-06 series is stationary temperature 0.0002 series is stationary humidity 0.0 series is stationary solar radiance 0.0306 series is stationary wind speed 1.1e-09 series is stationary wind direction 2.441e-06 series is stationary 29
As we can see, all our series are stationary, and we can proceed. We are using test window of 30 observations with 5-n walk forward cross-validation. For univariate analysis we use SARIMA model, as changing its parameters allows us to test broad scope – from AR to SARIMA. For multivariate analysis we use classical VAR model. Best configuration was chosen based on best Akaike information criteria after simple iteration over different order of lags. Results are shown in table 4. Table 4. Time series models results Model MAE SARIMA (2,0,1) (0,0,0,12) 0.1475 VAR (9) 0.1237 Holt-Winters 0.1368 Vector autoregressive model of order 9 shows the best quality of fit with the average MAE of 0.12. LSTM models Next, we fit neural network models with 5-n walk forward cross-validation. We start with long short-term memory neural networks. First, we try different configurations to predict pollution level one day ahead. Single channel model uses only historical data on pollution level. Multichannel models use historical data on all the available time series (pollution level, humidity, solar radiance, temperature, wind direction and wind speed). Multichannel output models allow us to predict not only the target variable but all the series, like VAR model. Our architecture consists of LSTM layers with 50 units followed by fully connected layer. Results are shown in table 5. Table 5. LSTM model results for different configurations forecast 1 step ahead Model MAE LSTM single channel input, single channel output 0.0973 LSTM multichannel input, single channel output 0.0846 LSTM multichannel input, multichannel output 0.0939 30
As we can see, LSTM using multichannel input to predict pollution level one step forward has the lowest MAE. To predict for more than one step we reshape our data. For example, to build a model making prediction for a week we will reshape our data into [1583, 6, 7] adding timestep dimension. This means that we feed our models chunks of data each containing 7 time steps of 6 time series. Results are available in table 6. Table 6. LSTM model results for the broader forecast horizon Model MAE LSTM multichannel input, single channel output 5 days prediction 0.0851 LSTM multichannel input single channel output 7 days prediction 0.0883 LSTM multichannel input single channel output 10 days prediction 0.0989 LSTM multichannel input single channel output 30 days prediction 0.1483 Unfortunately, MAE arises rather rapidly predicting for more than 10 steps forward. But one week prediction is handled relatively well. CNN models Next, we fit convolutional neural networks with 5-n walk forward cross-validation. First, we try different configurations to predict pollution level one step forward. Single channel model uses only historical data on pollution level. Multichannel models use historical data on all the available time series. Our CNN consists of convolutional layer followed by max pooling layer followed by a fully connected layer of 50 neurons. Results are present in table 7. Table 7. CNN model results for different configurations forecast 1 step ahead Model MAE CNN single channel input, single channel output prediction 0.0943 CNN multichannel input, single channel output prediction 0.0721 CNN multichannel input, multichannel output prediction 0.0947 31
Multichannel input single channel output CNN shows surprisingly good result. Next step in our analysis is to test this model’s ability to predict broader time horizon. We will same trick we used for LSTM models. Results are available in table 8. Table 8. CNN model results for the broader forecast horizon Model MAE CNN multichannel input, single channel output 5 days prediction 0.0781 CNN multichannel input single channel output 7 days prediction 0.0832 CNN multichannel input single channel output 10 days prediction 0.0894 CNN multichannel input single channel output 30 days prediction 0.1238 As we can see, CNN outperforms LSTM also in the case of extended prediction horizon. MAE is systematically lower and degrades slower over time Machine learning models and artificial neural network Machine learning models and neural network do not have natural ability to predict for several steps ahead. As our final goal is the model predicting for a month ahead, we build a module containing 30 models each predicting for +1 day forward. So, linear regression (1) will predict pollution level one step ahead and linear regression (23) will predict pollution level 23 steps ahead. In order to train model predicting for + steps forward, we shift our target variable, so that today’s value of is mapped to value of on the ℎ step. This trick allows as to build machine learning models for forecasting. It has its limitations, though. As we cannot shift data endlessly, we need to have enough to train our models on the intersect of and . We are using all available data (time features, time series and their lags) in the following analysis. For each machine learning model used we conduct hyperparameter tuning using simple iteration over grid of all possible combinations. Table 9 contains average MAE of 30 instances of machine learning models. MAE is calculated using 5-n walk forward cross-validation. 32
Table 9. Machine learimg modules results on 30 steps ahead forecast Model MAE Linear regression 0.1348 Ridge regression (alpha = 1) 0.1221 Lasso regression (alpha = 0) 0.1147 SVM regressor (C = 2, kernel = linear) 0.0929 RF regressor (max_depth = 10, n_estimators = 100, 0.0875 min_samples_leaf = 10) GBM regressor (max_depth = 10, n_estimators = 100, 0.0872 learning_rate = 0.1 XGB regressor (max_depth = 5, n_estimators = 100) 0.0958 ANN (input layer (144), hidden layer (200), hidden layer (100), 0.1182 hidden layer (50), output layer (1) Table 10 contains innformation of all the MAEs using gradient boosting machine as an example. We can see that machine learning models are less prone than time series models and neural networks to degradation over extended prediction horizon in general. Ensemble learning shows the best error with gradietn boosting machine regressor dominating. Figure 8 demonstrates GBM regressor prediction for September. Normalization have been reversed, WRF-Chem’s MAE equals 2.05, GBM module’s MAE equals 1.89. So, we gained beter accuracy preducting pollution level for 27 time steps. We also experienced less computational costs, as WRF-Chem model may take month to sumulate month of observations, while GBM module took little under 4 minutes to train on more than 4 years of observations and third on a second to predict month of data. Table 11 shows fitting and testing time for all the used models. For both groups of models train/test split is roughly 720/30 observations. 33
14 12 10 Level of PM2.5 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Real WRF GBM regressor Figure 9. Observed polution level vs. WRF’prediction vs. GBM’prediction Table 11. Fitting and forecasting time of various models Model Fit time Prediction time SARIMA 0.41s 0.12s Holt-Winters 0.32s 0.09s VAR 0.49s 0.13s LSTM 8.41s 0.41s CNN 4.05s 0.32s Linear regression 2.53s 0.15s Lasso regression 0.51s 0.07s Ridge regression 0.32s 0.02s SVM regressor 12.61s 0.14s RF regressor 7.32m 0.73s GBM regressor 3.53m 0.32s LGBM regressor 2.12m 0.26s XGB regressor 2.15m 0.21s ANN model 16.17m 0.76s 34
You can also read