Degree Project1451144/FULLTEXT01.pdfalgorithms might be used predict air quality, build those models and conduct a final comparison regarding accuracy, complexity and time costs Our

Degree Project

Level: Master’s in Business Intelligence

WRF-Chem vs machine learning approach to predict air

quality in urban complex terrains: a comparative study

Authors: Andrey Kudryashov

Supervisor: Yves Rybarczyk

Examiner: Moudud Alam

Subject/main field of study: Microdata Analysis

Course code: MI4002

Credits: 15 ECTS

Date of examination: 08.06.2020

At Dalarna University it is possible to publish the student thesis in full text in DiVA.

The publishing is open access, which means the work will be freely accessible to read

and download on the internet. This will significantly increase the dissemination and

visibility of the student thesis.

Open access is becoming the standard route for spreading scientific and academic

information on the internet. Dalarna University recommends that both researchers as

well as students publish their work open access.

I give my/we give our consent for full text publishing (freely accessible on the internet,

open access):

Yes ☒ No ☐

Dalarna University – SE-791 88 Falun – Phone +4623-77 80 00

Abstract:

Air pollution is the main environmental health issues that affects all the regions

and causes millions premature deaths every year. In order to take any preventive

measures, we need the ability to predict pollution level and air quality. This task is

conventionally solved using deterministic models. However, those models fail to

capture complex non-linear dependencies in erratic data. Lately machine learning

models gained popularity as a very promising alternative to deterministic models.

The purpose of this thesis is to conduct a comparative study between Chemical-

Transport Model (WRF-Chem) and a Statistical Model built from machine

learning algorithms in order to understand which one is advantageous predicting

the air quality and the meteorological conditions using data from Cuenca, Ecuador.

The study aims to compare the two methods and conclude on which of them is

better in forecasting the concentration of fine particulate matter (PM2.5) in an

urban complex terrain. I concluded that even though WRF-Chem has the biggest

advantage of forecasting all the data of interest for broader time horizon machine

learning algorithms provide better accuracy for middle-term period. Machine

learning models also require much less computational power but lack ability to

predict meteorological conditions along with pollution level.

Keywords: Machine learning, WRF-Chem, comparative study, air quality

3

Table of Contents 1. Introduction ...................................................................................................... 4

1.1 Background ............................................................................................... 4

1.2 Relevance .................................................................................................. 5

1.3 Purpose ...................................................................................................... 5

1.4 Scientific novelty ....................................................................................... 6

1.5 Structure of the research ............................................................................ 6

2. Overview of pollution level modeling ............................................................. 7

2.1 Deterministic methods ............................................................................... 7

2.2 Non-deterministic methods ....................................................................... 9

3. Machine learning algorithms ......................................................................... 13

3.1 Time series models .................................................................................. 13

3.1.1 Univariate analysis ........................................................................... 13

3.1.2 Multivariate analysis ........................................................................ 15

3.2 Classical machine learning methods ....................................................... 15

3.2.1 Regularized linear regression ........................................................... 16

3.2.2 Support Vector Regression .............................................................. 16

3.2.3 Decision tree..................................................................................... 17

3.3 Ensemble learning methods ..................................................................... 19

3.3.1 Horizontal ensemble......................................................................... 19

3.3.2 Vertical ensemble ............................................................................. 19

3.4 Artificial Neural Networks ...................................................................... 20

3.4.1 Multilayer perceptron ....................................................................... 20

3.4.2 LSTM neural network ...................................................................... 21

3.4.3 CNN neural network ........................................................................ 23

4. Modeling pollution level ................................................................................ 25

4.1 Data ......................................................................................................... 25

4.2 Methodology of modeling ....................................................................... 27

4.3 Modeling ................................................................................................. 29

5. Discussion and Conclusion ............................................................................ 35

6. References ...................................................................................................... 36

7. Appendix ........................................................................................................ 40

4

1. Introduction

1.1 Background

The global population that is currently around 7.8 billion has increased by 100%

the last 40 years and is estimated to increase by 50% during the period of the next

40 years reaching 9 billion by 2037 (Ahmadov, 2016). Most of the growth occurs

in urban areas of the developing parts of the world and has as a result the overuse

and shortage of natural resources, deforestation, climate change and especially

environmental pollution (Ritter et al., 1992).

According to the World Health Organization (WHO) air pollution is the main

environmental health issue that affects all regions of the world and has caused 4.2

million premature deaths all over the world during 2016. However, the inhabitants

of low-income cities are the most impacted ones. This fact is supported by the

latest air quality database which indicates that 97% of cities in low- and middle-

income countries with more than 100,000 residents do not respond to WHO air

quality principles (guidelines) (Rybarczyk & Zalakeviciute, 2018).

The outdoor air pollution affects large cities as well as rural areas and is caused by

multiple factors like industry and energy supply, waste management, transport,

dust, agricultural practices and household energy (Zalakeviciute et al., 2018).

Pollutants that have been proved as being the most dangerous for public health

concern include particulate matter (PM), ozone (O3), nitrogen dioxide (NO2) and

Sulphur dioxide (SO2). The most registered health risks are related to particulate

matter of less than 10 and 2.5 microns in diameter (PM10 and PM2.5). PM is

capable of penetrating deep into lung passageways and entering the bloodstream

causing cardiovascular, cerebrovascular and respiratory impacts. Additional

serious health issues that induced by air pollution are according to WHO heart

disease, stroke, chronic obstructive pulmonary disease, lung cancer (WHO, 2014).

It is not only the human health that is critically impacted by the air pollutants but

also the earth’s climate and ecosystems globally (WHO, 2014). Air quality can

impact climate change and climate change can respectively impact air quality.

Emissions of pollutants into the air can have as a result the climate changes. Ozone

5

in the atmosphere warms the climate, while different components of particulate

matter (PM) can have either warming or cooling effects on the climate. On the

other hand, changes in climate can affect the local air quality. Atmospheric

warming related to climate change, potentially increases ground-level ozone in

many regions and due to this fact, there may be challenging to comply with the

ozone standards in the future. The impact of climate change on other air pollutants

is still uncertain but many studies are in progress to manage this uncertainty

(Brunelli et al., 2007).

1.2 Relevance

Due to the information mentioned above, it is indisputable fact that the prediction

and the monitoring of the air quality is of the utmost importance both for human

health and climate progress. The present comparative study between Weather

Research and Forecast Chemistry model (WRF-Chem) and machine learning

(statistical method) air quality prediction (Carnevale et al., 2009), will be based on

the available data from the meteorological station of the Cuenca city in Ecuador.

1.3 Purpose

The purpose of this study is to compare the accuracy of the prediction between a

WRF-Chem model and a Statistical Model built from machine learning algorithms

and investigate which of the two methods is better in the forecasting of the

concentration of fine particulate matter (PM2.5) in an urban complex terrain as

well as the meteorological conditions.

In order to reach our goal, we need to determine which machine learning

algorithms might be used predict air quality, build those models and conduct a

final comparison regarding accuracy, complexity and time costs

Our methodology is to compare benchmark with methods developed throughout

the process. We use WRF-Chem’s prediction error as a benchmark to compare

with results of different statistical methods and machine learning algorithms.

6

1.4 Scientific novelty

Current studies show that traditional deterministic models tend to struggle to

capture the non-linear relationship between the concentration of air pollutants and

their sources of emission and dispersion (Shimadera et al., 2016). To tackle such a

limitation, very promising approach is to use statistical models based on machine

learning techniques (Chen et al., 2017). We try broad variety of different statistical

approaches to overcome the issue, including ensemble learning and sequence-to-

sequence neural network models. Related literature demonstrates usage of machine

learning models to predict air pollution level for the next day. We will create and

evaluate a module allowing for multistep prediction.

1.5 Structure of the research

The paper consists of introduction, three chapters, discussion, conclusion,

reference list and appendix. First chapter is an overview of best practices used in

the field to predict air pollution. We will compare deterministic and non-

deterministic approach and discuss advantages of either of both. Second chapter

explains statistical methods used in the study. We also discuss their advantages,

disadvantages and suitability for the paper’s goal. Third chapter contains empirical

part of the present research. It tells about data used and its preprocessing. Then we

built selected machine learning algorithms and test them against the benchmark.

7

2. Overview of pollution level modeling

In the related literature forecasting of the pollution level usually is performed

using one of two approaches: deterministic and statistical. This logically leads to

the structure of present chapter. In the deterministic approach prediction is made

based on field-specific knowledge about data, e.g. laws of physics and chemistry.

In the non-deterministic approach, researcher uses statistical models and

algorithms to extract rule from the data with no or little prior knowledge

(Armstrong, 2002).

2.1 Deterministic methods

Deterministic models are usually represented by systems of models that work

together to simulate emission, transport, diffusion, transformation, and removal of

air pollutants. Those models are namely meteorological models, emissions models,

air quality models. Pollutant concentration forecast can be performed using simple

one-dimensional air quality models, but three-dimensional models are used to

simulate complex interactions of physical and chemical processes (U.S.

Environmental Protection Agency, 2003).

One of the most widely used meteorological models is the Penn State/NCAR

Mesoscale Model version 5 - MM5. Which is a regional mesoscale model used for

weather forecasting and climate projections maintained by Penn State University

(Grell et al., 1994). Another prime example is the Regional Atmospheric Modeling

System – RAMS which is a comprehensive mesoscale meteorological modeling

system (Pielke et al., 1992).

In the process of emission modeling estimated emission with the spatial, temporal

and chemical resolution are used to model air quality (Pielke et al., 1992). Data on

emission includes mobile sources, stationary sources, area sources and natural

sources. Most used emission modeling systems are Emission Processing System

(EPS 2.0) (U.S. Environmental Protection Agency, 1992), Emissions Modeling

System (EMS-95 – EMS-2002) (Bruckman, 1993) and Sparse Matrix Operator

Kernel Emissions (SMOKE) modeling system (Coats, 1996).

8

There are two types of three-dimensional models: Lagrangian and Eulerian;

depending on the method used to simulate the time-varying distribution of

pollution concentrations. Lagrangian models trace individual air parcels of air over

time using meteorological data to transport and diffuse the pollutants that is why

they are also called trajectory models. However, the fact the model traces each

individual parcel of air makes it computationally inefficient if interaction of a large

number of individual sources when nonlinear chemistry is involved, and these

models have limited usefulness in forecasting secondary pollutants (Pielke et al.,

1992).

Eulerian models use a grid of cell (vertical and horizontal) where the chemical

transformation equations are solved in each cell and pollutants are exchanged

between cells. These models can produce three-dimensional concentration fields

for several pollutants but require significant computational power. Typically, the

computational requirements are reduced using nested grids, with a coarse grid used

over rural areas and a finer grid used over urban areas where concentration

gradients tend to be more pronounced (Pielke et al., 1992).

The Hybrid Single-Particle Lagrangian Integrated Trajectories with a generalized

nonlinear Chemistry Module (HY-SPLIT CheM) model is an example of a

Lagrangian model used to forecast air quality on a regional scale (Stein et al.,

2000). However, these models struggle to works with big number of emission

sources, so Eulerian models are used more often for the urban scale. Popular

Eulerian models include multiscale Air Quality Simulation Platform (MAQSIP)

(Odman & Ingram, 1996), SARMAP Air Quality Model (SAQM) (Chang et al.,

1996) and Urban Airshed Model with Aerosols (UAM-AERO) (Lurmann, 2000).

Very popular deterministic model is Weather Research and Forecasting with

Chemistry (WRF-Chem V3.2) (WRF, 2017). WRF is a 3-D last-generation non-

hydrostatic model used for meteorological forecasting and weather research. It is a

fully compressible model that solves the equations of atmospheric motion, with

applicability to global, mesoscale, regional and local scales. WRF also has the

configuration WRF-Chem for modeling the interactions between meteorology and

transport of pollutants.

9

It is not rare that deterministic models are developed for some specific regions.

Finardi et al. developed deterministic module to forecast air quality in Torino city

(Finardi et al., 2008). Modeling system is based on prognostic downscaling of

weather forecasts and on multi-scale chemical transport model simulation in order

to describe atmospheric circulation in a complex topographic environment,

space\time variation of emissions and pollutant import from neighboring regions.

2.2 Non-deterministic methods

Quite often authors use broad variety of machine learning models and conduct

comparative analysis of the results. Saniya et al. use level of precipitation, wind

speed and wind direction to predict concentration of PM2.5. Authors use Linear

Regression, Multilayer Perceptron, Support Vector Machine and M5P model

Trees. Collaborative filtering algorithm has played a major role by making

automatic and accurate predictions based on previous trends of pollutant levels and

database in the server (Saniya et al., 2018).

Sayegh et al. also employ a number of machine learning models including Linear

Regression, Quantile Regression, Generalized Additive model and Boosted

Decision Trees model to compare the performance to predict PM10.

Meteorological factors including wind speed, wind direction, temperature,

humidity and chemical species including CO, NOx, SO2, PM10 value for the

previous time step data for one year from Makkah, Saudi Arabia are used. Quantile

Linear Regression shows better results due to the fact that covariants are affecting

quantiles heterogeneously which is lost in the central rendency prediction

framework (linear regression) (Sayegh et al., 2014).

Singh et al. in their paper identity sources of pollution and forecast the air

pollution level using variuos machine learnig models: Hybrid Model with

Principal Components Analysis, Support Vector Machine and ensemlbe learning

models – Random Forest and Boosted Desicion Tree. Authors use five years of

pollution level and meteorological variables data for Lucknow, India. Models are

used ro predict Air Quality Index and Combined AQI. They also research

importance of predictors and their influence on the forecast. Boosted Decision

10

Tree in that paper shows the best result closely followed by Random Forest (Singh

et al., 2013).

Philibert et al. use Random Forest and Linear and Nonlinear Regression to predict

N2O emission level. They use data on environmental and crop variables including

fertilization, type of crop, experiment duration, country, etc. on the global scale.

Authors use variabe selection to rank variables by importance and include only the

most informative ones, which results in the increased accuracy. Random Forest

model shows the vest result (Philibert et al., 2013).

In the paper by Nieto et al. authors aim to predict various pollutants’ level

including NO2, SO2 and PM10 in Oviedo, Spain based on a number of

meteorological factors. They use Multivariate Adaptive Regression Splines and

Multilayer Perceptron model on three years of historical data (Nieto et al., 2015).

Kleine Deters et al. use six years of meteorological data including wind speed and

precipitation for Quito, Ecuador to identify the meteorology effects on PM2.5.

They use Linear Regression as this statistical method offers excellent

interpretability and allows for easy analysis of statistical significence of

independent variables (Kleine et al., 2017).

Carnevale et al. aim to estimate the relationship between PM10 emission and

pollutants from the Air Quality Index for Lombard region, Italy using hourly data

on SO2, Nox, CO, PM10 and NH3 for a year. The Dijkstra algorithm is deployed

in the large-scale data processing system. Model’s performance then was

comapaired against deterministic model simulation. Performance of the model is

close to the Transport Chemical Aerosol Model which is computationally much

more expensive (Carnevale et al., 2018).

Suárez Sánchez et al. investigate the dependence between primary and secondary

pollutants and most significant contributors to air pollution level. Data include

three years of observations of NOx, CO, SO2, O3 and PM10 in Aviles, Spain.

Authors use various Support Vector Machine kernels including radial, linear,

quadratic, Pearson VII Universal Kernels and Multilayer Perceptron Model to

predict NOx, CO, SO2, O3, and PM10. Aviles, Spain. Best quality was achieved

using Pearson VII Universal Kernel (Suárez et al., 2011).

11

Liu et al. also use SVM to predict Air Quality Index training models on two years

of observations from three cities in China (Beijing, Tianjin, and Shijiazhuang).

Data includes AQI values, various pollutants’ concentrations (PM2.5, PM10, SO2,

CO, NO2, and O3), meteorological factors (temperature, wind direction and

velocity), and weather descriptions (ex. cloudy/sunny, or rainy/snowy, etc.). The

model performance was significantly improved after including the surrounding

cities’ air quality levels (Liu et al., 2017).

Another paper use SVM to forecast pollutants’ (NO2, SO2, O3, SPM) levels from

historical and meteorological data from Macau, China by Vong et al. Authors use

three years to train the model and ine year to evaluate the performance. The

Pearson correlation is used to identify the best predictors for each pollutant and

different kernels are used to test which of the predictors or models get the best

results. They also use Pearson correlation as a metric to determine optimal number

of days for forecsting. They achieve a good fit and conclude that SVM’s

performance crucially depends on the choice of kernel (Vong et al., 2012).

Study by Zhan et al. uses Random Forest model to build a spatiotemporal model to

predict O3 concentration across China. They use RF with 500 estimators (decision

trees). Dataset includes one year of observations for meteorology variables,

planetary boundary height, elevation, anthropogenic emission inventory, land use,

vegetation index, road density, population density, and time from 1601 stations

located all across China. Performance of the model is evaluated against Chemical

Transport models’ simulations using RMSE and R squared as metrics. Machine

learning models show better accuracy at the same time being less consumong in

terms of computationsl resourses. They also conclued that accuracy of prediction

relies heavily on the quality of coverage by the monitoring network (Zhan et al.,

2018).

Martínez-España et al. aim to find the most robust machine learning algorithms to

preserve accuracy in case of O3 monitoring failure. Authors use Decision Tree, k-

Nearest neighbours model, Bagging model, Random Cometee and Random Forest.

They compare performance of selected models and then use hierarhical clustering

to determine optimal number of models to predict the O3 level in the region of

Murcia, Spain. Random Forest slightly outperforms the other models. The best

12

predictors turns out to be NOx, temperature, wind direction, wind speed, relative

humidity, SO2, NO, and PM10. They also conclude that two models are enough

for chosen data (Martínez-España et al., 2018).

In the paper by Bougoudis et al. authors identify the conditions under which high

pollution emerges. They use Hybrid system based on the combination of

clustering, Artificial Neural Networks, Random Forest and fuzzy logic. Twelve

years of hourly observations of CO, NO, NO2, SO2, temperature, relative

humidity, pressure, solar radiation, wind speed and direction from Athens, Greece

are used. The optimization of the modeling performance is done with Mamdani

rule-based fuzzy inference system that exploits relations between the parameters

affecting air quality. Specifically, self-organizing maps are used to perform dataset

re-sampling, then ensembles of feedforward artificial neural networks and random

forests are trained to clustered data vectors (Athanasopoulos et al., 2017).

Elangasinghe et al. is one of the earlier papers using neural networks to predict

concentration of NO2. They use genetic algorithm to optimize inputs for the neural

network. Variables set includes wind speed, wind direction, solar radiation,

temperature, relative humidity and time features accounting for hour, day and

month (Elangasinghe et al., 2014).

Gardner and Dorling concluded that neural networks outperform other linear

statistical methods regarding non-linear dependency (Gardner & Dorling, 1999).

Perez conducted comparison between persistence method, linear regression and

neural network using data from Santiago, Chile. He concluded that the best error

on the hourly prediction of pollution level was obtained using neural networks

(Pérez et al., 2000). Brunelli et all used recurrent neural networks to predict

concentration of various pollutants for two days ahead using meteorological data

(Brunelli et al., 2015).

Some authors have been improving neural networks’ accuracy using other

methods. Grivas et al. uses a neural network capable of combining meteorological

and time-scale input to predict hourly pollution level over the Greater Athens Area

using data collected in 2001-2002. Their model greatly outperformed linear

regression used for comparison (Finardi et al., 2008).

13

3.Machine learning algorithms

Machine learning methods gradually infiltrate time series analysis and pollution

level modeling. However, properly configurated, they hold powerful potential. In

this chapter, we are going to do a quick recap of the time series models and discuss

machine learning models.

3.1 Time series models

For the univariate time series analysis, we are going to use two models: SARIMA

and Holt-Winters Exponential Smoothing. For the multivariate time series

analysis, we are going to use vector autoregressive model (VAR).

3.1.1 Univariate analysis

The autoregressive integrated moving average (ARIMA) is a classical time series

model designed to analyze and forecast time series data (Zhang, 2001). It is a

generalization over ARMA model in which data is supposed to be non-stationary.

Equation 1 shows ARMA model with autoregressive component of order 𝑝 and

moving average component of order 𝑞.

𝑦𝑡 = 𝜃0 + 𝜙1𝑦𝑡−1 +⋯+ 𝜙𝑝𝑦𝑡−𝑝 + 𝜀𝑡 + 𝜃1𝜀𝑡−1 +⋯+ 𝜃𝑞𝜀𝑡−𝑞 (1)

To use ARIMA model we need to be sure that our data is stationary, meaning that

it has a constant mean and variance regardless of time step. ARIMA models

assures stationarity using differencing, as difference (on practice has a high chance

of being stationary. Equation 2 shows differencing process.

𝑦𝑡′ = 𝑦𝑡 − 𝑦𝑡−1 (2)

In case of seasonal data, we apply seasonal differencing showed in the equation 3,

in which 𝑚 depicts assumed seasonality:

𝑦𝑡′′ = 𝑦𝑡

′ − 𝑦𝑡−𝑚′ (3)

14

Once we have treated non-stationarity and seasonality in our data using

differencing, we can write high-level representation of SARIMA model shown in

equation 4.

𝑆𝐴𝑅𝐼𝑀𝐴 (𝑝, 𝑑, 𝑞) (𝑃, 𝐷, 𝑄,𝑚) (4)

• 𝑝 is an order of autoregressive component (AR)

• 𝑑 is an order of non-seasonal differencing

• 𝑞 is an order of moving average component (MA)

• 𝑃 is an order of seasonal AR

• 𝐷 is an order of seasonal differencing

• 𝑄 is an order of seasonal MA

• 𝑚 is a number of periods in the season

Holt-Winters Exponential Smoothing is an extension of Holt’s method to capture

seasonality (Winters, 1960). Model consists of forecast equation (equation 5) and

three smoothing equations: for the level 𝑙𝑡 (equation 6), for the trend 𝑏𝑡 (equation

7), for the seasonal component 𝑠𝑡 (equation 8). Corresponding smoothing

parameters 𝛼 , 𝛽 and 𝛾 are estimated using error minimization. Parameter 𝑚

accounts for the frequency of seasonality.

�̂�𝑡+ℎ|𝑡 = 𝑙𝑡 + ℎ𝑏𝑡 + 𝑠𝑡+ℎ−𝑚(𝑘+1) (5)

𝑙𝑡 = 𝛼(𝑦𝑡 − 𝑠𝑡−𝑚) + (1 − 𝛼)(𝑙𝑡−1 + 𝑏𝑡−1) (6)

𝑏𝑡 = 𝛽(𝑙𝑡 − 𝑙𝑡−1) + (1 − 𝛽)𝑏𝑡−1 (7)

𝑏𝑡 = 𝛽(𝑙𝑡 − 𝑙𝑡−1) + (1 − 𝛽)𝑏𝑡−1 (8)

Method has two variations: additive is preferred when the seasonal variations are

roughly constant throughout the series, while the multiplicative method is

preferred when the seasonal variations are changing proportional to the level of the

series. Due to the nature of our data, we use the additive model.

15

3.1.2 Multivariate analysis

Vector autoregressive model is a generalization of the univariate autoregressive

model which allows for forecasting of a vector of time series (Athanasopoulos,

2017). All the variables affect each other and are treated equally. For example,

three-dimensional VAR is described by system of equations shown in equation 9.

{

𝑦1,𝑡 = 𝑐1 + 𝜙11,1𝑦1,𝑡−1 + 𝜙12,1𝑦2,𝑡−1 + 𝜙13,1𝑦3,𝑡−1 + 𝑒1,𝑡𝑦2,𝑡 = 𝑐2 + 𝜙21,1𝑦1,𝑡−1 + 𝜙22,1𝑦2,𝑡−1 + 𝜙23,1𝑦3,𝑡−1 + 𝑒2,𝑡𝑦3,𝑡 = 𝑐3 + 𝜙31,1𝑦1,𝑡−1 + 𝜙32,1𝑦2,𝑡−1 + 𝜙33,1𝑦3,𝑡−1 + 𝑒3,𝑡

(9)

Where 𝑒1,𝑡, 𝑒2,𝑡 and 𝑒3,𝑡 are white noise processes that may be contemporaneously

correlated.

3.2 Classical machine learning methods

The most widely used machine learning is a classical linear regression, using least

square method to get coefficients’ estimations. Linear regression model can be

written as an equation 10, where 𝑦 is a target variable; 𝑥𝑖 is an explanatory

variable; 𝛽𝑖 is weight for explanatory variable; 𝜖 is an error between predicted and

observed values.

𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +⋯+ 𝛽𝑛𝑥𝑛 + 𝜖 (10)

Vector of weights is found using minimization problem shown in equation 11:

𝑚𝑖𝑛𝛽(1

𝑛∑ (𝑦(𝑖) − 𝑥𝑖

𝑇𝛽)2

𝑛

𝑖=1

) (11)

Whether our coefficients are correct and well representing reality depends on the

compliance to the set of assumptions: linear dependency between target variable

and predictors; target variable should be normally distributed; homoskedasticity

means that variance of error is assumed to be constant throughout data; each

observations supposed to be independent; absent of multicollinearity mean that our

variables are independent with each other.

16

3.2.1 Regularized linear regression

In the era of big data, the researcher may find himself in a situation where the

number of variables exceeds the number of observations, in the case of the

classical least-squares method, this leads to overfitting and zero predictive ability.

The potential multicollinearity of variables and the need to get rid of a number of

these in the analysis process are also big problems (Zou & Hastie, 2005).

In order to combat these problems, models of normalized least squares were

presented. The two most popular models - ridge and lasso - are very similar and

differ only in the specification of the penalty component (normalization form).

Let's take a closer look at the lasso model.

Lasso is an autonomous and convenient way to introduce sparseness into a linear

regression model. The lasso abbreviation stands for “the operator of the smallest

absolute shrinkage and selection” and, when applied to the linear regression

model, performs the selection of features and regularization of the weights of the

selected objects. Lasso adds a penalty component to the OLS minimization

problem as shown in equation 12.

𝑚𝑖𝑛𝛽(1

𝑛∑ (𝑦(𝑖) − 𝑥𝑖

𝑇𝛽)2

𝑛

𝑖=1

+ 𝝀‖𝜷 ‖𝟏) (12)

Component || 𝛽 || 1 is the 𝐿1 norm of the variable vector, which leads to a penalty

for large weights. Since the 𝐿1 norm is used, many weights get a score of 0 (in the

case of the ridge model, the 𝐿2 norm is used, which leads to the fact that the

weights can be arbitrarily small, but not zero) and the rest are reduced. The lambda

parameter controls the degree of regularizing effect and is usually tuned by cross-

validation. When the lambda is large, many weights become equal to 0.

3.2.2 Support Vector Regression

Support vector machine (SVM) is a classical machine learning algorithm often

used as a benchmark to measure more complex models’ efficiency due to its speed

and accuracy (Basak et al., 2007).

17

The basic idea of SVM is to find a separating hyperplane separating classes (in the

case of classification). In the case of regression analysis, the task is similar in

appearance to constructing linear regression (minimizing the error), with the

difference that in the case of the support vector model, the task is to conclude the

error within a certain threshold. Optimization problem is formulated as system of

equations shown in equation 13.

{

1

2‖𝜔‖2 + 𝐶∑(ξ𝑖 + ξ𝑖

∗) → 𝑚𝑖𝑛

𝑙

𝑖=1

𝑦𝑖 − ⟨𝜔, 𝑥⟩ − 𝑏 ≤ 𝜀 + ξ𝑖⟨𝜔, 𝑥⟩ + 𝑏 − 𝑦𝑖 ≤ 𝜀 + ξ𝑖

∗

ξ𝑖 , ξ𝑖∗ ≥ 0

(13)

where С – penalty for the estimation error;

𝜀𝑖 – estimation error;

ξ𝑖 , ξ𝑖∗ – slack variables;

𝜔 – vector of weights;

𝑥 – vector of independent variables;

𝑦𝑖 – dependent variable.

In the case when the set of objects is linearly inseparable, it is necessary to move

from the original space to a space of higher dimension, in which the classes are

linearly separable. Examples of the most popular mappings:

• Linear: 𝐾(𝑥𝑖, 𝑥𝑗) = 𝑥𝑖𝑇𝑥𝑗;

• Polynomial: 𝐾(𝑥𝑖, 𝑥𝑗) = (1 + 𝑥𝑖𝑇𝑥𝑗)

𝑝;

• Gaussian: 𝐾(𝑥𝑖 , 𝑥𝑗) = exp (−‖𝑥𝑖−𝑥𝑗‖

2

2𝜎2);

• Sigmoid: 𝐾(𝑥𝑖, 𝑥𝑗) = tanh(𝛽0𝑥𝑖𝑇𝑥𝑗 + 𝛽1).

3.2.3 Decision tree

Decision trees are a family of algorithms that play an important role in machine

learning (Thomas, 2000). Due to the simple method of generating decision trees,

decision tree learning is quick and easy compared to more complex algorithms

(Cruz & Wishart, 2006). The tree structure consists of branches of several edges

18

connected by internal vertices and leaves at the end of each branch. Each leaf at

the end makes a prediction.

For partitioning, the simplest condition is used, which checks whether the value of

some attribute 𝑥𝑗 lies to the left of the specified threshold 𝑡: [𝑥𝑗 ≤ 𝑡]. Let the set

𝑋𝑚 objects from the training set be at the vertex 𝑚 . The parameters in the

condition will be chosen to minimize the error criterion (e. g. in classification

problem Gini impurity index can be used; regression problem can use mean

absolute error).

The parameters 𝑗 and 𝑡 can be selected by enumeration. There are a finite number

of features, and of all the possible values of the threshold 𝑡, we can consider only

those for which various partitions are obtained. After the parameters have been

selected, the set of 𝑋𝑚 objects from the training set is divided into two subsets,

each of which corresponds to its child vertex.

Procedures is repeated until the desired accuracy or stopping criteria is met. The

accuracy of the decisive trees increases with their depth. The deeper the tree, the

more complex, non-monotonous dependencies it can catch. However, increasing

depth leads to unwanted consequences:

• Loss of interpretability;

• Severe overfitting, as tree deep enough can reach 100% accuracy on train

data while being unable to perform good enough on test data.

The main way to combat retraining is to normalize and select model

hyperparameters that, on the one hand, will show good results on training data, and,

on the other hand, will be able to produce greater accuracy of predictions on

validation data.

The main hyperparameters used to normalize decision trees are the maximum

depth of the tree (i.e., the maximum number of divisions down the tree), the

minimum number of observations at the terminal vertex (i.e. the minimum number

of observations in the tree leaf needed to happen division).

19

3.3 Ensemble learning methods

Another way to solve the overfitting problem is ensemble learning. The idea of

ensemble is to average the predictions of several weak predictors and combine

them into one model, which will have high predictive ability (high accuracy).

After this, the prediction is conducted by combining the results of all weak

predictors, for classification, the simple majority voting rule can be used, for

regression – averaging.

In the modeling process, it is important to obtain the most different (minimally

correlating) weak predictors among themselves. The main methods used to achieve

this goal are bootstrap and random selection of a limited number of variables for

each weak predictor. A bootstrap consists of selecting random observations from a

common sample to train each weak predictor. With a bootstrap with repetition, the

same observation can enter the model training dataset several times.

3.3.1 Horizontal ensemble

During the horizontal ensemble, we train several weak predictors independently of

each other. One of the most popular examples of parallel ensemble is a random

forest (RF). A random forest is an ensemble of decision trees, each operating

independently, making a prediction as to where an example data entry belongs.

The forest aggregates the results and chooses the strongest prediction (Andy &

Matthew, 2002). The random forest algorithm, can be described as follows:

1. We draw 𝑛 bootstrap samples from the dataset;

2. For each sample we create an unpruned decision tree based on 𝑘 features in

the dataset;

3. We get predictions from trees which are then averaged by voting

(classification) or average (regression).

3.3.2 Vertical ensemble

During vertical ensembling, we train several weak learners consequently. In

general, sequential ensemble allows to obtain higher accuracy of predictions than

parallel trained models. However, that model loses in terms of speed, as due to the

fact of sequential fit it is impossible to parallel computation. This model is also

20

even more prone to overfitting, which requires use of regularization. One of the

most popular algorithms, gradient boosting, can be described as follows:

1. We get the initial model’s error (e. g. decision tree or linear regression):

𝑒1 = 𝑦 − 𝑦1 ̂;

2. Estimate error for the model in which error from the first step is used as a

dependent variable: 𝑒1 ̂;

3. Sum obtained prediction with the original: 𝑦2 ̂ = 𝑦1 ̂ + 𝑒1 ̂;

4. Get the new error: 𝑒2 = 𝑦 − 𝑦2 ̂;

5. Repeat steps 2-4 until we overfit or until model’s error become constant.

Most popular algorithms are Gradient Bosting Machine (GBM) described before

and Extreme Gradient Boosting (XGBoost) using shortcuts in the conventional

algorithm to achieve faster computational speed at the expense of potential

accuracy loss.

3.4 Artificial Neural Networks

3.4.1 Multilayer perceptron

The simplest version of neural network is called multilayer feed-forward

perceptron. It is simply defined as an input layer, an output layer, and several

hidden layers. Each layer consists of multiple artificial neurons, which are tasked

with feeding forward data to the next layer (Svozil et al., 1997). Figure 1

represents simple neural network schematically.

Figure 1. Example of a feed-forward neural network with one hidden layer (Svozil et al., 1997)

21

Each node of the network consists of an artificial neuron, a mathematical model

intended to emulate the role of a neuron in a physical brain. Each neuron consists

of a set of inputs, some type of activation or transfer function, and an output

(Svozil et al., 1997). The inputs multiplied by weights and added to biases are

passed to the further layers (forward propagation). Example of artificial neuron is

shown in figure 2.

Figure 2. Artificial neuron schema (Svozil et al., 1997)

There are several activation functions in practice, the most popular is sigmoid.

Process of training of the neural network can be describe by the following steps:

1. Network receives training data as its input which through feed-forward

propagation becomes the set of outputs;

2. Error is calculated (usually for the regression problem it is mean square

error;

3. Partial derivatives of the loss function are calculated with respect to the

model’s parameters;

4. Model parameters are tuned with respect to mentioned derivatives

(backpropagation).

3.4.2 LSTM neural network

Classical neural network is poorly suitable for the time series prediction as they are

analyzing each datapoint separately, not being able to bear information over time

or any other sequence (e. g. language). This problem gets resolved by Recurrent

22

neural network (RNN), as they use previous state as another input. Figure 3

represents simple recurrent neural network schematically.

Figure 3. Reccurent neural network (3 units) (Olah)

However, if the input sequence is long, RNN gives more attention to the later

datapoints, while memory about old ones vanishes. This problem is overcame

using Long Short-Term Memory neural networks (LSTM) as they are able to learn

long-term dependencies (Olah).

In order to have such ability, LSTM has three layers inside of a unit. The most

important idea behind the model is a cell state, information ‘conveyer’ running

through the entire network. Information goes to cell state through three gates.

Forget gate decides what part of the information needs to be withdrawn from the

cell state. Input gate decides which part of information is going to be passed to

enter the cell state. Output gate decides which part of data is going to be outputted.

Example of LSTM network’s unit is shown in figure 4.

Figure 4. LSTM neural network (3 units) (Olah)

23

LSTM can be used to map sequence to scalar or vector, to a single or multiple time

steps. WE train itusing classical backpropagation taking derivatives of the loss

fucntion once comparison of factual and expected output are obtained Those

derivative are used as weights to update parameters inside neural network’s layers.

3.4.3 CNN neural network

Convolutional neural network is the most popular as image processing models (or

at least the most popular building block). Key idea behind CNN is using a set of

filters to gradually learn mode and more complex features (Stewart). Close

analogy would a flashlight gliding over an image. Using this “flashlight” with

convolutional layers we significantly reduce number of trainable parameters.

Example of convolutional layer is given in figure 5.

Figure 5. The convolution operation (Stewart)

Latter is a major improvement over classical neural network, as it uses one input

per pixel, and moderate resolution picture processing results in hundreds of

thousands of trainable parameters. So, at first CNN learns simple shapes or even

shades, then gradually it learns more and more complicated features until by the

last layer is can recognize a nose on the picture and even tell which animal it

belongs to.

24

Figure 6. Simple CNN example (Stewart)

Figure 6 represents simple CNN architecture. CNN can also be applied to time

series data, as we glide with a 1D filter over the sequence of observations mapping

it to the output which can be in scalar or vector form while latter can be

represented bya single or multiple time steps. As any other neural network, CNN

is trained using backpropagation.

25

4. Modeling pollution level

This is the empirical part of the present research. We start by describing data used

and preprocessing steps taken. We also discuss metric chosen for model

comparison; feature engineering required to account for time nature of our data. In

the end we build models and conduct comparison with the benchmark. We use

programming language Python 3.5 for our analysis, code for modeling is available

in the appendix. For time series analysis we use statsmodels library. Machine

learning models are built using sklearn library. To build neural networks we use

Keras library operating on top of Tensorflow library.

4.1 Data

We are using 5 years of hourly data on chemical (PM2.5) and meteorological

(temperature, relative humidity, solar radiation, wind speed and wind direction)

variables collected from a monitoring station located in Cuenca, Ecuador. As

WRF-Chem provides daily observations, we down sample data to daily

observation using mean as aggregation function.

Unfortunately, our data has a lot of missing observations. Even worse, dates for

which observations are missing are not consistent over variables. E.g., wind speed

and wind direction don’t have observations for the majority of 2016, while

temperature is missing for the second half of 2015 and 2 months of 2017. Having

said that, we cannot just drop missing observations, as this result in the reduction

of our dataset from 1518 to ~350 observations. Hence, we use interpolation.

For some variables (e.g. temperature) best interpolation technique proved to be

spline of order 5, others (e.g. solar radiation) were best approximated by simple

linear interpolation. Interpolation was chosen by the criteria of the best fit for

existing data. Possible shortcomings of that approach are discussed in the

discussion section of the present paper.

Prior to modeling we need to clean our data from outliers. Some data can be just

false (for example, negative pollution level), some days show extremely high value

of some variables, which can adversely affect the training process and result in a

loss of accuracy.

26

To detect outliers, we are going to use interquartile range. The interquartile range

(𝐼𝑄𝑅) is a measure of statistical dispersion and is calculated as the difference

between the 75𝑡ℎ and 25𝑡ℎ percentiles (percentile is a measure indicating the

value below which a given percentage of observations falls). It is represented by

the formula 𝐼𝑄𝑅 = 𝑄3 − 𝑄1. After calculating 𝐼𝑄𝑅 for each variable, we limit

the variable in the interval between 𝑄1 – 1.5𝐼𝑄𝑅 and 𝑄3 + 1.5𝐼𝑄𝑅 . Table 1

shows statistical description of data.

Table 1. Statistical description of data

pm temp hum sol wind_dir vel_ms

count 1583 1583 1583 1583 1583 1583

mean 9.15 15.25 64.40 191.89 161.79 1.72

std 3.81 1.12 8.06 70.96 50.66 0.34

min 0.00 11.30 25.59 0.00 11.08 0.46

25% 6.57 14.42 59.45 130.56 129.70 1.60

50% 8.99 15.23 64.67 186.47 157.12 1.64

75% 11.51 15.98 69.24 244.39 189.01 1.92

max 21.40 19.11 91.12 472.04 307.62 3.02

Some statistical models can be sensitive to difference in magnitude of variables.

For example. Linear regression performs better if all the variables are scaled (or

normalized), tree-based algorithms in general are less sensitive, for neural

networks normalization is a must, as it allows for faster convergence and better

accuracy (Ali & Faraj, 2014). We are going to use a wide range of machine

learning models. So, it is better to perform some transformation on our data.

We are using min max normalization to assure that all the variables have the same

magnitude (contained within 0 and 1). Normalization is performed as shown in

equation 14.

27

𝑥𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =𝑥𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑥𝑚𝑖𝑛𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

(14)

It is worth mentioning that data is quite erratic without clear trend and/or

seasonality. This further nudge us to use machine learning as those models in

general are more suitable to model complicated non-linear dependencies. Figure 7

shows erratic nature of pollution level time series.

Figure 7. Pollution level time series

4.2 Methodology of modeling

Evaluation metric

To evaluate models’ results we are using mean absolute error (MAE) metric, used

a lot assessing regression problem. I have chosen absolute over root squared mean

error as our data is erratic and thereby, I do not want to inflict additional

punishment for outliers. Formula to calculate MAE is shown in equation 15.

𝑀𝐴𝐸 =1

𝑛∑ |𝑦𝑖 − �̂�𝑖|

𝑛

𝑖=1(15)

Where 𝑦𝑖 is an observed value of dependent variable, �̂�𝑖 is a predicted one and 𝑛 is

the size of testing sample.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2014-09-01 2015-09-01 2016-09-01 2017-09-01 2018-09-01

Leve

l of

PM

2.5

28

Time features

For time series modeling we just use our six time series, but to use machine

learning algorithms we need to add time features explicitly. I add 𝑚𝑜𝑛𝑡ℎ ,

day_of_week and week_of_year features (whole numbers) to account for weekly

and annual trend in data.

Then, for each variable I add values at 6am, 1pm and 3pm as those hours provide u

with the most representative concentrations of pollutants over the day. 6am is the

beginning of the morning peak, 1pm corresponds to the midday baseline and 3pm

is the beginning of evening peak. Then, for all the variables excluding 𝑚𝑜𝑛𝑡ℎ,

𝑤𝑒𝑒_𝑜𝑓_𝑦𝑒𝑎𝑟 and 𝑑𝑎𝑦_𝑜𝑓_𝑤𝑒𝑒𝑘 I add lagged values up to 5th lag (e.g. 2nd lag is a

value observed two days ago). This way we account for time nature of the data.

Walk forward validation

In order to make our evaluation robust, we use cross-validation. Using 5-n cross-

validation we split dataset into 5 parts. Then we train our model on the first four of

them. We use last part of our dataset, not used in the training process, for

validation. Once we get the MAE, we save it and repeat the process using first,

second, third and fifth parts for training and the fourth one for evaluation, getting

another MAE (the process is repeated 5 times). This way we guarantee that our

model has been evaluated on all the available data.

Unfortunately, it is not feasible in case of time series models. We could use cross-

validation for the machine learning models non-depended on the sequential

structure. However, for time series models time structure is a requirement. So, we

need better approach for evaluation, applicable both to time series models and to

machine learning models.

We are going to use walk forward cross-validation. This approach requires two

sliding windows for training and test set. Schematically approach is depicted in

figure 8. For each test set we will calculate a separate MAE and then take the

average of them. Sliding over dataset allows for robustness in evaluation in

training.

29

Figure 8. 4-n walk forward cross-validation (Moudiki, 2020)

4.3 Modeling

Biggest problem of our data is the fact that we only have WRF-Chem prediction

for September 2014. At the same time, our dataset stretches from 2014 to 2018.

We overcome this issue reversing our data, so order is preserved, but reversed. At

the same time, models fit and tuned to predict specifically September 2014 have a

chance to lack robustness. So, at first, we evaluate our models using walk forward

cross-validation with training window of two years. Then, we take best models and

compare forecast for September 2014 to the benchmark.

Time series analysis models

Prior to build our time series models, we conduct Augmented Dickey-Fuller with

constant and trend test for each sequence. Null hypothesis for the test states that

series are not stationary. Results are shown in table 3.

Table 3. Augmented Dickey-Fuller test results

Time series p-value conclusion

pollution level 8.8306e-06 series is stationary

temperature 0.0002 series is stationary

humidity 0.0 series is stationary

solar radiance 0.0306 series is stationary

wind speed 1.1e-09 series is stationary

wind direction 2.441e-06 series is stationary

30

As we can see, all our series are stationary, and we can proceed. We are using test

window of 30 observations with 5-n walk forward cross-validation. For univariate

analysis we use SARIMA model, as changing its parameters allows us to test

broad scope – from AR to SARIMA. For multivariate analysis we use classical

VAR model. Best configuration was chosen based on best Akaike information

criteria after simple iteration over different order of lags. Results are shown in

table 4.

Table 4. Time series models results

Model MAE

SARIMA (2,0,1) (0,0,0,12) 0.1475

VAR (9) 0.1237

Holt-Winters 0.1368

Vector autoregressive model of order 9 shows the best quality of fit with the

average MAE of 0.12.

LSTM models

Next, we fit neural network models with 5-n walk forward cross-validation. We

start with long short-term memory neural networks. First, we try different

configurations to predict pollution level one day ahead. Single channel model uses

only historical data on pollution level. Multichannel models use historical data on

all the available time series (pollution level, humidity, solar radiance, temperature,

wind direction and wind speed). Multichannel output models allow us to predict

not only the target variable but all the series, like VAR model. Our architecture

consists of LSTM layers with 50 units followed by fully connected layer. Results

are shown in table 5.

Table 5. LSTM model results for different configurations forecast 1 step ahead

Model MAE

LSTM single channel input, single channel output 0.0973

LSTM multichannel input, single channel output 0.0846

LSTM multichannel input, multichannel output 0.0939

31

As we can see, LSTM using multichannel input to predict pollution level one step

forward has the lowest MAE. To predict for more than one step we reshape our

data. For example, to build a model making prediction for a week we will reshape

our data into [1583, 6, 7] adding timestep dimension. This means that we feed our

models chunks of data each containing 7 time steps of 6 time series. Results are

available in table 6.

Table 6. LSTM model results for the broader forecast horizon

Model MAE

LSTM multichannel input, single channel output 5 days prediction 0.0851

LSTM multichannel input single channel output 7 days prediction 0.0883



Unfortunately, MAE arises rather rapidly predicting for more than 10 steps

forward. But one week prediction is handled relatively well.

CNN models

Next, we fit convolutional neural networks with 5-n walk forward cross-validation.

First, we try different configurations to predict pollution level one step forward.

Single channel model uses only historical data on pollution level. Multichannel

models use historical data on all the available time series. Our CNN consists of

convolutional layer followed by max pooling layer followed by a fully connected

layer of 50 neurons. Results are present in table 7.

Table 7. CNN model results for different configurations forecast 1 step ahead

Model MAE

CNN single channel input, single channel output prediction 0.0943

CNN multichannel input, single channel output prediction 0.0721

CNN multichannel input, multichannel output prediction 0.0947

32

Multichannel input single channel output CNN shows surprisingly good result.

Next step in our analysis is to test this model’s ability to predict broader time

horizon. We will same trick we used for LSTM models. Results are available in

table 8.

Table 8. CNN model results for the broader forecast horizon

Model MAE

CNN multichannel input, single channel output 5 days prediction 0.0781

CNN multichannel input single channel output 7 days prediction 0.0832



As we can see, CNN outperforms LSTM also in the case of extended prediction

horizon. MAE is systematically lower and degrades slower over time

Machine learning models and artificial neural network

Machine learning models and neural network do not have natural ability to predict

for several steps ahead. As our final goal is the model predicting for a month

ahead, we build a module containing 30 models each predicting for +1 day

forward. So, linear regression (1) will predict pollution level one step ahead and

linear regression (23) will predict pollution level 23 steps ahead.

In order to train model predicting for 𝑡 + 𝑛 steps forward, we shift our target

variable, so that today’s value of 𝑋 is mapped to value of 𝑦 on the 𝑛𝑡ℎ step. This

trick allows as to build machine learning models for forecasting. It has its

limitations, though. As we cannot shift data endlessly, we need to have enough to

train our models on the intersect of 𝑋 and 𝑦.

We are using all available data (time features, time series and their lags) in the

following analysis. For each machine learning model used we conduct

hyperparameter tuning using simple iteration over grid of all possible

combinations. Table 9 contains average MAE of 30 instances of machine learning

models. MAE is calculated using 5-n walk forward cross-validation.

33

Table 9. Machine learimg modules results on 30 steps ahead forecast

Model MAE

Linear regression 0.1348

Ridge regression (alpha = 1) 0.1221

Lasso regression (alpha = 0) 0.1147

SVM regressor (C = 2, kernel = linear) 0.0929

RF regressor (max_depth = 10, n_estimators = 100,

min_samples_leaf = 10) 0.0875

GBM regressor (max_depth = 10, n_estimators = 100,

learning_rate = 0.1 0.0872

XGB regressor (max_depth = 5, n_estimators = 100) 0.0958

ANN (input layer (144), hidden layer (200), hidden layer (100),

hidden layer (50), output layer (1) 0.1182

Table 10 contains innformation of all the MAEs using gradient boosting machine

as an example. We can see that machine learning models are less prone than time

series models and neural networks to degradation over extended prediction horizon

in general. Ensemble learning shows the best error with gradietn boosting machine

regressor dominating.

Figure 8 demonstrates GBM regressor prediction for September. Normalization

have been reversed, WRF-Chem’s MAE equals 2.05, GBM module’s MAE equals

1.89. So, we gained beter accuracy preducting pollution level for 27 time steps.

We also experienced less computational costs, as WRF-Chem model may take

month to sumulate month of observations, while GBM module took little under 4

minutes to train on more than 4 years of observations and third on a second to

predict month of data. Table 11 shows fitting and testing time for all the used

models. For both groups of models train/test split is roughly 720/30 observations.

34

Figure 9. Observed polution level vs. WRF’prediction vs. GBM’prediction

Table 11. Fitting and forecasting time of various models

Model Fit time Prediction time

SARIMA 0.41s 0.12s

Holt-Winters 0.32s 0.09s

VAR 0.49s 0.13s

LSTM 8.41s 0.41s

CNN 4.05s 0.32s

Linear regression 2.53s 0.15s

Lasso regression 0.51s 0.07s

Ridge regression 0.32s 0.02s

SVM regressor 12.61s 0.14s

RF regressor 7.32m 0.73s

GBM regressor 3.53m 0.32s

LGBM regressor 2.12m 0.26s

XGB regressor 2.15m 0.21s

ANN model 16.17m 0.76s

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Leve

l of

PM

2.5

Real WRF GBM regressor

35

5. Discussion and Conclusion

One of the biggest flaws of the present analysis is the quality of the data. Variables

contain missing observations at different time steps, what makes treatment for

missing values even harder. Treatment we used (interpolation) can reduce (or

increase) model performants even though we tried to mitigate it using walk

forward cross-validation. Having better data, we could achieve fairer comparison

between various models and more robust results.

We conducted comprehensive comparison of various statistical models to the

deterministic benchmark represented by WRF-Chem simulation. We used five

years of daily observations of meteorological variables, wind speed, wind direction

and PM2.5 level from Cuenca, Ecuador. We conclude that the best accuracy for

the next day prediction was achieved by the Convolution Neural Network model.

But with the broader forecast horizon accuracy drops rapidly. So, we also value the

ability to predict for at least a month. With that goal we build module consisting of

thirty GBM models and beat the benchmark making a whole month prediction

using much less computational resources. Nevertheless, comparison cannot be

considered entirely fair, as WRF-Chem still has two advantages that our models

lack: ability to predict over much broader time horizon and ability to predict not

only pollutant level but also meteorological conditions.

36

6. References

Ahmadov, R. (2016), WRF-Chem: Online vs Offline Atmospheric Chemistry

Modeling. In ASP Colloquium; National Center for Atmospheric Research

(NCAR): Boulder, CO, USA.

Ali, P. & Faraj, R. (2014), Data Normalization and Standardization: A Technical

Report. Machine Learning Technical Reports, 1(1), 1-6.

Andy, L., and Matthew, W., (2002). Classification and Regression by

randomForest. R News, 2/3, 18-22.

Armstrong, J., (2002). PRINCIPLES OF FORECASTING: A Handbook for

Researchers and Practitioners. Kluwer Academic Publishers.

Athanasopoulos, G., Hyndman, R. J., Kourentzes, N., & Petropoulos, F. (2017).

Forecasting with temporal hierarchies. European Journal of Operational

Research, 262(1), 60–74.

Basak, D., Pal, S., Patranabis, D. (2007). Support Vector Regression. Neural

Information Processing – Letters and Reviews. Vol. 11, No. 10, 203-224.

Bougoudis, I.; Demertzis, K.; Iliadis, L. (2016) HISYCOL a hybrid computational

intelligence system for combined machine learning: The case of air

pollution modeling in Athens. Neural Comput. 27(5). DOI:

10.1007/s00521-015-1927-7

Bruckman L. (1993) Overview of the Enhanced Geocoded Emissions Modeling

and Projection (Enhanced GEMAP) System. In proceeding of the Air &

Waste Management Association's Regional Photochemical Measurements

and Modeling Studies Conference, p 562, San Diego, CA.

Brunelli, U., V. Piazza, L. Pignato, F. Sorbello, and S. Vitabile. (2007). Two-days

Ahead Prediction of Daily Maximum Concentrations of SO2, O3, PM10,

NO2, CO in the Urban Area of Palermo, Italy. Atmospheric Environment

41.14, 2967-995.

Carnevale, C., Finzi, G., Pederzoli, A., Turrini, E., Volta, M. (2016). Lazy

Learning based surrogate models for air quality planning. Environ. Model.

Softw. 83, 47–57.

Chang J., Jin S., Li Y., Beauharnois M., Chang K., Huang H., Lu C., and Wojcik

G. (1996) The SARMAP Air Quality Model, SARMAP Final Report.

Chen, J., Chen, H., Wu, Z., Hu, D., Pan, J. (2017). Forecasting smog-related health

hazard based on social media and physical sensor. Inf. Syst., 64, 281–291.

Coats C.J. (1996) High performance algorithms in the sparse matrix operator

kernel emissions modeling system. Proceedings of the Ninth Joint

Conference on Applications of Air Pollution Meteorology of the American

Meteorological Society and the Air and Waste Management Association,

Atlanta, GA.

Deep Learning 4 Java. A Beginner’s Guide to Recurrent Networks and LSTMs.

available online at: https://deeplearning4j.org/lstm.html [Visited 20-5-

2020]

https://deeplearning4j.org/lstm.html

37

Elangasinghe, M.A.; Singhal, N.; Dirks, K.N.; Salmond, J.A. (2014). Development

of an ANN–based air pollution forecasting system with explicit knowledge

through sensitivity analysis. Atmos. Pollut. Res.5, 696–708.

Finardi, Sandro & De Maria, Roberta & D'Allura, Alessio & Cascone, C. &

Calori, G. & Lollobrigida, Francesco. (2008). A deterministic air quality

forecasting system for Torino urban area, Italy. Environmental Modelling

& Software. 23. 344-355.

Gardner, M., and S. R. Dorling. (1999). Neural Network Modelling and Prediction

of Hourly NOx and NO2 Concentrations in Urban Air in London.

Atmospheric Environment 33.5, 709-19.

Grell G., Dudhia J., and Stauffer D. (1994) A description of the fifth-generation

Penn State/NCAR mesocale model (MM5). Prepared by National Center

for Atmospheric Research, Boulder, CO, NCAR Technical Note-398

Guidelines for Developing an Air Quality (Ozone and PM2.5). (2003) Forecasting

Program. U.S. Environmental Protection Agency.

Holt, C. (1957). Forecasting seasonals and trends by exponentially weighted

averages (O.N.R. Memorandum No. 52). Carnegie Institute of Technology,

Pittsburgh USA.

Joseph, A., & David, S. (2006). Applications of Machine Learning in Cancer

Prediction and Prognosis. Departments of Biological Science and

Computing Science, University of Alberta Edmonton.

Kleine, J., Zalakeviciute, R., Gonzalez, M., Rybarczyk, Y., (2017). Modeling

PM2.5 urban pollution using machine learning and selected meteorological

parameters. J. Electr. Comput. Eng, 2017(5), 1-14.

LexPredict, LLC. A decision tree case study. Available online at:

https://www.lexpredict.com/2017/02/case-study-decision-tree/ [Visited 20-

5-2020]

Liu, B., Binaykia, A., Chang, P., Tiwari, M., Tsao, C., (2017) Urban air quality

forecasting based on multidimensional collaborative Support Vector

Regression (SVR): A case study of Beijing-Tianjin-Shijiazhuang. PloS

ONE, 12(7), e0179763.

Lurmann F.W. (2000) Simplification of the UAMAERO Model for seasonal and

annual modeling: the UAMAERO-LT Model. Report prepared for South

Coast Air Quality Management District, Diamond Bar, CA by Sonoma

Technology, Inc., Petaluma.

Martínez-España, R.; Bueno-Crespo, A.; Timón, I.; Soto, J.; Muñoz, A. Cecilia,

J.M. (2018). Air-pollution prediction in smart cities through machine

learning methods: A case of study in Murcia, Spain. J. Univ. Comput. Sci.,

24, 261–276.

Monteiro, A.; Lopes, M.; Miranda, A.I.; Borrego, C.; Vautard, R. (2005). Air

pollution forecast in Portugal: A demand from the new air quality

framework directive. Int. J. Environ. Pollut. 25, 4–15

Moudiki, T. (2020). Time series cross-validation using crossval. R-bloggers

https://www.lexpredict.com/2017/02/case-study-decision-tree/

38

Nieto, P.J.G.; Antón, J.C.Á.; Vilán, J.A.V.; García-Gonzalo, E. (2015). Air quality

modeling in the Oviedo urban area (NW Spain) by using multivariate

adaptive regression splines. Environ. Sci. Pollut. Res. 22, 6642–6659

Odman T. and Ingram C.L. (1996) Multiscale Air Quality Simulation Platform

(MAQSIP): source code documentation and validation. MCNC Technical

Report.

Olah, C. Understanding LSTM Networks. Colah’s blog. available online at:

https://colah.github.io/posts/2015-08-Understanding-LSTMs/ [Visited 20-

5-2020]

Pérez, P., Trier, A., & Reyes, J. (2000). Prediction of PM2.5 Concentrations

Several Hours in Advance Using Neural Networks in Santiago, Chile.

Atmospheric Environment 34.8 : 1189-196.

Philibert, A.; Loyce, C.; Makowski, D. (2013). Prediction of N2O emission from

local information with Random Forest. Environ. Pollut. 177, 156–163.

Pielke R.A., Cotton W.R., Walko R.L., Tremback C.J., Lyons W.A., Grasso L.,

Nicholls M.E., Moran M.D., Wesley D.A., Lee T.J., and Copeland J.H.

(1992) A comprehensive meteorological modeling system - RAMS.

Meteor. Atmos. Phys. 49, pp. 69-91.

Ritter, M., Müller, M.D., Tsai, M.-Y., Parlow, E. (2013). Air pollution modeling

over very complex terrain: An evaluation of WRF-Chem over Switzerland

for two 1-year periods. Atmos. Res. 132–133, 209–222.

Rybarczyk Y., Zalakeviciute R. (2018) Machine learning approaches for outdoor

air quality modelling: a systematic review. Applied Sciences, 8(12), 2570.

Saniya, A, Nithin S, Venkata S., Aditya Pai H. (2018) Prediction of urban air

pollution by a machine learning method. International journal of advance

research, ideas and innovations in technology. 4(3), 1072-1077.

Sayegh, A.S.; Munir, S.; Habeebullah, T.M. (2014). Comparing the Performance

of Statistical Models for Predicting PM10 Concentrations. Aerosol Air

Qual. Res., 10, 653–665.

Shimadera, H.; Kojima, T.; Kondo, A. (2016). Evaluation of Air Quality Model

Performance for Simulating Long-Range Transport and Local Pollution of

PM2.5 in Japan. Adv. Meteorol. 2016(2), 1-13.

Singh, K.P.; Gupta, S.; Rai, P. (2013). Identifying pollution sources and predicting

urban air quality using ensemble learning methods. Atmos. Environ. 80,

426–437.

Stein A.F., Lamb D., and Draxler R.R. (2000) Incorporation of detailed chemistry

into a threedimensional Lagrangian-Eulerian hybrid model: Application to

regional tropospheric ozone. Atmos. Environ. 34, pp. 4361-4372.

Stewart, M. Simple Introduction to Convolutional Neural Networks. available

online at: https://towardsdatascience.com/simple-introduction-to-

convolutional-neural-networks-cdf8d3077bac [Visited 20-5-2020]

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac

https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac

39

Suárez, S.; García Nieto, P., Riesgo Fernández, P., del Coz Díaz, J.J.; Iglesias-

Rodríguez, F., (2011) Application of an SVM-based regression model to

the air quality study at local scale in the Avilés urban area (Spain). Math.

Comput. Model. Vol 54, 1453-1466.

Svozil, D., Kvasnicka, V., and Pospichal, J. (1997). Introduction to multi-layer

feed-forward neural networks. Chemometrics and Intelligent Laboratory

Systems(39):43-62.

Thomas G. (2000). An Experimental Comparison of Three Methods for

Constructing Ensembles of Decision Trees: Bagging, Boosting, and

Randomization. Machine Learning(40):139-157.

U.S. Environmental Protection Agency (1992) User's guide for the urban airshed

model. Volume IV: User's manual for the emissions preprocessor system

2.0. Office of Air Quality Planning and Standards, Research Triangle

Park, NC.

Vong, C., Ip, W., Wong, P., Yang, J. (2012). Short-term prediction of air pollution

in Macau using support vector machines. J. Control Sci.

https://doi.org/10.1155/2012/518032

WHO. (2014). Million Premature Deaths Annually Linked to Air Pollution.

Available online at:

https://www.who.int/mediacentre/news/releases/2014/air-

pollution/en/#.WqBfue47NRQ.mendeley [Visited 20-5-2020]

Winters, P. (1960). Forecasting sales by exponentially weighted moving averages.

Management Science, 6, 324–342

WRF, (2017). Weather research and forecasting model. http://wrf-model.org

Zalakeviciute R., Gonzalez M., Rybarczyk Y. (2017) Modeling PM2.5 urban

pollution using machine learning and selected meteorological parameters.

Journal of Electrical and Computer Engineering.

Zalakeviciute R., López-Villada J., Rybarczyk Y. (2018) Contrasted effects of

relative humidity and precipitation on urban PM2.5 pollution in high

elevation urban areas. Sustainability, 10(6), 2064.

Zhan, Y., Luo, Y., Deng, X., Grieneisen, M., Zhang, M., Di, B. (2018)

Spatiotemporal prediction of daily ambient ozone levels across China using

random forest for human exposure assessment. Environ. Pollut. 233, 464-

473.

Zhang, P. (2003) Time series forecasting using a hybrid ARIMA and neural

network model. Neurocomputing. 2003(1), 159-175.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic

net. Journal of the Royal Statistical Society, Series B. 67, 301–320.

https://doi.org/10.1155/2012/518032

https://www.who.int/mediacentre/news/releases/2014/air-pollution/en/%23.WqBfue47NRQ.mendeley

https://www.who.int/mediacentre/news/releases/2014/air-pollution/en/%23.WqBfue47NRQ.mendeley

http://wrf-model.org/

40

7. Appendix

This section contains code written in Python 3.5 programming language used for

empirical partt of the present research.

def fix_outliers(ts_data):

Q1, Q2, Q3 = ts_data.quantile([0.25,0.5,0.75])

IQR = Q3 - Q1

min_val = Q1 - 2*IQR

max_val = Q3 + 2*IQR

return ts_data.clip(lower=min_val, upper=max_val)

def add_lags(df, start_column, n):

for col in df.iloc[:,start_column:]:

for i in range(1, n):

df[col+'_lag_'+str(i)] = df[col].shift(i)

return df

def normalize_table(data):

for col in data:

data[col]=(data[col]-data[col].min())/(data[col].max()-data[col].min())

return data

def X_y_split(data):

X = data.drop(['pm', 'pm_6', 'pm_13', 'pm_15'], axis=1)

y = data['pm']

return X, y

def ts_data_shift(X, y, i):

X_shifted = X.iloc[:-i]

y_shifted = y.shift(-i).dropna(axis=0)

return X_shifted, y_shifted

def deseason_ts(ts):

result = {}

decomp = seasonal_decompose(ts, model="additive", period=365)

result['resid'] = decomp.trend + decomp.resid

result['seasonal'] = decomp.seasonal['2016-01-01':'2016-12-31']

return result

def mape(y_true, y_pred):

return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def deseason_ts(ts):

result = {}

decomp = seasonal_decompose(ts, model="additive", period=365)

result['resid'] = decomp.trend + decomp.resid

result['seasonal'] = decomp.seasonal['2016-01-01':'2016-12-31']

41

return result

def adfuller_test(series, signif=0.05, name='', verbose=False):

"""Perform ADFuller to test for Stationarity of given series and print report"""

r = adfuller(series, autolag='AIC')

output = {'test_statistic':round(r[0], 4), 'pvalue':round(r[1], 10),

'n_lags':round(r[2], 4), 'n_obs':r[3]}

p_value = output['pvalue']

def adjust(val, length= 6): return str(val).ljust(length)

# Print Summary

print(' Augmented Dickey-Fuller Test on "{}"'.format(name), "\n ", '-'*47)

print(' Null Hypothesis: Data has unit root. Non-Stationary.')

print(' Significance Level = {}'.format(signif))

print(' Test Statistic = {}'.format(output["test_statistic"]))

print(' No. Lags Chosen = {}'.format(output["n_lags"]))

for key,val in r[4].items():

print(' Critical value {} = {}'.format(adjust(key), round(val, 3)))

if p_value <= signif:

print(" => P-Value = {}. Rejecting Null Hypothesis.".format(p_value))

print(" => Series is Stationary.")

else:

print(" => P-Value = {}. Weak evidence to reject the Null

Hypothesis.".format(p_value))

print(" => Series is Non-Stationary.")

for i in range(1, 30):

X_train, y_train = X_y_split(data)

X_train, y_train = ts_data_shift(X_train, y_train, i)

models = [("LinearRegression",LinearRegression()),

("Ridge", Ridge(random_state=seed)),

("Lasso", Lasso(random_state=seed)),

("SVR", SVR()),

("RF", RandomForestRegressor(random_state=seed)),

("ET", ExtraTreesRegressor(random_state=seed)),

("BR", GradientBoostingRegressor(random_state=seed)),

("LGBM", LGBMRegressor(random_state=seed)),

("XGB", XGBRegressor(seed=seed))]

scoring = 'neg_mean_absolute_error'

n_folds = 5

results, names = [], []

for i in range(n_train, n_records):

train, test = X[0:i], X[i:i+1]

print('train=%d, test=%d' % (len(train), len(test)))

names.append(name)

results.append(cv_results)

print('{}: {} (+/- {})'.format(name, cv_results.mean(), cv_results.std()))

42

class DirectForecastModel:

def __init__(self, model, n_steps):

self.model = model

self.n_steps = n_steps

def fit_predict(self, data):

self.prediction = {}

self.X, self.y = X_y_split(data)

X_pred = self.X.iloc[0, :]

self.dict_of_models = {}

for i in range(1, self.n_steps+1):

X_train, y_train = ts_data_shift(self.X, self.y, i)

model_fit = self.model.fit(X_train, y_train)

self.prediction[i] = model_fit.predict(X_pred.values.reshape(1, 144))

return [i[0] for i in self.prediction.values()]

model = VAR(data_var_train, freq='D')

var = model.fit(maxlags=20, ic='aic')

y_hat = var.forecast(y=data_var_test.values, steps=data_var_test.shape[0])

model = ExponentialSmoothing(train, seasonal='add', seasonal_periods=365).fit()

pred = model.predict(start=test.index[0], end=test.index[-1])

stepwise_fit = pmdar.auto_arima(sol0_train,

start_p=1, max_p=10,

start_q=1, max_q=10,

start_P=0, max_P=10,

start_Q=0, max_Q=10,

m=12, d=None, D=None,

seasonal=True, trace=True,

error_action='ignore',

suppress_warnings=True, stepwise=True)

stepwise_fit.summary()

X = data.drop(['pm', 'pm_6', 'pm_13', 'pm_15'], axis=1).values

Y = data['pm'].values

def deeper_model():

model = Sequential()

model.add(Dense(144, input_dim=144, kernel_initializer='normal',

activation='relu'))


activation='relu'))


activation='relu'))


activation='relu'))

model.add(Dense(1, kernel_initializer='normal'))

model.compile(loss='mean_absolute_error', optimizer='adam')

return model

43

estimator = KerasRegressor(build_fn=deeper_model, epochs=100, batch_size=5,

verbose=0)

kfold = KFold(n_splits=10)

results = cross_val_score(estimator, X, Y, cv=kfold)

print("Deeper: %.2f (%.2f) MAE" % (results.mean(), results.std()))

def split_sequences(sequences, n_steps_in, n_steps_out):

X, y = list(), list()

for i in range(len(sequences)):

# find the end of this pattern

end_ix = i + n_steps_in

out_end_ix = end_ix + n_steps_out

if out_end_ix > len(sequences):

break

seq_x, seq_y = sequences[i:end_ix, :],

sequences[end_ix:out_end_ix, :]

X.append(seq_x)

y.append(seq_y)

return np.array(X), np.array(y)


model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2])))

model.add(Dense(1))

model.compile(loss='mae', optimizer='adam')

# fit network

model.fit(X_train, y_train, epochs=50, batch_size=100, validation_data=(X_test,

y_test), verbose=0, shuffle=False)

n_steps_in, n_steps_out = 30, 10

data_cnn = data[['temp', 'hum', 'sol', 'pm']]

X, y = split_sequences(data_cnn.values, n_steps_in, n_steps_out)

n_train_days = 1000

n_output = y.shape[1] * y.shape[2]

y = y.reshape((y.shape[0], n_output))

n_features = X.shape[2]

X_train, X_test = X[:n_train_days], X[n_train_days:]

y_train, y_test = y[:n_train_days], y[n_train_days:]


model.add(Conv1D(filters=64, kernel_size=2, activation='relu',

input_shape=(n_steps_in, n_features)))

model.add(MaxPooling1D(pool_size=2))

model.add(Flatten())

model.add(Dense(50, activation='relu'))

model.add(Dense(n_output))

model.compile(optimizer='adam', loss='mse')

model.fit(X_train, y_train, epochs=50, batch_size=100, verbose=0)# demonstrate

prediction

Documents

Degree Project1451144/FULLTEXT01.pdfalgorithms might be used predict air quality, build those models and conduct a final comparison regarding accuracy, complexity and time costs Our