Time Series Forecasting of House Prices An evaluation of a ... › smash › get › diva2:1325965 › FULLTEXT01.pdf · Machine learning methods can be categorized into methods using

Time Series Forecasting of House Prices: An evaluation of aSupport Vector Machine and a Recurrent Neural Network with LSTM

cells

BACHELOR’S THESIS IN STATISTICS

Uppsala UniversityDepartment of Statistics

Authors: Fredrik Hansson and Jako Rostami

Supervisor: Sebastian Ankargren

Examiner: Professor Johan Lyhagen

24 May 2019

I

Abstract

In this thesis, we examine the performance of different forecasting methods. We use dataof monthly house prices from the larger Stockholm area and the municipality of Uppsalabetween 2005 and early 2019 as the time series to be forecast. Firstly, we compare theperformance of two machine learning methods, the Long Short-Term Memory, and theSupport Vector Machine methods. The two methods forecasts are compared, and themodel with the lowest forecasting error measured by three metrics is chosen to be com-pared with a classic seasonal ARIMA model. We find that the Long Short-Term Memorymethod is the better performing machine learning method for a twelve-month forecast,but that it still does not forecast as well as the ARIMA model for the same forecast period.

Keywords: machine learning, cross-validation, seasonality, sliding window, sequentialmodel, supervised learning

© F.Hansson & J.Rostami

Table of Contents II

Table of Contents

Abstract I

List of Tables IV

List of Figures V

Terminology VI

1 Introduction 1

2 Research question 32.1 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Restrictions from a macroeconomic point of view and statistical motivation 3

3 Literature Review 4

4 Methodology 84.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2.2 A basic Recurrent Neural Network . . . . . . . . . . . . . . . . . . 104.2.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 SARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 Cross-validation and evaluation metrics . . . . . . . . . . . . . . . . . . . 154.5 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.6 Model building approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.7 Software and computational setup . . . . . . . . . . . . . . . . . . . . . . 19

5 Empirical findings 215.1 Data exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 235.2.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . 245.2.3 SARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2.4 Machine learning model comparison and model choice . . . . . . 265.2.5 Best machine learning model vs. SARIMA . . . . . . . . . . . . . 27


Table of Contents III

6 Discussion & Conclusion 296.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 References 32

Appendices 35

A Figures 35

B Kernels 42


List of Tables IV

List of Tables

Table 5.1: Hyperparameters and respective kernel for Uppsala. . . . . . . . . . . 23Table 5.2: Hyperparameters and respective kernel for the larger Stockholm area. 23Table 5.3: LSTM hyperparameters and RMSE for both areas. . . . . . . . . . . . . 24Table 5.4: Optimal hyperparameters for seasonal ARIMA and their respective

cross-validation RMSE and overall AIC. . . . . . . . . . . . . . . . . . . 26Table 5.5: Model comparison of LSTM and SVM. . . . . . . . . . . . . . . . . . . . 27Table 5.6: Model comparison of LSTM and SARIMA. . . . . . . . . . . . . . . . . 27


List of Figures V

List of Figures

Figure 4.1: A graph of a simple SVM with slack variables. . . . . . . . . . . . . . 8Figure 4.2: A flowchart of a basic artificial neural network with a single hidden

layer and binary outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 4.3: A simple RNN with an input layer, a single hidden layer, and an output

layer and its recurrent architecture. . . . . . . . . . . . . . . . . . . . . 11Figure 4.4: An LSTM cell with visual representations of its internal structure and

the recurrent connections. . . . . . . . . . . . . . . . . . . . . . . . . . 12

Figure 5.1: House prices of the larger Stockholm area and Uppsala from January2014 till April 2019. Source: Svensk Mäklarstatistik. . . . . . . . . . . 21

Figure 5.2: Seasonal plot of houses in the larger Stockholm area. . . . . . . . . . . 22Figure 5.3: Seasonal plot of houses in Uppsala. . . . . . . . . . . . . . . . . . . . . 22Figure 5.4: Cross-validation with out-of-sample predictions for the larger Stock-

holm area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 5.5: Cross-validation with out-of-sample predictions for Uppsala. . . . . . 25Figure 5.6: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01

on average house prices per month of the larger Stockholm area . . . 28Figure 5.7: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01

on average house prices per month of Uppsala. . . . . . . . . . . . . . 28


List of Figures VI

Terminology

Abbreviations

NN Neural NetworkANN Artificial Neural NetworkCNN Convolutional Neural NetworkDNN Deep Neural NetworkRNN Recurrent Neural NetworkLSTM Long Short-Term MemoryARIMA Autoregressive Integrated Moving AverageSARIMA Seasonal Autoregressive Integrated Moving AverageRBF Radial Basis FunctionMAPE Mean Absolute Percentage ErrorSMAPE Symmetric Mean Absolute Percentage ErrorRMSE Root Mean Square ErrorSVM Support Vector MachineSVR Support Vector Regression

Vocabulary in machine learning

The words that occur frequently in this paper and in the machine learning community,are given a terminology translation in Table 1 below. For statistical terms, it is expectedthat the reader is familiar with them prior to reading this paper.

Table 1: Terminology glossaryMachine learning StatisticsBack-propagation Chain rule of partial derivativesEpochs Number of times the regression coefficients get updated

using one or more observationsHyperparameter Model parameter independent of the dataInput/Feature Independent variableOutput Dependent variableTraining FittingTraining set In-sample setTest set Out-of-sample set


Introduction 1

1 Introduction

When looking at a time series the question of what happens next often comes up. It is ofinterest to be able to predict the future of the time series. There are many ways to makepredictions and many models to choose from when making forecasts. A time series ismade up of quantitative observations of one or more measurable characteristics of anindividual entity and taken at multiple points in time. It is often characterized by trend,seasonality, stationarity, and auto-correlation (Avishek and Prakash, 2017). Interest rates,stock prices, weather indices and population over time are all examples of possible timeseries to analyse. In this paper forecasting future house prices is of interest. Since thereis a lot of money involved in the house market being able to predict future prices is ofgreat interest to many and how to do this well is being researched (Yu et al, 2018).

A common method used to make forecasts of time series is the ARIMA model and it willbe used in this paper. Other than the ARIMA model we will also employ two machinelearning methods to make forecasts. Machine learning is an interdisciplinary field thatshares common threads with the mathematical fields of statistics, information theory,game theory, and optimization. Given the emerging research in the machine learningfield the last two decades, machine learning models have established themselves asserious contenders to classical statistical models in the forecasting community (Bon-tempi et al, 2013). Time series data is explained by a process that is unknown to theanalyst most of the time. According to Alpaydin (2010) the niche of machine learning isdetecting patterns and regularities under the assumption that identifying the completeprocess may not be possible. Machine learning is considered to be born out of the ideathat instead of programming machines for specific tasks they should be able to learnthemselves (Khan et al, 2019).

Machine learning methods can be categorized into methods using supervised and unsu-pervised learning techniques (Ghatak, 2017). Supervised learning techniques makes pre-dictions of the data where the supervisor compares the correct answers to the predictedanswers. Unsupervised learning techniques gathers information about the structure ofthe data. It explores the structure of the data and sheds light on it. This paper explorestwo machine learning methods both of which use supervised learning. The two methodsare called Support Vector machine (SVM) and Recurrent Neural Network (RNN) withLong Short-Term Memory (LSTM). The RNN with LSTM will be referred to as LSTM inthis paper.


Introduction 2

In this paper, we will present our research question and the data we are going to usefollowed by the previous literature that shaped the paper. We will explain how the SVM,LSTM, and ARIMA models work and how we build the models as well as how we findthe best hyperparameters to use. Our results will be presented, first comparing the SVMand LSTM models and then the better of those two models with the ARIMA model. Wewill then discuss the results and the strengths and weaknesses of the models.


Research question 3

2 Research question

2.1 Research question

For this paper, we have real-world data on house prices in Sweden coming from SvenskMäklarstatistik. The structure of the thesis is such that the data provided is analyzedvisually, thereafter the two supervised machine learning models are applied to the time-series data. After that, the performance of the two machine learning models forecastsis evaluated and compared to each other to see which one performs better. Once thebest performing machine learning model has been chosen it is compared to a seasonalARIMA and thereafter evaluating and comparing both of the final models forecastingperformance to determine if the classic model or the machine learning model is better.

• Which of the chosen machine learning techniques is better at forecasting houseprices and is it better than the classical ARIMA time series model?

2.2 Data

Svensk Mäklarstatistik has provided the dataset for this paper. Svensk Mäklarstatistikis a company that collects data on estate sales through realtors. Their services providerealtors, media, firms, government agencies, and universities with data and statistics(Svensk Mäklarstatistik, 2019). The dataset we have received contains the average offinal prices that detached-houses were sold for. We have data for the larger Stockholmarea, and Uppsala municipality. The data is monthly and ranges from January 2005 toMarch 2019 with no missing data.

2.3 Restrictions from a macroeconomic point of view andstatistical motivation

There could be many different variables that could be useful for trying to predict futurehouse prices in addition to time and previous values. However, this paper is not con-cerned with trying to find the best predictor for future house prices. In 2018 Yu et alcompared an LSTM with 13 characteristic variables and time variables to an LSTM onlybased on time. The percentage error for the LSTM with characteristic and time variables


Literature Review 4

was 10.35% and for the LSTM based on time only was 2%. This motivates us to dropthe macroeconomic part of house prices and solely focus on the temporal part of theunivariate time series data. Because of the above reasons we chose not to add any othervariables than time and past prices to the models.

3 Literature Review

When the data is complex neural networks are useful because they are able to learnany function with a single hidden layer, where the hidden layer can be considered asa black-box that performs calculations. Because of this, they are known as universalfunction approximators (Maini and Sabri, 2017). With RNNs having a built-in memorysuch that they can keep track of what happened historically even when it is not visibleat once. Neural Networks involving recurrence for any function can be considered aRNN (Goodfellow et al, 2016).

Yu et al (2018) ) designed several models to evaluate their forecasting performance usingmonthly housing price trends to predict the housing prices in Beijing in the future. Whatthey found was that a Convolutional Neural Network (CNN) had lower percentageerror than an LSTM when accounting for house characteristic variables and that anLSTM was better when employed on time series with the LSTM depended on time only.Noticeable was their seasonal ARMA with a percentage error of 3.1% and their LSTM-1model with a percentage error of 2%. Continuing with China, the identification of realestate cycles can be seen as a recurrent cycle but with irregular fluctuations in the rate ofthe total return for all properties of real estate according to Zhang et al (2015). Whenthe authors introduced neural networks into identifying real estate cycles in China theyused an Artificial Neural Network (ANN) to determine cycles on the Chinese real estatemarket. They used data from 1993-2008 to train on and to forecast the years 2009-2011.In 2009 China’s real estate market reached its peak and went into recession in 2010whilst reaching its trough in 2011. They found that the neural network’s performance isgenerally consistent with market reality. China has cyclical characteristics and fluctuatesmore frequently after 2008 with economic events in 2009 and 2011 to explain the cyclicalfluctuations. In their conclusion, they suggest that applications of neural networks canbe enriched by international comparisons for real estate cycles based on whether a neuralnetwork can be a potential research area to examine global real estate cycles.

Farman, Khan, and Jan (2018) provides an insight into how deep learning in neural net-works can be used for finance and the occurrence of irregular and complex behaviour of


Literature Review 5

the economic domain with a large number of factors. The authors mention that one typeof neural network that is suitable for detecting patterns in time series is the RNN whichcan retain information over a longer time period and enables it to recognize patternsin sequential data. Specifically, it was developed to recognize patterns in time seriesdata. Other neural networks are not suitable for sequential data where the sequence istime-dependent. This makes the RNN a preferable choice for time series compared toCNNs and DNNs, as the latter ones are not suitable for sequential data. One example ofneural networks applied to time series models is in the paper by Kohzadi et al (1996).The authors provides a neural network which can account for non-linear relationshipsand compare it to ARIMA models that cannot do this. They used monthly live cattleand wheat prices from 1950 through 1990 and repeated the experiment seven times forsuccessive three year periods. The performance of the neural network model showed 27percent and 56 percent lower mean squared error than the ARIMA model. The absolutemean error and the mean absolute percent error were also lower for the neural networkscompared to the ARIMA models. In their conclusion the authors mention that one ofthe reasons that the neural networks performed better may be because of non-linearor chaotic behaviour in the data which cannot be fully captured by the ARIMA model.Further on they claim that because they used past prices the methods can be applied inother forecasting problems such as stocks or other financial prices.

In a paper by Amrani and Zaytar (2016) the LSTM model was used for time seriesweather prediction with the final goal to produce two types of models per city in Mo-rocco. In total nine cities in Morocco were used for forecasting with 24 and 72 hoursworth of weather data with about 15 years of hourly meteorological data to train theneural network. While the authors mention the chaotic nature of the atmosphere and themassive computational power required, their results showed that LSTM neural networkscan be considered a better alternative to forecast general weather conditions. In theirconclusion, they claim that the success of their model suggests it could be used on otherweather-related problems.

In addition to the consideration of the nature of the data, Walid and Alamsyah (2017)in their work with RNNs for forecasting time series mention that RNNs are often usedto predict and estimate issues related to electricity and can describe the cause of theswelling of electrical load experienced by the Indonesian state-owned electricity com-pany. The authors highlight the necessity of electricity to humans and the issues theIndonesian state-owned electricity company has with not being able to provide contin-uous power to its customers. While they used RNNs for forecasting time series, thenature of their data was different from that of previously mentioned papers. The studywas a comparison between different models of RNNs and not a comparison of neuralnetworks to other methods. Previously mentioned studies have a few things in common;irregular patterns in their data, time-dependent data, and their NNs can compute the


Literature Review 6

underlying structure of their data. This tells us that NNs can learn any function.

So far we have mentioned a few papers regarding neural networks and their applicationsin fields with random to cyclical behaviour such as electricity consumption, economicsand finance, and meteorological data. These articles aim to lift the occurrence of neuralnetworks in these fields to highlight the complexity that neural networks can handleand how they perform in forecasting. In a paper by Ahmed et al (2010), the authors com-pared eight machine learning models used for time series forecasting on monthly timeseries competition data. The competition data is used as a benchmark for testing andcomparing forecasting methods. One model they used is the Support Vector MachineRegression also called Support Vector Regression which is also used in this thesis. Theyused lagged values, differencing, and moving averages to compare overall performanceamong the eight different models where they are considered in their basic form. Theirfindings according to the authors suggest some models are better than others but theperformance results vary. Some time series data favor some models more than othermodels. Preprocessing the data can also have a significant impact on the performance.The Support Vector Machine is one of the top models in classification but didn’t performas well on regression in time series. It is either in the top for some time series and forothers it has a high percentage error.

Levis and Papageorgiou (2005) presented a paper on customer demand forecastingusing Support Vector Regression on time series data. They use a three-step algorithmwith the last step being recursive constructed by semi-deterministic past data. Thethree-step algorithm predicts one future value to be used as the most recent value forthe following predicted value. Meaning that the predicted value is only used for onefollowing time period but it is also replacing actual past demand in a recursive mannerfor the next predicted time period because there is no actual past demand data-pointfor the first predicted time period. Their model performs well on non-linear data andtheir prediction accuracy, measured as 100 – MAPE, for all their examples is over 93%.According to the authors, their model not only avoids overfitting but also interprets theunderlying pattern of their customer demand data.

Vapnik et al (1999) used Support Vector Regression for time series prediction comparingit to a Radial Basis Function network by using generated data from the Santa Fe com-petition. The SVR performed better on stationary data but not on nonstationary datacompared to an RBF network. The authors mention that nonstationarities should beconsidered before the actual prediction. In the paper by Hao and Yu (2006) where theyuse non-stationary financial time series to predict the next 40 time periods based on 100training examples of a dataset of the Shanghai Stock Exchange. They used a modifiedSVR that penalized recent errors more than distant errors and by doing so they receivedlower RMSE and MAE. The authors claim that their modified SVR performs better


Literature Review 7

than a standard SVR and a traditional time series forecasting models when it comesto forecasting stock composite. They have not in their paper presented a comparisonbetween their model and the models they claim superiority over and they ignore thenon-stationarity issue mentioned by Vapnik et al (1999).

As we have presented several research papers conducting neural networks and Sup-port Vector Machines, our expectations are that they will perform well on time series.Depending on the preprocessing for the LSTM and the SVR the outcomes may vary ifwe do not admit their intrinsic differences. Both of them are great with non-linear datadepending on how they are preprocessed. Ultimately, the nature of both the modelsare complex and claiming a superior one is not possible given how they can performif one takes their unique structure into consideration. This means that they are to bemodified with respect to their complexity and to their differences when approximatingnon-linearity.


Methodology 8

4 Methodology

4.1 Support Vector Machine

A support vector machine is in its simplest form a linear classifier that separates datainto two classes. It does this by drawing a line through the data where data points onone side of the line are part of one class and the datapoints on the other side are partof another class. However, this line can, of course, be drawn in many ways and canseparate the data arbitrarily which would make for a poor method. The support vectormachine separates the data by maximizing the distance between the line and the nearestdata points on either side of the line. The result is that the two resulting classes areas different from each other as possible. This produces a tube that separates the datawhere the data points at the edge of the tube are called support vectors. (Flach, 7.3, 2012).

We are not, however, looking to separate the data into two groups. We want to beable to make accurate time series forecasts. To do this we need to modify the modela little bit. If the data has more dimensions than one or if it is not linear we have tomap it into a high-dimensional feature space and then use the feature space in ourfuture calculations. We represent this mapping with the function ϕ (x) where x has beenmapped into a high-dimensional feature space. The feature space is unknown right nowbut will eventually be replaced with a so-called kernel function. When x is in the correctform we can use the SVM to do a linear regression to make predictions. (Okasha, 2014).

Figure 4.1: A graph of a simple SVM with slack variables.


Methodology 9

Below, we see the equation we want to solve to be able to make predictions;

y(x) = WTϕ(x) + b

where the coefficient W is the vector perpendicular to our regression line and the coeffi-cient b is the intercept. The ϕ indicates a high-dimensional feature space. Notice thatthis equation is similar to a simple linear regression (Okasha, 2014). To solve the aboveequation we introduce a set of so-called slack variables. The slack variables measurethe function’s distance from a data point to the boundary of the tube (see Figure 4.1).For every data point, the slack variables are positive but are zero for data points onthe rim of the tube and if the data point is correctly classified. In SVM regression weallow data points to lie within the tube. The consequence of this is that maximizingthe tube size, as we do in the normal SVM case, results in a tube of infinite size sincethe data points are allowed to be inside the tube. Because of this, we introduce slackvariables into the model. The slack variables are minimized at the same time as the tubesize is maximized. When the tube grows larger, more data points are misclassified andmore slack variables are needed. A balance, therefore, needs to be struck between tubesize and slack variable minimization. To do this we introduce a hyperparameter calledcost that determines how important it is to minimize the slack variables (Flach, 7.3, 2012).

The final SVM equation becomes:

y(x) =n

∑i=1

(αi − α∗i )(ϕ(xi), ϕ(x)) + b (4.1)

Where the α and α* are lagrange multipliers and where we can express (ϕ(xi), ϕ(x)) =K(xi, x). This is called a kernel function. The kernel function can take many formsdepending on the dimensionality of the input data (Okasha, 2014). The function ϕ isunknown to us but we can use the kernel function to find a suitable feature space to mapour data onto. A kernel function calculates the inner product of the kernel variablesin the feature space (Ruping, 2001). Now, the kernel function is not a set function andlooks different for different data. Choosing the right kernel function is important tomake sure the accuracy of the forecast (Okasha, 2014). There are many kernels to choosefrom and our kernel will be chosen on the basis of which one produces the best forecastthrough trial and error. The kernel function is, therefore, another hyperparameter thatneeds to be determined.The kernel functions that will be tested in this report will be thelinear, radial-basis, sigmoid and polynomial kernels. Their equations can be seen in theappendix.


Methodology 10

4.2 Recurrent Neural Network

4.2.1 Background

Artificial neural networks are inspired by the design of the brain. It is a model of com-putation that can be viewed as formal models with equations and statements of whichparts that are to be used etc. These models consist of a large number of basic computingneurons connected to each other in a complex communication network (Shalev-Shwartzand Ben-David, 2014, Chapter.20). Hence, it is mimicking the learning behaviour of thebrain without being programmed for a specific task as mentioned earlier. This is wherea supervisor enters a teaching position and interacts with the learner, i.e the network,and the environment. This is known as supervised learning.

Figure 4.2: A flowchart of a basic artificial neural network witha single hidden layer and binary outputs.

As seen in Figure 4.2 above, the NN is feeding forward and is composed of an input layer,a hidden layer with 3 cells and activation functions in each cell, and an output layerwith an activation function that determines the output of the cell. An activation functionworks as a transformer in the network, where it takes the input values to transforminto output. In the figure above is a sigmoid function, which is a special case of thelogistic function used in logistic regression, to transform the outputs into binary outputsbetween 0 and 1. The hidden layer works as a learning box and as a filter depending onthe nature of the data and the structure of the hidden layer.

4.2.2 A basic Recurrent Neural Network

RNNs are a part of the family of artificial neural networks with a recurrent architecture.The recurrence involved is the usage of previous outputs as new inputs such that they


Methodology 11

are recurring, meaning that they occur one or more times in the calculation of newoutputs. They are similar to feed-forward networks as the one explained in Figure 4.2.A simple RNN takes not only the current input, but also previous information as inputsthrough recurring connections. This gives importance to previous events, meaning thatthe events at t-2 and t-1 will influence decisions taken at time t. See Figure 4.3 for theintuition behind an RNN. The basic RNN is known as a vanilla RNN or as a simpleRNN. The basic RNN has been applied in several disciplines such as natural languageprocessing and tourism forecasting (Bianchi et al, 2017). RNNs are sequential modelsthat process a sequential input or process a sequence of variables (Goodfellow et al,2016). Their main signifier is how they can use all of the previous inputs for each outputsuch that they are recurring giving the notion of memory in the neural network (Graves,2012, Chapter.3).

Figure 4.3: A simple RNN with an input layer, a single hiddenlayer, and an output layer and its recurrent

architecture.

4.2.3 LSTM

For an RNN with LSTM architecture, the subnets, i.e the memory blocks, are recurrentlyconnected. These memory blocks consist of cells with four gates where the gates arerepresentations of activation functions (Graves, 2012). The cells in the memory block,also known as neurons, can be simply viewed as a small house with an entrance andan exit. The doors in the house represent the four gates in the memory block, one cellrepresents one passage through the house. Two cells represent passing through the samehouse and entering again just to exit again, it is a recurring passage.The gates are called;forget gate, input gate, output gate, and the fourth one, which we will call, the gate


Methodology 12

gate and they are multiplicative gates done element-wise and are known as Hadamardproducts. The original work on LSTMs and the details of the algorithm by Schmidhuberand Hochreiter (1997) is simplified here for the forward-propagation. Figure 4.4 showsthat the memory block has three inputs; xt, ht−1, and ct−1.

Starting by simplifying the input vector xt and the hidden state, the hidden state attime t-1 is defined as ht−1. Let xt and ht−1 be column vectors of size m× 1 and n× 1.Concatenate them to receive a column vector of size u called v, where v = [xt,ht−1] = [v1

v2...vm], and u = (m + n)× 1. Continue to the gates in Figure 4.4, where the long-termmemory at the previous time step defined as ct−1 is used as the long-term memory attime t before it is updated by the multiplication of the forget gate (i). The output valuefrom the previous time step is sent into the current time step through the forget gate(i) where it tells the long-term memory at time t what to remember and what to forgetthrough a sigmoid function. The sigmoid function outputs numbers between 0 and1, where 1 means “fully store this” and 0 means “fully forget this”. To avoid explodingnumbers when the network undergoes many computations, the tanh function keeps thevalues between [-1,1] to make sure the network is steady and stable.

Other multiplications done inside the gates are through weight matrices with the con-catenated vector. Every gate consists of a set of parameters called weights. Similar toregression coefficients, they decide how important values at the current time-step andpast time-steps are. For this, weight matrices are introduced at each gate and multipliedwith the concatenated column vector v. The weight matrices are also updated throughback-propagation to lower the error they produce from training until they can no longerreduce their training error.

Figure 4.4: An LSTM cell with visual representations of itsinternal structure and the recurrent connections.


Methodology 13

Now compute the weight matrix multiplied by the concatenated vector and by element-wise summation, add the bias and finally run it through the sigmoid function.

Ft = σ(WF × v + bF)

(i) The forget gate output.

Similarly for the input gate which learns which parts of the hidden state at t-1 and theinput xt that are worth using and saving.

It = σ(WI × v + bI)

(ii) The input gate output.

And the gate gate is deciding which other candidates that are suitable by transforming itbetween values [-1,1]. Once the input gate and the gate gate are multiplied with eachother, the sigmoid output of the input gate will determine which information to keepfrom the tanh output. This also regulates the network from what is known as explodinggradients during back-propagation for LSTMs to ensure network stability over multipletime-steps.

Gt = tanh(WG × v + bG)

(iii) The gate gate output.

The old long-term memory is updated by multiplying the old long-term memory withthe forget gate and adding the input gate multiplied by the gate gate.

ct = Ft × ct−1 + It × Gt

(iv) The new updated long-term memory, also called the cell state.

And the output gate where it is decided what to output and is based on the previousmemory block and the given input at the current time-step. This is without the updatedlong-term memory, which means that the new output gate contains past information.

Ot = σ(WO × v + bO)

(v) The output gate output.

Passing the updated cell state through a tanh function and multiplying it with thesigmoid output tells the hidden state what information to carry on. Our memory blockat time t has now been updated and is used for predictions at time t + 1 and for the finaloutput at t. To get the new hidden state at time t which also works as the output at timet.


Methodology 14

ht = Ot × tanh(ct)

(vi) The new hidden state output.

Finally our predicted output value, for regression output, will run through a lineartransformation so that the output values become unbounded. The memory blockprocesses as explained in this section gives the reader a notion of what the Long Short-Term Memory stands for. As demonstrated, ct is the Long Memory where it remembers allprevious operations whilst Ot is the Short-Term Memory which remembers the previousoutputs. Hence, Long Short-Term Memory is the process of the memory block withrecurring connections through time.The recurrence involved is through ct and ht whencarried on to the next time-step where past hidden states and past information arerecurring in the next memory block.

4.3 SARIMA

A commonly used way to forecast time series is to use a SARIMA model or one of itsmany variants. SARIMA stands for Seasonal Autoregressive Integrated Moving-Averageprocess. The SARIMA model consists of lags from previous time periods. These lags arepast values of dependent variables as well as past values for independent, identicallydistributed random variables with mean 0 and variance σ2

ε (Cryer, Chan, 4.2-4.3, 2008).The lags are subject to a set of weights that determines their importance to the valuebeing predicted. Some time series are not stationary which causes problems for us whenwe want to build the SARIMA model. To remedy this we take the difference of themodel until it becomes stationary (Cryer, Chan, 10.1, 2009). Some dependent variablesdepend not only on variables from the previous time period but also on variables at pastrecurring time intervals. This is called a seasonal element. A common seasonal elementis what happened the current month a year ago (Cryer, Chan, 5.2, 2008). In equation 4.2we see a SARIMA model.

(1− φ1B)(1−Φ1B12)(1− B)(1− B12)Yt = (1 + θ1B)(1 + Θ1B12)et (4.2)

Equation 4.2 is a SARIMA model with a seasonal element with lag 12. B is the backshiftoperator. φ is the weights for past lagged values of dependant variables. Φ is the weightsfor past seasonal lagged values of dependent variables. (1− B) is the first difference and(1− B12) is the first seasonal difference. θ is the weight for past random variables and Θis the weight for past seasonal random variables. Yt is the dependent variable value wewant to predict and et are independent, identically distributed random variables withmean 0 and variance σ2

ε .


Methodology 15

4.4 Cross-validation and evaluation metrics

Hastie et al (2017) describe that the method of k-fold cross-validation is randomly divid-ing the dataset into k groups of approximately equal size. The first group is treated as atest set while the rest is training on the remaining k – 1 groups. However, because we aredealing with time series data, we have to approach the problem with a slightly differentmethod. Since we do not have the possibility of dividing the dataset into a training set,a test set, and a validation set because of insufficient data. And given our models itwould be more appropriate to use the walk-forward/sliding windows testing routineas described by Kaastra and Boyd (1996) with a training and test set (Hastie et al, Ch.7,2013). Hyndman and Athanasopoulos (2018) suggest that a one-step or multi-step rollingforecast origin may also be used. One must consider the size and nature of the data set.Using monthly time series data suggests that a sliding window or rolling forecast originare the methods to use when cross-validating. Given the machine learning models,sliding windows try to simulate real-life data and test model robustness by retraining ona large out-of-sample dataset. It also allows quicker adaptations to changing conditionson the market when using neural networks (Kaastra and Boyd, 1996).

The sliding window cross-validation is done by splitting the dataset into a trainingperiod and a test period. The length and number of windows are defined by the periodswe train and test on. This makes the sliding window method a sort of a k-fold methodfor time series. In order to evaluate the sliding window for the global cross-validation,the overall RMSE of the forecast errors is given in Equation 4.3.

RMSEoveral l =

√MSE1 + ... + MSEj

kj = 1,2, ...,k (4.3)

where MSEj is the mean of the squared forecast errors in sliding window j.

However, Hyndman and Koehler (2005) bring up the sensitivity of RMSE to outliers thusincluding two other metrics is added to ensure a good comparison between differentmethods where the added metrics are scale-independent. The MAPE, Mean AbsolutePercentage Error,

1T − T0 + 1

T

∑t=T0

∣∣∣∣∣Yt − Yt

Yt

∣∣∣∣∣ ∗ 100 1 < T0 < T (4.4)


Methodology 16

and the Symmetric Mean Absolute Percentage Error (SMAPE),

1T − T0 + 1

T

∑t=T0

200 ∗∣∣Yt −Yt

∣∣(|Yt|+

∣∣Yt∣∣) 1 < T0 < T (4.5)

will be used in this paper. In Equation 4.4 and 4.5, T0 is the first forecast month andT is the final forecast month. In our case T is always the twelfth month we forecast.These metrics are not defined as optimal measurements of forecast accuracy, MAPE putsheavier penalty on positive errors than negative errors. And the SMAPE is not symmetricas the name suggest, where the same value of yt, the value of 200 ∗ |Yt −Yt|/(|Yt|+ |Yt|)has a heavier penalty when forecasts are low compared to when they are high (Hyndmanand Koehler, 2005). Therefore, using three different metrics to measure the forecast errorsbetween models should guide us better than using only one. The RMSE should in thiscase act primarily as an in-model forecast error metric and secondary as a between-model forecast error metric. While the MAPE and the SMAPE are primarily between-model forecast error metrics for model comparison and secondary to clarify the overallcomparison together with the RMSE.

4.5 Hyperparameter tuning

Searching for the most optimal models is a time-consuming process and leads to thesearch of methods to find the best configuration where it performs the best on the eval-uation metrics chosen for the training and testing sets. This is logical, since it reducesthe time spent on manual trial and error and instead create an efficient computingmechanism to receive optimal configurations.

Hyperparameters are the parameters, that are independent of the data and can be consid-ered to be the buttons on a black box. They are used to control the model and are definedas those parameters that cannot be learned directly from the data (A.Ghatak, 2017).Tuning them is to tune the model with respect to its performance on the metrics chosenfor training and testing. Three methods of hyperparameter tuning will be consideredin this paper; manual configuration, random hyperparameter search, and gridsearch.The manual configuration might seem to be unnecessary or contradictory since we havementioned how time-consuming it is to search for the optimal model. Regardless, thisis used to build any model and is the foundation for how to build models in general.Build it first and consider finding the best configuration through other methods.

Gridsearch is a method that finds the hyperparameters of interest. It searches for allpossible configurations and gives you the best configuration. However, gridsearch canbe computationally demanding and sometimes requires a lot of time and computationalresources. The more parameters that are tested by the gridsearch, the more demand-


Methodology 17

ing it becomes. Because of this, a random hyperparameter search can be preferableinstead if the computing is heavy. Even though random search is unlikely to find thebest configuration, in the end, it will give better configurations using fewer iterations.Suggesting that perhaps after 10 iterations, random search gives the better configurationbut gridsearch will give the best configuration after a full search. This is where one hasto consider the computational side and the time available. In this paper we will use thefollowing hyperparameters for the LSTM,

Batch size = number of training examples to pass through one epoch, this will be set to 1Hidden layers = number of memory blocks in the networkTime-steps = the amount of steps in time run through the RNN where it memorizes theamount of steps, for this paper this is set to 1Features = number of variables in every time step, in univariate time series this is 1Recurrent dropout = a fraction, defined as 0-1, of how much of the hidden states to dropCells = number of cells where the procedure of the LSTM memory block is repeatedlyexecutedOptimizer = an algorithm to optimize the weights to reduce the mean square error,AdaDelta will be used with default settings in Keras

For the SVR,C = a constant that determines the importance of slack variable minimization versustube diameter maximizationGamma = a constant used in the radial-basis, sigmoid and polynomial kernelsCoef = an intercept constant used in the sigmoid and polynomial kernelsDegree = a constant used in the polynomial kernel

In the case of the SVR we will use gridsearch as an initial way of finding good valuesfor the hyperparameters and then try to improve on the gridsearch by searching forbetter hyperparameters. This is done by trying to lower the RMSE by changing thehyperparameters one by one. Each hyperparameter is increased and/or decreased untilthe RMSE is no longer lowered after which we move on to the next hyperparameterand do the same thing. When all hyperparameters have gone through this process eachis increased and decreased again to see if a lower RMSE can be found. This processcontinues until a better value cannot be found.

4.6 Model building approach

Designing a machine learning model for time series forecasting can be a time-consumingprocedure and might demand a structural building approach. An eight-step designmethodology for a neural network forecasting model on financial time series data has


Methodology 18

been presented by Kaastra and Boyd (1996) which may require revisiting previous stepsinstead of a single-pass one. Utilizing this methodology for the LSTM and the SVR inthis thesis will be beneficial in terms of time but also in tuning the models where we canlook back at previous steps.

Step one: Variable selectionThe first step of building the model will be selecting which variables to use in the model.The simplest neural network model uses lagged values of the dependent variable(s)or its first differences (Kaastra and Boyd, 1996). The SVR will use time as the onlyindependent variable while the LSTM will use lagged values.

Step two: Data collectionData on monthly house prices for detached-houses have been provided by the companySvensk Mäklarstatistik.

Step three: Data preprocessingThe data provided will go through a preprocessing considering the most suitable pro-cedure for LSTM and SVR. This preprocessing might be procedures like taking thelogarithm of the process, normalized values, and/or square root transformation. Wewill use the square root transformation and the normalization procedure.

Step four: Training, testing, and validation setsIdeally for a large data set it is better to divide the dataset into three parts; a training set,a test set, and a validation set. But for situations with insufficient data, as the datasetfor this thesis, the dataset will only be split into a training set and a testing set. We willuse a walk-forward sliding windows testing routine as mentioned by Kaastra and Boyd(1996) or named as a rolling forecast by Hyndman and Athanasopoulos (2018, Chapter 3).

Step five: Neural network paradigmIn machine learning, hyperparameters are the parameters that are independent of thelearning process and are configured before learning begins. LSTM can be viewed as anartificial neural network that produces an output at each time step. Some RNN also havehidden layers that do not produce output but are instead used to refine the learningprocess by sending data to the processes that give the output. RNN models with morehidden layers are called deeper and they are advantageous according to Goodfellow etal (2016). However, Kaastra and Boyd (1996) mention the dangers of overfitting whenincreasing the depth, i.e increasing the hidden layers, and brings up that one or twohidden layers are widely used and have performed well. The SVM has a number ofhyperparameters that needs to be determined and they are chosen based on the modelperformance. The hyperparameters of both models will be determined in the results


Methodology 19

section of this report.

Step six: Evaluation criteriaTo decide which model is the better one we need to be able to evaluate their performance.We will look at the overall RMSE of the sliding windows forecasts. The model with thelowest average forecast error will be determined as the better model.

Step seven: Model trainingThe supervised learning method will take place by passing training data to the LSTMand the SVR. Optimization of the models by trial and error and with different tuningmethods will be carried out to find the most suitable models.

Step eight: ImplementationOnce steps 1 through 7 are finished the implementation of the final LSTM and SVRmodel will be done. These are now the definite models used for forecasting.

4.7 Software and computational setup

R, a statistical programming tool, has been used for the work in this paper. Python,another programming language, has been integrated into the R environment through theAnaconda platform for the purpose of utilizing two libraries that are written in Pythoncode.

The main libraries used are:

• Keras - a machine learning library written in Python code, runs on TensorFlow.

• TensorFlow GPU - a machine learning library written in Python code, utilizesGPU instead of CPU.

• caret - a statistical learning package written in R for classification and regression.

• e1071 - a statistical learning package written in R for Support Vector Machines.

• tfruns - a hyperparameter tuning package written in R for Keras and TensorFlow.

• forecast - a time series library written in R.


Methodology 20

Computer setup for the software and libraries used.

Computer specifications

Specifications PC-1 PC-2

CPU i5-9600K i5-7500GPU GeForce RTX 2060 6GB GeForce GTX 1060 6GBRAM 16 gigabytes 8 gigabytes


Empirical findings 21

5 Empirical findings

5.1 Data exploration

An interesting part of the house prices is how they change over a long period of time.By visualizing the house prices for the larger Stockholm area and the municipality ofUppsala, one can get a glimpse of any occurring pattern. Figure A.1 (see appendix)shows a stable increase in house prices over the time period from 2005 till 2019 whereStockholm saw a jump in prices around 2015. Uppsala saw a jump in increasing pricesjust before 2015 and has, like Stockholm, had steadily increasing prices over the entiretime period. When viewing the time period of 2014 till 2019 of Stockholm and Uppsala apattern is revealed (see Figure 5.1). There is an increasing seasonal trend for houses inStockholm with troughs as well. However, the seasonal element persists through thistime period. Looking at Uppsala, a seasonal trend can be seen in the first 2 years whileeventually becoming cyclical between 2016 and 2019. The deepest trough for Stockholmoccurs in the middle of 2016 and recovered by increasing all the way into 2017 beforegoing into a trough again, this time not as deep as the previous year. For Uppsala,similar behavior is detected but it was short-lived. It goes into a seasonal pattern after2016 and continues until 2018 before it returns to similar behaviour as before 2016.

Figure 5.1: House prices of the larger Stockholm area andUppsala from January 2014 till April 2019. Source:

Svensk Mäklarstatistik.



Visually plotting the preprocessed data and looking at the autocorrelation for Stockholmand Uppsala, we see that both of them show some non-negligible correlation of pastperiods. As seen in A.2 and A.3, they are both decreasing until they hit lag 12 and theautocorrelation increases and thereafter decreasing again. A 12-month lag seems to bean appropriate past value to use for the LSTM. We will look into it further to try tovisualize and detect underlying elements of the data. By observing it closer to see wherethe troughs appear, see 5.2 and 5.3, reveals the month of July that is responsible for this.

Figure 5.2: Seasonal plot of houses in the larger Stockholm area.

Figure 5.3: Seasonal plot of houses in Uppsala.

The seasonal plots also tell us how the values behave monthly over a longer time period.The monthly values for Stockholm behave similarly from year to year whereas for Upp-sala, monthly values behave differently over a longer time period but July still remainsas the significant trough over the measured time. Decomposing the preprocessed data of



Stockholm and Uppsala it is seen how both of them have a seasonal and a trend element(see A.4 and A.5). Looking at the trend component there is a pattern of increasing priceswhile the seasonal component shows 12-month seasonal fluctuations with troughs inthe middle of the year for Stockholm and for Uppsala.

5.2 Results

5.2.1 Support Vector Machine

The table below shows the tests of the linear, polynomial, radial-basis and sigmoidkernels for the SVM. The tests were run on houses in Uppsala and Stockholm using foursliding windows. Each sliding window had a training period of four years and a testperiod of one year. The next window was offset by two years. The grid-search methoddid not yield the lowest RMSE model. It suggested a radial-basis kernel with higherRMSE than all the kernels presented below. To find these kernels we used the alternativemethod presented earlier where we find the hyperparameters manually. As we can see,the radial-basis kernel seems to perform the best for both models.

Table 5.1: Hyperparameters and respective kernel for Uppsala.Hyperparameter

Kernel Linear Polynomial Radial-basis

Sigmoid

RMSE 292072 289668 274222 385975Cost 0.075 0.77 6 100Gamma - 0.016 1.02 10000Degree - 1 - -Coef - 1 - 1

Table 5.2: Hyperparameters and respective kernel for the larger Stockholm area.Hyperparameter

Kernel Linear Polynomial Radial-basis

Sigmoid

RMSE 300854 300853.7 280994 378205Cost 0.51 0.12 44 1Gamma - 5 0.08 3000Degree - 1 - -Coef - 100 - 1



5.2.2 Long Short-Term Memory

The LSTM model used the trial and error method to test for the optimal hyperparameters.One of the manually configured settings outperformed the random hyperparametersearch in terms of RMSE. Table 5.3 below shows the optimal hyperparameter configura-tions with a random sample of 20 percent of all the combinations and a manual tuningconfiguration. In this case 45 out of 225 combinations were searched for both Stockholmand Uppsala, however, a manual configuration was found to be the optimal one forUppsala.

Table 5.3: LSTM hyperparameters and RMSE for both areas.Area

Hyperparameter Metro-Stockholm UppsalaOptimizer Adadelta AdadeltaRMSE 286234 331505Hidden layers 2 2Cells (1) 24 24Cells (2) 12 24Recurrent dropout (1) 0.1 0.1Recurrent dropout (2) 0.5 0.4Tuning Random search Manual

For Stockholm, the random hyperparameter search yields a better configuration thanthe manual search did. The proposed configuration is two hidden layers with 24 unitsin the first layer, 12 units in the second layer, and a recurrent dropout of 10 percent inthe first layer and 50 percent in the second layer. For Uppsala, the optimal configurationwas not yielded by the random hyperparameter search. Instead, manual testing gavethe best configuration which had two hidden layers, 24 units in the first layer, 24 units inthe second layer, and a recurrent dropout of 10 percent in the first layer and 40 percentin the second layer. The training of the LSTM for Stockholm went through 100 epochs asthis was tested to be optimal for the current sampling plan. By visually inspecting thecross-validation for Stockholm in Figure 5.4, it can be seen that the LSTM captures a fewof the seasonal occurrences in terms of troughs and peaks. It is visually appealing sinceit seems to understand the structure of the data with low RMSE on the window sets.ForUppsala, the LSTM has a hard time learning from the data and the underlying structureis difficult to remember. On all the sliding windows, the predictions on the unseen1-year windows look tedious but the RMSE is not so awful such that the configuration isdisregarded. This LSTM had optimal training on 50 epochs and it is seen especially onthe fourth sliding window where the model performed badly compared to the previousthree samples as seen in Figure 5.5 below.



Figure 5.4: Cross-validation with out-of-sample predictions forthe larger Stockholm area.

Figure 5.5: Cross-validation with out-of-sample predictions forUppsala.



5.2.3 SARIMA

For the seasonal ARIMA model, examining the autocorrelation and partial autocorrela-tion functions to choose the right values of hyperparameters has been done manually. Bytesting several hyperparameters, the optimal model for Stockholm was the one with 2past lags, 1st order differencing, 0 past noise lags, 0 past seasonal lags, 1st order seasonaldifference, and 2 past seasonal noise lags. Uppsala follows the same configuration exceptthat it does not have a first-order difference.

Table 5.4: Optimal hyperparameters for seasonal ARIMA and their respectivecross-validation RMSE and overall AIC.

AreaHyperparameter Metro-Stockholm UppsalaOrder(p,d,q) (2,1,0) (2,0,0)Seasonal(P,D,Q) (0,1,2)[12] (0,1,2)[12]RMSE 154933 163600Average AIC 938.3 987.9

Both models have been differenced such that their slowly decaying autocorrelationfunction for the residuals is decaying faster, their partial autocorrelation function forthe residuals are uncorrelated at lag 12, the plots for these functions can be found inappendix, see Figure A.10-A.13.

Seen in Figure A.6 and A.7, the SARIMA models have a low overall RMSE for the slidingwindows cross-validation sampling plan. The model for Uppsala performs visually poorbut the RMSE is still low, whilst the model for Stockholm performs well visually and byinspecting the RMSE.

5.2.4 Machine learning model comparison and model choice

Since the radial-basis kernel performed best in the sliding windows test for the SVMit is also applied for the final forecasts seen below. The forecast is made with the first159 months of our data as training data and then forecasts 12 months ahead. The RMSEshown is the RMSE of the 12 forecast months. The forecast for the LSTM is done in thesame way and uses the optimal. The first 159 months are used as training data and aforecast is made 12 months ahead. To illustrate clearer, the RMSE, MAPE, and SMAPEshown in Table 5.5 are the forecast error metrics of the 12-month forecast.

As can be seen in Table 5.5 below, the LSTM has a lower RMSE than the SVM for bothUppsala and Stockholm. The LSTM model will, therefore, be compared to the SARIMAmodel. The SVM performs poorly on both areas when inspected visually in Figure A.8and A.9, compared to the LSTM which has both lower metric values and looks visuallysatisfying.



Table 5.5: Model comparison of LSTM and SVM.Support Vector Machine vs Long Short-Term Memory

LSTM MAPE SMAPE RMSEMetro-Stockholm 4.29% 4.18% 286124Uppsala 7.52% 7.11% 340303SVMMetro-Stockholm 7.05% 6.72% 434589Uppsala 8.32% 8.05% 373270

5.2.5 Best machine learning model vs. SARIMA

We see by direct comparison of the RMSE in the table below that the SARIMA modelshave a lower RMSE than the LSTM models. Table 5.6 below shows the evaluationmetrics on both models. In Figures 5.6 and 5.7 below we also see how the SARIMAmodel forecasts on the unseen 12-month data better than the LSTM.

Table 5.6: Model comparison of LSTM and SARIMA.SARIMA vs Long Short-Term Memory

LSTM MAPE SMAPE RMSEMetro-Stockholm 4.29% 4.18% 286124Uppsala 7.52% 7.11% 340303SARIMAMetro-Stockholm 2.70% 2.68% 158438Uppsala 6.04% 5.79% 280297



Figure 5.6: Comparison of forecasts on the period from2018-04-01 to 2019-03-01 on average house prices per

month of the larger Stockholm area

Figure 5.7: Comparison of forecasts on the period from2018-04-01 to 2019-03-01 on average house prices per

month of Uppsala.


Discussion & Conclusion 29

6 Discussion & Conclusion

6.1 Discussion

As seen in the results section the SARIMA models perform better than both the LSTMand SVM models. The LSTM models, in turn, perform better than the SVM models. Itespecially outperforms the SVM model in the Stockholm time series where the differencebetween the RMSE, MAPE, and SMAPE is very large. The SVM RMSE, MAPE, andSMAPE are almost twice as large as the LSTM error measurements. Let’s look at thedifferences between the LSTM and SVM models more closely.

In the graphs A.8 and A.9 where the forecasts of the LSTM and SVM models are shownwe gain more insight into why the measurements are worse for the SVM. It is clearlyseen that the LSTM models try to fit the data. It follows the data reasonably well butstumbles a little bit in the later part of the Uppsala forecast. The SVM models, on theother hand, make an almost linear forecast. When predicting the Uppsala data it seemsto be able to find a downward trend during the forecast time-period and comes closeto the LSTM prediction for that reason. It is however limited by its linear nature andcannot make predictions beyond the trend and will for this reason suffers a high error.In the Stockholm forecast, the LSTM performs far better than the SVM as previouslymentioned. The SVM tries to find a linear trend again but this time it misjudges thetrend and goes off in a different direction than the actual time series does. This is thereason that the error measurements for the forecast are so poor. The LSTM forecast doesnot do this and instead somewhat captures the behavior of the time series.

What are the problems with the SVM model and how can the SVM model be improved?A likely problem with the SVM model is that the optimal hyperparameters have notbeen found. The grid search method used to find the hyperparameters failed at findingthe optimal parameters and was very easy to beat by manually trying to find betterparameters. A problem with manually trying to find the best hyperparameters is that itis very easy to find a local minimum that is mistaken for a global minimum which is theoptimal set of hyperparameters. This is especially true for the kernels that use more thanone hyperparameter since the number of possible combinations of parameter valuesincreases manyfold. A computer-based method should be able to beat out the manualmethod used in this paper since it can try out many more combinations faster than anyhuman could. Unfortunately, the grid search method found in the packages we haveused was not very good. A solution to this could be to build our own computer-based



method to find the best set of hyperparameters. However, time constraints prevented usfrom doing this.

Another problem with the SVM model is that the hyperparameters suggested afterfinishing the sliding windows process are not the best for forecasting the final model.The sliding windows process suggested using the radial-basis kernel to forecast bothtime series. When trying to forecast the final forecast using a different kernel than theradial-basis kernel with the hyperparameters suggested by the sliding windows processwe sometimes got more accurate forecasts. In some cases, these forecasts vastly outper-formed the radial-basis kernel. This means that the sliding windows process used is notfinding the best model to use for the final forecast. It is possible that we are using toofew windows and need to use more windows to find a set of hyperparameters that fitsthe time series better than our current parameters. The downside of doing this is that itcould become more computationally demanding especially if a computer-based methodis used to find the hyperparameters. However, the SVM was not very demanding whenused with four windows so they could probably be increased without much issue. Athird way to increase the forecast accuracy could be to include not only the time seriesdata as input but also lagged values of the time series. We know from viewing the dataseen in the data visualization part of this report that there is a seasonal dependence inthe time series. If that dependence is suggested to the model as input it could improvethe model. This also was not done due to time constraints. Lastly, the LSTM has anadvantage over the SVM since it uses lagged values as input and not only the normaltime series. This makes it easier for it to be able to take seasonality into account andsince we know that there is a seasonal element, this gives the LSTM model an edge overthe SVM.

The LSTM clearly outperforms the SVM and it comes with hyperparameters that aredifferent. The model needs to know how long past information is to be retained anddropped in future predictions and at the same time, it needs to know how much of thepast information is to be retained and dropped in future predictions. Further, the modelneeds to have an optimal amount of hidden layers to avoid overfitting. As two hiddenlayers were used for both areas it confirms Kaastra and Boyd’s (1996) paper where themodels perform well on the cross-validation. Also by looking at the cross-validationpredictions for the LSTM and comparing it to the forecasting performance on the 12-month period from April 2018 to March 2019. It performs unexpectedly well since itunderperformed on the fourth sliding window forecast on Uppsala, which is also themost recent data in the sampling plan. This creates confusion in evaluating the optimalhyperparameters when revisiting the model building steps. On one hand it seemsreliable when multi-step forecasting, on the other hand it can create a confirmation biaswhere it seems like you have found the optimal hyperparameters but can be a case of“beginner’s luck”. This is a problem when you have insufficient computational resources



and cannot do a gridsearch to get the best configuration, it is easy to fall in the pit ofhaving created the optimal forecasting model. One way to come overcome this problemwould be to preprocess the data like the SARIMA model is preprocessed. How wouldthe model behave under stationarity? This, for example, would create a time series withproperties that do not change over time and would make the model easier to interpretas the properties would be predictable over time.

6.2 Conclusion

While the SVM model performs poorly, the LSTM model shows much greater promiseas a forecasting tool for time series. The LSTM model building approach is a sensitivebalance of bias-variance tradeoffs and the time it takes to configure the model thatmay easily fail to give accurate forecasts. This was shown in the cross-validationplan for Uppsala where the out-of-sample predictions exhibited high bias and lowvariance on the fourth sliding window. Having no assumptions about the data is ahuge advantage with the LSTM model over the SARIMA model but it makes it morevulnerable to difficult data than the SARIMA model. Our data preprocessing might nothave been enough to curtail difficulties with the data with reduced forecast accuracy as aconsequence. Another reason for lower accuracy may be the choice of hyperparameters.It is likely that the hyperparameters chosen by the random search are not the mostoptimal parameters possible. This is even truer when manually chosen parameters werethe best parameters found. Like in the SVM case it is more likely that a better method forfinding hyperparameters would yield a higher forecast accuracy. The model might beunable to detect strong seasonal patterns, heavy white noise, or that the model perhapsneeds statistical assumptions about the data before configuring it. Comparing the LSTMto a classical time series model on a small dataset shows how robust the SARIMA is.The LSTM is defeated in performance on three metrics; RMSE, MAPE, and SMAPE. TheSARIMA model was also far less labor intensive to construct, but this is probably in partdue to our previous lack of experience with machine learning methods. It seems like thebest machine learning method used in this paper was not better than the SARIMA model.Contradictory to the results found in the study by Yu et al (2018) their best LSTM modelhad a MAPE of 2% and their seasonal ARMA a MAPE of 3.1%. This paper shows thata seasonal ARIMA with a percentage error of 2.7% (MAPE) and 2.68% (SMAPE) beatsan LSTM with a percentage error of 4.29% (MAPE) and 4.18% (SMAPE). The differencebetween this paper and the study mentioned seems to depend on which ARIMA modelto employ for comparison, the structure of the data, and the model building approach.This does however not imply that the LSTM can be disregarded. It has shown to be avalid model under no assumptions but might need further research to make it moreaccurate for forecasting time series.


References 32

7 References

Ahmed, N. K., Atiya, A. F., Gayar, N. e., El-Shishiny, H. (2010). An empirical compari-son of machine learning models for time series forecasting. Econometric Reviews, 29(5),594-621.

Alpaydin, E. (2010). Introduction to machine learning (2nd ed.). Cambridge, Mass: MITPress.

Amiri, S., von Rosen, D., Zwanzig, S., Sveriges lantbruksuniversitet. (2009). The SVMapproach for B.J models. Revstat-Statistical Journal, 7(1), 23-36.

Bianchi, F. M., Maiorino, E., Kampffmeyer, M. C., Rizzi, A., Jenssen, R. (2017). Recurrentneural networks for short-term load forecasting: An overview and comparative analysis. NewYork: Springer.

Bontempi, G., Taieb, S. B., Borgne, Y. (2013). Machine learning strategies for time seriesforecasting. Business Intelligence, 62-77.

Cryer, J. D., Chan, K. (2008). Time series analysis: With applications in R (2nd ed.). NewYork: Springer.

Flach, P. (2012). Machine Learning. The Art and Science of Algorithms that Make Sense ofData. Cambridge University Press.

Ghatak, A., (2017). Machine learning with R. Singapore: Springer Singapore.

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press

Graves, A., Dr. (2012). Supervised sequence labelling with recurrent neural networks. Heidel-berg: New York: Springer.

Hao W., Yu S. (2006) Support Vector Regression for Financial Time Series Forecasting.In: Wang K., Kovacs G.L., Wozny M., Fang M. (eds) Knowledge Enterprise: IntelligentStrategies in Product Design, Manufacturing, and Management. PROLAMAT 2006. IFIPInternational Federation for Information Processing, vol 207. Springer, Boston, MA.


References 33

Hastie, T., James, G., Witten, D., Tibshirani, R.. (2013). An introduction to statisticallearning: With applications in R. New York, NY: Springer.

Hastie, T., Tibshirani, R., Friedman, J. (2017). The elements of statistical learning: Datamining, inference, and prediction. 2nd ed., corrected at 12th printing. New York: Springer.

Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Computation,9(8), 1735-1780.

Hyndman, R. J., Koehler, A. B. (2006). Another look at measures of forecast accuracy.International Journal of Forecasting, 22(4), 679-688.

Hyndman, R.J., Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition,OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 15 April 2019.

Kaastra, I., Boyd, M. (1996). Designing a neural network for forecasting financial andeconomic time series. Neurocomputing, 10(3), 215-236.

Khan, M., Jan, B., Farman, H. (2018). Deep learning: Convergence to big data analytics.Singapore: Springer.

Kohzadi, N., Boyd, M. S., Kermanshahi, B., Kaastra, I. (1996). A comparison of artificialneural network and time series models for forecasting commodity prices. Neurocomput-ing, 10(2), 169-181.

Levis, A. A., Papageorgiou, L. G. (2005). Customer demand forecasting via supportvector regression analysis. Chemical Engineering Research and Design, 83(8), 1009-1018.

Maini, V., Sabri, S. (2017). Machine Learning for Humans. Accessed on 15 April 2019 from:https://everythingcomputerscience.com/books/Machine%20Learning%20for%20Humans.pdf

Mitchell, Tom M. 1951- (Tom Michael). (1997). Machine learning. New York: McGraw-Hill

Okasha, M. (2014). Using Support Vector Machines in Financial Time Series Forecasting.International Journal of Statistics and Applications. 4. 28-39.

Pal, A. Prakash, P.K.S. (2017). Practical Time Series Analysis. Packt Publishing, Limited,Birmingham.


References 34

Rüping, S. (2001). SVM kernels for time series analysis. Technical Report, SFB 475:Komplexitätsreduktion in Multivariaten Datenstrukturen, Universität Dortmund, 2001,43

Shalev-Shwartz, S., Ben-David, S. (2014). Understanding machine learning. West Nyack:Cambridge University Press.

Smola, A. J., Schölkopf, B. (2004). A tutorial on support vector regression. Statistics andComputing, 14(3), 199-222.

Svensk Mäklarstatistik. (2019). Om statistiken. Accessed on 14 April 2019 from:https://www.maklarstatistik.se/om-oss/om-statistiken

Trafalis, T. B., Ince, H. (2000). Support vector machine for regression and applicationsto financial forecasting. Paper presented at the IEEE-INNS-ENNS International JointConference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges andPerspectives for the New Millennium, 6, 348-353. doi: 10.1109/IJCNN.2000.859420

Vapnik, V Smola, Alexander Rätsch, Gunnar Schölkopf, B Kohlmorgen, J Müller,Klaus-Robert. (1999). Using support vector machines for time series prediction. ISBN:0-262-19416-3

Walid, Alamsyah. (2017). Recurrent neural network for forecasting time series withlong memory pattern. Journal of Physics: Conference Series, 824(1). doi: 10.1088/1742-6596/824/1/012038

Yu, L. , Jiao, C. , Xin, H. , Wang, Y. , Wang, K. (2018). Prediction on Housing Price Basedon Deep Learning. World Academy of Science, Engineering and Technology, Interna-tional Science Index 134, International Journal of Computer, Electrical, Automation,Control and Information Engineering, 12(2), 90 - 99.

Zaytar Akram, M., El, C. (2016). Sequence to sequence weather forecasting with longshort-term memory recurrent neural networks. International Journal of Computer Applica-tions, 143(11), 7-11.

Zhang, H., Gao, S., Seiler, M. J., Zhang, Y. (2015). Identification of real estate cycles inchina based on artificial neural networks. Journal of Real Estate Literature, 23(1), 67-84.


Appendices 35

Appendices

A Figures

Figure A.1: House prices of the larger Stockholm area andUppsala from January 2005 till April 2019. Source:

Svensk Mäklarstatistik.


Appendices 36

Figure A.2: Auto-correlation function plot of houses in the larger Stockholm Area

Figure A.3: Auto-correlation function plot of houses in Uppsala


Appendices 37

Figure A.4: Decomposition plot for the larger Stockholm area.

Figure A.5: Decomposition plot for Uppsala.


Appendices 38

Figure A.6: Cross-validation out-of-sample predictions for the larger Stockholm area.

Figure A.7: Cross-validation out-of-sample predictions for Uppsala.


Appendices 39

Figure A.8: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01 onaverage house prices per month of the larger Stockholm area.

Figure A.9: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01 onaverage house prices per month of Uppsala.


Appendices 40

Figure A.10: ACF plot of the final SARIMA model for the larger Stockholm area.

Figure A.11: PACF plot of the final SARIMA model for the larger Stockholm area.


Appendices 41

Figure A.12: ACF plot of the final SARIMA model for Uppsala.

Figure A.13: PACF plot of the final SARIMA model for Uppsala.


Appendices 42

B Kernels

Linear Kernel:K(xi, x) = xT

i x

Radial-basis kernel:K(xi, x) = exp(−γ|xi − x|2)

sigmoid Kernel:K(xi, x) = tanh(γxix + coe f )

Polynomial kernel:K(xi, x) = (γxix + coe f )d


Documents

Time Series Forecasting of House Prices An evaluation of a ... › smash › get › diva2:1325965 › FULLTEXT01.pdf · Machine learning methods can be categorized into methods using