Predicting time series with Transformer

Aalto University

School of Science

Master’s Programme in Machine Learning, Data Science and Artificial Intelligence

Maxim Afteniy

Predicting time series with Transformer

Master’s ThesisEspoo, April 10, 2021

Supervisors: Assistant Professor Alexander Jung, Aalto UniversityAdvisor: Aditya Kaushik M.Sc. (Tech.)

Aalto UniversitySchool of ScienceMaster’s Programme in Machine Learning, Data Science andArtificial Intelligence

ABSTRACT OFMASTER’S THESIS

Author: Maxim Afteniy

Title:Predicting time series with Transformer

Date: April 10, 2021 Pages: 44

Major: Machine Learning, Data Science andArtificial Intelligence

Code: SCI3044

Supervisors: Assistant Professor Alexander Jung

Advisor: Aditya Kaushik M.Sc. (Tech.)

In this thesis, I investigated the problem of forecasting the non-stationary andhigh-frequency Bitcoin market price data with current state-of-the-art DeepLearning models.

The study’s objective was to apply the transformer model and compare its ap-plicability to a popular LSTM approach on a high-frequency time series data.Upon identifying the difficulty of making a regressive predicitons, I’ve modifiedthe predictive model to output binned targets which are particularly beneficialfor time series problems exhibiting non-stationarity.

The major contribution through the thesis is showing the possibility to predicthighly non-observable dynamics for Bitcoin market data.

In summary, the results are promising and further directions include the usage ofa much larger lookup history window and external data resources such as newscorpora.

Keywords: Deep Learning, Time series, Machine Learning, Transformer

Language: English

2

Aalto-yliopistoPerustieteiden korkeakouluTieto-, tietoliikenne- ja informaatiotekniikan maisteriohjelma

DIPLOMITYONTIIVISTELMA

Tekija: Maxim Afteniy

Tyon nimi:Aikasarjojen ennustus Muuntajalla

Paivays: 10. huhtikuu 2021 Sivumaara: 44

Paaaine: Koneoppiminen, Datakasittely jaTekoaly

Koodi: SCI3044

Valvojat: Apulaiprofessori Alexander Jung

Ohjaaja: Diplomi-insinoori Aditya Kaushik

Tassa opinnaytetyossa tutkin Bitcoin kryptovaluuttan ennustavuutta ultrano-pean kaupankaynnin kayttotarkoituksessa kayttaen viimeista koneoppimistek-niikkaa edustavia Deep Learning malleja. Tutkimuksen tarkoituksena oli soveltaamuuntajamallia (Transformer) ja arvioida sita vertaamalla Takaisinkytket-tyyn neuroverkkoon. Muuntaja on nayttanyt lupaavia tuloksia Kielen mallinta-misessa, mutta aikasarjaongelmissa mallin arviointi on jaannyt todella vahaiseksi.

Aluksi olen huomannut, etta regressiivisten ennusteiden tekeminen on erittainvaikeaa, taman havainnon perusteella olen muokannut ennustemallia tuottamaandiskreetteja ennustuksia, mika tarjoaa suuria etuja erityisesti aikasarjaongelmille,joilla on korkea volatiliteetti.

Opinnaytetyon tarkein havainto on, etta on todellakin mahdollista ennustaaBitcoinin dynamiikkaa koerkean volatiliteetin riippumatta ultranopean kau-pankaynnin asetuksessa.

Yhteenvetona voidaan todeta, etta tulokset ovat lupaavia, tulevissa tutkimuksissavoidaan tutkia paljon suurempaa ikkunaa ja ulkoisten tietoresurssien sisaltamista.

Asiasanat: Koneoppiminen, Data-analyysi, Tekoaly, Muuntaja, Neuro-verkko

Kieli: Englanti

3

Acknowledgements

I would like to thank everyone who supported my during my thesis, especiallymy mom and motivating friends. Thanks to Aditya Kaushik and AlexanderJung for productive discussions that helped getting the thesis forward.

Espoo, April 10, 2021

Maxim Afteniy

4

Glossary

Sequence Modeling An approach that learns to forecast sequence elementscapturing temporal dependencies

Explanatory variable variable that explains variable in the response variableResponse variable Variable which variation might be explained by Ex-

planatory VariablesLevel The value which is computed along with the predictive

windowhyperparameters High-level parameters that identify the predictive

modelFitted value one-step forecast for the given dataNLP Natural Language Processing, usually it’s assumed

that NLP models the conditional distribution of nexttoken given preceding tokens (token = word)

Outlier Datapoint or feature of interest that differs signifi-cantly from other observations, usually the differenceis unexplained

Datapoint Observation which is usually explained by a set offeatures

Feature Map A mapping function of a Label given DatapointFeature A numerical value that explains corresponding Data-

pointLabel An corresponding ground-truth value related to the

DatapointDataset A data representation consisting of datapoint and la-

bel pairsKernel An sliding window of the Convolutional Neural Net-

work contributing to its corresponding feature mapvalues

Volume Amount of item of interest that changed hands in theparticular time window

5

Sequence element An ordered element in a sequence, usually referred toby its index t

6

Contents

Glossary 5

1 Introduction 9

2 Data and Problem statement 112.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Description, source, and frequency . . . . . . . . . . . . . . . . 122.3 Data behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Exponential Smoothing features . . . . . . . . . . . . . 162.4.2 Econometric features . . . . . . . . . . . . . . . . . . . 17

2.5 Data normalization . . . . . . . . . . . . . . . . . . . . . . . . 182.6 From regression to classification . . . . . . . . . . . . . . . . . 20

3 Methods 213.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Recurrent Neural Networks . . . . . . . . . . . . . . . 223.2 LSTM to improve gating . . . . . . . . . . . . . . . . . . . . . 243.3 LSTM in forecasting . . . . . . . . . . . . . . . . . . . . . . . 253.4 Attention based models . . . . . . . . . . . . . . . . . . . . . . 25

3.4.1 Attention mechanism . . . . . . . . . . . . . . . . . . . 263.4.2 Multi-Head Self Attention . . . . . . . . . . . . . . . . 263.4.3 Positional Encoding . . . . . . . . . . . . . . . . . . . . 293.4.4 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.1 Feature generation . . . . . . . . . . . . . . . . . . . . 313.5.2 Data Loading pipeline . . . . . . . . . . . . . . . . . . 313.5.3 K-Fold Cross Validation . . . . . . . . . . . . . . . . . 323.5.4 Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7

4 Evaluation 354.1 Random target class sampling as a baseline model . . . . . . . 354.2 LSTM and Transformer results . . . . . . . . . . . . . . . . . 364.3 Final evaluation over K-folds . . . . . . . . . . . . . . . . . . . 39

5 Conclusions and outlook 405.1 Relation to efficient market theory . . . . . . . . . . . . . . . 41

8

Chapter 1

Introduction

In recent years, advances in Deep Learning and Natural Language Processing(NLP) led to cutting-edge performance achievements in language modelingtasks. Additionally, successful models that are mainly applied in NLP haveshown promising results in time-series (TS) forecasting tasks [27], [7].

Sequence modeling is the process through which a model learns to cap-ture dependencies across time. Extensive Deep learning research in recentyears resulted in better sequence modeling techniques for the NLP domain.A frequent task in the NLP domain is to predict a particular word in thedictionary given its preceding words. This forecasting ability of a model isbeneficial as it translates to other problems that have time-dependent dataavailable, namely stock time series (TS) data. To apply an NLP algorithmin TS forecasting, only a minor change to the architecture is needed.Previously successful Recurrent Neural Network (RNN) [15] based modelswere particularly inaccurate at modeling longer sequences, as the modelshad to summarize preceding sequence elements into a single vector. Recentadvances in attention-based predictive modeling, especially the ”Attention isall you need” (Transformer) models [28], provided enormous improvementsto the NLP language modeling task, and achieved state-of-the-art resultsin terms of accuracy. Despite modeling longer sequences more accuratelythan RNNs, the vanilla Transformer based models have some limitations.Firstly, they are are memory inefficient. Secondly, they become extremelyslow when modeling longer sequences. The transformer has a potential as therecent research [31] [20] [19] made progress in mitigating efficiency problemsin Transformer.

Despite the success of those novel advancements in language modeling,it’s still not evident if the above Transformer methods could potentially out-perform existing state-of-the-art time-series models, and little research has

9

CHAPTER 1. INTRODUCTION 10

been done regarding Transformer applicability in time-series forecasting re-garding noisy market data.

Moreover, the past Bitcoin forecasting works [30], [3], [12] have used onlya small amount of TS data without evaluating models on the high-frequencydatasets.

Therefore, the objective of this thesis is to evaluate the Transformer modelin the TS setting and compare its performance to the existing state-of-the-arttime series LSTM model on the Bitcoin market data.

Structure of the thesis

Firstly in section ”Data”, I will analyze the dataset of bitcoin market data.The analysis will include the analysis of the stationarity of the data, behavior,and possible enhancements. The second section ”Methods” will go throughthe data loading pipeline and will focus on the explanation of models andvarious design choices that are used in the Evaluation chapter. In the last”Evaluation” chapter I will go through the evaluation and analysis of theresults of the proposed transformer model comparing its performance versusthe LSTM model and a simpler baseline model. Lastly, the ”Conclusionand Outlook” chapter will summarize the results and will give final remarksregarding the work.

Chapter 2

Data and Problem statement

In this chapter, I will go through the problem statement and data pre-processing and normalization techniques that have to be taken into accountfor handling high-frequency Bitcoin Data. Additionally, I will analyze whatadditional features can be included in the models in order to make trainingmore feasible.

2.1 Problem statement

Time series substantially differ from other usual data representation formatsused in Machine Learning. It introduces the time domain and requires sam-ples to be measured in predefined time intervals. The time-step parameter tdescribes time, this gives information about sequential order of events. Forexample, if our data represents information with one-year intervals startingfrom 1970, the first explanatory variable y0 will indicate the informationabout the year 1970. The representation of timed events can be univariate,also known as univariate time-series which captures information about onlyone variable at each time step. Naturally, there are multivariate TS whichdescribe the changes of the variable of interest with multiple variables. Forexample to have a multivariate time series model to model the precipitationlevels of one city, one may incorporate the information about the past pre-cipitation levels of the neighboring cities as well.

The goal of any TS model is to predict the ”future” conditioned on pastobservations. For the sake of simplicity the following example shows how thenotation is used for univariate point forecasting, notation is taken from [23].In a TS prediction, we have a predictive horizon denoted as h, and observedhorizon as t, given observed values the task is to predict response variable

11

CHAPTER 2. DATA AND PROBLEM STATEMENT 12

values at timesteps [T + 1, . . . T + h], given observed values [yT−t−1, . . . yT ].Predicted variables are denoted by [yT+1, . . . yT+h]. In this thesis, I’m goingto use the notation of y for observed target values and y for predicted values.Since most of the time series models are unable to predict the whole predictedhorizon h at once, the Autoregressive prediction is required which predictsonly one value at a time, reusing past forecasted values as input.

2.2 Description, source, and frequency

Bitcoin (BTC) is the first and most popular cryptocurrency developed in2009 after the financial crisis by an unknown group of people, it’s designedto be more robust than traditional currencies like USD and EUR. One ben-efit of that currency that it has a finite amount of it available (obtaining suchcoin through mining becomes harder and harder). Additionally, it’s nearlyimpossible to falsify the transactions between users as it’s in nature was de-signed to be immune to frauds and scams.

Bitcoin opens opportunities for forecasting the price, the data is very ac-cessible and hosted by a large cryptocurrency trading platform also knownas Binance 1. Due to the very small amounts of data in the time-seriesdomain that has been used in previous research, my obvious choice was tomodel the time series in a higher sampling frequency. For this, I’ve chosento minutely Bitcoin data. Additionally, the higher sampling frequency canbe changed back to less informative lower frequency data by aggregation. Ibelieve that there could be more potential benefits if the data was sampledin 5 or 1-second intervals, but this is not open-sourced by Binance and re-quires its own implementation of data gathering and saving to the database.This might be beneficial in the future if one wants to implement own high-frequency trading (HFT) algorithm.

To know accurate forecasts it’s essential what phenomenon is causing theprice changes, in the case of Bitcoin the price is influenced by thousandsof active users and therefore the influence on the movement is not directlyobservable to the model. However, my hypothesis is that the local momentumin the data can be captured. The demand might be described by the highrises and falls directly seen in the TS of the price which is additionally canbe weighted by volume.

1Binance - An cryptocurrency trading platform

https://www.binance.com/


2.3 Data behavior

The data values are ordered by timestamp ranging from 17-08-2017 to 08-01-2021. The values are represented obtained in CSV data format. Illustratedsample of data is shown in 2.1 and the whole time series is shown in Figure2.2.I’m using only open-source data available in the thesis and making the pre-dictions only from the predefined number of past observations obtained fromBinance with minutely sampling frequency.

timestamp open high low close volume17-08-2017 4261.48 4261.48 4261.48 4261.48 1.77518317-08-2017 4261.48 4261.48 4261.48 4261.48 0.00000017-08-2017 4280.56 4280.56 4280.56 4280.56 0.26107417-08-2017 4261.48 4261.48 4261.48 4261.48 0.01200817-08-2017 4261.48 4261.48 4261.48 4261.48 0.140796

Figure 2.1: The illustration of minutely BTC multivariate TS data that isrepresented in the CSV format, each row contains information of 1-minuteinterval and some features are left out in this figure for the sake of simplicity.


Figure 2.2: The illustrated TS dynamics of the BTC (close) price, the seriesappear to be highly non-stationary and with non-linear dynamics.

Technical indicators of the price changes

Open-high-low-close (OHLC) chart [21] is a set of technical indicators overtime that are describing the movements of the price in the given time inter-val. The indicators are quite simple, the opening and close indicators showwhat was the starting and ending price for a given interval. And high andlow indicators tell the highest and lowest prices for given intervals. The dif-ference between high and low is usually very useful as it shows the momentumand volatility information to the chart viewer [21]. In the case of my dataset(see Figure 2.1) the OHLC chart information is represented in a multivariateTS manner, including additionally the volume information indicating theamount of BTC traded in the given interval of time. Additional technicalcomponents that were not included Figure 2.1 but are present in the dataare the trades other components. Trades is a discrete number indicatinghow many trades happened, and volume indicates simply how much bitcoinwas traded in a specific time interval. Those proposed indicators give greatopportunities for engineering additional features such as Volume Weighted


Average Price and Volatility, volume per trade, and other econometric indi-cators. The illustration of the BTC-minutely data is shown in the histogramplot 2.3.

Figure 2.3: The illustrated time-series histogram, showing the featurespresent in each TS data timestep.

2.4 Feature Engineering

Feature Engineering (FE) is a way of making the features more meaning-ful for the model, even though Deep Learning on its own learns to extractthe relevant features, it’s still data input dependent and in some cases, it’s


impossible or infeasible to extract useful features from the data. One suchexample is when some features that are added can’t be inferred from the basicinput window in TS analysis. Such features are all kinds of moving averageor Exponential Smoothing (ES) values [17, Chapter 7.1] that are averagingacross way higher window than the model sees. This approximation of pastvalues becomes intractable for the model that can see only part of the inputwindow of TS at steps [T − t− 1, . . . , T ].

2.4.1 Exponential Smoothing features

To model dependencies of time-series across different time scales Simple Ex-ponential Smoothing can be used and it has to be applied only once to thedata in order to use the ES. This for the model to have a view of the TS ondifferent time scales, giving an abstract view of the data and how it changes.To understand more deeply how ES is applied to the Bitcoin TS refer to thechapter related to ES in the Methods chapter3. The ES features that aremodeling smoothing on different scales are shown in2.4.

Figure 2.4: The close column of the data is fit with multiple different SimpleExponential Smoothing models with single hyperparameter α.


2.4.2 Econometric features

In the stock market, there are additional technical indicators such as VolumeWeighted Average Price (VWAP) [9], it’s a strong indicator for daily tradersto see how the price changes over time, taking into account the volume of thetraded product. To compute the VWAP over the time series the informationof traded volume and the price is required. In our case since the exact priceis not known, the average price is taken by averaging the high and low priceof the time period. The equation to compute the VWAP is shown in theequation 2.1.

VWAP =

∑Price ∗ Volume∑

Volume(2.1)

The VWAP simply scales the weighted price by the volume divided bythe total volume traded in 2.1. It’s computed over a certain period of time.In intraday trading (when the trading starts at the beginning of the day)the VWAP is computed for each day separately. In our case since the datais continuous, the rolling window approach is applied. The varying VWAPis a strong indicator in the sense of the sliding window length. Similarly asexplained the section 2.4.1, the VWAP feature can be used to create multiplenew features varying the sliding window length, but in the experiments onlyone additional feature is used as shown in Figure 2.5.


Figure 2.5: An example how continuous-VWAP is applied on the data, re-quiring the high, low and volume features. The sliding window was usedfor some predefined time frame, in our case, the window is set to 3600.

2.5 Data normalization

Data normalization or standardization is a very crucial component in DeepLearning [10] [18], and plays an important role to make the model less sensi-tive to learning rate choices and makes the loss surface smoother [18]. So nat-urally having an inputs to the model from a fixed distribution and preferablyto have zero mean and unit variance for each consecutive layer brings greatbenefits to the training process [18]. Since the inputs for our future modelwill use past observations for the prediction task, the model will naturallycompute the statistics over some predefined input window for each feature.The evident example of the data non-stationarity in it’s raw form is shown inthe figure 2.6, the data exhibits highly varying mean and standard deviation(volatility). To make the forecasting task for the TS model easier, usuallythe data has to be made stationary (constant mean) and has to be properlynormalized for the input model. The problem in TS setting is that the input


and output window might have different means and standard deviations, andthough re-normalization from predictions becomes non-trivial. Approacheslike [25], [22] have made research towards handling non-stationary time-seriesand applying custom data normalization techniques that are applied for atime window of the interest. Additionally, the ES-RNN approach has ap-plied a normalization to the input window which divides the values by thelevel values and removes outliers by applying transformation of the data toa logarithmic space.

Figure 2.6: Nonstationary behavior of the BTC data of the value of interestclose price. The number of time steps shown in the figure N = 1000. Eachcorresponding area corresponds to a distinct time window over which thestatistics are computed. The mean is described as µ and standard deviationas σ.

The choice for the input of time-series window normalization is chosento be the Standardization (Z-score Normalization), formal technique for nor-malization over the vector x is shown in Function 2.2.


xi =xi −mean(x)

std(x) + ε(2.2)

2.6 From regression to classification

Due to the problems that were proposed in 2.5 the predictions of the Figure2.6 that are shown in the green has to be normalized by the statistics ofthe input window, this becomes problematic as the normalization assumespredefined bounds of the values. Additionally, there’s the problem of theBTC data that it may not be predicted only using the input data (history),so in that case, it may be not a good idea to make regressive predictions.The benefits to introducing classification over regression model : (1) the re-normalization of predictions is no longer a problem, (2) the model may givea rough approximation of how the price will change (in simple case it’s theprice going up or down) (3) the model will have an easier task of learningand predicting (4) there’s not much information gain from knowing absolutevalues, (5) predicting accurate values is impossible with partial observabilityof the stock market dynamics. Based on proposed reasons, the models usedin experiments will be used in a classification setup. The implementationdetails of the method will be proposed in the next chapter.

Chapter 3

Methods

This chapter will delve into the methods used in this thesis.

Exponential Smoothing for feature generation

Statistical ”classic” models are powerful simple non-Deep Learning modelswhich have shown great results in econometric and other time series forecast-ing tasks, they can model seasonality and trends adequately. In this section,I will qualitatively review the Exponential Smoothing [4] method that hasbeen used to generate new data features for each unique time step.

Exponential smoothing

Consider a simple baseline model where the predictions of the model arethe averages of the input of past k elements. However, it’s often that weshould give prioritized weight to the most recent values. Simple ExponentialSmoothing (SES) does exactly that by giving the most dominant weight tocloser values in history, exponentially decaying the weight or importanceto earlier values. The exponential decay speed is controlled by the singlehyperparameter α. Exponential Smoothing works exactly according to theformulation in [17, Chapter 7.1]. Simple exponential smoothing is illustratedin equation3.1.

SES can be considered as a transformation of time series to smoothedvariant, parametrized by l0, and α. Each value of y, becomes a transformedvalue y after Exponential Smoothing:

yt+1 = αyt + (1− α)yt

Where 0 ≤ α ≤ 1(3.1)

21

CHAPTER 3. METHODS 22

Since the predicted value y depends on past values, in the case t = 0 the valueof y0 = l0 which is a second hyperparameter. Simple Exponential Smoothingrequires only two hyperparameters to make predictions. This becomes veryeasy to find an optimal set of weights l0 and α with gradient descent.

Component formAdditionally, the smoothened values can be represented in the componentform where each component is modeled separately, in our case it’s useful andconvenient to represent in a single level component.

yt+1 = lt

lt = αyt + (1− α)lt−1

(3.2)

Just like the level component is modelled in the above equation 3.2, theabove method can be extended to model additionally the seasonality andthe trend [17, Chapter 5] using Holt-Winters’ method. However, the datasetthat is studied in this thesis exhibits a highly non-stationary trend and un-known non-linear seasonality and for that reason, a review of Holt Winter’smethod is out of the research scope.As briefly mentioned in the Chapter 2and also shown in Figure 2.4 I’mfirst finding the most optimal SES parameters with optimization strategysuch as gradient descent. In the next step after getting the most optimalα the fractions of it can be used to generate additional smoother features.The other smoothing parameters are directly generated from optimal α like:{α/2, . . . α/6}. The lower the alpha, the higher importance is on past valuesas shown in eq. 3.2. This strategy yields a nice set of features that can beextracted only once before the actual training. Also, this strategy is helpfulin noise removal, which in our case is beneficial due to the chaotic nature ofthe TS.

3.1 Methods

3.1.1 Recurrent Neural Networks

One problem becomes evident when the input datapoint consists of inputsof varying length (sequence), training such model with fixed input dimen-sion becomes problematic, to solve such problems the Recurrent NeuralNetwork [29] [16] approach was introduced. Consider a sequence elements(x1, x2, x3, . . . xK), that forms an input sequence for each datapoint, wherethe K parameter can vary. The RNN processes usually each element once ata time and outputs a hidden value for each input element xi also referred to


as a hidden state. The hidden state’s task is to represent all information ofthe previous datapoints to propagate the information to the following inputs.The illustrated example of RNN is presented in Figure 3.1.1.

The RNNs are extensively used in NLP tasks for neural machine trans-lation. Translation tasks include ”Encoder” which encodes the sequence ofwords into a single context vector. And the ”Decoder” part is used to trans-form that context vector back, but into a target language sequence. In thisway, the Encoder-Decoder models the target language words based on theactual context or the ”meaning” of encoded words.

Since there’s only one hidden state hi that the RNN has to keep informa-tion of to process input sequence element xi+1 it enables the model to processarbitrary lengths of the input sequences while preserving the ”memory” in asingle vector.

While this technique appears very promising and flexible for sequenceinput elements it has a couple of limitations. Firstly the context of RNN hasto be kept in a single finite length vector, this becomes problematic if thepast information to remember becomes infeasibly large to be summarized ina single vector. Moreover, RNN training exhibits vanishing and explodinggradient problems [24]. These problems are directly tied to the gradientsignal propagation through all of the processed sequence elements of RNN.To overcome the consistency of the gradients in the training of the RNNs pastresearch introduced techniques such as Long Short Term Memory (LSTM)[16] and Gated Recurrent Unit [7] (GRU). Additionally, the RNN’s can beenhanced with dilated connections across time, improving the stability onlonger sequences [6]. There exist many additional improvements such as theimproved gating mechanism [11] that drastically improve the rememberingcapability of the LSTM.


Figure 3.1: The illustrated example how sequence of elements x ={x1, . . . , x5} RNN is processed and summarized into a single hidden statevector h5. In the case of BTC data, the elements xi correspond to multivari-ate TS points for timestep i.

3.2 LSTM to improve gating

The key aspects and drawbacks mentioned here are directly tied to the na-ture of the recurrence introduced in RNNs. Recall a basic RNN architec-ture in Figure 3.1.1, besides hidden representation hi, LSTM introduces cellstate component ci, see Figure 3.2 for details. Incorporating cell state andforget gate further improves the gradient flow in the Network, allowing todynamically discard irrelevant information partially mitigating the gradientproblems exhibited in basic RNN.

The input sequence (in our case time series) can be passed to the LSTMand it encodes the sequence consecutively into the memory vector h andstate vector c. Each input time-series elements don’t depend on the timeparameter t, but actually the context and state vectors define or ”summarize”preceding elements. Abstract computational graph of an atomic LSTM cellthat produces next states from input xi sequence element and previous statesis shown in Figure 3.2. Moreover, the cells can be stacked vertically toincrease the complexity of the representation, although there’s again trade-off with very long computational graphs. To see more in-depth informationabout the computation see Pytorch implementation of LSTM cell 1. Thiscell operation is performed on each sequence element.

1LSTM cell - Implemented by open-source Deep Learning library PyTorch.

https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html


3.3 LSTM in forecasting

Contrastively to the ”Encoder-Decoder” architecture in the implementationI’m using only the Encoder part because there’s no need to extract the actualsentiment of the Language sequence. The only useful information encodedinto hidden representation by LSTM is the meaningful representation of thepast values that might be, or might not be useful concerning the optimizedloss function. LSTM encodes a subset of the source sequence of length n ≤I into hidden representation ht and afterward the task is to predict thediscretized values corresponding to TS values in the output window. Thefurther details of the selection of the labels from an output window can beviewed from section 3.5.4.

Figure 3.2: Abstract computational graph showing how each input elementin the sequence x1, . . . , xN is processed in an LSTM cell. The cell keepsinformation of context in the cell states ci and hidden states hi.

3.4 Attention based models

Attention is a recent approach of Deep Learning that implements the corre-sponding biological concept of ”Attention”. Humans or any animals usuallyuse ”Attention” to filter out irrelevant information during any activity. Sim-ilarly the Attention concept contribution in Deep Learning research [2] [28]tries to feasibly model the Attention mechanism that is parametrized by theweights of the Neural Network. It appeared that the prior works contributingto the attention were a perfect fit for an NLP domain where there were evi-dent problems with LSTM training on longer sequences. The first appearanceof Attention was introduced in the work of bidirectional LSTM [2] which com-puted attention scores based on the predicted word and the context [2]. This


technique contributed to the existing state-of-the-art RNN-based encoder-decoder approach by enhancing the predictions with Attention. Howeverthe method still relied on the RNN architecture which had a drawback ofgradient flowing through time, which is not optimal when handling larger se-quences. Next major contribution of Attention that completely go rid of theRNN-based architecture was the architecture called Transformer [28]. Thisapproach became a state-of-the-art approach for NLP and other currentlywell known like BERT [8], GPT-2 [26], GPT-3 [5] are directly extended fromthe Tansformer.

3.4.1 Attention mechanism

The original Transformer contains both Encoder and Decoder, but for theTS forecasting task, only the Encoder is needed. The motivation to using theonly Encoder is similar to the previous LSTM section, there’s no need forlanguage-agnostic ”sentiment” extraction. The representation of Encoder-only Transformer architecture can be seen in the Figure 3.3.

The Transformer introduced new concepts to the Deep Learning commu-nity namely Positional Encoding and Multi-Head Attention (MHA). Otherrelevant concepts that are used from past Deep Learning research are theResidual connections [14] and the Layer Normalization [1]. Positional En-coding is an essential concept and is needed to have the ordering encodedinto representation as the MHA computes attention in an unordered manner.

3.4.2 Multi-Head Self Attention

To understand the concept of the Multi-Head Attention I will first go throughthe high-level representation and afterward into a more formal mathematicalrepresentation.

Multi-Head Self Attention is a central contribution from the Transformer.As it can be seen from the 3.3 Encoder architecture if the MHA block isremoved from the module it becomes a Fully-Connected neural network withresidual connections, so why MHA is so powerful?


Figure 3.3: The abstract computational graph of Transformer used in thisthesis. The Inputs correspond to the sequence input elements, N stands forthe number of layers that are repeated.

Consider a simple example of input sequence elements x1, . . . , x5, theMHA module takes as input that sequence and returns the output sequencex1, . . . , x5 which actually transforms the sequence elements to be aware of it’ssurrounding sequence. This is similar to the LSTM case when the hiddenstate vector actually tells the past information of the sequence, but in theAttention case, it’s actually representation over all of the sequence elements.In the MHA each sequence entry has information over all input elements tothe MHA. The each element xi has encoded representation included fromother sequence elements xj, j 6= i. Abstractly speaking in the Transformermodel the ”Attention” for sequence element xi is represented via normalizedcategorical distribution over the input sequence. This is actually a very use-ful technique because on the test time the model outputs can be debuggedand actually shown which sequence elements contribute the most to the pre-diction for a particular element.

MHA is modeled as a weighted product of projected ”Values” and aprobability distribution. The probability is modeled by applying softmax onthe inferred weights yielding attention scores. The MHA in the Transformer


layer is formulated as follows:

Attention (Q,K, V ) = softmax

(QKT

√F

+MASK

)V (3.3)

Query, Key, and Values (Q, K, V) stand for a projection of the inputelements, consider a time window of length Nx, then the input elements withembedding space D become as X = Rn×D. The masking technique is ex-plained in section 3.4.4. To acquire corresponding inputs for the MHA theelements are projected with vectorized matrix multiplication (or with feed-forward layers):

Q = XWQ

K = XWK

V = XWV

WQ ∈ RD,F ,WK ∈ RD,F ,WV ∈ RD,M

As it can be seen from Equation 3.3 the multivariate time series window X isprojected (after positional encoding) to corresponding Q,K, V values, afterthat the Attention scores are computed. Multi-Headed Attention stands forcomputation of the Attention scores in parallel with different heads, andafterward they’re concatenated to and further projected to original space.Illustration of the Multi-Headed attention and the Scaled-Dot Product isshown in Figure 3.4.


Figure 3.4: The illustrated Attention mechanism with is also represented inEquation 3.3 (on the left), and the actual Multi-Headed Attention (on theright). The MHA consists of multiple parallel computations of the Attentionoperations explained in Equation 3.3.

3.4.3 Positional Encoding

As you can see from the previous Transformer figure it also incorporatesPositional Encoding (PE). While LSTM inherently introduces order as eachtimestep is processed sequentially, the Transformer on the other hand doesn’thave sequential computation and everything can be processed in parallel com-putation. This advantage in the parallelism is lost as the QKT operation inAttention which loses the information of the order of elements that are mul-tiplied with each row of Q. For this reason, the positional encoding encodesthe ”timestep” feature to every input element xt. The encoding process isshown in 3.4:

PE(t,2i) = sin(t/100002i/F

)PE(t,2i+1) = cos

(t/100002i/F

) (3.4)

The PE is done for each input vector, the i stands to a dimension index, andthe t stands to the position in the time-series window.


3.4.4 Masking

Sometimes for some applications, the future elements in the sequence can notbe attended to, there’s a simple trick to incorporate that into the Attention inEquation 3.3. The ”no-attention” notation means that the Attention weightsor the probability of Softmax equal to zero. The achieve zeros in the Softmaxspace the logits with −∞ are applied to masked positions. This is equivalentto adding a 2D matrix to the logits, the matrix used in our case is shown inFigure 3.5.

Figure 3.5: The masking technique that is added to the logits for Attentioncomputation as described in Figure 3.4. The green color indicates valuesof zero and gray colors are −∞. In our case, the TS values at time t arerepresenting rows, and the columns are indicating whether the attention canbe applied to the other elements at that corresponding column index.

3.5 Implementation

For clarification purposes, the whole pipeline is shown in Figure 3.6.


Figure 3.6: Example of how the whole training pipeline is implemented, thewhole series is processed before batching and is processed with custom Dat-aLoader during training. The unique batch element is a sliding window, inputis normalized and the target is binned, the resulting metrics are computedwith Cross-Entropy Loss.

3.5.1 Feature generation

The new features are generated by applying transformation functions for thewhole TS variable of interest, namely the Close price of the BTC dataset.The Exponential smoothing is fit on the whole TS and then new data ofSES features are generating by smoothing the Close column on differentsmoothing parameters. Additionally, the Volume Weighted Average Priceis fit on the same variable of the interest, but now since classical VWAPassumes day trading and is used exclusively on certain day price changes,our time horizon is a moving average of the whole series. The moving aver-age of VWAP removes the fixed-day-trading assumptions, which is justified,especially for such international currency like Bitcoin.

3.5.2 Data Loading pipeline

The length of the time series is in our case N = 1778411. Additionally theinput and target ”window” notion is simply illustrated in Figure 3.9. Dataloading is performed in an usual windowing way, consider a dataset of size Nwhich is formulated as D = {(X1, Y2), . . . , (Xt, Yt)}, Xt ∈ RI×D, Yt ∈ RO×D,D = Dimension of single multivariate TS variable. The dataset containsN −K− 1 unique windows. Each epoch is made of a set of non-overlapping


windows, this is a performance-wise design choice. To simulate all possibleTS windows, the beginning index of the dataset is shifted for each start ofthe epoch for performance, or alternatively, the random offset is added whenacquiring the window datapoint (begin time index of the series window).This strategy is resulting in a reasonable data augmentation approach for anon-overlapping set of windows.

Each pair (Xt, Yt) defines a window of a length K = I + O, Xt ={Xt,1, . . . , Xt,I}, Y = {Yt,1, . . . , Xt,O}. The process of the discrete labels gen-eration is further shown in Figure 3.8.

3.5.3 K-Fold Cross Validation

The K-Fold cross-validation (K-Fold CV) split for time-series is used to eval-uate the model’s final performance. Since the test-set location changes inthe K-fold CV, it provides a less biased estimate of the final performanceaccuracy of the model. Illustrated example of how train and test sets arecreated in TS dataset setting is shown in Figure 3.7.

Figure 3.7: Example how to perform K-Fold CV on time series dataset. Ifrequired, the test set can be split in similar way to validation and test set.


Figure 3.8: The process of generating discrete labels, each element in redwindow is compared with green window (target window) to generate cor-responding labels for the model. The attention mask 3.5 that is used intransformer directly comes from an exact assumption that elements of inputwindow (in red) Xt,I−z, . . . , Xt,I are not supposed to attend the future values.

Figure 3.9: The example of a windowing datapoint representation. Themodel’s input is on the left inside the input window and the target valuesare on the right (red box). The level represents an exponential movingaverage feature which can be furthermore used for normalization as it’s donein [27].


3.5.4 Binning

Recall the motivation to use a classification model mentioned in section 2.6.The relaxation of the problem will be done by changing the continuous out-puts to the discrete space, this is done by a simple strategy also known asbinning and it has been used extensively in Deep Learning research. Binningmaps continuous values to discrete space by assigning the input ranges forthe regressed variable. Abstractly speaking the binning tells in what per-centage range the price changed from time t to t+ offset.After the windowed version of DataLoading is implemented it’s trivial to adda certain label generation technique. In our case, the DataLoader outputstwo windows, a vector Xi, and target vector Yi, with lengths I and O respec-tively. Consider the model’s task is to predict the price changes at timestep twith an offset t+offset. In our case the selection of offset is simply offset = O.The binary target for each atomic output window is then transformed viabinning function 3.5, where for each Dataset sample t the yi =

Xt,i

Yt,i.

yiB = f(yi)

yi ∈ R

f(yi) =

0, if yi < 0.95

1, if 0.95 ≤ yi < 0.0

2, if 0.0 ≤ yi < 1.05

3, if yi > 1.05

(3.5)

In Function 3.5 the generated target labels tell the price change over timefrom the source window at timestep t to how it changed for time-step t+ z.In this case, I’ve chosen four labels as shown in the equation.

Chapter 4

Evaluation

In this section, I will present the results on two models that were presentedin the last chapter, the Transformer model and the LSTM model. The eval-uation reports accuracies on the train-validation datasets, additionally pro-viding the loss over the training period (1000 epochs).

4.1 Random target class sampling as a base-

line model

Since the prediction of the random labels will obviously result in poor resultsas the number of distinct target classes increases, I’ve employed a samplingstrategy of categorical classes based on real counts of the target values. Let’ssay the correct labels for the dataset are counted: {150, 700, 150, 0}, thesecond class with label y = 1 will be picked more frequently with probability0.7. This is a much more convenient strategy than picking random targetvalues. This ensures that the random guessing follows the actual distributionof the labels. The resulting Random Target Class Sampling (RTCS) strategybased on counts is illustrated in a Figure 4.1. One drawback of such a methodis the prior knowledge of the label distribution of the desired dataset. Theaccuracy of the baseline model can be seen in the table 4.1.

35

CHAPTER 4. EVALUATION 36

Figure 4.1: The amount of target classes in the predicted validation datasetwith corresponding counts for each class: [561, 33993, 40533, 413]. The

4.2 LSTM and Transformer results

The LSTM and the Transformer models are evaluated with the proposedmethods and on different K-Fold splits. The results provided below indicatethe last K-Fold split (which covers most of the data). For the model per-formance, the accuracy of correct labels was used to evaluate the models.The final average K-fold accuracies for all models are reported in Table ??.Both models were trained for 1000 epochs with the same input and outputwindows I = 1000, O = 250. The targets for each datapoint are computedvia 3.5.


LSTM

The CrossEntropy Loss and accuracies over the course of training are re-ported below in Figures 4.4 and 4.5 respectively.

Figure 4.2: The mean CrossEntropy Loss over training period for the LSTMmodel.

Figure 4.3: The accuracy of the LSTM model over the largest K-Fold split.


Transformer

The transformer has exactly the same setup as the LSTM model. The Trans-former model is used with the masking technique described in 3.4.4.

Figure 4.4: The mean CrossEntropy Loss over training period for the Trans-former model.

Figure 4.5: The accuracy of the Transformer model over the largest K-Foldsplit.


4.3 Final evaluation over K-folds

The three models including Random class sampling (RTCS), LSTM, andTransformer are evaluated on three different K-fold splits. The time seriesspecific K-Fold splits are used by strategy from 3.7. The split specific accu-racies over validation datasets are shown in table 4.3, and the final averageperformance over three K-folds is reported in Table 4.1. The comparison inaccuracy over training time is reported in Figure 4.6.

N-Fold Train fraction Validation fraction RTCS LSTM Transformer1 0.4 0.2 0.496 0.567 0.8182 0.6 0.2 0.502 0.576 0.883 0.8 0.2 0.499 0.583 0.815

Figure 4.6: The comparison of the LSTM and Transformer model over thetraining period. K-Fold = 3.

Baseline (RTCS) 0.499LSTM 0.670Transformer 0.834

Table 4.1: Final accuracies of the models. LSTM and Transformer modelsare trained on 1000 epochs. Each reported accuracy is an average of threedifferent K-Fold splits on validation data.

Chapter 5

Conclusions and outlook

In this thesis I’ve introduced an Transformer framework for time-series fore-casting in classification manner which has shown superior results than theother strong LSTM baseline for multivariate TS prediction. The contribu-tion of this thesis was to apply only Transformer in time-series forecastingin highly volatile and chaotic dataset, purely on historical information of theseries. The findings are quite promising and indeed some knowledge gain isevident and possible in the High-Frequency Trading setting.

To further improve the model I believe that the additional features canbe created for each unique timestep, for example an embedding of the newsand or other sources of information such as social media. The enormousamount of the data can be gathered and perhaps the Transformer can befurther used to ”Attend” only the relevant information in the feature space.Similar approach has already been applied successfully in image-based tasks[13]. Additionally, the attention over feature space needs to be dynamic,simply because the sentiment of news might change over time too. Dynamicadaptation and emphasis on recent events is relevant due to the events ofthe 2021 GameStop short-squeeze, this would’ve not been possible withoutonline twitter sub-reddit WallStreetBets. I believe that incorporation of thenews data to the model will further improve the performance.

Since the training of the Transformer has taken considerably more timethan LSTM (due to the input sequence length = 1000). Another directionof research can be focused on the further evaluation of the TS datasets onmore efficient versions of Transformer.

40

CHAPTER 5. CONCLUSIONS AND OUTLOOK 41

5.1 Relation to efficient market theory

Efficient market hypothesis (EMH) says that ”There’s no gain of the knowl-edge of external information as all of the information is already reflectedin the price”. However, the thesis findings in other hand directly contradictwith this hypothesis, as the model manages to find plausible predictions frompast observations, and the predictions are empirically more accurate than therandom guessing technique proposed.

Bibliography

[1] Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization,2016.

[2] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine transla-tion by jointly learning to align and translate, 2016.

[3] Batista, J. F. Forecasting bitcoin prices: Arima vs lstm, 2019.

[4] Box, G. E. P., and Jenkins, G. Time Series Analysis, Forecastingand Control. Holden-Day, Inc., USA, 1990.

[5] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S.,Radford, A., Sutskever, I., and Amodei, D. Language modelsare few-shot learners, 2020.

[6] Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W.,Cui, X., Witbrock, M., Hasegawa-Johnson, M., and Huang,T. S. Dilated recurrent neural networks, 2017.

[7] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,Bougares, F., Schwenk, H., and Bengio, Y. Learning phraserepresentations using rnn encoder-decoder for statistical machine trans-lation, 2014.

[8] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for language understand-ing, 2019.

42

BIBLIOGRAPHY 43

[9] Fernando, J. Volume weighted average price (vwap) definition, Feb2021.

[10] Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning.MIT Press, 2016. http://www.deeplearningbook.org.

[11] Gu, A., Gulcehre, C., Paine, T., Hoffman, M., and Pascanu,R. Improving the gating mechanism of recurrent neural networks. In In-ternational Conference on Machine Learning (2020), PMLR, pp. 3800–3809.

[12] Gyamerah, S. A. Are bitcoins price predictable? evidence from ma-chine learning techniques using technical indicators, 2019.

[13] Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., and Wang, Y.Transformer in transformer, 2021.

[14] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learningfor image recognition, 2015.

[15] Hochreiter, S., and Schmidhuber, J. Long short-term memory.Neural computation 9 (12 1997), 1735–80.

[16] Hochreiter, S., and Schmidhuber, J. Long short-term memory.Neural computation 9, 8 (1997), 1735–1780.

[17] Hyndman, R. J., and Athanasopoulos, G. Forecasting: principlesand practice. OTexts, 2018.

[18] Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift, 2015.

[19] Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F.Transformers are rnns: Fast autoregressive transformers with linear at-tention, 2020.

[20] Kitaev, N., Lukasz Kaiser, and Levskaya, A. Reformer: Theefficient transformer, 2020.

[21] Mitchell, C. Ohlc chart definition and uses, Sep 2020.

[22] Ogasawara, E., Martinez, L. C., De Oliveira, D., Zimbrao,G., Pappa, G. L., and Mattoso, M. Adaptive normalization: Anovel data normalization approach for non-stationary time series. InThe 2010 International Joint Conference on Neural Networks (IJCNN)(2010), IEEE, pp. 1–8.

http://www.deeplearningbook.org

BIBLIOGRAPHY 44

[23] Oreshkin, B. N., Carpov, D., Chapados, N., and Bengio, Y.N-BEATS: neural basis expansion analysis for interpretable time seriesforecasting. CoRR abs/1905.10437 (2019).

[24] Pascanu, R., Mikolov, T., and Bengio, Y. On the difficultyof training recurrent neural networks. In International conference onmachine learning (2013), PMLR, pp. 1310–1318.

[25] Passalis, N., Tefas, A., Kanniainen, J., Gabbouj, M., andIosifidis, A. Deep adaptive input normalization for time series fore-casting. IEEE transactions on neural networks and learning systems 31,9 (2019), 3760–3765.

[26] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitask learners.

[27] Redd, A., Khin, K., and Marini, A. Fast es-rnn: A gpu implemen-tation of the es-rnn algorithm, 2019.

[28] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention isall you need, 2017.

[29] Werbos, P. J. Backpropagation through time: what it does and howto do it. Proceedings of the IEEE 78, 10 (1990), 1550–1560.

[30] Yenidogan, I., Cayir, A., Kozan, O., Dag, T., and Arslan,Bitcoin forecasting using arima and prophet. In 2018 3rd InternationalConference on Computer Science and Engineering (UBMK) (2018),pp. 621–624.

[31] Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti,C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L.,and Ahmed, A. Big bird: Transformers for longer sequences, 2020.

Documents

Predicting time series with Transformer