Implementation och utvärdering av Historical Consistent Neural …838846/FULLTEXT01.pdf · 2015-07-01 · The fundamental idea of creating a neurological network is to model neu-rons

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet

gnipökrroN 47 106 nedewS ,gnipökrroN 47 106-ES

LIU-ITN-TEK-A--15/051--SE

Implementation ochutvärdering av Historical

Consistent Neural Networks medparallella beräkningar

Johan Bjarnle

Elias Holmström

2015-06-15

LIU-ITN-TEK-A--15/051--SE

Implementation ochutvärdering av Historical

Consistent Neural Networks medparallella beräkningar

Examensarbete utfört i Medieteknikvid Tekniska högskolan vid

Linköpings universitet

Johan BjarnleElias Holmström

Examinator Pierangelo Dell'Acqua

Norrköping 2015-06-15

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat förickekommersiell forskning och för undervisning. Överföring av upphovsrättenvid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning avdokumentet kräver upphovsmannens medgivande. För att garantera äktheten,säkerheten och tillgängligheten finns det lösningar av teknisk och administrativart.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovanbeskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådanform eller i sådant sammanhang som är kränkande för upphovsmannens litteräraeller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press seförlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possiblereplacement - for a considerable time from the date of publication barringexceptional circumstances.

The online availability of the document implies a permanent permission foranyone to read, to download, to print out single copies for your own use and touse it unchanged for any non-commercial research and educational purpose.Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are conditional on the consent of the copyright owner. Thepublisher has taken technical and administrative measures to assure authenticity,security and accessibility.

According to intellectual property law the author has the right to bementioned when his/her work is accessed as described above and to be protectedagainst infringement.

For additional information about the Linköping University Electronic Pressand its procedures for publication and for assurance of document integrity,please refer to its WWW home page: http://www.ep.liu.se/

© Johan Bjarnle, Elias Holmström

Abstract

Forecasting the stock market is well-known to be a very complex anddifficult task, and even by many considered to be impossible. The newmodel, Historical Consistent Neural Networks (HCNN), has recently beensuccessfully applied for prediction and risk estimation on the energy mar-kets.

HCNN is developed by Dr. Hans Georg Zimmermann, Siemens AG,Corporate Technology Dpt., Munich, and implemented in the SENN (Sim-ulation Environment for Neural Network) package, distributed by Siemens.The evalution is made by tests on a large database of historical price datafor global indicies, currencies, commodities and interest rates. Tests havebeen done, using the Linux version of the SENN package, provided by Dr.Zimmermann and his research team.

This thesis takes on the task given by Eturn Fonder AB, to develop asound basis for evaluating and using HCNN, in a fast and easy manner.An important part of our work has been to develop a rapid and improvedimplementation of HCNN, as an interactive software package. Our ap-proach has been to take advantage of the parallelization capabilities ofthe graphics card, using the CUDA library together with an intuitive andflexible interface for HCNN built in MATLAB. We can show that the com-putational power of our CUDA implementation (using a cheap graphicsdevice), compared to SENN, is about 33 times faster.

With our new optimized implementation of HCNN, we have been ableto test the model on large data sets, consisting of multidimensional fi-nancial time series. We present the results with respect to some commonstatistical measures, evaluates the prediction qualities and performanceof HCNN, and give our analysis of how to move forward and do furthertesting.

1

Contents

1 Introduction 4

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 6

2.1 Introduction to Neural Networks . . . . . . . . . . . . . . . . . . 62.2 Back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 The HCNN Model 10

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Implementation 17

4.1 SENN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 HCNNLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 C Mex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.2 CUDA Mex . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Modified Back-propagation . . . . . . . . . . . . . . . . . . . . . 21

5 Configuration 23

5.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Model Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 255.4 Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Results 29

6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.1 Comparison Models . . . . . . . . . . . . . . . . . . . . . 306.2.2 Error measurements . . . . . . . . . . . . . . . . . . . . . 316.2.3 Forecast Hit Rate . . . . . . . . . . . . . . . . . . . . . . 326.2.4 Local Hit Rate . . . . . . . . . . . . . . . . . . . . . . . . 346.2.5 Theil Coefficient . . . . . . . . . . . . . . . . . . . . . . . 366.2.6 MASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2

7 Discussion 41

7.1 Answers to Thesis Questions . . . . . . . . . . . . . . . . . . . . 437.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3

1 Introduction

1.1 Motivation

Forecasting the stock market is well-known to be a very complex and difficulttask, and even by many considered to be impossible. Neural networks are apopular approach when simulating the market dynamics, but they can be verycomputationally heavy. The new model, Historical Consistent Neural Networks(HCNN), has recently been successfully applied for prediction and risk estima-tion on the energy markets.

Nowadays, with the raise in GPU power, more and more compute-intensiveoperations are performed utilizing a modern graphics device, instead of thetraditionally used CPU. This opens up huge possibilites to parallelize and dothe calculations simultaneously.

The basic structure of neural networks, with nodes and connections, can beimplemented as matrix and vector operations, which is very well suited for GPUcalculations.

1.2 Purpose

Eturn Fonder AB is a small stock exchange company based in Stockholm,founded in 2004. The company’s founders and managers have a long combinedexperience of model development and over 18 years of experience in successfulmodel-based trading. This thesis takes on the task given by Eturn Fonder AB,to develop a sound basis for evaluating and using HCNN, in a fast and easymanner. To meet Eturn’s goals and needs, we should deliver an easy-to-usesoftware solution suited for daily analysis, as well as an evaluation of the HCNNmodel.

1.3 Questions

Our work aims to answer the following questions:

• Can we create a faster implementation of the HCNN model on our own?

• Can the GPU be utilized in an efficient way to speed up the learningphase?

• Is it possible to use HCNN for predictions on large financial data sets ona daily basis?

• Is it possible to predict the financial markets using HCNN?

1.4 Limitations

In order to fit the thesis within reasonable boundaries, the following limitationshas been set up:

4

• The analysis has been focused on two data sets.

• All available data for the sets have been used - no correlation analysis hasbeen made.

• Only weekly data and predictions.

• No variations of the HCNN model has been evaluated (i.e. RHCNN,CRCNN).

5

2 Background

2.1 Introduction to Neural Networks

The human brain is the most complex and sophisticated system that we knowof in the universe. It’s no wonder that scientists and engineers have tried toreplicate its structure and features to imitate human intelligence for decades.To be able to create an artifical version of our brains would surely open up hugepossibilities of artifical learning and computations.

To replicate the behaviour of the human brain, we would need to take intoconsideration it’s main charasteristics; self-organization and learning capability,generalization capability and fault tolerance [1]. Even though the human braintheoretically is slower than a modern computer, it still outclasses computersin several areas. When processing, a huge amount of the brain is active witha massive amount of neurons working simultaneously together. Combine thiswith it’s ability to store and access data as well as noise filtering, and we canbegin to understand how complex it is to make this biological system into anartificial one.

In 1943, Warren McCulloch and Walter Pitts introduced models of neurolog-ical networks. They created threshold switches based on neurons, and showedthat even simple networks could calculate nearly any logic or arithmetic func-tion. The first computer implementations of neurological networks were done,among others, by Konrad Zuse, who was tired of calculating ballistic trajectoriesby hand.

Figure 1: Complete neuron cell diagram.

6

The fundamental idea of creating a neurological network is to model neu-rons and their connections to each other. A neuron is nothing more than aswitch with information input and output. The switch is activated if there areenough stimuli from other neurons to its input. The neuron output will thensend out a pulse to, for example, other neurons. The information is receivedthrough the neuron’s dendrites in special connections called synapses, see figure2. Through these tree-like connections, the information is then transmitted intothe nucleus of the cell. When enough stimulation has been achieved throughmultiple synapses, the cell nucleus of the neuron activates an electrical pulse,which then is transmitted to the neurons connected to the current one throughthe dendrites.

Not all stimulated signals are of equal strength. The synapses can transferboth strong and weak signals, and over time they can form both stronger orweaker connections. The adjustabilitiy varies a lot and is one of the centralpoints in the examination of the learning abilities of the human brain.

Output

Hidden

Input

Figure 2: Artificial neural network.

To imitate the learning process in the human brain, we need to model anartifical neural network consisting of the key elements in the human brain. Herewe have the neurons with their nucleus cell, and the connections between themthat varies in strength due to the synapses. Each neuron can be modelled asthree serialized functions that gives us the proper behaviour. First, we have thepropagation function which sums all the inputs from other neurons. Then weapply an activation function, that decides if the stimulation of the nucleus wasenough and a pulse to connected neurons should be transmitted. Last, we usean output function that transforms activation to output for other neurons.

There are many ways to structure the nodes into different neural network

7

topologies. One of the most common toplogies is the feed-forward network.This network consists of input-, hidden- and output layers. Each layer consistsof multiple neurons modeled as nodes, with weighted connections between them,representing the synapses. The hidden layer is invisible from the outside, whichis why the neurons in this layer are referred to as hidden neurons.

When discussing neural networks they are often referred to as black boxes.This means that even though the structure might be unknown, we can still ob-serve output generated from the feeded input. This emphasizes the complexityof the network and its structure. One often told anectode from the world ofneural networks refers to the US Army. There is no confirmed source that weknow of, but it really well illustrates the powers, as well as the problems, thatcome with neural networks:

In the 1980s, the US army wanted to detect camouflaged enemy tanks and de-cided to use neural networks for this task. They trained their networks by using50 pictures of tanks camouflaged in trees, and 50 pictures of trees without tanks.By adjusting the parameters and the network settings, the researchers managedto make the network separate pictures with and without tanks. This did not,however, guarantee that new images would be classified correctly at all. Thenetwork might only work for these 100 images. But the wise researchers hadthought of this problem, so they had originally taken 200 images, leaving an-other 100 images to test the network on. They ran the tests and concluded thatthe network classified all these pictures correctly as well. Success confirmed!

The finished work was handed to the Pentagon, but they soon handed it back.They complained that in their own tests the neural network did no better thanchance at detecting tanks.

It turns out that the photos of camouflaged tanks had been taken on cloudydays, while the photos of plain forest had been taken on sunny days. The neuralnetwork had, instead of recognizing camouflaged tanks, learned how to distin-guish between cloudy and sunny days.

8

2.2 Back-propagation

The error backpropagation is a supervised learning algorithm that was firstintroduced in 1974 by Paul Werbos [1]. It calculates the first partial derivatesof the network error function, and updates the weights according to the gradientdescent algorithm. It’s an effective method to train supervised neural networks,where we want to minimize the deviation of the network output and the targetvalues in the training pattern. This can be expressed as adjusting the weights,so the mean-square error function of the network is minimized:

MSE =

n∑

k=1

1

2(outk − tark)

2 → minwij

(1)

where n is the number of nodes, out is the network output and tar is thetarget values. More details with an example of back-propagation of a three layerfeed-forward neural network can be found in [2].

x 0

x 1

x 2

x 3

x 4

Figure 3: The principle of the gradient descent algorithm illustrated in a two-dimensional error space. The algorithm helps us to minimize the error functionstep-by-step. By travelling in the direction with the steepest descent, the erroris minimized.

9

3 The HCNN Model

3.1 Introduction

Neural networks offer significant benefits for dealing with the typical challengesassociated with forecasting. With their universal approximation properties, neu-ral networks make it possible to describe non-linear relationships between a largenumber of factors and multiple time scales [2].

As said in the previous chapter, a neural network can be expressed as indi-vidual layers in the form of nodes, and the connections between the layers in theform of links. Many real-world technical and economic applications can be seenin the context of large systems, in which various non-linear dynamics interactwith one another through time.

Recurrent Neural Network (RNN) is one common type of neural networks.RNN are universal approximations of dynamic systems, and can be used tomodel the behavior of a wide range of complex systems. Unlike feed-forwardneural networks, RNN has an internal memory to process sequences of inputs,because of its recurrent nature. RNN consists of input-, hidden- and outputlayers. The following set of equations is a general description of a RNN network:

sτ = f(sτ−1, ut) state transition, (2a)

yτ = g(sτ ) output equation. (2b)

The state transition sτ takes influences from inputs uτ as well as the previousstate transition sτ−1. Each hidden state generates outputs yτ .

Figure 4: Illustration of a recurrent neural network.

10

The RNN is used to model and forecast an open dynamic system using a non-linear regression approach. Every hidden state node is influenced by the previousstate and the network input. This means that the network might be heavilydependent on the network inputs during the learning phase, and will have tomake predictions without them in the future, where no inputs are available.To make predictions consistent with the training, further improvements in thenetwork model are needed.

st−3 st−2 st−1 st st+1 st+2 st+3

yt−3 yt−2 yt−1 yt yt+1 yt+2 yt+3

ut−3 ut−2 ut−1 ut

A A A A A A

C C C C C C C

B B B B

Figure 5: A recurrent neural network unfolded through time. A, B and C aretransition matrices for the model.

In 2010, Zimmermann, Grothmann, Tietz and Jouanne-Diedrich [5] intro-duced a new type of RNN called Historical Consistent Neural Networks (HCNN).HCNN allows the modeling of highly-interacting non-linear dynamical systemsacross multiple time scales. HCNN is a closed system and does not draw anydistinction between inputs and outputs, but models observables embedded inthe dynamics of a large state space, see figure 6. The fundemental idea of theHCNN is to explain the join dynamics of the observables in a casual manner,i.e. with an information flow from the past to the future.

3.2 Model

The HCNN describes the dynamics of all observables by the sequence of statessτ using a single state transition matrix A [5]. The state transition matrixcontains the only free parameters in the system. When the network unfolds,the hidden state vector sτ represents each hidden state through time until thelast time step in the training, t. The model is defined as:

sτ = tanh (Asτ−1) state transition (3a)

yτ = BT ◦ sτ output equation (3b)

11

where B is a vector as defined in equation 5 and tanh is the activationfunction1, illustrated in figure 7. The purpose of BT is to filter out the observ-ables stored as the N first neurons in the state sτ . The subsequent neurons inthe state represents the hidden variables, which are the underlying dynamics ofthe observables. As previously stated, the model makes no distinction betweenobservables and hidden variables.

The identification task for the HCNN model can be expressed as a functionthat minimizes the square error between the network outputs and the targetvalues, by adjusting the parameters in the state transition matrix. This isexpressed in the following equation:

E =

t∑

τ=t−m

(yτ − ydτ )2 → min

Asystem identification (4)

where E represents the total residual error in the network, yτ is the networkoutput, ydτ is the target values and m is the number of time steps.

st−3 st−2 st−1 st st+1 st+2 st+3

yt−3 yt−2 yt−1 yt yt+1 yt+2 yt+3

A A A A A A

BT BT BT BT BT BT BT

Figure 6: Architecture of the HCNN Model.

−2 −1 1 2

−1

1

x

tanhx

Figure 7: Activation function tanh.

1. The purpose of the activation function is to convert a neuron’s weighted input to it’soutput activation. Nonlinear activation functions are what give neural networks theirnonlinear capabilities [7]. Symmetric sigmoids such as the hyberbolic tangent often con-verge faster than the standard logistic function and is commonly used in neural networks.

12

3.3 Learning

A technique that is frequently used in the learning task of neural networks isteacher forcing [8]. With this technique the actual output yτ is replaced withthe teacher signal ydτ in all timesteps. [5] introduces a new approach to teacherforcing as an integrated part in the neural network architecture:

sτ = tanh (A · rτ−1), τ ≤ t state transition (5)

where rτ = (Csτ−1 +Bydτ−1), B =

[

Id

0

]

and C =

[

0 00 Id

]

.

The identity part of B, Id, has the same dimension as the number of timeseries. C has the same dimension as A and is constructed so that C · B = ~0.E.g. for one time series, this would give:

B =

10...0

, C =

0 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

This allows us to use the standard learning algorithm back-propagation throughtime [2, pp. 113-135] to solve the identification task (equation 4).

The optimization of the network parameters can be achieved by differentmethods, e.g. SNOPT [9] or Genetic Algorithms [10]. For the learning of feed-forward neural networks, standard error back-propagation is still an effectivemethod, especially with teacher forced learning [6].

bias st−2 rt−2 st−1 rt−1 st rt st+1 st+2

yt−2 yt−1 yt yt+1 yt+2

yd

t−2 yd

t−1 yd

t

s0 C A C A C A A

BT BT BT BT BT

B B B

Figure 8: HCNN model architecture with teacher forcing (equation 5).

13

3.4 Initialization

In large recurrent neural networks, such as HCNN, the sparsity of the statetransition matrix is essential [6]. It improves long term memory and is a nat-ural way to limit the operations and prevent overflow in the system. A largenetwork with nonlinear tanh connectivities, would not be stable if initiated withfull connectivity. Thus, the state transition matrix A is defined as a uniformlyrandomized matrix, with a square dimension of n × n, and with a sparsity λ.All non-zero elements in A can be seen as weights wij in the network, and mustsatisfy the following condition:

{wij ∈ R | − α < wij < α} (6)

The first hidden state is initialized from the network bias1 [6]. The biasconsists of a vector of all ones, with a randomized vector s0 of dimension n asweights (see figure 8). In order to stabilize the network against uncertainties,s0 is adjusted throughout the learning phase. All weights wi in s0 must satisfythe following condition:

{wi ∈ R | − β < wi < β} (7)

After the initialization, the network unfolds forward in time from t − N tot, where N is the number of time steps in the past (see equation 3a). The firstunfolded time step, t−N , is calculated as:

st−N = tanh(s0 ◦ bias) (8)

3.5 Forecasting

The solution of the system identification task will depend on the initilization ofthe network weights (see section 3.4). The HCNN model is over-parameterizedto fit the data perfectly and it is not possible to know if the solution representsthe true dynamics; each individual solution is a reasonable forecast scenario.Thus, an ensemble of scenarios together can be used as a reasonable risk esti-mation.

By calculating the average (or median) of an ensemble we can define theforecast as:

forecast(τ) =1

n·

n∑

i=1

scenarioi(τ), τ > t (9)

where n is the number of scenarios and τ is future time steps.

3.6 Dynamics

To demonstrate HCNN’s ability to adapt, we show forecast examples of sinusdata. The data consists of two sinus periods over one hundred data points and

1. The bias shifts the activation function horizontally.

14

has an additional random part added that varies in size for each example. Eachforecast consists of 100 scenarios. The charts show original data, individualscenarios, as well as the forecast median and average of the scenarios.

Original

Scenarios

Forecast median

Forecast average

Figure 9: A two period forecast of sinus data with an additional random partof maximum 10% of the magnitude.

Original

Scenarios

Forecast median

Forecast average


15

Original

Scenarios

Forecast median

Forecast average


Original

Scenarios

Forecast median

Forecast average

Figure 12: An eighteen period forecast of sinus data with an additional randompart of maximum 10% of the magnitude.

As seen in the figures, HCNN can easily manage to describe the dynam-ics of a sinus curve with various amounts of random noise. However, we cansee tendencies of undershooting related to the amount of noise and predictionlength.

16

4 Implementation

4.1 SENN

SENN (Software Environment for Neural Networks) is a development environ-ment for the creation of artificial neural network based forecasting and clas-sification models [11]. It’s developed by Siemens in Munich, Germany. Theyuse it in various applications, i.e. to improve their timings for electricity andworldwide copper purchases [12].

SENN contains tools for a finance-analytical interpretation of the networkoutputs and for the solution of general, complex problems. It supports vari-ous network types and optimization algorithms (such as the back-propagationmethod). The networks are described in topology files which contains the net-work structure. The application can be controlled by a graphical interface orby running TCL (Tool Command Language) scripts [13].

TCL scripts are robust and flexible and with the use of TCL reference andSENN coding examples, we created our own library for SENN. Using our library,we could automate the whole process including data selection, learning andcharting for our HCNN tests.

Figure 13: Neural network node structure and charting component within theSENN environment.

Even though we kept our tests simple and only used a few time series, thenetwork learning time was rather high. We felt the need to solve the identifica-tion task in a shorter time span than we managed with SENN.

4.2 HCNNLab

To be able to run daily forecasts or forecasts with large amount of data usingSENN, something similar to a computer farm would be required. For this thesis,as well as for real world usage by Eturn, a faster and more powerful implemen-tation would be desirable. One where the computers we had available would

17

suffice. We thought that by utilizing the graphics card and implementing theHCNN model within MATLAB, we should be able to achieve a faster and moreflexible solution. This is where the fundamentals of our own software were born.

HCNNLab stands for Historical Consistent Neural Networks Lab, and is ourown neural network environment built within MATLAB. It covers the wholepipeline including automatic data transformation and adjustments, networkmodelling, network weight initialization, network learning, forecast charting andstatistical measurements. The components and the overall structure can be seenin figure 14 below.

The user can implement general neural network models in HCNNLab, andmake faster and more optimized code implementations using MATLAB MEXfiles. These are written in C code and compiled for faster executions. UsingMEX files with the CUDA library [4] enables the utilization of the GPU. Wehave built both CPU- and GPU MEX files for the HCNN- and the RHCNNmodels in HCNNLab. Multiple scenarios can be queued or ran in parallell, bothon the CPU and the GPU. This means that the user is able to run a wholeensemble of scenarios simultaneously on a home computer, equipped with agraphics device with CUDA support. By allowing the user to pause and resumeduring the network learning, the flexibility and usability are increased.

By using the powerful builtin plotting tools in MATLAB, the forecast resultscan easily be analyzed. HCNNLab automatically adds the interesting elementsto charts, including the original data and the forecast ensamble with averages,medians and standard deviations.

Figure 14: A schematic overview of the HCNNLab environment.

18

% Load data into variable D

load(’data/world_data.mat’);

% Settings

data = D;

dataCols = [2 ,3];

dataLabels = {’DAX close ’, ’EU close ’};

nScenarios = 100;

m = 100;

lastEpoch = 400;

% Create hcnn object

hcnn = HCNN(’DAX’);

hcnn.setDescription(’A HCNN Test.’);

% Set data

hcnn.setData(data , dataCols , dataLabels );

hcnn.setTrainingInterval(lastEpoch , m);

% Set nr of scenarios

hcnn.setScenarios(nScenarios );

% Run on GPU with CUDA

hcnn.setMethod(Model.CUDA_MEX );

% Set model

hcnn.setModel(HCNN.MODEL_HCNN );

% Learn

hcnn.learn ();

% Plot

hcnn.plot ();

Figure 15: Simple code example from HCNNLab with two time series, wherescenarios are run in parallel on the GPU.

19

4.2.1 C Mex

Since the HCNN learning phase is computationally heavy with its complex al-gorithm and iterative process, we will benefit from rewriting the code for the Clanguage, instead of using the more simple MATLAB language. The program-ming language C can handle different kinds of data structures more efficient.Read more about using MEX files in MATLAB in the online documentation[14].

4.2.2 CUDA Mex

Because the gradient information for each dimension can be calculated in par-allel, a suitable approach to minimize the calculation costs is to introduce GPUacceleration. This also makes it possible to multiply each element in the hiddennetwork state with its transition matrix in parallel. We used CUDA [4] fromNvidia to extend our C MEX to fully support GPU acceleration. CUDA is alanguage that makes it possible to fully communicate with the GPU and utilizethe highly parallel computational power that it withholds.

(0, 0) (1, 0) (2, 0) (3, 0)

(0, 1) (1, 1) (2, 1) (3, 1)

(0, 2) (1, 2) (2, 2) (3, 2)

(0, 3) (1, 3) (2, 3) (3, 3)

(0, 0) (1, 0) (2, 0) (3, 0)

(0, 1) (1, 1) (2, 1) (3, 1)

(0, 2) (1, 2) (2, 2) (3, 2)

(0, 3) (1, 3) (2, 3) (3, 3)

Figure 16: Blocks and threads in CUDA.

The main topics that we needed to consider were maximum threads runningin parallel, size of shared memory and synchronization of threads. On Nvidiacards supporting CUDA 2.x, the maximum number of threads per block is1024, and the maximum amount of shared memory per multiprocessor is 48KB (approximatly 2000 double values). When calculating the gradients forHCNN, the time steps for each gradient is recursive in time, and is thereforenot parallelizable. However, in each time step we are able to parallelize thecalculations. Since all the scenarios in the ensamble are independent of eachother, they can also be parallelized. Maximum performance is reached when allthe GPU threads are active simultaneously, which is achieved when the batch

20

of scenarios is large enough. The maximum size of the batch is limited by thesize of the specific graphics device’s memory.

We introduce a new approach to only have one network identification taskfor all scenarios in an ensemble. This makes it possible to solve the HCNNin a computationally efficient way for the GPU. The new model definition issimilar to the HCNN definitions on page 11. The difference is that we combineall the scenarios into one single network. For this new ultra-large network, thecombined variables are defined as:

sτ =

siτsiiτ...snτ

, yτ =

yiτyiiτ...ynτ

, A =

Ai 0 · · · 00 Aii · · · 0...

.... . .

...0 0 · · · An

(10)

where n is the number of scenarios.The dimension of the problem is huge and varies depending on the problem

task. We have taken advantage of the CUDA library CUSPARSE to handle thematrix-vector-calculations, and it provides a robust and flexible implementationthat handles the memory allocation and matrix pointers very well, using theHYB-format1 [15].

4.3 Modified Back-propagation

The learning of neural networks using back-propagation (see section 2.2), isdivided into training epochs. Each epoch contains the network evaluation, theiteration of one complete back-propagation, and the weight update process.The purpose of each epoch is to move towards a better position in the weightspace. How far we travel in each epoch is determined by the step length, theparameter η (greek: eta). To minimize the time it takes to fully train a network,the amount of traing epochs needs to be reduced.

One way to achieve this, without affecting other network parameters, is byincreasing the η value. However, an increased step length will also lead toinstability in the learning process, and will make the network very unstable.By introducing two learning phases with two different step lengths, combinedwith a limit of the value for the weight update, we have drastically improvedthe learning curve, and thereby the time to train the network.

The network error is larger in the beginning of the learning phase. To pre-vent an overload of the error flow, a constant η value is used to restrict the errorto a smaller magnitude when updating the weights. As η becomes smaller,the training process becomes more robust, but at the same time it takes more

1. ”The HYB format, a combination of the ELL and COO sparse matrix data structures,offers the speed of ELL and the flexibility of COO. Often, unstructured matrices that arenot efficiently stored in ELL format alone and readily handled by HYB. In such matrices,rows with exceptional lengths contain a relatively small number of the total matrix entries.As a result, HYB is generally the fastest format for a broad class of unstructured matrices.”[16]

21

epochs to complete the identification task. This is a common trade-off in back-propagation. To solve this dilemma, we have, as previously mentioned, a firstphase of the learning where we limit the weight updates, combined with a rela-tively high η.

This works well in the case of an overparametrized network, such as HCNN.This new approach equalizes the error distribution among the weights, in a shortnumber of epochs, and enables us to use a stable and comparably large η forthe rest of the training. We think, for the case of HCNN, that this is a betterapproach than the standard back-propagation. We define it as:

∆A = η ·1

N

N∑

τ=1

∆Aτ ,−L < w∆A < L (11)

where ∆A is the total weight updates for A in the current epoch, N is thenumber of time steps, L is the update limit for each weight w and ∆Aτ is theweight update in each time step. The same approach is used for updating s0.

Note that it is critical how η, L and the number of epochs is configured. Thelearning is sensitive to small changes, that significantly influence the method’seffectiveness.

0 1 2 3 4 5 6 7 8 9 10

x 104

0

0.01

0.02

0.03

0.04

0.05

0.06

Dimension = 120, Sparsity = 0.6, Modified



Dimension = 160, Sparsity = 0.12, Eta = 0.08




Figure 17: Average residuals during HCNN learning, comparing the modified-and the standard back-propagation algorithm. The Y-axis represents the aver-age residual error, and the X-axis the number of trained epochs.

As seen in the figure above, the modified back-propagation can drasticallyimprove the residual errors in the beginning of the learning phase. It also allowsfor fast convergence in the later stages, heavily reducing the total amount ofepochs.

22

5 Configuration

We have chosen to use HCNNLab for our test runs. The reason for this is thatwe have a full understanding of the software, and it performs well, as shown insection 6.1.

5.1 Data Selection

We have chosen to test HCNN on two different data sets. Some of the data isprovided by Eturn, but most is downloaded from Reuters. The first data set isconstructed by large world indices, currencies, commodities and financial rates.The second data set consists of some of the most traded currencies. Our beliefand our understanding is that HCNN works well with a large amount of timeseries and with high degrees of correlations. We wanted to construct the datasets of as many time series as possible, limited by the amount and quality ofthe collected data. We also thought of this as a very interesting challenge forour HCNN implementation. The final time series used in the sharp runs for thetwo data sets are listed in table 1 and 2.

Currency data time series

US Dollar / British PoundUS Dollar / Canadian DollarUS Dollar / EuroUS Dollar / Hungarian ForintUS Dollar / Japanese YenUS Dollar / Norwegian KroneUS Dollar / Polish ZlotyUS Dollar / Russian RubelUS Dollar / Singapore DollarUS Dollar / South African RandUS Dollar / Swedish KronaUS Dollar / Swiss FrancUS Dollar / Thai Bath

Table 1: The 13 time series included in the currency data set.

23

World data time series

AEX (Holland) NASDAQ TranAluminium 3-Month (Composite) New Zealand .NZ50Australia .AORD NickelBaltic .OMXBPI Nikkei 225Brazil .BVSP NYSE Composite IndexBrent Blend Oats CBTBrussels BEL20 Index OMX Copenhagen 20BSE 30 Bombay OMX S30Copper Cash (Composite) Oslo Børs All-share IndexCorn CBT Pakistan .TRXFLDPKPCrude Oil Palladium Spot XPD EQUAL SCzech Republic .TRXFLDCZP Paris CAC40 IndexD J Composite Platinum Spot XPT EQ SD J Industrials Pork BelliesD J Transports RTS (Ryssland)D J Utilities Russel 2000DAX IBIS Tyskl S&P 100DJ Wilshire 5000 .Wil5 S&P 400 MidcapDJI Yahoo S&P 500Dow Jones Developed Markets ex-U.S. Index Shanghai SE Composite IndexDow Jones Global Basic Materials Index Silver SPot XAG EQ SDow Jones Global Consumer Goods Index SingaporeDow Jones Global Consumer Services Index Soybean OilDow Jones Global Financials Index Spain .SMSIDow Jones Global Health Care Index SSV 30Dow Jones Global Index Swiss M. IndexDow Jones Global Industrials Index Taiwan WeigthedDow Jones Global Oil & Gas Index Tin 3-Month (Composite)Dow Jones Global Technology Index US Dollar / British PoundDow Jones Global Telecommunications Index US Dollar / Canadian DollarDow Jones Global Utilities Index US Dollar / EuroGas oil US Dollar / Hungarian ForintGold Spot XAU EQUAL S US Dollar / Japanese YenHANG SENG INDEX US Dollar / Norwegian KroneIPC G. (Mexico) US Dollar / Polish ZlotyIreland .TRXFLDIEP US Dollar / Russian RubelJakarta Comp US Dollar / Singapore DollarKorea Composite US Dollar / South African RandKuala Lu (Malay) US Dollar / Swedish KronaLead US Dollar / Swiss FrancLean Hogs US Dollar / Thai BathLondon FTSE 100 USD 10-Years OblMarocco .TRXFLDMAP USD 30-Years OblMerval (Argent.) Value Line IndexNASDAQ 100 Wheat CBTNASDAQ Comp ZinkNASDAQ Tele

Table 2: The 93 time series included in the world data set.

5.2 Data Pre-processing

When it comes to simulations, forecasting and other types of real world applica-tions, the most fundamental building block is to have good data. The old sayinggarbage in, garbage out pretty much sums it up. Therefore, it’s very important

24

to acquire good data, and to treat it well. We need to properly process it beforewe can feed it to our networks. Our pre-processing consists of the followingsteps, in order:

• Data cleanup

Remove all data and time series that are insufficient and of bad quality.

• Remove outliers

Data points with erroneous values, and values that deviate significantly,will create a problem and drastically slow down the learning process. Theyare therefore deleted.

• Extract weekly data

There are many ways to extract week data. We chose to average allavailable data points for each week to get the weekly data values.

• Repair data

To get data with continuous values, we need to repairing missing data.This is done by replacing the missing value with the previous existingvalue. We had very few occurrences of this for our weekly data.

All data pre-processing were performed in MATLAB, where the final datasets are stored and used as .mat-files.

5.3 Model Configuration

Parameter Description Value

Nm Number of time series in the training set. 93m Number of time steps in the training set. 200dim(A) Dimension of the state transition matrix A. 800sparsity(A) Sparsity of the state transition matrix A. The value rep-

resents the amount of alive weights.12%

max(|Ainit|) Max abs value of each randomized weight in Ainit 0.2max(|sinit

0 |) Max abs value of each randomized weight in sinit

0 0.2max(|∆A|) Max abs weight update contribution value for A. 0.01max(|∆s0|) Max abs weight update contribution value for s0. 0.01max(|A|) Max abs weight value allowed in A 2max(|s0|) Max abs weight value allowed in s0 2.5P1 epochs Number of epochs in learning phase one. 8000ηP1 Step length in phase one. 10ηP2 Step length in phase two. 1.5Emax The maximum residual error allowed. 1E−4

Table 3: Model configuration for the world data set.

25

Parameter Description Value

Nm Number of time series in the training set. 13m Number of time steps in the training set. 250dim(A) Dimension of the state transition matrix A. 160sparsity(A) Sparsity of the state transition matrix A. The value rep-

resents the amount of alive weights.60%

max(|Ainit|) Max abs value of each randomized weight in Ainit 0.2max(|sinit

0 |) Max abs value of each randomized weight in sinit

0 0.2max(|∆A|) Max abs weight update contribution value for A. 0.01max(|∆s0|) Max abs weight update contribution value for s0. 0.01max(|A|) Max abs weight value allowed in A 2max(|s0|) Max abs weight value allowed in s0 2.5P1 epochs Number of epochs in learning phase one. 2000ηP1 Step length in phase one. 5ηP2 Step length in phase two. 1.5Emax The maximum residual error allowed. 1E−4

Table 4: Model configuration for the currency data set.

5.4 Computing

We knew that even with our own software with increased performance, thetime to complete many runs would be very high. Therefore, we needed to haveaccess to dedicated computers, allowing us to be up and running neural networkcalculations 24/7. We were given access to the following two computers, onwhich almost all network training were performed on:

• Computer 1

CPU: Quad-Core Intel Xeon E5410 2.33 GHz

• Computer 2

CPU: Hexa-Core Intel Xeon E5-2620 2.0 GHz (hyper-threading enabled)GPU: Geforce GTX 670

We have performed large GPU tests on the high-end graphics device inComputer 2, the Geforce GTX 670. The tests include both of the two data sets.The performance is stated in the nextcoming two tables, where epochs means thetotal amount of epochs reached for the scenarios with the slowest convergence,and time per scenario epoch means the effective time it takes to complete oneepoch in each scenario. Each test is set up according to the parameters in table3 and 4.

26

Description Value

Number of scenarios 50Memory allocation on GPU 614 MB of 2048 MBTotal time 268 minTime per scenario 5.4 minEpochs 39200Time per scenario epoch 8.2 ms

Table 5: GPU performance for the world data set.

Description Value

Number of scenarios 500Memory allocation on GPU 1241 MB of 2048 MBTotal time 289 minTime per scenario 0.6 minEpoch 21200Time per scenario epoch 1.6 ms

Table 6: GPU performance for the currency data set.

The statistics for the sharp runs, on both the world data set and the currencydata set, are listed in the tables below. Note that since the runs were performedrunning multiple scenarios simultaneously, the times are not equivalent with thetotal time it took to complete the training of all scenarios.

Description Value

Total time 696 daysTotal epochs 49398700Average epochs: 24298Total number of scenarios 23118Scenarios / batch 11.4Average time / scenario 43 minAverage time / scenario epoch 107.3 ms

Table 7: Computational statistics for the world data set.

Description Value

Total time 254 daysTotal epochs 19838900Average epochs: 33625Total number of scenarios 14692Scenarios / batch 24.9Average time / scenario 25 minAverage time / scenario epoch 44.6 ms

Table 8: Computational statistics for the currency data set.

27

Even though the runs consist of enormous amounts of data with a huge num-ber of total scenarios and epochs, we have been able to configure the networksto achieve reasonable run times; 43 and 25 minutes per scenario respectively(see table 7 and 8). If we would’ve had access to more GPU’s and not beingforced to run the majority of the computations on CPU, the run times would’vebeen significantly shorter. See the example runs in tables 5 and 6 where the runtimes are much shorter.

Without our own software, our own model implementation, and the possibil-ity to run the networks in parallel, it would have been hard to get such a goodexchange between run time and calendar time. As we can see in table 6, whenwe try to maximize the utilization of the GPU in HCNNLab, the interestingmeasure time per scenario epoch is drastically improved. The effective time isa remarkable 1.6 ms compared to the mixed runs (with both CPU and GPUcomputations), with the time of 44.6 ms. This can be explained by the supe-riour effectiveness of a modern high-end GPU, compared to a mid-end GPU, aswell as CPU.

28

6 Results

6.1 Performance

To be able to compare the different software solutions in section 4 in termsof performance and speed, we have performed measurements in test runs. Toget a fair performance comparison we have averaged the time it takes to com-plete 1000 training epochs. The runs was performed by using the standardback-propagation algorithm on a computer with an Intel i7 920 CPU @ 3.20GHz (running one core, hyper-threading off), and a mid-end graphics device, aGeforce GTX 460. On the CPU we ran one scenario, while on the GPU we ran300 scenarios in parallel.

Processor Scenarios Time per scenario Software

CPU 1 54.43 s SENNCPU 1 12.59 s HCNNLab CMexGPU 300 1.65 s HCNNLab CUDAMex

Table 9: Times for completing 1000 epochs with HCNN using the standard back-propagation algorithm. The tests average the time to complete 1000 epochs,with a state transition matrix with dimension 300 and sparsity 12%. The dataset includes 200 time steps, and comes from the tutorial package in SENN.

As seen in table 9, the times are greatly improved by using HCNNLab.Note that the results only reflects computational powers, when using the samelearning algorithm. One key different between SENN and HCNNLab, is thatour calculations are performed with doubles instead of floats, which allows forhigher accuracy and greater network memory. The trade-off for using doublesis a slight decrease in performance.

For an easier comparison, we have put together the results represented asspeedup factors, see figure 18. The chart shows us that HCNNLab is around 4times faster than SENN running on the CPU, and 33 times faster running onthe GPU. This allows us, as discussed earlier in chapter 4.1, to run more datasimultanously and keep the duration of the learning phase at a satisfactory level.

29

1,0

4,3

33,0

0,0

5,0

10,0

15,0

20,0

25,0

30,0

35,0

SENN C CUDA

Sp

ee

du

p F

act

or

Figure 18: Computational performance comparison between SENN andHCNNLab.

6.2 Tests

6.2.1 Comparison Models

HCNN is a neural network model that adapts according to historical data. Thisis done by adjusting a huge state transition matrix, consisting of weights to thedata, and acquire a prediction in the feed forward process. To see how goodthese predictions are, we need to measure them somehow, and put them intocontext. Measurement values themselves won’t give us the bigger picture, if wecan’t relate them to values from other models. Therefore, we need to collectprediction data and relevant error measurements not only from HCNN, but fromother models as well. This creates the need for us to find or develop robust, butyet simple enough models to relate to.

A commonly used model when it comes to predictions and stock marketforecasting, is the naıve predictor. The idea behind it is that today’s stock priceis the best estimate of tomorrow’s price [17], and is a consequence of the efficient-market hypothesis [18]. We think it’s a good idea to put the HCNN model inrelation to trivial predictors, and that’s why we have chosen to include TheilCoefficient (see section 6.2.5) and MASE (see section 6.2.6) in our tests. Ourimplementation of the predictor is to use the latest known value as the predictionfor all future timesteps. This will give the prediction the same conditions as thepredictions of the HCNN ensemble. The formula for our implementation is verysimple, and is defined as

xt = x0, t > 0 (12)

where x is the prediction based on the actual values in x for each time stept in the future.

30

Because neural networks, and HCNN in particular, are complex systemswith advanced dynamics, it seems fair to include a more sophisticated model aswell. Naıve prediction is easy to relate to, and might even prove useful in someareas, but it is fundamentally far from HCNN. It totally ignores the history,and basically says that the weather of tomorrow will be the same as of today.It would be nice to find a system that adapts to the history, but yet is simpleenough. Our choice for such a model is linear regression [19]. It fit’s a straightline through the set of n points, while keeping the sum of the squared residualsof the model as small as possible. It is defined as:

y = ax+ b (13)

where a is the gradient constant, defined as

a =cov(x, y)

var(x, y)=

∑n

i=1(xi − x)(yi − y)

∑n

i=1(xi − x)2

(14)

and b is the y-value when x = 0, which in our case is the last value in thehistorical data. The functions cov and var stands for covariance1 and variance2

respectively.Since linear regression adapts the predictions to the historical data, the

predictions are heavily dependent on the only parameter for the system – thelength of the data set. If we use a data set with a big amount of historical timesteps, we will receive a slow system. If we decrease the amount of data andthe historical time span, we get a more aggressive and probably steeper system.Each point in time will be different, and when the trends turn, the valutationof the historical data will vary. We will therefore run our tests against multiplelinear regression models, each with a different time span. We have chosen touse three separate systems, with historical data memory varying between longand short. This change is achieved by adjusting the number of time steps toinclude, previous to the prediction date. The different parameter settings canbe seen in the following table.

Setting Time steps included

Slow 200Medium 100Short 30

Table 10: Different parameter settings used for the linear regression model.

6.2.2 Error measurements

The most solid way to create a prediction from a HCNN ensemble, is to computethe median or average of all scenarios. We have chosen to use the median, sincethe influence from outliers will be nullified.

1. http:// en.wikipedia.org/wiki/Covariance2. http:// en.wikipedia.org/wiki/Variance

31

As standard, all predictions are included in our error measurement calcula-tions. This means that the total of 14 229 (153x93) predictions for the worlddata set, and 1 235 (93x13) predictions for the currency data set were used.

6.2.3 Forecast Hit Rate

Forecast hit rate is the average of the binear outcomes, that are given by compar-ing the forecast gradient to the actual gradient over the prediction timeframe.The gradient is derived from the straight line between the last prediction step,and the last value in the historical data. It can be either up- or downwards. Wecould look upon this hit rate as a predictor of bear- and bull markets [20]. Wecan define an outcome h as:

ht =

{

1, a > 0

0, a < 0, where a =

xt − x0

xt − x0

(15)

where t represents a prediction step (t > 0). By taking the average ofall outcomes, we get the forecast hit rate value. The hit rate varies when wegradually increase the number of prediction steps. In the graph below, we letthe number of time steps, p, go from 1 to 20 for both the world- and the currencydata set.

48

50

52

54

56

58

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Fo

reca

st H

itra

te [

%]

Predic!on Length [weeks]

Currencies World

Figure 19: Forecast hit rate when p goes from 1 to 20.

These results must be put in relevance to our other prediction models. Sinceour version of the naıve predictor is a straight line for all future time steps t > 1,we need to look at the derivate from the previous historical value, to be ableto extract the hitrate. The tables below show the results when we compare theHCNN forecast hit rates with the naıve- and linear regression (linreg) predictor,for p = 1.

32

HCNN Naıve LinReg-30 LinReg-100 LinReg-200

58.5% 61.5% 55.0% 55.4% 55.1%

Table 11: Forecast hit rate when t = 1 for the currency data set.

HCNN Naıve LinReg-30 LinReg-100 LinReg-200

53.9% 59.2% 54.0% 53.3% 52.7%

Table 12: Forecast hit rate when t = 1 for the world data set.

For HCNN and linreg we can compare the forecast hit rate over time. Thedeveloped result for this comparison is shown in the following graphs.

46

48

50

52

54

56

58

60

62

64

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Fo

reca

st H

itra

te [

%]


HCNN LinReg-30 LinReg-100 LinReg-200

Figure 20: Forecast hitrate for the currency data set.

33

46

48

50

52

54

56

58

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Fo

reca

st H

itra

te [

%]



Figure 21: Forecast hitrate for the world data set.

6.2.4 Local Hit Rate

Local hit rate is the average of the binear outcomes, that are given by comparingthe gradient in each prediction step to the actual gradient over the predictiontimeframe. We can define the binear outcome as:

ht =

{

1, a > 0

0, a < 0, where a =

xt − xt−1

xt − xt−1

(16)

where t represents a prediction step (t > 0). By taking the average of alloutcomes, we get the local hit rate value. The hit rate varies when we graduallyincrease the number of prediction steps. In the graph below we let the numberof time steps go from 1 to 20 for both the world- and currency data sets.

34

48

50

52

54

56

58

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Loca

l H

itra

te [

%]


Currencies World

Figure 22: Local hit rate when t goes from 1 to 20.

These results must be put in relevance to our other prediction models. Forecast-and local hit rate is the same when t = 1, which makes the tables 11 and 12 onpage 33 relevant for this measurement as well. Just as with the forecast hit rate,we can’t use the naıve predictor for timesteps t > 1. For HCNN and linreg, wecan compare the forecast hit rate over time. The results for this comparison areshown in the following graphs.

46

48

50

52

54

56

58

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Loca

l H

itra

te [

%]



Figure 23: Local hit rate for the currency data set.

35

46

48

50

52

54

56

58

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Loca

l H

itra

te [

%]



Figure 24: Local hit rate for the world data set.

6.2.5 Theil Coefficient

The Theil Coefficient [17, p. 20] was proposed by the econometrician HenriTheil from the Netherlands. It is expressed as the square error sum between theprediction and the original, in relation to the naıve predictor in every timestep.This makes it very useful for measuring the effectiveness in a predictor, sinceit’s more or less independent of the scaling of the time series. As the naıvepredictor, we use our earlier mentioned straight line implementation. The TheilCoefficient T is defined as:

T =

√

∑N

t=1(y(t)− y(t))2

√

∑N

t=1(y(t)− y(0))2

(17)

Values below one implicate that the error sum for our predictor is lowerthan the error sum for the naıve predictor, which is desirable. By comparingwith LinReg while setting the number of time steps p to 1, we get the followingtables:

Theil coefficient HCNN LinReg-30 LinReg-100 LinReg-200

Median 1.03 1.00 0.98 0.98MAD 0.51 0.29 0.20 0.12

Table 13: Theil coefficient for the currency data set.

36

Theil coefficient HCNN LinReg-30 LinReg-100 LinReg-200

Median 1.12 1.01 0.99 1.00MAD 0.58 0.29 0.18 0.12

Table 14: Theil coefficient for the world data set.

Median is referred to as the median of all theil coefficients for all the predic-tions. MAD1 is the median absolute deviation, and measures how volatile thevalues are.

By varying p from 1 to 20, we can plot the results as charts, for both thecurrency- and world data set.

0.95

1.00

1.05

1.10

1.15

1.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Th

eil

Co

effi

cie

nt

Predic"on Length [weeks]

Currencies World

Figure 25: Theil coefficient when t goes from 1 to 20.

The following charts compare HCNN and LinReg for both data sets:

1. http:// en.wikipedia.org/wiki/Median absolute deviation

37

0.90

0.95

1.00

1.05

1.10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Th

eil

Co

effi

cie

nt



Figure 26: Theil coefficient for the currency data set.

0.95

1.00

1.05

1.10

1.15

1.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Th

eil

Co

effi

cie

nt



Figure 27: Theil coefficient for the world data set.

6.2.6 MASE

MASE, or Mean Absolute Scaled Error [21], is an error measurement that takesinto consideration different scales for different time series. This makes it usefulas an error measurement for predictors. MASE is defined as:

q(t) =y(t)− y(t)

1

n

∑n

i=1|y(i)− y(i− 1)|

(18)

MASE =1

n

n∑

t=1

|q(t)| (19)

38

where y(t) is the observed values and y(t) is the prediction. All values belowone is to be considered as good. Due to the definition of MASE and the TheilCoefficient, they will be equal when t = 1. Therefore, tables specifically forMASE won’t be presented. Instead, see the tables in section 6.2.5, as they arerelevant. The following charts show MASE for all future timesteps, for both theworld- and the currency data set.

1.0

2.0

3.0

4.0

5.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

MA

SE

Predic on Length [weeks]

Currencies World

Figure 28: MASE when t goes from 1 to 20.

The following charts compare HCNN and LinReg for both data sets:

0.0

1.0

2.0

3.0

4.0

5.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

MA

SE



Figure 29: MASE for the currency data set.

39

0.0

1.0

2.0

3.0

4.0

5.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

MA

SE



Figure 30: MASE for the world data set.

6.2.7 Examples

0.7

0.71

0.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

0.8

2006

−12

−18

2006

−12

−25

2007

−01

−01

2007

−01

−08

2007

−01

−15

2007

−01

−23

2007

−01

−30

2007

−02

−06

2007

−02

−13

2007

−02

−21

2007

−02

−28

2007

−03

−07

2007

−03

−14

2007

−03

−22

2007

−03

−29

2007

−04

−05

2007

−04

−12

2007

−04

−20

2007

−04

−27

2007

−05

−04

2007

−05

−11

2007

−05

−19

2007

−05

−26

2007

−06

−02

2007

−06

−09

2007

−06

−17

2007

−06

−24

2007

−07

−01

2007

−07

−08

2007

−07

−16

OriginalScenariosForecast medianForecast average

Figure 31: Forecast example taken from the runs for the currency data set. Thechart shows a USD/SEK prediction, with 150 scenarios in the ensemble over 30future prediction steps.

40

7 Discussion

Our observations are that the currency market is trending and slow. If we lookat the Theil Coefficient, see figure 26, the linear regression will beat the naıveprediction (by having lower values than 1). This shows the strength in usingpredictors that acknowledge the market trends. In general, financial markets,including our data sets, contain a high level of trending data. This will givelinear predictors, such as linear regression, an advantage during the trendingphases. But it will also be their downfall when the trend shifts, i.e. from bull-to bear market.

Regardless results and accuracy, HCNN is a more consistent and flexiblepredictor, since it uses a large amount of nonlinear parameters to adapt to thehistory. This will, to a greater extent, create a more independent analyticalmodel. With the right data and good conditions, it is most likely possible tobenefit from HCNN and it’s features. Siemens in Germany has in recent yearsused HCNN in various predictions, analytics and other applications [23]. Theyhave, for example, used it in copper price predictions and in weather forecasts.

0

200

400

600

800

1000

1200

1400

Mar-02 Mar-03 Mar-04 Mar-05 Mar-06 Mar-07 Mar-08 Mar-09 Mar-10

Figure 32: The swedish index OMXS30 from our world data set.

In figure 32, we see the swedish stock exchange index OMXS30 over the timeperiod used in our runs, with clear trends in the price.

41

20 40 60 80 100 120 140

10

20

30

40

50

60

70

80

90

Figure 33: Hit rate over time for HCNN and the world data set.

20 40 60 80 100 120 140

10

20

30

40

50

60

70

80

90

Figure 34: Hit rate over time for linear regression and the world data set.

In the figures 33 and 34, the hit rate in the first prediction step for all timeseries, is plotted in time consecutive order, for both HCNN and LinReg-200.The time series are sorted in order by the best average hit rate. A white dotindicates a hit, and a black dot a miss. These figures show that HCNN isn’t astrend sensitive as the linear regression. The dots are more equally distributed

42

in the HCNN plot. In the plot for the linear regression, the hits are clearlymore periodical, and the hits and misses sometimes appear visually as verticalstripes. The total averages of the hits and misses are shown in table 12 for theforecast hit rates.

Point 80 represents the first value in the results of year 2008, a year wherethe financial markets crashed. In October 2008, the Black Week [22] lasted forfive trading sessions, in which Dow Jones Industrial Average fell 18.1%. If welook at both charts (figures 33 and 34), we can see that HCNN has higher hitrates during the crash than the linear regression.

7.1 Answers to Thesis Questions

Can we create a faster implementation of the HCNN model on our

own?

At first, it seemed like an impossible task, but with great effort and extensive re-search, we managed to recreate both the HCNNmodel and the back-propagationalgorithm, implemented as optimized C code. As seen in figure 18, our imple-mentation is 4.3 times faster than SENN.

Can the GPU be utilized in an efficient way to speed up the learning

phase?

We have presented a unique solution to maximize the GPU utilization by cre-ating one single network identification task for all scenarios in an ensemble, seepage 21. This, together with efficient calculations, and CUDA’s HYB sparsematrix pointers, have given as an effective and modern implementation of thiscomplex high-dimensional neural network. We have been able to increase theperformance with a speedup factor of 33, using our software HCNNLab, as seenin figure 18.

Is it possible to use HCNN for predictions on large financial data sets

on a daily basis?

With HCNNLab, the whole process, from data pre-processing to final prediction,is made very intuitive and user-friendly. The complete HCNN configurationcan be set up with only a few lines of code, see figure 15. We have presenteda technique to stabilize the learning faster than traditional back-propagation(section 4.3). This is implemented in HCNNLab, and combined with the efficientGPU implementation, it makes HCNNLab a competent and powerful tool toachieve the goal of complex market analysis on a daily basis.

Is it possible to predict the financial markets using HCNN?

For our test data and time span, HCNN doesn’t significantly beat the moresimple prediction models. When it comes to hit rate, HCNN is better than

43

linear regression for one time step forward, but a naıve predictor based on theslope between the previous and current value, gives a higher accuracy.

None of the gathered statistics and error measurements in section 6.2 (hitrates, Theil coefficient and MASE), indicate that HCNN has a strong edge inany of our data sets.

7.2 Further Work

7.2.1 Performance

As modern GPUs start to outperform CPUs, our implementation with CUDAis a suitable approach. With a combination of back-propagation and gradientlimit, we can take on more challenging and complex data sets. It is still a bit ofa trial and error-approach to find the optimal configuration for each problem,and this could be further improved with a process to identify the configurationwithin HCNNLab. We have limited our tests to weekly data, and it would bean idea to also explore other time horizons as well.

As the computational power increases, we could also try more complex prob-lems and use more frequently sampled data.

7.2.2 Tests

One interpretation we can make from the results of the test runs is that the dataneeds to be more specific, and chosen more carefully. We believe that the dataselection process could be improved by profound data dynamics analysis. E.g.choosing an index and including specific time series that explains the underlyingcauses of the index price movements. In our tests, we approached the creationof our data sets with the fundamental idea that all time series influence eachother. The other approach to more specifically predict only a subset of the dataset, could turn out to be more favourable.

The easiest way to find correlations between time series would be to use linearcorrelation [24]. It basically matches time series with similar linear dynamicswith each other. The obvious weakness of this method is that more complexdynamics, as seen in the financial markets, with many different types of timeseries effecting each other, will not be found.

Another way of building good data sets is to use an unlinear method calledsensitive analysis, developed by Hans-Georg Zimmermann, Siemens [6, p. 57].This method helps find complex unlinear relations between the time series insidea given network, and ranks them depending on their influence.

There are also possibilities to further analyze the ensemble to estimate fore-cast risk. By, for example, looking at the standard deviation for the ensemblemedian, the network uncertainty could be interpreted as a risk measurement.

44

References

[1] D. Kriesel, “A Brief Introduction to Neural Networks”. [Online].Available: http://www.dkriesel.com/ media/ science/neuronalenetze-en-zeta2-2col-dkrieselcom.pdf .[Accessed Dec 30, 2014].

[2] R. Grothmann, “Multi-Agent Market Modeling Based on Neural Networks,”Ph.D. Thesis, University of Bremen, Germany, 2002.

[3] MATLAB, MathWorks, Software. [Online].Available: http://www.mathworks.com

[4] CUDA, NVIDIA Corporation, Software. [Online].Available: http://www.nvidia.com

[5] H-G. Zimmermann, R. Grothmann, C. Tietz and H. Jouanne-Diedrich,“Market Modeling, Forecasting and Risk Analysis with Historical Consis-tent Neural Networks,” Operations Research Proceedings 2010, Siemens AG,Corporate Technology, Munich, Germany, 2010. [E-book].Available: SpringerLink

[6] Hans-Georg Zimmermann, “Neural Networks in System Identification, Fore-casting & Control,” in MEAFA workshop, 15-17 Feb, 2010, The Universityof Sydney, 2010.

[7] Y. LeCun, L. Bottou, G B. Orr and K-R. Muller, “Efficient BackProp,”Image Processing Research Department AT& T Labs, Wilmette University,USA, 1998. [Online].Available: http:// yann.lecun.com/ exdb/ publis/ pdf/ lecun-98b.pdf[Accessed Dec 28, 2014].

[8] R J. Williams and David Zipser, “A Learning Algorithm for ContinuallyRunning Fully Recurrent Neural Networks,” in Neural Computation, 1, pp.270-280, 1989. [Online].Available: ftp:// ftp.ccs.neu.edu/ pub/ people/ rjw/ rtrl-nc-89.ps .[Accessed Dec 28, 2014].

[9] P. E. Gill, W. Murray and M. A. Saunders, “SNOPT: An SQP Algorithm forLarge-Scale Constrained Optimization,” in SIAM Journal on Optimization,volume 12, number 4, pp. 979-1006, 2002. [Online].Available: http://web.stanford.edu/ group/ SOL/papers/ SNOPT-SIGEST.pdf .[Accessed Dec 30, 2014].

[10] E. Alba and J. F. Chicano, “Training Neural Networks with GA HybridAlgorithms,” Departamento de Lenguajes y Ciencias de la ComputacionUniversity of Malaga, Spain, 2004. [Online].Available: http://www.lcc.uma.es/∼eat/ pdf/ gecco04f.pdf .[Accessed Dec 30, 2014].

45

[11] C. Tietz, “SENN V3.1 User Manual,” Siemens AG, CT T, 2010.

[12] A. Pease, “The Science of Prediction,” in Pictures of the Future, SiemensAG, 2011. [Online].Available: http://www.siemens.com/ innovation/ pool/ en/ publikationen/publications pof/ pof fall 2011/machine learning/ pof0211 ml prognosenen.pdf .[Accessed Jan 25, 2015].

[13] Wikipedia contributors, “Tcl,” in Wikipedia, The Free Encyclopedia. [On-line].Available: http:// en.wikipedia.org/wiki/Tcl .[Accessed Jan 25, 2015].

[14] MATLAB MEX, MathWorks, Software. [Online].Available: http:// se.mathworks.com/help/matlab/ ref/mex.html

[15] CUSPARSE Hybrid Format (HYB), NVIDIA Corporation. [Online].Available: http:// docs.nvidia.com/ cuda/ cusparse/#hybrid-format-hyb

[16] Nathan Bell and Michael Garland, “Efficient Sparse Matrix-Vector Multi-plication on CUDA,” December 11, 2008. [Online].Available: http:// sbel.wisc.edu/Courses/ME964/Literature/techReportGarlandBell.pdf .[Accessed March 8, 2015].

[17] Thomas Hellstrom and Kenneth Holmstrom, “Predicting the Stock Mar-ket,” Department of Mathematics and Physics, Malardalen University, Swe-den. 1998. [Online].Available: http://www.e-m-h.org/HeHo98.pdf[Accessed 6 April, 2015].

[18] Wikipedia contributors, “Efficient-market Hypothesis,” in Wikipedia, TheFree Encyclopedia. [Online].Available: http:// en.wikipedia.org/wiki/Efficient-market hypothesis.[Accessed Apr 6, 2015].

[19] Wikipedia contributors, “Simple Linear Regression,” in Wikipedia, TheFree Encyclopedia. [Online].Available: http:// en.wikipedia.org/wiki/ Simple linear regression.[Accessed Apr 6, 2015].

[20] Wikipedia contributors, “Market trend,” in Wikipedia, The Free Encyclo-pedia. [Online].Available: http:// en.wikipedia.org/wiki/Market trend .[Accessed Apr 7, 2015].

[21] Wikipedia contributors, “Mean absolute scaled error,” in Wikipedia, TheFree Encyclopedia. [Online].

46

Available: http:// en.wikipedia.org/wiki/Mean absolute scaled error .[Accessed Apr 12, 2015].

[22] Money-Zine, “The Stock Market Crash of 2008,”. [Online].Available: http://www.money-zine.com/ investing/ stocks/stock-market-crash-of-2008/

[23] Pictures of the Future - Fall 2011, Siemens Technology Press and Innova-tion Communications, Munich, Germany. [Online].Available: http://www.siemens.com/ innovation/ pool/ en/ publikationen/publications pof/ pof fall 2011/machine learning/ pof0211 ml prognosenen.pdf

[24] Statlect, “Linear correlation”. [Online].Available: http://www.statlect.com/ linear correlation.htm.[Accessed Apr 25, 2015].

47

Documents

Implementation och utvärdering av Historical Consistent Neural …838846/FULLTEXT01.pdf · 2015-07-01 · The fundamental idea of creating a neurological network is to model neu-rons