Optimal Order Execution using Stochastic Control and ...kth.diva-portal.org/smash/get/diva2:963057/FULLTEXT01.pdfOptimal Order Execution using Stochastic Control and Reinforcement

IN DEGREE PROJECT MATHEMATICS,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2016

Optimal Order Execution using Stochastic Control and Reinforcement Learning

ROBERT HU

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Optimal Order Execution using Stochastic Control and Reinforcement

Learning

R O B E R T H U

Master’s Thesis in Mathematical Statistics (30 ECTS credits) Master Programme in Applied and Computational Mathematics (120 credits Royal Institute of Technology year 2016

Supervisors at Lynx Asset Management: Tobias Rydén, Per Hallberg Supervisor at KTH: Tatjana Pavlenko Examiner: Tatjana Pavlenko

TRITA-MAT-E 2016:56 ISRN-KTH/MAT/E--16/56-SE Royal Institute of Technology SCI School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

AbstractIn this thesis an attempt is made to find the optimal orderexecution policy that maximizes the reward from tradingfinancial instruments. The optimal policies are found us-ing a Markov Decision Process that is build using a statespace model and the Bellman equation. Since there is notan explicit formula for state space dynamics, simulations onhistorical data are made instead to find the state transitionprobabilities and the rewards associated with each stateand control. The optimal policy is then generated fromthe Bellman equation and tested against naive policies onout-of-sample data. This thesis also attempts to model thenotion of market impact and test whether the Markov Deci-sion Process is still viable under the imposed assumptions.Lastly, there is also an attempt to estimate the value func-tion using various techniques from Reinforcement Learning.

It turns out that naive strategies are superior when marketimpact is not present and when market impact is modeledas a direct penalty on reward. The Markov Decision Pro-cess is superior with market impact when it is modeledas having an impact on simulations, although some resultssuggest that the market impact model is not consistent forall types of instruments. Further, approximating the valuefunction yields results that are inferior to the Markov Deci-sion Process, but interestingly the method exhibits an im-provement in performance if the estimated value functionis trained before it is tested.

Sammanfattning

I denna uppsats görs ett försök att hitta den optimala orderexekverings strategi som maximerar vinsten från att handlafinansiella instrument. Den optimala strategin hittas genomatt använda en Markov beslutsprocess som är byggd på entillståndsmodell och Bellman ekvationen. Eftersom det in-te finns en explicit formel för tillstånds dynamiken, görsistället simuleringar på historiska data för att uppskattatransitionssannolikheterna och vinsten associerad med var-je tillstånd och styrsignal. Den optimala strategin genererassedan från Bellman ekvationen och testas mot naiva stra-tegier på test data. Det görs även ett försök att modelleramarknads påverkan för att testa om Markov beslutsproces-ser fortfarande är gångbara under antagandena som görs.Slutligen görs även ett försök på att estimera värdesfunk-tionen med olika tekniker från ”Reinforcement Learning”.

Det visar sig att naiva strategier är överlägsna när mark-nads påverkan inte inkorporeras och när marknads påver-kan modelleras som ett straff på vinsten. Markov besluts-processer är överlägsna när marknads påverkan modellerassom direkta påverkningar på simuleringarna, men några avresultaten påvisar att modellen inte är konsistent för allatyper av instrument. Slutligen, så ger approximation av vär-desfunktionen sämre resultat än Markov beslutsprocesser,men intressant nog påvisar metoden en förbättring i pre-standa om den estimerade värdesfunktionen tränas innanden testas.

AcknowledgementsI would like to thank Tobias Rydén and Per Hallberg atLynx Asset Management for giving me the opportunity todo this project and providing guidance and invaluable feed-back throughout my time there. I enjoyed every minute ofit and greatly appreciate that you took in an impatient 23-year-old with very little experience in the field.

I would also like to thank professor Tatjana Pavlenko forthoroughly guiding me through the process and inspiringme to come up with ideas when I was getting stuck. Igreatly appreciate that you were supportive of my slightlyoverzealous ideas but also were firm when I was simply be-ing unrealistic.

I would also like to thank KTH PDC for helping me incompleting various heavy computations.

I would also like to thank Mom and Dad, for always tellingme what I need to hear and not what I want to hear. Momand Dad, I finally made it. Now I can start paying off themortgage for my apartment.

Lastly, I would like to thank my friends for providing sup-port and inspiration when I needed it most and couldn’ttake what Mom and Dad was saying. For some, our timeas students together may end here but our time as friendsprobably won’t, ever.

Contents

1 Introduction 1

I The Basics 3

2 Theory 52.1 Terminology, Definitions and Theorems . . . . . . . . . . . . . . . . 5

3 Model 93.1 Core Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 State representation . . . . . . . . . . . . . . . . . . . . . . . 93.1.2 Control Representation . . . . . . . . . . . . . . . . . . . . . 103.1.3 State dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.4 Objective function and reward definitions . . . . . . . . . . . 113.1.5 Building the MDP . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Previous Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.1 The Markov Property . . . . . . . . . . . . . . . . . . . . . . 14

4 Implementation 174.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Order book data . . . . . . . . . . . . . . . . . . . . . . . . . 174.1.2 Trades data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.3 Garman-Klass Yang-Zhang volatility . . . . . . . . . . . . . . 194.1.4 Final output form . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Descriptives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Simulation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 The Vanilla Model 295.1 Benchmark algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Results and Comments . . . . . . . . . . . . . . . . . . . . . . . . . . 33

II Market Impact 39

6 Model 416.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1.1 Markov property . . . . . . . . . . . . . . . . . . . . . . . . . 416.1.2 Market Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 "MI" model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.3 "Lynx model" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Implementation 457.1 Benchmark algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2 Results and Comments . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.2.1 "MI" model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2.2 "Lynx" model . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

IIIReinforcement Learning 63

8 Theory 658.1 A slight reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . 658.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . 66

9 Implementation 699.1 Estimating the value function . . . . . . . . . . . . . . . . . . . . . . 69

9.1.1 Transition probability . . . . . . . . . . . . . . . . . . . . . . 699.1.2 Reward function . . . . . . . . . . . . . . . . . . . . . . . . . 729.1.3 Value function . . . . . . . . . . . . . . . . . . . . . . . . . . 74

9.2 Simulation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 769.3 Results and Comments . . . . . . . . . . . . . . . . . . . . . . . . . . 76

IVDiscussion and Conclusions 79

10 Discussion 8110.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

10.1.1 Further research . . . . . . . . . . . . . . . . . . . . . . . . . 82

References 85

A Additional Plots 87

B Tables of parameter estimations 93

Chapter 1

Introduction

With the rapidly evolving technology and the increased availability of informationto the public, hedge funds are continuously pressured to develop new innovativemethods to retain an advantage over the market.

One crucial problem is to determine how to execute trade orders, and specifically,when the orders are very large. When large orders for an asset are executed in abulk, this behavior will be observed by the public market and cause a rapid increaseor decrease in price of the traded asset, depending on the order executed.

The rapid change, will in either scenario, affect the profit of the hedge fund nega-tively in the sense that the profit margin of the order will decrease.

In this thesis we will explore methods using Markov decision processes and re-inforcement learning to find the policy that optimizes order execution. We will firstimplement established methods and later on expand around these models with newmethods and concepts to test for robustness and performance under new assump-tions and circumstances. Further we also implement "Reinforcement learning" byestimating the value function using the discrete Markov models as data for functionestimation, and test its application on this type of problem.

The goal of this thesis is to model and quantify the aforementioned behavior ofmarket impact, evaluate potential new methods and their feasibility and presenta framework to extract and evaluate the optimal trading strategy given a set ofhistorical financial data.

1

Part I

The Basics

3

Chapter 2

Theory

In this introductory part we go through the theory and definitions needed to under-stand the core concepts of this thesis. The reader will find all terms and definitionsused throughout the paper in the chapter below.

2.1 Terminology, Definitions and Theorems

Definition 1 (Order book). An order book is a list containing all the informationabout the current orders of a asset. It is on the form:

"Buy side" "Sell side"Volume Demanded Price bid Price asked Volume offered

$100 500$99 1000

Ask Price: $98 125400 $97 :Bid Price1000 $96600 $95

Table 2.1: Order book

Here "Ask Price" is the lowest price the sell side is willing to accept and "Bid price"is highest price the buy side is willing to offer. These two terms will be referredto as "ask" and "bid". We say that one "joins" the bid or ask price, when onesubmits an order at those price levels.

Definition 2 (Limit order). A limit order is an order placed with a brokerage tobuy or sell a set number of shares at a specified price or better during a specifiedtime frame. It may not be executed if the price set by the investor cannot be metduring the period of time in which the order is left open.

5

CHAPTER 2. THEORY

Definition 3 (Garman-Klass Yang-Zhang volatility estimator). The Garman-KlassYang-Zhang volatility estimator is defined as follows:

σY Z =

√F

N

√√√√ N∑i=1

(oi − ci−1)2 + 0.5(hi − li)2 − (2 ln 2− 1) · (oi − ci)2 (2.1)

Here F denotes the number of closing prices in a year, N the number of historicalprices used for the estimate and i denotes the index of the historical price used.Further o•, h•, l• and c• denotes the Open, High, Low and Close price for someindex •.

Definition 4 ("Submit and leave"). "Submit and leave" is a strategy used in ourmodel when executing an order. This means that the entire order of volume V assetsis submitted in the order book at price P, and remains unchanged until a certainamount of time has elapsed or the order has been fully executed.

Definition 5 (State). We say that xt is the state at time t if xt ∈ Rn for somefinite n. Further xt is defined:

xt =

x1...

xn−1t

Definition 6 (Transition probability). Given a control signal ut, we say that theprobability to transition from Xt = xt to Xt+1 = xt+1 is defined:

Pxt→xt+1(ut) = P [Xt+1 = xt+1|ut, Xt = xt] (2.2)

Further given the state space S and control set U , the transition probabilities havethe properties: ∑

xt+1∈SPxt→xt+1(ut) = 1 (2.3)

Given some control ut ∈ U .

Definition 7 (Policy). A policy π is a collection of rules to choose control ut givenprevious states {x0, x1..., xt}. In a sense it can be seen as a mapping:

π(xt) : S → U (2.4)

In the context of Markov models, it should be stressed that ut can only be chosenfrom xt alone due to the Markov property.

Definition 8 (Markov chain). A stochastic process {Xt, t = 0, 1, 2...} with a finitestate space is a Markov chain if for all states {xt|t ≥ 0} the following property issatisfied:

P [Xt+1 = xt+1|X0 = x0, ..., Xt = xt] = P [Xt+1 = xt+1|Xt = xt] (2.5)

6

2.1. TERMINOLOGY, DEFINITIONS AND THEOREMS

This means that the probability of a stochastic process to transition states only de-pends on the current state.

Definition 9 (Reward function). A reward function is a mapping:

c(xt, ut) : S × U → R (2.6)

Where t denotes the time index, S is the state space and U is the control space.

In the context of order execution, the reward can be seen as a relative economi-cal loss from unfavorable trades, meaning that it will most likely have a negativesign.

Definition 10 (Value function). Assume S and U are finite. Let Vπ(xt, ut) be thetotal reward in state xt given control ut up to time t and let c(xt, ut) be some rewardat time t. Then given a finite T ≥ t and a policy π, Vπ(xT , uT ) is expressed as:

Vπ(xT , uT ) = Eπ[T∑t=0

c(xt, ut)|x0 = x], (2.7)

here x is some given initial state and Eπ is the conditional expectation given that πis used as a policy, which is a rule to select a control ut given xt.

Theorem 1 (Bellman equation). The optimal control ut at time t that maximizesV (xt, ut) is given by the following equation for optimal total reward V (xt, ut) :

V (xt, ut) = maxut∈U

{c(xt, ut) +∑

xt+1∈SPxt→xt+1(ut)V (xt+1, ut+1)} (2.8)

Further, the optimal policy can be obtained from this as π(xt) = arg maxut∈U V (xt, ut).

It should be noted that the value function is calculated recursively, starting fromthe terminal state t = T to t = 0.

Proof. Assume V (xt, ut) is the optimal total reward with corresponding optimalpolicy π and

Z(xt, ut) := maxut∈U

{c(xt, ut) +∑

xt+1∈SPxt→xt+1(ut)V (xt+1, ut+1)}

The equality V (xt, ut) = Z(xt, ut) is proven by establishing that:

V (xt, ut) ≤ Z(xt, ut)V (xt, ut) ≥ Z(xt, ut)

We start with V (xt, ut) ≤ Z(xt, ut):

7

CHAPTER 2. THEORY

It is enough to show that V (xt, ut) ≤ Z(xt, ut) for any policy π applied on V (xt, ut).Assume that the probability of policy π selecting control ut is P [ut|π] Then we havethat:

V (xt, ut) =∑ut∈U

P [ut|π]{c(xt, ut) +∑

xt+1∈SPxt→xt+1(ut)V (xt+1, ut+1)}

But V (xt+1, ut+1) is optimal and satisfies V (xt+1, ut+1) ≥ V (xt+1, ut+1). Then itfollows that:

V (xt, ut) ≤∑ut∈U

P [ut|π]{c(xt, ut) +∑

xt+1∈SPxt→xt+1(ut)V (xt+1, ut+1)} ≤ Z(xt, ut)

We now show that V (xt, ut) ≥ Z(xt, ut), which is done by showing that there exista policy π such that V (xt, ut) ≥ Z(xt, ut). Then let each control ut given by somepolicy π be defined by:

ut ∈ arg maxut∈U{c(xt, ut) +

∑xt+1∈S

Pxt→xt+1(ut)V (xt+1, ut+1)}

Then using this policy on every V (xt, ut) would yield:

V (xt, ut) = c(xt, ut) +∑

xt+1∈SPxt→xt+1(ut)V (xt+1, ut+1) =

= c(xt, ut) +∑

xt+1∈SPxt→xt+1(ut)V (xt+1, ut+1) = Z(xt, ut)

Hence we have constructed a policy π that satisfies V (xt, ut) ≥ Z(xt, ut).

Theorem 2 (Glivenko-Cantelli). Let X1, ..., Xn be a collection of i.i.d random vari-ables with cdf FX , and let Fn(x) denote the empirical distribution

Fn(x) = 1n

n∑i=1

I(−∞,x)(Xi), (2.9)

where IC is the indicator function of the set C. Then as n→∞,

P [supx∈R|Fn(x)− FX(x)| → 0] = 1 (2.10)

8

Chapter 3

Model

In this chapter we present our the basics of our model and all its key characteristics.

3.1 Core ModelOur core model is a Markov Decision Process(MDP) which uses a state space. Inthis section we will describe the state and control space used to build the MDP.

3.1.1 State representation

In order to capture the dynamic behavior of the order book we use a discrete statespace model:

xt ∈ S ⊂ Z3

Here our three state variables are:

1. xt,1 := Volume state i ∈ {0, 1..., I} ∈ N. Assuming we are given an initialvolume V units of stocks to process, state xt,1 = i would mean that we havei · VI stocks left to process.

2. xt,2 := Time state t ∈ {0, 1, ..., τ}, τ ∈ N. Assuming we are given an initialtime frame of H minutes to conduct our operations, state xt,2 = t would meanthat we have (τ − t) · Hτ minutes left to process the remaining volume x1 · VI .The terminal state is thus xt,2 = τ where we have 0 minutes left to executean order. In this scenario, we must finish the remainder of the volume thathas not been executed by buying or selling any volume that is left by "eatingthrough" the levels of the order book at the specific data point in the simula-tion this occurs. It should be noted that the subindex t in xt denotes this state.

For the remainder of this thesis we will refer the length in minutes of oneunit of time state (Hτ ) as time period and H as session time.

9

CHAPTER 3. MODEL

3. xt,3 := Trend state θ ∈ {−1, 1}. Here 1 indicates that the trend is movingin a favorable direction given our direction of trading. As an example, if weare selling instruments, then a price trend moving higher would count as fa-vorable. Likewise, -1 indicates that the trend is moving in an unfavorabledirection. We say that this particular state counts as a market variable as itis dependent only on market data while the previous two variables are privateand also dependent on simulation outcome.

The explicit expression for trend is:

f(pi − pi−Hτ

) =

sgn(pi − pi−Hτ

), if we are selling− sgn(pi − pi−H

τ), if we are buying

Where pi is the mean of bid and ask at data point i, Hτ is the same as earliermentioned and sgn(x) denotes the sign function.

We denote the terminal state as xτ , where xτ := [i, τ, θ]T . The state space thusbecomes:

S := {0, 1..., I} × {0, 1, ..., τ} × {−1, 1} (3.1)

3.1.2 Control RepresentationIn order to determine the optimal strategy of processing order we define our controlspace as following:

ut ∈ U ⊂ Z2

Here our control variables are:

1. ut,1 := Price increments above the "join" price submitted a ∈ A. WhereA = {−6,−5,−4,−3,−2,−1, 0, 1, 2, 3, 4, 5, 6}. In this case the value of theprice increment is chosen to be the smallest allowed price change in submissionprice of the instrument known as "tick size". We will refer A as the "actionspace". As an example, the most aggressive order when buying would be a = 6and the most passive order would be a = −6.

2. ut,2 := Volume submitted ϑ ∈ {0, 1, ..., I}, I ∈ N. The volume we submit inthe order book is bounded above by our current volume state xt,1, since wecannot submit a larger order than we have left.

As an example when buying, using control ut = [a, ϑ]T at time state t, would implysubmitting an order of size ϑ · VI at price level pjoin + a · γ, where γ is the tick sizein the order book. It should be noted that when selling the price level becomespjoin − a · γ. An aggressive order (a > 0) would here mean submitting an order ata lower price than the ask.

10

3.1. CORE MODEL

It should be noted that the subindex t in ut denotes the time state in which thecontrol acts upon. Thus our control space is defined:

U := A× {0, 1, ..., I} (3.2)

3.1.3 State dynamics

When modeling the state transition, we use a Markov chain that we extend to aMDP. We thus assume that the transition between states occur with a probabilitythat is dependent on the current state xt and the control used ut. With this model,no explicit representation of the state dynamics on the form:

xt = f(xt, ut) (3.3)

where f : Z5 → Z3, is needed. Further, xt denotes the time derivative of state xt.

What we need instead is the transition probabilities Pxt→xt+1(ut) between everypossible state xt given some control ut. These probabilities are estimated from sim-ulations using historical order book and trade data.

The simulations estimate how a control signal ut changes xt during Hτ minutes

forward in time using the order book data sets OB and trade data sets TD.

3.1.4 Objective function and reward definitions

In order to measure what really is the optimal trading strategy, we need to quantifyeach state. This is done by the following three functions:

1. Execution reward function, cex(xt, ut) : S × U → R. Where cex(xt, ut) isexplicitly defined and calculated as:

cex(xt, ut, α) := (−1)α

N(xt, ut)

N∑i=1

pi − pmidiσY Zi

(3.4)

Here index i denotes the i:th data point and N(xt, ut) is the total number oftimes we have simulated state xt and control ut in total for every data pointand α ∈ {0, 1}. If α = 1, we are considering the case of buying assets and ifα = 0 we are selling assets.

The variable pi is the average price per share obtained/sold off by simulatingstate xt with control ut in data point i, pmidi is defined as bid+ask

2 , where thebid and ask are taken from the order book in the beginning of our simula-tion and σY Zi is the standard deviation in price corresponding to data point i.

11

CHAPTER 3. MODEL

Intuitively, this function measures the relative loss in average price per volatil-ity when submitting an order, which can be seen as a control signal ut.

2. Unexecuted reward function, cun(xt, ut) : S × U → R. Where cun(xt, ut) isexplicitly defined:

cun(xt, ut, α) := (−1)α

N(xt, ut)

N∑i=1

pi − pmidiσY Zi

(3.5)

Here, everything is the same in the execution reward function except that wein this case consider pi which denotes the price per share left to buy/sell bysimulating state xt with control ut in data point i.

Intuitively, this function measures the relative loss in average price of notgetting a certain fraction of V executed. The reason for imposing this re-ward is to capture the behavior of price movements over longer periods whenbuilding up the optimal value function. This notion will be discussed morethoroughly in the section of previous research.

3. Total reward function, Vπ(xt, ut) : S × U → R. Where Vπ(xt, ut) is explicitlydefined:

Vπ(xt, ut) = Eπ[t∑

j=0

cex(xj , uj) + cun(xj , uj)xj,1

|x0 = x] (3.6)

Where π is some given policy.

For the remainder of this thesis, we will use the notion of slippage to describe thereward of buying/selling at a more expensive/cheaper price than the reference pricetaken. In the context above, it is the sum of the unexecuted reward and executedreward where the reference price is always taken only one time period back. Ina later chapter, slippage will only refer to as the executed reward since we take areference several time periods back.

3.1.5 Building the MDPIn order to build the MDP we need to estimate the following through simulation:

1. The transition probabilities Pxt→xt+1(ut) for all xt ∈ S and ut ∈ U . Thetransition probabilities from xt → xt+1 given control ut are estimated in thefollowing way:

a) For every data point in data set, count the total number of timesN(xt, ut)simulating state xt with control ut.

b) Count the number of transitions to state N(xt+1, ut) by simulating statext with control ut.

12

3.2. PREVIOUS RESEARCH

c) The transition probability Pxt→xt+1(ut) is then estimated as:

Pxt→xt+1(ut) = N(xt+1, ut)N(xt, ut)

(3.7)

Which essentially means taking the empirical probability as an estimate.

2. The executed reward cex(xt, ut) for all xt ∈ S and ut ∈ U . This reward issimulated over all given datapoints and estimated by taking the average of allsimulations over all data points.

3. The unexecuted reward cun(xt, ut) for all xt ∈ S and ut ∈ U . This reward issimulated over all given datapoints and estimated by taking the average of allsimulations over all data points.

3.2 Previous ResearchA large part of this thesis serves as an extension to Nevmyvaka et al.’s paper [3].In their paper they also use a state based execution and only consider the "submitand leave strategy" when executing orders.

In this thesis we have extended the dimensions of the control space used in thesimulation and added two models on market impact to see how the optimal strat-egy behaves and performs under market impact. Besides the theoretical extensionwe also take a different approach in implementation, specifically on generating theoptimal strategy. In Nevmyvaka et al., an implicit recursive method is used forevery time state and a look back in data for a reference price to match the numberof time periods progressed in the simulation. In this thesis we use the Markov As-sumption more "liberally" and impose an "unexecuted reward" in order to capturethe behavior of "looking back" but only doing the simulation for one time periodand using this one time period simulation in a Bellman equation to obtain a sameoptimal policy as Nevmyvaka et al but with less computations.

Wilcox and Hendricks’ paper [4] also extends Nevmyvaka et al.’s model, but insteadlooks at different market variables and uses an action based on solely submissionvolume obtained from Chriss and Almgren’s model [1]. In this thesis we likewisetake inspiration from [1] but use it in a different fashion than suggested by Wilcoxand Hendricks.

In addition to the further exploration of discrete modeling of optimal order exe-cution strategies, we also make an attempt to apply some concepts of reinforcementlearning to our problem. These concepts are at large based on [10] and Silver etal.’s paper [2]. Silver also mentions the use of Neural Networks as a function ap-proximator, but in this context we choose differentiable functions instead due thesmall of amount of data used.

13

CHAPTER 3. MODEL

Lastly, we have chosen a discrete Markov Model due to its flexibility in modelingstate space dynamics without explicit representation and computational advantageover the otherwise well renowned SDE approach suggested by Pham and Kharroubiin [7].

3.3 MethodologyAs mentioned, the goal of this thesis is both to extract and evaluate the optimaltrading strategy from historical financial data. This section will focus on describinghow the extracted strategies will be evaluated as we already have gone through howthe optimal trading strategies would be extracted.

In order to quantify each obtained strategy, the reward generated from a simu-lation using each obtained strategy on out of sample data will become our basis forcomparison. The comparison between optimal trading strategies aims to comparewhich strategies work better in which contexts, for example:

1. Does having a trend state really improve performance?

2. Does setting I = 20 yield better performance than setting I = 5?

Since the obtained strategies give no context to whether the entire method in itselfis viable or not, the generated costs will be compared to costs obtained from runningsimulations with "naive strategies". Besides the naive strategies, we also comparethe model with extended control space to Nevmyvaka et al.’s model [3] that onlyutilizes the "submit and leave" strategy. We will give a description of these "naivestrategies" and how this simulation is run on out of sample data at the end of thispart.

Additionally, the data we use for our simulations is a year worth of data and thedata we use to test our model is half a year worth of data. Since the simulationsare heavy in computation, all computations are carried out on large multi-processorhigh performance computers. 1

3.4 Assumptions

3.4.1 The Markov PropertyPer definition, our model assumes that each state transition only depends on thecurrent state. This assumption limits the models capability to capture longer trendsin price evolution.

1See Tegner at KTH PDC.

14

3.4. ASSUMPTIONS

We still attempt to incorporate the effects of trend by introducing a trend state,that represents whether the price has moved in a beneficial direction the last Tminutes, given that we are buying or selling an order. As a consequence of theMarkov Property, the probabilities that the trend changes state only depends onthe current state.

15

Chapter 4

Implementation

4.1 PreprocessingWhen running our simulations, we need to transform the raw data supplied intousable form and extract certain variables from the raw data. In our case the datais not necessarily very large, but very complex in its structure. In this section wedescribe the preprocessing of order book data and trading data.

4.1.1 Order book data

When order book data is loaded we first make the comparison to the internal tradingschedule to observe if there are any missing data points on a minute wise basis. Anymissing order book data will be replaced with a "Zero Book":

"Buy side" "Sell side"Volume Demanded Price bid Price asked Volume offered

$0 0$0 0

Ask Price: $0 00 $0 :Bid Price0 $00 $0

Table 4.1: "Zero Book"

The idea of doing this specific replacement is to keep consistent indexing duringsimulation, since omitting a missing point would make the translation from time toindex more difficult during simulation.

When there are missing entries in the order book, we make a bold assumptionthat the missing value takes the value of the previous entry. As an example, if anorder book data entry looks like the following:

17

CHAPTER 4. IMPLEMENTATION

Figure 4.1: Missing data points order book

Here we see that the order book data is arranged with minute increments betweeneach point of data. Hence our assumption of missing values being replaced by thelatest values in each column are equivalent to assuming to assuming no new order ofa new price or volume at the specific level was placed, when a data point is missing.

4.1.2 Trades data

Our trading data after preprocessing is on the following form for some minute inthe data:

Price Volume$ 127 200 units$ 126.5 5000 units$ 126 200 units$ 125.5 2500 units

Table 4.2: Trading data, for a given date and time

Before we achieve this form, the raw data is on the following form:

18

4.1. PREPROCESSING

Figure 4.2: Raw trades data

In order to get to our desired form, we replace every missing data point on a minutewise basis with a so called "Zero Trade". It is heuristically the same thing as the"Zero Book". It looks like the following:

Price Volume$ 0 0 units

Table 4.3: "Zero Trade"

This is the only data replacement we make on the trading data.

4.1.3 Garman-Klass Yang-Zhang volatilityThe Garman-Klass Yang-Zhang volatility is calculated according to Definition 3. Itis calculated for each day, so for all minute wise data in one day the Garman-KlassYang-Zhang volatility is the same.

19


4.1.4 Final output form

After having preprocessed the order book and trades data so that they becomeon the form above, every order book and trades table represents the order bookand the trades respectively that occurred sometime on a minute wise basis. Wethen group all minute wise data into day wise data, in the sense that a day wisecollection contains several hundred minute wise data in chronological order. Everyday wise data is then grouped into a year. It should thus be noted that when wewrite For every data point OBi in OrderBookData, we mean "For every minute inevery day in every year".

4.2 Descriptives

The data consists of three forward contracts with various properties which are de-scribed in the table below. Due to compliance reasons, the three instruments aregiven generic names:

Property: Instrument A Instrument B Instrument CPrice tick size $ 0.015625 $ 0.1 $ 0.005Average YZ volatility (2015) 0.4269 48.0084 0.0290Price tick to average YZ volatility ratio (2015) 0.0365 0.0021 0.1724Average price (2015) $ 127.6018 $ 4428.5 $ 111.3Total volume traded in units (2015) 176 567 869 51 299 383 44 452 223Average trades per minute (2015) 1782 512 297

Table 4.4: Instrument characteristics

In addition to the above table, we also present some illustrating plots about theaverage daily behavior and the yearly evolution of a few metrics described below:

20

4.2. DESCRIPTIVES

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

Time

instrument A

0

1

2

3

4

5

6

7

8×104 Volume traded daily average

(a) (b)

(c)

Figure 4.3: Average daily traded volume for all three instruments

In the above plots, each red dot represent a trade in the trades data that occurredat the time of the day indicated on the x-axis. The blue line among the red dots isthe average of all trades for each time point on the x-axis.

21


08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

Time

instrument A

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5×104 Volume OB accumulated sold and bought, total daily basis

Orderbook depth = 1Orderbook depth = 2Orderbook depth = 3Orderbook depth = 4Orderbook depth = 5Orderbook depth = 6Orderbook depth = 7Orderbook depth = 8Orderbook depth = 9Orderbook depth = 10

(a)

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

Time

instrument B

0

200

400

600

800

1000

1200

1400Volume OB accumulated sold and bought, total daily basis


(b)

08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00

Time

instrument C

0

1

2

3

4

5

6

7

8

9

10×104 Volume OB accumulated sold and bought, total daily basis


(c)

Figure 4.4: Accumulated volume available in the order book

Here, each plot describes how much volume is available in the order at time indicatedon the x-axis. The volume on each level in the order book is accumulated frombid/ask to the lowest/highest level. Note that we are taking the average of the buyand sell case.

22

4.3. SIMULATION ALGORITHMS

4.3 Simulation Algorithms

In this section, we give a top-down explanation of our simulations and algorithms.All the algorithms are expressed in "pseudo-code". Our first algorithm is our "mainloop", where we build a loop that estimates everything needed in order to build theMDP.

Algorithm 1 Main Loop1: procedure Main Loop(OrderBookData, TradesData,H, σY Z_Y ear,τ, I, V, TrendData,A)

2: Initialize cex container CEX3: Initialize cun container CUN4: Initialize transition counter container TCounter5: Initialize simulation counter containter Counter6: for t = 0 to τ do . Stating from no time left7: for every datapoint OBi in OrderBookData do . For every data pointOBi in order book data, take OBi as the starting point in our simulation

8: σY Z=GetCorrectYZ(i, σY Z_Y ear) . σY Z is calculated for each day,so this function gets the correct σY Z given the minute index i

9: Define current trend state θ from data point i.10: for every xt ∈ S do11: for every ut ∈ U do12: (xt+1, cex, cun) = Basic Simulation(xt, ut, OBi,

OrderBookData, TradesData, σY Z , T rendData,Hτ )

13: Update CEX , CUN , Counter, TCounter14: Define transition probability container as P := TCounter

Counter . Element-wisedivision

15: Define reward container as C = CEX+CUNCounter . Element-wise division

16:17: OptimalCost :=BellmanIterator(C, P )18: OptmalCostperUnit :=NormalizeCost(OptimalCost, V, I) . Normalize

costs against volume19: Generate optimal policy π := GetOptimalPolicy(OptimalCostperUnit, )20: return OptimalCostperUnit, P ,C

The above procedure is the core script that runs and builds our MDP. It coversall the necessary operations needed to build the MDP that will generate the op-timal policy. It essentially "tests" every possible action for every possible statefor every data point available and extracts the necessary results. In the con-text, every data point can be seen as a starting point in our simulations. Itis important to note that when we run our simulations the costs have the unitPrice difference × Volume/Volatility due to practical reasons. We specifically nor-malize against the volume due to this. It should be noted that when we run our

23


simulations, we run it for the buy and sell case and base our optimal policy on anaverage of slippage costs from both cases.

Algorithm 2 Basic Simulation1: procedure Basic Simulation(xt, ut, OBi, OrderBookData, TradesData,σY Z , T rendData,

Hτ )

2: if t = τ then3: if Volume state=0 then4: Executed reward =05: Unexecuted reward =06: Get trend state θ from current data point i7: Set xt+1 = [0, t+ 1, θ]8: return xt+1, Executed reward and Unexecuted reward9: else

10: Bid-Ask=GetBidAsk(OBi) . GetBidAsk(OBi), returns the bid orask price

11: Reference Price= bidi+aski2

12: Get xt+1 and Executed reward fromEatThroughOrderBook(xt, OBi, σY Z ,Reference Price) . Call function tosimulate "eating through" the order book in terminal state

13: Unexecuted reward =014: Set xt+1 = [0, t+ 1, θ]15: return xt+1, Executed reward and Unexecuted reward16: else17: Bid-Ask=GetBidAsk(OBi) . GetBidAsk(OBi), returns the bid or ask

price18: Reference Price= bidi+aski

219: Get xt+1, Executed reward and Unexecuted reward from20: SimulateTrades(xt, ut, OBi, OrderBookData, TradesData,

σY Z , T rendData,Hτ ,Reference Price) . Call function to simulate trades when

not in terminal state21: return xt+1, Executed reward and Unexecuted reward

Our algorithm is made up by two stages, the first when we are at our terminal state(t = τ) and the other when we are not in the terminal state (t < τ). Dependingon which stage we are in, the simulation algorithms differ. In the first stage, wemust finish the remainder of our order and thus buy/sell everything available in theorder book. This is referred to as "Eating through the order book" in the algorithmabove. In the second stage we are allowed to choose what order to submit, whichrequires a simulation on how the order is executed given historical data.

Algorithm 3 "Eating through the order book"1: procedure EatThroughOrderBook(xt, OBi, σY Z ,Reference Price)

24


2: Available Volume=GetOrderBookVolume(OBi) .GetOrderBookVolume(OBi), returns the current total available volume in theorder book

3: if xt,1 > Available Volume then4: Total Price=GetTotalPrice(OBi) . The price of buying everything

available in the order book5: End Price=EndPrice(OBi) . Get the lowest/highest price in OBi,

depending on if we are selling or buying6: End Volume=EndVolume(OBi, cvol) . Get the volume available at that

level7: Total reward All= SumBeyondOrderBook(End Price,End Volume,xt,1 −Available Volume) +Total Price . reward of buying entire order andgoing outside of the order book

8: return Executed reward= Total reward All−xt,1·Reference PriceσY Z

9: else10: Bought Volume=0 . Initialize variable11: Total reward All=0 . Initialize variable12: index=1 . Let index denote the position of the bid/ask price in the

order book13: while Bought Volume<xt,1 do14: Bought Volume=Bought Volume+OBV

i (index) . Start by buyingfrom the bid/ask

15: Total reward All=Total reward All+OBVi (index) ·OBP

i (index)16: index=index+117:18:19: return Executed reward= Total reward All−xt,1·Reference Price

σY Z

20:

When "Eating through the order book", there are two scenarios. The first scenariobeing that given all the available volume in the order book, this is not enough tocover our remaining order and the other being that it is enough. In the secondscenario, the procedure is very straightforward in the sense that we only need tobuy/sell everything available at each level in the order book until we have finishedour order. The first scenario is slightly more complicated since we must estimatethe volume "beyond" the order book. This involves making an assumption on thevolume available. In our case, we have assumed that volume available is half thevolume available in the deepest part in the order book thus imposing a strongerpunishment on having too large volumes unfinished.

Algorithm 4 SumBeyondOrderBook1: procedure SumBeyondOrderBook(End Price, End Volume,Rest Volume)

25


2: Levels outside order book =⌈

Rest VolumeEndV olume

cV

⌉. Calculate how many levels we

must go outside of the order book3: reward=04: for i = 1 to Levels outside order book do5: reward=reward+End Volume

cV· (i · cP · Price tick + End Price)

return reward

In the algorithm above, we have that cP = 1 and cV = 2 given our assumptions.

Algorithm 5 Trading simulation1: procedure SimulateTrades(xt, ut, OBi, OrderBookData, TradesData,σY Z , T rendData,

Hτ ,Reference Price)

2: Forward Simulation Price=bid

i+Hτ

+aski+H

τ2 . Mid price H

τ minute in thefuture

3: if ut,2 = 0 then4: Update xt to xt+15: Executed reward=06: Unexecuted reward= (Forward Simulation Price−Reference Price)·xt+11

σY Zreturn Executed reward,Unexecuted reward and xt+17: Total obtained volume=08: if ut,1 ≤ 0 then . If we submit an order with lower/higher price than

bid/ask9: for j = i to i− 1 + H

τ do10: Trades data=TradesData(j)11: if queue then . If someone has placed an order before us in the

order book12: Queue Volume=GetQueueVolume(ut,OBi) . Get the size of the

orders placed before us13: Acquired Volume=GetVolume(ut, T radesData) . Given a control

ut, how much did we get according to the trades data14: if Queue Volume−Acquired Volume < 0 then15: Actually obtained volume=−(Queue Volume−Acquired Volume)16: queue=false17: Total obtained volume=Total obtained volume+Actually ob-

tained volume18: if Total obtained volume=ut,2 then19: Update xt to xt+1

20: Executed reward= (bidask+ut,1·price tick−Reference Price)·ut2σY Z

21: Unexecuted reward= (Forward Simulation Price−Reference Price)·xt+11σY Z

22: return Executed reward,Unexecuted reward and xt+1

23: else24: Queue Volume=Queue Volume−Acquired Volume

26


25: else26: Queue Volume=GetQueueVolume(ut,OBi) . Get the size of the

order placed before us27: Acquired Volume=GetVolume(ut, T radesData) . Given a control

ut, how much did we get according to the trades data28: Total obtained volume=Total obtained volume+Acquired Volume29: if Total obtained volume=ut,2 then30: Update xt to xt+1



33: Update xt to xt+1



36: return Executed reward,Unexecuted reward and xt+137: else . If we submit an order with higher/lower price than bid/ask38: Immediately obtained volume=GetImmediateVol(OBi) . In this case,

we will immediately obtain what is offered in the order book39: if Immediately obtained volume≥ ut,2 then40: Update xt to xt+141: Total price=EatThroughOrderBook(ut,2, OBi, σY Z) . If there is

enough to cover our order, then we are better of eating through the order bookthan paying too much

42: Executed reward= (Immediately obtained volume−Reference Price·ut2 )σY Z



45: Penalty Volume=0 . Since we already obtained46: Total obtained volume=047: for j = i to i− 1 + H

τ do48: Trades data=TradesData(j)49: Acquired Volume=GetVolume(ut, T radesData) . Given a control ut,

how much did we get according to the trades data50: Penalty Volume= Penalty Volume+Acquired Volume51: if Penalty Volume≥ Immediately obtained volume then52: Total obtained volume=Penalty Volume-Immediately obtained vol-

ume53: if Total obtained volume+Immediately obtained volume=ut,2 then54: Total price=EatThroughOrderBook(Immediately obtained volume

, OBi, σY Z)55: Update xt to xt+156: Executed reward=

(bidask+ut,1·price tick)·Total obtained volume+Total price−Reference Price·ut2σY Z

27




59: Total price=EatThroughOrderBook(Immediately obtained volume,OBi, σY Z)

60: Update xt to xt+161: Executed reward=

(bidask+ut,1·price tick)·Total obtained volume+Total price−Reference Price·ut2σY Z



In the case when we are allowed to choose what order to submit, we must simulatethe outcome of the order. In this context order is synonymous to control ut. Theprocedure is composed of two cases. The first case being that the order submitted isan aggressive order, meaning that we submit a price at least one tick above/underbid/ask and second that we submit an order that is not aggressive. We start withdescribing latter, as it appears first in the above algorithm.

Since we have placed an order that is either bid/ask or below/above bid/ask, thereare other participants who have submitted an order before us. This implies thatthey also will have their order executed before us and hence we must wait an amountequivalent to Queue Volume being traded before executing our own order. When itbecomes our turn, we calculate which of the trades in the historical data that couldbe counted as ours. We iterate this procedure either until we have executed oursubmitted order or the trade time has ended.

In the case when we are aggressive, there is no notion of waiting for a partici-pant who has placed an order before us since we are immediately offering a pricethat’s being bid/asked for. We will thus directly obtain all the volume at the pricelevel we submitted. Since the trade data itself does not reflect this phenomena,we have implemented it as having to wait a certain volume worth of trades beforebeing able to receive new volume at the submission price. In the algorithm this isreferred to as Penalty Volume.

When we write "Update xt to xt+1", it means simply means xt,1 and xt,2 are updatedaccording to the simulation results and xt,3 is updated by finding the trend state ofthe data point our simulation takes us to.

28

Chapter 5

The Vanilla Model

Here we present the results for the model with no market impact. When bench-marking the model, we use the following three "naive strategies" as reference points:

1. "Buy all at once": Here, we "eat through" the order book immediately withour order.

2. "Submit and Leave at bid/ask": Here we submit our entire order at bid/askand "eat through" the order book at the end our session if there is volume leftto execute.

3. "Portion out evenly at bid/ask": Here divide our order into several smallerorders corresponding to the number of time periods we divide our sessioninto. As a more descriptive example, let us say we want to buy 18 000 unitsof instrument A in 180 minutes. Given that we want to do this in 18 timeperiods (τ = 18), we then submit an order of 1000 units at bid/ask every 10minutes. It should be noted that bid/ask is taken at the order book whichcorresponds to how long into the 180 minutes we have simulated. Everythingthat is not executed within 10 minutes is then added to next submission. Ifwe have not finished by the end of the 180 minutes, we "eat through" the orderbook.

5.1 Benchmark algorithms

Here we give a more detailed description on how the benchmarking algorithm works.The overall idea is to use a given policy and simulate a certain order according tothis policy on out of sample data the obtain a mean slippage for each policy whichis used for assessment and comparison.

It should be noted that some of the state space input requirements have beenrelaxed in the sense that the algorithms below also allow for the absolute amount of

29

CHAPTER 5. THE VANILLA MODEL

stocks left to buy in contrast to only allowing for having a certain state left. Thissolely done in order to make simulating the naive strategies more straightforward.

Algorithm 6 Benchmarking Main Loop1: procedure Benchmarking Main Loop(OrderBookDataY ear_oos,TradesDataY ear_oos, σY Z_Y ear, τ, V, I, T rendData,trading_session_time, optimal_policies)

2: timeperiod length= trading_sessionτ

3: order size=Vτ

4: Naive slippage container "Eat through"=()5: Naive slippage container "Submit and Leave at bid"=()6: Naive slippage container "Portion out "=()7: Optimal policy slippage container=()8: Optimal "Submit and Leave"=()9: forOrderBookDataDay and TradesDataDay inOrderBookDataY ear_oos

and TradesDataY ear_oos do10: σY Z=GetCorrectYZ(Day, σY Z_Y ear)11: "Eat through" slippage=

EatThroughSimulation(V,OrderBookDataDay, σY Z)12: "Naive submit and Leave" slippage=

NaiveSubmitandLeave(V,OrderBookDataDay,TradesDataDay, σY Z , trading_session_time)

13: "Portion out" slippage=PortionOut(V,OrderBookDataDay,TradesDataDay, σY Z , trading_session_time, timeperiod length, order size, τ,OrderBookData_oos)

14: "Optimal policy" slippage=simulatePolicy(V, I,OrderBookDataDay,TradesDataDay, σY Z ,optimal_policies.optimal, trading_session_time,timeperiod length, τ, OrderBookData_oos, TrendData)

15: "Optimal submit and leave" slippage=simulatePolicy(V, I,OrderBookDataDay, TradesDataDay,σY Z , optimal_policies.submitandleave, trading_session_time, timeperiod length, τ, OrderBookData_oos, TrendData)

16: Add slippage values for each policy to respective containers17: Get mean daily slippage for each policy by taking the average for every

slippage container18: return Average slippage for every policy

In the algorithm above, we essentially simulate each policy for every data pointand calculate the average slippage associated with the policy in question. It ispractically implemented by first passing each day wise data as an argument. Fromthis day wise data, we consider each minute wise data point as a starting point forour simulation. We then obtain the slippage for using each data point as a startingpoint in our simulation.

30

5.1. BENCHMARK ALGORITHMS

Algorithm 7 EatThroughSimulation1: procedure EatThroughSimulation(V,OrderBookDataDay, σY Z)2: slippage container=()3: for OrderBookDataminute in OrderBookDataDay do4: Bid-Ask=GetBidAsk(OBi) . GetBidAsk(OBi), returns the bid or ask

price5: Reference Price= bidi+aski

26: slippage=EatThroughOrderBook(V,OrderBookDataminute,σY Z ,Reference Price)

7: Add slippage to slippage container8: return mean(slippage container)

This algorithm uses the "EatThroughOrderBook" function. The idea is to calculatethe slippage for the "Eat through all" policy for each data point. There is no notionof starting point here, since this assumes that we are always in the terminal stateand must execute everything immediately.

Algorithm 8 Naive Submit and Leave1: procedure NaiveSubmitandLeave(V,OrderBookDataDay, TradesDataDay,σY Z , trading_session_time)

2: slippage container=()3: for OrderBookDataminute in OrderBookDataDay do4: Bid-Ask=GetBidAsk(OBi) . GetBidAsk(OrderBookDataminute),

returns the bid or ask price5: Reference Price= bidi+aski

26: xt=[V 1 0] . Relaxed state space requirement7: [Simulated slippage,Volume Left]=SimulateTrades(xt, ubidask,OrderBookDataminute,OrderBookData_oos, TradesDataDay,σY Z , [], trading_session_time,Reference Price) . Here it should be notedthat slippage refers only to executed reward

8: if Volume Left>0 then9: Rest slippage=EatThroughOrderBook(Volume Left,OrderBookDataminute, σY Z ,Reference Price)

10: else11: Rest slippage=012: Add Simulated slippage+Rest slippage to slippage container13: return mean(slippage container)

As mentioned, we simulate the "naive submit and leave" policy for every minute wisedata point as a starting point. An important detail is that in this context, the notionof unexecuted reward is not needed since we are not making a Markov assumptionhere. Instead, we take a reference price at the beginning of each simulation for eachdata point, which will result in a reference price that is "looked back" a longer time.

31


Algorithm 9 Portion Out1: procedure PortionOut(V,OrderBookDataDay, TradesDataDay, σY Z ,trading_session_time, timeperiod length, order size,τ, OrderBookData_oos)

2: slippage container=()3: for OrderBookDataminute in OrderBookDataDay do4: temporary container=()5: Bid-Ask=GetBidAsk(OrderBookDataminute) . GetBidAsk, returns

the bid or ask price6: Reference Price= bidi+aski

27: Initialize xt = [V, 0] . Relaxed state space requirement8: while xt,2 <τ do9: ubidask=GetSubmissionPrice(OrderBookData_oos, xt)

10: [Slippage,Volume Left,xt+1]=SimulateTrades(xt, ubidask,OrderBookDataminute,OrderBookData_oos,TradesDataDay, σY Z , [], trading_session_time,Reference Price) . Here itshould be noted that slippage refers only to executed reward

11: Update xt to xt+112: Add Slippage to temporary container13: if xt,1 = 0 then14: Add sum(temporary container) to temporary container15: Break16: if Volume Left>0 then17: Rest slippage=EatThroughOrderBook(Volume Left,

OrderBookDataminute, σY Z ,Reference Price)18: else19: Rest slippage=020: Add sum(temporary container)+Rest slippage to slippage container21: return mean(slippage container)

This algorithm is similar in structure to the "naive submit and leave". An im-portant difference is that the "portion out" policy must update its control signalevery timeperiod length minutes. This imposes a while loop for every starting point,which is used to ensure that the trading session is being simulated the entire lengthof the trading_session_time. It should also be mentioned that we relax the inputrequirements to the state space variable.

Algorithm 10 Simulate Policy1: procedure PortionOut(V, I,OrderBookDataDay, TradesDataDay, σY Z ,optimal_policies, trading_session_time, timeperiod length,τ, OrderBookData_oos, TrendData)

2: slippage container=()3: for OrderBookDataminute in OrderBookDataDay do

32

5.2. RESULTS AND COMMENTS

4: temporary container=()5: Bid-Ask=GetBidAsk(OrderBookDataminute) . GetBidAsk, returns

the bid or ask price6: Reference Price= bidi+aski

27: Initialize xt = [I, 0, θ]8: while xt,2 <τ do9: uoptimal=GetOptimalPolicy(optimal_policies, xt)

10: [Slippage,Volume Left,xt+1]=SimulateTrades(xt, uoptimal,OrderBookDataminute,OrderBookData_oos,TradesDataDay, σY Z , [],trading_session_time,Reference Price) . Here it should be noted thatslippage refers only to executed reward

11: Update xt to xt+112: Add Slippage to temporary container13: if xt,1 = 0 then14: Add sum(temporary container) to temporary container15: Break16: if Volume Left>0 then17: Rest slippage=EatThroughOrderBook(Volume Left,

OrderBookDataminute, σY Z ,Reference Price)18: else19: Rest slippage=020: Add sum(temporary container)+Rest slippage to slippage container21: return mean(slippage container)

This algorithm is essentially the same as "naive portion out". The only majordifference is that we added the function GetOptimalPolicy, which selects optimalpolicy given our current state and TrendData. It should be mentioned that theoptimal policy is implemented as a "look up" table containing each possible statext ∈ S and its corresponding optimal control ut ∈ U .

5.2 Results and CommentsWhen simulating each model, we have chosen combinations of the parameters ineach row of the following table to conduct simulations for:

Instrument Name V I Market variable Time period Session timeInstrument A 15000,5000 5,10,15,20 true,false 5,10,15,20 60,120,180Instrument B 3600,1200 5,10,15,20 true,false 5,10,15,20 60,120,180Instrument C 2100,700 5,10,15,20 true,false 5,10,15,20 60,120,180

Table 5.1: Parameters used for simulation without market impact

Note that Session time and V are characteristics of the simulation rather than

33


parameters for the model we use. After simulating, we get the following results forthe parameters used:

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0

V: 15000, MV= 0, Session time= 180

Length of epsisode (T)

15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

instrument A

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

Optimized Submit and Leave

Calculated Policy

Eat-all at once

Submit and Leave at bid

Portion out evenly

Figure 5.1: Instrument A overall results. We see that the naive strategy "Portionout evenly at bid" is overall better in this case.

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

instrument B

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly

Figure 5.2: Instrument B overall results. We see that the naive strategy is overallbetter in this case.

34


-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20A

vera

ge s

lippa

ge (

SD

)15

0



15

Granularity (I)

0.05

10 105 5

instrument C

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly

Figure 5.3: Instrument C overall results. Here our approach is overall superior.

The plots above are mainly presented to illustrate the general results. Now lookingmore specifically at the best model given order volume and Session time we get thefollowing results and tables:

60 80 100 120 140 160 180

Session length(minutes)

-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page

instrument A, V=15000

60 80 100 120 140 160 180


-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page


Optimized Submit and Leave (with trend)Calculated Policy (with trend)Optimized Submit and Leave (without trend)Calculated Policy (without trend)Eat-all at onceSubmit and Leave at bidPortion out evenly

Figure 5.4: Instrument A best results obtained for each scenario. We see that thenaive strategy "Portion out evenly" indeed yields the best results.

35


60 80 100 120 140 160 180


-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Slip

page

instrument B, V=3600

60 80 100 120 140 160 180


-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page



Figure 5.5: Instrument B best results obtained for each scenario. We see that thenaive strategy "Portion out evenly" indeed yields the best results.

60 80 100 120 140 160 180


-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

Slip

page

instrument C, V=2100

60 80 100 120 140 160 180


-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

Slip

page

instrument C, V=700


Figure 5.6: Instrument C best results obtained for each scenario. Here our modelwithout any market variable does the best.

We also present a table with more detailed information regarding the best resultsfor each type of simulation:

36


Instrument Name V I M.V T.P S.T Strategy Slippage’Instrument A’ 15000 NaN NaN 15 180 ’Portion out evenly’ 0.0218’Instrument A’ 15000 NaN NaN 20 120 ’Portion out evenly’ 0.0281’Instrument A’ 15000 NaN NaN 15 60 ’Portion out evenly’ 0.0199’Instrument A’ 5000 NaN NaN 20 180 ’Portion out evenly’ 0.031’Instrument A’ 5000 NaN NaN 20 120 ’Portion out evenly’ 0.0301’Instrument A’ 5000 NaN NaN 15 60 ’Portion out evenly’ 0.0237’Instrument B’ 3600 NaN NaN 20 180 ’Portion out evenly’ 0.0225’Instrument B’ 3600 NaN NaN 20 120 ’Portion out evenly’ 0.019’Instrument B’ 3600 NaN NaN 10 60 ’Portion out evenly’ 0.0119’Instrument B’ 1200 NaN NaN 20 180 ’Portion out evenly’ 0.0226’Instrument B’ 1200 NaN NaN 15 120 ’Portion out evenly’ 0.0197’Instrument B’ 1200 NaN NaN 10 60 ’Portion out evenly’ 0.0152’Instrument C’ 2100 20 0 20 180 ’Calculated Policy’ 0.0707’Instrument C’ 2100 20 0 20 120 ’Calculated Policy’ 0.0391’Instrument C’ 2100 NaN NaN 5 60 ’S&L at bid’ -0.0027’Instrument C’ 700 20 0 15 180 ’Calculated Policy’ 0.0903’Instrument C’ 700 20 0 15 120 ’Calculated Policy’ 0.0547’Instrument C’ 700 15 1 20 60 ’Calculated Policy’ 0.0094

Table 5.2: No market impact results

Here "NaN", means that the strategy used does not have the notion of usingmarket variables or volume states. We see that in general the naive strategy "Portionout evenly" yields the best results, consistently yielding a positive slippage. Since thestrategy always submits a small amount at the current bid/ask, it is very plausiblethat a lot of the order is executed at a favorable price. For Instrument C, our policyseems to do better. This could be due to the large price tick to volatility ratio,implying that market participants in general are reluctant to changing price tickswhich results in a very peculiar behavior. The MDP might perform better due toits ability to capitalize on such peculiar behavior by taking mean estimates.

37

Part II

Market Impact

39

Chapter 6

Model

In the previous part, we talked about the basics of a Markovian model without thenotion of market impact added to our model. In this part we explore this conceptand attempt to construct a viable model that is somewhat consistent with reality.

In order to model the market impact we take two approaches:

1. A model where price and transition probability are changed due to marketimpact. This model will be referred to as "MI" model.

2. A model where a slippage depending on factors relevant to market impact isadded to the simulated slippage. This model will be referred to as the "Lynx"model.

6.1 AssumptionsWhen market impact is modeled, there are of course assumptions to be made re-garding how it affects the MDP.

6.1.1 Markov property

Clearly, there are no changes to the Markov assumption since we still have to relyon transition probabilities.

6.1.2 Market Impact

Since there is no empirical way of estimating market impact, we can only modelassumed consequences of market impact. These assumptions are:

1. The price driven in an unfavorable direction is proportional to the size of anorder in terms of volume submitted.

41

CHAPTER 6. MODEL

2. The price driven in an unfavorable direction is proportional to the aggressive-ness of an order in terms of price ticks above or below bid and ask submitted.

The consequences thus become:

1. When price is driven in an unfavorable direction, the price of the incomingtrades will move in an unfavorable direction resulting in less of our order beingexecuted. This implies that any control ut will will always result in transitionto a less favorable state.

2. Further, the "incremental price" is pushed to an unfavorable direction, result-ing in larger unexecuted slippage.

6.2 "MI" modelThis model is inspired by the Chriss-Almgren model of market impact. It looks likethe following:

Zt+1 = Zt + σY Z ·

√T

60 ·ut,2V·Xt + s · sgn(α) · (1 + ut,1

|A|) (6.1)

Where Zt denotes the price at time index t, T is the amount of time that has elapsedof the time period in minutes, V is the estimated total volume that is traded duringthe time period, Xt is a normally distributed (N (0, 1)) random variable, s the pricetick, α ∈ −1, 1 a buy/sell indicator and |A| the cardinality of action space.

This model intuitively makes no sense, but it achieves the desired properties ofthe market impact assumptions. This model was achieved through trial and errorby trying different models and plugging it into the code to see which model resultedin the desired properties. In order to visualize the effects of market impact we lookat the probability of having a difference in state transition:

010

0

0.5

Pro

babi

lity

Volume submitted: 3000

Action

-50000

Volume difference(units)

1

-10000-10 -15000

00

-10-5000

0.5


Pro

babi

lity

Action


0-10000

1

-15000 10


6 4 2 0 -2 -4 -6

0

Action

0.2

0.4

Pro

babi

lity

0.6

0.8

1

0


×104-1


0

Action

-26 4 2 0 -2 -4 -6

0.5

Pro

babi

lity

1

Transition Probability Differences

1

0.8

0.6

Pro

babi

lity

0.4

0.2


0

-10000 -5000 0-15000


4000

6000

8000

10000

12000

14000

Cur

rent

sta

te V

olum

e(un

its)

Figure 6.1: Here we have taken instrument A as an example with model parametersV=15000, I=5 and T=5.

42

6.3. "LYNX MODEL"

What the above graph shows is the probability of getting less volume executed givena state xt and control ut. Since the graphs are rather similar in shape for differentvalues of submitted volume, they are shown in different angels for the sake of givinga better picture of how the probabilities change.

6.3 "Lynx model"Due to compliance reasons, no explicit representation of this model will be printedout. This model uses a coefficient that is correlated with the control signal, whichwill be denoted γ. In this thesis, will test the effects of market impact for threedifferent γ’s.

43

Chapter 7

Implementation

When implementing the above models, not much modification is necessary. In fact,we only need to modify the "SimulateTrades" function by adding a market impactprice calculator for the "MI" case and a reward penalty function for the "Lynx" case.

In the "MI" case, an additional function is called when calculating the "incremen-tal price" and when iterating forwards on trade data. This function looks like thefollowing:

Algorithm 11 "MI" model market impact1: procedure MIcalculator(Zt, s, ut, T radesDataDay, time period,σY Z ,Elapsed Time)

2: V = 03: for i= to time period do . Get market volume for the time period relative

to which day it is.4: V = V + sum(TradesDataDay(i)) . Here "sum()" sums the total

volume traded5: Zt+1 = Zt + σY Z ·

√Elapsed Time

60 · ut,2V ·Xt + s · sgn(α) · (1 + ut,1|A| )

6: return Zt+1

Here, we first calculate the market volume and plug this value into our equation.Then given a control ut and Zt, which both can be the incremental price and theprice of the trades. When calculating how the incremental price changes, we simplyset elapsed time to time period, while in the other case we keep track on how manyminutes we have progressed in data since we started the simulation.

For the lynx model, the values of the three γÂ´s tested are 0.01, 0.03 and 0.05.Similarly, to the "MI" model, it also has an aspect of market volume. It is imple-mented on the following form:

Algorithm 12 "Lynx" model market impact

45


1: procedure Lynxcalculator(Zt, s, ut, T radesDataDay, time period,σY Z ,Elapsed Time, γ)

2: V = 03: for i= to time period do . Get market volume for the time period relative

to which day it is.4: V = V + sum(TradesDataDay(i)) . Here "sum()" sums the total

volume traded5: Additional slippage=LynxModel(V, γ, time period, ut)6: return Zt+1

7.1 Benchmark algorithms

We use the same algorithms as in the case without market impact with exceptionof increasing price on trades depending on the order and a reward penalty due tomarket impact.

7.2 Results and Comments

We divide the results of this section into the "MI" model and "Lynx" model.

7.2.1 "MI" model

In this case, the simulations run is the same as in the case without market impact:

Instrument Name V I Market variable Time period Session timeInstrument A 15000,5000 5,10,15,20 true,false 5,10,15,20 60,120,180Instrument B 3600,1200 5,10,15,20 true,false 5,10,15,20 60,120,180Instrument C 2100,700 5,10,15,20 true,false 5,10,15,20 60,120,180

Table 7.1: Parameters used for simulation with market impact

Where the variables denote the same thing as before. We obtain the followingresults after simulations:

46


-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.0820

-0.06

20

Ave

rage

slip

page

(S

D)

15

-0.04



15

Granularity (I)

-0.02

10 105 5

-0.0820

-0.06

20

Ave

rage

slip

page

(S

D)

15

-0.04



15

Granularity (I)

-0.02

10 105 5

-0.0820

-0.06

20

Ave

rage

slip

page

(S

D)

15

-0.04



15

Granularity (I)

-0.02

10 105 5

-0.0820

-0.06

20

Ave

rage

slip

page

(S

D)

15

-0.04



15

Granularity (I)

-0.02

10 105 5

-0.0820

-0.06

20A

vera

ge s

lippa

ge (

SD

)15

-0.04



15

Granularity (I)

-0.02

10 105 5

instrument A

-0.0820

-0.06

20

Ave

rage

slip

page

(S

D)

15

-0.04



15

Granularity (I)

-0.02

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly

Figure 7.1: Instrument A overall results with market impact. We see that there isno obvious superior strategy.

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

instrument B

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly

Figure 7.2: Instrument B overall results market impact. It seems that our approachis superior.

47


-0.120

0

20

Ave

rage

slip

page

(S

D)

15

0.1



15

Granularity (I)

0.2

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15

0.1



15

Granularity (I)

0.2

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15

0.1



15

Granularity (I)

0.2

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15

0.1



15

Granularity (I)

0.2

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15

0.1



15

Granularity (I)

0.2

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20A

vera

ge s

lippa

ge (

SD

)15



15

Granularity (I)

0

10 105 5

instrument C

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly

Figure 7.3: Instrument C overall results market impact. The optimized submit andleave looks to be superior.

Now, looking at the best model given simulation variables Volume and Session time.

60 80 100 120 140 160 180


-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

Slip

page


60 80 100 120 140 160 180


-0.075

-0.07

-0.065

-0.06

-0.055

-0.05

-0.045

-0.04

-0.035

-0.03

Slip

page



Figure 7.4: Instrument A best results obtained for each scenario. We see thatoptimized submit and leave together with our calculated policy with a trend variableis superior.

48


60 80 100 120 140 160 180


-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Slip

page


60 80 100 120 140 160 180


-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

Slip

page



Figure 7.5: Instrument B best results obtained for each scenario. We see thatoptimized submit and leave together with our calculated policy with a trend variableis superior.

60 80 100 120 140 160 180


-0.1

-0.05

0

0.05

0.1

0.15

0.2

Slip

page

instrument C, V=2100

60 80 100 120 140 160 180


-0.1

-0.05

0

0.05

0.1

0.15

Slip

page

instrument C, V=700


Figure 7.6: Instrument C best results obtained for each scenario. Here our calculatedpolicy without any market variable together with optimized submit and leave doesthe best.

We also present a table with more detailed information regarding the best resultsfor each type of simulation:

49


Instrument Name V I M.V T.P S.T Strategy Slippage’Instrument A’ 15000 20 1 15 180 ’Optimized S&L’ 0.0032’Instrument A’ 15000 20 1 15 120 ’Optimized S&L’ -0.003’Instrument A’ 15000 20 1 15 60 ’Calculated Policy’ -0.0241’Instrument A’ 5000 20 1 5 180 ’Calculated Policy’ -0.0319’Instrument A’ 5000 20 1 5 120 ’Calculated Policy’ -0.034’Instrument A’ 5000 NaN NaN 5 60 ’Portion out evenly’ -0.0332’Instrument B’ 3600 20 1 5 180 ’Optimized S&L’ 0.0058’Instrument B’ 3600 20 0 5 120 ’Calculated Policy’ 0.0043’Instrument B’ 3600 20 0 5 60 ’Calculated Policy’ 0.001’Instrument B’ 1200 10 1 5 180 ’Calculated Policy’ 0.0071’Instrument B’ 1200 20 1 5 120 ’Optimized S&L’ 0.0053’Instrument B’ 1200 20 1 5 60 ’Optimized S&L’ 0.0053’Instrument C’ 2100 20 0 5 180 ’Optimized S&L’ 0.173’Instrument C’ 2100 20 0 5 120 ’Optimized S&L’ 0.1007’Instrument C’ 2100 20 0 5 60 ’Optimized S&L’ 0.0138’Instrument C’ 700 20 0 5 180 ’Optimized S&L’ 0.1271’Instrument C’ 700 20 0 5 120 ’Optimized S&L’ 0.0655’Instrument C’ 700 20 1 5 60 ’Optimized S&L’ -0.0002

Table 7.2: Results for "MI" model market impact

When imposing a punishment on submission volume and action, we see that moreadvanced models are performing better. This result is not very surprising since theadvanced models are simulated on large sets of data, they tend to converge to an"average" that better captures behavior that is dependent on what kind of controlthat are submitted. A remarkable thing is that for instrument C, an advanced modelperforms better than the case when there is no market impact involved. Consideringthe fact that instrument C has the lowest amounts of trades per minute, whichallows for lower values of V, the stochastic component in the market impact couldunintentionally allow for transitions for favorable states instead of unfavorable ones,which is a flaw in our model.

7.2.2 "Lynx" model

Since the model we use is dependent on a γ coefficient, we must include it as asimulation variable. Hence our variables used for each simulation becomes:

Instrument Name V I M.V T.P S.T γ

Instrument A 15000,5000 5,10,15,20 true,false 5,10,15,20 60,120,180 0.01,0.03,0.05Instrument B 3600,1200 5,10,15,20 true,false 5,10,15,20 60,120,180 0.01,0.03,0.05Instrument C 2100,700 5,10,15,20 true,false 5,10,15,20 60,120,180 0.01,0.03,0.05

Table 7.3: Lynx model simulation parameters

After simulating for different γ’s, we get the following results:

50


-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

instrument A gamma=0.01

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

Figure 7.7: Instrument A overall results with market impact. The naive strategyportion out evenly seems superior.

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

instrument B gamma=0.01

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

Figure 7.8: Instrument B overall results market impact. The naive strategy portionout evenly seems superior.

51


-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

instrument C gamma=0.01

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

Figure 7.9: Instrument C overall results market impact. Our calculated policy seemssuperior.

The above results are for γ = 0.01.

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5


-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5


52


-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.620

-0.4

20

Ave

rage

slip

page

(S

D)

15

-0.2



15

Granularity (I)

0

10 105 5

-0.620

-0.4

20

Ave

rage

slip

page

(S

D)

15

-0.2



15

Granularity (I)

0

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5


-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

Figure 7.11: Instrument B overall results market impact. The naive strategy portionout evenly seems superior.

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.1

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5


-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

Figure 7.12: Instrument C overall results market impact. The naive strategy submitand leave at bid seems superior.

The above results are for γ = 0.03.

53


-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20A

vera

ge s

lippa

ge (

SD

)15

0



15

Granularity (I)

0.05

10 105 5


-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly


-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.520

0

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0.5

10 105 5

-0.620

-0.4

20

Ave

rage

slip

page

(S

D)

15

-0.2



15

Granularity (I)

0

10 105 5

-0.620

-0.4

20

Ave

rage

slip

page

(S

D)

15

-0.2



15

Granularity (I)

0

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5

-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5


-0.220

-0.1

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.1

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly

Figure 7.14: Instrument B overall results market impact.The naive strategy portionout evenly seems superior.

54


-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15

0



15

Granularity (I)

0.05

10 105 5

-0.120

-0.05

20A

vera

ge s

lippa

ge (

SD

)15



15

Granularity (I)

0

10 105 5


-0.120

-0.05

20

Ave

rage

slip

page

(S

D)

15



15

Granularity (I)

0

10 105 5


Calculated Policy

Eat-all at once


Portion out evenly

Figure 7.15: Instrument C overall results market impact. The naive strategy submitand leave at bid seems superior.

The above results are for γ = 0.05. Now, looking at the best models we have thefollowing results:

60 80 100 120 140 160 180


-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page

instrument A, V=15000, gamma=0.01

60 80 100 120 140 160 180


-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page



Figure 7.16: Instrument A best results with market impact. The naive strategyportion out performs the best.

55


60 80 100 120 140 160 180


-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Slip

page

instrument B, V=3600, gamma=0.01

60 80 100 120 140 160 180


-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page



Figure 7.17: Instrument B best results with market impact. The naive strategyportion out performs the best.

60 80 100 120 140 160 180


-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page

instrument C, V=2100, gamma=0.01

60 80 100 120 140 160 180


-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

Slip

page



Figure 7.18: Instrument C best results with market impact. Our calculated policywithout market variable performs the best.

Taking the best results for γ = 0.01 we get the following table:

56


Instrument Name Gamma V I M.V T.P S.T Strategy Slippage’Instrument A’ 0.01 15000 NaN NaN 15 180 ’Portion out evenly’ 0.0199’Instrument A’ 0.01 15000 NaN NaN 20 120 ’Portion out evenly’ 0.0249’Instrument A’ 0.01 15000 NaN NaN 15 60 ’Portion out evenly’ 0.0148’Instrument A’ 0.01 5000 NaN NaN 20 180 ’Portion out evenly’ 0.0302’Instrument A’ 0.01 5000 NaN NaN 20 120 ’Portion out evenly’ 0.0289’Instrument A’ 0.01 5000 NaN NaN 15 60 ’Portion out evenly’ 0.0217’Instrument B’ 0.01 3600 NaN NaN 20 180 ’Portion out evenly’ 0.0201’Instrument B’ 0.01 3600 NaN NaN 20 120 ’Portion out evenly’ 0.0156’Instrument B’ 0.01 3600 NaN NaN 10 60 ’Portion out evenly’ 0.0077’Instrument B’ 0.01 1200 NaN NaN 20 180 ’Portion out evenly’ 0.0218’Instrument B’ 0.01 1200 NaN NaN 15 120 ’Portion out evenly’ 0.0187’Instrument B’ 0.01 1200 NaN NaN 10 60 ’Portion out evenly’ 0.0136’Instrument C’ 0.01 2100 20 0 20 180 ’Calculated Policy’ 0.0373’Instrument C’ 0.01 2100 NaN NaN 5 120 ’S&L at bid’ 0.0088’Instrument C’ 0.01 2100 NaN NaN 20 60 ’Portion out evenly’ -0.005’Instrument C’ 0.01 700 20 0 20 180 ’Calculated Policy’ 0.0882’Instrument C’ 0.01 700 20 0 20 120 ’Calculated Policy’ 0.0534’Instrument C’ 0.01 700 20 0 20 60 ’Calculated Policy’ 0.0059

Table 7.4: Lynx model γ = 0.01 results

60 80 100 120 140 160 180


-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

Slip

page


60 80 100 120 140 160 180


-0.07

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

Slip

page




57


60 80 100 120 140 160 180


-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Slip

page


60 80 100 120 140 160 180


-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page




60 80 100 120 140 160 180


-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

Slip

page


60 80 100 120 140 160 180


-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

Slip

page



Figure 7.21: Instrument C best results with market impact. It looks like a tiebetween submit and leave at bid and our calculated policy without market variables.


58


Instrument Name Gamma V I M.V T.P S.T Strategy Slippage’Instrument A’ 0.03 15000 NaN NaN 15 180 ’Portion out evenly’ 0.0163’Instrument A’ 0.03 15000 NaN NaN 20 120 ’Portion out evenly’ 0.0185’Instrument A’ 0.03 15000 NaN NaN 10 60 ’Portion out evenly’ 0.0052’Instrument A’ 0.03 5000 NaN NaN 20 180 ’Portion out evenly’ 0.0287’Instrument A’ 0.03 5000 NaN NaN 20 120 ’Portion out evenly’ 0.0267’Instrument A’ 0.03 5000 NaN NaN 15 60 ’Portion out evenly’ 0.0179’Instrument B’ 0.03 3600 NaN NaN 20 180 ’Portion out evenly’ 0.0153’Instrument B’ 0.03 3600 NaN NaN 15 120 ’Portion out evenly’ 0.0092’Instrument B’ 0.03 3600 NaN NaN 10 60 ’Portion out evenly’ -0.0007’Instrument B’ 0.03 1200 NaN NaN 20 180 ’Portion out evenly’ 0.02’Instrument B’ 0.03 1200 NaN NaN 15 120 ’Portion out evenly’ 0.0165’Instrument B’ 0.03 1200 NaN NaN 10 60 ’Portion out evenly’ 0.0104’Instrument C’ 0.03 2100 NaN NaN 5 180 ’S&L at bid’ 0.0078’Instrument C’ 0.03 2100 NaN NaN 5 120 ’S&L at bid’ -0.0023’Instrument C’ 0.03 2100 NaN NaN 20 60 ’Portion out evenly’ -0.0089’Instrument C’ 0.03 700 20 0 20 180 ’Calculated Policy’ 0.0728’Instrument C’ 0.03 700 20 0 20 120 ’Calculated Policy’ 0.0411’Instrument C’ 0.03 700 20 0 20 60 ’Calculated Policy’ 0.0007


60 80 100 120 140 160 180


-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

Slip

page


60 80 100 120 140 160 180


-0.07

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

Slip

page




59


60 80 100 120 140 160 180


-0.45

-0.4

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

Slip

page


60 80 100 120 140 160 180


-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

Slip

page




60 80 100 120 140 160 180


-0.1

-0.09

-0.08

-0.07

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

Slip

page


60 80 100 120 140 160 180


-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

Slip

page



Figure 7.24: Instrument A best results with market impact. It looks like a tiebetween submit and leave at bid and our calculated policy with market variables.


60


Instrument Name Gamma V I M.V T.P S.T Strategy Slippage’Instrument A’ 0.05 15000 NaN NaN 15 180 ’Portion out evenly’ 0.0126’Instrument A’ 0.05 15000 NaN NaN 20 120 ’Portion out evenly’ 0.0121’Instrument A’ 0.05 15000 NaN NaN 5 60 ’Portion out evenly’ -0.0022’Instrument A’ 0.05 5000 NaN NaN 20 180 ’Portion out evenly’ 0.0272’Instrument A’ 0.05 5000 NaN NaN 20 120 ’Portion out evenly’ 0.0244’Instrument A’ 0.05 5000 NaN NaN 10 60 ’Portion out evenly’ 0.0144’Instrument B’ 0.05 3600 NaN NaN 20 180 ’Portion out evenly’ 0.0104’Instrument B’ 0.05 3600 NaN NaN 15 120 ’Portion out evenly’ 0.0033’Instrument B’ 0.05 3600 NaN NaN 5 60 ’Portion out evenly’ -0.0086’Instrument B’ 0.05 1200 NaN NaN 20 180 ’Portion out evenly’ 0.0183’Instrument B’ 0.05 1200 NaN NaN 15 120 ’Portion out evenly’ 0.0144’Instrument B’ 0.05 1200 NaN NaN 10 60 ’Portion out evenly’ 0.0072’Instrument C’ 0.05 2100 NaN NaN 5 180 ’S&L at bid’ -0.0023’Instrument C’ 0.05 2100 NaN NaN 5 120 ’S&L at bid’ -0.0135’Instrument C’ 0.05 2100 NaN NaN 5 60 ’Portion out evenly’ -0.0115’Instrument C’ 0.05 700 15 1 20 180 ’Calculated Policy’ 0.0378’Instrument C’ 0.05 700 15 1 20 120 ’Calculated Policy’ 0.0132’Instrument C’ 0.05 700 NaN NaN 5 60 ’S&L at bid’ -0.0044


For the Lynx model we see that there is not a great change in which policies thatare the most efficient for γ = 0.01, γ = 0.03 and γ = 0.05. This suggest that thismodel is may be too lenient in "punishing" large volume submissions and aggressiveactions and hence a larger γ might have been more suitable to model market impact.

61

Part III

Reinforcement Learning

63

Chapter 8

Theory

This part serves as an exploratory "exercise", in implementing several concepts of"Reinforcement learning" on the given problem and testing their feasibility. Dueto the lack of time and resources, we only present a case study of instrument Awithout any market impact and without the trend state.

8.1 A slight reformulation

Usually when using "Reinforcement learning", we are still considering a state spacemodel. The problem arises when the model has a state space so large, that it be-comes computationally unpractical to conduct simulations. In our case, we couldhave state and control spaces having a total of 107 combinations where we wouldchoose I ≈ 103 very large. Since choosing I large increases the precision of themodel, we stand at a dilemma of accuracy and computation practicality.

The idea of "Reinforcement learning"1 is then to approximate the value functionV (xt, ut) associated with some state xt and control ut with differentiable functions,thus unlocking gradient methods. Given our situation where data amount does notqualify to be in the range of "big data", we omit the "neural network" approach andestimate the value function using continuous probability distributions and continu-ously differentiable functions.

When approximating discrete data points with continuous functions, we need toreformulate state space, control space and the Bellman equation. First off, thestate space takes values in the following set:

1Technically, the discrete models are also included in the definition, but we make a distinctionhere to accentuate the value function approximation part which is specifically a reinforcementlearning technique

65

CHAPTER 8. THEORY

S = [0, 1]× [0, H] (8.1)

Where H ∈ R. The major changes are that the volume state previously indicatinghow many "batches" that are left are now a percentage value between 0 and 1.Further the time state is now no longer an amount of time periods progressed butrather how many minutes that have progressed of the total session time. The controlset thus becomes:

U = A× [0, 1]× [5, 20] (8.2)

Since the state space changed to a percentage based value, the control space volumesubmission component must also change in correspondence to the volume state inthe state space representation. Since there is no possibility to submit half a pricetick, the action space will be estimated as a continuous variable but in implementa-tion it will always be rounded to the closest integer. The extra dimension in controlspace represents how many minutes we want the control to be active before we re-submit a new one. This extra dimension is added through parameterizing the timeperiod, which in the discrete case would be fixed for each simulation. It is boundedby 5 and 20 since those are the bounds of the values we empirically tested for andestimated the probability transition function for. Thus we are only confident of theestimated values within the empirical bounds. It should be noted that this changeis motivated by testing whether adding this dimension to the control space yields asignificant improvement in performance compared to the discrete case without thisdimension. With the state space and control space reformulated, given x0 and T ,the Bellman equation becomes for some 0 ≤ t ≤ T , ut,1 ≤ xt,1:

V ∗t (xt) = maxut∈U

{cim(ut)+put,1(ut)V ∗t+1(xt,1−ut,1)+

∫ ut,1

0Pxt→xt+1(ut)V ∗t+1(xt+1)dxt+1

}(8.3)

Where, xt+1 = xt − ut,1. Further we have that 0 ≤ xt+1 ≤ xt, since in this formu-lation xt,1 ∈ [0, 1] and xt+1 ≤ xt, ∀t ≥ 0 .

In the formulation above we use that the transition probability has the form ofa point distribution together with a continuous distribution.

8.2 Reinforcement learningAccording to David Silver, the notion of "Reinforcement learning" means estimatingthe value function of the control problem presented and gradually improving theaccuracy of this value function by training it on new data. In our case, we thenseparate the out of sample data such that if the data is half a year long, we thentake the first three months as training data and the latter three months as test data.

66

8.2. REINFORCEMENT LEARNING

We train our estimated value function by minimizing the square error using gradientdescent:

[c1, ..., cn]j+1 = [c1, ..., cn]j − α · ∇ciL(c1, ..., cn, xt) (8.4)

Here L(c1, ..., cn, xt) denotes the quadratic error function and j denotes iterationindex. Since we want to minimize the with respect to the coefficients c1, ..., cnused to estimate our value function, we take the gradient of the error functionL(c1, ..., cn, xt) and "walk" against the direction of the gradient with step size α.Further ∇ciL(c1, ..., cn, xt) is defined.

∇ciL(c1, ..., cn, xt) = (yi − V ∗τ−1(c1, ..., cn, xt)) · ∇ciV ∗τ−1(c1, ..., cn, xt) (8.5)

Where V ∗τ−1(c1, ..., cn, xt), denotes the optimal value function with one time periodleft before the terminal state. Since the optimal policy for different time statesare built recursively, we specifically chose this time state since it contains everycoefficient needed to build the rest of the policies. When we say one time periodleft in this context, we refer to the case when of having anything between 5-20minutes of time left. This is because when we estimate the value function, we makethe assumption that each recursive step is valid for a time period of 20 minutes.This is because the submission time component of the control signal varies between5-20 minutes. More explicitly, the algorithm we use is given below:

Algorithm 13 "GradientDescent"1: procedure GradientDescent(model parameters, function handle, reward data)2: Take random subset X of reward data3: for xt, ut and rewardj in X do4: ∇ciL(c1, ..., cn, xt) = (rewardj−V ∗τ−1(c1, ..., cn, xt))·∇ciV ∗τ−1(c1, ..., cn, xt). Iteration index j starts at 1

5: [c1, ..., cn]j+1 = [c1, ..., cn]j − α · ∇ciL(c1, ..., cn, xt)

It is worth mentioning, that when calculating the gradient we take the numericalderivative using central difference. This is mainly due to the fact that we have 38coefficients that we need to take the analytical derivative of, which we omitted dueto time constraints. In this thesis we try three approaches to train our fitted model:

1. Fitting model to data, and testing the model on out-of-sample data withoutany training

2. Training the fitted model and then testing the model out-of-sample data

3. Updating the model during actual testing on out-of-sample data

Do note that the out-of-sample test data set remains the same in all three cases.

67

Chapter 9

Implementation

9.1 Estimating the value functionWhen estimating the value function we need the following components: transitionprobability, reward function for xt,2 < τ and reward function for xt,2 = τ . We startby describing how we estimate the transition probability.

9.1.1 Transition probability

When estimating the the transition probability, we estimate the pdf by taking thederivative of a fitted cumulative distribution against the empirical cumulative dis-tribution. Since our model consists of a Markov process, each data point used toestimate the empirical cumulative distribution is assumed to be i.i.d. which impliesthat the Glivenko-Cantelli theorem holds. The question is thus if the number ofdata points n is big enough to achieve convergence to the true distribution. In orderto asses this we conduct Kolmogorov-Smirnov tests at 95% confidence level. TheKolmogorov-Smirnov test is conducted by testing the hypothesis:

1. H0: The data follows the specified distribution

2. H1: The data does not follow the specified distribution

We reject H0 if the test static

Dn = supx∈R|Fn(x)− FX(x)| (9.1)

is higher than the critical value α95%,n ≈ 1.358√n, n ≥ 35. In our case n = 194352.

Further, we also find the 95% confidence intervals for all estimated coefficients byusing a non-parametric bootstrap1.

1See Appendix B

69


We estimate the transition probability of the instrument A by using the follow-ing functions:

Pxt→xt+1(ut) = Pxt→xt+1(ut,1, ut,2, ut,3) := f(1− ut,1; α(ut,2, ut,3), β(ut,2, ut,3))(9.2)

The above function denotes the transition probabilities. Below is the point distri-bution and how it depends on ut.

put,1(ut,1, ut,2, ut,3) = p(ut,2, ut,3)+(1− p(ut,2, ut,3)) ·I1−ut,1(α(ut,2, ut,3), β(ut,2, ut,3))(9.3)

Where Ix(a, b) is the cumulative beta distribution function for parameters a and band f(x; a, b) is the probability density function of the beta distribution. Furtherp(ut,2, ut,3) is the empirically estimated function for the point distribution at ut,1 =12. As a first step, we only fit α, β and p for utt, 1:

Figure 9.1: Here we look at the cumulative transition probabilities for different timeperiods and angels for instrument A.

In the above graph we see that the cumulative beta distribution fits with a pointmass fits quite well. Do note that we only need to fit the out most edge of thedistribution since the curves do not change shape along the volume submissionaxis. Looking at the Kolmogorov-Smirnov test results3 we obtain the following:

2Since p is estimated directly as a parameter depending on α and β a confidence interval is notgiven for p

3Test statics can be found in Appendix B

70

9.1. ESTIMATING THE VALUE FUNCTION

Time period/Action -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 65 0 0 0 1 1 1 1 1 1 1 1 1 110 0 1 1 1 1 1 1 1 1 1 1 1 115 1 1 1 1 1 1 1 1 1 1 1 1 120 1 1 1 1 1 1 1 0 1 1 1 1 0

Table 9.1: Here 0, means that we do not reject H0 and 1 means that we reject H0.

We see that the majority of the beta fits for different parameters of action are re-jected by the Kolmogorov-Smirnov test. This suggest that the beta distributionwith a point probability is not the true distribution.

Having parametrized the distribution for volume submission, we also seek to parametrizeit for time period and action. Thus we use the following functions to parametrizep(ut,2, ut,3), α(ut,2, ut,3) and β(ut,2, ut,3).

I(ut,1;α(ut,2, ut,3), β(ut,2, ut,3)) :=∫ ut,1

0tα(ut,2,ut,3)(1− t)β(ut,2,ut,3)dt

f(x;α(u, v), β(u, v)) = ∂I(x;α(u, v), β(u, v))∂x

p(x, y) = c1 ·x

c2 + e−c3·y

α(x, y) = (c1 + c2 · xy)[c3 + c4 · ec5x ·

e−0.5(y−(c7+c8·x))2

2πc6

12π∫ 6−c7

c6−∞ e−0.5t2dt−

∫ −6−c7c6

−∞ e−0.5t2dt

]

β(x, y) = c1 ·e−0.5(y−(c2+c3·x))2

2πc4

12π∫ 6−c2

c4−∞ e−0.5t2dt−

∫ −6−c2c4

−∞ e−0.5t2dt

In the above equations, we have denoted every constant ci, i ∈ N. The idea is thatwe conduct a least square fit on the generated costs from the previous simulations.In this context, we will later attempt to train our model by using gradient descenton the coefficients ci. These function shapes are obtained from iterating differentshapes until we achieved a fit that looked like the following:

71


0

20

0.2

0.4

10

0.6

Est

imat

ed v

alue

of p

0

15

Time period (T)

0.8

Action

1

010

5 -10

0

20

1

2

10

3

Est

imat

ed v

alue

of a

lpha

15

Time period (T)

4

Action

5

010

5 -10

p0, alpha and beta fit

0

1

20

2

3

10

Est

imat

ed v

alue

of b

eta

15

4

Time period (T)

5

Action

6

010

5 -10

estimates

fitted values

Figure 9.2: Here we parametrize p0, α and β for variables time period and action.

9.1.2 Reward function

When estimating the reward function there are two components that are necessary,the reward function for xt,2 < τ and reward function for xt,2 = τ . In the lattercase, we are in the terminal state and hence have probability 1 of transitioning toxτ,1 = 0. Thus we denote the reward function for this case as a value functionV ∗N (xt):

cim(ut) = −(I(ut,1;α(ut,3), β(ut,3))− 1) · (4∑i=0

ci(ut,3)uit,2)

α(x) = Cα − EαeDα(x−5)

β(x) = Cβ − EβeDβ(x−5)

ci(x) = Ci − EieDi(x−5)

V ∗N (xτ ) = BN (1− eCN ·xτ )

Here I also denotes the cumulative beta distribution. Further B•, C•, D• and E•denotes the coefficients that are being fitted to data. Likewise, to how we fitted thetransition probability function, we also conducted fitting in two steps. We do thefirst fit by parameterizing volume submission and action:

72


-2000

-1500

0

-1000

-500

Cos

t (vo

lum

e*pr

ice/

SD

)

0

Submitted share (%)

0.5

10ynote, time period: 5

Action

1 6420-2-4-6

-2000

0

-1500

0.2

-1000

Cos

t (vo

lum

e*pr

ice/

SD

)

0.4

-500

Submitted share (%)

0.6

0


0.8 6

Action

420-21 -4-6

-2000

-1500

0

-1000

-500

Cos

t (vo

lum

e*pr

ice/

SD

)

0


Submitted share (%)

0.564

Action

20-2-41 -6

Observed costs

-2500

-2000

0

-1500

-1000

Cos

t (vo

lum

e*pr

ice/

SD

)

0.2

-500

0

0.4


Submitted share (%)

60.6 42

Action

00.8 -2-41 -6

Observed valuesFitted values

Figure 9.3: Here we parametrize p0, α and β for variables time period and action.

We know parametrize time period for α(t), β(t) and ci(t).

5 10 15 20

Time (T)

1

1.2

1.4

1.6

1.8

2

Fitt

ed c

oeffi

cien

t val

ues

alph

a

5 10 15 20

Time (T)

1.5

1.55

1.6

1.65

1.7

1.75

Fitt

ed c

oeffi

cien

t val

ues

beta

5 10 15 20

Time (T)

0

0.1

0.2

0.3

0.4

Fitt

ed c

oeffi

cien

t val

ues

c4

5 10 15 20

Time (T)

-1

-0.5

0

0.5

1

Fitt

ed c

oeffi

cien

t val

ues

c3

5 10 15 20

Time (T)

-35

-30

-25

-20

-15

Fitt

ed c

oeffi

cien

t val

ues

c2

5 10 15 20

Time (T)

-200

-150

-100

-50

Fitt

ed c

oeffi

cien

t val

ues

c1

Fitting coefficients against time period

5 10 15 20

Time (T)

-500

-400

-300

-200

-100

Fitt

ed c

oeffi

cien

t val

ues

c0


Figure 9.4: reward function parametrized for volume submission and action.

Finally, we also make a fit for V ∗N (xτ ), which is a lot simpler since it only dependson the terminal volume state:

73


Value function costs in terminal state

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Volume state (%)

-3000

-2500

-2000

-1500

-1000

-500

0

Cos

t (V

olum

e*pr

ice/

SD

)


Figure 9.5: Fitting the terminal value function.

9.1.3 Value function

Since we have changed the original discrete formulation of the Bellman equationto a continuous one, the method of solving changes from simply finding the con-trol in the finite control set that maximizes Vt(xt) by comparison to using gradientmethods depending on the value of xt ∈ [0, 1]. The Bellman equation is recursiveand thus for every iteration we need to take the derivative of a maxut∈U • operator.This procedure becomes rather difficult to handle and we work around it by esti-mating the value function for each time period by interpolating the optimal valuecorresponding to every xt ∈ [0, 1] with step size 0.01. Since the time period is now acontrol within the interval of [5, 20], we make the assumption that each time periodis 20 minutes, meaning that if we consider a session time of 180 minutes we wouldhave 10 time states and if there would be 165 minutes left we would still be in timestate 0 and only switch to 1 when we have less than 160 minutes left. In a sense,we estimate the value function in time piecewisely.

Below, we have fitted the optimal value functions for the cases when we have trainedthe model and not trained it:

74


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Volume state: % left

-1200

-1000

-800

-600

-400

-200

0

200

Opt

imal

val

ueTrained estimation

Calculated pointsFitted values

Figure 9.6: Fitted value function for when training has been conducted.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Volume state: % left

-1200

-1000

-800

-600

-400

-200

0

200

Opt

imal

val

ue

No training estimation

Calculated pointsFitted values

Figure 9.7: Fitted value function for when training has not been conducted.

In this case we see that value function in both cases seem to converge to a curve.The curve type we have used to fit the optimal value function is the quadraticpolynomial:

V ∗• (x) = ax2 + bx (9.4)

75


9.2 Simulation algorithmThe simulation algorithm largely remains the same as in the discrete case. Theonly exception being how the optimal policy is obtained. In the algorithm belowwe illustrate how the optimal policy is obtained:

Algorithm 14 "GetOptimalPolicyRL"1: procedure GetOptimalPolicyRL(model parameters,function handle,xt,Value

function handle,Session time)2: Calculate which Value function handle V ∗t+1(xt) to use depending on time

state and Session time.3: Find ut such that V ∗t (xt) = maxut∈U

{cim(ut) + put,1(ut)V ∗t+1(xt,1 − ut,1) +

∫ ut,10 Pxt→xt+1(ut)V ∗t+1(xt+1)dxt+1

}4: return ut

In the case where we train the model online, there is a slight change in estimationmethod due to limitations in computational power. Since the optimal policy re-quires solving a rather advanced optimization problem for each iteration, it takesquite some time to loop through every data point as a starting point for simulation.

The online learning approach does not allow parallel computation, which slowsdown the process even further. To work around this, we then only sample 6 datapoints from every day in the data and use the simulations from these data pointsas an average. The specific number of data points are chosen in the sense thatcomputation did not take an unreasonable amount of time before yielding results.Further, it should be noted that we also train the estimated optimal value functionV ∗t (x) associated with every time state t when conducting online training.

When using gradient descent, we have chosen step size α = 10−11 and the numericalaccuracy of h = 10−15 when calculating the numerical gradient.

9.3 Results and CommentsWe obtain the following results:

Session Time\Learning style Not trained Trained Trained online Best discrete model180 -0.017209 -0.016605 -0.020039 -0.012640120 -0.018457 -0.017651 -0.018425 -0.01456060 -0.030754 -0.030493 -0.03079 -0.020513

Table 9.2: Reinforcement learning results

We see that the slippage yielded from this model is indeed quite low, but not as goodas the naive strategy "Portion out" or the best discrete case models. In a sense, this

76


is a little disappointing but we see that there is actually a consistent improvementfrom training the model before testing, which suggest that this approach might beviable for larger sets of data when more rigorous training is available. Further wesee that conducting online training yields mixed results, which partially may becredited to using a lot smaller data set.

77

Part IV

Discussion and Conclusions

79

Chapter 10

Discussion

In this chapter we review the results we have obtained and discuss their implicationsand why we obtained these results. A large part of this section will also be com-mitted to analyze the models used and implementation shortfalls how these couldhave affected the results.

From the obtained results in the discrete case of the model it is suggested thatnaive strategies do better when there is no punishment on large volume submissionsand aggressive orders. Since the policies from the MDP might suggest very passiveorders when there is a lot of time left, these orders might end up unexecuted duringtesting thus yielding "wasted" time. Hence the naive strategy of "portion out evenlyat bid/ask" puts a smaller order at a price that is much more likely to be executedthus not "wasting" any time. In the case of instrument C, the MDP strategy issuperior, which most likely is due to properties instrument C exhibits such as lowamount of trades per minute and a large price tick to volatility ratio, implying thatthere are certain quirks to price movement of the instrument.

Looking at the results involving market impact, the results suggest that the MDPapproach is superior in performance compared to a naive strategy. This may becredited to the fact that MDP takes the average from all the simulations and hencealso incorporates the effect of market impact in its estimated average. From theslippages on instrument C, this model clearly also has flaws when the instrumentexhibits certain properties that might render the model invalid due to its stochasticcomponent. When one assumes that transition probabilities and price are affectedby market impact, the results suggest that the MDP is superior to naive strategies.When directly imposing an additional slippage, the results suggest that the naivestrategy still are superior. As mentioned, this may be due to a too small γ imposingtoo lenient punishments.

Looking at Reinforcement learning methods, the overall results are inferior to thediscrete results. But the method shows potential in the sense that training the esti-

81

CHAPTER 10. DISCUSSION

mated model improves the test results. Suggesting that the estimated model holdssome truth value although the Kolmogorov-Smirnov test rejects the hypothesis thatthe beta distribution with a point probability would be the true distribution. Whenimplementing this model, there has been several issues with numerical precisioncausing the optimization function to only find the suboptimal control and variousprobabilistic expressions being erroneously evaluated. Another aspect is the caseof numerical gradient, which we choose opposed to diligently taking the derivativeof all 38 coefficients used for estimation. This part might also have yielded someinaccuracies when conducting gradient descent. Further looking at the online train-ing, we see that slight improvement is achieved in one of the cases. Since the datapoints used for testing where much smaller than in the other cases, it can be specu-lated whether the improvement is coincidental or that the model actually improvedduring simulations.

10.1 ConclusionsFrom the results, it can be concluded that:

1. In the case without market impact, the naive strategy "portion out evenly atbid/ask" seems superior when instruments have a lower price tick to volatilityratio.

2. In the case with market impact ("MI" model), strategies obtained from theMDP approach seems superior due to its ability to incorporate the behaviorimposed by the market impact. The "MI" model is also slightly miss specifiedfor instrument C, which exhibits lower amount of trades per minutes.

3. In the case with market impact ("Lynx" model), the optimal strategies exhibitthe same behavior as without market impact, mainly suggesting that themarket impact factor γ has possibly been set too low for the simulations.

4. Reinforcement learning seems like a viable option in solving optimal orderexecution problems. This statement is strengthened by the fact that theestimated model consistently improves with training. Further estimating thecomponents needed to build the value function is a viable option opposedto directly estimating the value function from the discrete recursions of theBellman equation.

10.1.1 Further researchLooking at the discrete model and their results, more simulations on different in-strument with different properties could be further topic to see which kinds ofinstrument exhibits behavior where the MDP is a good approach. Another topic isto rework or replace the "MI" model with a new model that consistently exhibitsthe correct properties for any instrument.

82

10.1. CONCLUSIONS

For the Reinforcement learning part, further research on cases where market impactis involved and larger sets of data would give further insight and validate whetherthis approach is truly viable and if the results presented in this thesis are merelycoincidental. For larger data sets it also interesting to see whether a neural net-work would do better in estimating the value function rather than parametrizedfunctions, thus utilizing the notion of "Deep reinforcement learning".

83

References

[1] Almgren, R. and N. Chriss, Optimal execution of portfolio transactions, Journalof Risk, 2000

[2] Antonoglou, I., Graves, A., Kavukcuoglu, K., Mnih, V., Riedmiller, M., Silver,D. and D. Wierstra, Playing Atari with Deep Reinforcement Learning, NIPSDeep Learning Workshop, Lake Tahoe, CA, 2013

[3] Feng, Y., Kearns, M., and Y. Nevmyvaka, Reinforcement Learning for OptimizedTrade Execution, Proceedings of the 23 rd International Conference on MachineLearning, Pittsburgh, PA, 2006

[4] Hendricks, D. and D. Wilcox, A reinforcement learning extension to theAlmgren-Chriss framework for optimal trade execution, University of the Wit-watersrand, 2014

[5] Jönsson, U., Lecture notes in Optimal Control, KTH Royal Institute of Tech-nology, 2010

[6] Kearns, M. and Y. Nevmyvaka, Machine Learning for Market Microstruc-ture and High Frequency Trading, High-Frequency Trading - New Realities forTraders, Markets and Regulators, Risk Books, 2013

[7] Kharroubi, I. and H. Pham, Optimal Portfolio Liquidation with Execution re-ward and Risk, SIAM Journal on Financial Mathematics, 2010

[8] Mohri, M., Rostamizadeh, A. and A. Talwalkar, Foundations of Machine Learn-ing, The MIT Press, Cambridge, MA, 2012

[9] M.Ross, S., Applied Probability Models with Optimization Applications, Holden-Day, San Francisco, CA, 1970

[10] Silver, D., Lecture notes in Reinforcement Learning, University College London,2015

[11] Åström K.J., Stochastic Control Problems, Lund Institute of Technology, 1977

85

Appendix A

Additional Plots

Here we add some additional plots about with more information about the financialinstruments used in this thesis:

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01

Time

instrument A

240

260

280

300

320

340

360

380

400Sweep to fill buy and sell yearly evolution of price: standard deviations

(Sell)Total volume order = 0.01 times nanmean transaction volume(Sell)Total volume order = 0.02 times nanmean transaction volume(Sell)Total volume order = 0.03 times nanmean transaction volume(Buy)Total volume order = 0.01 times nanmean transaction volume(Buy)Total volume order = 0.02 times nanmean transaction volume(Buy)Total volume order = 0.03 times nanmean transaction volume

(a)

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01

Time

instrument B

40

50

60

70

80

90

100

110

120

130



(b)

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01

Time

instrument C

2000

2500

3000

3500

4000

4500



(c)

Figure A.1: "Sweep to fill reward" yearly evolution

87

APPENDIX A. ADDITIONAL PLOTS

08:00 08:15 08:30 08:45 09:00 09:15 09:30 09:45 10:00 10:15 10:30 10:45 11:00 11:15 11:30 11:45 12:00 12:15 12:30 12:45 13:00 13:15 13:30 13:45 14:00 14:15 14:30 14:45

Time

instrument A

299.78

299.8

299.82

299.84

299.86

299.88

299.9Sweep to fill buy and sell daily average price: standard deviations


(a)

08:0008:1008:2008:3008:4008:5009:0009:1009:2009:3009:4009:5010:0010:1010:2010:3010:4010:5011:0011:1011:2011:3011:4011:5012:0012:1012:2012:3012:4012:5013:0013:1013:2013:3013:4013:5014:0014:1014:2014:3014:4014:50

Time

instrument B

99.82

99.83

99.84

99.85

99.86

99.87

99.88

99.89Sweep to fill buy and sell daily average price: standard deviations


(b)

08:0008:1508:3008:4509:0009:1509:3009:4510:0010:1510:3010:4511:0011:1511:3011:4512:0012:1512:3012:4513:0013:1513:3013:4514:0014:1514:3014:4515:0015:1515:3015:4516:0016:1516:3016:4517:0017:1517:3017:4518:0018:15

Time

instrument C

3810

3820

3830

3840

3850

3860

3870

3880Sweep to fill buy and sell daily average price: standard deviations


(c)

Figure A.2: "Sweep to fill reward" daily evolution

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01

Time

instrument A

-2

-1

0

1

2

3

4

5×10-6 Yearly price movement: standard deviations/per volume

(a)

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01

Time

instrument B

-6

-5

-4

-3

-2

-1

0

1

2


(b)

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01

Time

instrument C

-1.5

-1

-0.5

0

0.5


(c)

Figure A.3: Price movement yearly evolution, SD per volume unit

88

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

Time

instrument A

-7

-6

-5

-4

-3

-2

-1

0

1

2

3×10-5 Daily price movement average: standard deviations/per volume

(a)

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

Time

instrument B

-4

-3

-2

-1

0

1

2


(b)

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30

Time

instrument C

-3

-2

-1

0

1

2


(c)

Figure A.4: Price movement daily evolution, SD per volume unit

0 50 100 150 200 250

Time

instrument A

-4

-3

-2

-1

0

1

2

3×104 Volume OB accumulated bought(+) and sold(-) yearly evolution


(a)

0 50 100 150 200 250

Time

instrument B

-800

-600

-400

-200

0

200

400

600

800Volume OB accumulated bought(+) and sold(-) yearly evolution


(b)

0 50 100 150 200 250 300

Time

instrument C

-8

-6

-4

-2

0

2

4

6

8×104 Volume OB accumulated bought(+) and sold(-) yearly evolution


(c)

Figure A.5: Accumulated volume in order book, yearly evolution

89

APPENDIX A. ADDITIONAL PLOTS

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

Time

instrument A

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5×104 Volume OB average accumulated sold(-) and bought(+), separated daily basis


(a)

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

Time

instrument B

-800

-600

-400

-200

0

200

400

600

800Volume OB average accumulated sold(-) and bought(+), separated daily basis


(b)

08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00

Time

instrument C

-5

-4

-3

-2

-1

0

1

2

3

4

5×104 Volume OB average accumulated sold(-) and bought(+), separated daily basis


(c)

Figure A.6: Accumulated volume in order book, daily evolution

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01124

125

126

127

128

129

130

131Yearly average price (bid+ask)/2

(a)2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01

4000

4100

4200

4300

4400

4500

4600

4700

4800Yearly average price (bid+ask)/2

(b)

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-01111

111.1

111.2

111.3

111.4

111.5

111.6

111.7

111.8Yearly average price (bid+ask)/2

(c)

Figure A.7: Average price, yearly evolution

90

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00126.55

126.555

126.56

126.565

126.57

126.575Daily average price (bid+ask)/2

(a)08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00

4391

4391.5

4392

4392.5

4393

4393.5

4394


(b)

08:00 08:30 09:00 09:30 10:00 10:30 11:00 11:30 12:00 12:30 13:00 13:30 14:00 14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30108.1

108.2

108.3

108.4

108.5

108.6

108.7

108.8

108.9

109


(c)

Figure A.8: Average price, daily evolution

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-010.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

0.52YZ volatility

(a)2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-0130

40

50

60

70

80

90

100YZ volatility

(b)

2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 2015-11 2015-12 2015-010.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06YZ volatility

(c)

Figure A.9: YZ volatility yearly evolution

91

Appendix B

Tables of parameter estimations

Action/Timeperiod 5 10 15 20-6 0.0008 0.0026 0.0049 0.0063-5 0.0014 0.0042 0.0071 0.0092-4 0.0021 0.0065 0.0099 0.0131-3 0.0038 0.0114 0.0166 0.0193-2 0.0078 0.0196 0.0260 0.0277-1 0.0153 0.0319 0.0376 0.03670 0.0320 0.0425 0.0381 0.03161 0.0173 0.0049 0.0031 0.00222 0.0380 0.0132 0.0057 0.00503 0.0388 0.0199 0.0063 0.00314 0.0294 0.0191 0.0088 0.00365 0.0196 0.0145 0.0078 0.00356 0.0193 0.0112 0.0042 0.0020

Table B.1: Test statics obtained for the Kolmogorrov-Smirnoff test. Do note thatthe critical value is 0.0031.

93

APPENDIX B. TABLES OF PARAMETER ESTIMATIONS

Table B.2: Transition probability T=5: Coefficient estimates and 95% confidenceinterval

Action/Coeff α β p α lb α ub β lb β ub-6 0.7688 0.0028 0.0033 0.7699 0.7699 0.0027 0.0031-5 0.8113 0.0048 0.0044 0.7563 0.9248 0.0045 0.0055-4 0.9062 0.0093 0.0062 0.8152 1.0363 0.0089 0.0106-3 0.9742 0.0191 0.0087 0.9250 1.0006 0.0182 0.0200-2 1.0678 0.0414 0.0131 0.9689 1.1520 0.0389 0.0443-1 1.2582 0.1023 0.0213 1.1756 1.3888 0.0945 0.11680 1.6296 0.3616 0.0408 1.5206 1.6680 0.3437 0.37141 2.5994 1.2572 0.0840 2.5085 2.6712 1.2283 1.28252 3.5069 2.2725 0.1165 3.3886 3.9768 2.1482 2.53883 4.7435 4.0001 0.1407 4.2331 5.2577 3.6635 4.37124 4.7718 5.8365 0.1451 4.6285 4.8916 5.6533 5.95705 3.1472 5.9933 0.1531 2.9521 3.3006 5.6399 6.28906 1.9728 5.7189 0.2335 1.8967 1.9995 5.4520 5.9057



94





95


Table B.6: Transition probability function: time and action parameter estimateswith 95% confidence intervals

Coefficient Value Lower bound Upper boundp : c1 0.0311 0.0274 0.0365p : c2 0.6469 0.5620 0.8079p : c3 0.6048 0.5521 0.6764α : c7 2.7842 2.6091 3.0897α : c6 1.8137 1.5943 2.0610α : c4 38.8880 30.4599 47.8247α : c3 0.7937 0.7586 0.8396α : c5 -0.1659 -0.1928 -0.1286α : c8 0.1067 0.0698 0.1337α : c1 0.9797 0.9643 0.9925α : c2 0.0011 0.0009 0.0014β : c2 5.1435 4.9679 8.3095β : c4 2.6508 2.5530 3.9359β : c1 24.5163 11.4258 26.2723β : c3 0.0727 -0.1202 0.0850

Table B.7: Reward T=5: time and action parameter estimates with 95% confidenceintervals

Coefficient Value Lower bound Upper boundc4 0.0843 0.0598 0.1299c3 -0.8700 -1.0010 -0.7349c2 -19.4521 -21.1520 -17.4346c1 -90.0394 -92.2358 -87.5980c0 -145.1442 -149.3477 -140.9711α 1.9681 1.9335 2.0782β 1.7095 1.6759 1.7826

96


Coefficient Value Lower bound Upper boundc4 0.2708 0.2468 0.2919c3 0.1763 -0.0362 0.2851c2 -26.3131 -27.0717 -25.4377c1 -151.8706 -154.6408 -148.5554c0 -297.7239 -301.6426 -295.5811α 1.4131 1.3910 1.4420β 1.5936 1.5783 1.6179


Coefficient Value Lower bound Upper boundc4 0.3493 0.3277 0.4281c3 0.4947 0.4105 0.6021c2 -29.1332 -31.1326 -28.2380c1 -177.3983 -180.3701 -175.2524c0 -393.7561 -401.0657 -386.0641α 1.2177 1.2021 1.2354β 1.5570 1.5313 1.5775

Table B.10: Reward T=20: time and action parameter estimates with 95% confi-dence intervals

Coefficient Value Lower bound Upper boundc4 0.3762 0.3581 0.4475c3 0.5124 0.3904 0.6166c2 -29.9487 -32.8265 -28.7662c1 -184.4126 -187.3292 -180.3502c0 -451.7511 -465.5944 -439.3724α 1.1385 1.1207 1.1555β 1.5403 1.5263 1.5599

97


Table B.11: Reward: time parameter estimates with 95% confidence intervals

Coefficient Value Lower bound Upper boundCα 1.0988 1.0895 1.2311Eα 0.8690 -7.9269 4.6309Dα -0.2022 -0.2044 -0.1925Cβ 1.5343 1.5087 1.5751Eβ 0.1751 -2.6905 3.0356Dβ -0.2132 -0.9594 0.0016C4 0.3976 0.3922 0.4420E4 -0.3135 -0.3222 -0.3057D4 -0.1828 -0.1971 -0.1818C3 0.5582 -2.7104 0.5582E3 -1.4296 -1.4296 2.0566D3 -0.2705 -0.2705 -0.2705C2 -30.6337 -31.4318 -29.8975E2 11.1915 4.8917 13.7193D2 -0.1932 -1.0139 -0.0876C1 -190.5588 -195.3487 -190.4260E1 100.6199 100.5353 105.3093D1 -0.1945 -1.3664 -0.1238C0 -548.1662 -556.8428 -449.1081E0 403.1283 303.3633 411.6986D0 -0.0956 -1.0115 -0.0926

Table B.12: Value function: Volume left parameter estimates with 95% confidenceintervals

Coefficient Value Lower bound Upper boundBN -492.3597 -503.2909 -482.1639CN 1.9178 1.8977 1.9376

98

TRITA -MAT-E 2016:56

ISRN -KTH/MAT/E--16/56--SE

www.kth.se

Documents

Optimal Order Execution using Stochastic Control and ...kth.diva-portal.org/smash/get/diva2:963057/FULLTEXT01.pdfOptimal Order Execution using Stochastic Control and Reinforcement