FE 582 Project Report

STATISTICAL ARBITRAGE PAIRS FOR THE UNIVERSE OF SECTORAL ETFs USING COINTEGRATION

MANOJ SHENOY

ZENGHUI LIU

YANGXI LENG

TABLE OF CONTENTS PAGES

1. Introduction 3

2. Central Theme and Motivation 4-5

3. Project Focus and Objective 5

4. Current Work 6

5. Methodology and Technology 6-10

6. Training Data 11

7. Performance Analytics for complete Co-Integrated Portfolio 12

8. ETF Spread Prediction through Supervised Machine Learning 13-16

APPENDIX 1

1.A. WEKA Analysis with 5 lagged Variables 1 .B.WEKA Analysis with 10 Lagged Variables

2. Complete R Code

LIST OF FIGURES & TABLES

2 | P a g e

Fig 1. Spread Chart for the Pair CRBQ-GRES

Fig 2. Scatterplot of Log Prices : CRBQ-GRES

Fig 3. Equity Curve for Co-integrated Portfolio

Fig 4. Equity Curve for CRBQ-GRES

Fig 5. Cumulative Profit at various levels of Delta

Fig 6. Predicted Spread with Lagged Variables - 5


Fig 8. Weka Analysis with Spread Analysis for 5 Lags

Fig 9. Weka Analysis with Spread Analysis for 5 Lags with

Actual & Predicted Values


with Evaluation Summary

Fig 11. Weka Analysis with Spread Analysis for 10

Lags


Lags with Actual & Predicted Values


Lags with Evaluation Summary

Table 1. Worst Drawdowns for Co-integrated Pairs

Portfolio

Table 2. Co-integrated Portfolio Stats

Table 3. Summary Statistics for 5 Lagged Variables


1. INTRODUCTION

An asset return is composed of two main types of Risks: Systematic or market risk as referred

more commonly in market parlance as ‘Beta’ of the asset and Unsystematic Risk. Beta is a

measure of how much an asset will move in relation to the market. Therefore, beta may be

estimated as slope of the regression line between market returns and asset returns.

A Market neutral strategy is one in which returns from the strategy is largely uncorrelated with

market returns. Regardless of whether the market moves up or down, in good times and bad

the market neutral strategy performs in a steady manner.

In the early 90’s, the strategy gained prominence with Morgan Stanley Quant Nunzio Tartaglia

developing the idea of statistical pairs trading. This idea has gained great prominence owing to

the low sector specific and market risk with hedge funds deploying a multitude of different

kinds of pair strategies.

Pair trading is a market neutral strategy in its most primitive form. The market neutral

portfolios are constructed using just two securities, consisting of a long position in one security

and a short position in the other, in a predetermined ratio. At any given time, the portfolio is

associated with a quantity called the spread. This quantity is computed using the quoted prices

of the two securities and forms a time series. The spread is in some ways related to the residual

return component of the return. Pair trading involves putting on positions when the spread is

substantially away from its mean value, with the expectation that the spread will revert back.

The positions are then reversed upon convergence.

3 | P a g e

2. CENTRAL THEME AND MOTIVATION

The general theme for investing in the marketplace from a valuation point of view is to sell

overvalued securities and buy the undervalued ones. However, it is possible to determine that a

security is overvalued or undervalued only if we also know the true value of the security in

absolute terms. But, this is very hard to do.

Pairs trading attempts to resolve this using the idea of relative pricing; that is, if two securities

have similar characteristics, then the prices of both securities must be more or less the same in

relative terms. If the prices happen to be different, it could be that one of the securities is

overpriced, the other security is underpriced, or the mispricing is a combination of both.

Pairs trading involves selling the higher-priced security and buying the lower-priced security

with the idea that the mispricing will correct itself in the future. The mutual mispricing between

the two securities is captured by the notion of spread. The greater the spread, the higher the

magnitude of mispricing and greater the profit potential. A long–short position in the two

securities is constructed such that it has a negligible beta and therefore minimal exposure to

the market. Hence, the returns from the trade are uncorrelated to market returns which is a

typical feature of market neutral strategies.

Suppose we have the prices of both securities at the current time. The return on both securities

is expected to be the same in all time frames. In other words, the increment to the logarithm of

the prices at the current time must be about the same for both the securities at all time

instances in the future. This, of course, means that the time series of the logarithm of the two

prices must move together, and the spread calculation formula is therefore based on the

difference in the logarithm of the prices. What is required of the pairs is therefore, that the

4 | P a g e

price series or the log price series of the two assets must move together. A statistical idea

possibly explaining co-movement can be correlation between two assets or securities, but since

correlation is time-dependent, which means that correlation in different time periods can vary

significantly, it is of little use. That leads us to the notion of a statistical property known as Co-

integration

Why Cointegration: Using R-squared statistic to check regression can give misleading results

because of the tendency of time series with trends to produce something which has come to be

known as ‘Spurious regression’. To eliminate this inconsistency, we need to look for more

consistent methods. Hence the need arises for co-integration.

Co-integration is a statistical property of time series variables. It is defined when the error term

in the regression modeling is stationary. Stationarity, in simple terms means that the mean and

variance remain constant, which in effect ensures that the time series is mean-reverting.

If two or more series are individually integrated (in the time series sense) but some linear

combination of them has a lower order of integration, then the series are said to be co-

integrated. For instance, a stock market index and the price of its associated futures

contract move through time, each roughly following a random walk.

3. PROJECT FOCUS AND OBJECTIVE

The objective of this report is to give a brief idea of a Pair Trading Strategy for the universe of

sectoral ETFs. The strategy uses Co-integration as the principal statistical idea for this purpose.

The main aim is to thoroughly assess the Sectoral ETFs, bucket them into various sectors using

5 | P a g e

already defined Industry wide classification and outline detailed trading strategies for different

ETF pairs in all sectors, based on whether there exists co-integration between them or not.

There also have been some machine learning algorithms used in our work to train the data and

predict ETF Spread, using a co-integrated Natural Resources pair CRBQ-GRES as an example.

4. CURRENT WORK

A. Development of a Statistical Model for pair Trading using Co-integration back-tested

over a period of 5 years.

B. Back-tested the entire universe of Sectoral ETFs to arrive at the optimal portfolio of ETF

pairs in the same sector.

C. One co-integrated pair from Natural resources industry CRBQ-GRES chosen to show

visualizations of Spreads, Equity curves, scatterplots etc.

D. Determination of optimal threshold levels of buy and sell based on P & L Optimization

E. Machine Learning Tool Weka used for training part of the data and predicting the future

spread based on supervised learning methods.

5. METHODOLOGY & TECHNOLOGY

In a nutshell, the methodology & technology usage for achieving our objective was as follows:

R for generating the code for the Statistical model

FUnitRoots, Tseries packages for determination of co-integration property between ETFs

Quantmod and Performance Analytics package for Portfolio Statistics

R to generate the visualizations using ggplot, Quantmod

6 | P a g e

Machine Learning Tool Weka used for ETF Spread Prediction. A Classifier Model called

Multi-Layer Perceptron used for training the data and predicting spread.

The Detailed Methodology is as follows:

Defining the Co-Integration Model

A. Two ETFs A and B are co-integrated with the non-stationary time series corresponding to

them being log PtA and log Pt

B respectively.

B. We have two equations equating the scaled difference of log prices to return of ETFs in the

current time period. We can write

Where ϒ is the Co-integration coefficient and ϵ A and ϵ B are error correction terms. The

scaled difference of log prices is termed as spread in our model.

C. Consider a Portfolio with long one share of ETF A and short ϒ shares of ETF B. The return of

the portfolio for a given time period is given as:

D. Consider the trading strategy where the trades are put on and unwound on a deviation of Δ

on either direction from the spread mean. Buy the portfolio (Long ETF A & Short ETF B) when

the current spread is Δ below the mean. Similarly, Sell the portfolio (Short ETF A and Long ETF B

7 | P a g e

log PtA

−¿ log Pt−1A =α A logPt−1

A −¿γ log Pt−1B +ε A ¿¿

log PtB

−¿ logPt−1B =α Blog Pt−1

A −¿γ logPt−1B +εB ¿¿

[ logPt+iA

−¿ log PtA ]−¿ γ [ logPt+i

B −¿ log PtB ] ¿¿¿

[ logPt+iA

−¿γ logPt+iB ]−¿ [ log P

tA−¿γ log P

tB ]=Spread t +i−Spread t ¿¿¿

when the current spread is Δ above the mean.

Road Map for Strategy Design & Implementation

A. Data is downloaded directly from Yahoo using R code.

B. Use ETF Pairs from the same sector and test for Co-integration using Augmented Dickey

Fuller Test. This involves determining the co-integration coefficient and examining the

spread time series to ensure that it is stationary and mean reverting.

C. This is achieved by regressing the log price series of one ETF v/s the other to get the

regression coefficient, which is also known as the hedge ratio.

D. If the p-value is less than or equal to 0.01 as obtained from the ADF test, we conclude that

the series is stationary.

E. The entire universe of ETF pairs is run through the code to determine co-integrated pairs.

F. The data is then trained to determine the value of delta which optimizes the profit function

Delta is the optimal threshold value at which the pair is bought or sold which maximizes the

profit.

G. Visualizations are generated for the co-integrated ETF pair from Natural Resources: CRBQ-

GRES.

H. ETF Spread Prediction for the Pair is implemented through Supervised Machine learning

using the Classifier Algorithm ‘Multi-Layer Perceptron’ in the tool Weka

8 | P a g e

[ logPtA

−¿ γ log PtB ]=μ−Δ∧[ log Pt+i

A −¿ γ log Pt+iB ]=μ+Δ ¿ ¿

We present visualizations for the pair from Natural Resources sector CRBQ - GRES

Fig 1. Spread Chart for the Pair CRBQ-GRES

The figure above depicts the spread for our chosen co-integrated pair (in magenta). The Upper

& Lower red and green lines are the upper & Lower threshold for Selling & buying the portfolio

pair. The threshold is one standard deviation away from the mean.

9 | P a g e

Fig 2. Scatterplot of Log Prices : CRBQ-GRES

Fig 3. Equity Curve for Co-integrated Portfolio

The Co-integrated portfolio grew from 100 to 144 in the backtested period of 5 years, which is a

8% compounded annual return with a maximum drawdown of only 1.5%. Sharpe Ratio

calculated for the portfolio was 1.51

Fig 4. Equity Curve for CRBQ-GRES

10 | P a g e

6. TRAINING DATA

Long the Spread whenever we observe that the spread has a value less than or equal to – Δ.

Similarly, we sell spread when we observe a value greater than or equal to Δ. The probability

that a Wiener process at any time instant deviates by an amount greater than or equal to Δ (Δ

being positive) is determined by the integral of the Gaussian process, which is 1–N(Δ). Similarly,

the probability of the value being less than or equal to –Δ is given by N(–Δ). Now, owing to the

symmetry of the Gaussian process N(–Δ) = 1 – N(Δ) and therefore the number of instances, we

expect the value of the spread to be less than or equal to –Δ is also T(1–N(Δ). The profit on each

buy and sell is 2Δ. A measure of profitability for trading in the time period T is therefore (profit

per trade × number of trades); that is 2T Δ (1–N(Δ). Now the problem of band design boils

down to determining the value of Δ that maximizes Δ (1–N(Δ).

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.490

110

130

150

170

190

210

230Cum.Profit

Fig 5. Cumulative Profit at various levels of Delta

11 | P a g e

7. Performance Analysis for the Complete Co-integrated Portfolio

The Package ‘PerformanceAnalytics’ was utilized for analyzing the statistical performance of the

whole portfolio. We have a portfolio of 176 pairs whose drawdown is shown in table1.

X1 9% X23 11% X45 6% X67 7% X89 4% X111 10% X133 3% X155 8%

X2 13% X24 6% X46 0% X68 7% X90 7% X112 8% X134 5% X156 4%

X3 8% X25 5% X47 3% X69 6% X91 4% X113 5% X135 3% X157 7%

X4 3% X26 6% X48 5% X70 8% X92 7% X114 3% X136 5% X158 9%

X5 6% X27 4% X49 4% X71 6% X93 6% X115 6% X137 1% X159 6%

X6 8% X28 1% X50 15% X72 3% X94 6% X116 9% X138 3% X160 7%

X7 5% X29 4% X51 10% X73 8% X95 7% X117 9% X139 8% X161 3%

X8 3% X30 6% X52 6% X74 9% X96 9% X118 3% X140 4% X162 4%

X9 3% X31 15% X53 8% X75 7% X97 9% X119 8% X141 3% X163 10%

X10 9% X32 5% X54 8% X76 7% X98 10% X120 4% X142 3% X164 7%

X11 2% X33 6% X55 9% X77 6% X99 10% X121 4% X143 3% X165 3%

X12 6% X34 6% X56 7% X78 5% X100 9% X122 4% X144 6% X166 3%

X13 8% X35 6% X57 5% X79 5% X101 6% X123 6% X145 7% X167 2%

X14 6% X36 11% X58 10% X80 8% X102 9% X124 6% X146 8% X168 6%

X15 10% X37 7% X59 34% X81 8% X103 7% X125 3% X147 4% X169 4%

X16 5% X38 2% X60 6% X82 9% X104 8% X126 6% X148 8% X170 3%

X17 5% X39 2% X61 5% X83 8% X105 12% X127 3% X149 5% X171 7%

X18 8% X40 5% X62 5% X84 9% X106 10% X128 5% X150 4% X172 5%

X19 4% X41 2% X63 8% X85 5% X107 4% X129 4% X151 10% X173 5%

X20 4% X42 6% X64 8% X86 3% X108 4% X130 2% X152 9% X174 4%

X21 9% X43 7% X65 6% X87 4% X109 8% X131 3% X153 8% X175 6%

X22 3% X44 7% X66 9% X88 7% X110 4% X132 6% X154 3% X176 7%

Average 2%

CO-INTEGRATED PAIRS PORTFOLIO

Table 1. Worst Drawdowns for Co-integrated Pairs Portfolio

12 | P a g e

Particulars PortfolioObservations 1206NAs 0Minimum -0.0114Quartile 0Median 0.0003Arithmetic Mean 0.0019Geometric Mean 0.0019Quartile 0

Particulars PortfolioMaximum 0.2203SE 0.0003LCL Mean 0UCL Mean 0.0006Variance 0.0001Stdev 0.0114Skewness 9.869

Kurtosis 134.0417

Table 2. Co-integrated Portfolio Stats

8. ETF SPREAD PREDICTION THROUGH SUPERVISED MACHINE LEARNING

A Machine Learning tool called Weka has been used for the purposes of Spread prediction. The

objective of spread prediction through this Machine Learning tool is to show the application of

supervised Machine learning.

The same Natural Resources pair CRBQ-GRES has been used as an example. Two sample data

are used, one which uses 5 period lagged or embedded dimension variables and the other uses

10 period lagged variables. This is used for training the algorithm so that it gets trained on

repetitive data, which is the premise on which machine learning is based. Without repetitive

data, the algorithm cannot be trained effectively so as to minimize the error between the actual

and the predicted variable.

A Classifier algorithm called Multi-Layer Perceptron is used for training the data and developing

the training model.

Weka Analysis for the CRBQ – GRES Pair with 5 lagged variables

-0.03-0.02-0.01

00.010.020.030.040.05

SPREADPredicted


13 | P a g e

The figure above shows the actual spread v/s predicted spread for the CRBQ-GRES pair with 5

lagged or embedded dimension variables. The number of lags used is 5 in the first case, which

gives better results in comparison to the second case which uses 10 lagged or embedded

dimension variables.

The correlation result obtained is 93.47%, which is very good; however the most striking

feature to look out for is the Root Mean Square Error which is 0.41%. This result is obtained by

training on 66% of the ETF Spread data, which are 412 instances. The Relative absolute error

and the root relative squared error are 22.64% and 27.12% respectively.

=== Evaluation on test split ====== Summary ===Lagged Variables 5Correlation coefficient 0.9347Mean absolute error 0.0028Root mean squared error 0.0041

Relative absolute error22.64

%

Root relative squared error27.12

%Total Number of Instances 412

14 | P a g e


Weka Analysis for the CRBQ – GRES Pair with 10 lagged variables

-0.03-0.02-0.01

00.010.020.030.040.05

SPREADPredicted


The figure above shows the actual spread v/s predicted spread for the CRBQ-GRES pair with 10

lagged or embedded dimension variables. The number of lags used is 5 in the first case, which

gives better results in comparison to the second case which uses 10 lagged or embedded

dimension variables.

The correlation result obtained is 93.28%, which is very good; however the important

parameter to look out for is the Root Mean Square Error which is 0.53%, which is slightly worse

than the first case with 5 Lags.

This throws up an interesting result, which shows that inspite of training the data with higher

number of lagged variables and in effect more repetitive data, which should have improved

results; on the other hand results deteriorate slightly. Again, as in the first case, this result is

obtained by training on 66% of the ETF Spread data, which are 412 instances. The Relative

15 | P a g e

absolute error and the root relative squared error are 32.89% and 35.32% respectively, which

are significantly worse than those with 5 Lags.

=== Evaluation on test split ====== Summary ===Lagged Variables 10Correlation coefficient 0.9328Mean absolute error 0.0041Root mean squared error 0.0053Relative absolute error 32.89%Root relative squared error 35.32%Total Number of Instances 412

16 | P a g e


APPENDIX 1

A. WEKA ANALYSIS WITH 5 LAGGED OR EMBEDDED DIMENSION VARIABLES


Fig 9. Weka Analysis with Spread Analysis for 5 Lags with Actual & Predicted Values

17 | P a g e

Fig 10. Weka Analysis with Spread Analysis for 5 Lags with Evaluation Summary

B. WEKA ANALYSIS WITH 10 LAGGED OR EMBEDDED DIMENSION VARIABLES

Fig 11. Weka with Spread Analysis for 10 Lags

18 | P a g e

Fig 12. Weka with Spread Analysis for 10 Lags - Actual & Predicted Values

Fig 13. Weka with Spread Analysis for 10 Lags - Evaluation Summary

19 | P a g e

APPENDIX 2: COMPLETE R CODE

APPENDIX 2

20 | P a g e

a <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=ipn&a=01&b=01&c=2011&d=10&e=31&f=2015&ignore=.csv", stringsAsFactors=F)

b <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=vis&a=01&b=01&c=2011&d=10&e=31&f=2015&ignore=.csv", stringsAsFactors=F)

install.packages("zoo")

install.packages("tseries")

install.packages("quantmod")

install.packages("PerformanceAnalytics")

install.packages("fUnitRoots")

library(zoo)

library(tseries)

library(quantmod)

library(PerformanceAnalytics)

library(fUnitRoots)

a <- zoo(a[,7], as.Date(a[,1]))

b <- zoo(b[,7], as.Date(b[,1]))

t.zoo <- merge(a, b, all=FALSE)

t <- as.data.frame(t.zoo)

cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")

m <- lm(log(t$a) ~ log(t$b) + 0, data=t)

beta1 <- coef(m)[1]

n <- lm(log(t$b) ~ log(t$a) + 0, data=t)

beta2 <- coef(n)[1]

beta <- max(beta1,beta2)

cat("Assumed Co-integration Coefficient is", beta, "\n")

if (beta1 >= beta2) {

sprd <- log(t$a) - beta*log(t$b)

} else {

sprd <- log(t$b) - beta*log(t$a) }

ht <- adf.test(sprd, alternative="stationary", k=0)

ht.unitroots <- adfTest(sprd,lags = 0,type = "nc")

cat("ADF p-value is", ht$p.value, "\n")

if (ht$p.value < 0.10) {

cat("The spread is likely mean-reverting\n")

} else { cat("The spread is not mean-reverting.\n") }

ht.unitroots

Delta = 1

d <- mean(sprd) - Delta * sd(sprd)

u <- mean(sprd) + Delta * sd(sprd)

scatter.smooth (log(t$a),log(t$b),xlab = "log A",col = "dark red",lwd = 2,

ylab = "log B",main = "Scatterplot of Log Prices")

plot (sprd, col = "violet", type = "l",lwd = 2, xlab = "Period", main = "Spread Chart for the Pair: CRBQ-GRES")

abline(h = u, col = "red", lwd = 3)

abline(h = d, col = "green", lwd = 3)

sigup <- ifelse (sprd < d,1,0)

sigdn <- ifelse (sprd > u,-1,0)

sigup <- lag(sigup,1)

sigdn <- lag(sigdn,1)

sigup[is.na(sigup)] <- 0

sigdn[is.na(sigdn)] <- 0

sig <- sigup + sigdn

slippage <- 0.0010

ret <- (diff(sprd,lag = 1) - slippage) * sig

ret[1] <- 0

BIBLIOGRAPHY & REFERENCES:

21 | P a g e

initial_equity <- 100

eq_up <- initial_equity * cumprod(1+ret*sigup)

eq_dn <- initial_equity * cumprod(1+ret*sigdn*-1)

eq_all <- initial_equity * cumprod(1+ret)

plot.zoo(cbind(eq_up, eq_dn),xlab = "Period", ylab=c("Long","Short"), col=c("green","red"),

lwd = 3, main="Long & Short Equity curves: \n Co-integration based Pair Trading Strategy",xylabels = TRUE )

plot.zoo(eq_all,xlab = "Period", ylab="Equity", col="blue", lwd = 3,

main="Cum. returns with Initial Equity of 100: \n Co-integration based Pair-Strategy for CRBQ-GRES")

# Training the data #

k.min = min((abs(sprd - mean(sprd))/sd(sprd)))

k.max = max((abs(sprd - mean(sprd))/sd(sprd)))

delta_train = seq(k.min, k.max, length = length(sprd))

training_delta <- function() {

delta_train <- 0.5

for (i in 1:nrow(t)) {

d_train <- mean(sprd) - delta_train * sd(sprd)

u_train <- mean(sprd) + delta_train * sd(sprd)

sigup_train <- ifelse (sprd < d_train,1,0)

sigdn_train <- ifelse (sprd > u_train,-1,0)

sig_train <- sigup_train + sigdn_train

ret_train <- diff(sprd,lag = 1) * sig_train

ret_train[1] <- 0

initial_equity_train <- 100

fin_equity <- initial_equity_train * cumprod(1+ret_train)

return(tail(fin_equity,1))

}

}

profit <- training_delta()

profit1 <- lapply(delta_train,profit)

plot.zoo(delta_train,profit1,xlab = "Delta",ylab = "profit",col = "orange",

lwd = 3,main = "Profit at various levels of Delta",type = "l",

labels(optidelta))

Portfolio <- read.zoo("Portfolio.csv", sep = ",",header = TRUE)

Portfolio_Copy <- read.zoo("Portfolio_Copy.csv", sep = ",",header = TRUE)

table.Drawdowns(Portfolio_Copy, top=10)

maxDrawdown(Portfolio_Copy)

table.DownsideRisk(Portfolio)

Portfolio <- read.csv("Portfolio.csv",header = TRUE)

portfolio <- Portfolio$Portfolio

plot.zoo(portfolio,xlab = "Period",ylab = "P&L",col = "firebrick",

lwd = 5,main = "Equity Curve for Cointegrated Portfolio",type = "l")

Worstdrawdown <- maxDrawdown(Portfolio)

BIBLIOGRAPHY AND REFERENCES

Ganapathy Vidyamurthy. Pairs Trading: Quantitative Methods and Analysis, 4th Edition (New

York: John Wiley & Sons, Inc., 2004).

Elton, Edwin J. and Martin J. Gruber. Modern Portfolio Theory and Investment Analysis, 4th

Edition. (New York: John Wiley & Sons, Inc., 1991).

Robert H. Shumway and David S. Stoffer. Time Series Analysis and its Applications - with R

Examples, 3rd Edition. (New York: Springer, 2010

22 | P a g e

Documents

FE 582 Project Report