Upload
zenghui-liu
View
44
Download
0
Embed Size (px)
Citation preview
STATISTICAL ARBITRAGE PAIRS FOR THE UNIVERSE OF SECTORAL ETFs USING COINTEGRATION
MANOJ SHENOY
ZENGHUI LIU
YANGXI LENG
TABLE OF CONTENTS PAGES
1. Introduction 3
2. Central Theme and Motivation 4-5
3. Project Focus and Objective 5
4. Current Work 6
5. Methodology and Technology 6-10
6. Training Data 11
7. Performance Analytics for complete Co-Integrated Portfolio 12
8. ETF Spread Prediction through Supervised Machine Learning 13-16
APPENDIX 1
1.A. WEKA Analysis with 5 lagged Variables 1 .B.WEKA Analysis with 10 Lagged Variables
2. Complete R Code
LIST OF FIGURES & TABLES
2 | P a g e
Fig 1. Spread Chart for the Pair CRBQ-GRES
Fig 2. Scatterplot of Log Prices : CRBQ-GRES
Fig 3. Equity Curve for Co-integrated Portfolio
Fig 4. Equity Curve for CRBQ-GRES
Fig 5. Cumulative Profit at various levels of Delta
Fig 6. Predicted Spread with Lagged Variables - 5
Fig 7. Predicted Spread with Lagged Variables - 10
Fig 8. Weka Analysis with Spread Analysis for 5 Lags
Fig 9. Weka Analysis with Spread Analysis for 5 Lags with
Actual & Predicted Values
Fig 10. Weka Analysis with Spread Analysis for 5 Lags
with Evaluation Summary
Fig 11. Weka Analysis with Spread Analysis for 10
Lags
Fig 12. Weka Analysis with Spread Analysis for 10
Lags with Actual & Predicted Values
Fig 13. Weka Analysis with Spread Analysis for 10
Lags with Evaluation Summary
Table 1. Worst Drawdowns for Co-integrated Pairs
Portfolio
Table 2. Co-integrated Portfolio Stats
Table 3. Summary Statistics for 5 Lagged Variables
Table 4. Summary Statistics for 10 Lagged Variables
1. INTRODUCTION
An asset return is composed of two main types of Risks: Systematic or market risk as referred
more commonly in market parlance as ‘Beta’ of the asset and Unsystematic Risk. Beta is a
measure of how much an asset will move in relation to the market. Therefore, beta may be
estimated as slope of the regression line between market returns and asset returns.
A Market neutral strategy is one in which returns from the strategy is largely uncorrelated with
market returns. Regardless of whether the market moves up or down, in good times and bad
the market neutral strategy performs in a steady manner.
In the early 90’s, the strategy gained prominence with Morgan Stanley Quant Nunzio Tartaglia
developing the idea of statistical pairs trading. This idea has gained great prominence owing to
the low sector specific and market risk with hedge funds deploying a multitude of different
kinds of pair strategies.
Pair trading is a market neutral strategy in its most primitive form. The market neutral
portfolios are constructed using just two securities, consisting of a long position in one security
and a short position in the other, in a predetermined ratio. At any given time, the portfolio is
associated with a quantity called the spread. This quantity is computed using the quoted prices
of the two securities and forms a time series. The spread is in some ways related to the residual
return component of the return. Pair trading involves putting on positions when the spread is
substantially away from its mean value, with the expectation that the spread will revert back.
The positions are then reversed upon convergence.
3 | P a g e
2. CENTRAL THEME AND MOTIVATION
The general theme for investing in the marketplace from a valuation point of view is to sell
overvalued securities and buy the undervalued ones. However, it is possible to determine that a
security is overvalued or undervalued only if we also know the true value of the security in
absolute terms. But, this is very hard to do.
Pairs trading attempts to resolve this using the idea of relative pricing; that is, if two securities
have similar characteristics, then the prices of both securities must be more or less the same in
relative terms. If the prices happen to be different, it could be that one of the securities is
overpriced, the other security is underpriced, or the mispricing is a combination of both.
Pairs trading involves selling the higher-priced security and buying the lower-priced security
with the idea that the mispricing will correct itself in the future. The mutual mispricing between
the two securities is captured by the notion of spread. The greater the spread, the higher the
magnitude of mispricing and greater the profit potential. A long–short position in the two
securities is constructed such that it has a negligible beta and therefore minimal exposure to
the market. Hence, the returns from the trade are uncorrelated to market returns which is a
typical feature of market neutral strategies.
Suppose we have the prices of both securities at the current time. The return on both securities
is expected to be the same in all time frames. In other words, the increment to the logarithm of
the prices at the current time must be about the same for both the securities at all time
instances in the future. This, of course, means that the time series of the logarithm of the two
prices must move together, and the spread calculation formula is therefore based on the
difference in the logarithm of the prices. What is required of the pairs is therefore, that the
4 | P a g e
price series or the log price series of the two assets must move together. A statistical idea
possibly explaining co-movement can be correlation between two assets or securities, but since
correlation is time-dependent, which means that correlation in different time periods can vary
significantly, it is of little use. That leads us to the notion of a statistical property known as Co-
integration
Why Cointegration: Using R-squared statistic to check regression can give misleading results
because of the tendency of time series with trends to produce something which has come to be
known as ‘Spurious regression’. To eliminate this inconsistency, we need to look for more
consistent methods. Hence the need arises for co-integration.
Co-integration is a statistical property of time series variables. It is defined when the error term
in the regression modeling is stationary. Stationarity, in simple terms means that the mean and
variance remain constant, which in effect ensures that the time series is mean-reverting.
If two or more series are individually integrated (in the time series sense) but some linear
combination of them has a lower order of integration, then the series are said to be co-
integrated. For instance, a stock market index and the price of its associated futures
contract move through time, each roughly following a random walk.
3. PROJECT FOCUS AND OBJECTIVE
The objective of this report is to give a brief idea of a Pair Trading Strategy for the universe of
sectoral ETFs. The strategy uses Co-integration as the principal statistical idea for this purpose.
The main aim is to thoroughly assess the Sectoral ETFs, bucket them into various sectors using
5 | P a g e
already defined Industry wide classification and outline detailed trading strategies for different
ETF pairs in all sectors, based on whether there exists co-integration between them or not.
There also have been some machine learning algorithms used in our work to train the data and
predict ETF Spread, using a co-integrated Natural Resources pair CRBQ-GRES as an example.
4. CURRENT WORK
A. Development of a Statistical Model for pair Trading using Co-integration back-tested
over a period of 5 years.
B. Back-tested the entire universe of Sectoral ETFs to arrive at the optimal portfolio of ETF
pairs in the same sector.
C. One co-integrated pair from Natural resources industry CRBQ-GRES chosen to show
visualizations of Spreads, Equity curves, scatterplots etc.
D. Determination of optimal threshold levels of buy and sell based on P & L Optimization
E. Machine Learning Tool Weka used for training part of the data and predicting the future
spread based on supervised learning methods.
5. METHODOLOGY & TECHNOLOGY
In a nutshell, the methodology & technology usage for achieving our objective was as follows:
R for generating the code for the Statistical model
FUnitRoots, Tseries packages for determination of co-integration property between ETFs
Quantmod and Performance Analytics package for Portfolio Statistics
R to generate the visualizations using ggplot, Quantmod
6 | P a g e
Machine Learning Tool Weka used for ETF Spread Prediction. A Classifier Model called
Multi-Layer Perceptron used for training the data and predicting spread.
The Detailed Methodology is as follows:
Defining the Co-Integration Model
A. Two ETFs A and B are co-integrated with the non-stationary time series corresponding to
them being log PtA and log Pt
B respectively.
B. We have two equations equating the scaled difference of log prices to return of ETFs in the
current time period. We can write
Where ϒ is the Co-integration coefficient and ϵ A and ϵ B are error correction terms. The
scaled difference of log prices is termed as spread in our model.
C. Consider a Portfolio with long one share of ETF A and short ϒ shares of ETF B. The return of
the portfolio for a given time period is given as:
D. Consider the trading strategy where the trades are put on and unwound on a deviation of Δ
on either direction from the spread mean. Buy the portfolio (Long ETF A & Short ETF B) when
the current spread is Δ below the mean. Similarly, Sell the portfolio (Short ETF A and Long ETF B
7 | P a g e
log PtA
−¿ log Pt−1A =α A logPt−1
A −¿γ log Pt−1B +ε A ¿¿
log PtB
−¿ logPt−1B =α Blog Pt−1
A −¿γ logPt−1B +εB ¿¿
[ logPt+iA
−¿ log PtA ]−¿ γ [ logPt+i
B −¿ log PtB ] ¿¿¿
[ logPt+iA
−¿γ logPt+iB ]−¿ [ log P
tA−¿γ log P
tB ]=Spread t +i−Spread t ¿¿¿
when the current spread is Δ above the mean.
Road Map for Strategy Design & Implementation
A. Data is downloaded directly from Yahoo using R code.
B. Use ETF Pairs from the same sector and test for Co-integration using Augmented Dickey
Fuller Test. This involves determining the co-integration coefficient and examining the
spread time series to ensure that it is stationary and mean reverting.
C. This is achieved by regressing the log price series of one ETF v/s the other to get the
regression coefficient, which is also known as the hedge ratio.
D. If the p-value is less than or equal to 0.01 as obtained from the ADF test, we conclude that
the series is stationary.
E. The entire universe of ETF pairs is run through the code to determine co-integrated pairs.
F. The data is then trained to determine the value of delta which optimizes the profit function
Delta is the optimal threshold value at which the pair is bought or sold which maximizes the
profit.
G. Visualizations are generated for the co-integrated ETF pair from Natural Resources: CRBQ-
GRES.
H. ETF Spread Prediction for the Pair is implemented through Supervised Machine learning
using the Classifier Algorithm ‘Multi-Layer Perceptron’ in the tool Weka
8 | P a g e
[ logPtA
−¿ γ log PtB ]=μ−Δ∧[ log Pt+i
A −¿ γ log Pt+iB ]=μ+Δ ¿ ¿
We present visualizations for the pair from Natural Resources sector CRBQ - GRES
Fig 1. Spread Chart for the Pair CRBQ-GRES
The figure above depicts the spread for our chosen co-integrated pair (in magenta). The Upper
& Lower red and green lines are the upper & Lower threshold for Selling & buying the portfolio
pair. The threshold is one standard deviation away from the mean.
9 | P a g e
Fig 2. Scatterplot of Log Prices : CRBQ-GRES
Fig 3. Equity Curve for Co-integrated Portfolio
The Co-integrated portfolio grew from 100 to 144 in the backtested period of 5 years, which is a
8% compounded annual return with a maximum drawdown of only 1.5%. Sharpe Ratio
calculated for the portfolio was 1.51
Fig 4. Equity Curve for CRBQ-GRES
10 | P a g e
6. TRAINING DATA
Long the Spread whenever we observe that the spread has a value less than or equal to – Δ.
Similarly, we sell spread when we observe a value greater than or equal to Δ. The probability
that a Wiener process at any time instant deviates by an amount greater than or equal to Δ (Δ
being positive) is determined by the integral of the Gaussian process, which is 1–N(Δ). Similarly,
the probability of the value being less than or equal to –Δ is given by N(–Δ). Now, owing to the
symmetry of the Gaussian process N(–Δ) = 1 – N(Δ) and therefore the number of instances, we
expect the value of the spread to be less than or equal to –Δ is also T(1–N(Δ). The profit on each
buy and sell is 2Δ. A measure of profitability for trading in the time period T is therefore (profit
per trade × number of trades); that is 2T Δ (1–N(Δ). Now the problem of band design boils
down to determining the value of Δ that maximizes Δ (1–N(Δ).
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.490
110
130
150
170
190
210
230Cum.Profit
Fig 5. Cumulative Profit at various levels of Delta
11 | P a g e
7. Performance Analysis for the Complete Co-integrated Portfolio
The Package ‘PerformanceAnalytics’ was utilized for analyzing the statistical performance of the
whole portfolio. We have a portfolio of 176 pairs whose drawdown is shown in table1.
X1 9% X23 11% X45 6% X67 7% X89 4% X111 10% X133 3% X155 8%
X2 13% X24 6% X46 0% X68 7% X90 7% X112 8% X134 5% X156 4%
X3 8% X25 5% X47 3% X69 6% X91 4% X113 5% X135 3% X157 7%
X4 3% X26 6% X48 5% X70 8% X92 7% X114 3% X136 5% X158 9%
X5 6% X27 4% X49 4% X71 6% X93 6% X115 6% X137 1% X159 6%
X6 8% X28 1% X50 15% X72 3% X94 6% X116 9% X138 3% X160 7%
X7 5% X29 4% X51 10% X73 8% X95 7% X117 9% X139 8% X161 3%
X8 3% X30 6% X52 6% X74 9% X96 9% X118 3% X140 4% X162 4%
X9 3% X31 15% X53 8% X75 7% X97 9% X119 8% X141 3% X163 10%
X10 9% X32 5% X54 8% X76 7% X98 10% X120 4% X142 3% X164 7%
X11 2% X33 6% X55 9% X77 6% X99 10% X121 4% X143 3% X165 3%
X12 6% X34 6% X56 7% X78 5% X100 9% X122 4% X144 6% X166 3%
X13 8% X35 6% X57 5% X79 5% X101 6% X123 6% X145 7% X167 2%
X14 6% X36 11% X58 10% X80 8% X102 9% X124 6% X146 8% X168 6%
X15 10% X37 7% X59 34% X81 8% X103 7% X125 3% X147 4% X169 4%
X16 5% X38 2% X60 6% X82 9% X104 8% X126 6% X148 8% X170 3%
X17 5% X39 2% X61 5% X83 8% X105 12% X127 3% X149 5% X171 7%
X18 8% X40 5% X62 5% X84 9% X106 10% X128 5% X150 4% X172 5%
X19 4% X41 2% X63 8% X85 5% X107 4% X129 4% X151 10% X173 5%
X20 4% X42 6% X64 8% X86 3% X108 4% X130 2% X152 9% X174 4%
X21 9% X43 7% X65 6% X87 4% X109 8% X131 3% X153 8% X175 6%
X22 3% X44 7% X66 9% X88 7% X110 4% X132 6% X154 3% X176 7%
Average 2%
CO-INTEGRATED PAIRS PORTFOLIO
Table 1. Worst Drawdowns for Co-integrated Pairs Portfolio
12 | P a g e
Particulars PortfolioObservations 1206NAs 0Minimum -0.0114Quartile 0Median 0.0003Arithmetic Mean 0.0019Geometric Mean 0.0019Quartile 0
Particulars PortfolioMaximum 0.2203SE 0.0003LCL Mean 0UCL Mean 0.0006Variance 0.0001Stdev 0.0114Skewness 9.869
Kurtosis 134.0417
Table 2. Co-integrated Portfolio Stats
8. ETF SPREAD PREDICTION THROUGH SUPERVISED MACHINE LEARNING
A Machine Learning tool called Weka has been used for the purposes of Spread prediction. The
objective of spread prediction through this Machine Learning tool is to show the application of
supervised Machine learning.
The same Natural Resources pair CRBQ-GRES has been used as an example. Two sample data
are used, one which uses 5 period lagged or embedded dimension variables and the other uses
10 period lagged variables. This is used for training the algorithm so that it gets trained on
repetitive data, which is the premise on which machine learning is based. Without repetitive
data, the algorithm cannot be trained effectively so as to minimize the error between the actual
and the predicted variable.
A Classifier algorithm called Multi-Layer Perceptron is used for training the data and developing
the training model.
Weka Analysis for the CRBQ – GRES Pair with 5 lagged variables
-0.03-0.02-0.01
00.010.020.030.040.05
SPREADPredicted
Fig 6. Predicted Spread with Lagged Variables - 5
13 | P a g e
The figure above shows the actual spread v/s predicted spread for the CRBQ-GRES pair with 5
lagged or embedded dimension variables. The number of lags used is 5 in the first case, which
gives better results in comparison to the second case which uses 10 lagged or embedded
dimension variables.
The correlation result obtained is 93.47%, which is very good; however the most striking
feature to look out for is the Root Mean Square Error which is 0.41%. This result is obtained by
training on 66% of the ETF Spread data, which are 412 instances. The Relative absolute error
and the root relative squared error are 22.64% and 27.12% respectively.
=== Evaluation on test split ====== Summary ===Lagged Variables 5Correlation coefficient 0.9347Mean absolute error 0.0028Root mean squared error 0.0041
Relative absolute error22.64
%
Root relative squared error27.12
%Total Number of Instances 412
14 | P a g e
Table 3. Summary Statistics for 5 Lagged Variables
Weka Analysis for the CRBQ – GRES Pair with 10 lagged variables
-0.03-0.02-0.01
00.010.020.030.040.05
SPREADPredicted
Fig 7. Predicted Spread with Lagged Variables - 10
The figure above shows the actual spread v/s predicted spread for the CRBQ-GRES pair with 10
lagged or embedded dimension variables. The number of lags used is 5 in the first case, which
gives better results in comparison to the second case which uses 10 lagged or embedded
dimension variables.
The correlation result obtained is 93.28%, which is very good; however the important
parameter to look out for is the Root Mean Square Error which is 0.53%, which is slightly worse
than the first case with 5 Lags.
This throws up an interesting result, which shows that inspite of training the data with higher
number of lagged variables and in effect more repetitive data, which should have improved
results; on the other hand results deteriorate slightly. Again, as in the first case, this result is
obtained by training on 66% of the ETF Spread data, which are 412 instances. The Relative
15 | P a g e
absolute error and the root relative squared error are 32.89% and 35.32% respectively, which
are significantly worse than those with 5 Lags.
=== Evaluation on test split ====== Summary ===Lagged Variables 10Correlation coefficient 0.9328Mean absolute error 0.0041Root mean squared error 0.0053Relative absolute error 32.89%Root relative squared error 35.32%Total Number of Instances 412
16 | P a g e
Table 4. Summary Statistics for 10 Lagged Variables
APPENDIX 1
A. WEKA ANALYSIS WITH 5 LAGGED OR EMBEDDED DIMENSION VARIABLES
Fig 8. Weka Analysis with Spread Analysis for 5 Lags
Fig 9. Weka Analysis with Spread Analysis for 5 Lags with Actual & Predicted Values
17 | P a g e
Fig 10. Weka Analysis with Spread Analysis for 5 Lags with Evaluation Summary
B. WEKA ANALYSIS WITH 10 LAGGED OR EMBEDDED DIMENSION VARIABLES
Fig 11. Weka with Spread Analysis for 10 Lags
18 | P a g e
Fig 12. Weka with Spread Analysis for 10 Lags - Actual & Predicted Values
Fig 13. Weka with Spread Analysis for 10 Lags - Evaluation Summary
19 | P a g e
APPENDIX 2: COMPLETE R CODE
APPENDIX 2
20 | P a g e
a <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=ipn&a=01&b=01&c=2011&d=10&e=31&f=2015&ignore=.csv", stringsAsFactors=F)
b <- read.csv("http://ichart.finance.yahoo.com/table.csv?s=vis&a=01&b=01&c=2011&d=10&e=31&f=2015&ignore=.csv", stringsAsFactors=F)
install.packages("zoo")
install.packages("tseries")
install.packages("quantmod")
install.packages("PerformanceAnalytics")
install.packages("fUnitRoots")
library(zoo)
library(tseries)
library(quantmod)
library(PerformanceAnalytics)
library(fUnitRoots)
a <- zoo(a[,7], as.Date(a[,1]))
b <- zoo(b[,7], as.Date(b[,1]))
t.zoo <- merge(a, b, all=FALSE)
t <- as.data.frame(t.zoo)
cat("Date range is", format(start(t.zoo)), "to", format(end(t.zoo)), "\n")
m <- lm(log(t$a) ~ log(t$b) + 0, data=t)
beta1 <- coef(m)[1]
n <- lm(log(t$b) ~ log(t$a) + 0, data=t)
beta2 <- coef(n)[1]
beta <- max(beta1,beta2)
cat("Assumed Co-integration Coefficient is", beta, "\n")
if (beta1 >= beta2) {
sprd <- log(t$a) - beta*log(t$b)
} else {
sprd <- log(t$b) - beta*log(t$a) }
ht <- adf.test(sprd, alternative="stationary", k=0)
ht.unitroots <- adfTest(sprd,lags = 0,type = "nc")
cat("ADF p-value is", ht$p.value, "\n")
if (ht$p.value < 0.10) {
cat("The spread is likely mean-reverting\n")
} else { cat("The spread is not mean-reverting.\n") }
ht.unitroots
Delta = 1
d <- mean(sprd) - Delta * sd(sprd)
u <- mean(sprd) + Delta * sd(sprd)
scatter.smooth (log(t$a),log(t$b),xlab = "log A",col = "dark red",lwd = 2,
ylab = "log B",main = "Scatterplot of Log Prices")
plot (sprd, col = "violet", type = "l",lwd = 2, xlab = "Period", main = "Spread Chart for the Pair: CRBQ-GRES")
abline(h = u, col = "red", lwd = 3)
abline(h = d, col = "green", lwd = 3)
sigup <- ifelse (sprd < d,1,0)
sigdn <- ifelse (sprd > u,-1,0)
sigup <- lag(sigup,1)
sigdn <- lag(sigdn,1)
sigup[is.na(sigup)] <- 0
sigdn[is.na(sigdn)] <- 0
sig <- sigup + sigdn
slippage <- 0.0010
ret <- (diff(sprd,lag = 1) - slippage) * sig
ret[1] <- 0
BIBLIOGRAPHY & REFERENCES:
21 | P a g e
initial_equity <- 100
eq_up <- initial_equity * cumprod(1+ret*sigup)
eq_dn <- initial_equity * cumprod(1+ret*sigdn*-1)
eq_all <- initial_equity * cumprod(1+ret)
plot.zoo(cbind(eq_up, eq_dn),xlab = "Period", ylab=c("Long","Short"), col=c("green","red"),
lwd = 3, main="Long & Short Equity curves: \n Co-integration based Pair Trading Strategy",xylabels = TRUE )
plot.zoo(eq_all,xlab = "Period", ylab="Equity", col="blue", lwd = 3,
main="Cum. returns with Initial Equity of 100: \n Co-integration based Pair-Strategy for CRBQ-GRES")
# Training the data #
k.min = min((abs(sprd - mean(sprd))/sd(sprd)))
k.max = max((abs(sprd - mean(sprd))/sd(sprd)))
delta_train = seq(k.min, k.max, length = length(sprd))
training_delta <- function() {
delta_train <- 0.5
for (i in 1:nrow(t)) {
d_train <- mean(sprd) - delta_train * sd(sprd)
u_train <- mean(sprd) + delta_train * sd(sprd)
sigup_train <- ifelse (sprd < d_train,1,0)
sigdn_train <- ifelse (sprd > u_train,-1,0)
sig_train <- sigup_train + sigdn_train
ret_train <- diff(sprd,lag = 1) * sig_train
ret_train[1] <- 0
initial_equity_train <- 100
fin_equity <- initial_equity_train * cumprod(1+ret_train)
return(tail(fin_equity,1))
}
}
profit <- training_delta()
profit1 <- lapply(delta_train,profit)
plot.zoo(delta_train,profit1,xlab = "Delta",ylab = "profit",col = "orange",
lwd = 3,main = "Profit at various levels of Delta",type = "l",
labels(optidelta))
Portfolio <- read.zoo("Portfolio.csv", sep = ",",header = TRUE)
Portfolio_Copy <- read.zoo("Portfolio_Copy.csv", sep = ",",header = TRUE)
table.Drawdowns(Portfolio_Copy, top=10)
maxDrawdown(Portfolio_Copy)
table.DownsideRisk(Portfolio)
Portfolio <- read.csv("Portfolio.csv",header = TRUE)
portfolio <- Portfolio$Portfolio
plot.zoo(portfolio,xlab = "Period",ylab = "P&L",col = "firebrick",
lwd = 5,main = "Equity Curve for Cointegrated Portfolio",type = "l")
Worstdrawdown <- maxDrawdown(Portfolio)
BIBLIOGRAPHY AND REFERENCES
Ganapathy Vidyamurthy. Pairs Trading: Quantitative Methods and Analysis, 4th Edition (New
York: John Wiley & Sons, Inc., 2004).
Elton, Edwin J. and Martin J. Gruber. Modern Portfolio Theory and Investment Analysis, 4th
Edition. (New York: John Wiley & Sons, Inc., 1991).
Robert H. Shumway and David S. Stoffer. Time Series Analysis and its Applications - with R
Examples, 3rd Edition. (New York: Springer, 2010
22 | P a g e