47
Multiple Forecast Comparison Andrew Patton Department of Economics Duke University OMI-SoFiE Summer School, 2013 Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 1 / 47

Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Multiple Forecast Comparison

Andrew Patton

Department of EconomicsDuke University

OMI-SoFiE Summer School, 2013

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 1 / 47

Page 2: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Papers to be covered

* White, H., 2000, A Reality Check for Data Snooping, Econometrica 68,1097-1126.

* Romano, J.P. and Wolf, M., 2005. Stepwise Multiple Testing as FormalizedData Snooping. Econometrica 73, 1237-1282.

* Hansen, P.R., A. Lunde and J.M. Nason, 2011, Model Confidence Sets forForecasting Models, Econometrica 79, 453-497.

Sullivan, R., A. Timmermann, and H. White, 1999, Data Snooping, TechnicalTrading Rule Performance, and the Bootstrap, Journal of Finance, 54,1647-1692.

Hansen, P.R., 2005, A Test for Superior Predictive Ability, Journal ofBusiness and Economic Statistics, 23, 365-380.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 2 / 47

Page 3: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Comparing multiple forecasts

Consider the problem of comparing S forecasts with some benchmarkforecast (yt ,0) using some loss function L :

Let µi =1T

T

∑t=1

L (yt , yt ,i ) , i = 0, 1, 2, ..,S

We can compare each individual forecast yt ,i with yt ,0 using aDiebold-Mariano-West test:

H i0 : E [L (yt , yt ,0)− L (yt , yt ,i )] = 0

But say we wanted to test the best competing forecast, yt ,i ∗ :

i∗ ≡ argmini=1,2,...,S

µi

then test H∗0 : E [L (yt , yt ,0)− L (yt , yt ,i ∗)] = 0

How do we test H∗0 , controlling for the fact that we chose y∗t by searching

over S individual forecasts?

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 3 / 47

Page 4: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Data snooping, data mining, and data re-use

In economics and finance, there is often only a single history of dataavailable for analysis. Thus looking for relationships between variables in thatdata will inevitably searching over the same data set.

Even if no relationship exists, an apparently significant one may be foundpurely by chance.

“Those of us who study phenomena generated once and for all by a systemoutside our control lack the inferential luxuries afforded to the experimentalsciences.”White (2000, p1117).

If we could re-run the economy we could avoid data re-use, and thus avoidthe problem of data snooping. In the absence of that “luxury”, we mustinstead employ inference methods that account for this re-use.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 4 / 47

Page 5: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Data snooping, data mining, and data re-use

From the Ethical Guidelines for Statistical Practice at the AmericanStatistical Association website(http://www.amstat.org/about/ethicalguidelines.cfm). Section II-A-8:

Recognize that any frequentist statistical test has a random chance ofindicating significance when it is not really present.

Running multiple tests on the same data set at the same stage of an analysisincreases the chance of obtaining at least one invalid result.

Selecting the one “significant” result from a multiplicity of parallel tests posesa grave risk of an incorrect conclusion. Failure to disclose the full extent oftests and their results in such a case would be highly misleading.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 5 / 47

Page 6: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

White (2000)

White provides a formal yet (relatively) simple method of controlling for datasnooping. The null hypothesis of interest is:

yt ≡ [yt ,1, yt ,2, ..., yt ,S ]′

H0 : E [L (yt , yt ,0)(1×1)

− L (yt , yt )(S×1)

] ≤ 0(S×1)

which is equivalent to H0 : maxi=1,2,...,S

E [L (yt , yt ,0)− L (yt , yt ,i )] ≤ 0

In theory, the results of West (1996) could be used here, though West notesthat his approach is unlikely to work well when the number of models beingcompared is much greater than 2.

White (2000) follows West and assumes that the forecasts in the null arebased on β∗ rather than βt , though he notes that we might be interested in anull based directly on βt

Foreshadowing the work in Giacomini and White (2006)..

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 6 / 47

Page 7: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Testing multiple hypotheses - Bonferroni bounds

The null of interest to White is a multiple hypothesis:

H i0 : E [L (yt , yt ,0)− L (yt , yt ,i )] ≤ 0, for all i = 1, 2, ..,S

A simple standard way of testing these S individual hypotheses is to useBonferroni’s method:

Test each individual hypothesis H i0 with level αi such that ∑Si=1 αi = α, thedesired level for the joint test. (e.g.: αi = α/S)

Reject the joint null hypothesis, at level α, if any of the individual nullhypotheses H i0 are rejected at level αi

This method is known to be very conservative (i.e., does not reject the nullenough when the null is true) and thus has low power. The degree ofconservatism increases with S

White provides a test that rejects the null with the correct probabilityasymptotically, and thus has greater power than Bonferroni’s method.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 7 / 47

Page 8: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Motivating the bootstrap approach

Under the conditions of West (and using his notation) we know

Define f ≡ [f1, ..., fS ]′

then√n (f − E [f ∗]) → N (0,Ω) , as n,R → ∞

thus maxi=1,2,..,S

√n (fi − E [f ∗i ]) → VS ≡ max

i=1,2,...SZi

where Z ≡ [Z1,Z2, ...,ZS ]′ ∼ N (0,Ω)

and fti ≡ L (yt , yt ,0)− L (yt , yt ,i )

Given some consistent estimator of Ω, Ωn , this distribution result is feasible.

The distribution of the maximum of a vector of correlated Normals is notknown in closed-form, but can be obtained by simulating from the N (0,Ω)distribution (the “Monte Carlo”approach in White 2000).

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 8 / 47

Page 9: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Motivating the bootstrap approach, cont’d

White suggests that the Monte Carlo approach has some practical drawbacks:

Requires storage of Ωn . This matrix is order S2, which can be large when S islarge (in White’s application S = 3, 654).

Requires computation of Ωn . This can be done analytically (as in West) or bythe bootstrap, but requires the knowledge of ft ,i ∀ t, i . No good whenlooking at results of competitors.

To overcome these, White suggests using a bootstrap. This suggestion, andits justification, is the main contribution of this paper.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 9 / 47

Page 10: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Brief introduction to the “bootstrap”See Efron and Tibshirani, 1993, "An Introduction to the Bootstrap" for more.

The bootstrap is a computationally-intensive method of assigning measuresof accuracy to statistical estimators.

Often used as an alternative to the usual asymptotic Normal distribution

Suppose we observe a sample of data y = (y1, y2, ..., yT ) and some statistics (y) (eg, a mean, standard deviation, etc.)

A “bootstrap” sample y∗ =(y∗1 , y

∗2 , ..., y

∗T

)is obtained by sampling T times,

with replacement, from the original data points (y1, y2, ..., yT ) . Thebootstrap distribution of the statistic is obtained by examining thedistribution of s (y∗) across many (eg: 1000 or 10000) bootstrap samples.

Eg, one bootstrap sample of (y1, y2, y3, y4, y5) might be (y3, y1, y5, y3, y2) .

The true sampling distribution of the statistic s is a function of the(unknown) underlying distribution of the data, F . The bootstrap estimatesthe sampling distribution of s by using the empirical distribution functionof the observed data, F .

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 10 / 47

Page 11: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

“Block bootstraps” for time series data

For iid data a bootstrap sample can be obtained by sampling T times withreplacement from the original data.

For time series data such an approach is not appropriate: the bootstrapsamples will be iid (by construction) while the original data is not. Thus thebootstrap will not estimate the sampling distribution of the statisticaccurately.

Instead, use a “block bootstrap”:

Break data into blocks of length q, then sample with replacement from thoseblocks of data until you have a sample of length T .

Serial dependence in the bootstrap sample is maintained within the blocks,and is only broken across block boundaries (and there are only T/q of those).

The “circular block bootstrap”uses fixed block sizes; the “stationarybootstrap”uses block sizes that are random.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47

Page 12: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

The “stationary bootstrap”of Politis and Romano (1994)

The stationary bootstrap is a form of “block bootstrap”, where block lengthshave the Geometric distribution with average length b.

For a general time series, t = 1, 2, ...,T , it is implemented as follows:

1 Randomly choose a starting point from 1, 2, ...,T . Call this τ (1) .

2 Draw a U ∼ Unif (0, 1), independent of everything else.

1 If U > q = 1/b then set τ (2) =

τ (1) + 1, if τ (1) < T1, if τ (1) = T

2 If U ≤ q, then randomly choose from 1, 2, ...,T and call this τ (2)

3 Repeat step 2 until you have T time indices, [τ (1) , τ (2) , ..., τ (T )]′

The series[Yτ(1),Yτ(2), ...,Yτ(T )

]′is one bootstrap sample of

[Y1,Y2, ...,YT ]′ . Repeat this B times.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 12 / 47

Page 13: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

The bootstrap reality check

Let fi be the individual average loss differences, and defineV ≡ maxi=1,2,...,S fi

The bootstrap version of White’s “reality check”uses the stationary

bootstrap to generate bootstrap samples of[f (b)t ,1 , ..., f

(b)t ,S

]Bb=1

.

From each bootstrap sample we compute

V (b) =√n ·maxi=1,2,...,S

(f (b)t ,i − fi

). (Notice the “re-centering”step.)

The bootstrap reality check p-value is computed as

pRC =1B

B

∑b=1

1V (b) > V

and we reject the null hypothesis if pRC < α (eg, pRC < 0.05)

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 13 / 47

Page 14: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Composite null hypothesesThe null hypothesis of interest in White (2000) is

H0 : E [L (yt , yt ,0)− L (yt , yt )] ≤ 0

This is a “composite”hypothesis. In contrast, a “simple”hypothesis(loosely stated) is one that can be satisfied in only a single way.

The presence of a composite null hypothesis makes it hard to obtain criticalvalues, because there are many DGPs that satisfy the null.

White obtains the distribution of the reality check test statistic under the(simple) “null least favourable to the alternative”:

H0 : E [L (yt , yt ,0)− L (yt , yt )] = 0

Hansen (2005) shows that this null is conservative, and provides analternative way of imposing the null hypothesis (involving the trimming of“really bad” forecasts)

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 14 / 47

Page 15: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Extension of the reality check to other performance metrics

The main case treated in White (2000) is the one where performance ismeasured as the average of some loss (or gain) function.

Many popular performance measures in economics and finance do not fit inthis framework:

R2i = 1−V[yt − αin − β

in yt ,i

]V [yt ]

Sharpe Ratio =E [rt ,i − rft ]√V [rt ,i − rft ]

Corollary 2.6 of White (2000) shows that so long as the performance measureis a continuously differentiable function of some moment of some functionof the data and forecasts, then the bootstrap approach can be applied.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 15 / 47

Page 16: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

The reality check and parameter estimation error (PEE)White assumes that the null hypothesis involves the forecasts evaluated atthe pseudo-true parameters

H0 : E [L (yt , yt ,0 (β∗0))− L (yt , yt ,i (β∗i ))] ≤ 0, for i = 1, 2, ..,S

rather than at the estimated values (as in Giacomini and White).

Thus the finite-sample differences between β∗i and βt ,i need to be considered.

Recall that West (1996) showed that if n/R → 0 as n,R → ∞ then PEE hasno impact on the asymptotic distribution of his test statistic. White needs:

nRlog logR → 0 as T → ∞

White needs R to grow faster than West does in order for the bootstraptheory to go through. (Note log logR grows very slowly though...)

One very useful result of this assumption is that the models used to generatethe forecasts do not need to be re-estimated for each bootstrap

We can simply bootstrap the performance measures, ft ,i , conditional on theoriginal estimated parameters.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 16 / 47

Page 17: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Empirical application

White considers the problem of forecasting daily returns on the S&P 500index over the period March 1988 to May 1994. R = 803, n = 758 and soT = 1560. (Is (n/R) log logR = 1.79 small??)

The set of forecasting models considered are all possible linear combinationsof three technical trading indicators:

First lag of returns10 measures of price momentum (returns over the past 1, 2,.., 10 days)4 measures of “local trends” (regressing past returns on a constant and a timetrend)5 “relative strength” indicators (number of positive returns in last n days)10 “moving average oscillators” (differences in moving averages of returnsover two different sample lengths)

This yields S = 29C3 = 3, 654 combinations of predictor variables. Thebenchmark model includes only a constant.

Note these models are nested —problem?

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 17 / 47

Page 18: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Empirical application, cont’d

White considers two performance metrics:

ft ,i ≡ −(yt − β

′t ,iXt ,i

)2+(yt − β

′t ,0Xt ,i

)2ft ,i ≡ 1

yt · β

′t ,iXt ,i > 0

− 1

yt · β

′t ,iXt ,0 > 0

Under MSE loss, the bootstrap reality check p-value is 0.3674, while the“naïve”p-value is 0.1068. (The naïve p-value is obtained by doing a DM teston the model that performed the best over the period, with no control fordata snooping).

Under the directional accuracy measure, the bootstrap reality check p-value is0.2040, while the “naïve”p-value is 0.0036.

The second performance measure reveals the importance of controlling for thenumber of models that were included in the search.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 18 / 47

Page 19: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Romano and Wolf (2005)

Romano and Wolf extend the work of White (2000) in three main directions:

1 Use the ‘reality check’(or a minor variant thereof) to identify all models thatbeat the benchmark, rather than just to test whether the best model beats thebenchmark

2 Use a step-wise approach to potentially identify even more models than areidentified in the first stage.

3 Motivate using “studentized” statistics in the analysis (Hansen (2005) alsoemphasises the importance of this.)

Like White (2000), the implementation of the test uses a bootstrapre-sampling scheme.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 19 / 47

Page 20: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Re-thinking the hypothesis(es) to be tested

White’s null hypothesis was:

H∗0 : E [L (yt , yt ,0)− L (yt , yt ,i ∗)] ≤ 0vs. H∗a : E [L (yt , yt ,0)− L (yt , yt ,i ∗)] > 0where i∗ ≡ argmin

i=1,2,...,Sµi

RW consider the more general problem of:

Hs : E [L (yt , yt ,0)− L (yt , yt ,s )] ≤ 0vs. H ′s : E [L (yt , yt ,0)− L (yt , yt ,s )] > 0

for s = 1, 2, ..,S .

Identifying all models that beat the benchmark model corresponds toidentifying all the null hypotheses that are false.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 20 / 47

Page 21: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Definitions of the ‘size’of multiple testing methods

In hypothesis testing we seek to maximise the probability that we can rejectfalse null hypotheses (power), whilst limiting the probability that we rejecttrue null hypotheses (Type I errors).

When testing a single hypothesis, the usual quantity to control is the(asymptotic) probability of a Type I error:

limT→∞

Pr[WT ∈ R |H0 true

]where WT is the test statistic and R is the ‘rejection region’for the teststatistic. For example, WT might be a t-statistic, and the rejection region fora two-sided test with size 0.05 would be R = (−∞,−1.96]× [1.96×∞).

When testing multiple hypotheses there are many ways to think about the“error” in a hypothesis test, and this has attracted a lot of attention instatistics and econometrics in the last 10-15 years.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 21 / 47

Page 22: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

1) The ‘family-wise error rate’(FWE or FWER)

One widely-used measure of error in multiple hypothesis testing is the FWE:

FWE ≡ Pr[

∑s∈S∗

1Ws ,T ∈ Rs

≥ 1

]

where S∗ is the set of true null hypotheses. That is, the FWE is theprobability that we reject at least one true null hypothesis.

Recently, some authors have suggested that this measure leads to tests thatare ‘too strict’when the number of hypotheses is large. One looser measureof error is the k-FWE:

k-FWE ≡ Pr[

∑s∈S∗

1Ws ,T ∈ Rs

≥ k

]

where k is an integer weakly greater than 1.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 22 / 47

Page 23: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

2) The ‘false discovery proportion’(FDP) or ‘rate’(FDR)Another popular measure of error is the FDP:

FDP ≡(

∑s∈S∗

1Ws ,T ∈ Rs

)/(S

∑j=1

1Wj ,T ∈ Rj

)

That is, the number of false rejections divided by the total number ofrejections (and defined to be zero if there are no rejections). The quantity tobe controlled is usually:

Pr [FDP > γ]

where γ is some small number

The FDR is simply the expected FDP

FDR ≡ E [FDP ]

which would also be controlled to be smaller than some small number.

RW note that theory exists for tests that control the FWE, k-FWE and theFDP, but not yet for the FDR.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 23 / 47

Page 24: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Controlling the Type I error or the FWE

White (2000) frames his problem as testing a single hypothesis:

H∗0 : E [L (yt , yt ,0)− L (yt , yt ,i ∗)] ≤ 0, where i∗ ≡ argmini=1,2,...,S

µi

and focuses on controlling the Type I error probability (Theorem 2.3 andassociated discussion).

Romano and Wolf (2005) consider the problem of:

Hs : E [L (yt , yt ,0)− L (yt , yt ,s )] ≤ 0, for s = 1, 2, ...,S

and instead focus on controlling the FWE.

RW note that White’s test can be interpreted as providing ‘weak’control ofthe FWE for their full set of hypotheses. (That is, control when all nulls aretrue, vs. when some are true and some are false.)

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 24 / 47

Page 25: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Holm’s method for conducting multiple hypothesis tests

In addition to the Bonferroni method (discussed above) RW recall Holm’s(1979) method, which is a step-wise refinement of Bonferroni:

1 Sort the individual p-values from the tests into ascending order:p(1) ≤ p(2) ≤ · · · ≤ p(S ), with null hypotheses correspondingly labelledH (1)0 , ...,H (S )0

2 Reject all null hypotheses from H (1)0 up to H (j)0 for which p(j) ≤ α/ (S + 1− j)

Note that α/ (S + 1− j) ≤ α/S , and so this method will rejectat least as many null hypotheses as Bonferroni’s method.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 25 / 47

Page 26: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Bonferroni vs. Holm, and White vs. RW

Holm and Bonferroni methods are both conservative, because they controlthe FWE by considering the worst case dependence structure for theindividual p-values.

The methods proposed by White and RW use the bootstrap to estimate thedependence between the p-values, thereby maintaining control of the FWEbut increasing the power of the test.

The step-wise nature of RW increases power relative to White in a similarway that Holm improves upon Bonferroni.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 26 / 47

Page 27: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Romano and Wolf’s stepwise testing method

Denote the individual test statistics as wT ,s , which limit to θs , and orderthem so that wT ,1 ≥ wT ,2 ≥ · · · ≥ wT ,SFirst step: Construct a (rectangular) confidence region for [θ1, ..., θS ]

′ withcoverage probability of 1− α, of the form

[wT ,1 − c1,∞)× [wT ,2 − c1,∞)× · · · × [wT ,S − c1,∞)

where c1 is the critical value for this step

If an individual confidence interval does not contain 0, then reject that nullLet R1 be the number of null hypotheses rejected at this stage

Second step: Construct a confidence region for[θR1+1, ..., θS

]′ withcoverage probability of 1− α, of the same form as above, but now with a newcritical value, denoted c2.

If an individual confidence interval does not contain 0, then reject that nullLet R2 be the number of null hypotheses rejected at this stage

Continue this algorithm until no more hypotheses are rejected.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 27 / 47

Page 28: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Romano and Wolf’s stepwise testing method, cont’d

An alternative way to think about this test is:

1 Compute the critical value c1 using the ‘reality check’of White (2000) on allS performance measures

2 Reject all nulls where wT ,j > c1. Let R1 be number of rejected nulls

3 Compute the critical value c2 using the ‘reality check’of White (2000) on theS − R1 remaining (non-rejected) performance measures

4 Reject all nulls where wT ,j > c2. Let R2 be number of rejected nulls at thisstage

5 Compute the critical value c3 using the ‘reality check’of White (2000) on theS − R1 − R2 remaining (non-rejected) performance measures, etc.

This algorithm allows us to say not only whether the best performing methodsignificantly beats the benchmark (corresponding to rejecting H(1)0 ), which isWhite’s test, but to obtain the set of all methods that significantly beat thebenchmark, controlling the overall FWE.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 28 / 47

Page 29: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Example: comparing 32 methods against a benchmark

0 5 10 15 20 25 30 35­1

0

1

2

3

4

5

6

7

Model number (in decreasing order)

t­sta

tistic

(pos

itive

val

ue in

dica

tes

beat

ing

the

benc

hmar

k)Example: test statistics and critical values from various methods

Individ t­statsRW c1RW c2HolmBonferroniNormal 10% crit value

Page 30: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Example: comparing 32 methods against a benchmark

0 5 10 15 20 25 30 35­1

0

1

2

3

4

5

6

7

Model number (in decreasing order)

t­sta

tistic

(pos

itive

val

ue in

dica

tes

beat

ing

the

benc

hmar

k)Example: test statistics and critical values from various methods

Individ t­statsRW c1RW c2HolmBonferroniNormal 10% crit value

Page 31: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Using ‘studentised’test statistics I

Section 4 of RW argues for the use of ‘studentised’test statistics rather than‘raw’test statistics. That is, using:

zT ,j ≡wT ,j√V[wT ,j

] rather than wT ,j

where V[wT ,j

]is some consistent estimator of V

[wT ,j

],eg:

1 kernel-based “robust” estimators, such as Newey-West (1987)

2 a block bootstrap estimator. This would then be a ‘nested’bootstrap:[w ∗,1T ,j ,w

∗,2T ,j , ...,w

∗,MT ,j

]′approximates the distribution of wT ,j[

w ∗∗,m,1T ,j ,w ∗∗,m,2T ,j , ...,w ∗∗,m,M2T ,j

]′approximates the distribution of w ∗,mT ,j

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 31 / 47

Page 32: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Using ‘studentised’test statistics II

The latter set can be used to compute V[w ∗,mT ,j

], and thus z∗,mT ,j ≡

w ∗,mT ,j /V[w ∗,mT ,j

], m = 1, 2, ...,M , which approximates the distribution of zT ,j .

Nested bootstraps can be very computationally intensive.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 32 / 47

Page 33: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Using ‘studentised’test statistics III

RW show that using studentised test statistics also ‘works’, in that it controlsthe FWE.

Further, they argume that this method should work better in practice:

1 Power (somehow defined) should be better for the studentised version

2 Impact of heteroskedasticity: if the different test statistics, wT ,j , havedifferent variances then studentising brings them all to the same ‘scale’.Without this, the test stats with the largest variance may have undueinfluence on the test. (This is a compelling reason, particularly in economicsand finance.)

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 33 / 47

Page 34: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Using ‘studentised’test statistics IV

3 Error probability:

1 in standard univariate hypothesis testing problems, bootstrapping studentisedtest statistics leads to ‘asymptotic refinements’(namely that the errors inconfidence intervals vanish at rate n−1 rather than n−1/2). This comes fromthe fact that the studentised test stat is ‘pivotal’in standard cases (i.e., itsdistribution does not depend on the unknown parameters of the DGP. eg: at-stat is pivotal but an OLS parameter estimate is not.)

2 the studentised test stats in RW are not pivotal (individual means andvariances are removed, but the correlations between wT ,j’s remain) but it ishoped that it is ‘closer to pivotal’.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 34 / 47

Page 35: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Simulation designSimulation design for time series application:

T=200, S=40, M=200, simulation reps=2000.

DGPs:Each

Xt ,j

Tt=1 series is an AR(1) process with AR coeffi cient equal to 0.6.

All innovations to the Xt ,j’s are MV Normal, with correlations either set all to0, or all to 0.5

Mean stategy performance (where benchmark has mean performance = 1and std dev=1)

1 (40, 0), (34, 6) , (20, 20) , (0, 40) of the means equal 1, and standard deviationsequal 1 or 2 in equal proportions

2 The 0, 6, 20, 40 means that differ from 1 either all equal 1.6 (Table VI), orare evenly spread over the interval [1, 7] (Table VII) and all standarddeviations equal twice the mean value.

Uses circular bootstrap with block length equal to 15 or 20 (studentised vs.non-studentised test), and uses Andrews and Monahan (1992) to estimateV[wT ,j

].

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 35 / 47

Page 36: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Simulation results (from Table VI). Nominal level is 0.10

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 36 / 47

Page 37: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Simulation results (from Table VII). Nominal level is 0.10

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 37 / 47

Page 38: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Empirical application: hedge fund performance evaluationRW study the performance of S = 105 funds, over T = 147 months(1992-2004), relative to a risk-free rate (the US T-bill rate??), withFWE = 0.05.

There are ten funds with a t-stat greater than 4.65 (see Table VIII)

Univariate critical value for t-test is 1.645, but this does not control size.

Bonferroni critical value is Φ−1 (1− 0.05/105) = 3.30, so would indicate atleast the top ten are significant, where Φ−1 is inverse of N (0, 1) cdf

Holm critical values for the top ten funds range from 3.30 toΦ−1 (1− 0.05/(105+ 1− 10)) = 3.28, so would again indicate at least thetop ten are significant

The non-studentised RW test does not find ANY funds to significantlyout-perform the benchmark

The studentised test identifies 6 funds in the first step, and 1 more fund in asecond step.

So this application was not such a great motivation for this paper...

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 38 / 47

Page 39: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

The Model Confidence Set, HLN (2011)

The “model confidence set” (MCS) is another method for comparing a largenumber of competing models or forecasts

Its fundamental difference from the reality check and Romano-Wolf’sstepwise procedure is that it does not require specifying a benchmark model

Given a set of modelsM, the goal of the MCS is to find the subsetM∗ thatcontains the best model(s) with a given level of confidence

“Best” depends on the choice of loss function

There may be more than one “best”model ⇔M∗ does not have to be asingleton.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 39 / 47

Page 40: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Constructing the MCS: the (true) set of best models

Define the difference in performance for models i and j as

dij ,t ≡ Lit − Ljtand let µij ≡ E

[dij ,t

]where Li ,t is the loss incurred by forecast i at time t.

The set of superior forecasts is defined as

M∗ ≡i ∈ M : µij ≤ 0 ∀ j ∈ M

That is, it is the set of all models that are never “beaten”by a competingmodel.

This is the population quantity to be estimated. It is estimated by the MCSM∗ which satisfies

lim infT→∞

Pr[M∗ ⊆ M∗] ≥ 1− α

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 40 / 47

Page 41: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Constructing the MCS: Eliminating bad models I

Define the set of all pair-wise avg loss differences:

dij ≡1T

T

∑t=1

dij ,t and di ≡1

# (M) ∑j∈M

dij

Then define the corresponding t-statistic

ti ≡di√V[di]

The variance in the denominator can be obtained by a simple bootstrap,similar to that used in White (2000) and Romano and Wolf (2005)

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 41 / 47

Page 42: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Constructing the MCS: Eliminating bad models II

Define the max over these test statistics:

Tmax ≡ maxi∈M

ti

Now we consider the testing step: Consider the null hypothesis

H0 : µij = 0 ∀ i , j

The distribution of Tmax under H0 can be obtained using a simple bootstrap.

If Tmax is greater than the bootstrap critical value then we move to theelimination step:

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 42 / 47

Page 43: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Constructing the MCS: Eliminating bad models III

HLN propose simply removing the model with the largest test stat. That is,remove model i∗, where:

i∗ = arg maxi∈M

ti

The p-value corresponding to this removal is computed as

pval =1B

B

∑b=1

1T (b)max > Tmax

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 43 / 47

Page 44: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Constructing the MCS: Eliminating bad models IV

The steps above are then repeated for the subset of remaining models,removing the “worst”model at each step.

The algorithm stops the first time that the nul hypothesis is not rejected

Stopping at the first failure to reject prevents the accumulation of Type Ierrors (which can cause problems in other sequential testing methods)

Note that the set of models that remain are getting “more similar” at eachiteration, and the p-values associated with each removal will decline

The size of the MCS (i.e., the number of models in the MCS) tells us aboutthe informativeness of the data for comparing these models

Similar to a confidence interval for a parameter: when it is wide, we are lesscertain about the location of the true parameter

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 44 / 47

Page 45: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Illustration of the MCS (Table 1, HLN)For alpha=0.01,0.05,0.10 the MCS contains N-1, N-4, N-6 models.

Step MCS p-val1 0.012 0.043 0.044 0.045 0.076 0.077 0.118 0.25...

...N 1.00

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 45 / 47

Page 46: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Summary I

White (2000) and Romano-Wolf (2005) provide methods for comparingmany (even thousands or more) forecasting models

while still controlling the error rate of the test (Type I or FWE)

and (hopefully!) improving power relative to Bonferroni or Holm methods

Hansen, Lunde and Nason (2011) provide a related method for finding thesubset of models that contains the truly best models with some level ofconfidence.

These tests are implemented using a bootstrap, which is somewhatcomputationally intensive, though simple to program and implement.

These tests provides a way of controlling for the “data re-use”problem thatis faced in non-experimental sciences, such as economics and finance.

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 46 / 47

Page 47: Multiple Forecast Comparison - Oxford-Man Institute · 2013-07-16 · Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 11 / 47 The fistationary bootstrapflof

Summary II

These three papers provide related, but distinct, tools for comparing manymodels:

White enables us to determine whether the best performing method trulybeats the benchmark, controlling for the number of other methods considered

Romano and Wolf enable us to separate the competing models into threegroups: those that are (i) significantly better than, (ii) significantly worsethan, and (iii) not significantly different from the benchmark.

Hansen, Lunde and Nason enable us to find the subset of models that containsthe best model(s), without the need to specify a benchmark

Patton (Duke) Multiple Forecast Comparison OMI-SoFiE Summer School, 2013 47 / 47