9
Supplementary materials for this article are available online. Please go to http://tandfonline.com/r/JBES Testing Predictive Ability and Power Robustification Kyungchul SONG Department of Economics, University of British Columbia, Vancouver BC V6T 1Z1, Canada ([email protected]) One of the approaches to compare forecasting methods is to test whether the risk from a benchmark prediction is smaller than the others. The test can be embedded into a general problem of testing inequality constraints using a one-sided sup functional. Hansen showed that such tests suffer from asymptotic bias. This article generalizes this observation, and proposes a hybrid method to robustify the power properties by coupling a one-sided sup test with a complementary test. The method can also be applied to testing stochastic dominance or moment inequalities. Simulation studies demonstrate that the new test performs well relative to the existing methods. For illustration, the new test was applied to analyze the forecastability of stock returns using technical indicators employed by White. KEY WORDS: Data snooping; Inequality restrictions; One-sided sup tests; Power robustification; Reality check. 1. INTRODUCTION Comparing multiple forecasting methods is important in prac- tice. Diebold and Mariano (1995) proposed tests comparing two forecasting methods, and West (1996) offered a formal analysis of inference based on out-of-sample predictions. White (2000) developed a general testing framework for multiple forecast- ing models, and Hansen (2005) developed a way to improve the power of the tests in White (2000). Giacomini and White (2006) introduced out-of-sample predictive ability tests that can be applied to conditional evaluation objectives. There also has been interest in evaluation of density forecasts in the literature. For earlier contributions see Diebold, Gunther, and Tay (1998); Christoffersen (1998); and Diebold, Hahn, and Tay (1999); and for more recent researches, see Amisano and Giacomini (2007) and Bao, Lee, and Salto ˘ glu (2007), among others. For a general survey of forecast evaluations, see West (2006) and references therein. This article’s main focus is on one-sided sup tests of predic- tive ability developed by White (2000) and applied by Sulli- van, Timmermann, and White (1999). A one-sided sup test is based on the maximal difference between the benchmark and the candidate forecast performances. Hansen (2005) demon- strated that the one-sided sup tests are asymptotically biased, and suggested a way to improve their power. His approach is general and, in fact, related to some later literature on testing moment inequalities such as Andrews and Soares (2007); Bugni (2010); Andrews and Shi (2011); and Linton, Song, and Whang (2010). This article generalizes the observation by Hansen (2005), and proposes a method to robustify the local asymptotic power behavior. The main idea is to couple the one-sided sup test with a complementary test that shows better power proper- ties against alternative hypotheses under which the one-sided sup test performs poorly. For a complementary test, this article adopts a symmetrized test statistic used by Linton, Maasoumi, and Whang (2005) for their stochastic dominance tests. This article calls this coupled test a hybrid test, as its power prop- erties are intermediate between those of the one-sided sup test and the symmetrized test. As this article demonstrates, one can easily apply a bootstrap procedure to obtain approximate crit- ical values for the hybrid test by using the existing bootstrap procedures. Many recent researchers have focused on the testing problem of inequality restrictions that hold simultaneously under the null hypothesis. See, for example, Hansen (2005), Andrews and Soares (2007), Bugni (2010), Canay (2010), Linton et al. (2009), Andrews and Shi (2010), and references therein. This article’s approach contrasts with the proposals mentioned above. In the context of testing predictive ability, these proposals improve the finite sample power properties of the test by eliminating fore- casting methods that perform poorly beyond a threshold when computing a critical value. Since using a fixed threshold makes the test asymptotically invalid, the threshold is chosen to be less stringent as the sample size becomes larger, satisfying certain rate conditions. On the other hand, this article’s approach mod- ifies the sup test to have a better local power against alternatives that the original test is known to have weak power, and hence using a sequence of thresholds is not required. A test of such inequality restrictions is said to be asymp- totically similar on the boundary, if the asymptotic rejection probability remains the same whenever any of the inequality re- strictions is binding under the null hypothesis. A recent article by Andrews (2011) showed an interesting result that a test of such inequality restrictions that is asymptotically similar on the boundary has poor power properties under general conditions. Like the work mentioned previously, the hybrid test of this ar- ticle improves power properties by alleviating asymptotic bias of the one-sided test against certain alternatives, but does not eliminate entirely the asymptotic nonsimilarity of the one-sided test. Hence the test is not subjected to the poor power problem pointed out by Andrews (2011). © 2012 American Statistical Association Journal of Business & Economic Statistics April 2012, Vol. 30, No. 2 DOI: 10.1080/07350015.2012.663245 288

Testing Predictive Ability and Power Robustification

Embed Size (px)

Citation preview

Page 1: Testing Predictive Ability and Power Robustification

Supplementary materials for this article are available online. Please go to http://tandfonline.com/r/JBES

Testing Predictive Ability and PowerRobustification

Kyungchul SONGDepartment of Economics, University of British Columbia, Vancouver BC V6T 1Z1,Canada ([email protected])

One of the approaches to compare forecasting methods is to test whether the risk from a benchmarkprediction is smaller than the others. The test can be embedded into a general problem of testing inequalityconstraints using a one-sided sup functional. Hansen showed that such tests suffer from asymptotic bias.This article generalizes this observation, and proposes a hybrid method to robustify the power propertiesby coupling a one-sided sup test with a complementary test. The method can also be applied to testingstochastic dominance or moment inequalities. Simulation studies demonstrate that the new test performswell relative to the existing methods. For illustration, the new test was applied to analyze the forecastabilityof stock returns using technical indicators employed by White.

KEY WORDS: Data snooping; Inequality restrictions; One-sided sup tests; Power robustification; Realitycheck.

1. INTRODUCTION

Comparing multiple forecasting methods is important in prac-tice. Diebold and Mariano (1995) proposed tests comparing twoforecasting methods, and West (1996) offered a formal analysisof inference based on out-of-sample predictions. White (2000)developed a general testing framework for multiple forecast-ing models, and Hansen (2005) developed a way to improvethe power of the tests in White (2000). Giacomini and White(2006) introduced out-of-sample predictive ability tests that canbe applied to conditional evaluation objectives. There also hasbeen interest in evaluation of density forecasts in the literature.For earlier contributions see Diebold, Gunther, and Tay (1998);Christoffersen (1998); and Diebold, Hahn, and Tay (1999); andfor more recent researches, see Amisano and Giacomini (2007)and Bao, Lee, and Saltoglu (2007), among others. For a generalsurvey of forecast evaluations, see West (2006) and referencestherein.

This article’s main focus is on one-sided sup tests of predic-tive ability developed by White (2000) and applied by Sulli-van, Timmermann, and White (1999). A one-sided sup test isbased on the maximal difference between the benchmark andthe candidate forecast performances. Hansen (2005) demon-strated that the one-sided sup tests are asymptotically biased,and suggested a way to improve their power. His approach isgeneral and, in fact, related to some later literature on testingmoment inequalities such as Andrews and Soares (2007); Bugni(2010); Andrews and Shi (2011); and Linton, Song, and Whang(2010).

This article generalizes the observation by Hansen (2005),and proposes a method to robustify the local asymptotic powerbehavior. The main idea is to couple the one-sided sup testwith a complementary test that shows better power proper-ties against alternative hypotheses under which the one-sidedsup test performs poorly. For a complementary test, this articleadopts a symmetrized test statistic used by Linton, Maasoumi,and Whang (2005) for their stochastic dominance tests. Thisarticle calls this coupled test a hybrid test, as its power prop-erties are intermediate between those of the one-sided sup test

and the symmetrized test. As this article demonstrates, one caneasily apply a bootstrap procedure to obtain approximate crit-ical values for the hybrid test by using the existing bootstrapprocedures.

Many recent researchers have focused on the testing problemof inequality restrictions that hold simultaneously under thenull hypothesis. See, for example, Hansen (2005), Andrews andSoares (2007), Bugni (2010), Canay (2010), Linton et al. (2009),Andrews and Shi (2010), and references therein. This article’sapproach contrasts with the proposals mentioned above. In thecontext of testing predictive ability, these proposals improve thefinite sample power properties of the test by eliminating fore-casting methods that perform poorly beyond a threshold whencomputing a critical value. Since using a fixed threshold makesthe test asymptotically invalid, the threshold is chosen to be lessstringent as the sample size becomes larger, satisfying certainrate conditions. On the other hand, this article’s approach mod-ifies the sup test to have a better local power against alternativesthat the original test is known to have weak power, and henceusing a sequence of thresholds is not required.

A test of such inequality restrictions is said to be asymp-totically similar on the boundary, if the asymptotic rejectionprobability remains the same whenever any of the inequality re-strictions is binding under the null hypothesis. A recent articleby Andrews (2011) showed an interesting result that a test ofsuch inequality restrictions that is asymptotically similar on theboundary has poor power properties under general conditions.Like the work mentioned previously, the hybrid test of this ar-ticle improves power properties by alleviating asymptotic biasof the one-sided test against certain alternatives, but does noteliminate entirely the asymptotic nonsimilarity of the one-sidedtest. Hence the test is not subjected to the poor power problempointed out by Andrews (2011).

© 2012 American Statistical AssociationJournal of Business & Economic Statistics

April 2012, Vol. 30, No. 2DOI: 10.1080/07350015.2012.663245

288

Page 2: Testing Predictive Ability and Power Robustification

Song: Testing Predictive Ability and Power Robustification 289

Here, the performance of the hybrid test is investigatedthrough Monte Carlo simulation studies. Overall, the new testperforms as well as the tests of White (2000) and Hansen (2005),and in some cases, performs conspicuously better.

This article applies the hybrid test to investigate the fore-castability of S&P500 stock returns by technical indicators ina spirit similar to the empirical application in White (2000).Considering the periods from March 28, 2003, through July 1,2008, the empirical application tests the null hypothesis that nomethod among the 3654 candidate forecasting methods consid-ered by White (2000) outperforms the benchmark method basedon sample means. The hybrid test has conspicuously lower p-values than the tests of Hansen (2005) and White (2000). A briefexplanation behind this finding is provided in the article.

The article is organized as follows. The next section discussespoor power properties of one-sided sup tests. Section 3 intro-duces a general method of coupling the one-sided test with acomplementary one. Sections 4 and 5 present and discuss resultsfrom Monte Carlo simulation studies, and an empirical applica-tion on stock returns forecastability. Section 6 concludes.

2. TESTING PREDICTIVE ABILITY ANDASYMPTOTIC BIAS

In producing a forecast, one typically adopts a forecastingmodel, estimates the unknown parameter, and then produces aforecast using the estimated forecasting model. Since a fore-casting model and an estimation method constitute eventuallya single map that assigns past observations to a forecast, wefollow Giacomini and White (2006) and refer to this map gener-ically as a forecasting method. Given information FT at timeT , multiple forecasting methods are generically described bymaps ϕm, m ∈ M, from FT to a forecast, where M ⊂ R denotesthe set of the indices for the forecasting methods. The set Mcan be a finite set or an infinite set that is either countable oruncountable. Let �(m) denote the risk of prediction based onthe mth candidate forecasting method, and �(0) the risk of pre-diction based on a benchmark method. Then the difference inperformance between the two methods is measured by

d(m) ≡ �(0) − �(m).

We are interested in testing whether there is a candidate fore-casting method that strictly dominates the benchmark method.The null and the alternative hypotheses are written as:

H0 : d(m) ≤ 0, for all m ∈ M, and (1)

H1 : d(m) > 0, for some m ∈ M.

Let us consider some examples of �(m).

Example 1 [Point Forecast Evaluated With the Mean SquaredPrediction Error (MSPE)]. Suppose that there is a time se-ries {(Yt , X�

t )}∞t=1, where we observe part of it, say, FT ≡{(Yt , X�

t )}Tt=1. The object of forecast is a τ -ahead quantityYT +τ . There are M number of candidate forecasts Y

(m)T +τ , m =

1, . . . ,M. Each forecast is generated by Y(m)T +τ = fm(FT ; βm,T ),

where βm,T is a quantity estimated using FT , and fm(·; β) themth forecasting model known up to β. Since βm,T is estimatedusing FT , we can write βm,T = τm(FT ) for some map τm. When

we write ϕm(FT ) = fm(FT ; τm(FT )), the forecasting method isrepresented as a single map Y

(m)T +τ = ϕm(FT ). One way to define

the risk �(m) is to adopt the MSPE:

�(m) = E[{

YT +τ − Y(m)T +τ

}2].

The expectation above is with respect to the joint distribution ofvariables constituting information FT and YT +τ .

Example 2 (Density Forecast Evaluated With the ExpectedKullback–Leibler Divergence). Let {(Yt , X�

t )}∞t=1 and FT ≡{(Yt , X�

t )}Tt=1 be as in Example 1. The object of forecast inthis example is the density fT +τ of a τ -ahead quantity YT +τ .Suppose that fm,T +τ (·;FT ) is the density forecast obtained us-ing the mth forecasting method and information FT . FollowingBao et al. (2007), we may take the expected Kullback–Leiblerdivergence as a measure of discrepancy between the forecastdensity fm,T +τ (·;FT ) and the actual density fT +τ (·):

KL(m) =∫

log (fT +τ (y)) fT +τ (y)dy

− E[∫

log(fm,T +τ (y;FT )

)fT +τ (y)dy

],

where fT +τ is the true density of YT +τ . Since the first integraldoes not depend on the choice of a forecasting method, we focuson the second part only in comparing the methods. Hence wetake

�(m) = −E[∫

log(fm,T +τ (y;FT )

)fT +τ (y)dy

]

and define e(m) = �(0) − �(m).

Example 3 (Conditional Forecast Evaluated With theMSPE). Giacomini and White (2006) proposed a general frame-work of testing conditional predictive abilities. Let {(Yt , X�

t )}∞t=1

and FT ≡ {(Yt , X�t )}Tt=1 be as in Example 1. Let Y

(m)T +τ =

ϕm(FT ) be a point forecast of YT +τ using the mth method ϕm,and define the conditional MSPE

�(m|GT ) = E[{YT +τ − ϕm(FT )}2|GT ],

where GT is part of the information FT . For example, the fore-cast is generated from Y

(m)T +τ = fm(FT ; βm) as in Example 1.

Then the null hypothesis of interest is stated as d(m|GT ) ≤ 0for all m ∈ M, where d(m|GT ) = �(0|GT ) − �(m|GT ), with�(0|GT ) denoting the benchmark method’s conditional MSPE.The null hypothesis states that regardless of how the informa-tion GT realizes, the performance of the benchmark methoddominates all the candidate methods. As in Giacomini andWhite (2006), we choose an appropriate test function h(GT ),and focus on testing (1) with d(m) = �(0) − �(m), where�(m) ≡ E[h(GT ){YT +τ − ϕm(FT )}2]. A remarkable feature ofGiacomini and White’s (2006) study is that their testing proce-dure is designed to capture the effect of estimation uncertaintywhen a fixed sample size is used for the estimation even asT → ∞. This feature is also accommodated in this article’sframework.

The usual method of testing (1) involves replacing �(m)by an estimator �(m) and constructing an appropriate test us-ing d(m) = �(0) − �(m). For Examples 1–3 above, we can

Page 3: Testing Predictive Ability and Power Robustification

290 Journal of Business & Economic Statistics, April 2012

construct the following:

�(m) = 1

T − R + 1

T∑t=R

{Yt+τ − fm(Ft ; βm,t )}2 (Example 1),

�(m) = − 1

T − R + 1

T∑t=R

log(fm,t+τ (Yt+τ ;Ft ))

(Example 2), and

�(m) = 1

T − R + 1

T∑t=R

h(Gt ){Yt+τ − fm(Ft ; βm,t )}2

(Example 3),

where the periods R + τ, . . . , T + τ are target periods of fore-cast. [For obtaining fm,t+τ (Yt+τ ;Ft ) in Example 2, see Bao et al.(2007) and Amisano and Giacomini (2007) for details.] Let ndenote the number of the time series observations used to pro-duce d(m). The random quantity d(m) is viewed as a stochasticprocess indexed by m ∈ M, or briefly a random function d(·) onM. The main assumption for this article is the following:

Assumption 1. There exists a Gaussian process Z with a con-tinuous sample path on M such that

√n{d(·) − d(·)} =⇒ Z(·), as n → ∞, (2)

where =⇒ denotes weak convergence of stochastic processeson M.

When M = {1, 2, . . . , M}, Assumption 1 is satisfied if ford= [d(1), . . . , d(M)]� and d= [d(1), . . . , d(M)]�,

√n(d − d)

d→ Z ≡ [Z(1), . . . , Z(M)]� ∼ N (0,�) (3)

for a positive semidefinite matrix � (i.e., for all t ∈ RM t�Z iszero if t��t = 0, and t�Z ∼ N (0, t��t) if t��t > 0). There-fore, the predictive models are allowed to be nested as in White(2000) and Giacomini and White (2006). In the situation wherethe forecast sample is small relative to the estimation sample,the estimation error in βm,t becomes irrelevant. On the otherhand, in the situation where parameter estimation uses a rollingwindow of observations with a fixed window length as in Gia-comini and White (2006), the estimation error in βm,t remainsrelevant in asymptotics. Assumption 1 accommodates both thesituations.

Assumption 1 also admits infinite M. The case arises, forexample, when one tests the conditional MSPE as in Example3 using a class of test functions instead of using a single choiceof h [e.g., Stinchcombe and White (1998)]. In this case, M alsoincludes the indices of such a class of test functions.

However, Assumption 1 excludes the case where �(·) is aregression function of a continuous random variable and �(·) isits nonparametric estimator. While such a case arises rarely inthe context of testing predictive abilities, it does in the context oftesting certain conditional moment inequalities [e.g., Lee, Song,and Whang (2011)].

We first show that tests based on the one-sided sup test statis-tic,

T K ≡ √n sup

m∈Md(m), (4)

are asymptotically biased. To define a local power function, weintroduce Pitman local alternatives in the direction a:

d(m) = a(m)/√

n. (5)

The direction a represents how far and in which direction thealternative hypothesis is from the null hypothesis. For example,suppose that M = {1, 2, . . . , }, that is, we have M candidateforecasting methods, and consider an alternative hypothesis witha such that a(1) = c > 0 and a(m) = 0 for all m = 2, . . . , M .The alternative hypothesis in this case is such that the firstforecasting method (m = 1) has a risk smaller than that of thebenchmark method by c/

√n.

Proposition 1. Suppose that Assumption 1 holds. Then forany cα > 0 with limn→∞ P {T K > cα} ≤ α under H0, there ex-ists a map a : M → R such that under the local alternatives oftype (5),

limn→∞ Pa

{T K > cα

} ≤ α − � + ε,

where Pa denotes the sequence of probabilities under (5) and� = P {supm∈M Z(m) > cα} − infm∈M P {Z(m) > cα}.

Proposition 1 shows that the sup test of (1) has a severe biaswhen � is large. Proposition 1 relies only on generic featuresof the testing setup such as (1) and (2), and hence also appliesto many inequality tests in contexts beyond those of testingpredictive abilities. A general version of Proposition 1 and itsproof is found in the supplemental note.

The intuition behind Proposition 1 is simple. Suppose thatP {supm∈M Z(m) > cα} = α. Then, the asymptotic power un-der the local alternatives with a is given by P {supm∈M Z(m) +a(m) > cα}. Suppose we take a(m) of the form in Figure 1with a(m0) positive but close to zero, whereas for other m’s,a(m) is very negative. Then supm∈M Z(m) + a(m) is closeto Z(m0) + a(m0) with high probability. Since a(m0) is close tozero,

P

{supm∈M

Z(m) + a(m) > cα

}≤ P {Z(m0) > cα} + ε

Figure 1. Illustration of an alternative hypothesis associated withlow power. Recall that when a(m) takes a positive value for some m,the situation corresponds to an alternative hypothesis. When a(m) ishighly negative at m’s away from m0 and a(m0) is positive but closeto zero, supm∈MZ(m) + a(m) is close to Z(m0) with high probability,making it likely that the rejection probability is below α.

Page 4: Testing Predictive Ability and Power Robustification

Song: Testing Predictive Ability and Power Robustification 291

for small ε > 0. Since typically P {Z(m0) > cα} <

P {supm∈M Z(m) > cα} = α, we obtain the asymptotic biasresult.

3. POWER ROBUSTIFICATION VIA COUPLING

3.1 A Complementary Test

The previous section showed that the sup test has very poorpower against certain local alternatives. This section proposes ahybrid test that improves power against such local alternatives.Given d as before, we construct another test statistic:

T S = min{

maxm∈M

d(m), maxm∈M

(−d(m))}

. (6)

This type of test statistic was introduced by Linton et al. (2005)for testing stochastic dominance. For testing (1), T S is comple-mentary to T K in the sense that using T S results in a greaterpower against such local alternatives that the test T K performsvery poorly.

To illustrate this point, let M = {1, 2} and X1 and X2 begiven observations which are positively correlated and jointlynormal with a mean vector d= [d(1), d(2)]�. We are interestedin testing

H0 : d(1) ≤ 0 and d(2) ≤ 0, against

H1 : d(1) > 0 or d(2) > 0.

Consider T K = max{X1, X2} and T S = min{max{X1, X2},max{−X1,−X2}}. Complementarity between T K and T S isillustrated in Figure 2 in a form borrowed from Hansen (2005).The ellipses in Figure 2 indicate representative contours of thejoint density of X1 and X2, each corresponding to different dis-tributions denoted by A, B, and C. While A represents the nullhypothesis under a least favorable configuration (LFC), that is,

Figure 2. Complementarity of T K and T S : Both panels depict threeellipses that represent contours from the joint density of (X1, X2) underdifferent probabilities. Contour A corresponds to the least favorableconfiguration under the null hypothesis, and contours B and C differentprobabilities under the alternative hypothesis. The lighter gray areason both panels represent the rejection regions of the tests. Under thealternative hypothesis of type C, the test T K has a better power thanthe test T S , while under the alternative hypothesis of type B, the testT S has a better power than the test T K (as shown by a larger dark areawithin the contour B for the test T S than for the test T K ).

d(1) = d(2) = 0, B and C represent alternative hypotheses. Un-der B, the rejection probability of the test T K is lower than thatunder A, implying the biasedness of the test. (This is illustratedby the dark area in ellipsis B in the left panel which is smallerthan the dark area in ellipsis A in the same panel.) However, therejection probability of the test T S against B is better than thetest T K as indicated by a larger dark area in the ellipsis B onthe right panel than that on the left panel. (This contrast maybe less stark when X1 and X2 are negatively correlated.) Henceagainst B, test T S has a better power than test T K. This orderof performance is reversed in the case of an alternative C wherethe test T S has a power close to zero, while the test T K has apower close to 1.

3.2 Coupling

We construct a hybrid test by coupling T K and T S . For agiven level α ∈ [0, 1] and γ ∈ [0, 1],we define a hybrid test of(1) as follows:

Reject H0 if T S > cSα(γ ) or (7)

if T S ≤ cSα(γ ) and T K > cK

α (γ ),

where cSα(γ ) and cK

α (γ ) are threshold values such that

limn→∞P {T S > cSα(γ )} = αγ

and

limn→∞P {T S ≤ cSα(γ ), T K > cK

α (γ )} = α(1 − γ ).

The hybrid test runs along a locus between T K and T Sas wemove γ between 0 and 1.When γ is close to 1, the hybrid test be-comes close to T S , and when γ is close to zero, it becomes closeto T K. The power-reducing effect of the negativity of d(m) formost m’s on the test T K is counteracted by the power-enhancingeffect of the positivity of −d(m) for most m’s on the test T S .By coupling with T S, the hybrid test shares this counteractingeffect. Without reasons to do otherwise, this article proposesusing γ = 1/2.

Critical values can be computed using bootstrap. First, wesimulate the bootstrap distribution P ∗of (T S, T K ) by generating(T S∗

b , T K∗b )Bb=1, where B denotes the bootstrap number. [When

observations are stationary series, this can be done using thestationary bootstrap method of Politis and Romano (1994). Seealso for details White (2000) and Hansen (2005).] Using theempirical distribution of {T S∗

b }Bb=1, we first compute cS∗α (γ ) such

that

1

B

B∑b=1

1{T S∗b > cS∗

α (γ )} = αγ . (8)

Given cS∗α (γ ), we can take cK∗

α (γ ) to be the (1 − α(1 − γ ))-percentile of the bootstrap series, T K∗

b · 1{T S∗b ≤ cS∗

α (γ )}, b =1, . . . , B.

The method of coupling hardly entails additional computa-tional cost. The computational cost in most cases arises whenone computes d∗(m) using the bootstrap samples, which is a stepcommon in other bootstrap-based tests. Once d∗(m) is com-puted, finding T K∗

b and T S∗b and obtaining bootstrap critical

values are straightforward.

Page 5: Testing Predictive Ability and Power Robustification

292 Journal of Business & Economic Statistics, April 2012

We define the p-values for the test as follows:

p = sup{α ∈ [0, 1] : T S ≤ cS∗

α (γ ) and T K ≤ cK∗α (γ )

}, (9)

where cS∗α (γ ) and cK∗

α (γ ) are critical values defined in (8). Theevent that T S ≤ cS∗

α (γ ) and T K ≤ cK∗α (γ ) arises if and only if

the hybrid test does not reject the null hypothesis. In practice,one starts from α = 0 and increases along a grid point untilT S > cS∗

α (γ ) or T K > cK∗α (γ ). Since bootstrap statistics T K∗

b

and T S∗b have already been computed, the grid search can be

done very fast.

3.3 A Recursive Search for a Better ForecastingMethod

When the search for a better forecasting method is an ongoingprocess with candidate models continuing to expand at eachsearch, it is convenient to have a search algorithm that properlytakes account of the past searches. White (2000) proposed suchan algorithm for practitioners. In this section, we similarly offerthe method of recursive search based on the hybrid test.

Given bootstrap versions {d∗b (m)}Bb=1, m = 1, . . . ,M , and

a consistent asymptotic variance estimator ω2(m) such that√n(d(m) − d(m)}/ω(m)

d→ N (0, 1), we define

d∗b (m) = d∗

b (m) − d(m). (10)

[One may construct ω2(m) using an HAC (heteroscedasticity-autocorrelation consistent) type estimator as done in Hansen(2005), p. 372.] Now, the recursive search that this article sug-gests proceeds as follows:

Step 1. For model 1, compute d(1) = �(0) − �(1),its asymptotic variance estimator ω2(1), and the boot-strap version {d∗

b (1)}Bb=1. Set T K1,+ = √

nd(1)/ω(1), T K1,− =

−√nd(1)/ω(1), T S

1 = min{T K1,+, T K

1,−}, and bootstrap versions,T K∗

1,+,b = √nd∗

b (1)/ω(1), T K∗1,−,b = −√

nd∗b (1)/ω(1), and T S∗

1,b =min{T K∗

1,+,b, TK∗

1,−,b}.Step m. For model m, we compute d(m) = �(0) − �(m), its

asymptotic variance estimator ω2(m), and the bootstrap version{d∗

b (m)}Bb=1. Set T Km,+ = max{√nd(m)/ω(m), T K

m−1,+}, T Km,− =

max{−√nd(m)/ω(m), T K

m−1,−}, T Sm = min{T K

m,+, T Km,−}, and

bootstrap versions

T K∗m,+,b = max

{√nd∗

b (m)/ω(m), T K∗m−1,+,b

},

T K∗m,−,b = max

{−√nd∗

b (m)/ω(m), T K∗m−1,−,b

}, and

T S∗m,b = min{T K∗

m,+,b, TK∗m,−,b}.

At each Step m, the bootstrap p-value can be computed as in(9), where we replace T S∗

b , T K∗b and T S, T K by T S∗

m,b, T K∗m,+,b

and T Sm, T K

m,+, respectively.The recursive search at Step m carries along the history of

previous searches done using the same data. Similarly as in thespirit of White (2000), we emphasize that as for the previoussearches, the recursive search at any Step m requires only knowl-edge of T K

m−1,+, T Km−1,−, {T K∗

m−1,+,b}Bb=1, and {T K∗m−1,−,b}Bb=1 . In

other words, one does not need to know the entire history of thesearches and performances, before one turns to the next search.

4. MONTE CARLO SIMULATIONS

4.1 Data-Generating Processes and Three Tests

The first part of the simulations focuses on the simulationdesign considered by Hansen (2005) and compares three typesof tests, a test of White (2000) (Reality Check: RC), a testof Hansen (2005) (Superior Predictive Ability: SPA), and thisarticle’s proposal (Hybrid Test: Hyb). The second part considerslocal alternatives that are different from those of Hansen (2005).

Suppose that Y(m)T +τ is a τ -step ahead forecast of YT +τ using

the mth method. The relative performance is represented byL(YT +τ , Y

(m)T +τ ) for some loss function L, and we simply write

Lm,T = L(YT +τ , Y(m)T +τ ). Suppose that Y

(0)T +τ is a forecast from

a benchmark method. The risk difference is given by d(m) =E[L0,T − Lm,T ].

In simulation studies, we drew for m = 1, 2, . . . ,M and t =1, 2, . . . , n,

Lm,t

iid∼ N(λ(m)/

√n, σ 2

m

)for constants λ(m) and σ 2

m = 12 exp (arctan(λ(m))) . We set

λ(0) = 0. As for λ(m), we considered two different schemes:alternatives with local positivity and alternatives with both lo-cal positivity and local negativity. These two schemes are to bespecified in Sections 4.2.1 and 4.2.2 later.

First, consider two test statistics, one according to White(2000) and the other according to Hansen (2005):

T RC = √n max

m∈Md(m) and T SPA = √

n maxm∈M

d(m)

ω(m), (11)

where ω2(m) is taken to be the sample variance of {L0,t −Lm,t }nt=1.

To construct critical values, we generated the bootstrap ver-sion {d∗

b (m)}Bb=1, b = 1, 2, . . . , B,of d(m) by resampling fromobservations {L0,t − Lm,t }nt=1 with replacement, and let d∗

b (m)be as defined in (10). We constructed

Reality Check Test (RC) :

Reject H0 if T RC > cRC∗α and

Superior Predictive Ability Test (SPA) :

Reject H0 if T SPA > cSPA∗α ,

where cRC∗α is the (1 − α)-quantile of {T RC∗

b }Bb=1, with T RC∗b =√

n maxm∈M d∗b (m), and cSPA∗

α is the (1 − α)-quantile of{T SPA∗

b }Bb=1 with T SPA∗b = maxm∈M

√nd∗

b (m)/ω(m), where

d∗b (m) = d∗

b (m) − d(m) × 1{√nd(m)/ω(m) ≥ −√

2 ln ln n}.Note that d∗

b (m) involves centering of d∗b (m) selectively depend-

ing on whether d(m) is close to the boundary of the inequalitiesor not. The selective recentering is done to improve the powerof the test by weeding out the forecasting methods that performbadly.

For the hybrid test which is the main proposal of this article,define first the complementary test statistic:

T S = √nmin

{maxm∈M

d(m)

ω(m), max

m∈M− d(m)

ω(m)

}.

Page 6: Testing Predictive Ability and Power Robustification

Song: Testing Predictive Ability and Power Robustification 293

As for T K , we take T K = T SPA defined in (11). As for criticalvalues, construct

T S∗b = √

nmin

{maxm∈M

d∗b (m)

ω(m), max

m∈M− d∗

b (m)

ω(m)

}.

Let cS∗α be the (1 − α/2)-quantile of {T S∗

b }Bb=1, and takecK∗α to be the (1 − α/2)-quantile of {T K∗

b }Bb=1, where T K∗b =

T SPA∗b 1{T S∗

b ≤ cS∗α }. Choosing cS∗

α and cK∗α as such reflects the

choice of γ = 1/2. Then, the hybrid test is defined as

Hybrid Test (Hyb) : Reject H0 if T S > cS∗α or

if T S ≤ cS∗α and T SPA > cK∗

α .

In the simulation studies, the sample size n was 200 and thenumber of Monte Carlo simulations and the bootstrap MonteCarlo simulations 2000. The number (M) of candidate forecast-ing methods was chosen from {50, 100}.

4.2 Alternative Hypotheses

4.2.1 Alternatives With Local Positivity. FollowingHansen (2005), we first consider the following alternatives withλ(m):

DGP A: λ(m) =⎧⎨⎩

0, if m = 0λ(1), if m = 1ρ · (m − 1)/(M − 2), if m = 2, . . . , M,

where ρ and −λ(1) were chosen from {0, 1, 2, 3, 4}. The M −1 methods with m = 2, . . . ,M are inferior to the benchmarkmethod (m = 0). Their relative performance is ordered as M ≺M − 1 ≺ · · · ≺ 2.

When λ(1) = 0, no alternative forecasting method strictlydominates the benchmark method, representing the null hypoth-esis. When λ(1) < λ(0) = 0, the method 1 performs better thanthe benchmark method, representing the alternative hypothesis.The magnitude ρ controls the extent to which the inequalitiesλ(m) ≥ λ(0) = 0, m = 2, . . . ,M, lie away from binding. Whenρ = 0, the remaining inequalities for methods 2 through M arebinding, that is, d(m) = 0 for all m = 2, . . . ,M.

Tables 1 and 2 show the empirical size of the tests underDGP A. The results show that the test RC has lower Type Ierror as the design parameter ρ increases. For example, whenρ = 2, the rejection probability of the test RC is 0.0005 whenthe nominal size is 5% and M = 50. This extremely conserva-tive size of the test RC is significantly improved by the test SPAof Hansen (2005) which shows the Type I error of 0.0180. Thisimprovement is made through two channels: the normalizationby ω(m) of the test statistic and the trimming of poorly per-

Table 1. Empirical size of tests of predictive abilities under DGP A(M = 50, n = 200)

α = 0.05 α = 0.10

ρ λ(1) RC SPA Hyb RC SPA Hyb

0 0 0.0520 0.0615 0.0620 0.1005 0.1180 0.11602 0 0.0005 0.0180 0.0365 0.0045 0.0330 0.05303 0 0.0005 0.0125 0.0230 0.0005 0.0210 0.04254 0 0.0000 0.0105 0.0220 0.0000 0.0230 0.0350

Table 2. Empirical size of tests of predictive abilities under DGP A(M = 100, n = 200)

α = 0.05 α = 0.10

ρ λ(1) RC SPA Hyb RC SPA Hyb

0 0 0.0505 0.0710 0.0610 0.1060 0.1290 0.12102 0 0.0010 0.0125 0.0280 0.0040 0.0245 0.04853 0 0.0005 0.0080 0.0205 0.0015 0.0200 0.03554 0 0.0000 0.0120 0.0220 0.0015 0.0230 0.0315

forming forecast methods. The hybrid approach shows a furtherimprovement over the test SPA, yielding Type I error of 0.0365in this case.

Tables 3 and 4 show the power of the three tests. As forHyb, the rejection probability is slightly lower than that of SPAin the case of ρ = 0. It is interesting to see that the rejectionprobability of Hyb is still better than RC when λ(1) = −2,−3with ρ = 0. As the inequalities move farther away from bindingwhile maintaining the violation of the null hypothesis (i.e., asρ increases while λ(1) < 0), the performance of Hyb becomesprominently better than both RC and SPA.

To see how the power of Hyb can be better than that of SPA,recall that when the performance of SPA performs better thanRC in finite samples, it is mainly because in computing criticalvalues, SPA weeds out candidates that perform poorly. Giventhe same sample size n, the proportion of candidates weededout tends to become larger as ρ increases. This explains thebetter performance of SPA over RC in Tables 3 and 4. When nincreases so that

√2 ln ln n increases slowly yet

√nd(m)/ω(m)

is stable for many m’s (as is the case with the simulation de-sign with

√n-converging Pitman local alternatives), the power

improvement by SPA is attenuated because there are many meth-ods that perform bad yet survive the truncation. In this situation,the power-improving effect of coupling by Hyb is still in force,because power reduction due to many bad forecasting methodsthat survive the truncation in SPA continues to be counteractedby the complementary test coupled in Hyb.

4.2.2 Alternatives With Local Positivity and Local Negativ-ity. The hybrid test was shown to perform well relative to theother two tests under DGP A. However, DGP A mainly focuseson alternatives such that RC tends to have weak power. In thissection, we consider the following alternative scheme: for each

Table 3. Empirical power of tests of predictive abilities under DGP A(M = 50, n = 200)

α = 0.05 α = 0.10

ρ λ(1) RC SPA Hyb RC SPA Hyb

0 −2 0.1315 0.2995 0.2830 0.2180 0.4200 0.3950−3 0.4895 0.7785 0.7435 0.6210 0.8450 0.8265

2 −2 0.0115 0.2945 0.3770 0.0255 0.3810 0.4520−3 0.1135 0.7855 0.8395 0.2185 0.8490 0.8890

3 −2 0.0025 0.3055 0.4085 0.0095 0.3900 0.4690−3 0.0655 0.7745 0.8390 0.1360 0.8345 0.8770

4 −2 0.0030 0.3295 0.4125 0.0085 0.4185 0.4700−3 0.0475 0.8035 0.8625 0.0935 0.8655 0.8925

Page 7: Testing Predictive Ability and Power Robustification

294 Journal of Business & Economic Statistics, April 2012

Table 4. Empirical power of tests of predictive abilities under DGP A(M = 100, n = 200)

α = 0.05 α = 0.10

ρ λ(1) RC SPA Hyb RC SPA Hyb

0 −2 0.0945 0.2675 0.2475 0.1805 0.3765 0.3535−3 0.3945 0.7080 0.6755 0.5270 0.7885 0.7755

2 −2 0.0025 0.2315 0.3185 0.0115 0.3040 0.3830−3 0.0620 0.7040 0.7750 0.1230 0.7710 0.8220

3 −2 0.0000 0.2345 0.3245 0.0030 0.3110 0.3810−3 0.0350 0.7115 0.7865 0.0725 0.7770 0.8230

4 −2 0.0005 0.2385 0.3125 0.0035 0.3155 0.3615−3 0.0190 0.7220 0.7875 0.0435 0.7925 0.8250

m = 1, . . . ,M,

DGP B1: λ(m) = r × {�(−8m/M + 1/5) − 1/2},DGP B2: λ(m) = r × {�(−8m/M + 2/5) − 1/2}, and

DGP B3: λ(m) = r × {�(−8m/M + 4/5) − 1/2},where � is a standard normal distribution function and r isa positive constant running in an equal-spaced grid in [0, 5].This scheme is depicted in Figure 3. In DGP B1, only a smallportion of methods perform better than the benchmark method,and in DGP B3, a large portion of methods perform better thanthe benchmark method. The general discussion of this articlepredicts that the hybrid test has relatively strong power againstthe alternatives under DGP B1, while it has relatively weakpower against the alternatives under DGP B3.

Only the results for the cases DGP B1 and DGP B3 are shownin Figure 4 to save space. Under DGP B1, Hyb is shown to out-perform the other tests. However, it shows a slight reduction inpower (relative to SPA) under DGP B3. This result suggests that

Figure 3. Three designs of λ (m) with M = 100. All the threedesigns represent different types of alternative hypotheses. In DGP B1,the forecasting methods with m = 1 through m = 20 outperforms thebenchmark method and in DGP B3, the forecasting methods with m =1 through m = 80 outperforms the benchmark method.

Figure 4. Finite sample rejection probabilities for testing predictiveability for three tests at nominal size 5%: Reality Check (RC) test ofWhite (2000), Superior Predictive Ability (SPA) test of Hansen (2005),and Hybrid (Hyb) test of this article against alternatives are depictedin Figure 3. The result shows that Hyb performs conspicuously betterthan SPA and RC under DGP B1 that is generally associated with lowpower for tests, and it performs slightly worse than SPA under DGP B3that is generally associated with high power for tests. Hence the resultillustrates the robustified power behavior of the hybrid approach.

as long as the simulation designs used so far are concerned, thepower gain by adopting the hybrid approach can be considerableunder certain alternatives while its cost as a reduction in powerunder the other alternatives is only marginal.

5. EMPIRICAL APPLICATION: REALITY CHECKREVISITED

5.1 Testing Framework and Data

The empirical section of White (2000) investigates forecasta-bility of excess returns using technical indicators. He demon-strated that unless the problem of data snooping is properlyaddressed, the best performing candidate forecasting methodappears spuriously to perform better than the benchmark fore-cast based on a simple efficient market hypothesis. This sec-tion revisits his empirical study using recent S&P500 stockreturns.

Similarly as in White (2000), this study considered 3654forecasts using technical indicators and adopted MSPE definedas follows: for m = 1, . . . , 3654,

MSPE : �MSPE(m) = E[(Ym,T +1 − Ym,T +1)2] ,

where YT denotes the S&P500 return on day T , and YT +1 its oneday ahead forecast. (Given stock price Pt at t, the stock returnis defined to be Yt = (Pt − Pt−1)/Pt−1.) See White (2000) fordetails about the construction of the forecasts.

S&P500 Stock Index closing prices were obtained from theWharton Research Data Services (WRDS). The stock indexreturns data range from March 28, 2003, to July, 1, 2008. For

Page 8: Testing Predictive Ability and Power Robustification

Song: Testing Predictive Ability and Power Robustification 295

Table 5. Bootstrap p-values for testing predictive ability in terms ofMSPE. The number q represents the tuning parameter that determines

the random block sizes in the stationary bootstrap. See Politis andRomano (1994) and White (2000) for details.

q RC SPA Hyb

Data snooping 0.10 0.1910 0.1598 0.0670taken into account 0.25 0.2916 0.2736 0.0980

0.50 0.3378 0.3206 0.1250Data snooping 0.10 0.0094 0.0094 0.0200

ignored 0.25 0.0188 0.0188 0.03900.50 0.0384 0.0384 0.0780

each forecast method, we obtain 187 one-day ahead forecastsfrom October 4, 2007, to July 1, 2008. The data used for theestimation of the forecast models begin from the stock indexreturn on March 28, 2003, and the sample size for estimationis 1138. Hence the sample size for estimation is much largerthan the sample used to produce forecasts, and it is expectedthat the normalized sum of the forecast error differences will beapproximately normally distributed [see Clark and McCracken(2001) for details].

5.2 Results

The results are shown in Table 5. The number q in the ta-ble represents the tuning parameter that determines the randomblock sizes in the stationary bootstrap of Politis and Romano(1994) [see White (2000) for details]. Note that in White (2000),q = 0.5 was used.

Table 5 presents p-values for the tests with data snoopingtaken into account, and for the tests with data snooping ignored.When data snooping is ignored, all the tests spuriously rejectthe null hypothesis at 10%. (Note that the results of RC andSPA are identical, because there is only one candidate forecastwhen data snooping is ignored, and this forecast is not trimmedout by the truncation involved in SPA.) This attests to one ofthe main messages of White (2000) that without proper consid-eration of data snooping, the best performing candidate fore-casting method will appear to highly outperform the benchmarkmethod.

Interestingly, the p-values from Hyb are conspicuously lowerthan those obtained from RC and SPA. For example, whenq = 0.10, the p-values for RC and SPA are 0.1910 and 0.1598,respectively, but the p-value for Hyb is 0.0670. Hence the nullhypothesis is rejected by Hyb while not by RC and SPA at10% in this case. This illustrates the distinctive power be-havior of Hyb. Note that Hyb can outperform RC even whenSPA does not outperform it. Such a case may arise when forall the m’s

√nd(m)/ω(m) is greater than −√

2 ln ln n withlarge probability, but for some m,

√nd(m)/ω(m) still tends

to take a fairly negative value relative to the critical value cα

of RC.For example, see Figure 5 that plots the normalized MSPE

differences, that is,√

nd(m)/ω(m), for m = 1, . . . , 3654. It isinteresting to see that no forecasting method is truncated by thetruncation scheme of SPA. This is indicated by the fact that allthe normalized forecast error differences are above the threshold

Figure 5. Normalized estimated mean forecast error differences interms of MSPE [i.e.,

√ne(m)/ω(m)]. The threshold is −√

2 ln ln n. Thex-axis is the forecasting method index m running from 1 to 3654. Theforecast error differences lie above the threshold value (representedby the lower dashed line). This means that no forecast method wastruncated by SPA in this case. Therefore, the difference between RCand SPA is solely due to the normalization by ω(m) in SPA. Even inthis case, Hyb can still improve the power of the sup test through thepower-enhancing effect from the complementary test.

value (lower dashed line). This perhaps explains similar p-valuesfor RC and SPA. The difference between the results from RCand SPA is solely due to the fact that SPA involves normalizationby ω(m) while RC does not. Even in this case, Hyb continuesto counteract its power-reducing effect by coupling with thecomplementary test statistic.

6. CLOSING REMARKS

This article has shown that the one-sided sup tests of predic-tive ability can be severely asymptotically biased in a generalsetup. To alleviate this problem, this article proposes the ap-proach of hybrid tests where we couple the one-sided sup testwith a symmetrized complementary test. Through simulations,it is shown that this approach yields a test with robust powerbehavior. The hybrid approach can be applied to numerous othertests of inequalities beyond predictive ability tests. The questionof which modification or extension is suitable often depends onthe context of application.

ACKNOWLEDGMENT

I thank Werner Ploberger and Frank Schorfheide for theirvaluable comments and advice. Part of this article began with adraft titled, “Testing Distributional Inequalities and AsymptoticBias.” The comments of a Co-Editor, an Associate Editor, anda referee were valuable and helped improve the article substan-tially. I thank them for their comments.

[Received November 2009. Revised October 2011.]

Page 9: Testing Predictive Ability and Power Robustification

296 Journal of Business & Economic Statistics, April 2012

REFERENCES

Amisano, G., and Giacomini, R. (2007), “Comparing Density Forecasts viaWeighted Likelihood Ratio Tests,” Journal of Business and Economic Statis-tics, 25, 177–190. [288,290]

Andrews, D. W. K. (2011), “Similar-on-the-Boundary Tests for Moment In-equalities Exist but Have Poor Power,” CFDP 1815. http://cowles.econ.yale.edu/P/cd/d18a/d1815.pdf [288]

Andrews, D. W. K., and Shi, X. (2011), “Inference Based on Conditional Mo-ment Inequalities,” CFDP 1761R. http://cowles.econ.yale.edu/P/cd/d17b/d1761-r.pdf [288]

Andrews, D. W. K., and Soares, G. (2010), “Inference for Parameters Defined byMoment Inequalities Using Generalized Moment Selection,” Econometrica,78, 119–157. [288]

Bao, Y., Lee, T.-H., and Saltoglu, B. (2007), “Comparing Density ForecastModels,” Journal of Forecasting, 26, 203–225. [288,289,290]

Bugni, F. (2010), “Bootstrap Inference in Partially Identified Models Definedby Moment Inequalities: Coverage of the Identified Set,” Econometrica, 78,735–753. [288]

Canay, I. A. (2010), “EL Inference for Partially Identified Models: Large De-viations Optimality and Bootstrap Validity,” Journal of Econometrics, 156,408–425. [288]

Christoffersen, P. F. (1998), “Evaluating Interval Forecasts,” International Eco-nomic Review, 39, 841–861. [288]

Clark, T. E., and McCracken, M. W. (2001), “Tests of Equal Forecast Accuracyand Encompassing for Nested Models,” Journal of Econometrics, 105, 85–110. [295]

Diebold, F. X., Gunther, T. A., and Tay, A. S. (1998), “Evaluating DensityForecasts With Applications to Financial Risk Management,” InternationalEconomic Review, 39, 863–883. [288]

Diebold, F. X., Hahn, J., and Tay, A. S. (1999), “Multivariate Density ForecastEvaluation and Calibration in Financial Risk Management: High-Frequency

Returns on Foreign Exchange,” Review of Economics and Statistics, 81,661–673. [288]

Diebold, F. X., and Mariano, R. S. (1995), “Comparing Predictive Accuracy,”Journal of Business and Economic Statistics, 13, 253–263. [288]

Giacomini, R., and White, H. (2006), “Tests of Conditional Predictive Ability,”Econometrica, 74, 1545–1578. [288,289,290]

Hansen, P. R. (2005), “Testing Superior Predictive Ability,” Journal of Businessand Economic Statistics, 23, 365–379. [288,289,291,292,293]

Lee, S., Song, K., and Whang, Y.-J. (2011), “Testing Functional Inequalities,”Cemmap Working Paper, CWP12/11, Department of Economics, UniversityCollege London. [290]

Linton, O., Maasoumi, E., and Whang, Y.-J. (2005), “Consistent Testing forStochastic Dominance Under General Sampling Schemes,” Review of Eco-nomic Studies, 72, 735–765. [288,291]

Linton, O., Song, K., and Whang, Y.-J. (2010), “An Improved BootstrapTest of Stochastic Dominance,” Journal of Econometrics, 154, 186–202. [288]

Politis, D. N., and Romano, J. P. (1994), “The Stationary Bootstrap,” Journal ofthe American Statistical Association, 89, 1303–1313. [291,295]

Sullivan, R., Timmermann, A., and White, H. (1999), “Data Snooping, TechnicalTrading Rule Performance, and the Bootstrap,” Journal of Finance, 54,1647–1692. [288]

Stinchcombe, M. B., and White, H. (1998), “Consistent Specification Test-ing When the Nuisance Parameters Present Only Under the Alternative,”Econometric Theory, 14, 295–325. [290]

West, K. D. (1996), “Asymptotic Inference About Predictive Ability,” Econo-metrica, 64, 1067–1084. [288]

West, K. D. (2006), “Forecast Evaluation,” in Handbook of Economic Fore-casting, Chapter 3, eds. G. Elliot, C. W. J. Granger, and A. Timmermann,North-Holland. [288]

White, H. (2000), “A Reality Check for Data Snooping,” Econometrica, 68,1097–1126. [288,289,290,291,292,294,295]