Forecasting Binary Outcomes - · PDF fileIn this setting, probability forecasts ... Because a point forecast is a mixture of the objective joint distribution ... The formulation of

Forecasting Binary Outcomes∗

Kajal Lahiri†and Liu Yang‡

Department of EconomicsUniversity at Albany, SUNY

NY 12222, USA

Forthcoming in Handbook of Economic Forecasting, Vol. 2 (Eds. G. Elliott and A. Timmermann)

Abstract

Binary events are involved in many economic decision problems. In recent years, consid-

erable progress has been made in diverse disciplines in developing models for forecasting

binary outcomes. We distinguish between two types of forecasts for binary events that

are generally obtained as the output of regression models: probability forecasts and point

forecasts. We summarize specification, estimation, and evaluation of binary response

models for the purpose of forecasting in a unified framework which is characterized by

the joint distribution of forecasts and actuals, and a general loss function. Analysis of

both the skill and the value of probability and point forecasts can be carried out within

this framework. Parametric, semiparametric, nonparametric, and Bayesian approaches

are covered. The emphasis is on the basic intuitions underlying each methodology, ab-

stracting away from the mathematical details.

JEL Classifications: C40, C50, C53, C80

Key words: Probability Prediction, Point Prediction, Skill, Value, Joint Distribution,

Loss Function.

∗We are indebted to the editors, two anonymous referees and the paticipants of the ‘Handbook’ Conferenceat St. Louis Fed for their constructive comments on an earlier version of this chapter. We are also grateful toAntony Davies, Arturo Estrella, Terry Kinal, Massimiliano Marcellino, and Yongchen Zhao for their help. Muchof the revision of this chapter was completed when Kajal Lahiri was visiting the European University Instituteas a Fernand Braudel Senior Fellow during 2012. The responsibility for all remaining errors and omissions areours.

†Corresponding author. Tel.: +1 518 442 4758. E-mail address: [email protected].‡Tel.: +1 518 779 3190. E-mail address: [email protected].

1 Introduction

The need for accurate prediction of events with binary outcomes, like loan defaults, occurrence of

recessions, passage of a specific legislation, etc., arises often in economics and numerous other areas of

decision making. For example, a firm may base its production decisions on macroeconomic prospects;

a bank manager may decide whether to extend a loan to an individual depending on the risk of default;

and the propensity of a worker to apply for disability benefits is partially determined by the probability

of being approved.

How should one characterize a good forecast in these situations? Take the loan offer as an ex-

ample: a skilled bank manager with professional experience, after observing all relevant personal

characteristics of the applicant, is probably able to guess the odds that an applicant will default. How-

ever, this ability does not necessarily translate into a good decision because the ultimate payoff also

depends on the accurate assessment of the cost and benefit associated with a decision. The cost of

an incorrect approval of the loan can be larger than that of an incorrect denial such that an optimal

decision will depend on how large this cost differential is. A manager, who may otherwise be a skillful

forecaster, is unable to make an optimal decision unless he is aware of the costs and benefits associated

with each of the binary outcomes. The value of a forecast can only be evaluated in a decision making

context.

It is useful to distinguish between two types of forecasts for binary outcomes: probability fore-

casts and point forecasts. The former is a member of the broader category of density forecasts, since

knowing the probability of a binary event is equivalent to knowing the entire density for the binary

variable. Growing interest in probability forecasts has mainly been dictated by the desire of the profes-

sional forecasting community to quantify forecast uncertainty, which is often ignored in making point

forecasts. After all, a primary purpose of forecasting is to reduce uncertainty. In practice, a set of co-

variates is available for predicting the binary outcome under consideration. In this setting, probability

forecasts only describe the objective statistical properties of the joint distribution between the event

and covariates, and thus can be analyzed first without considering forecast value. On the contrary, a

binary point forecast, always being either 0 or 1, cannot logically be issued in isolation of the loss

function implicit in the underlying decision making problem. In this sense, probability forecasts are

more fundamental in nature. Because a point forecast is a mixture of the objective joint distribution

between the event and the covariates, and the loss function, we will defer an in-depth discussion of

1

binary point forecasts until some important concepts regarding forecast value have been introduced.

Given the importance of density and point forecasts for other types of target variables such as GDP

growth and inflation rates, one may wonder what feature of a binary outcome necessitates a separate

analysis and evaluation of its forecasts. It is the discrete support space of the dependent variable that

makes forecasting binary outcomes distinctive, and this restriction should be taken into account in the

specification, estimation, and evaluation exercises. For probability forecasts, any hypothesized model

ignoring this feature may lead to serious bias in forecasts. This, however, is not necessarily the case

in making binary point forecasts where the working model may violate this restriction, cf. Elliott and

Lieli (2010). Due to the nature of a binary event, its joint distribution and loss function are of special

forms, which can be used to design a wide array of tools for forecast evaluation and combination. For

most of these procedures, it is hard to find comparable counterparts in forecasting other types of target

variables.

This chapter summarizes a substantial body of literature on forecasting binary outcomes in a uni-

fied framework that has been developed in a number of disciplines such as biostatistics, computer

science, econometrics, mathematics, medical imaging, meteorology, and psychology. We cover only

those models and techniques that are common across these disciplines, with a focus on their appli-

cations in economic forecasting. Nevertheless, we give references to some of the methods excluded

from this analysis.

The outline of this chapter is as follows. In Section 2, we present methods for forecasting binary

outcomes that have been developed primarily by econometricians in the framework of binary regres-

sions. Section 3 is concerned with the evaluation methodologies for assessing binary forecast skill and

forecast value, most of which have been developed in meteorology and psychology. Section 4 is built

upon the previous two sections; it consists of models especially designed for binary point predictions.

We discuss two alternative methodologies to improve binary forecasts in Section 5. Section 6 closes

this chapter by underscoring the unified framework that is at the core of the literature, by providing

coherence to the diversity of issues and generic solutions.

2 Probability Predictions

This section addresses the issue of modeling the conditional probability of a binary event given an

information set available at the time of prediction. It is a special form of density prediction since, for a

2

Bernoulli distribution, knowing the conditional probability is equivalent to knowing the density. Four

classical binary response models developed in econometrics along with an empirical illustration will

come first, followed by generalizations to panel data forecasting. Sometimes, forecasts are not derived

from any estimated econometric model, but are completely subjective or judgemental. These will be

introduced briefly in Section 2.2.

2.1 Model-based probability predictions

For the purpose of probability predictions, the forecaster often has an information set (denoted by Ω)

that includes all variables relevant to the occurrence of a binary event. Incorporation of a particular

variable into Ω is justified either by economic theory or by the variable’s historical forecasting per-

formance. Suppose the dependent variable Y equals 1 when the target event occurs and 0 otherwise.

The question to be answered in this section is how to model the conditional probability of Y = 1 given

Ω, viz., P(Y = 1|Ω). The formulation of binary probability prediction in this manner is sufficiently

general to nest nearly all specific models that follow. For instance, if Ω contains lagged dependent

variables, then we have a dynamic model commonly used in macroeconomic forecasting. When it

comes to the functional form of the conditional probability, we can identify three broad approaches:

(i) a parametric model which imposes a very strong assumption on P(Y = 1|Ω), the only unknown is a

finite dimensional parameter vector; (ii) a nonparametric model which does not constrain P(Y = 1|Ω)

beyond certain regular properties such as smoothness; and (iii) a semiparametric model which lies

between these two extremes in that it does restrict some elements of P(Y = 1|Ω), and yet allows flex-

ible specification of other elements. If Ω contains prior knowledge on the parameters, P(Y = 1|Ω)

is a Bayesian model that integrates the prior with sample information to yield the posterior predic-

tive probability. Before examining each specific model in detail, we will offer motivations as to why

special care must be taken when the dependent variable is binary.

For modeling a binary event, a natural question is whether we can treat it as an ordinary dependent

variable and assume a linear structure for P(Y = 1|Ω). In a linear probability model, for example, the

conditional probability of Y = 1 depends on a k-dimensional vector X in a linear way, that is,

P(Y = 1|Ω) = Xβ (1)

where Ω = X and β is a parameter vector conforming in dimension with X . However, this model

3

may not be suitable for the binary response case. As noted by Maddala (1983), for some range of

covariates X , Xβ may fall outside of [0,1]. This is not permissible given that conditional probability

must be a number between zero and one. Consequently, discreteness of binary dependent variables

calls for nonlinear econometric models, and the selected specification must tackle this issue properly.

The common approach to overcome the drawback associated with the linear model involves a

nonlinear link function taking values within [0,1]. One well-known example is the cumulative distri-

bution function for any random variable. Often, restrictions on P(Y = 1|Ω) are imposed within the

framework of the following latent dependent variable form (with Ω = X):

Y ∗ = G(X)+ ε, ε is distributed as F(·)

Y = 1 if Y ∗ > 0, otherwise Y = 0. (2)

Here, Y ∗ is a hypothesized latent variable with conditional expectation G(X), called the index function.

ε is a random error with cumulative distribution function F(·) and is independent of X . The observed

binary variable Y is generated according to (2). By design, the conditional probability of Y = 1 given

X must be a number between zero and one, as shown below:

E(Y |X) = P(Y = 1|X) = P(Y ∗ > 0|X)

= P(ε >−G(X)|X)

= 1−F(−G(X)). (3)

Regardless of X , F(−G(X)) always lies inside [0,1], so does the conditional expectation itself. In a

parametric model, the functional form of F(·) is known whereas the index G(·) is specified up to a

finite dimensional parameter vector β, that is, G(·) = G0(·,β) and the functional form of G0(·, ·) is

known. As mentioned earlier, a nonparametric model does not impose stringent restrictions on the

functional form of F(·) and G(·) besides some regular smoothness conditions. If either F(·) or G(·)

is flexible but the other is subject to specification, a semiparametric model results.

4

2.1.1 Parametric approach

Two prime parametric binary response models assume the index function to be linear, that is,

G0(X ,β) = Xβ. If F is the distribution function of a standard normal variate, that is,

F(u) =∫ u

−∞

1√2π

e−12 t2

dt, (4)

then we have the probit model. Alternatively, if F is logistic distribution function, that is,

F(u) =eu

1+ eu , (5)

we have the logit model.

These are two popular parametric binary response models in econometrics. By symmetry of their

density functions around zero, conditional probability of Y = 1 reduces to the simple form F(Xβ).

Note that the index function does not have to be linear and it could be any nonlinear function of β. In

addition, the link function F(·) need not be (4) or (5), it could be any other distribution function. One

of the possibilities is the extreme value distribution:

F(u) = e−e−u. (6)

Nevertheless, the key point in parametric models is that the functional forms for the link and index,

irrespective of how complex they are, should be specified up to a finite dimensional parameter vector.

Koenker and Yoon (2009) introduced two wider classes of parametric link functions for binary

response models: the Gosset link based on the Student t-distribution for ε, and the Pregibon link

based on the generalized Tukey λ family. The probit and logit links are nested within Gosset and

Pregibon classes, respectively. For example, when the degrees of freedom for Student t-distribution

are large, it can be very close to standard normal distribution. For generalized Tukey λ link with

two parameters controlling the tail behavior and skewness, logit link is obtained by setting these two

parameters to zero. Based on these observations, Koenker and Yoon (2009) compared and contrasted

the Bayesian and asymptotic chi-squared tests for the suitability of probit or logit link within these

more general families. One primary objective of their paper was to correct the misperception that all

links are essentially indistinguishable. They argued that the misspecification of the link function may

lead to a severe estimation bias, even when the index is correctly specified. The binary response model

5

with Gosset or Pregibon as link offers a relatively simple compromise between the conventional probit

or logit specification and the semiparametric counterpart to be introduced in Section 2.1.3.

Train (2003) discussed various identification issues in parametric binary response models. For the

purpose of prediction, we care about the predicted probabilities instead of parameters, implying that

we have no preference over two models generating identical predicted probabilities, even though one

of them is not fully identified. For this reason, identification is often not an issue, and unidentified or

partially identified models may be valuable in forecasting.

Once the parametric model is specified and identification conditions are recognized, the remaining

job is to estimate β, given a sample. Amongst a number of methods, maximum likelihood (ML) yields

an asymptotically efficient estimator, provided the model is correctly specified. Suppose the index is

linear. The logarithm of conditional likelihood function given a sample Yt ,Xt with t = 1, ...,T is

l(β|Yt ,Xt)≡T

∑t=1

Yt ln(F(Xtβ))+(1−Yt)ln(1−F(Xtβ)), (7)

and ML maximizes (7) over the parameter space. Amemiya (1985) derived consistency and asymptotic

normality of the maximum likelihood estimator for this model, and established the global concavity

of the likelihood function in the logit and probit cases. This means that the Newton-Raphson iterative

procedure will converge to the unique maximizer of (7), no matter what the starting values are. For de-

tails regarding the iterative procedure to calculate ML estimator in these models, see Amemiya (1985).

Statistical inference on the parameters, predicted probabilities, marginal effects, and interaction effects

can be conducted in a straightforward way, provided the sample is independently and identically dis-

tributed (i.i.d.) or stationary and ergodic (in addition to satisfying certain moment conditions). These,

however, may not always hold. Park and Phillips (2000) developed the limiting distribution theory of

ML estimator in parametric binary choice models with nonstationary integrated explanatory variables,

which was extended further to multinomial responses by Hu and Phillips (2004a,b).

In dynamic binary response models, the information set Ω may include unobserved variables.

Chauvet and Potter (2005) incorporated the lagged latent variable, together with exogenous regressors,

in Ω. A practical difficulty with these models is that the likelihood function involves an intractable

multiple integral over the latent variable. One way to circumvent this problem is to use a Bayesian

computational technique based on a Markov chain Monte Carlo algorithm. See the technical appendix

in Monokroussos (2011) for implementation details. Kauppi and Saikkonen (2008) examined the

predictive performance of various dynamic probit models in which the lagged indicator of economic

6

recession, or the conditional mean of the latent variable, is used to forecast recessions. Their dynamic

formulations are much easier to implement by applying standard numerical methods, and iterated

multi-period forecasts can be generated. For a general treatment of multiple forecasts over multiple

horizons in dynamic models, see Terasvirta et al. (2010), where four iterative procedures are outlined

and assessed in terms of their forecast accuracy. Hao and Ng (2011) evaluated the predictive ability

of four probit model specifications proposed by Kauppi and Saikkonen (2008) to forecast Canadian

recessions, and found that dynamic models with actual recession indicator as an explanatorary variable

were better in predicting the duration of recessions, whereas the addition of the lagged latent variable

helped in forecasting the peaks of business cycles.

In macroeconomic and financial time series, the probability law underlying the whole sequence of

0’s and 1’s is often not fixed, but characterized by long repetitive cycles with different periodicities.

Exogenous shocks and sudden policy changes can lead to a sudden or gradual change in regime. If

the model ignores this possibility, chances are high that the resulting forecasts will be off the mark.

Hamilton (1989, 1990) developed a flexible Markov switching model to analyse a time series subject

to changes in regime, where an underlying unobserved binary state variable st governed the behaviour

of observed time series Yt . The change of regime in Yt is simply due to the change of st from one

state to the other. It is called Markov regime-switching model because the probability law of st is

hypothesized to be a discrete time two-state Markov chain. The advantage of this model is that it

does not require prior knowledge of regime separation at each time. Instead, such information can be

inferred from observed data Yt . For this reason, one can take advantage of this model to get predicted

probability of a binary state even if it cannot be observed directly. For a comprehensive survey of

this model, see Hamilton (1993, 1994). Lahiri and Wang (1994) utilized this model for estimating

recession probabilities using the index of leading indicators (LEI), circumventing the use of ad hoc

filter rules such as three consecutive declines in LEI as the recession predictor.

Unlike benchmark probit and logit models, a number of parametric binary response models may

be derived from other target objects. The autoregressive conditional hazard (ACH) model in Hamil-

ton and Jorda (2002) serves as a good example. The original target to be predicted is the length of

time between events, such as the duration between two successive changes of the federal funds rate

in the United States. For this purpose, Engle (2000) and Engle and Russell (1997, 1998) developed

an autoregressive conditional duration (ACD) model where the conditional expectation of the present

duration was specified to be a linear function of past observed durations and their conditional expec-

tations. Hamilton and Jorda (2002) considered the hazard rate defined as the conditional probability

7

of a change in the federal funds rate, given the latest information Ω. The ACH model is implied by

the ACD model since the expected duration between two successive changes is the inverse of the haz-

ard rate. They also generalized this simple specification by adding a vector of exogenous variables

to represent new information relevant for predicting the probability of the next target change. The

discreteness of observed target rate changes along with potential dynamic structure are dealt with si-

multaneously in this framework. See Grammig and Kehrle (2008), Scotti (2011), and Kauppi (2012)

for further applications and extensions.

Instead of predicting a single binary event, it is often useful to forecast multiple binary responses

jointly. For instance, we may like to predict the direction-of-change in several financial markets at a

future date given current information. A special issue arises in this context as these multiple binary

dependent variables may be intercorrelated, even after controlling for all independent variables. One

way to model this contemporaneous correlation is based on copulas, which decomposes the joint

modeling approach into two separate steps. The power of a copula is that for multivariate distributions,

the univariate marginals and the dependence structure can be isolated, and all dependence information

is contained in the copula. While modeling the marginal, one can proceed as if the current binary

event is the only concern, which means that all previously discussed methodologies including dynamic

models can be direcly applied. After this step, we may consider modeling the dependence structure

by using a copula.1 Patton (2006) and Scotti (2011) used this approach in forecasting. Anatolyev

(2009) suggested a more interpretable measure, called dependence ratios, for the purpose of directional

forecasts in a number of financial markets. Both marginal Bernoulli distributions and dependence

ratios are parameterized as functions of the direction of past changes. By exploiting the information

contained in this contemporaneous dependence structure, it is expected that this multivariate model

will produce higher quality out-of-sample directional forecasts than its univariate counterparts.

Cramer (1999) considered the predictive performance of the logit model in unbalanced samples

in which one event is more prevalent than the other. Denote the in-sample estimated probabilities of

Yt = 1 and Yt = 0 by Pt and 1−Pt , respectively. By the property of logit models, the sample average

of Pt always equals the in-sample proportion of Yt = 1, which is denoted by α. Cramer proved that

the average of Pt over the subsample of Yt = 1 cannot be less than the average of 1−Pt over the

subsample of Yt = 0, if α ≥ 0.5. Thus, in unbalanced samples, the average predicted probability of

Yt = 1 when Yt = 1 is greater than or equal to the average predicted probability of Yt = 0 when Yt = 0.

1In the binary case, the copula is characterized by a few parameters and thus is simple to model, see Tajaret al. (2001).

8

As a result, Cramer pointed out that estimated probabilities are a poor measure of in-sample predictive

performance. Using estimated probabilities leads to the absurd conclusion that success is predicted

more accurately than failure even though the two outcomes are complementary.

King and Zeng (2001) investigated the use of a logit model in situations where the event of interest

is rare. With the typical sample proportion of the event less than 5%, they showed that the logit model

performs well asymptotically provided it is correctly specified. However, in small samples, the logit

estimator is biased. In these cases, efficient competing estimators with smaller mean squared errors do

exist. This point has been noticed by statisticians but has not attracted much attention in the applied

literature, see Bull et al. (1997).

The estimated asymptotic covariance matrix of the logit estimators is the inverse of the estimated

information matrix, that is,

V (β) = [T

∑t=1

Pt(1−Pt)x′txt ]−1 (8)

where β is the logit ML estimator, and Pt is the fitted conditional probability for observation t, which

is 1/(1+ e−xt β). King and Zeng (2001) pointed out that in logit models, Pt for the subsample for

which the rare event occurred would usually be large and close to 0.5. This is because probabilities

reported in studies of rare events are generally very small compared to those in balanced samples.

Consequently, the contribution of this value to the information matrix would also be relatively large.

This argument implies that for rare event data, observations with Y = 1 have more information content

than those with Y = 0. In this situation, random samples that are often used in microeconometrics

no longer provide efficient estimates. Drawing more observations from Y = 1, relative to what can

be obtained in a random sampling scheme, could effectively yield variance reduction. This is called

choice-based, or more generally, endogenously stratified sampling in which a random sample of pre-

assigned size is drawn from each stratum based on the values of Y . This nonrandom design tends to

deliberately oversample from the subpopulation (that is, Y = 1) that leads to variance reduction. King

and Zeng (2001) suggested a sequential procedure to determine the sample size for Y = 0 based on

the estimation accuracy of each previously selected sample.

The statistical procedures valid for random samples need to be adjusted as well in order to accom-

modate this choice-based sampling scheme. Maddala and Lahiri (2009) included some preliminary

discussions on this issue. Manski and Lerman (1977) proposed two modifications of the usual max-

imum likelihood estimation. The first one involves computing a logistic estimate and correcting it

9

according to prior information about the fraction of ones in the population, say τ, and the observed

fraction of ones in the sample, say Y . For the logit model, the estimator of slope coefficient β1 is con-

sistent in both sampling designs. The estimator of the intercept βo in the choice-based sample should

be corrected as:

βo− ln[(1− τ

τ)(

Y1− Y

)], (9)

where βo is the ML estimate for βo. For the random sample, τ = Y , and thus there is no need to adjust

βo. However, in a choice-based sample with more observations on 1’s, we must have τ < Y , and the

corrected estimate is less than βo accordingly. The prior correction is easy to implement and only

requires the knowledge of τ, which is often available from census data. However, in the case of a mis-

specified parametric model, prior correction may not work. Given the prevalence of misspecification

in economic applications, more robust correction procedures are called for. Another limitation of this

prior correction procedure is that it may not be applicable for other parametric specifications, such as

the probit model, for which the inconsistency of the ML estimator may take a more complex form

(unlike in the logit case).

Manski and Lerman (1977)’s second approach – the weighted exogenous sampling maximum-

likelihood estimator – is robust even when the functional form of logit model is incorrect, see Xie

and Manski (1989). Instead of maximizing the logarithm of likelihood function of the usual form, it

maximizes the following weighted version:

lw(β|Yt ,Xt)≡−T

∑t=1

wt ln(1+ e(1−2yt)xt β). (10)

The weight function wt is w1Yt +wo(1−Yt), where w1 = τ/Y and wo = (1− τ)/(1− Y ). As noted

by Scott and Wild (1986) and Amemiya and Vuong (1987), in the case of correct specification, the

weighting approach is asymptotically less efficient than prior correction, but the difference is not very

large. However, if model misspecification is suspected, weighting is a robust alternative. Unlike

prior correction, the weighted estimator can be applied equally well to other parametric specifications.

The only knowledge required for its implementation is τ, the population probability of the rare event.

Manski and Lerman (1977) has proved that the weighted estimator for any correctly specified model is

consistent given the true τ. However, this estimator may not be asymptotically efficient. The intuition

behind the lack of efficiency is that unlike in a random sample, the knowledge of τ must contain

additional restrictions for the unknown parameters β in a choice-based sample. Failure to exploit

10

this additional information makes the resulting estimator inefficient. Imbens (1992) and Imbens and

Lancaster (1996) examined how to efficiently estimate β in an endogenously stratified sample. Their

estimator based on the generalized-method-of-moment (GMM) reformulation does not require prior

knowledge of τ and the marginal distribution of regressors. Instead, τ can be treated as an additional

parameter that is estimated by GMM jointly with β. They have shown that this estimator achieves the

semiparametric efficiency bound given all available information. For an excellent survey on estimation

in endogenously stratified samples, see Cosslett (1993).

One interesting point in the context of choice-based sampling is that the logit model could some-

times be consistently estimated when the original data comes exclusively from one of the strata. This

problem has been investigated by Steinberg and Cardell (1992). In this paper, they have shown how

to pool an appropriate supplementary sample that can often be found in general purpose public use

surveys, such as the U.S. Census, with original data to estimate the parameters of interest. The sup-

plementary sample can be drawn from the marginal distribution of the covariates without having any

information on Y . This estimator is algebraically similar to the above weighed MLE, and hence can be

implemented in conventional statistical packages. Only the logit model is analyzed in this paper due to

the existence of an analytic solution. In principle, the analysis can be generalized to other parametric

binary response models.

In finite samples, however, all of the above statistical procedures are subject to bias even when the

model is correctly specified. King and Zeng (2001) pointed out that such bias may be amplified in the

case of rare events. They proposed two methods to correct for the finite sample bias in the estimation

of parameters and the probabilities. For the parameters, they derived an approximate expression of

bias in the usual ML estimator, viz., (X ′WX)−1(X ′Wξ) where ξt = 0.5Qtt [(1+w1)Pt −w1], Qtt is

the diagonal element of Q = X(X ′WX)−1X ′, and W = diagPt(1−Pt)wt. This bias term is easy

to estimate since it is just the weighted least squares estimate of regressing ξ on X with W as the

weight. The bias-corrected estimator of β is β = β−(X ′WX)−1(X ′Wξ) with the approximate variance

V (β) = (T/(T + k))2V (β), where k is the dimension of β. Observe that T/(T + k)< 1 for all sample

sizes. The bias-corrected estimator is not only unbiased but has smaller variance, and thus has a

smaller mean squared error than the usual ML estimator in finite samples. When it comes to the

predicted probabilities, a possible solution is to replace the unknown parameters β in 1/(1+ e−xt β)

with the bias-corrected estimator β. The problem is that a nonlinear function of β may not be unbiased.

King and Zeng (2001) developed the approximate Bayesian estimator based on the approximation of

11

the following estimator after averaging out the uncertainty due to estimation of β:

P(Y = 1|X = xo) =∫

1/(1+ e−xoβ∗)P(β∗)dβ∗. (11)

They stated that ignoring the estimation uncertainty of β would lead to underestimation of the true

probability in a rare event situation. From a Bayesian viewpoint, P(β∗) that summarizes such uncer-

tainty, is interpreted as the posterior density of β, that is, N(β,V (β)). Computation of this approximate

Bayesian estimator and its associated standard deviation can be carried out in a straightforward way.

The pitfall of this estimator is that it is not unbiased in general, even though it often has small mean

squared error in finite samples. King and Zeng (2001) therefore proposed another competing estima-

tor, viz., “the approximate unbiased estimator”, which, as its name suggests, is unbiased.

2.1.2 Nonparametric approach

As mentioned at the beginning of Section 2.1, the nonparametric approach is the most robust way

to model the conditional probability, in that both the link and the index can be rather flexible. Non-

parametric regression often deals with continuous responses with well behaved density functions, but

the theory does not explicitly rule out other possibilities like a binary dependent variable. All extant

nonparametric regression methods, after minor modifications, can be used to model binary dependent

variables as well.

The most well-known nonparametric regression estimator of conditional expectation is the so-

called local polynomial estimator. For the univariate case, the pth local polynomial estimator solves

the following weighted least square problem given a sample Yt ,Xt with t = 1, ...,T :

minbo,b1,...,bp

T

∑t=1

(Yt −bo−b1(Xt − x)− ...−bp(Xt − x)p)2K(x−Xt

hT) (12)

where hT is the selected bandwidth, possibly depending on the sample, and K(·) is the kernel function.

When p = 0, it reduces to local constant or Nadaraya-Watson estimator; When p = 1, it is the local

linear estimator. In any case, the conditional probability P(Y = 1|X = x) can be estimated using bo,

the solution to (12). However, this fitted probability may exceed the feasible range [0,1] for some

values of x, since there is no such implicit constraint underlying this model. An immediate solution

in practice would be to cap the estimates at 0 and 1 when the fitted values fall beyond this range. The

problem is that there is no strong support in theory to do so, and the modified fitted probability is likely

12

to assume these boundary values for a large number of values of x and thus the estimated marginal

effect at these values must be zero as well. Like probit or logit transformations in the parametric

model, we can make use of the same technique here. The only difference is that we fit the model

locally by kernel smoothing. Specifically, let g(x,βx) be such a transformation function with unknown

coefficient vector βx. The conditional probability is modeled as:

P(Y = 1|X = x) = g(x,βx). (13)

In contrast to a parametric model, the coefficient βx is allowed to vary with the evaluation point x. In

the present context, the local logit is a sensible choice in which g(x,βx) = 1/(1+ e−xβx). Generally

speaking, any distribution function can be taken as g. Currently, there are three approaches to estimate

βx and thus P(Y = 1|X = x) in (13); see Gozalo and Linton (2000), Tibshirani and Hastie (1987), and

Carroll et al. (1998).

Another way to get the fitted probabilities within [0,1] nonparametrically is simply by noting that

p(y|x) = p(y,x)p(x)

(14)

where p(y|x), p(y,x) and p(x) are the conditional, joint, and marginal densities, respectively. A non-

parametric conditional density estimator is obtained by replacing p(y,x) and p(x) in (14) by their

kernel estimates. When Y is a binary variable, p(1|x) = P(Y = 1|X = x). A technical difficulty is that

the ordinary kernel smoothing implicitly assumes that the underlying density function is continuous,

which is not true for a binary variable. Li and Racine (2006) provides a comprehensive treatment of

several ways to cope with this problem based on generalized kernels.

A number of papers have compared nonparametric binary models with the familiar parametric

benchmarks. Frolich (2006) applied local logit regression to analyze the dependence of Portuguese

women’s labor supply on family size, especially on the number of children. For the parametric logit

estimator, the estimated employment effects of children never changed sign in the population. How-

ever, the nonparametric estimator was able to detect a larger heterogeneity of marginal effects in that

the estimated effects were negative for some women but positive for others. Bontemps et al. (2009)

compared nonparametric conditional density estimation with a conventional parametric probit model

in terms of their out-of-sample binary forecast performances by bootstrap resampling. They found that

the nonparametric method was significantly better behaved according to the “revealed performance”

test proposed by Racine and Parmeter (2009). Harding and Pagan (2011) considered a nonparametric

13

regression model using constructed binary time series. They argued that due to the complex scheme

of transformation, the true data generating process governing an observed binary sequence is often

not described well by a parametric specification, say, the static or dynamic probit model. Their dy-

namic nonparametric model was then applied to U.S. recession data using the lagged yield spread to

predict recessions. They compared the fitted probabilities from the probit model and those based on

the Nadaraya-Watson estimator, and concluded that the parametric probit specification could not char-

acterize the true relationship between recessions and yield spread over some range. The gap between

these two specifications was statistically significant and economically substantial.

2.1.3 Semiparametric approach

The semiparametric model consists of both parametric and nonparametric components. Compared

with the two extremes, a semiparametric model has its own strength. It is not only more robust

than a parametric one because of its flexibility in the nonparametric part, but also reduces the risk

of the “curse of dimensionality” and data “sparseness” associated with its nonparametric counterpart.

Various semiparametric models for binary responses have emerged in the last few decades. We will

briefly review some of the important developments in this area.

Recall that the link function is assumed to be known in the parametric model. Suppose this

assumption is relaxed while keeping the index unchanged. We have then the following single-index

model:

E(Y |X) = P(Y = 1|X) = F(G(X)). (15)

Generally speaking, the index G(X) does not have to be linear, as in the parametric model. We

only consider the case where G(X) = Xβ for the sake of simplicity. The only difference from the

parametric model is that the functional form for F(·) is unknown here and thus needs to be estimated.

By allowing for a flexible link function, greater robustness is achieved, provided the index has been

correctly specified. Horowitz (2009) discussed the identification issues for various sub-cases of (15).

Generally speaking, the simplest identified specification can be used without worrying about other

possibilities, provided that the alternative models are observationally equivalent from the standpoint

of forecasting.

For the single-index model, once a consistent estimator of β is available, F could be estimated

using a nonparametric regression with β replaced by its estimator. There are three suggested estima-

14

tors for β. Horowitz (2009) categorized them according to whether a nonlinear optimization problem

has to be solved. Two estimators obtained as the solution of a nonlinear optimization problem are

the semiparametric weighted nonlinear least square estimator due to Ichimura (1993), and the semi-

parametric maximum likelihood estimator proposed by Klein and Spady (1993). A direct estimator

not involving optimization is the average derivative estimator; see Stoker (1986, 1991a,b), Hardle and

Stoker (1989), Powell et al. (1989), and Hristache et al. (2001).

Another semiparametric model suitable for binary responses is the nonparametric additive model

where the link is given, but the index contains nonparametric additive elements:

P(Y = 1|X = x) = F(µ+m1(x1)+ ...+mk(xk)). (16)

Here, X is a k-dimensional random vector and the function F(·) is known prior to estimation, al-

though the univariate function m j(·) for each j needs to be estimated. The model is semiparametric

in nature as it contains both the parametric component F(·), along with the additive structure, and

the nonparametric component m j(·). Note that this nonparametric additive model does not overlap

with the single-index model, in the sense that there is at least one single-index model that cannot be

rewritten in the form of nonparametric additive model, and vice versa. Like the single-index model,

the nonparametric additive model relaxes restrictions on model specification to some extent, thereby

reducing the risk of misspecification as compared with the parametric approach. Furthermore, it over-

comes the “curse of dimensionality” associated with a typical multivariate nonparametric regression

by assuming each additive component to be a univariate function. Often, a cumulative distribution

function with range between 0 and 1 is a sensible choice for F(·). To ensure consistency of estimation

methodology, F(·) has to be correctly specified. Horowitz and Mammen (2004) described estimation

of this additive model. The basic idea is to estimate each m j(·) by series approximation. A natu-

ral generalization is to allow for unknown F(·). This more general specification nests (15) and (16)

as two special cases. Horowitz and Mammen (2007) developed a penalized-least-squares estimator

for this model, which does not suffer from the “curse of dimensionality” and achieves the optimal

one-dimensional nonparametric rate of convergence.

2.1.4 Bayesian approach

In contrast to the frequentist approach, the Bayesian approach takes the probability of a binary event as

a random variable instead of a fixed value. Combining prior information with likelihood using Bayes’

15

rule, it obtains the posterior distribution of parameters of interest. By the property of a binary variable,

each 0/1-valued Yt must be distributed as Bernoulli with probability p. The likelihood function for a

random sample would take the following form:

T !T1!T0!

pT1(1− p)T0 (17)

where T1 and T0 are the total number of observations with Yt = 1 and Yt = 0, respectively, and T =

T1 +T0. A conjugate prior for parameter p is Beta (α, β) where both α and β are nonnegative real

numbers. According to Bayes’ rule, the posterior is Beta (α+T1, β+T0) with mean:

E(p|Y ) = λpo +(1−λ)T1

T(18)

where po = α/(α+ β) is the prior mean, T1/T is the sample mean, and λ = (α+ β)/(α+ β+ T )

is the weight assigned to the prior mean. If α = β = 1 in the above Beta-Binomial model, that is,

when a noninformative prior is used, the posterior distribution is then dominated by the likelihood,

and (18) gets close to the sample mean provided T is sufficiently large. In other words, Bayesian nests

the frequentist approach as a special case. However, this flexibility comes at the cost of robustness,

as the posterior relies on the prior, which, to some extent, is thought of as arbitrary and subject to

choice by the analyst. This deficiency can be alleviated by checking the sensitivity of the posterior to

multiple priors, or using empirical Bayes methods. For the former, if different priors produce similar

posteriors, the result obtained under a particular prior is robust. In the latter approach, the prior is

determined by other data sets such as those examined in previous studies. For instance, we can match

the prior mean and variance with sample counterparts to determine two parameters α and β in the

above Beta-Binomial model. This is a natural way to update the information from previous studies.

Once the posterior density is known, the predicted probability can be obtained under a suitable loss

function. For example, the posterior mean is the optimal choice under quadratic loss.

Up to this point, only the information contained in the prior distribution and past Y are utilized

for generating probability forecasts. Usually in practice, a set of covariates X is available for use. In

line with our general formulation at the beginning of this section, only the prior distribution and past

Y are incorporated into the information set Ω in the Beta-Binomial model. Let us now consider how

to incorporate X into Ω within the framework of (2). There are two approaches to do this. The first

one is conceptually simple in that only Bayes’ rule is involved. The prior density of parameters π(β)

multiplied by the conditional sampling density of Y given X generates the posterior in the following

16

way:

p(β|Y,X) =Cπ(β)T

∏t=1

F(G0(Xt ,β))Yt (1−F(G0(Xt ,β)))

1−Yt (19)

where C is a constant which equals

∫π(β)

T

∏t=1

F(G0(Xt ,β))Yt (1−F(G0(Xt ,β)))

1−Yt dβ. (20)

The Metropolis-Hastings algorithm can draw samples from this distribution directly. Alternatively,

we can use Monte Carlo integration to approximate the constant C. Albert and Chib (1993) developed

the second method using the idea of data augmentation. The parametric model F(G0(Xt ,β)) is seen

to have an underlying regression structure on the latent continuous data; see (2). Without loss of gen-

erality, we only consider the case where G0(Xt ,β) = Xtβ, and ε has the standard normal distribution,

that is, F(·) = Φ(·) where Φ(·) is the standard normal distribution function with φ(·) as its density.

If the latent data Y ∗t is known, then the posterior distribution of the parameters can be computed

using standard results for normal linear models; see Koop (2003) for more details. Values of the latent

variable are drawn from the following truncated normal distributions:

p(Y ∗t |Yt ,Xt ,β) ∝

φ(Y ∗t −Xtβ)I(Y ∗t > 0) if Yt = 1;

φ(Y ∗t −Xtβ)I(Y ∗t ≤ 0) otherwise.(21)

where ∝ means “is proportional to”. Draws from the posterior distribution are then used to sample

new latent data, and the process is iterated with Gibbs sampling, given all conditional densities. The

distribution of the predicted probability can be obtained as follows. Given an evaluation point x, the

conditional probability is Φ(xβ), which is random in the Bayesian framework. When a sufficiently

large sample is generated from the posterior p(β|Y,X), the distribution of Φ(xβ) can be approximated

arbitrarily well by evaluating Φ(xβ) at each sample point. As before, when only a point estimate is

desired, we can derive it given a specified loss function.

Albert and Chib (1993) also pointed out a number of advantages of the Bayesian estimation over

a frequentist approach. First, frequentist ML relies on asymptotic theory and its estimator may not

perform satisfactorily in finite samples. Indeed, Griffiths et al. (1987) found that a ML estimator could

have significant bias in small samples, while the Bayesian estimator could perform exact inferences

even in these cases. Second, the Bayesian approach based on the latent variable formulation, is com-

17

putationally attractive. Third, Gibbs sampling needs to draw samples mainly from several standard

distributions, and therefore is simple to implement. Finally, we can easily extend this model to deal

with other sampling densities for the latent variables other than the present multivariate normal den-

sity. As a cautionary note, some diagnostic methods have to be used to ensure that the generated

Markov chain has reached its equilibrium distribution. For applications of this general approach in

other binary response models, see Koenker and Yoon (2009), Lieli and Springborn (2012), and Scotti

(2011).

2.1.5 An empirical example

In this part, we will present an empirical example that illustrates the application of the methodologies

covered so far. The task is to generate the probabilities of future U.S. economic recessions. The

monthly data we use consists of 624 observations on the difference between 10-year and 3-month

Treasury rates, and NBER dated recession indicators from January 1960 to December 2011.2 The

binary target event is the recession indicator that is one, if the recession occurred, and zero otherwise.

The sample proportion of months that were in recession is about 14.9%, indicating that it is a relatively

uncommon event. The independent variables are the yield spread, i.e., difference between 10-year and

3-month Treasury rates, and the lagged recession indicator. Estrella and Mishkin (1998) found that

the best fit occurred when the yield spread is lagged 12 months. We maintain this assumption here.

Figure 1 shows the frequency distribution of the yield spread in our sample periods. The three tallest

bars show that the value of the spread was between 0 and 1.5 percentage points in about 42.6% of

the cases. The distribution is heavily skewed toward the positive values. All our fitted models with

the yield spread as the explanatory variable reveal a very strong serial correlation in residuals. As a

result, the dynamic specification involving one month lagged indicator as an additional regressor is

used here. We implement parametric, semiparametric, and nonparametric approaches on this dataset,

and summarize the fitted curves in a single graph. For the Bayesian approach, we use the R code

provided by Albert (2009) to simulate the posterior distributions under different priors.

Figure 2 presents three fitted curves generated using a parametric probit model, a semiparametric

single-index model, and the nonparametric conditional density estimator of Section 2.1.2, given the

value of the lagged indicator. Both the probit and the single-index models contain the linear index3.

2Downloaded from http://www.financeecon.com/ycestimates1.html.3The single-index model is estimated by the Klein-Spady approach with carefully selected bandwidth, see

Section 2.1.3.

18

Figure 1: Frequency distribution of the yield spread

In the top panel in Figure 2, which is conditional on being in recession in the last month, we find

the estimated conditional probabilities to be very close to each other, except for values of the yield

spread larger than 2.5%. Despite the divergence between them on the right end, both are downward-

sloping. In contrast, the relationship, as estimated by the nonparametric model, is not monotonic in

that the probability surprisingly rises when the spread increases from−1% to 0. However, this finding

is hard to explain given the prototypical negative correlation between them. We ascribe this to the

data “sparseness” exhibited in Figure 1, namely that the nonparametric estimators on these values are

not reliable. In the bottom panel, which is conditional on not being in recession in the last month,

there is no substantial difference among these three models, and all of them are decreasing over the

entire range. Again, the precision for nonparametric estimators on both ends are relatively low for

the same reason as before. An interesting issue that arises as one compares both the panels is that the

estimated probabilities when the lagged recession occurs are uniformly larger than those when it does

not. Actually, the probabilities in the bottom panel are nearly zero in magnitude no matter how small

the spread is. This could be true if there is a strong serial correlation in recessions identified by NBER,

as shown in our probit model that has a highly significant coefficient estimate for the lagged indicator.

For this reason, the information contained in the current macroeconomic state, which is related to the

occurrence of future recessions is far more important than that given by the spread. This example, at

first sight, seems to be an evidence against the predictive power of the yield spread. However, it is

not the case given the fact that the one month lagged recession indicator is unavailable at the date of

forecasting. The autocorrelation among recession indicators shrinks toward zero as forecast horizon

19

Figure 2: Probability of a recession given its lagged value (1 for the top panel; 0 for the bottom panel)

increases. The yield spread stands out only in these longer horizon forecasts where few competing

predictors with good quality exist.

To apply the Bayesian approach, we need some prior information. Suppose the coefficient vector

β is assigned a multivariate normal prior with mean βo and covariance matrix Vo. For βo, we assume

the prior means of the intercept, the coefficient of the spread and the lagged indicator to be -1, -1

and 1, respectively. As for Vo, three cases are examined: the noninformative prior corresponding to

infinitely large Vo, and a variation of the Zeller’s g informative priors4 with large and small precisions.

Figure 3 summarizes the simulated posterior means for the conditional probabilities as well as the

probit curves from Figure 2. For comparison purpose, we also plot a curve replacing unknown β by

its prior mean βo. In both panels, the Bayesian fitted curves are sensitive to the prior involved. For

4See Albert (2009) for an explanation of g informative prior.

20

Figure 3: Probability of a recession given its lagged value (1 for the top panel; 0 for the bottom panel)

noninformative and informative priors with small precision, these curves are almost identical to the

probit curves, reflecting the dominance of the sample information over priors. The reversed pattern

appears in the other two curves. When the prior precision is extremely large, the forecasters’ beliefs

about the true relationship between the spread and future recession is so firm that they are unlikely

to be affected by the observed sample. That is the reason why the simulated curves under this sharp

prior almost overlap with the curves implied by βo alone. To summarize, the Bayesian approach is a

compromise between prior and sample information, and the degree of compromise crucially depends

on the relative informativeness.

21

2.1.6 Probability predictions in panel data models

Panel data consists of repeated observations for a given sample of cross-sectional units, such as in-

dividuals, households, companies, and countries. In empirical microeconomics, a typical panel has

a small number of observations along the time dimension but very large number of cross-sectional

units. The opposite scenario is generally true in macroeconomics. In this section, we consider a micro

panel environment with small or moderate T and large N. Many estimation and inference methods

developed for micro panel can be adapted to binary probability prediction. For the ease of exposition,

only balanced panels with an equal number of repeated observations for each unit will be discussed.

The basic linear static panel data model can be written in the following form:

Yit = Xitβ+ ci + εit , i = 1, ...,N, t = 1, ...,T (22)

where Yit and Xit are the dependent and k-dimensional independent variables, respectively, for unit i

and period t. One of the crucial features that distinguishes panel data models from cross-sectional and

univariate time series models is the presence of unobserved ci, the time-invariant individual effects. In

more general unobserved effects models, time effects λt are also included. εit is the idiosyncratic error

varying with i and t, and is often assumed to be i.i.d. and independent from other model components.

The benefits of using panel data mainly come from its larger flexibility in specification as it allows

the unobserved effect to be correlated with regressors. In a cross-sectional contexts without further

information (such as availability of the valid instruments), parameters such as β cannot be identified.

Even if ci is uncorrelated with regressors, the panel data estimator is generally more efficient relative

to those obtained in cross-sectional models. Baltagi (2012) covers many aspects of forecasting in

panel data models with continuous response variables.

When Yit is binary, the linear panel data model, like the linear probability model, is no longer

adequate. Again, we rewrite it in the latent variable form. The unobserved latent dependent variable

Y ∗it satisfies:

Y ∗it = Xitβ+ ci + εit , i = 1, ...,N, t = 1, ...,T. (23)

Instead of knowing Y ∗it , only its sign Yit = I(Y ∗it > 0) is observed. In order to get the conditional

probability of Yit = 1, certain distributional assumptions concerning εit and ci have to be made. For

example, when εit is i.i.d. with distribution function F(·) and ci has G(·) as its marginal distribution,

22

the conditional probability of Yit = 1 given Xi = (X ′i1,X′i2, ...,X

′iT )′ and ci is

P(Yit = 1|Xi,ci) = 1−F(−Xitβ− ci). (24)

The problem with this conditional probability is that ci is unobserved and P(Yit = 1|Xi,ci) cannot

be estimated directly except for large T . In a micro panel, the solution, without estimating ci, is to

compute P(Yit = 1|Xi), that is, integrating out ci from P(Yit = 1|Xi,ci). If the conditional density of ci

given Xi is denoted by g(·|·), then the conditional probability is:

P(Yit = 1|Xi) =∫(1−F(−Xitβ− c))g(c|Xi)dc, (25)

which is a function of Xi alone, and thus can be estimated by replacing β with its estimate, provided

that the functional forms of F(·) and g(·|·) are known.

In general, the function g(·|·) is unknown. The usual practice is to make some assumptions about

it. One such assumption is that ci is independent of Xi, so

g(c|Xi) = g(c)≡ dG(c)dc

. (26)

This leads to the random effects model. Given this specification, β and other parameters in g(·) and

F(·) can be efficiently jointly estimated by maximum likelihood. For some parametric specifications

of g(·) and F(·), such as normal distributions, identification often requires further restrictions on their

parameters; see Lechner et al. (2008). In general, the conditional likelihood function for each unit i is

computed as below by noting that idiosyncratic error is i.i.d. across t:

Li(Yi|Xi) =∫ T

∏t=1

[1−F(−Xitβ− c)]Yit F(−Xitβ− c)1−Yit g(c)dc. (27)

If both G(·) and F(·) are zero mean normal distributions with variances σ2c and σ2

ε , respectively, then

σ2c +σ2

ε = 1 is often needed to identify all parameters. In general, G(·) or F(·) may be any cumulative

distribution function. Multiplying conditional likelihood functions Li(Yi|Xi) for each i and taking

logarithms, we get the conditional log-likelihood function for the whole sample:

l(Y |X) =N

∑i=1

lnLi(Yi|Xi). (28)

The ML estimate is defined as the global maximizer of l(Y |X) over the parameter space, and the

23

estimated conditional probability is thus

P(Y = 1|x) =∫(1− F(−xβ− c))g(c)dc (29)

where β is the ML estimate of β, g(·) and F(·) are the density of c and the distribution of ε, with

unknown parameters replaced by their ML estimates. The predicted probability is evaluated at x.

The above framework can be extended to a general case where the covariance matrix of errors

is not restricted to have the conventional component structure. Let Y ∗i = (Y ∗i1,Y∗i2, ...,Y

∗iT )′ and ui =

(ui1,ui2, ...,uiT )′ be the stacked matrix of Y ∗ and u for unit i. The latent variable linear panel data

model can be rewritten in the following compact form:

Y ∗i = Xiβ+ui. (30)

We consider the case where Xi is independent of ui, with the latter having a T -dimensional multivariate

joint distribution Fu. Note that when uit = ci + εit for each t, (30) reduces to the random effects model

discussed above. Given data (Yi,Xi) for i = 1, ...,N, the likelihood function for unit i is

Li(Yi|Xi) =∫

Di

dFu (31)

where

Di = u ∈ RT : I(Xitβ+ut > 0) = Yit for t = 1, ...,T. (32)

The log-likelihood for the whole sample is thus l(Y |X) = ∑Ni=1 lnLi(Yi|Xi). Denote the ML estimate

by β. The predicted probability at point x is then

P(Y = 1|x) = P(xβ+ux > 0|x)

= P(ux >−xβ|x)

=∫

ux>−xβ

dFo (33)

where Fo is the estimated joint distribution function of (ui,ux). Here, ux is the latent error term cor-

responding to the point x, and (33) is for unit i. In general, it is hard to specify a particular form for

Fo without further knowledge of the serial dependence among the ui. Additional conditions, such as

serial independence, are needed to make (33) tractable.

24

In practice, this general framework is hard to implement due to the presence of the multiple inte-

gral in the likelihood function. Numerous methods of overcoming this technical difficulty have been

developed in the last few decades. Most of them are based on a stochastic approximation of the multi-

ple integral by simulation; see Lee (1992), Gourieroux and Monfort (1993), and Train (2003) for more

details on these simulation-based estimators and their asymptotic properties.

We can generalize the above model further to deal with the case where ui depends on Xi in a

known form. Similar to the linear panel data model, Chamberlain (1984) relaxed the assumption that

the individual effect ci is independent of the regressors. Let the linear projection of ci on Xi be in the

following form:

ci = Xiγ+ηi (34)

For simplicity, ηi is assumed to be independent of Xi. After pluging Xiγ+ηi into (23), we get the

following equation free of ci:

Y ∗it = Xiγt +ηi + εit (35)

where γt = γ+β⊗ et , and et is a T -dimensional column vector with one for the tth element and zero

for the others. The composite error ηi + εit is independent of Xi. If we know the distributions of ηi

and εit , the above likelihood-based framework can be applied here in the same manner. Note that

for making probability predictions, we are not interested in β in (23), the reduced form parameter

γt in (35) is sufficient. To summarize, in parametric panel data models, as long as the conditional

distribution of error given Xi is correctly specified, the predicted probability at evaluation point x is

obtained by replacing unknown parameters by their maximum likelihood estimates. The parametric

approach is efficient but not robust. In the panel data context, it is hard to ensure that all stochastic

components of the model are correctly specified. If one of them is misspecified, the resulting estimator

is in general not consistent. More robust estimation approaches, that do not require full specification of

the random components, have been proposed, such as the well-known conditional logit model which

allows for an arbitrary relationship between the individual effect and the regressors, see Andersen

(1970), Chamberlain (1980, 1984), and Hsiao (1996). Unfortunately, these appoaches cannot be used

to get probability forecasts. Given that the conditional probability P(Y = 1|x) depends on both β and

the distribution function that transfers an index into a number between zero and one, consistency of the

parameter estimator is not enough. When parametric models fail, the semiparametric or nonparametric

25

approach may be an obvious choice; see Ai and Li (2008). However, most of the semiparametric and

nonparametric panel data models focus on how to estimate β, instead of the predicted probabilities.

In a dynamic binary panel data model, the latent variable in period t depends on the lagged ob-

served binary event as shown below:

Y ∗it = Yit−1α+Xitβ+ ci + εit . (36)

The dynamic model is useful in some cases as it accounts for the state dependence of the binary choice

explicitly. Consider consumers’ brand choice as an example. The unobserved indirect utility over a

brand is likely to be correlated with past purchasing behavior, as most consumers tend to buy the same

brand if it has been tried before and was satisfactory. Presence of the lagged endogenous variable Yit−1

on the right hand side of (36) complicates the estimation due to the correlation between ci and Yit−1.

In dynamic panel data models, the lagged value Yi0 is not observed by the econometricians. Therefore,

another issue is how to deal with this initial distribution in order to get the valid likelihood function for

estimation and inference; see Heckman (1981), Wooldridge (2005), and Arellano and Carrasco (2003)

for alternative solutions. Lechner et al. (2008) provided an outstanding overview of several dynamic

binary panel data models.

The Bayesian approach in the panel data context shares much similarity with its counterpart in the

single equation case. Chib (2008) considered a general latent variable model in which both slope and

intercept exhibit heterogeneity. This random coefficient model is shown below:

Y ∗it = Xitβ+Witbi + εit (37)

where Wit is the subvector of Xit whose marginal effects on Y ∗it captured by bi are unit specific, and

where εit follows standard normal distribution. The probability of the binary response given this

formulation is P(Yit = 1|Xit ,bi) = Φ(Xitβ+Witbi). bi is assumed to be a multivariate random vec-

tor N(0,D). Again, data augmentation with the latent continuous response is suggested to facilitate

computation of the posterior distribution; see Chib (2008) for more details.

26

2.2 Non-model-based probability predictions

The methodologies covered so far rely crucially on alternative econometric binary response models. In

practice, researchers sometimes are confronted with binary probability predictions which may or may

not come from any econometric model. Instead, the predicted probabilities are issued by a number

of market experts following their professional judgements and experiences. These are non-model-

based probability predictions, or judgemental forecasts in psychological parlance; see, for instance,

Lawrence et al. (2006). The Survey of Professional Forecasters (SPF) conducted by the Federal

Reserve Bank of Philadelphia and by the European Central Bank (ECB) are leading examples of

non-model-based probability predictions in economics. Other forecasting organizations like the Blue

Chip Surveys, Bloomburg, and many central banks also report probability forecasts from time to time.

Given the high reputation and widespread use of the U.S. SPF data in academia and industry, this

section will give a brief introduction to this survey focusing on probability forecasts for real GDP

declines. See Croushore (1993) for a general introduction to SPF, and Lahiri and Wang (2012) for

these probability forecasts.

The Survey of Professional Forecasters is the oldest quarterly survey of macroeconomic forecasts

in the United States. It began in 1968 and was conducted by the American Statistical Association

and the National Bureau of Economic Research. The Federal Reserve Bank of Philadelphia took over

the survey in 1990. Currently, the dataset contains over thirty economic variables. In every quarter,

the questionnaire is distributed to selected individual forecasters and they are asked for their expecta-

tions about a number of economic and business indicators, such as real GDP, CPI, and employment

rate in the current and next few quarters. For real GDP, GDP Price Deflator, and Unemployment,

density forecasts are also collected, viz., the predicted probability of annual percent change in each

prescribed interval for current and the next four quarters. Furthermore, the survey asks forecasters for

their predicted probabilities of declines in real GDP in the quarter in which the survey is conducted

and each of the following four quarters. For any target year, there are five forecasts from an indi-

vidual forecaster, each corresponding to a different quarterly forecast horizon. By investigating the

time series of individual forecasts for a given target, we can study how their subjective judgements

evolve over time and their usefulness. SPF also reports aggregate data summarizing responses from all

forecasters, including their mean, median, and cross-sectional dispersion. Note that the dataset is not

balanced, and the individual forecasters enter or exit from the survey in any quarter for a number of

reasons. Also, some forecasters may not report their predictions for some variables or horizons. Given

27

the novelty and quality of this dataset, SPF is extensively used in macroeconomics. For our purpose,

probability forecasts of a binary economic event can also be easily constructed from the subjective

density forecasts. Galbraith and van Norden (2012) used the Bank of England’s forecast densities to

calculate the forecast probability that the annual rate of change of inflation and output growth exceed

given threshold values. For instance, if the target event is GDP decline in the current year, then the

constructed probability of this event is the sum of probabilities in each interval with negative values.

For quarterly GDP declines, however, this probability is readily available in the U.S. SPF, and can be

analyzed for their properties. Clements (2006) has found some internal inconsistency between these

probability and density forecasts, whereas Lahiri and Wang (2006) found that the probability forecasts

for real GDP declines have no significant skill beyond the second quarter.

A commonly cited SPF indicator is the anxious index. It is defined as the probability of a decline

in real GDP in the next quarter. For example, in the survey taken in the fourth quarter of 2011, the

anxious index is 16.6 percent, which means that forecasters on average believed that there was a 16.6

percent chance that real GDP will decline during the first quarter of 2012. Figure 4 illustrates the

path of anxious index over time, beginning in the fourth quarter of 1968, along with the shaded NBER

dated recessions. The fluctuations in the probabilities seem roughly coincident with the NBER defined

peaks and troughs of the U.S. business cycle since 1968. Rudebusch and Williams (2009) compared

0

10

20

30

40

50

60

70

80

90

100

1968

1969

1970

1971

1972

1973

1974

1975

1976

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Pro

babi

lity

(per

cent

)

Survey Date

Figure 4: The Anxious Index from 1968:Q4 to 2011:Q4 (source: SPF website)

the economic downturn forecast accuracy of SPF and a simple binary probit model using yield spread

as regressor, finding that in terms of alternative measures of forecasting performance, the former wins

for the current quarter but the difference is not statistically significant. Its advantage over the latter

deteriorates as forecast horizon increases. Given the widespread recognition of the enduring role of

yield spread in predicting contractions during the past 20 years, this result that professional forecasters

do not seem to incorporate this readily available information on yield spread in forecasting real GDP

28

downturns appears to be a puzzle; see Lahiri et al. (2012a) for further analysis of the issue. A number

of papers have studied the properties of the SPF data. See for example, Braun and Yaniv (1992),

Clements (2008, 2011), Lahiri et al. (1988) and, Lahiri and Wang (2012), just to name a few.

Engelberg et al. (2011) called attention to the problem of changing panel composition in surveys

of forecasters and illustrated this problem using SPF data. They warned that the traditional aggregate

analysis of time series SPF conflate changes in the expectations of individual forecasters with changes

in the composition of the panel. Instead of aggregating individual forecasts by mean or median as

reported by the Federal Reserve Bank of Philadelphia, they suggested putting more emphasis on the

analysis of time series of predictions made by each individual forecaster. Aggregation, as a simplifying

device, should only be applied to subpanels with fixed composition.

3 Evaluation of Binary Event Predictions

Given a sequence of predicted values for a binary event that may come from an estimated model or

subjective judgements by individual forecasters like SPF, we can evaluate their accuracy empirically.

For example, it is desirable to verify whether it is associated well with the realized event. An important

issue here is how to compare the performance of two or more forecasting systems predicting the same

event, and whether a particular forecasting system is valuable from the perspective of end users. In

this section, we shall summarize many important and useful evaluation methodologies developed in

diverse fields in a coherent fashion. There are two types of binary predictions: probability prediction

discussed thoroughly in Section 2 and point prediction, which will be covered in the next section. The

evaluation of probability predictions is discussed first.

3.1 Evaluation of Probability Predictions

We can roughly classify the extant methodologies on binary forecast evaluation into two categories.

The first one measures forecast skill, which describes how the forecast is related to the actual, while the

second one measures forecast value, which emphasizes the usefulness of a forecast from the viewpoint

of an end user. Skill and value are two facets of a forecasting system, a skillful forecast may or may

not be valuable. We will first review the evaluation of forecast skill and then move to forecast value

29

where the optimal forecasts are defined in the context of a two-state, two-action decision problem.

3.1.1 Evaluation of forecast skill

The econometric literature contains many alternative measures of goodness of fit analogous to the

R2 in conventional regressions, which can be related to various re-scalings of functions of the like-

lihood ratio statistics for testing that all slope coefficients of the model are zero5. These measures,

though useful in many situations, are not directly oriented towards measuring forecast skill, and are

often unsatisfactory in gauging the usefulness of the fitted model in either identifying a relatively

uncommon or rare event in the sample or forecasting out-of-sample. Most methods for skill evalua-

tion for binary probability predictions were developed in meteorology without emphasizing model fit.

Murphy and Winkler (1984) provide a historic review of probability predictions in meteorology from

both theoretical and practical perspectives. Given the prevalence of binary events in economics such

as economic recessions and stock market crashes, existing economic probability forecasts should be

evaluated carefully, whether they are generated by models or judgements.

Murphy and Winkler (1987) described a general framework of forecast skill evaluation with bi-

nary probability forecasts as a special case. The basis for their framework is the joint distribution

of forecasts and observations, which contains all of the relevant statistical information. Let Y be the

binary event to be predicted and P be the predicted probability of Y = 1 based on a forecasting system.

The joint distribution of (Y,P) is denoted by f (Y,P), a bivariate distribution when only one forecast-

ing system is involved. Murphy and Winkler (1987) suggested two alternative factorizations of the

joint distribution. Consider the calibration-refinement factorization first. f (Y,P) can be decomposed

into the product of two distributions: the marginal distribution of P and the conditional distribution of

Y given P, that is, f (Y,P) = f (P) f (Y |P). For perfect forecasts, f (1|P = 1) = 1 and f (1|P = 0) = 0,

i.e., the conditional probability of Y = 1 given the forecast is exactly equal to the predicted value. In

general, it is natural to require f (1|P) = P almost surely over P and this property is called calibration

in the statistics literature, see Dawid (1984). A well-calibrated probability forecast implies the actual

frequency of event given each forecast value should be close to the forecast itself, and the user will

not commit a large error by taking the face value of the probability forecast as the true value. Given

a sample Yt ,Pt of actuals and forecasts, we can plot the observed sample fraction of Y = 1 against

P, the so-called attribute diagram, to check calibration graphically. The ideal situation is that all pairs

5Estrella (1998) and Windmeijer (1995) contain critical analyses and comparison of most of these goodnessof fit measures.

30

of (Yt ,Pt) concentrate around the diagonal line, and corresponds to the so-called Mincer–Zarnowitz

regression in a rational expectation framework, cf. Lovell (1986). Seillier-Moiseiwitsch and Dawid

(1993) proposed a test to determine if in finite samples the difference between the actual and the

probability forecasts is purely due to the sampling uncertainty. This test is based on the asymptotic

approximation using the martingale central limit theorem, and is consistent in spirit with the prequen-

tial principle of Dawid (1984), which states that any assessment of a series of probability forecasts

should not depend on the way the forecast is generated. The strength of the prequential principle is

that it allows for a unified test for calibration regardless of the probability law underlying a particular

forecasting system.

Seillier-Moiseiwitsch and Dawid (1993) calibration test groups a sequence of probability forecasts

in a small number of cells, say J cells with the midpoint Pj as the estimate of the probability in each

cell. Given a sample Yt ,Pt, the number of events Yt = 1 in the jth cell is counted and denoted by

N j. The corresponding expected count under the predicted probability is PjTj where Tj is the number

of observations in the jth cell. The calibration test for cell j becomes straightforward by constructing

the test statistics Z j = (N j−PjTj)/√w j, where w j = TjPj(1−Pj) is the weight for cell j. Under the

null hypothesis of calibration for cell j, Z j is asymptotically normally distributed with zero mean and

unit variance, and should not lie too far out in the tail of this distribution. The overall calibration

test for all cells is then conducted using statistic ∑Jj=1 Z2

j which has χ2 distribution asymptotically

with J degrees of freedom, and there is a strong evidence against overall calibration if it exceeds the

critical value under a significant level. As an example, Lahiri and Wang (2012) find that for the current

quarter aggregate SPF forecasts of GDP declines introduced in Section 2.2, the calculated χ2 value is

8.01, which is significant at the 5% level. Thus, even at this short horizon, recorded forecasts are not

calibrated.

Calibration measures the predictive performance of probability forecasts with observed binary

outcomes. However, this is not a unique criterion of primary concern in practice. Consider the naive

forecast which always predicts the marginal probability P(Y = 1). Since f (1|P) = P(Y = 1|P(Y =

1)) = P(Y = 1), it is necessarily calibrated. Generally speaking, any conditional probability forecast

P(Y = 1|Ω) for some information set Ω has to be calibrated since

P(Y = 1|P(Y = 1|Ω)) = E(E(Y |Ω)|P(Y = 1|Ω)) = P(Y = 1|Ω), (38)

by applying the law of iterated expectations. The naive forecast P(Y = 1) is a special case of this

31

conditional probability forecast with Ω containing only the constant term. However, forecasting with

the long run probability P(Y = 1) is typically not a good option as it does not distinguish those obser-

vations when Y = 1 with those when Y = 0. This latter property is better characterized by the marginal

distribution f (P) that is a measure of the refinement for probability forecasts and indicates how often

different forecast values are used. For the naive forecast, f (P) is a degenerate distribution with all

probability mass at P = P(Y = 1) and the forecast is said to be not refined, or sharp. A perfectly

refined forecasting system tends to predict 0 and 1 in each case. According to these definitions, the

aforementioned perfect forecast is not only perfectly calibrated but also refined. In contrast, the naive

forecast is perfectly calibrated but not refined at all. Any forecasting system that predicts 1 when Y = 0

and 0 when Y = 1 is still perfectly refined but not calibrated at all. Given that perfect forecasts do not

exist in reality, Gneiting et al. (2007) developed a paradigm of maximizing the sharpness subject to

calibration, see also Murphy and Winkler (1987).

The second way of factorizing f (Y,P) is to write it as the product of f (P|Y ) and f (Y ), called

the likelihood-base rate factorization, which corresponds to Edwin Mills’ Implicit Expectations hy-

pothesis; see Lovell (1986). Given a binary event Y , we have two conditional distributions, namely,

f (P|Y = 1) and f (P|Y = 0). The former is the conditional distribution of predicted probabilities in

the case of Y = 1 while the latter is the distribution for Y = 0. We would hope that f (P|Y = 1) puts

more density on higher values of P and opposite for f (P|Y = 0). These two distributions are the con-

ditional likelihoods associated with the forecast P. For perfect forecasts, f (P|Y = 1) and f (P|Y = 0)

degenerate at P = 1 and P = 0, respectively. Conversely, if f (P|Y = 0) = f (P|Y = 1) for all P, the

forecasts are said not to be discriminatory at all between the two events and provide no useful infor-

mation about the occurrence of the event. The forecast is perfectly discriminatory if f (P|Y = 1) and

f (P|Y = 0) are two distinct degenerate densities, in which case, after observing the value of P, we are

sure which event will occur. Based on this idea, Cramer (1999) suggested the use of the difference in

the means of these two conditional densities as a measure of goodness of fit. Since each mean is taken

over respective sub-samples, this measure is not unduly influenced by the success rate in the more

prevalent outcome group.

Figure 5 shows these two empirical likelihoods for the current quarter forecasts based on SPF

data; cf. Lahiri and Wang (2012). This diagram shows that the current quarter probability forecasts

discriminate between the two events fairly well, and f (P|Y = 0) puts more weight on the lower proba-

bility values than f (P|Y = 1) does. However, not enough weight is associated with higher probability

values when GDP does decline, and so the SPF forecasters appear to be somewhat conservative in this

32

sense. Q0_C

Page 1

0%

10%

20%

30%

40%

50%

60%

70%

80%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Like

lihoo

d [f(

P|Y)

]

Forecast Probability (P)

f(P|Y=0)f(P|Y=1)

Figure 5: Likelihoods for Quarter 0 (source: Lahiri and Wang (2012))

In the likelihood-base rate factorization, f (Y ) is the unconditional probability of each event. In

weather forecasting, this is called the base rate or sample climatology and represents the long run

frequency of the target event. Since it is only a description of the forecasting situation, it is fully

independent of the forecasting system. Murphy and Winkler (1987) took f (Y ) as the probability

forecast in the absence of any forecasting system and f (P|Y ) as the new information beyond the base

rate contributed by a forecasting system P. They emphasized the central role of joint distribution

of forecasts and observations in any forecast evaluation, and discussed the close link between their

general framework and some popular evaluation procedures widely used in practice. For example,

Brier (1950)’s score can be calculated as the sample mean squared error of forecasts and actuals or

1/T ∑Tt=1(Yt −Pt)

2 which has a range between zero and one. Perfect forecasts have zero Brier score,

and a smaller value of Brier score indicates better predictive performance. The population mean

squared error is E(Yt−Pt)2 =Var(Yt−Pt)+[E(Yt)−E(Pt)]

2 where the first term is the variance of the

forecast errors and the second is the square of the forecast bias. Murphy and Winkler (1987) expressed

this score in terms of population moments as follows:

E(Yt −Pt)2 =Var(Pt)+Var(Yt)−2Cov(Yt ,Pt)+ [E(Yt)−E(Pt)]

2. (39)

This decomposition reaffirms the previous statement that all evaluation procedures are based on the

joint distribution of Y and P. It shows that the performance, as measured by the mean squared error,

is not only affected by the covariance Cov(Yt ,Pt) (larger value means better performance), but also by

33

the marginal moments of forecasts and actuals. Suppose Y is a relatively rare event with E(Yt) close

to zero. The optimal forecast minimizing (39) is close to the constant E(Y) which is the naive forecast

having no skill at all. In practice, the skill score defined below, which measures the relative skill over

the naive forecast, is often used in this context:

skill score ≡ 1− ∑Tt=1(Yt −Pt)

2

∑Tt=1(Yt −E(Yt))2

. (40)

The reference naive forecast has no skill in that its skill score is zero, whereas a skillful forecast is

rewarded by a positive skill score. The larger the skill score, the higher skill the forecast has. For the

current quarter forecasts from SPF, Lahiri and Wang (2012) calculated Brier score and skill score as

0.0668 and 0.45 respectively, and seen impresive.

Murphy (1973) decomposed the Brier score in terms of two factorizations of f (Y,P). In light of

the calibration-refinement factorization, it can be rewritten as:

E(Yt −Pt)2 =Var(Yt)+EP[Pt −E(Yt |Pt)]

2−EP[E(Yt |Pt)−E(Yt)]2 (41)

where EP(·) is the expectation operator with respect to the marginal distribution of P. This decomposi-

tion summarizes the features in two marginal distributions and f (Y |P). The second term is a measure

of calibration as it is a weighted average of the discrepancy between the face value of the probability

forecast and the actual probability of the realization given the forecast. The third term is a measure of

the difference between conditional and unconditional probabilities of Y = 1. This attribute is called

resolution by Murphy and Daan (1985). In terms of the likelihood-base rate factorization, the Brier

score can be alternatively decomposed as

E(Yt −Pt)2 =Var(Pt)+EY [Yt −E(Pt |Yt)]

2−EY [E(Pt |Yt)−E(Pt)]2 (42)

where EY (·) is the expectation operator with respect to the marginal distribution of Y . Instead of using

information in f (Y |P), (42) exploits information in the likelihood f (P|Y ) in addition to two marginal

distributions. The second term is a weighted average of the squared difference between the observation

and the mean forecast given observation and is supposed to be small for a good forecast. The third

term is a weighted average of the squared difference between the mean forecast given the observation

and the overall mean forecast, and measures the discriminatory power of forecasts against two events.

These two decompositions summarize different aspects of f (Y,P), and its sample analogue can be

34

computed straightforwardly.

Yates (1982) suggested an alternative decomposition of the Brier score which isolated individual

components capturing distinct features of f (Y,P) in the same sprite as Murphy and Wrinkler’s general

framework. Yates’ decomposition, popular in psychology, is derived from the usual interpretation of

the mean squared error (39) in terms of variance and squared bias. Note that Var(Yt) = E(Yt)[1−

E(Yt)] and Cov(Yt ,Pt) = [E(Pt |Yt = 1)− E(Pt |Yt = 0)]E(Yt)[1− E(Yt)]. We get Yates’ covariance

decomposition by plugging these into (39) using the definition VarP,min(Pt)≡ [E(Pt |Yt = 1)−E(Pt |Yt =

0)]2E(Yt)[1−E(Yt)] and obtain

E(Yt)[1−E(Yt)]+∆Var(Pt)+VarP,min(Pt)−2Cov(Yt ,Pt)+ [E(Yt)−E(Pt)]2, (43)

where ∆Var(Pt)≡Var(Pt)−VarP,min(Pt) by definition. The first term E(Yt)[1−E(Yt)] is the variance

of the binary event and thus is independent of forecasts. It is close to zero when either E(Yt) or

1−E(Yt) is very small. Given this property, a comparison across several forecasts with different targets

based on the overall Brier score may be misleading, because two target events tend to have different

marginal distributions, and the discrepancy of the scores is likely to solely reflect the differential of the

marginal distributions, thus saying nothing about the real skill. Yates regarded E(Yt)[1−E(Yt)] as the

Brier score of the naive forecast mentioned before, and showed that it is the minimal achievable value

for a constant probability forecast. It is the remaining part, that is, E(Yt−Pt)2−E(Yt)[1−E(Yt)], that

matters for evaluation purposes.

The term [E(Yt)− E(Pt)]2 measures the magnitude of the global forecast bias and is zero for

unbiased forecasts. In contrast to perfect calibration, which requires the conditional probability to

be equal to the face value almost surely, Yates called this calibration-in-the-large. It says that the

unconditional probability of Y = 1 should match the average predicted values. Cov(Yt ,Pt) describes

how responsive a forecast is to the occurrence of the target event, both in terms of the direction and the

magnitude. A skillful forecast ought to identify and explore this information in a sensitive and correct

manner. It is apparent that small Var(Pt) is desired, but this is not everything. A typical example is

the naive forecast with zero variance but no skill as well. VarP,min(Pt) is the minimum variance of Pt

given any value of the covariance Cov(Yt ,Pt), and ∆Var(Pt) is the excess variance which should be

minimized. The minimal variance VarP,min(Pt) is achieved only when ∆Var(Pt) = 0 for which Pt = P1

on all occasions of Yt = 1, and Pt = P0 on other occasions and the variation of forecasts is due to the

event’s occurrence. In this sense, Yates called ∆Var(Pt) the excess variability of forecasts and it is not

35

zero, when the forecast is responsive to information that is not related to the event’s occurrence. Using

the current quarter SPF forecasts, Lahiri and Wang (2012) found that the excess variability was 53%

of the total forecast variance of 0.569. For longer horizons, excess variability increases rapidly and

indicates an interesting characteristic of these forecasts. Overall, the Yates’ decomposition stipulates

that a skillful forecast is expected to be unbiased and highly sensitive to relevant information, but

insensitive to irrelevant information. Yates (1982) emphasized on the essence of resolution instead of

the conventional focus on calibration in probability forecast evaluation; see also Toth et al. (2003).

Although the Brier score is extensively used in probability forecast evaluation, it is not the only

choice. Alternative scores characterizing other features of the joint distribution exist. Two lead-

ing examples are the average absolute deviation which is E(|Yt − Pt |) and the logarithmic score

−E(Yt log(Pt)+(1−Yt)log(1−Pt)). In general, any function with (Yt ,Pt) as arguments can be taken

as a score. In the theoretical literature, a subclass called proper scoring rules is comprised of functions

satisfying

E(S(Yt ,P∗t ))≤ E(S(Yt ,Pt)),∀Pt ∈ [0,1], (44)

where S(·, ·) is the score function with the observation as the first argument and the forecast as the

second, and P∗t is the underlying true conditional probability. If P∗t is the unique minimizer of the

expected score, S(·, ·) is called the strictly proper scoring rule. It can be easily shown that the Brier

score and the logarithmic score are proper, while the absolute deviation is not. Gneiting and Raftery

(2007) pointed out the importance of using proper scores for evaluation purposes and provided an

example to demonstrate the problem associated with improper scores. Schervish (1989) developed an

intuitive way of constructing a proper scoring rule that has a natural economic interpretation in terms

of the loss associated with a decision problem based on forecasts. He also generated a proper scoring

rule that is equal to the integral of the expected loss function evaluated at the threshold value with

respect to a measure defined on unit interval, and discussed the connection between calibration and a

proper scoring rule. Gneiting (2011) argued that a consistent scoring function or an elicitable target

functional (the mean in our context) ought to be specified ex ante if forecasts are to be issued and

evaluated. Thus, it does not make sense to evaluate probability forecasts using the absolute deviation,

which is not consistent for the mean.

Up to this point, all evaluations are carried out through a number of proper scoring rules. If we

have more than one competing forecasting model targeting the same event and a large sample track-

36

ing the forecasts, scores can be calculated and compared. For example, in terms of the Brier score

1/T ∑Tt=1(Yt −Pt)

2, model A with larger score is considered to be a worse performer than model B.

Lopez (2001), based on Diebold and Mariano (1995), proposed a new test constructed from the sam-

ple difference between two scores, allowing for asymmetric scores, non-Gaussian and nonzero mean

forecast errors, series correlation among observations, and contemporaneous correlation between fore-

casts. Here we replace the objective function of Diebold and Mariano (1995) by a generic proper

scoring rule. Let S(Yt ,Pti) be the score value of the ith (i = 1 or 2) model for observation t. It is often

assumed to be a function of the forecast error defined by eti ≡ Yt −Pti, that is, S(Yt ,Pti) = f (eti). The

method works equally well for more general cases where the functional form of S(·, ·) is not restricted

in this way. In addition, let dt = f (et1)− f (et2) be the score differential between 1 and 2. The null

hypothesis of no skill differential is stated as E(dt) = 0.

Suppose the score differential series dt is covariance stationary and has short memory. The

standard central limit theorem for dependent data can be used to establish the asymptotic distribution

of test statistic under E(dt) = 0 as

√T (d−E(dt))

d−→ N(0,2π fd(0)), (45)

where

d =1T

T

∑t=1

dt (46)

is the sample mean of score differentials,

fd(0) =1

2π

∞

∑τ=−∞

γd(τ) (47)

is the spectral density of dt at frequency zero, and γd(τ) = E(dt −E(dt))(dt−τ−E(dt)) is the autoco-

variance of dt with the τth lag. The t statistic is thus

t =d√

2π fd(0)T

, (48)

where fd(0) is a consistent estimator of fd(0). Estimation of fd(0) based on lag truncation methods is

quite standard in time series econometrics, see Diebold and Mariano (1995) for more details. The key

idea is that only very weak assumptions about the data generating process are imposed and neither

37

serial nor contemporaneous correlation is ruled out by these assumptions. Implementation of this

procedure is quite easy as it is simply the standard t test of a zero mean for a single population after

adjusting for serial correlation. Thus, while comparing the current quarter SPF forecasts with the naive

constant forecast given by the sample proportion, Lahiri and Wang (2012) found the Lopez t statistic

to be -2.564, suggesting the former to have significantly lower Brier score than the naive forecast at

the usual 5% level.

West (1996) developed procedures for asymptotic inference about the moments of a smooth score

based on out-of-sample prediction errors. If predictions are generated by econometric models, these

procedures adjust for errors in the estimation of the model parameters. The conditions are also given,

under which ignoring this estimation error would not affect out-of-sample inference. This framework

is neither more general nor a special case of the Diebold-Mariano approach and thus should be viewed

as complementary. Note that the Diebold-Mariano test is not applicable when two competing forecasts

cannot be treated as coming from two nonnested models. However, if we think of the null hypothesis

as the two forecast series having equal finite sample forecast accuracy, then, the Diebold-Mariano test

statistic as a standard normal approximation gives a reasonably-sized test of the null in both nested and

non-nested cases, provided that the long run variances are estimated properly and the small-sample

adjustment of Harvey et al. (1997) is employed; see Clark and McCracken (2012).

Another useful tool for probability forecast evaluation, popular in medical imaging, meteorology

and psychology, that has not received much attention in economics is the Receiver Operating Charac-

teristic (ROC) analysis; see Berge and Jorda (2011) for a recent exception. Given the joint distribution

f (Y,P) and a threshold value which is a number between zero and one, we can calculate two condi-

tional probabilities: hit rate and false alarm rate. Let P∗ be a threshold, and Yt = 1 is predicted if and

only if Pt ≥ P∗, that is, P∗ transforms a continuous probability forecast into a binary point forecast.

Table 1 presents the joint distribution of this forecast and realization under a generic P∗. In this 2×2

contingency table, πi j is the joint probability of (Y = i,Y = j) while πi. and π. j are marginal proba-

bilities of Y = i and Y = j, respectively. The hit rate (H) is the conditional probability of Y = 1 given

Y = 1, that is, H ≡ πY=1|Y=1 = π11/π.1 and it tells the chance that Y = 1 is correctly predicted when it

does happen.

In contrast, false alarm rate (F) is the conditional probability of Y = 1 given Y = 0, that is, F ≡

πY=1|Y=0 = π10/π.0 and it measures the fraction of incorrect forecasts when Y = 1 does not occur.

Although these two probabilities appear to be constant for a given sample, they are actually functions

of P∗. If P∗ = 0≤ Pt for all t, then Y = 1 would always be predicted. As a result, both the hit and false

38

alarm rates equal one. Conversely, only Y = 0 would be given and both probabilities are zero when

P∗ = 1. For interior values of P∗, H and F fall within [0,1]. Their relationship due to the variation

of P∗ can be depicted by tracing out all possible pairs of (F(P∗),H(P∗)) for P∗ ∈ [0,1]. This graph

plotted with the false alarm rate on the horizontal axis and the hit rate on the vertical axis is called the

Receiver Operating Characteristic curve. Its typical shape for a skillful probability forecast is shown

in Figure 6.

Hits = 97.5%False alarms = 84%



0

20

40

60

80

100

0 20 40 60 80 100

Figure 6: A typical ROC curve

In categorical data analysis, H is often called the sensitivity and 1−F = πY=0|Y=0 is the speci-

ficity. Both measure the fraction of correct forecasts and are expected to be high for skillful forecasts.

(F(P∗),H(P∗)), corresponding to a particular threshold P∗, is only one point on the ROC curve which

consists of all such points for possible values of P∗.

The ROC curve can be constructed in an alternative way based on the likelihood-base rate factor-

ization f (Y,P) = f (P|Y ) f (Y ). Given a threshold P∗, H is the integral of f (P|Y = 1),

H =∫ 1

P∗f (P|Y = 1)dP, (49)

and F is the integral of f (P|Y = 0) over the same domain,

F =∫ 1

P∗f (P|Y = 0)dP. (50)

Table 1: Joint distribution of binary point forecast Y and observation Y

Y = 1 Y = 0 Row totalY = 1 π11 π10 π1.Y = 0 π01 π00 π0.

Column total π.1 π.0 1

39

Figure 7 illustrates these two densities along with three values of P∗

H

F

B

A dA>dB

Hits = 97.5% False alarms = 84%

Hits = 84% False alarms = 50%

Hits = 50% False alarms = 16%

Figure 7: f (P|Y = 1)(right), f (P|Y = 0)(left) and three values of P∗

In this graph, H is the area of f (P|Y = 1) on the right of P∗, while F is the same area for f (P|Y =

0). As the vertical line shifts rightward from top to bottom, both areas shrink, and both H and F

decline. In one extreme where P∗ = 0, both areas equal one. In the other extreme where P∗ = 1, they

equal zero. Figure 7 reveals the tradeoff between H and F : they move together in the same direction

as P∗ varies and the scenario (H = 1,F = 0) is generally unobtainable unless the forecast is perfect.

This relationship is also apparent from the upward sloping ROC curve in Figure 6. Deriving ROC

curve from the likelihood-base rate factorization is in the same spirit of Murphy and Winkler’s general

framework. To see this, consider the likelihoods of two systems (A and B) predicting the same event,

see Figure 8 below.

Let us assume that the likelihoods when Y = 1 are exactly the same for both A and B, while

the likelihoods when Y = 0 share the same shape but center at different locations. The likelihood

f (P|Y = 0) for A is symmetric around a value that is less than the corresponding value for B. In the

terminology of the likelihood-base rate factorization, A is said to have a higher discriminatory ability

than B because its f (P|Y = 0) is farther apart from f (P|Y = 1) and is thus more likely to distinguish

the two cases. Consequently, A has a higher forecast skill, which should be reflected by its ROC curve

as well. This result is supported by considering any threshold value represented by a vertical line in

this graph. As discussed before, the area of f (P|Y = 0) for A lying on the right of the threshold (A’s

false alarm rate) is always smaller than that for B, and this is true for any threshold. On the other

40

Figure 8: Likelihoods for forecasts A and B with a common threshold

hand, since f (P|Y = 1) is identical for both A and B, hit rates defined as the area of f (P|Y = 1) on the

right of the vertical line are the same for both. Therefore, A is more skillful than B, which is shown in

Figure 9 where the ROC curve of A always lies to the left of B for any fixed H.

Figure 9: ROC curves for A and B with different skills

The ROC curve is a convenient graphical tool to evaluate forecast skill and can be used to facilitate

comparison among competing forecasting systems. To see this, consider three special curves in the

unit box. The first one is the 45 degree diagonal line on which H = F . The probability forecast, which

41

has an ROC curve of this type, is the random forecast that is statistically independent of observation.

As a result, H and F are identical and both equal the integral of marginal density of probability forecast

over the domain [P∗,1]. One of the examples is the naive forecast. Probability forecasts whose ROC

curve is the diagonal line has no skill and are often taken as the benchmark to be compared with

other forecasts of interest. For a perfect forecast, the corresponding ROC curve is the left and upper

boundaries of the unit box. Most probability forecasts in real life situations fall in between, and their

ROC curves lie in the upper triangle, like the one shown in Figure 6. Since higher hit rate and lower

false alarm rate are always desired, the ROC curve lying farther from the diagonal line indicates higher

skill. A curve in the lower triangle appears to be even worse than the random forecast at first sight, but

it can potentially be relabeled to be useful.

Given a sample, there are two methods of plotting the ROC curve: parametric and nonparamet-

ric. In parametric approach, some distributional assumptions about the likelihoods f (P|Y = 1) and

f (P|Y = 0) are necessary. A typical example is the normal distribution. However, it is not a sensible

choice given that the range of P is limited. Nevertheless, we can always transform P into a variable

with unlimited range. For instance, the inverse function of any normal distribution suffices for this

purpose. The parameters in this distribution are estimated from a sample, and the fitted ROC curve

can be plotted by varying the threshold in the same way as when deriving the population curve. This

approach, however, is subject to misspecification like any parametric method. In contrast, nonpara-

metric estimation does not need such stringent assumptions and can be carried out based on data alone.

Fawcett (2006) provides an illustrative example with computational details. Fortunately, most current

commercial statistical packages like Stata have built-in procedures for generating ROC graphs.

Sometimes, a single statistic summarizing information contained in an ROC curve is warranted.

There are two alternatives: one measures the local skill for a threshold of primary interest, while the

other measures global skill over all thresholds. For the former, there are two statistics most commonly

used. The first one is the smallest Euclidean distance between point (0,1) and the point on the ROC

curve. This is motivated by observing that the ROC curve of more skillful probability forecast is

often closer to (0,1). The second statistic is called the Youden index that is the maximal vertical gap

between diagonal to the ROC curve (or hit rate minus false alarm rate). The global measure is the area

under the ROC curve (AUC). For random forecasts, the AUC is one half while it is one for perfect

forecasts. The larger AUC thus implies higher forecast skill. Calculation of the AUC proceeds in two

ways depending on the approach used to estimate the ROC curve. For parametric estimation, the AUC

is the integral of a smooth curve over the domain [0,1]. For nonparametric estimation, the empirical

42

ROC curve is a step function and its integral is obtained by summing areas of a finite number of

trapezia. If the underlying ROC curve is smooth and concave, the AUC computed in this way is bound

to underestimate the true value in a finite sample. Note that these two measures may not concord with

each other in the sense that they may give conflicting judgements regarding forecast skill. Figure 10

illustrates a situation like this.

H

F

B

A dA>dB




Figure 10: ROC curves for two forecasts: A and B

In this graph, dA and dB are local skill statistics for A and B, respectively, and A is slightly less

skillful in terms of this criterion. However, the AUC of A is larger than that of B. Conflict between

these two raises a question in practice as to which one should be used. Often, there is no universal

answer and it depends on the adopted loss function. Mason and Graham (2002), Mason (2003),

Cortes and Mohri (2005), Faraggi and Reiser (2002), Liu et al. (2005), among others, proposed and

compared estimation and inference methods concerning AUC in large data sets. These include, but

are not limited to, the traditional test based on the Mann-Whitney U-statistic, an asymptotic t-test, and

bootstrap-based tests. Using these procedures in large samples, we can answer questions like: “Does

a forecasting system have any skill?” or “Is its AUC larger than 1/2 significantly?”or “Is AUC of

forecast A significantly larger than that of B in the population?”

ROC analysis was initially developed in the field of signal detection theory, where it was used to

evaluate the discriminatory ability for a binary detection system to distinguish between two clearly-

defined possibilities: signal plus noise and noise only. Thereafter, it has gained increasing popularity

in many other related fields. For a general treatment of ROC analysis, the readers are referred to Egan

(1975), Swets (1996), Zhou et al. (2002), Wickens (2001), and Krzanowski and Hand (2009), just to

name a few. For economic forecasts, Lahiri and Wang (2012) evaluated the SPF probability forecasts

43

of real GDP declines for the U.S. economy using the ROC curve. Figure 11, taken from this paper for

the current quarter forecasts, shows that at least for the current quarter, the SPF is skillful.

0.2

5.5

.75

1

0 .25 .5 .75 1

Figure 11: ROC curve with 95% confidence band for Quarter 0 (source: Lahiri and Wang (2012))

3.1.2 Evaluation of forecast value

For calculating the forecast value, one needs more information than what is contained in the measures

of association between forecasts and realizations. Let L(a,Y ) be the loss of a decision maker when

(s)he takes the action a and the event Y is realized in the future. Here, like in the banker’s problem,

only the scenario with two possible actions (e.g. making a loan or not) coupled with a binary event

(e.g. default or not) is considered. It is simple, yet fits a large number of real life decision making

scenarios in economics.

First, we need to show that a separate analysis of forecast value is necessary. The following

example suffices to this end. Suppose A and B are two forecasts targeting the same binary event Y .

The following tables summarize predictive performances for both models.

Table 2: Contingency table cross-classifying forecasts of A and observations Y

Y = 1 Y = 0 Row totalY = 1 20 100 120Y = 0 23 997 1020

Column total 43 1097 1140

Here A and B are 0/1 binary point forecasts. If forecast skill is measured by the Brier score,

then A performs better than B since its Brier score is about 10.79%, less than B’s score of 17.54%.

44

Consequently, A is superior to B in terms of the forecast skill measured by the Brier score. Does the

same conclusion hold in terms of forecast value? To answer this question, we have to specify the loss

function L(a,Y ) first. Without loss of generality, suppose the decision rule is given by a = 1 if Y = 1

is predicted and a = 0 otherwise. The loss is described in Table 4.

This loss function has some special features: it is zero when the event is correctly predicted; the

losses associated with incorrect forecasts are not symmetric in that the loss for a = 0 when the event

Y = 1 occurs is much larger than that when a = 1 and the event Y = 1 does not occur. Loss functions

of this type are typical when the target event Y = 1 is rare but people incur a substantial loss once it

takes place, such as a dam collapse or financial crisis. The overall loss of A is 10×100+5000×23 =

116000 which is much larger than that of B (10×197+5000×3 = 16970). This example shows that

the superiority of A in terms of skill does not imply its usefulness from the standpoint of a forecast

user. An evaluation of forecast value needs to be carried out separately.

Thompson and Brier (1955) and Mylne (1999) examined forecast values in the simple cost/loss

decision context in which L(1,1) = L(1,0) = C > 0, L(0,1) = L > 0, and L(0,0) = 0. C is cost and

L is loss. This model simplifies the analysis by summarizing the loss function into two values: cost

and loss; and its result can be conveyed visually as a consequence. Loss functions of this type are

suitable in a context such as the decision to purchase insurance by a consumer, where two actions are

“buy insurance” or “do not buy insurance”, which lead to different losses depending on whether the

adverse event occurs in the future. If one buys the insurance (a = 1), (s)he is able to protect against

the effects of adverse event by paying a cost C, whereas occurrence of adverse event without benefit

of this protection results in a loss L. If the consumer knows the marginal probability that the adverse

event would occur at the moment of decision, the problem boils down to comparing expected losses

by two actions. On the one hand, (s)he has to pay C irrespective of the event if (s)he decides to

Table 3: Contingency table cross-classifying forecasts of B and observations Y

Y = 1 Y = 0 Row totalY = 1 40 197 237Y = 0 3 900 903

Column total 43 1097 1140

Table 4: Loss function associated with the 2×2 decision problem

Y = 1 Y = 0a = 1 0 10a = 0 5000 0

45

buy the insurance, and her/his expected loss would equal PL if (s)he does not do so, where P is the

marginal probability of Y = 1 perceived by the consumer. The optimal decision rule is thus a = 1 if

and only if P ≥C/L, and the lowest expected loss resulting from this rule is min(PL,C) denoted by

ELclim. Now, suppose the consumer has access to perfect forecasts. Then, the minimum expected loss

would be ELper f ≡ PC which is smaller in magnitude than ELclim given that P ∈ [0,1] and C≤ L. The

difference ELclim−ELper f measures the gain of a perfect forecast relative to the naive forecast. The

more realistic situation is that the probability forecast under consideration improves upon the naive

forecast, but is not perfect. Wilks (2001) suggested the value score (VS) to measure the value of a

forecasting system where

V S =ELclim−ELP

ELclim−ELper f, (51)

and ELP denotes the expected loss of the forecasting system P. The value score defined in this way

can be interpreted as the expected economic value of the forecasts of interest as a fraction of the value

of perfect forecasts relative to naive forecasts. Its value lies in (−∞,1] and it is positively oriented in

the sense that higher VS means larger forecast value. Naive forecasts and perfect forecasts have VS 0

and 1, respectively. Note that VS may be negative, indicating that it is better to use the naive forecast

of no skill in these cases. However, Murphy (1977) demonstrated that VS must be nonnegative for

any forecasting system with perfect calibration; thus any perfectly calibrated probability forecast is at

least as useful as the naive forecast. This illustrates the interplay between forecast skill and forecast

value.

Given a probability forecast Pt , VS can be calculated from f (Pt ,Yt), the joint distribution of fore-

casts and observations, and the loss function. To accomplish this, the joint distribution of (a,Y ) must

be derived first where the optimal action depends on consumer’s knowledge of f (Pt ,Yt). Given the

forecast Pt , the conditional probability of the event is f (Yt = 1|Pt) which corresponds to the second

element in the calibration-refinement factorization of f (Pt ,Yt), and the optimal decision rule takes the

form specified above: a = 1 if and only if P(Yt = 1|Pt)≥C/L. Therefore, the cost/loss ratio C/L is the

optimal threshold for translating a continuous probability P(Yt = 1|Pt) into a binary action. Given C/L,

the joint probability of (a = 1,Y = 1) is thus equal to π11 ≡∫

I(P(Yt = 1|Pt) ≥C/L) f (Pt ,Yt = 1)dPt

where I(·) is the indicator function which is one only when the condition in (·) is met. Likewise, we

46

can calculate other three joint probabilities, listed as follows:

π10 ≡∫

I(P(Yt = 1|Pt)≥C/L) f (Pt ,Yt = 0)dPt ;

π01 ≡∫

I(P(Yt = 1|Pt)<C/L) f (Pt ,Yt = 1)dPt ;

π00 ≡∫

I(P(Yt = 1|Pt)<C/L) f (Pt ,Yt = 0)dPt . (52)

Based on these results, the expected loss ELP is the weighted average of L(a,Y ) with the above

probabilities πi j as weights:

ELP = (π11 +π10)C+π01L (53)

which is then plugged into (51) to get VS. Note that in this derivation, not only is the information

contained in f (Pt ,Yt) used, but the cost/loss ratio, which is user-specific, plays a role as well. This

observation reconfirms our previous argument that the forecast value is a mixture of objective skill

and subjective loss. If f (Pt ,Yt) is fixed, ELP is a function of C and L. Wilks (2001) proved a stronger

result that VS is only a function of C/L, so that only the ratio matters. For this reason, we can plot

VS against cost/loss ratio in a simple 2-dimensional diagram. In other decision problems, where the

loss function takes a more general rather than the current cost/loss form, VS can be calculated in the

same fashion as before, but the resulting VS as a function of four loss values cannot be shown by a 2

or 3-dimensional diagram.

Figure 12 plots VS against the cost/loss ratio of a probability forecast. Note that the domain of

interest is the unit interval between zero and one, as the nonnegative cost C is assumed to be less

than the loss L. The two points (0,0) and (1,0) must lie on VS curve, because when C/L = 0,

a = 1 is adopted, resulting in ELclim = ELP = V S = 0; on the other hand, when C/L = 1, a = 0

with ELclim = ELP = PC, which again implies zero VS. In this graph, the probability forecast is not

calibrated, as the VS curve lies beneath zero for some cost/loss ratios. Krzysztofowicz (1992) and

Krzysztofowicz and Long (1990) showed that recalibration (i.e., relabeling) of such forecasts will not

change the refinement but can improve the value score over the entire range of cost/loss ratios, which

again is evidence that forecast skill would affect the forecast value. For the ROC curve, however, Wilks

(2001) demonstrated that even with such recalibration, the recalibrated ROC curve will not change.

Wilks (2001) hence concluded that “the ROC curve is best interpreted as reflecting potential rather

than actual skill” and it is insensitive to calibration improvement. Further details on the interaction of

47

Figure 12: An artificial value score curve

skill and value measured by other criteria are available in Richardson (2003).

The value score curve lends support for the use of probability forecasts instead of binary point

forecasts. For the latter, only 0/1 values are issued without any uncertainty measurement. Suppose

there is a community populated by more than one forecast user, and each one has his own cost/loss

ratio. Initially, the single forecaster serving the community produces a probability forecast Pt , and

then changes it into a 0/1 prediction by using a threshold P∗, which is announced to the community.

The threshold P∗ determines a unique 2×2 contingency table, and the value score for any given C and

L can be calculated. As a result, the value score curve as a function of the cost/loss ratio can be plotted

as well. Richardson (2003) pointed out that this VS curve is never located higher than that generated

by probability forecasts Pt for any cost/loss ratio on [0,1]. This result is obvious since the optimal P∗

for the community as a whole may not be optimal for all users. If the forecaster provides a probability

forecast Pt instead of a binary point forecast, each user has larger flexibility to choose her/his action

according to his/her own cost/loss ratio, and this would minimize the individual expected loss. A

single forecaster without knowing the distribution of cost/loss ratios across individuals is likely to

give a sub-optimal 0/1 forecast for the whole community.

Similar to the ROC analysis, we often need a single quantity like AUC to measure the overall value

of a probability forecast. A natural choice is the integral of VS curve over [0,1]. This may be justified

by a uniform distribution of cost/loss ratios, which means that forecast values are equally weighted

for all cost/loss ratios. Wilks (2001) proved that this integral is equivalent to the Brier score. This is

a special case where forecast value is completely determined by forecast skill. This may not be true

48

generally. Wilks (2001) suggested using a beta distribution on the domain [0,1], with two parameters

(α,β), to describe the distribution of cost/loss ratios, as it allows for a very flexible representation

of how C/L spreads across individuals by specifying only two parameters. For example, α = β = 1

yields the uniform distribution with equal weights. The weighted average of value scores (WVS) is

WV S≡∫ 1

0V S(

CL)b(

CL

;α,β)dCL, (54)

where V S(CL ) is the value score as a function of the cost/loss ratio and b(C

L ;α,β) is the beta density with

parameters α and β. Wilks (2001) found that this overall measure of forecast value is very sensitive to

the choice of parameters. In practice, it is impossible for a forecaster to know this distribution exactly

since the cost/loss ratio is user-dependent and may involve cost and loss in some mental or utility unit.

Therefore the application of WVS in forecast evaluation practice calls for extra caution. However,

even if one has a perfect awareness of the cost/loss distribution and ranks a collection of competing

forecasts by WVS, this rank cannot be interpreted from the perspective of a particular end user. After

all, WVS is only an overall measure; and the good forecasts identified by WVS may not be equally

good in the eyes of a particular user who will re-evaluate each forecast according to his own cost/loss

ratio.

Although the value score provides a general framework to evaluate the usefulness of probability

forecasts in terms of economic cost and loss, it has its own drawbacks. In the derivation of value score,

we have used the conditional probability P(Yt = 1|Pt) which is unknown in practice and needs to be

estimated from a sample. For a user without much professional knowledge, this is highly infeasible.

Richardson (2003) simplified the derivation by assuming the forecast is perfectly calibrated (P(Yt =

1|Pt) = Pt) and thus a user can take the face value Pt as the truth. All empirical value score curves

presented in Richardson (2003) are generated under this assumption. However, the assumption may

not hold for any probability forecast, and deriving the VS curve and conducting statistical inference

in such a situation become much more challenging.

3.2 Evaluation of Point Predictions

Compared to probability forecasts, only 0/1 values are issued in binary point predictions, which will

be discussed in depth in Section 4. For binary forecasts of this type, the 2× 2 contingency tables,

cross-classifying forecasts and actuals, completely characterize the joint distribution, and thus are

49

convenient tools from which a variety of evaluation measures about skill and value can be constructed.

We will introduce usual skill measures based on contingency tables. See Stephenson (2000) and

Mason (2003) as well. Statistical inference on a contingency table, especially the independence test

under two sampling designs, and the measure of forecast value are then briefly reviewed.

3.2.1 Skill measures for point forecasts

Although there are four cells in a contingency table (Table 1), only three quantities are sufficient for

describing it completely. The first one is the bias (B) which is defined to be the ratio of two marginal

probabilities π1./π.1. For an unbiased forecasting system, B is one and E(Y ) = E(Y ). Note that B

summarizes the marginal distributions of forecasts and observations, and thus does not tell anything

about the association between them. For example, independence of Y and Y is possible for any value

of the bias. The unbiased random forecasts are often taken as having no skill in this context, and all

other forecasts are assessed relative to this benchmark. Two other measures necessary to characterize

the forecast errors are the hit rate (H) and the false alarm rate (F) and are the two basic building blocks

for a ROC curve. Note that for the random forecasts of no skill, both H and F are equal to the marginal

probability P(Y = 1) due to independence. For forecasts of positive skill, H is expected to exceed

F. Given B, H and F, any joint probability πi j in Table 1 is uniquely determined, verifying that only

three degrees of freedom are needed for a 2×2 contingency table. The false alarm ratio is defined as

1−H ′ ≡ P(Y = 0|Y = 1) while the conditional miss rate is F ′ ≡ P(Y = 1|Y = 0). Using Bayes’ rule

connecting two factorizations, Stephenson (2000) derived the following relationship between these

four conditional measures:

H ′ =HB

F ′ =F(1−H)

F−H +B(1−F). (55)

Other measures of forecast skill can be constructed using the above three elementary but sufficient

statistics. The first one is the odds ratio defined as the ratio of two odds

OR≡ H1−H

/F

1−F, (56)

which is positively oriented in that it equals 1 for random forecasts and is greater than 1 for forecasts

of positive skill. Actually, OR is often taken as a measure of association between rows and columns

50

in any contingency table, and is zero if and only if they are independent; see Agresti (2007). Note that

OR is just a function of H and F, both of which are summaries of the conditional distributions. As

a result, OR does not rely on the marginal information. Another measure that is parallel to the Brier

score is the probability of correct forecasts defined as

πcorr ≡ 1−E(Y − Y )2

= π11 +π00

=FH +(1−F)(B−H)

B−H +F, (57)

which depends on B and the marginal information as well. In rare event cases where the unconditional

probability of Y = 1 is close to zero, πcorr would be very high for the random forecasts of no skill.

This is easily seen by observing that H = F = P(Y = 1) = P(Y = 1) and B = 1. Substituting these

into πcorr, we get

πcorr =FH +(1−F)(B−H)

B−H +F= 2P2(Y = 1)−2P(Y = 1)+1 (58)

and the minimum is obtained when P(Y = 1) = 0.5, that is, the event is balanced. In contrast, it

achieves its maximum when P(Y = 1) = 1 or P(Y = 1) = 0. For rare events where P(Y = 1) is

close to zero, πcorr is near one and this leads to the misconception that the random forecasts perform

exceptionally well, as nearly 100% cases are correctly predicted. Even if there is no association

between forecasts and observations, this score could be very high. For this reason, Gandin and Murphy

(1992) regarded πcorr to be “inequitable” in the sense of encouraging hedging. In contrast, the odds

ratio which is not dependent on B does not have this flaw and hence is a reliable measure in rare event

cases. Often we take logarithm of OR to transform its range into the whole real line, and the statistical

inference based on log odds ratio is much simpler to conduct than ones based on odds ratio, as shown

in Section 3.2.2.6 Alternatively, we can use the improvement of πcorr relative to the random forecasts

of no skill to measure the forecast skill. This is the Heidke skill score (HSS):

HSS =πcorr−πo

corr

1−πocorr

(59)

6Another transformation of OR is the so-called Yules Q or Odds Ratio Skill Score (ORSS) which is definedas (OR−1)/(OR+1). Unlike OR, ORSS ranges from −1 to 1 and is recognized conventionally as a measureof association in contingency tables.

51

where πocorr is πcorr for random forecasts. According to Stephenson (2000), HSS is a more reliable

score to use than πcorr, albeit it also depends on B.

The second widely used skill score that gets rid of the marginal information is the Peirce skill score

(PSS) or Kuipers score, which is defined as the hit rate minus the false alarm rate, cf. Peirce (1884).

Like OR, forecasts of higher skill is rewarded by larger PSS. One of the advantages of PSS over OR

is that it is a linear function of H and F, and thus is well-defined for virtually all contingency tables,

whereas OR is not defined when H and F are zero. Stephenson (2000) evaluated the performance of

these scores in terms of complement and transpose symmetry properties, and their encouragement to

hedging behaviour. His conclusion is that the odds ratio is generally a useful measure of skill for binary

point forecasts. It is easy to compute and construct inference built on it; moreover, it is independent

of the marginal totals and is both complement and transpose symmetric. Mason (2003) provided a

more comprehensive survey on various scores that are built on contingency tables and established

five criteria for screening these measures, namely, equitability, propriety, consistency, sufficiency and

regularity.

3.2.2 Statistical inference based on contingency tables

So far, all scores are calculated using population contingency tables and nearly all of them are func-

tions of four joint probabilities. In practice, only a sample Yt ,Yt for t = 1, ...,T is available, which

may or may not be generated from the models in Section 4. We have to use this sample to construct

the score estimates. This is made simple by noticing that any score, denoted by f (π11,π10,π01), is

a function of three probabilities πi j. The estimator is obtained by replacing each πi j by the sample

proportion pi j. The statistical inference is therefore based on the maximum likelihood theory if the

sample size is sufficiently large. For simplicity, let us consider the random sampling scheme where

Yt ,Yt is i.i.d.. The objective is to find the asymptotic distribution of an empirical score which is a

function of the sample proportions, denoted by f (p11, p10, p01).

Taking each (Yt ,Yt) as a random draw from the joint distribution of forecasts and observations,

we have four possible outcomes for each draw: (1,1), (1,0), (0,1) and (0,0) with corresponding

probabilities π11, π10, π01, and π00, respectively. Under the assumption of independence, the sam-

pling distribution of Yt ,Yt is the multinomial having four outcomes each with probability πi j. The

52

likelihood as a function of πi j is thus

L(πi j|Yt ,Yt) =T !

n11!n10!n01!n00!π

n1111 π

n1010 π

n0101 π

n0000 (60)

where ni j is the number of observations in the cell (i, j), and T = ∑1i=0 ∑

1j=0 ni j. The maximum

likelihood estimator is obtained by maximizing (60) over πi j, subject to the natural constraint:

∑1i=0 ∑

1j=0 πi j = 1. Agresti (2007) showed that ML estimator is simply pi j = ni j/T which is the sam-

ple proportion of outcomes (i, j). By maximum likelihood theory, pi j is consistent and asymptotically

normally distributed, that is,

√T (p−π)

d−→ N(0,V ) (61)

where p = (p11, p10, p01)′, π = (π11,π10,π01)

′ and V is the 3×3 asymptotic covariance matrix which

can be estimated by the inverse of negative Hessian for the log-likelihood evaluated at p. The asymp-

totic distribution of f (p11, p10, p01) can be derived by delta method, provided f is differentiable in a

neighborhood of π, obtaining

√T ( f (p11, p10, p01)− f (π11,π10,π01))

d−→ N(0,∂ f∂π

V∂ f∂π

T

), (62)

where ∂ f∂π

is the gradient vector of f evaluated at π, and can be estimated by replacing π with p. Asymp-

totic confidence intervals for any score defined above can be obtained based on (62); see Stephenson

(2000) and Mason (2003).

In small samples, the above asymptotic approximation is no longer valid. A rule-of-thumb is

that the number of observations in each cell should be at least 5 in order for the approximation to

be valid. For samples in real life, one or more cells may not contain any observation; and some

measures, such as OR, cannot be calculated. The Bayesian approach with a reasonable prior could

work in these situations. As shown above, the sample is drawn from a multinomial distribution. Albert

(2009) showed that the conjugate prior for π is the so-called Dirichlet distribution with four parameters

(α11,α10,α01,α00) with density

p(π) =Γ(∑1

i=0 ∑1j=0 αi j)

∏1i=0 ∏

1j=0 Γ(αi j)

πα11−111 π

α10−110 π

α01−101 π

α00−100 , (63)

where ∑1i=0 ∑

1j=0 πi j = 1 and Γ(·) is the Gamma function. A natural choice is the noninformative prior,

53

in which all αi j’s equal one and all π’s are equally likely. Albert (2009) showed that the posterior

distribution is also Dirichlet with the updated parameters (α11 + n11,α10 + n10,α01 + n01,α00 + n00).

A random sample of size M from this posterior distribution, denoted by πm for m = 1, ...,M, can

be used to obtain a sequence of scores f (πm). For the purpose of inference, the resulting highest

posterior density (HPD) credible set Cα at a given significant level α can be treated as the same as the

confidence interval in the non-Bayesian analysis. Note that the strength of the Bayesian approach in

the present situation is that the score can be calculated even though some ni j’s are zero.

Testing independence between rows and columns in contingency tables is very important for fore-

cast evaluation. As shown above, independent forecasts would not be credited a high value by any

score. Merton (1981) proposed a statistic to measure the market timing skill of directional forecasts

(DF). According to Merton (1981), a DF has no value if, and only if,

HM ≡ P(Yt = 1|Yt = 1)+P(Yt = 0|Yt = 0) = 1, (64)

where Yt = 1 means the variable has moved upward. In our terminology, this means that

P(Yt = 1|Yt = 1)−P(Yt = 1|Yt = 0) = 0. (65)

Note that P(Yt = 1|Yt = 1) is the hit rate and P(Yt = 1|Yt = 0) is the false alarm rate. As a result, the

DF under consideration has no market timing skill in the sense of Merton (1981) if, and only if, the

Peirce skill score is zero. Blaskowitz and Herwartz (2008) derived an alternative expression for the

HM statistic in relation to the covariance of realized and forecasted directions

HM−1 =Cov(Yt ,Yt)

Var(Yt). (66)

HM = 1 if, and only if, Cov(Yt ,Yt) is zero, which is equivalent to independence between Yt and Yt in

the case of binary variables. Interestingly, a large number of papers investigating DF use symmetric

loss functions of various forms, which amounts to taking the percentage of correct forecasts as the

score; see Leitch and Tanner (1995), Greer (2005), Blaskowitz and Herwartz (2009), Swanson and

White (1995, 1997a,b), Gradojevic and Yang (2006), and Diebold (2006), to name a few. Pesaran and

Skouras (2002) linked the HM statistic with a loss function in a decision-based forecast evaluation

framework.

Since testing market timing skills is equivalent to the independence test in contingency tables, let

54

us look at this test a bit more. The independence test under random sampling is much simpler than the

test in the presence of serial correlation. As a matter of fact, all of the above frequentist and Bayesian

tests are applicable in this situation. Take the Peirce skill score as an example. We can construct

an asymptotic confidence interval for PSS based on a large sample and then check whether zero is

included in the confidence interval. Besides these, two additional asymptotic tests exist, namely, the

likelihood ratio and the Pearson chi-squared tests. The former is constructed as

LR≡ 2(lnL(π∗i j|Yt ,Yt)− lnL(πi j|Yt ,Yt)) (67)

where π∗i j is the unrestricted ML estimate, whereas πi j is the restricted one under the restrictions

πi j = πi.π. j for all i and j. Given the null hypothesis of independence, LR follows a chi-squared

distribution with one degree of freedom asymptotically, and the null should be rejected if and only if

LR is larger than the critical value at a preassigned significant level. The Pearson chi-squared statistic

is

χ2 ≡

1

∑i=0

1

∑j=0

(ni j− ni j)2

ni j(68)

where ni j is the observed cell count, ni j = T pi.p. j is the expected cell count under independence,

pi. is the marginal sample proportion of the ith row, and p. j is that for the jth column. If the rows

and the columns are independent, this statistic is expected to be small. It also has an asymptotic

chi-squared distribution with one degree of freedom and the same rejection area. Both tests are valid

and equivalent in large samples. In finite samples, where one or more cell counts are smaller than 5,

Fisher’s exact test is preferred under the assumption that the total row and column counts are fixed.

The null distribution of the Fisher test statistic is not valid if these marginal counts are not fixed, as is

often the case in random sampling. Specifically, the probability of the first count n11 given marginal

totals and independence is

P(n11) =n1.!

n11!n10!n0.!

n01!n00!/

T !n.1!n.0!

(69)

which has the hypergeometric distribution for any sample size. This test was proposed by Fisher in

1934 and is widely used to test independence for I× J contingency tables in the random sampling

design. Here only the simple case with I = J = 2 is considered, and the readers are referred to Agresti

(2007) for further discussions on this exact test. Another way of testing independence in general I×J

55

contingency tables is the asymptotic test of ANOVA coefficients of ln(πi j), that is, the significance

test of relevant coefficients in the log-linear model, which is popular in statistics and biostatistics,

but rarely used by econometricians. This test makes use of the fact that ANOVA coefficients of

ln(πi j) must meet some conditions under independence. One of the conditions is that the coefficient

of any interaction term must be zero. The test proceeds by checking whether the maximum likelihood

estimators support these implied values by three standard procedures, that is, the Wald, likelihood

ratio, and Larangian multiplier tests. In econometrics, Pesaran and Timmermann (1992) proposed

an asymptotic test (PT92) based on the difference between P(Y = 1,Y = 1)+P(Y = 0,Y = 0) and

P(Y = 1)P(Y = 1)+P(Y = 0)P(Y = 0), which should be close to zero under independence. A large

deviation of the sample estimate from zero is thus a signal of rejection. In 2× 2 contingency tables,

ANOVA and PT92 tests are asymptotically equivalent to the classical χ2 test.

In reality, especially for macroeconomic forecasts, Yt and Yt are likely to be serially correlated. All

of the above testing statistics can be used nevertheless; but their null distributions are going to change.

For example, Tavare and Altham (1983) examined the performance of the usual χ2 test, where both

row and column are characterized by two-state Markov chains, and concluded that the χ2 statistic does

not have the χ2 distribution with one degree of freedom, as in the case of random samples. Before

drawing any meaningful conclusions from these classic tests, serial correlation needs to be tackled

properly.

Blaskowitz and Herwartz (2008) provided a summary of the testing methodologies in the presence

of serial correlation of Yt and Yt . These include a covariance test based on the covariance of obser-

vations and events, a static/dynamic regression approach adjusted for serial correlation by calculating

Newey-West corrected t-statistic, and the Pesaran and Timmermann (2009) test based on the canoni-

cal correlations from dynamically augmented reduced rank regressions specialized to the binary case.

They found that all of these tests based on the asymptotic approximations tend to produce incorrect

empirical size in finite samples, and suggested a circular bootstrap approach to improve their finite

sample performance. Bootstrap-based tests are found to have smaller size distortion in small samples

without much sacrifice of power, and those without taking care of serial correlation tend to generate

inflated test size in finite samples.

Dependence of forecasts and observations is necessary for a forecasting system to have positive

skill. However, it is only a minimal requirement for good forecasts. It is not unusual that the perfor-

mance of a forecasting system is worse than random forecasts of no skill in terms of some specific

criterion. Donkers and Melenberg (2002) proposed a test of relative forecasting performance over this

56

benchmark by comparing the difference in the percentage of correct forecasts. In a real life example,

they found that the test proposed by them and the PT92 test differ dramatically in the estimation and

evaluation samples.

3.2.3 Evaluation of forecast value

Most evaluation methodologies focus on the skill of binary point forecasts. As argued by Diebold and

Mariano (1995) and Granger and Pesaran (2000a,b), however, the end user often finds measures of

economic value to be more useful than the usual mean squared error or other statistical scores. We have

emphasized this point in the context of probability forecasts in which the cost/loss ratio is important

for value-evaluation in a forecast-based decision problem. In a 2× 2 payoff matrix (e.g. Table 4),

each cell corresponds to the loss associated with a possible combination of action and realization, and

is not limited to the specific cost/loss structure. Blaskowitz and Herwartz (2011) proposed a general

loss function suitable for directional forecasts in economics and finance, which takes into account

the realized sign and the magnitude of directional movement for the target economic variable. They

regarded this general loss function as an alternative to the commonly used mean squared error for

forecast evaluation.

As indicated before, Richardson (2003) analyzed the relationship between skill and value in the

context of the cost/loss decision problems. Note that for probability forecasts, any user, faced with a

probability value, decides whether or not to take some action according to his optimal threshold. For

binary point predictions, we can also calculate the value score, defined as a function of the cost/loss

ratio. The resulting VS curve would lie below the one generated by probability forecasts. Richard-

son (2003) proved that the particular cost/loss ratio which maximizes VS is equal to the marginal

probability of Y = 1, and the highest achievable value score is simply the Peirce skill score (PSS).

Granger and Pesaran (2000b) derived a very similar result. Consequently, the maximum economic

value is related to the forecast skill, and PSS is taken as a measure of the potential forecast value as

well as skill. However, for a specific user with a cost/loss ratio different from the marginal probability

P(Y = 1), this maximum value is not attainable. Thus PSS only gives the possible maximum rather

than the actual value achievable for any user. On the other hand, Stephenson (2000) argued that in

order to have a positive value score for at least one cost/loss ratio, the odds ratio (OR) has to exceed

one. That is, forecasts and observations have to depend on each other, otherwise, nobody benefits from

the forecasts and one would rather use the random forecasts with no skill. This observation provides

57

another example, where forecast value is influenced by forecast skill. Only those forecasts satisfying

the minimal skill requirements can be economically valuable.

4 Binary Point Predictions

In some circumstances, especially in two-state, two-action decision problems, one has to make a

binary decision according to the predicted probability of a future event. This can be done by trans-

forming a continuous probability into a 0/1 point prediction, as we will discuss in this section. Unlike

probability forecasts, binary point forecasts cannot be isolated from an underlying loss function. For

this reason, we deferred a detailed examination of the topic until after forecast evaluation under a

general loss function was reviewed in Section 3. The plan of this section is as follows: Section 4.1

considers ways to transform predicted probabilities into point forecasts – the so called “two-step ap-

proach”. Manski (1975, 1985) generalized this transformation procedure to other cases where no

probability prediction is given as the prior knowledge, and the optimal forecasting rule is obtained

through a one-step approach. This will be addressed in Section 4.2, followed by an empirical illustra-

tion in Section 4.3. A set of binary classification techniques primarily used in the statistical learning

literature are briefly introduced in Section 4.4. These include discriminant analysis, classification

trees, and neural networks.

4.1 Two-step approach

In the two-step approach, the first step consists of generating binary probability predictions, as re-

viewed in Section 2, while a threshold is employed to translate these probabilities into 0/1 point

predictions in the second step. In the cost/loss decision problem, the optimal threshold of doing so is

based on the cost/loss ratio. For a general loss function L(Y ,Y ), the optimal threshold minimizing the

expected loss can be solved by comparing two quantities, namely, the expected loss of Y = 1 and that

of Y = 0. Denote the former by EL1 = P(Y = 1|P)L(1,1)+(1−P(Y = 1|P))L(1,0) and the latter by

EL0 = P(Y = 1|P)L(0,1)+(1−P(Y = 1|P))L(0,0). Y = 1 is optimal if and only if EL1 ≤ EL0, or,

P(Y = 1|P)≥ L(1,0)−L(0,0)L(1,0)−L(0,0)+L(0,1)−L(1,1)

≡ P∗. (70)

58

Here we assume that making a correct forecast is beneficial and making a false forecast is costly,

that is, L(0,0) < L(1,0) and L(1,1) < L(0,1). P∗ defined above is the optimal threshold which is a

function of losses, and is interpreted as the fraction of the gain from getting the forecast right when

Y = 0 over the total gain of correct forecasts. Given P∗, the optimal decision (or forecasting) rule is:

Y = I(P(Y = 1|P) ≥ P∗). In general, P(Y = 1|P) is unknown, and this rule is infeasible. However,

suppose P is generated by one of the models in Section 2 that are correctly specified in the sense that

P = P(Y = 1|Ω). The law of iterated expectations implies that P(Y = 1|P) = P, that is, P is perfectly

calibrated, and so the decision rule reduces to Y = I(P ≥ P∗). Given a sequence of this type of

probability forecasts Pt, this rule says that we can generate another sequence of 0/1 point forecasts

Yt by simply comparing each Pt with P∗. In reality, rather than P, what we know is its estimate P

from a particular binary response model, say probit or single index model, evaluated at a particular

covariate value x. Once this model is correctly specified, the decision rule using P in replace of P is

asymptotically optimal as well, and both yield the same expected loss as the sample size approaches

infinity. Figure 13 illustrates a decision rule based on the probit model with threshold 0.4.

Figure 13: Probit and linear probability models with threshold 0.4

From this figure, Y = 1 is predicted for any observation with Φ(X β) ≥ 0.4, or for those on the

right hand side of the vertical line.

4.2 One-step approach

Manski (1975, 1985) developed a semiparametric estimator for the binary response model, the so-

called maximum score estimator (MSCORE). This is different from other semiparametric estimators

59

in Section 2.1.3 in terms of the imposed assumptions. Both single-index and nonparametric additive

models assume that the error in (2) is stochastically independent of X . In contrast, MSCORE only

assumes the conditional median of this error is zero, that is, med(ε|X) = 0, or median independence,

which is much weaker. Manski assumed the index function to be linear in unknown parameters β,

so the full specification is akin to the parametric model in Section 2.1.1, but he relaxed the inde-

pendence and distributional assumptions. Compared with other binary response models, the salient

feature of Manski’s semiparametric estimator is its weak distributional assumptions. However, as a

result, the conditional probability P(Y = 1|X) cannot be estimated—the price one has to pay with less

information. This is the reason why we did not discuss this model in Section 2 under “Probability

Predictions”.

The maximum score estimator β solves the following maximization problem based on a sample

Yt ,Xt:

maxβ∈B,|β1|=1

Sms(β)≡1T

T

∑t=1

(2Yt −1)(2I(Xtβ≥ 0)−1), (71)

where B is the permissible parameter space, |β1| is assumed to be 1 due to identification considerations,

as β is identified up to scale, and Sms(·) is the score function. Note that when Yt = 1 and Xtβ ≥ 0 or

Yt = 0 and Xtβ < 0, (2Yt − 1)(2I(Xtβ ≥ 0)− 1) = 1; Otherwise, (2Yt − 1)(2I(Xtβ ≥ 0)− 1) = −1.

Interpreting this as the problem of using X to predict Y , it says that Y = 1 is predicted if, and only if,

a linear predictor Xβ is larger than zero. As long as the predicted and observed values are the same,

the score rises by 1/T ; otherwise, it decreases by the same amount. By this observation, MSCORE

attempts to estimate the optimal linear forecasting rule of the form Xβ which maximizes the percentage

of correct forecasts.

Manski (1985) established strong consistency of the maximum score estimator. The rate of conver-

gence and the asymptotic distribution were analyzed by Cavanagh (1987) and Kim and Pollard (1990),

respectively. However, the score function is not continuous in parameters, and thus the limiting dis-

tribution is complex for carrying out statistical inference. Manski and Thompson (1986) suggested

using a bootstrap to conduct inference for MSCORE, which was critically evaluated by Abrevaya and

Huang (2005). Delgado et al. (2001) discussed the use of nonreplacement subsampling to approximate

the distribution of MSCORE. Furthermore, the convergence rate of MSCORE is T 1/3, which is slower

than the usual√

T . All of these issues restrict the application of MSCORE in empirical studies. To

overcome the problem resulting from discontinuity, Horowitz (1992) proposed a smoothed version of

60

the score function using a differentiable kernel. The resulting smoothed MSCORE is consistent and

asymptotically normal with a convergence rate of at least T 2/5 , and can be arbitrarily close to√

T

under some assumptions. Horowitz (2009) also discussed extensions of MSCORE to choice-based

samples, panel data and ordered-response models. Caudill (2003) illustrated the use of MSCORE in

forecasting where seeding is taken as a predictor of winning in the men’s NCAA basketball tourna-

ment. He found that MSCORE tends to outperform parametric probit models for both in-sample and

out-of-sample forecasts.

Manski and Thompson (1989) investigated a one-step analog estimation of optimal predictors

of binary response with much relaxed parametric assumptions on the response process. The loss

functions they considered are quite general. The first is the class of asymmetric absolute loss functions

under which the optimal forecasting rule takes the same form as Y = I(P ≥ P∗). The second is the

class of asymmetric square loss functions, and the last is the logarithmic loss function. Under these

last two losses, however, the optimal forecasts are not 0/1-valued and thus are omitted here. A natural

estimation strategy is to estimate P first, and then to get the point forecasts using the optimal rule,

as explained in Section 4.1. Manski and Thompson (1989) suggested estimating the optimal binary

point forecasts directly by the analogy principle, viz., the estimates of best predictors are obtained by

solving sample analogs of the prediction problem without the need to estimate P first. The potential

benefit of this one-step procedure is that it allows for a certain degree of misspecification for P. They

discussed this issue in two specific binary response models, “isotonic” and “single-crossing”, finding

that the analog estimators for a large class of predictors are algebraically equivalent to MSCORE, and

so are consistent.

Elliott and Lieli (2010) followed the same one-step approach under a general loss function. They

extended Manski and Thompson’s analog estimator allowing the best predictor to be nonlinear in β.

In MSCORE, the “rule of thumb” threshold of transforming X β into 0/1 binary point forecasts is 0.

Note that X β is not the conditional probability of Y = 1 given X. However, this threshold may not

be optimal for a particular decision problem under consideration. Elliott and Lieli (2010) derived an

optimal threshold based on a general utility function which may depend on the covariates X as well.

Their motivation can be explained in terms of Figure 13.

Suppose the true model is the probit model, but a linear probability model is fitted instead, with

the fitted line shown in Figure 13. According to the analysis in Section 2.1.1, the estimated β is

generally not consistent and so the linear probability model will be viewed as a bad choice. Elliott

and Lieli (2010) argued, however, that this may not be the case, at least in this example. Rather than

61

concentrating on β, what is important is the optimal forecasting rule; two different models may yield

the same forecasting rule. In Figure 13, the optimal forecasting rule determined by the true model

is: Y = 1 is predicted if, and only if, X lies on the right hand side of the vertical line – the very rule

we get by using the linear predictor Xβ. This finding highlights the point that we do not require the

model to be correctly specified in order to obtain an optimal forecasting rule. As a result, modeling

binary responses for point predictions becomes much more flexible than for probability predictions.

However, this gain in specification flexibility should not be overstated, since not every misspecified

model will work. The key requirement is that both the working model and the true model have to cross

the optimal threshold level at exactly the same cutoff point. The working model can behave arbitrarily

elsewhere, where the predictions can even go beyond [0,1].7 Therefore, a good working model may

not be the real conditional probability model and need not have any structural interpretation. For

example, β in the linear probability model in Figure 13 does not give the marginal effect of X on the

probability of Y = 1. Elliott and Lieli concluded that the usual two-step estimation procedures, such

as maximum likelihood estimation, fit the working model globally, and thus the fitted model is close

to the true model over the whole range of covariate values. However, this is not necessary since the

goodness of fit in the neighborhood of the cutoff point is all that is necessary. In other words, all we

need is a potentially misspecified working model that fits well locally instead of globally.

To overcome the problem of the two-step estimation approach, Elliott and Lieli (2010) incor-

porated utility into the estimation stage – the one-step approach initially proposed by Manski and

Thompson (1989). The population problem involves maximizing expected utility by choosing a bi-

nary optimal action as a function of X , namely,

maxa(·)

E(U(a(X),Y,X)), (72)

where U(a,Y,X) is the utility function depending on the binary action a which is again a function of

X , realized event Y as well as covariates X .8 After some algebraic manipulations, (72) can be rewritten

as

maxg∈G

E(b(X)[Y +1−2c(X)]sign[g(X)]), (73)

7Another nontrivial requirement is that the working model must be above (below) the cutoff whenever thetrue model is above (below) it.

8Elliott and Lieli suggested empirical examples where X enters into the utility function.

62

where b(X) =U(1,1,X)−U(−1,1,X)+U(−1,−1,X)−U(1,−1,X)> 0, c(X) is the optimal thresh-

old expressed as a function of utility, a(X) = sign[g(X)], and G is a collection of all measurable func-

tions from Rk to R (note X is k-dimensional). The so-called Maximum Utility Estimator (MUE) is

then obtained by solving the sample version of (73);

maxg∈G

1T

T

∑t=1

b(Xt)[Yt +1−2c(Xt)]sign[g(Xt)]. (74)

For implementation, g needs to be parameterized, that is, only a subclass of G is considered to reduce

the estimation dimension. The estimator β which maximizes the objective function

maxβ∈B

1T

T

∑t=1

b(Xt)[Yt +1−2c(Xt)]sign[h(Xt ,β)] (75)

produces the empirical forecasting rule sign[h(Xt , β)].9 Under weak conditions, this empirical fore-

casting rule converges to the theoretically optimal rule given the model specification h(x,β). If, in

addition, the model h(x,β) satisfies the stated condition for correct specification, the constrained op-

timal forecast is also the globally optimal forecast for all possible values of the predictors. They

recommended a finite order polynomial for use in practice.

The identification issues in the Manski and Elliott and Lieli approaches are less important for

prediction purposes than for structural analysis. The estimation proceeds without much worry about

identification provided alternative identification restrictions yield the same forecasting rules. Their

statistical inference is built on the optimand function instead of the usual focus on β. One difficulty

comes from the discontinuity of the objective function, meaning that maximization in practice can-

not be undertaken by the usual gradient-based numerical optimization techniques. Elliott and Lieli

employed the simulated annealing algorithm in their Monte Carlo studies, while mixed integer pro-

gramming was suggested by Florios and Skouras (2007) to solve the optimization problem.

Lieli and Springborn (2012) assessed the predictive ability of three procedures (two-step maxi-

mum likelihood, two-step Bayesian and one-step maximum utility estimation) in deciding whether to

allow novel imported goods which may be accompanied by undesirable side effects, such as biological

invasion. They used Australian data to demonstrate that a maximum utility method is likely to offer

significant incremental gains relative to the other alternatives, and estimated this annual value to be

$34-$49 million (AU$) under their specific loss function. This paper also extends the maximum utility

9Note that (75) with constant b(Xt), c(Xt) = 0.5 and h(Xt ,β) = Xtβ, is equivalent to the maximum scoreproblem. Therefore, MSCORE is a special case of this general estimator.

63

model to address an endogenously stratified sample where the uncommon event is over-represented in

the sample relative to the population rate, as discussed in Section 2.1.1.

Lieli and Nieto-Barthaburu (2010) generalized the above approach with a single decision maker to

a more complex context where a group of decision makers has heterogeneous utility functions. They

considered a public forecaster serving all decision makers by maximizing a weighted sum of individ-

ual (expected) utilities. The maximum welfare estimator was then defined through the forecaster’s

maximization problem, and its properties were explored. The conditions under which the traditional

binary prediction methods can be interpreted asymptotically as socially optimal were given, even when

the estimated model was misspecified.

4.3 An empirical illustration

To illustrate the difference between the one-step and two-step approaches in terms of their forecasting

performance, the data in Section 2.1.5 involving yield spreads and recession indicators, are used here.

For simplicity, the lagged indicator is removed, that is, only static models with yield spread as the only

regressor are fitted. It is well known that the best model for fitting the data is not always the best model

for forecasting. The whole sample is, therefore, split into two groups. The first group, covering the

period from January 1960 to December 1979, is for estimation use, while the second one, including

all remaining observations, is for out-of-sample evaluation. For the conventional two-step approach,

we fit a parametric probit model with a linear index. The recession for the month t is predicted if and

only if

Φ(β0 + β1Y St−12)≥ optimal threshold, (76)

where Φ(·) is the standard normal distribution function, Y St−12 is the 12 month lagged yield spread,

and β j, for j = 0 and 1, are the maximum likelihood estimates. For the purpose of comparison, the

same model specification in (76) is fitted by Elliott and Lieli’s approach under alternative loss func-

tions. In this case, we use the same forecasting rule (76) with β j replaced by β j, the maximum utility

estimates. Two particular loss functions are analyzed here: the percentage of correct forecasts and

the Peirce skill score, with 0.5 and the population probability of recession as the optimal thresholds,

respectively. We take the sample proportion as the estimate of the population probability. Note that

these are also the two most commonly used thresholds to translate a probability into a 0/1 value in em-

64

pirical studies; see Greene (2011). The maximum utility estimates are computed using OPTMODEL

procedure in SAS 9.2.

Figure 14 presents these fitted curves using the estimation sample, together with two optimal

thresholds.10 In contrast to the two-step maximum likelihood approach, one-step estimates depend on

the loss function of interest. When the Peirce skill score is maximized, instead of the percentage of

correct forecasts, both intercept and slope estimates change, making the fitted curve shift rightward.

One noteworthy result is that both the one-step and two-step fitted curves of maximizing Peirce skill

score touch the optimal threshold (0.15) in roughly the same region, despite their large gap when the

yield spread is negative. According to Elliott and Lieli (2010), this implies that both are expected

to yield the same forecasting rule, and thus yield the same value for the Peirce skill score. For the

percentage of correct forecasts, the fitted curves from these two approaches are also very close to

each other in the critical region, where the curves touch the optimal threshold (0.5). Their results are

confirmed in Table 5 where we summarize the in-sample goodness of fit for all fitted models. As

expected, it makes no difference in terms of the objectives they attempt to maximize. For instance,

the maximized Peirce skill score is 0.4882 for both the probit and MPSS. One possible reason for

their equivalence in this particular example could be due to the correct specification in (76), i.e., the

true data generating process can be represented by the probit model correctly.11 Note that in Table 5,

the Peirce skill score of MPC is significantly lower than those for the other two; so is the percentage

of correct forecasts for MPSS. This is not surprising, as the one-step semiparametric model is not

designed to maximize it.

Table 5: In-sample goodness of fit for one-step v.s. two-step models

PC PSS

Probit 0.8625 0.4882MPC 0.8625 0.1744MPSS 0.7167 0.4882

To correct for possible in-sample overfitting, we evaluate the fitted models using the second sam-

ple with the results summarized in Table 6. Both tables convey similar information pertaining to the

forecasting performances of one-step and two-step models. In Table 6, the probit model still per-

10In Figure 14, MPC is the fitted curve for the maximum percentage of correct forecasts, while MPSS is themaximum Peirce skill score fitted curve.

11In fact, a nonparametric specification test shows that the functional form in (76) cannot be rejected by thesample. Thus, the fitted probit model serves as a proxy for the unknown data generating process.

65

Figure 14: One-step v.s. two-step fitted curves

forms admirably well. In terms of percentage of correct forecasts, it even outperforms MPC, which is

constructed to maximize this criterion. Given that the probit model is correctly specified, the slight su-

periority of two-step approach may be possibly due to sampling variability or the structural differences

between estimation and evaluation samples.

Table 6: Out-of-sample evaluation for one-step v.s. two-step models

PC PSS

Probit 0.8672 0.5854MPC 0.8542 0.1333MPSS 0.8229 0.5854

The relative flexibility of the one-step approach, as emphasized in Section 4.2, is that it allows for

some types of misspecification, which are not allowed in the two-step approach. In order to highlight

this point, we fit the linear probability model (1) instead of the probit model (76). For the two-step

approach, the recession for the month t is predicted if and only if

βOLS0 + β

OLS1 Y St−12 ≥ optimal threshold, (77)

where βOLSj , for j = 0 and 1, are the OLS estimates. For the one-step approach, these parameters are

estimated by the Elliott and Lieli method. Figure 15 illustrates some interesting results in this setting.

Compared with the probit fitted curve, the OLS fitted line is dramatically different. However, the MUE

fitted lines, based on PC and PSS, intersect the MUE fitted curves (76) at their associated threshold

66

values (0.5 and 0.15, respectively). Thus, MUE produces the same binary point forecasts even when

the working model (77) is misspecified. Figure 15 shows that the lines estimated by MUE do not fit

the data generating process globally very well, yet are capable of producing correct point predictions.

Given that a global fit is less important than the localized problem of identifying the cutoff in the

present binary point forecast context, the one-step approach with better local fit should be preferred.12

Figure 15: One-step v.s. two-step linear fitted lines

4.4 Classification models in statistical learning

Supervised statistical learning theory is mainly concerned with predicting the value of a response vari-

able using a few input variables (or covariates), which is similar to forecasting models in econometrics.

Many binary point prediction models have been proposed in the supervised learning literature, and are

called binary classification models. This section serves as a sketchy introduction to a few classical

classification models amongst them.

4.4.1 Linear discriminant analysis

As stated above, an optimal threshold is needed to transform the conditional probability P(Y = 1|X)

into 0/1 point prediction. The most widely used threshold is 1/2 which corresponds to a symmetric

12When we implemented the in-sample and out-of-sample evaluation exercises for the linear specification(77), we found that the linear model fitted by OLS performed worse than its MUE counterparts.

67

loss function given by Mason (2003). Given this threshold, classification simply involves comparison

of two conditional probabilities, that is, P(Y = 1|X) and P(Y = 0|X), and the event with larger proba-

bility is predicted accordingly. Linear discriminant analysis follows this rule but obtains P(Y = 1|X)

in a different way than the usual regression-based approach. The analysis assumes that we know the

marginal probability P(Y = 1) and the conditional density f (X |Y ). By Bayes’ rule, the conditional

probability is given by

P(Y = 1|X) =P(Y = 1) f (X |Y = 1)

P(Y = 1) f (X |Y = 1)+P(Y = 0) f (X |Y = 0). (78)

To simplify the analysis, hereafter a parametric assumption is imposed on the conditional density

f (X |Y ). The usual practice, when X is continuous, is to assume both f (X |Y = 1) and f (X |Y = 0) are

multivariate normal with different means but a common covariance matrix Σ, that is,

f (x|Y = j) =1

(2π)k/2|Σ|1/2 exp(−12(x−µ j)Σ

−1(x−µ j)′), (79)

where j = 1 or 0. Under this assumption, the log odds in terms of the conditional probabilities is

lnP(Y = 1|X = x)P(Y = 0|X = x)

= lnf (x|Y = 1)f (x|Y = 0)

+ lnP(Y = 1)P(Y = 0)

= lnP(Y = 1)P(Y = 0)

− 12(µ1 +µ0)Σ

−1(µ1−µ0)′+ xΣ

−1(µ1−µ0)′ (80)

which is an equation linear in x. The equal covariance matrices causes the normalization factors to

cancel, as well as the quadratic part in the exponents. The previous classification rule amounts to

determining whether (80) is positive for a given x. The decision boundary that is given by setting (80)

to be zero is a hyperplane in Rk, dividing the whole space into two disjoint subsets. For any given x

in Rk, it must exclusively fall into one subset; and the classification follows in a straightforward way.

To make this rule work in practice, four blocks of parameters have to be estimated using samples:

P(Y = 1), µ1, µ0 and Σ. This can be done easily by using their sample counterparts. To be specific,

P(Y = 1) = T1/T , µ j = ∑ j Xt/Tj for j = 0,1 and Σ = (∑1(Xt − µ1)′(Xt − µ1) +∑0(Xt − µ0)

′(Xt −

µ0))/(T − 2), where P is the estimate of P, T1 is the number of observations with Yt = 1, and ∑ j is

the summation over those observations with Yt = j. Substituting parameters with their estimates in

the decision boundary yields the empirical classification rule. It is called linear discriminant analysis

simply because the resulting decision boundary is a hyperplane in the input vector space, which again

is the consequence of the imposed assumptions. Hastie et al. (2001) derived a decision boundary

68

described by a quadratic equation under the normality assumption with distinct covariance matrices,

that is, Σ1 6= Σ0. They also extended this simplest case by considering other distributional assumptions

leading to more complex decision boundaries.

Another point worth mentioning is that the log odds generated by linear discriminant analysis

takes the form of a logistic specification. Specifically, the linear logistic model by construction has

linear logit

lnP(Y = 1|X = x)P(Y = 0|X = x)

= βo + xβ1, (81)

which is akin to (80) if

βo ≡ lnP(Y = 1)P(Y = 0)

− 12(µ1 +µ0)Σ

−1(µ1−µ0)′ (82)

and

β1 ≡ Σ−1(µ1−µ0)

′. (83)

Therefore, the assumptions in linear discriminant analysis induce the logistic regression model, which

can be estimated by maximum likelihood to get estimates for βo and β1. In this sense, both models

generate the same classification rules asymptotically, in spite of the difference in their estimation

methods. However, the joint distribution of Y and X is used in discriminant analysis, whereas logistic

regression only uses the conditional distribution of Y given X , leaving the marginal distribution of X

not explicitly specified. As a consequence, linear discriminant analysis, by relying on the additional

model assumptions, is more efficient but less robust when the assumed conditional density of X given

Y is not true. In the situation where some of the components of X are discrete, logistic regression is a

safer, more robust choice.

Maddala (1983) followed an alternative way to derive the linear discriminant boundary, which

provides a deep insight into what discriminant analysis actually does. Suppose that only a linear

boundary is considered for simplicity. Without loss of generality, denote it by Xλ = 0, and Y = 1

is predicted if and only if Xλ ≥ 0. What discriminant analysis does is to find the optimal value

for λ according to a certain criterion. Fisher posed this problem initially for finding λ such that

the between-class variance is maximized relative to the within-class variance. The between-class

variance measures how far away from each other are the means of Xλ for both classes (Y = 1 and

69

Y = 0 where Y is the binary point prediction), which should be maximized subject to the constraint

that the variance of Xλ within each class is fixed. This does make intuitive sense in the context of

classification. If the dispersion of two means is small or two distributions of Xλ overlap to a large

extent, it is hard to distinguish one from the other. In other words, a large proportion of observations

could be misclassified. Alternatively, even if the means of two distributions are far away from each

other, they cannot be sharply distinguished unless both distributions have small variances. The optimal

λ solving the Fisher’s problem gives the best linear decision boundary whose analytical form is given

in Maddala (1983). Mardia et al. (1979) offered a concise discussion of linear discriminant analysis.

Michie et al. (1994) compared a large number of popular classifiers on benchmark datasets. Linear

discriminant analysis is a simple classification model with a linear decision boundary, and subsequent

developments have extended it in various directions; see Hastie et al. (2001) for details.

4.4.2 Classification trees

As with discriminant analysis, methods based on classification trees partition the input vector space

into a number of subsets on which 0/1 binary point predictions are made. Consider the case with

two input variables: X1 and X2, both of which take values in the unit interval. Figure 16 presents a

particular partition of the unit box.

Figure 16: Partition of the unit box

First, subset R1 is derived if X1 < t1. For the remaining part, check whether X2 < t2, if so, we get

R2. Otherwise, check whether X1 < t3, if so, we get R3. Otherwise, check whether X2 < t4, if so, we

get R4. Otherwise, we take the remaining as R5. This process can be represented by a classification

70

tree in Figure 17.

Figure 17: The classification tree associated with Figure 16

Each node on the tree represents a stage in the partition; and the number of final subsets equals that

of terminal nodes. The branch connecting two nodes gives the condition under which the upper node

transits to the lower one. For example, condition X1 < t1 must be satisfied in order to get R1 from

the initial node. The tree shown in Figure 17 can be expanded further to incorporate more terminal

nodes when the partition ends up with more final subsets. In general, suppose we have M subsets: R1,

R2,..., RM on each of which we have assigned a unique probability denoted by p j for j = 1, ...,M.

Using the optimal threshold 1/2, Y = 1 should be predicted on subset j if and only if p j ≥ 0.5. Hence,

the classification boils down to how to divide the input vector space into disjoint subsets as shown

in Figure 16 (or how to generate a classification tree like the one in Figure 17), and how to assign

probabilities to them.

To introduce an algorithm to grow a classification tree, we define X as a k-dimensional input

vector, with X j as its jth element and

R1( j,s) ≡ X |X j ≤ s,

and R2( j,s) ≡ X |X j > s. (84)

Given a sample Yt ,Xt, the optimal splitting variable j and split point s solve the following problem:

minj,s

[minc1

∑xt∈R1( j,s)

(Yt − c1)2 +min

c2∑

xt∈R2( j,s)(Yt − c2)

2]. (85)

71

For any fixed j and s, the optimal ci (for i = 1 or 2) that minimizes the mean squared errors is the

sample proportion of Yt = 1 within the class of Xt : Xt ∈ Ri( j,s). Computation of the optimal j and

s can be carried out in most statistical packages without much difficulty. Having found the best split,

the whole input space is divided into two subsets according to whether X j∗ ≤ s∗ where j∗ and s∗ are

the optimal solutions to (85). The whole procedure is then iterated on each subset to get finer subsets

which can be partitioned further as before. In principle, this process can be repeated infinitely many

times, but we have to stop it when a certain criterion is met. To this end, we define the cost complexity

criterion function

Cα(T )≡|T |

∑m=1

∑Xt∈Rm

(Yt − Ym)2 +α|T | (86)

where T is a subtree of To that is a very large tree, |T | is the number of terminal nodes of T each of

which is indexed by Rm for m = 1, ..., |T |, and Ym is the sample proportion of Yt = 1 within subset Rm.

The criterion is a function of α, that is a nonnegative tuning parameter to be specified by user. The

optimal subtree T depending on α should minimize Cα(T ). If α = 0, the optimal T should be as large

as possible and equals the upper bound To. Conversely, an infinitely large α forces T to be very small.

This result is very intuitive. When the partition gets finer and finer, fewer and fewer observations fall

into each subset. In the limit, each one would contain at most one observation, so that Ym = Yt for

each m, and the first term of Cα(T ) would vanish. This also shows that without any other constraint,

the optimal partition rule tends to overfit in-sample data. This is very unstable and inaccurate in the

sense that this rule is sensitive to even a slight change in sample. The optimal subtree should balance

the tradeoff between stability and in-sample goodness of fit. This balance is controlled by parameter

α. Breiman et al. (1984) and Ripley (1996) outlined details to obtain the optimal subtree for a given α

that is determined by cross-validation.

Hastie et al. (2001) recommended using other measures of goodness of fit in the complexity cri-

terion function instead of the sample mean squared error in (86) for binary classification purpose,

including the missclassification error, Gini index, and cross-entropy. They compared them in terms

of their sensitivity to changes in the node probabilities. They also discussed cases with categorical

predictors and asymmetric loss function. For an initial introduction to classification trees, see Morgan

and Sonquist (1963). Breiman et al. (1984) and Quinlan (1992) contain a general treatment of this

topic.

72

4.4.3 Neural networks

The model of neural networks is a highly nonlinear supervised learning model, which seeks to ap-

proximate the regression function by combining a k-dimensional input vector in a hierarchical way

via multiple hidden layers. To outline its basic idea, only a single hidden layer neural networks is

considered here.

As before, Y is a binary response, and X is a k-dimensional input vector to be used for classi-

fication. Let Z1, ...,ZM be unobserved hidden units that depend on X by Zm = σ(α0m +Xαm), for

m = 1, ...,M, where σ(·) is a known link function. A typical choice is σ(v) = 1/(1+ e−v). Then the

neural networks, with Z1, ...,ZM as the only hidden layer, can be written as

Tk = β0k +Zβk, k = 0,1,

P(Y = 1|X) = g(T ), (87)

where T = (T0,T1), Z = (Z1, ...,ZM), P(Y = 1|X) is the conditional probability of Y = 1 given X , and

g is a known function with two arguments. For a binary response, g(T ) = eT1

eT0+eT1is often used. The

above model structure is presented by Figure 18.

Figure 18: Neural networks with a single hidden layer

In general, there may be more than one hidden layer, and so Y will depend on X in a more

complex way. The model therefore allows for enhanced specification flexibility and reduced risk of

misspecification. Note that there are M(k+ 1)+ 2(M + 1) parameters in this model that need to be

estimated, and some of them may not be identified when both M and k are large. In other words,

73

the specification is too rich to be identified. For this reason, instead of fitting the full model, only a

nested model, with some parameters fixed, is estimated given a sample Yt ,Xt. Despite its complex

structure, it is still a parametric model because the functional forms of g and σ are known a priori

and only a finite set of parameters are estimated. The usual nonlinear least squares, or maximum

likelihood, method is used to get a consistent estimator. For the former, the objective function that

should be minimized is the forecast mean squared error

R(θ) =T

∑t=1

(Yt −P(Y = 1|Xt))2, (88)

whereas the likelihood function for the latter is

R(θ) =T

∑t=1

P(Y = 1|Xt)Yt (1−P(Y = 1|Xt))

1−Yt , (89)

where θ is the vector of all parameters. The classification rule is that Y = 1 is predicted if, and only

if, the fitted probability P(Y = 1|X) is no less than 0.5. Typically, the global solutions of the above

problems are often not desirable in that they tend to overfit the model in-sample but perform poorly

out-of-sample. So, one can obtain a suboptimal solution either directly through a penalty term added

in any of the above objective functions, or indirectly by early stopping. For computational details

on neural networks, see Hastie et al. (2001), Parker (1985), and Rumelhart et al. (1986). A general

introduction of neural networks is given by Ripley (1996), Hertz et al. (1991), and Bishop (1995). For

a useful review of neural networks from an econometric point of view, see Kuan and White (1994).

Refenes and White (1998), Stock and Watson (1999), Abu-Mostafa et al. (2001), Marcellino (2004)

and Terasvirta et al. (2005) applied neural networks in time series econometrics and forecasting.

5 Improving Binary Predictions

Till now, all binary probability and point predictions have been constructed based on a single training

sample Yt ,Xt, and the resulting predictions are thus subject to sampling variability. We say a binary

probability/point prediction Q(x) evaluated at x is unstable if its value is sensitive to even a slight

change of the training sample from which it is derived. The lack of stability is especially severe in

cases of small training samples and highly nonlinear forecasting models. If Q(x) varies a lot, it is

74

hardly reliable as one may get a completely different predicted value when a different training sample

is used. In other words, the variance of the forecast error would be extremely large for an unstable

prediction. To improve forecast performance and reduce the uncertainty associated with an unstable

binary forecast, combining multiple individual forecasts for the same event was suggested; see Bates

and Granger (1969), Deutsch et al. (1994), Granger and Jeon (2004), Stock and Watson (1999, 2005),

Yang (2004), and Timmermann (2006). The motivation of forecast combination is much analogous to

the use of the sample mean instead of a single observation as an unbiased estimator of the population

mean, as taking average reduces the variance without affecting unbiasedness. Let us consider using the

usual criterion of mean squared error for forecast evaluation. Denote an individual binary forecast by

Q(x,L) where x is the evaluation point of interest and L is the training sample Yt ,Xt (for t = 1, ...,T )

by which Q(x,L) is constructed. The mean squared error of an individual forecast is

el ≡ ELEY,X(Y −Q(X ,L))2. (90)

Suppose we can draw N random samples Li each of which has size T from the joint distribution

f (Y,X). Then the combined forecast QA(x) ≡ 1/N ∑Ni=1 Q(x,Li) is closer to the population average

when N is very large, that is,

QA(x)≈ ELQ(x,L). (91)

The mean squared error associated with this combined forecast is thus

ea ≡ EY,X(Y −QA(X))2. (92)

Now using Jensen’s inequality, we have

el = EY,XY 2−2EY,XY QA(X)+EY,X EL(Q(X ,L))2

≥ EY,XY 2−2EY,XY QA(X)+EY,X(QA(X))2

= EY,X(Y −QA(X))2

= ea. (93)

Thus, the combined forecast has a lower mean squared error than any individual forecast, and the

magnitude of improvement depends on EL(Q(X ,L))2− (ELQ(X ,L))2 = VarL(Q(X ,L)), which is the

75

variance of the individual forecasts due to the uncertainty of the training sample and measures forecast

stability. Substantial instability leaves more space for improvement induced by forecast combination.

Generally speaking, small training samples and high nonlinearity in forecasting models are two main

sources of instability. Forecast combination can help a lot under these circumstances. Section 5.1 deals

with the case where multiple binary forecasts for the same event are available and the combination to

be carried out is straightforward. The bootstrap aggregating technique is followed when we only have

a single training set.

5.1 Combining binary predictions

Sometimes more than one binary prediction is available for the same target. A typical example is the

SPF probability forecasts of real GDP declines where approximately 40−50 individual forecasters is-

sue their subjective probability judgements in each survey about real GDP declines in the current and

each of the next four quarters. In these instances, individual forecasters might give diverse probability

assessments of a future event but none of them makes effective use of all available information. Be-

sides, the forecasts are likely to fluctuate over time and across individuals. Stimulated by concerns of

instability, a number of combination methods have been suggested. However, the combination meth-

ods should not be arbitrary and simplistic. Cases of combined forecasts that have performed worse

than individual forecasts have been documented in the literature; see Ranjan and Gneiting (2010) for

a good example. In this light, an effort to search for the optimal combination method is desired. Here,

the main focus is to combine probability forecasts instead of point forecasts. As for the latter, there

are already a large number of articles in computer science under the title of multiple classifier systems

(MCS), see Kuncheva (2004) for a textbook treatment.

The optimal combination of probability forecasts is discussed in a probabilistic context where the

joint distribution of observation and multiple individual forecasts is

f (Y,P1,P2, ...,PM), (94)

where Pm for m = 1, ...,M is the mth individual probability forecast of the binary event Y . The deriva-

tion of the optimal combination in the framework of the joint distribution unifies various separate

combination techniques in that it allows for more general assumptions on observations and forecasts.

For example, the Pm may be contemporaneously correlated with each other, which is very common as

76

individual forecasts are often based on similar information sets. Series correlation of observations and

forecasts is also allowed. Moreover, individual forecasts may come from either econometric models,

subjective judgements, or both. As shown in Section 3, there are many competing criteria or scores to

measure the skill or accuracy for probability forecasts. As a consequence, one may expect that optimal

combination rules may rely on adopted scores and thereby no universal combination rule will exist.

Fortunately, the situation is not as hopeless as it seems, as long as the score is proper. Denote the

proper score by S(Y,P) which is a function of the realized event and the probability forecasts, and the

conditional probability of Y = 1, given all individual forecasts, by P ≡ P(Y = 1|P1,P2, ...,PM). Ran-

jan and Gneiting (2010) proved that P, as a function of individual forecasts, is the optimal combined

forecast in the sense that its expected score is the smallest among all candidates provided the score is

proper. To see this, note that the expected score of P is given by

E(S(Y, P)) = E(E(S(Y, P)|P1,P2, ...,PM))

= E(PS(1, P)+(1− P)S(0, P))

≤ E(PS(1, f (P1,P2, ...,PM))+(1− P)S(0, f (P1,P2, ...,PM)))

= E(E(S(Y, f (P1,P2, ...,PM))|P1,P2, ...,PM))

= E(S(Y, f (P1,P2, ...,PM))), (95)

where f (P1,P2, ...,PM) is any measurable function of (P1,P2, ...,PM), an alternative combined forecast.

The inequality above uses the fact that S(Y,P) is a negatively oriented proper scoring rule. This result

says that taking P as the combined forecast always wins, which is true irrespective of the possible de-

pendence structures. A specific combination rule, such as the widely used linear opinion pool (OLP)

in which f (P1,P2, ...,PM) = ∑Mm=1 wmPm and wm is the nonnegative weight satisfying ∑

Mm=1 wm = 1,13

performs well only if it is close to the optimal P. A large number of specific rules have been devel-

oped, each of which is valid under its own assumptions. As a result, a specific rule may succeed if its

assumptions roughly hold in practice, but fail when the data generating process violates these assump-

tions. For example, the rule ignoring dependence structure among individual forecasts may perform

poorly if they are highly correlated with each other. For details of various specific combination rules,

see Genest and Zidek (1986), Clemen (1989), Diebold and Lopez (1997), Graham (1996), Wallsten

et al. (1997), Clemen and Winkler (1986, 1999, 2007), Timmermann (2006), and Primo et al. (2009).

13That is, f (P1,P2, ...,PM) is a convex combination of individual forecasts. Note that the linearity of P ispossible as each Pm lies in the unit interval, so does the convex combination.

77

In general, the functional form of this conditional probability P is unknown and needs to be esti-

mated from the sample Yt ,P1t ,P2t , ...,PMt for t = 1, ...,T , which is the usual practice in econometrics,

by noting that P is nothing more than a conditional probability. All methods covered in Section 2 will

work here. The most robust way of estimation is nonparametric regression, even though it is subject to

the “curse of dimensionality” when a large number of individual forecasts need to be combined. Ran-

jan and Gneiting (2010) recommended the beta-transformed linear opinion pool (BLP) to reduce the

estimation dimension, yet reserve certain flexibility in the specification. BLP is akin to the parametric

model (2) with linear index and beta distribution as its link function, that is,

P(Y = 1|P1,P2, ...,PM) = Bα,β(M

∑m=1

wmPm), (96)

where Bα,β(·) is the distribution function of the beta density with two parameters α > 0 and β > 0.

The number of unknown parameters including α and β is M + 2. They showed that BLP reduces to

OLP when α = β = 1. All parameters can be estimated by maximum likelihood given a sample, and

validity of OLP can thus be verified by a likelihood ratio test. Ranjan and Gneiting examined the

properties of BLP, compared it with OLP and each individual forecast in terms of their calibration

and refinement. They found that correctly specified BLP, necessarily calibrated by construction, is a

recalibration of OLP, which may not be calibrated even if the individual forecasts are. The empirical

version of BLP, based on a sample, performs equally well compared with the optimal P. Using SPF

forecasts, Lahiri et al. (2012b) find that the procedure works reasonably well in practice.

5.2 Bootstrap aggregating

Bootstrap aggregating, or bagging, is a forecast combination approach proposed by Breiman (1996)

in the machine learning literature, when only a single training sample is available. The basic intuition

is to average individual predictions generated by each bootstrap sample to reduce the variance of

unbagged prediction without affecting its bias. Like the usual forecast combination approach, bagging

is useful only if the sample size is not large and the forecasting model is highly nonlinear. Typical

examples where forecasts can be improved significantly by bagging include classification trees and

neural networks. But bagging does not seem to work well in linear discriminant analysis and k-nearest

neighbor methods; see Friedman and Hall (2007), Buja and Stuetzle (2006), and Buhlmann and Yu

(2002) for further discussion of this issue. A striking result is that bagged predictors can perform

78

even worse than unbagged predictors in terms of certain criteria, as shown in Hastie et al. (2001).

Though it is not useful for all problems at hand, its ability to stabilize a binary classifier has been

supported in the machine learning literature, as documented by Bauer and Kohavi (1999), Kuncheva

and Whitaker (2003), and Evgeniou et al. (2004). Lee and Yang (2006) demonstrated that bagged

predictors outperform unbagged predictors even under asymmetric loss functions, instead of the usual

mean squared error. They also established the conditions under which bagging is successful.

Bootstrap aggregating starts by resampling Yt ,Xt via bootstrap to get B bootstrap samples. The

binary forecasts, with fixed evaluation point x, are then constructed from each bootstrap sample to get

a set of Q(x,Li), where Li is the ith bootstrap sample. The bagged predictor is calculated as the

weighted average of Q(x,Li), where

Qb(x,L)≡1B

B

∑i=1

wiQ(x,Li) (97)

and wi is the nonnegative weight attached to the ith bootstrap sample Li and satisfies the usual con-

straint ∑Bi=1 wi = 1. The bagged predictor Qb(x,L) depends on the original sample L, as resampling

is based on the empirical distribution of L. There are a few points to be clarified for its implemen-

tation. First, appropriate bootstrap methods should be used depending on the context. For example,

nonparametric bootstrap is the natural choice for independent data, and parametric bootstrap is more

efficient when the data generating process of L is known up to a finite dimensional parameter vector.

For time series or other dependent data, block bootstrap can provide a sound simulation sample, as

illustrated by Lee and Yang (2006). Second, for probability prediction, the predictor Qb(x,L) is di-

rectly usable as its value must be between zero and one if each Q(x,Li) is. However, this is not the

case for binary point prediction, as Qb(x,L) is not 0/1-valued even if each Q(x,Li) is. In this context,

a usual rule is the so-called majority voting, where Qb(x,L) always predicts what is predicted more

often in Q(X ,Li). This is equivalent to taking 1/2 as threshold, that is, using I(Qb(x,L) ≥ 1/2) as

the bagged predictor.14 Third, the BLP combination method in Section 5.1 can be used here, provided

its parameters can be estimated from bootstrap samples. Finally, the choice of B depends on the orig-

inal sample size, computational capacity and model structure in a complex way. Lee and Yang (2006)

showed that B = 50 is more than sufficient to get a stable bagged predictor, and even B = 20 is good

14Hastie et al. (2001) suggested another way to make a binary point prediction if we can obtain a proba-bility prediction at evaluation point x. The bagged probability predictor is then derived by (97) which is thentransformed to a 0/1 value according to the threshold. They argued that, compared to the first procedure, thisapproach ends up with a bagged predictor having lower variance especially for small B.

79

enough in some cases in their empirical example. For other applications of bootstrap aggregating in

econometrics, interested readers are referred to Kitamura (2001), Inoue and Kilian (2008), and Stock

and Watson (2005).

6 Conclusion

In this chapter, we discussed the specification, estimation and evaluation of binary response models in

a unified framework from the standpoint of forecasting. In a stochastic setting, generating the probabil-

ity of the occurrence of an event with binary outcomes boils down to the specification and estimation

of the conditional expectation or the regression function. In this process, the conventional nonlinear

econometric modeling approaches play a dominant role. Specification designed for the limited range

of the response distinguishes models for binary dependent variables from those for continuous pre-

dictands. Therefore, the validity of transformations like the probit link function becomes an issue in

modeling binary events for forecasting.

Two types of forecasts for binary events are distinguished in this chapter: probability forecasts

and point forecasts. There is no universal answer as to which one is better. The value score analysis

in section 3.1.2 justifies the use of probability forecast, as it allows for the heterogeneity in the loss

functions of the end users in decision making. However, if the working model is misspecified, the

point forecast based on a one-step approach that integrates estimation and forecasting may be superior,

provided a loss function has been properly chosen. Moreover, in many regulatory environments, there

are mandates for the issuance of only binary forecasts.

The joint distribution of forecasts and actuals embodies the basic ingredients required for the

evaluation of forecast skill. All existing scoring rules and graphical approaches essentially reflect

certain attributes of this joint distribution. Since no single evaluation tool provides a complete measure

of skill for forecasting binary events, the use of a battery of such measures is recommended to assess

the skill more comprehensively. As a general rule, those not influenced by the marginal information

regarding the actuals are preferred. Many examples fall into this category, such as the odds ratio, Peirce

skill score, or ROC. Compared with those commonly used in practice, the tools within this category

are more likely to capture the true forecast skill. In circumstances where the event under consideration

is rare or relatively uncommon, the marginal probability of the occurrence of the event may confound

the true skill if it is not isolated from the score. The usual methods for assessing the goodness of fit

80

of a binary regression model, such as the pseudo R2 or the percentage of correct predictions, do not

adjust for the asymmetry of the response variable. We have also emphasized the need for reporting

sampling errors of these statistics. In this regard, there is substantial room for improvement in current

econometric practice.

Given that we have introduced a wide range of models and methods for forecasting binary out-

comes, a natural question is which ones should be used in a particular situation. It appears that complex

models that often fit better in-sample tend not to do well out-of-sample. Three classification models

in section 4.4 illustrate this point pretty well. Simple models like the discriminant analysis with a

linear boundary or the neutral networks with a single hidden layer often do very well in out-of-sample

forecasting exercises. This also explains why the forecast combination would usually work when

the individual forecasts come from complex nonlinear models. When multiple forecasts of the same

binary event are available, the skill performance of any single forecast can potentially be improved

when it is combined with other individual forecasts efficiently. Here again, the optimal combination

scheme should be derived from the joint distribution of forecasts and actuals. When only a single

training sample is available and the individual forecasts based on it are highly unstable, bagging is an

attractive way to reduce the forecast variance and improve the forecast skill.

It is virtually impossible that a forecast with an extremely low skill would satisfy the need of a

forecast user. Only those forecasts that enjoy at least a moderate amount of skill can be of some value

in guiding the decision making process. It is possible that a skillful forecast on the basis of a particular

criterion may not be useful at all in another decision making context. Knowing the joint distribution

is not enough for the purpose of evaluating the usefulness of a forecast from the perspective of a user

– the loss function connecting forecasts and realizations needs to be considered as well. The binary

point prediction discussed in Section 4 is a prime example where a 0/1 forecast is made by implicitly or

explicitly relying on a threshold value that is determined by a presumed loss function. In some specific

contexts, certain skill scores are directly linked to the value of the end user. One such example is that,

under certain circumstances, the highest achievable value score is the Peirce skill score, as shown in

section 3.2.3. Without any knowledge about the joint distribution of forecasts and realizations, we do

not know the nature of uncertainty facing us. However, even with knowledge of the joint distribution,

without information regarding the loss function, we would not know how to balance the expected

gains and losses under different forecasting scenarios for making decisions under uncertainty. For a

truly successful forecasting system, we need both.

81

References

Abrevaya, J. and Huang, J. (2005), ‘On the Bootstrap of the Maximum Score Estimator’, Economet-

rica 73, 1175–1204.

Abu-Mostafa, Y. S., Atiya, A. F., Magdon-Ismail, M. and White, H. (2001), ‘Introduction to the

Special Issue on Neural Networks in Financial Engineering’, IEEE Transactions on Neural Networks

12, 653–656.

Agresti, A. (2007), An Introduction to Categorical Data Analysis, John Wiley & Sons.

Ai, C. and Li, Q. (2008), Semi-parametric and Non-parametric Methods in Panel Data Models, in

L. Matyas and P. Sevestre, eds, ‘The Econometrics of Panel Data: Fundamentals and Recent Devel-

opments in Theory and Practice’, Springer, pp. 451–478.

Albert, J. (2009), Bayesian Computation with R, Springer.

Albert, J. H. and Chib, S. (1993), ‘Bayesian Analysis of Binary and Polychotomous Response Data’,

Journal of the American Statistical Association 88, 669–679.

Amemiya, T. (1985), Advanced Econometrics, Harvard University Press.

Amemiya, T. and Vuong, Q. H. (1987), ‘A Comparison of Two Consistent Estimators in the Choice-

Based Sampling Qualitative Response Model’, Econometrica 55, 699–702.

Anatolyev, S. (2009), ‘Multi-Market Direction-of-Change Modeling Using Dependence Ratios’, Stud-

ies in Nonlinear Dynamics & Econometrics 13, Article 5.

Andersen, E. B. (1970), ‘Asymptotic Properties of Conditional Maximum-Likelihood Estimators’,

Journal of the Royal Statistic Society, Series B 32, 283–301.

Arellano, M. and Carrasco, R. (2003), ‘Binary Choice Panel Data Models with Predetermined Vari-

ables’, Journal of Econometrics 115, 125–157.

Baltagi, B. H. (2012), Panel Data Forecasing, in A. Timmermann and G. Elliott, eds, ‘Handbook of

Economic Forecasting (forthcoming)’, North-Holland Amsterdam.

82

Bates, J. M. and Granger, C. W. J. (1969), ‘The Combination of Forecasts’, Operational Research

Quarterly 20, 451–468.

Bauer, E. and Kohavi, R. (1999), ‘An Empirical Comparison of Voting Classification Algorithms:

Bagging, Boosting, and Variants’, Machine Learning 36, 105–139.

Berge, T. J. and Jorda, O. (2011), ‘Evaluating the Classification of Economic Activity into Recessions

and Expansions’, American Economic Journal: Macroeconomics 3, 246–277.

Bishop, C. M. (1995), Neural Networks for Pattern Recognition, Oxford University Press.

Blaskowitz, O. and Herwartz, H. (2008), Testing Directional Forecast Value in the Presence of Serial

Correlation. Humboldt University, Collaborative Research Center 649, SFB 649, Discussion Papers.

Blaskowitz, O. and Herwartz, H. (2009), ‘Adaptive Forecasting of the EURIBOR Swap Term Struc-

ture’, Journal of Forecasting 28, 575–594.

Blaskowitz, O. and Herwartz, H. (2011), ‘On Economic Evaluation of Directional Forecasts’, Inter-

national Journal of Forecasting 27, 1058–1065.

Bontemps, C., Racine, J. S. and Simioni, M. (2009), Nonparametric vs Parametric Binary Choice

Models: An Empirical Investigation. Toulouse School of Economics TSE Working Papers with num-

ber 09-126.

Braun, P. A. and Yaniv, I. (1992), ‘A Case Study of Expert Judgment: Economists’ Probabilities

Versus Base-Rate Model Forecasts’, Journal of Behavioral Decision Making 5, 217–231.

Breiman, L. (1996), ‘Bagging Predictors’, Machine Learning 24, 123–140.

Breiman, L., Friedman, J., Olshen, R. A. and Stone, C. J. (1984), Classification and Regression Trees,

Chapman & Hall.

Brier, G. W. (1950), ‘Verification of Forecasts Expressed in Terms of Probability’, Monthly Weather

Review 78, 1–3.

Buhlmann, P. and Yu, B. (2002), ‘Analyzing Bagging’, Annals of Statistics 30, 927–961.

Buja, A. and Stuetzle, W. (2006), ‘Observations on Bagging’, Statistica Sinica 16, 323–351.

83

Bull, S. B., Greenwood, C. M. T. and Hauck, W. W. (1997), ‘Jackknife Bias Reduction for Polychoto-

mous Logistic Regression’, Statistics in Medicine 16, 545–560.

Carroll, R. J., Ruppert, D. and Welsh, A. H. (1998), ‘Local Estimating Equations’, Journal of the

American Statistical Association 93, 214–227.

Caudill, S. B. (2003), ‘Predicting Discrete Outcomes with the Maximum Score Estimator: the Case

of the NCAA Men’s Basketball Tournament’, International Journal of Forecasting 19, 313–317.

Cavanagh, C. L. (1987), Limiting Behavior of Estimators Defined by Optimization. Unpublished

Manuscript, Department of Economics, Harvard University.

Chamberlain, G. (1980), ‘Analysis of Covariance with Qualitative Data’, Review of Economic Studies

47, 225–238.

Chamberlain, G. (1984), Panel Dada, in Z. Griliches and M. D. Intrilligator, eds, ‘Handbook of Econo-

metrics’, North-Holland Amsterdam, pp. 1248–1318.

Chauvet, M. and Potter, S. (2005), ‘Forecasting Recessions using the Yield Curve’, Journal of Fore-

casting 24, 77–103.

Chib, S. (2008), Panel Data Modeling and Inference: A Bayesian Primer, in L. Matyas and P. Sevestre,

eds, ‘The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Prac-

tice’, Springer, pp. 479–515.

Clark, T. E. and McCracken, M. W. (2012), Advances in Forecast Evaluation, in A. Timmermann and

G. Elliott, eds, ‘Handbook of Economic Forecasting (forthcoming)’, North-Holland Amsterdam.

Clemen, R. T. (1989), ‘Combining Forecasts: A Review and Annotated Bibliography’, International

Journal of Forecasting 5, 559–583.

Clemen, R. T. and Winkler, R. L. (1986), ‘Combining Economic Forecasts’, Journal of Business &

Economic Statistics 4, 39–46.

Clemen, R. T. and Winkler, R. L. (1999), ‘Combining Probability Distributions From Experts in Risk

Analysis’, Risk Analysis 19, 187–203.

84

Clemen, R. T and Winkler, R. L (2007), Aggregating Probability Distributions, in W. Edwards, R.

F. Miles and D. von Winterfeldt, eds, ‘Advances in Decision Analysis: From Foundations to Applica-

tions’, Cambridge University Press, pp. 154–176.

Clements, M. P. (2006), ‘Evaluating the Survey of Professional Forecasters Probability Distributions

of Expected Inflation Based on Derived Event Probability Forecasts’, Empirical Economics 31, 49–64.

Clements, M. P. (2008), ‘Consensus and Uncertainty: Using Forecast Probabilities of Output De-

clines’, International Journal of Forecasting 24, 76–86.

Clements, M. P. (2011), ‘An Empirical Investigation of the Effects of Rounding on the SPF Probabili-

ties of Decline and Output Growth Histograms’, Journal of Money, Credit and Banking 43, 207–220.

Cortes, C. and Mohri, M. (2005), Confidence Intervals for the Area under the ROC Curve. Advances

in Neural Information Processing Systems (NIPS 2004).

Cosslett, S. R. (1993), Estimation from Endogenously Stratified Samples, in G. S. Maddala, C. R. Rao

and H. D. Vinod, eds, ‘Handbook of Statistics 11 (Econometrics)’, North-Holland Amsterdam, pp. 1–

44.

Cramer, J. S. (1999), ‘Predictive Performance of the Binary Logit Model in Unbalanced Samples’,

Journal of the Royal Statistical Society, Series D 48, 85–94.

Croushore, D. (1993), Introducing: The Survey of Professional Forecasters. Federal Reserve Bank of

Philadelphia Business Review, November/December, 3-13.

Dawid, A. P. (1984), ‘Present Position and Potential Developments: Some Personal Views: Statistical

Theory: The Prequential Approach’, Journal of the Royal Statistical Society, Series A 147, 278–292.

Delgado, M. A., Rodrıguez-Poo, J. M. and Wolf, M. (2001), ‘Subsampling Inference in Cube

Root Asymptotics with an Application to Manski’s Maximum Score Estimator’, Economics Letters

73, 241–250.

Deutsch, M., Granger, C. W. J. and Terasvirta, T. (1994), ‘The Combination of Forecasts Using Chang-

ing Weights’, International Journal of Forecasting 10, 47–57.

Diebold, F. X. (2006), Elements of Forecasting, South-Western College.

85

Diebold, F. X. and Lopez, J. A. (1997), Forecast Evaluation and Combination, in G.S. Maddala and

C.R. Rao, eds, ‘Handbook of Statistics 14 (Statistical Methods in Finance)’, North-Holland Amster-

dam, pp. 241–268.

Diebold, F. X. and Mariano, R. S. (1995), ‘Comparing Predictive Accuracy’, Journal of Business &


Donkers, B. and Melenberg, B. (2002), Testing Predictive Performance of Binary Choice Models.

Erasmus School of Economics, Econometric Institute Research Papers.

Egan, J. P. (1975), Signal Detection Theory and ROC Analysis, Academic Press.

Elliott, G. and Lieli, R. P. (2010), Predicting Binary Outcomes. Working paper, Department of Eco-

nomics, University of California, San Diego.

Engelberg, J., Manski, C. F. and Williams, J. (2011), ‘Assessing the Temporal Variation of Macroeco-

nomic Forecasts by a Panel of Changing Composition’, Journal of Applied Econometrics 26, 1059–

1078.

Engle, R. F. (2000), ‘The Econometrics of Ultra-High-Frequency Data’, Econometrica 68, 1–22.

Engle, R. F. and Russell, J. R. (1997), ‘Forecasting the Frequency of Changes in Quoted Foreign

Exchange Prices with the ACD Model’, Journal of Empirical Finance 12, 187–212.

Engle, R. F. and Russell, J. R. (1998), ‘Autoregressive Conditional Duration: A New Model for Irreg-

ularly Spaced Transaction Data’, Econometrica 66, 1127–1162.

Estrella, A. and Mishkin, F. S. (1996), ‘The Yield Curve as a Predictor of U.S. Recessions’, Current

Issues in Economics and Finance 2, 41–51.

Estrella, A. (1998), ‘A New Measure of Fit for Equations with Dichotomous Dependent Variables’,

Journal of Business & Economic Statistics 16, 198–205.

Estrella, A. and Mishkin, F. S. (1998), ‘Predicting U.S. Recessions: Financial Variables as Leading

Indicators’, The Review of Economics and Statistics 80, 45–61.

Evgeniou, T., Pontil, M. and Elisseeff, A. (2004), ‘Leave One Out Error, Stability, and Generalization

of Voting Combinations of Classifiers’, Machine Learning 55, 71–97.

86

Faraggi, D. and Reiser, B. (2002), ‘Estimation of the Area Under the ROC Curve’, Statistics in

Medicine 21, 3093–3106.

Fawcett, T. (2006), ‘An Introduction to ROC Analysis’, Pattern Recognition Letters 27, 861–874.

Florios, K. and Skouras, S. (2007), Computation of Maximum Score Type Estimators by Mixed In-

teger Programming. Working paper, Department of International and European Economic Studies,

Athens University of Economics and Business.

Friedman, J. H. and Hall, P. (2007), ‘On Bagging and Nonlinear Estimation’, Journal of Statistical

Planning and Inference 137, 669–683.

Frolich, M. (2006), ‘Non-parametric Regression for Binary Dependent Variables’, Econometrics Jour-

nal 9, 511–540.

Galbraith, J. W. and van Norden, S. (2007),‘Assessing Gross Domestic Product and Inflation Proba-

bility Forecasts Derived from Bank of England Fan Charts’, Journal of the Royal Statistical Society,

Series A 175, 1–15.

Gandin, L. S. and Murphy, A. H. (1992), ‘Equitable Skill Scores for Categorical Forecasts’, Monthly

Weather Review 120, 361–370.

Genest, C. and Zidek, J. V. (1986), ‘Combining Probability Distributions: A Critique and an Annotated

Bibliography’, Statistical Science 1, 114–135.

Gneiting, T. (2011), ‘Making and Evaluating Point Forecasts’, Journal of the American Statistical

Association 106, 746–762.

Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007),‘Probabilistic Forecasts, Calibration and Sharp-

ness’, Journal of the Royal Statistical Society, Series B 69, 243–268.

Gneiting, T. and Raftery, A. E. (2007), ‘Strictly Proper Scoring Rules, Prediction, and Estimation’,


Gourieroux, C. and Monfort, A. (1993), ‘Simulation-based Inference: A Survey with Special Refer-

ence to Panel Data Models’, Journal of Econometrics 59, 5–33.

Gozalo, P. and Linton, O. (2000), ‘Local Nonlinear Least Squares: Using Parametric Information in

Nonparametric Regression’, Journal of Econometrics 99, 63–106.

87

Gradojevic, N. and Yang, J. (2006), ‘Non-linear, Non-parametric, Non-fundamental Exchange Rate

Forecasting’, Journal of Forecasting 25, 227–245.

Graham, J. R. (1996), ‘Is a Group of Economists Better Than One? Than None?’, Journal of Business

69, 193–232.

Grammig, J. and Kehrle, K. (2008), ‘A New Marked Point Process Model for the Federal Funds

Rate Target: Methodology and Forecast Evaluation’, Journal of Economic Dynamics and Control

32, 2370–2396.

Granger, C. W. J. and Jeon, Y. (2004), ‘Thick Modeling’, Economic Modeling 21, 323–343.

Granger, C. W. J. and Newbold, P. (1986), Forecasting Economic Time Series, Academic Press.

Granger, C. W. J. and Pesaran, M. H. (2000a), A Decision-Theoretic Approach to Forecast Evaluation,

in W. S. Chan, W. K. Li and H. Tong, eds, ‘Statistics and Finance: An Interface’, Imperial College

Press, pp. 261–278.

Granger, C. W. J. and Pesaran, M. H. (2000b), ‘Economic and Statistical Measures of Forecast Accu-

racy’, Journal of Forecasting 19, 537–560.

Greene, W. H. (2011), Econometric Analysis, Prentice Hall.

Greer, M. R. (2005), ‘Combination Forecasting for Directional Accuracy: An Application to Survey

Interest Rate Forecasts’, Journal of Applied Statistics 32, 607–615.

Griffiths, W. E., Hill, R. C. and Pope, P. J. (1987), ‘Small Sample Properties of Probit Model Estima-

tors’, Journal of the American Statistical Association 82, 929–937.

Hamilton, J. D. (1989), ‘A New Approach to the Economic Analysis of Nonstationary Time Series

and the Business Cycle’, Econometrica 57, 357–384.

Hamilton, J. D. (1990), ‘Analysis of Time Series Subject to Changes in Regime’, Journal of Econo-

metrics 45, 39–70.

Hamilton, J. D. (1993), Estimation, Inference and Forecasting of Time Series Subject to Changes in

Regime, in G. S. Maddala, C. R. Rao and H. D. Vinod, eds, ‘Handbook of Statistics 11 (Economet-

rics)’, North-Holland Amsterdam, pp. 231–260.

88

Hamilton, J. D. (1994), Time Series Analysis, Princeton University Press.

Hamilton, J. D. and Jorda, O. (2002), ‘A Model of the Federal Funds Rate Target’, Journal of Political

Economy 110, 1135–1167.

Hao, L. and Ng, E. C. Y. (2011), ‘Predicting Canadian Recessions using Dynamic Probit Modelling

Approaches’, Canadian Journal of Economics 44, 1297–1330.

Harding, D. and Pagan, A. (2011), ‘An Econometric Analysis of Some Models for Constructed Binary

Time Series’, Journal of Business & Economic Statistics 29, 86–95.

Hardle, W. and Stoker, T. M. (1989), ‘Investigating Smooth Multiple Regression by the Method of

Average Derivatives’, Journal of the American Statistical Association 84, 986–995.

Harvey, D., Leybourne, S. and Newbold, P. (1997), ‘Testing the Equality of Prediction Mean Squared

Errors’, International Journal of Forecasting 13, 281–291.

Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining,

Inference, and Prediction, Springer.

Heckman, J. J. (1981), The Incidental Parameters Problem and the Problem of Initial Conditions in

Estimating a Discrete Time-Discrete Data Stochastic Process and some Monte-Carlo Evidence, in C.

F. Manski and D. McFadden, eds, ‘Structural Analysis of Discrete Data’, MIT Press, pp. 179–195.

Hertz, J., Krogh, A. and Palmer, R. G. (1991), Introduction to the Theory of Neural Computation,

Westview Press.

Horowitz, J. L. (1992), ‘A Smoothed Maximum Score Estimator for the Binary Response Model’,

Econometrica 60, 505–531.

Horowitz, J. L. (2009), Semiparametric and Nonparametric Methods in Econometrics, Springer.

Horowitz, J. L. and Mammen, E. (2004), ‘Nonparametric Estimation of an Additive Model with a

Link Function’, Annals of Statistics 32, 2412–2443.

Horowitz, J. L. and Mammen, E. (2007), ‘Rate-Optimal Estimation for a General Class of Nonpara-

metric Regression Models with Unknown Link Functions’, Annals of Statistics 35, 2589–2619.

89

Hristache, M., Juditsky, A. and Spokoiny, V. (2001), ‘Direct Estimation of the Index Coefficient in a

Single-Index Model’, Annals of Statistics 29, 595–623.

Hsiao, C. (1996), Logit and Probit Models, in L. Matyas and P. Sevestre, eds, ‘The Econometrics of

Panel Data: Handbook of Theory and Applications’, Kluwer Academic Publishers, pp. 410–428.

Hu, L. and Phillips, P. C. B. (2004a), ‘Dynamics of the Federal Funds Target Rate: A Nonstationary

Discrete Choice Approach’, Journal of Applied Econometrics 19, 851–867.

Hu, L. and Phillips, P. C. B. (2004b), ‘Nonstationary Discrete Choice’, Journal of Econometrics

120, 103–138.

Ichimura, H. (1993), ‘Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single-

Index Models’, Journal of Econometrics 58, 71–120.

Imbens, G. W. (1992), ‘An Efficient Method of Moments Estimator for Discrete Choice Models With

Choice-Based Sampling’, Econometrica 60, 1187–1214.

Imbens, G. W. and Lancaster, T. (1996), ‘Efficient Estimation and Stratified Sampling’, Journal of

Econometrics 74, 289–318.

Inoue, A. and Kilian, L. (2008), ‘How Useful is Bagging in Forecasting Economic Time Series? A

Case Study of U.S. CPI Inflation’, Journal of the American Statistical Association 103, 511–522.

Kauppi, H. (2012), ‘Predicting the Direction of the Fed’s Target Rate’, Journal of Forecasting 31, 47–

67.

Kauppi, H. and Saikkonen, P. (2008), ‘Predicting U.S. Recessions with Dynamic Binary Response

Models’, The Review of Economics and Statistics 90, 777–791.

Kim, J. and Pollard, D. (1990), ‘Cube Root Asymptotics’, Annals of Statistics 18, 191–219.

King, G. and Zeng, L. (2001), ‘Logistic Regression in Rare Events Data’, Political Analysis 9, 137–

163.

Kitamura, Y. (2001), Predictive Inference and the Bootstrap. Working paper, Yale University.

Klein, R. W. and Spady, R. H. (1993), ‘An Efficient Semiparametric Estimator for Binary Response

Models’, Econometrica 61, 387–421.

90

Koenker, R. and Yoon, J. (2009), ‘Parametric Links for Binary Choice Models: A Fisherian-Bayesian

Colloquy’, Journal of Econometrics 152, 120–130.

Koop, G. (2003), Bayesian Econometrics, John Wiley & Sons.

Krzanowski, W. J. and Hand, D. J. (2009), ROC Curves for Continuous Data, Chapman & Hall.

Krzysztofowicz, R. (1992), ‘Bayesian Correlation Score: A Utilitarian Measure of Forecast Skill’,

Monthly Weather Review 120, 208–219.

Krzysztofowicz, R. and Long, D. (1990), ‘Fusion of Detection Probabilities and Comparison of Mul-

tisensor Systems’, IEEE Transactions on Systems, Man, and Cybernetics 20, 665–677.

Kuan, C. M. and White, H. (1994), ‘Artificial Neural Networks: An Econometric Perspective’, Econo-

metrics Reviews 13, 1–91.

Kuncheva, L. I. (2004), Combining Pattern Classiers: Methods and Algorithms, John Wiley & Sons.

Kuncheva, L. I. and Whitaker, C. J. (2003), ‘Measures of Diversity in Classifier Ensembles and Their

Relationship with the Ensemble Accuracy’, Machine Learning 51, 181–207.

Lahiri, K., Monokroussos, G. and Zhao, Y. (2012a), The Yield Spread Puzzle and the Information

Content of SPF Forecasts. CESifo Working Paper Series No. 3949.

Lahiri, K., Peng, H. and Zhao, Y. (2012b), Evaluating the Value of Probability Forecasts in the Sense

of Merton. Paper presented at the 7th New York camp econometrics.

Lahiri, K., Teigland, C. and Zaporowski, M. (1988), ‘Interest Rates and the Subjective Probability

Distribution of Inflation Forecasts’, Journal of Money, Credit and Banking 20, 233–248.

Lahiri, K. and Wang, J. G. (1994), ‘Predicting Cyclical Turning Points with Leading Index in a Markov

Switching Model’, Journal of Forecasting 13, 245–263.

Lahiri, K. and Wang, J. G. (2006), ‘Subjective Probability Forecasts for Recessions: Evaluation and

Guidelines for Use’, Business Economics 41, 26–37.

Lahiri, K. and Wang, J. G. (2012), Evaluating Probability Forecasts for GDP Declines using Alter-

native Methodologies. Working paper, Department of Economics, State University of New York at

Albany.

91

Lawrence, M., Goodwin, P., O’Connor, M, and Onkal, D. (2006), ‘Judgmental Forecasting: A Review

of Progress over the Last 25 Years’, International Journal of Forecasting 22, 493–518.

Lechner, M., Lollivier, S. and Magnac, T. (2008), Parametric Binary Choice Models, in L. Matyas

and P. Sevestre, eds, ‘The Econometrics of Panel Data: Fundamentals and Recent Developments in

Theory and Practice’, Springer, pp. 215–245.

Lee, L. F. (1992), ‘On Efficiency of Methods of Simulated Moments and Maximum Simulated Like-

lihood Estimation of Discrete Response Models’, Econometric Theory 8, 518–552.

Lee, T. H. and Yang, Y. (2006), ‘Bagging Binary and Quantile Predictors for Time Series’, Journal of

Econometrics 135, 465–497.

Leitch, G. and Tanner, J. (1995), ‘Professional Economic Forecasts: Are They Worth Their Costs?’,


Li, Q. and Racine, J. S. (2006), Nonparametric Econometrics: Theory and Practice, Princeton Uni-

versity Press.

Lieli, R. P. and Nieto-Barthaburu, A. (2010), ‘Optimal Binary Prediction for Group Decision Making’,

Journal of Business & Economic Statistics 28, 308–319.

Lieli, R. P. and Springborn, M. (2012), ‘Closing the Gap Between Risk Estimation and Decision-

Making: Efficient Management of Trade-Related Invasive Species Risk’, Review of Economics and

Statistics (forthcoming) .

Liu, H., Li, G., Cumberland, W. G. and Wu, T. (2005), ‘Testing Statistical Significance of the Area Un-

der a Receiving Operating Characteristics Curve for Repeated Measures Design with Bootstrapping’,

Journal of Data Science 3, 257–278.

Lopez, J. A. (2001),‘Evaluating the Predictive Accuracy of Volatility Models’, Journal of forecasting

20, 87–109.

Lovell, M. C. (1986),‘Tests of the Rational Expectations Hypothesis’, The American Economic Review

76, 110–124.

Maddala, G. S. (1983), Limited-dependent and Qualitative Variables in Econometrics, Cambridge

University Press.

92

Maddala, G. S. and Lahiri, K. (2009), Introduction to Econometrics, John Wiley & Sons.

Manski, C. F. (1975), ‘Maximum Score Estimation of the Stochastic Utility Model of Choice’, Journal

of Econometrics 3, 205–228.

Manski, C. F. (1985), ‘Semiparametric Analysis of Discrete Response: Asymptotic Properties of the

Maximum Score Estimator’, Journal of Econometrics 27, 313–333.

Manski, C. F. (1988), ‘Identification of Binary Response Models’, Journal of the American Statistical

Association 83, 729–738.

Manski, C. F. and Lerman, S. R. (1977), ‘The Estimation of Choice Probabilities from Choice Based

Samples’, Econometrica 45, 1977–1988.

Manski, C. F. and Thompson, T. S. (1986), ‘Operational Characteristics of Maximum Score Estima-

tion’, Journal of Econometrics 32, 85–108.

Manski, C. F. and Thompson, T. S. (1989), ‘Estimation of Best Predictors of Binary Response’, Jour-

nal of Econometrics 40, 97–123.

Manzato, A. (2007), ‘A Note On the Maximum Peirce Skill Score’, Weather and Forecasting

22, 1148–1154.

Marcellino, M. (2004), ‘Forecasting EMU Macroeconomic Variables’, International Journal of Fore-

casting 20, 359–372.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979), Multivariate Analysis, Academic Press.

Mason, I. B. (2003), Binary Events, in I. T. Jolliffe and D. B. Stephenson, eds, ‘Forecast Verification:

A Practitioner’s Guide in Atmospheric Science’, John Wiley & Sons, pp. 37–76.

Mason, S. J. and Graham, N. E. (2002), ‘Areas Beneath the Relative Operating Characteristics (ROC)

and Relative Operating Levels (ROL) Curves: Statistical Significance and Interpretation’, Quarterly

Journal of the Royal Meteorological Society 128, 2145–2166.

Meese, R. and Rogoff, K. (1988), ‘Was it Real? The Exchange Rate-Interest Differential Relation

Over the Modern Floating-Rate Period’, Journal of Finance 43, 933–948.

93

Merton, R. C. (1981), ‘On Market Timing and Investment Performance.I.An Equilibrium Theory of

Value for Market Forecast’, Journal of Business 54, 363–406.

Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (1994), Machine Learning, Neural and Statistical

Classification, Prentice Hall.

Monokroussos, G. (2011), ‘Dynamic Limited Dependent Variable Modeling and U.S. Monetary Pol-

icy’, Journal of Money, Credit and Banking 43, 519–534.

Morgan, J. N. and Sonquist, J. A. (1963), ‘Problems in the Analysis of Survey Data, and a Proposal’,


Murphy, A. H. (1973), ‘A New Vector Partition of the Probability Score’, Journal of Applied Meteo-

rology 12, 595–600.

Murphy, A. H. (1977), ‘The Value of Climatological, Categorical and Probabilistic Forecasts in the

Cost-Loss Situation’, Monthly Weather Review 105, 803–816.

Murphy, A. H. and Daan, H. (1985), Forecast Evaluation, in A. H. Murphy and R. W. Katz, eds,

‘Probability, Statistics, and Decision Making in the Atmospheric Sciences’, Westview Press, pp. 379–

437.

Murphy, A. H. and Winkler, R. L. (1984), ‘Probability Forecasting in Meterology’, Journal of the

American Statistical Association 79, 489–500.

Murphy, A. H. and Winkler, R. L. (1987), ‘A General Framework for Forecast Verification’, Monthly


Mylne, K. R. (1999), The Use of Forecast Value Calculations for Optimal Decision-making Using

Probability Forecasts, in ‘17th Conference on Weather Analysis and Forecasting’, American Meteo-

rological Society, Boston, Massachusetts, pp. 235–239.

Park, J. Y. and Phillips, P. C. B. (2000), ‘Nonstationary Binary Choice’, Econometrica 68, 1249–1280.

Parker, D. B. (1985), Learning Logic. Technical Report TR-47, Cambridge MA: MIT Center for

Research in Computational Economics and Management Science.

Patton, A. J. (2006), ‘Modelling Asymmetric Exchange Rate Dependence’, International Economic

Review 47, 527–556.

94

Patton, A. J. and Timmermann, A. (2012),‘Forecast Rationality Tests Based on Multi-Horizon

Bounds’, Journal of Business & Economic Statistics 30, 1–17.

Peirce, C. S. (1884), ‘The Numerical Measure of the Success of Predictions’, Science 4, 453–454.

Pesaran, M. H. and Skouras, S. (2002), Decision-Based Methods for Forecast Evaluation, in M.

P. Clements and D. F. Hendry, eds, ‘A companion to Economic Forecasting’, Wiley-Blackwell,

pp. 241–267.

Pesaran, M. H. and Timmermann, A. (1992), ‘A Simple Nonparametric Test of Predictive Perfor-

mance’, Journal of Business & Economic Statistics 10, 461–465.

Pesaran, M. H. and Timmermann, A. (2009), ‘Testing Dependence among Serially Correlated Multi-

Category Variables’, Journal of the American Statistical Association 104, 325–337.

Powell, J. L., Stock, J. H. and Stoker, T. M. (1989), ‘Semiparametric Estimation of Index Coefficients’,

Econometrica 57, 1403–1430.

Primo, C., Ferro, C. A. T., Jolliffe, I. T. and Stephenson, D. B. (2009), ‘Combination and Calibration

Methods for Probabilistic Forecasts of Binary Events’, Monthly Weather Review 137, 1142–1149.

Quinlan, J. R. (1992), C4.5: Programs for Machine Learning, Morgan Kaufmann.

Racine, J. S. and Parmeter, C. F. (2009), Data-driven Model Evaluation: a Test for Revealed Perfor-

mance. Mac Master University Working Papers.

Ranjan, R. and Gneiting, T. (2010), ‘Combining Probability Forecasts’, Journal of the Royal Statistical

Society, Series B 72, 71–91.

Refenes, A. P. and White, H. (1998), ‘Neural Networks and Financial Economics’, International


Richardson, D. S. (2003), Economic Value and Skill, in I. T. Jolliffe and D. B. Stephenson, eds,

‘Forecast Verification: A Practitioner’s Guide in Atmospheric Science’, John Wiley & Sons, pp. 165–

187.

Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.

95

Rudebusch, G. D. and Williams, J. C. (2009), ‘Forecasting Recessions: The Puzzle of the Enduring

Power of the Yield Curve’, Journal of Business & Economic Statistics 27, 492–503.

Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986), Learning Internal Representations by

Error Propagation, in D. E. Rumelhart, J. L. McClelland and the PDP Research Group, eds, ‘Parallel

Distributed Processing: Explorations in the Microstructure of Cognition’, MIT Press, pp. 318–362.

Schervish, M. J. (1989), ‘A General Method for Comparing Probability Assessors’, Annals of Statistics

17, 1856–1879.

Scott, A. J. and Wild, C. J. (1986), ‘Fitting Logistic Models Under Case-Control or Choice Based

Sampling’, Journal of the Royal Statistical Society, Series B 48, 170–182.

Scotti, C. (2011), ‘A Bivariate Model of Federal Reserve and ECB Main Policy Rates’, International

Journal of Central Banking 7, 37–78.

Seillier-Moiseiwitsch, F. and Dawid, A. P. (1993), ‘On Testing the Validity of Sequential Probability

Forecasts’, Journal of the American Statistical Association 88, 355–359.

Steinberg, D. and Cardell, N. S. (1992), ‘Estimating Logistic Regression Models When the Dependent

Variable Has no Variance’, Communications in Statistics-Theory and Methods 21, 423–450.

Stephenson, D. B. (2000), ‘Use of the ‘Odds Ratio’ for Diagnosing Forecast Skill’, Weather Forecast-

ing 15, 221–232.

Stock, J. H. and Watson, M. W. (1999), A Comparison of Linear and Nonlinear Univariate Models for

Forecasting Macroeconomic Time Series, in R. F. Engle and H. White, eds, ‘Cointegration, Causality,

and Forecasting, A Festschrift in Honor of Clive W. J. Granger’, Oxford University Press, pp. 1–44.

Stock, J. H. and Watson, M. W. (2005), An Empirical Comparison of Methods for Forecasting Using

Many Predictors. Working paper, Harvard University and Princeton University.

Stoker, T. M. (1986), ‘Consistent Estimation of Scaled Coefficients’, Econometrica 54, 1461–1481.

Stoker, T. M. (1991a), Equivalence of Direct, Indirect and Slope Estimators of Average Derivatives,

in W. A. Barnett, J. Powell and G. Tauchen, eds, ‘Nonparametric and Semiparametric Methods in

Econometrics and Statistics’, Cambridge University Press, pp. 99–118.

96

Stoker, T. M. (1991b), Lectures on Semiparametric Econometrics, Louvain-la-Neuve, Belgium:

CORE Foundation.

Swanson, N. R. and White, H. (1995), ‘A Model Selection Approach to Assessing the Information

in the Term Structure Using Linear Models and Artificial Neural Networks’, Journal of Business &


Swanson, N. R. and White, H. (1997a), ‘Forecasting Economic Time Series Using Flexible Ver-

sus Fixed Specification and Linear Versus Nonlinear Econometric Models’, International Journal of

Forecasting 13, 439–461.

Swanson, N. R. and White, H. (1997b), ‘A Model Selection Approach to Real-Time Macroeconomic

Forecasting Using Linear Models and Artificial Neural Networks’, The Review of Economics and

Business Statistics 79, 540–550.

Swets, J. A. (1996), Signal Detection Theory and ROC Analysis in Psychology and Diagnostics:

Collected Papers, Lawrence Erlbaum Associates.

Tajar, A., Denuit, M. and Lambert, P. (2001), Copula-Type Representation for Random Couples with

Bernoulli Margins. Discussing paper 0118, Universite Catholique de Louvain.

Tavare, S. and Altham, P. M. E. (1983), ‘Dependence in Goodness of Fit Tests and Contingency

Tables’, Biometrika 70, 139–144.

Terasvirta, T., Tjøstheim, D. and Granger, C. W. J. (2010), Modelling Nonlinear Economic Time

Series, Oxford University Press.

Terasvirta, T., van Dijk, D. and Mederios, M. C. (2005), ‘Smooth Transition Autoregressions, Neu-

ral Networks, and Linear Models in Forecasting Macroeconomic Time Series: A Re-examination’,

International Journal of Forecasting 21, 755–774.

Thompson, J. C. and Brier, G. W. (1955), ‘The Economic Utility of Weather Forecasts’, Monthly


Tibshirani, R. and Hastie, T. (1987), ‘Local Likelihood Estimation’, Journal of the American Statisti-

cal Association 82, 559–567.

97

Timmermann, A. (2006), Forecast Combinations, in G. Elliott, C. W. J. Granger and A. Timmermann,

eds, ‘Handbook of Economic Forecasting’, North-Holland Amsterdam, pp. 135–196.

Toth, Z., Talagrand, O., Candille, G. and Zhu, Y. (2003), Probability and Ensemble Forecasts, in I.

T. Jolliffe and D. B. Stephenson, eds, ‘Forecast Verification: A Practitioner’s Guide in Atmospheric

Science’, John Wiley & Sons, pp. 137–163.

Train, K. E. (2003), Discrete Choice Methods with Simulation, Cambridge University Press.

Wallsten, T. S., Budescu, D. V., Erev, I. and Diederich, A. (1997), ‘Evaluating and Combining Sub-

jective Probability Estimates’, Journal of Behavioral Decision Making 10, 243–268.

West, K. D. (1996), ‘Asymptotic Inference about Predictive Ability’, Econometrica 64, 1067–1084.

Wickens, T. D. (2001), Elementary Signal Detection Theory, Oxford University Press.

Wilks, D. S. (2001), ‘A Skill Score Based on Economic Value for Probability Forecasts’, Meteorolog-

ical Applications 8, 209–219.

Windmeijer, F. A. G. (1995), ‘Goodness-of-Fit Measures in Binary Choice Models’, Econometric

Reviews 14, 101–116.

Wooldridge, J. M. (2005), ‘Simple Solutions to the Initial Conditions Problem in Dynamic Non Linear

Panel Data Models with Unobserved Heterogeneity’, Journal of Applied Econometrics 20, 39–54.

Xie, Y. and Manski, C. F. (1989), ‘The Logit Model and Response-Based Samples’, Sociological

Methods and Research 17, 283–302.

Yang, Y. (2004), ‘Combining Forecasting Procedures: Some Theoretical Results’, Econometric The-

ory 20, 176–222.

Yates, J. F. (1982), ‘External Correspondence: Decompositions of the Mean Probability Score’, Or-

ganizational Behavior and Human Performance 30, 132–156.

Zhou, X. H., Obuchowski, N. A. and McClish, D. K. (2002), Statistical Methods in Diagnostic

Medicine, John Wiley & Sons.

98

Documents

Forecasting Binary Outcomes - · PDF fileIn this setting, probability forecasts ... Because a point forecast is a mixture of the objective joint distribution ... The formulation of