Upload
tranthuy
View
224
Download
1
Embed Size (px)
Citation preview
Forecasting Binary Outcomes∗
Kajal Lahiri†and Liu Yang‡
Department of EconomicsUniversity at Albany, SUNY
NY 12222, USA
Forthcoming in Handbook of Economic Forecasting, Vol. 2 (Eds. G. Elliott and A. Timmermann)
Abstract
Binary events are involved in many economic decision problems. In recent years, consid-
erable progress has been made in diverse disciplines in developing models for forecasting
binary outcomes. We distinguish between two types of forecasts for binary events that
are generally obtained as the output of regression models: probability forecasts and point
forecasts. We summarize specification, estimation, and evaluation of binary response
models for the purpose of forecasting in a unified framework which is characterized by
the joint distribution of forecasts and actuals, and a general loss function. Analysis of
both the skill and the value of probability and point forecasts can be carried out within
this framework. Parametric, semiparametric, nonparametric, and Bayesian approaches
are covered. The emphasis is on the basic intuitions underlying each methodology, ab-
stracting away from the mathematical details.
JEL Classifications: C40, C50, C53, C80
Key words: Probability Prediction, Point Prediction, Skill, Value, Joint Distribution,
Loss Function.
∗We are indebted to the editors, two anonymous referees and the paticipants of the ‘Handbook’ Conferenceat St. Louis Fed for their constructive comments on an earlier version of this chapter. We are also grateful toAntony Davies, Arturo Estrella, Terry Kinal, Massimiliano Marcellino, and Yongchen Zhao for their help. Muchof the revision of this chapter was completed when Kajal Lahiri was visiting the European University Instituteas a Fernand Braudel Senior Fellow during 2012. The responsibility for all remaining errors and omissions areours.
†Corresponding author. Tel.: +1 518 442 4758. E-mail address: [email protected].‡Tel.: +1 518 779 3190. E-mail address: [email protected].
1 Introduction
The need for accurate prediction of events with binary outcomes, like loan defaults, occurrence of
recessions, passage of a specific legislation, etc., arises often in economics and numerous other areas of
decision making. For example, a firm may base its production decisions on macroeconomic prospects;
a bank manager may decide whether to extend a loan to an individual depending on the risk of default;
and the propensity of a worker to apply for disability benefits is partially determined by the probability
of being approved.
How should one characterize a good forecast in these situations? Take the loan offer as an ex-
ample: a skilled bank manager with professional experience, after observing all relevant personal
characteristics of the applicant, is probably able to guess the odds that an applicant will default. How-
ever, this ability does not necessarily translate into a good decision because the ultimate payoff also
depends on the accurate assessment of the cost and benefit associated with a decision. The cost of
an incorrect approval of the loan can be larger than that of an incorrect denial such that an optimal
decision will depend on how large this cost differential is. A manager, who may otherwise be a skillful
forecaster, is unable to make an optimal decision unless he is aware of the costs and benefits associated
with each of the binary outcomes. The value of a forecast can only be evaluated in a decision making
context.
It is useful to distinguish between two types of forecasts for binary outcomes: probability fore-
casts and point forecasts. The former is a member of the broader category of density forecasts, since
knowing the probability of a binary event is equivalent to knowing the entire density for the binary
variable. Growing interest in probability forecasts has mainly been dictated by the desire of the profes-
sional forecasting community to quantify forecast uncertainty, which is often ignored in making point
forecasts. After all, a primary purpose of forecasting is to reduce uncertainty. In practice, a set of co-
variates is available for predicting the binary outcome under consideration. In this setting, probability
forecasts only describe the objective statistical properties of the joint distribution between the event
and covariates, and thus can be analyzed first without considering forecast value. On the contrary, a
binary point forecast, always being either 0 or 1, cannot logically be issued in isolation of the loss
function implicit in the underlying decision making problem. In this sense, probability forecasts are
more fundamental in nature. Because a point forecast is a mixture of the objective joint distribution
between the event and the covariates, and the loss function, we will defer an in-depth discussion of
1
binary point forecasts until some important concepts regarding forecast value have been introduced.
Given the importance of density and point forecasts for other types of target variables such as GDP
growth and inflation rates, one may wonder what feature of a binary outcome necessitates a separate
analysis and evaluation of its forecasts. It is the discrete support space of the dependent variable that
makes forecasting binary outcomes distinctive, and this restriction should be taken into account in the
specification, estimation, and evaluation exercises. For probability forecasts, any hypothesized model
ignoring this feature may lead to serious bias in forecasts. This, however, is not necessarily the case
in making binary point forecasts where the working model may violate this restriction, cf. Elliott and
Lieli (2010). Due to the nature of a binary event, its joint distribution and loss function are of special
forms, which can be used to design a wide array of tools for forecast evaluation and combination. For
most of these procedures, it is hard to find comparable counterparts in forecasting other types of target
variables.
This chapter summarizes a substantial body of literature on forecasting binary outcomes in a uni-
fied framework that has been developed in a number of disciplines such as biostatistics, computer
science, econometrics, mathematics, medical imaging, meteorology, and psychology. We cover only
those models and techniques that are common across these disciplines, with a focus on their appli-
cations in economic forecasting. Nevertheless, we give references to some of the methods excluded
from this analysis.
The outline of this chapter is as follows. In Section 2, we present methods for forecasting binary
outcomes that have been developed primarily by econometricians in the framework of binary regres-
sions. Section 3 is concerned with the evaluation methodologies for assessing binary forecast skill and
forecast value, most of which have been developed in meteorology and psychology. Section 4 is built
upon the previous two sections; it consists of models especially designed for binary point predictions.
We discuss two alternative methodologies to improve binary forecasts in Section 5. Section 6 closes
this chapter by underscoring the unified framework that is at the core of the literature, by providing
coherence to the diversity of issues and generic solutions.
2 Probability Predictions
This section addresses the issue of modeling the conditional probability of a binary event given an
information set available at the time of prediction. It is a special form of density prediction since, for a
2
Bernoulli distribution, knowing the conditional probability is equivalent to knowing the density. Four
classical binary response models developed in econometrics along with an empirical illustration will
come first, followed by generalizations to panel data forecasting. Sometimes, forecasts are not derived
from any estimated econometric model, but are completely subjective or judgemental. These will be
introduced briefly in Section 2.2.
2.1 Model-based probability predictions
For the purpose of probability predictions, the forecaster often has an information set (denoted by Ω)
that includes all variables relevant to the occurrence of a binary event. Incorporation of a particular
variable into Ω is justified either by economic theory or by the variable’s historical forecasting per-
formance. Suppose the dependent variable Y equals 1 when the target event occurs and 0 otherwise.
The question to be answered in this section is how to model the conditional probability of Y = 1 given
Ω, viz., P(Y = 1|Ω). The formulation of binary probability prediction in this manner is sufficiently
general to nest nearly all specific models that follow. For instance, if Ω contains lagged dependent
variables, then we have a dynamic model commonly used in macroeconomic forecasting. When it
comes to the functional form of the conditional probability, we can identify three broad approaches:
(i) a parametric model which imposes a very strong assumption on P(Y = 1|Ω), the only unknown is a
finite dimensional parameter vector; (ii) a nonparametric model which does not constrain P(Y = 1|Ω)
beyond certain regular properties such as smoothness; and (iii) a semiparametric model which lies
between these two extremes in that it does restrict some elements of P(Y = 1|Ω), and yet allows flex-
ible specification of other elements. If Ω contains prior knowledge on the parameters, P(Y = 1|Ω)
is a Bayesian model that integrates the prior with sample information to yield the posterior predic-
tive probability. Before examining each specific model in detail, we will offer motivations as to why
special care must be taken when the dependent variable is binary.
For modeling a binary event, a natural question is whether we can treat it as an ordinary dependent
variable and assume a linear structure for P(Y = 1|Ω). In a linear probability model, for example, the
conditional probability of Y = 1 depends on a k-dimensional vector X in a linear way, that is,
P(Y = 1|Ω) = Xβ (1)
where Ω = X and β is a parameter vector conforming in dimension with X . However, this model
3
may not be suitable for the binary response case. As noted by Maddala (1983), for some range of
covariates X , Xβ may fall outside of [0,1]. This is not permissible given that conditional probability
must be a number between zero and one. Consequently, discreteness of binary dependent variables
calls for nonlinear econometric models, and the selected specification must tackle this issue properly.
The common approach to overcome the drawback associated with the linear model involves a
nonlinear link function taking values within [0,1]. One well-known example is the cumulative distri-
bution function for any random variable. Often, restrictions on P(Y = 1|Ω) are imposed within the
framework of the following latent dependent variable form (with Ω = X):
Y ∗ = G(X)+ ε, ε is distributed as F(·)
Y = 1 if Y ∗ > 0, otherwise Y = 0. (2)
Here, Y ∗ is a hypothesized latent variable with conditional expectation G(X), called the index function.
ε is a random error with cumulative distribution function F(·) and is independent of X . The observed
binary variable Y is generated according to (2). By design, the conditional probability of Y = 1 given
X must be a number between zero and one, as shown below:
E(Y |X) = P(Y = 1|X) = P(Y ∗ > 0|X)
= P(ε >−G(X)|X)
= 1−F(−G(X)). (3)
Regardless of X , F(−G(X)) always lies inside [0,1], so does the conditional expectation itself. In a
parametric model, the functional form of F(·) is known whereas the index G(·) is specified up to a
finite dimensional parameter vector β, that is, G(·) = G0(·,β) and the functional form of G0(·, ·) is
known. As mentioned earlier, a nonparametric model does not impose stringent restrictions on the
functional form of F(·) and G(·) besides some regular smoothness conditions. If either F(·) or G(·)
is flexible but the other is subject to specification, a semiparametric model results.
4
2.1.1 Parametric approach
Two prime parametric binary response models assume the index function to be linear, that is,
G0(X ,β) = Xβ. If F is the distribution function of a standard normal variate, that is,
F(u) =∫ u
−∞
1√2π
e−12 t2
dt, (4)
then we have the probit model. Alternatively, if F is logistic distribution function, that is,
F(u) =eu
1+ eu , (5)
we have the logit model.
These are two popular parametric binary response models in econometrics. By symmetry of their
density functions around zero, conditional probability of Y = 1 reduces to the simple form F(Xβ).
Note that the index function does not have to be linear and it could be any nonlinear function of β. In
addition, the link function F(·) need not be (4) or (5), it could be any other distribution function. One
of the possibilities is the extreme value distribution:
F(u) = e−e−u. (6)
Nevertheless, the key point in parametric models is that the functional forms for the link and index,
irrespective of how complex they are, should be specified up to a finite dimensional parameter vector.
Koenker and Yoon (2009) introduced two wider classes of parametric link functions for binary
response models: the Gosset link based on the Student t-distribution for ε, and the Pregibon link
based on the generalized Tukey λ family. The probit and logit links are nested within Gosset and
Pregibon classes, respectively. For example, when the degrees of freedom for Student t-distribution
are large, it can be very close to standard normal distribution. For generalized Tukey λ link with
two parameters controlling the tail behavior and skewness, logit link is obtained by setting these two
parameters to zero. Based on these observations, Koenker and Yoon (2009) compared and contrasted
the Bayesian and asymptotic chi-squared tests for the suitability of probit or logit link within these
more general families. One primary objective of their paper was to correct the misperception that all
links are essentially indistinguishable. They argued that the misspecification of the link function may
lead to a severe estimation bias, even when the index is correctly specified. The binary response model
5
with Gosset or Pregibon as link offers a relatively simple compromise between the conventional probit
or logit specification and the semiparametric counterpart to be introduced in Section 2.1.3.
Train (2003) discussed various identification issues in parametric binary response models. For the
purpose of prediction, we care about the predicted probabilities instead of parameters, implying that
we have no preference over two models generating identical predicted probabilities, even though one
of them is not fully identified. For this reason, identification is often not an issue, and unidentified or
partially identified models may be valuable in forecasting.
Once the parametric model is specified and identification conditions are recognized, the remaining
job is to estimate β, given a sample. Amongst a number of methods, maximum likelihood (ML) yields
an asymptotically efficient estimator, provided the model is correctly specified. Suppose the index is
linear. The logarithm of conditional likelihood function given a sample Yt ,Xt with t = 1, ...,T is
l(β|Yt ,Xt)≡T
∑t=1
Yt ln(F(Xtβ))+(1−Yt)ln(1−F(Xtβ)), (7)
and ML maximizes (7) over the parameter space. Amemiya (1985) derived consistency and asymptotic
normality of the maximum likelihood estimator for this model, and established the global concavity
of the likelihood function in the logit and probit cases. This means that the Newton-Raphson iterative
procedure will converge to the unique maximizer of (7), no matter what the starting values are. For de-
tails regarding the iterative procedure to calculate ML estimator in these models, see Amemiya (1985).
Statistical inference on the parameters, predicted probabilities, marginal effects, and interaction effects
can be conducted in a straightforward way, provided the sample is independently and identically dis-
tributed (i.i.d.) or stationary and ergodic (in addition to satisfying certain moment conditions). These,
however, may not always hold. Park and Phillips (2000) developed the limiting distribution theory of
ML estimator in parametric binary choice models with nonstationary integrated explanatory variables,
which was extended further to multinomial responses by Hu and Phillips (2004a,b).
In dynamic binary response models, the information set Ω may include unobserved variables.
Chauvet and Potter (2005) incorporated the lagged latent variable, together with exogenous regressors,
in Ω. A practical difficulty with these models is that the likelihood function involves an intractable
multiple integral over the latent variable. One way to circumvent this problem is to use a Bayesian
computational technique based on a Markov chain Monte Carlo algorithm. See the technical appendix
in Monokroussos (2011) for implementation details. Kauppi and Saikkonen (2008) examined the
predictive performance of various dynamic probit models in which the lagged indicator of economic
6
recession, or the conditional mean of the latent variable, is used to forecast recessions. Their dynamic
formulations are much easier to implement by applying standard numerical methods, and iterated
multi-period forecasts can be generated. For a general treatment of multiple forecasts over multiple
horizons in dynamic models, see Terasvirta et al. (2010), where four iterative procedures are outlined
and assessed in terms of their forecast accuracy. Hao and Ng (2011) evaluated the predictive ability
of four probit model specifications proposed by Kauppi and Saikkonen (2008) to forecast Canadian
recessions, and found that dynamic models with actual recession indicator as an explanatorary variable
were better in predicting the duration of recessions, whereas the addition of the lagged latent variable
helped in forecasting the peaks of business cycles.
In macroeconomic and financial time series, the probability law underlying the whole sequence of
0’s and 1’s is often not fixed, but characterized by long repetitive cycles with different periodicities.
Exogenous shocks and sudden policy changes can lead to a sudden or gradual change in regime. If
the model ignores this possibility, chances are high that the resulting forecasts will be off the mark.
Hamilton (1989, 1990) developed a flexible Markov switching model to analyse a time series subject
to changes in regime, where an underlying unobserved binary state variable st governed the behaviour
of observed time series Yt . The change of regime in Yt is simply due to the change of st from one
state to the other. It is called Markov regime-switching model because the probability law of st is
hypothesized to be a discrete time two-state Markov chain. The advantage of this model is that it
does not require prior knowledge of regime separation at each time. Instead, such information can be
inferred from observed data Yt . For this reason, one can take advantage of this model to get predicted
probability of a binary state even if it cannot be observed directly. For a comprehensive survey of
this model, see Hamilton (1993, 1994). Lahiri and Wang (1994) utilized this model for estimating
recession probabilities using the index of leading indicators (LEI), circumventing the use of ad hoc
filter rules such as three consecutive declines in LEI as the recession predictor.
Unlike benchmark probit and logit models, a number of parametric binary response models may
be derived from other target objects. The autoregressive conditional hazard (ACH) model in Hamil-
ton and Jorda (2002) serves as a good example. The original target to be predicted is the length of
time between events, such as the duration between two successive changes of the federal funds rate
in the United States. For this purpose, Engle (2000) and Engle and Russell (1997, 1998) developed
an autoregressive conditional duration (ACD) model where the conditional expectation of the present
duration was specified to be a linear function of past observed durations and their conditional expec-
tations. Hamilton and Jorda (2002) considered the hazard rate defined as the conditional probability
7
of a change in the federal funds rate, given the latest information Ω. The ACH model is implied by
the ACD model since the expected duration between two successive changes is the inverse of the haz-
ard rate. They also generalized this simple specification by adding a vector of exogenous variables
to represent new information relevant for predicting the probability of the next target change. The
discreteness of observed target rate changes along with potential dynamic structure are dealt with si-
multaneously in this framework. See Grammig and Kehrle (2008), Scotti (2011), and Kauppi (2012)
for further applications and extensions.
Instead of predicting a single binary event, it is often useful to forecast multiple binary responses
jointly. For instance, we may like to predict the direction-of-change in several financial markets at a
future date given current information. A special issue arises in this context as these multiple binary
dependent variables may be intercorrelated, even after controlling for all independent variables. One
way to model this contemporaneous correlation is based on copulas, which decomposes the joint
modeling approach into two separate steps. The power of a copula is that for multivariate distributions,
the univariate marginals and the dependence structure can be isolated, and all dependence information
is contained in the copula. While modeling the marginal, one can proceed as if the current binary
event is the only concern, which means that all previously discussed methodologies including dynamic
models can be direcly applied. After this step, we may consider modeling the dependence structure
by using a copula.1 Patton (2006) and Scotti (2011) used this approach in forecasting. Anatolyev
(2009) suggested a more interpretable measure, called dependence ratios, for the purpose of directional
forecasts in a number of financial markets. Both marginal Bernoulli distributions and dependence
ratios are parameterized as functions of the direction of past changes. By exploiting the information
contained in this contemporaneous dependence structure, it is expected that this multivariate model
will produce higher quality out-of-sample directional forecasts than its univariate counterparts.
Cramer (1999) considered the predictive performance of the logit model in unbalanced samples
in which one event is more prevalent than the other. Denote the in-sample estimated probabilities of
Yt = 1 and Yt = 0 by Pt and 1−Pt , respectively. By the property of logit models, the sample average
of Pt always equals the in-sample proportion of Yt = 1, which is denoted by α. Cramer proved that
the average of Pt over the subsample of Yt = 1 cannot be less than the average of 1−Pt over the
subsample of Yt = 0, if α ≥ 0.5. Thus, in unbalanced samples, the average predicted probability of
Yt = 1 when Yt = 1 is greater than or equal to the average predicted probability of Yt = 0 when Yt = 0.
1In the binary case, the copula is characterized by a few parameters and thus is simple to model, see Tajaret al. (2001).
8
As a result, Cramer pointed out that estimated probabilities are a poor measure of in-sample predictive
performance. Using estimated probabilities leads to the absurd conclusion that success is predicted
more accurately than failure even though the two outcomes are complementary.
King and Zeng (2001) investigated the use of a logit model in situations where the event of interest
is rare. With the typical sample proportion of the event less than 5%, they showed that the logit model
performs well asymptotically provided it is correctly specified. However, in small samples, the logit
estimator is biased. In these cases, efficient competing estimators with smaller mean squared errors do
exist. This point has been noticed by statisticians but has not attracted much attention in the applied
literature, see Bull et al. (1997).
The estimated asymptotic covariance matrix of the logit estimators is the inverse of the estimated
information matrix, that is,
V (β) = [T
∑t=1
Pt(1−Pt)x′txt ]−1 (8)
where β is the logit ML estimator, and Pt is the fitted conditional probability for observation t, which
is 1/(1+ e−xt β). King and Zeng (2001) pointed out that in logit models, Pt for the subsample for
which the rare event occurred would usually be large and close to 0.5. This is because probabilities
reported in studies of rare events are generally very small compared to those in balanced samples.
Consequently, the contribution of this value to the information matrix would also be relatively large.
This argument implies that for rare event data, observations with Y = 1 have more information content
than those with Y = 0. In this situation, random samples that are often used in microeconometrics
no longer provide efficient estimates. Drawing more observations from Y = 1, relative to what can
be obtained in a random sampling scheme, could effectively yield variance reduction. This is called
choice-based, or more generally, endogenously stratified sampling in which a random sample of pre-
assigned size is drawn from each stratum based on the values of Y . This nonrandom design tends to
deliberately oversample from the subpopulation (that is, Y = 1) that leads to variance reduction. King
and Zeng (2001) suggested a sequential procedure to determine the sample size for Y = 0 based on
the estimation accuracy of each previously selected sample.
The statistical procedures valid for random samples need to be adjusted as well in order to accom-
modate this choice-based sampling scheme. Maddala and Lahiri (2009) included some preliminary
discussions on this issue. Manski and Lerman (1977) proposed two modifications of the usual max-
imum likelihood estimation. The first one involves computing a logistic estimate and correcting it
9
according to prior information about the fraction of ones in the population, say τ, and the observed
fraction of ones in the sample, say Y . For the logit model, the estimator of slope coefficient β1 is con-
sistent in both sampling designs. The estimator of the intercept βo in the choice-based sample should
be corrected as:
βo− ln[(1− τ
τ)(
Y1− Y
)], (9)
where βo is the ML estimate for βo. For the random sample, τ = Y , and thus there is no need to adjust
βo. However, in a choice-based sample with more observations on 1’s, we must have τ < Y , and the
corrected estimate is less than βo accordingly. The prior correction is easy to implement and only
requires the knowledge of τ, which is often available from census data. However, in the case of a mis-
specified parametric model, prior correction may not work. Given the prevalence of misspecification
in economic applications, more robust correction procedures are called for. Another limitation of this
prior correction procedure is that it may not be applicable for other parametric specifications, such as
the probit model, for which the inconsistency of the ML estimator may take a more complex form
(unlike in the logit case).
Manski and Lerman (1977)’s second approach – the weighted exogenous sampling maximum-
likelihood estimator – is robust even when the functional form of logit model is incorrect, see Xie
and Manski (1989). Instead of maximizing the logarithm of likelihood function of the usual form, it
maximizes the following weighted version:
lw(β|Yt ,Xt)≡−T
∑t=1
wt ln(1+ e(1−2yt)xt β). (10)
The weight function wt is w1Yt +wo(1−Yt), where w1 = τ/Y and wo = (1− τ)/(1− Y ). As noted
by Scott and Wild (1986) and Amemiya and Vuong (1987), in the case of correct specification, the
weighting approach is asymptotically less efficient than prior correction, but the difference is not very
large. However, if model misspecification is suspected, weighting is a robust alternative. Unlike
prior correction, the weighted estimator can be applied equally well to other parametric specifications.
The only knowledge required for its implementation is τ, the population probability of the rare event.
Manski and Lerman (1977) has proved that the weighted estimator for any correctly specified model is
consistent given the true τ. However, this estimator may not be asymptotically efficient. The intuition
behind the lack of efficiency is that unlike in a random sample, the knowledge of τ must contain
additional restrictions for the unknown parameters β in a choice-based sample. Failure to exploit
10
this additional information makes the resulting estimator inefficient. Imbens (1992) and Imbens and
Lancaster (1996) examined how to efficiently estimate β in an endogenously stratified sample. Their
estimator based on the generalized-method-of-moment (GMM) reformulation does not require prior
knowledge of τ and the marginal distribution of regressors. Instead, τ can be treated as an additional
parameter that is estimated by GMM jointly with β. They have shown that this estimator achieves the
semiparametric efficiency bound given all available information. For an excellent survey on estimation
in endogenously stratified samples, see Cosslett (1993).
One interesting point in the context of choice-based sampling is that the logit model could some-
times be consistently estimated when the original data comes exclusively from one of the strata. This
problem has been investigated by Steinberg and Cardell (1992). In this paper, they have shown how
to pool an appropriate supplementary sample that can often be found in general purpose public use
surveys, such as the U.S. Census, with original data to estimate the parameters of interest. The sup-
plementary sample can be drawn from the marginal distribution of the covariates without having any
information on Y . This estimator is algebraically similar to the above weighed MLE, and hence can be
implemented in conventional statistical packages. Only the logit model is analyzed in this paper due to
the existence of an analytic solution. In principle, the analysis can be generalized to other parametric
binary response models.
In finite samples, however, all of the above statistical procedures are subject to bias even when the
model is correctly specified. King and Zeng (2001) pointed out that such bias may be amplified in the
case of rare events. They proposed two methods to correct for the finite sample bias in the estimation
of parameters and the probabilities. For the parameters, they derived an approximate expression of
bias in the usual ML estimator, viz., (X ′WX)−1(X ′Wξ) where ξt = 0.5Qtt [(1+w1)Pt −w1], Qtt is
the diagonal element of Q = X(X ′WX)−1X ′, and W = diagPt(1−Pt)wt. This bias term is easy
to estimate since it is just the weighted least squares estimate of regressing ξ on X with W as the
weight. The bias-corrected estimator of β is β = β−(X ′WX)−1(X ′Wξ) with the approximate variance
V (β) = (T/(T + k))2V (β), where k is the dimension of β. Observe that T/(T + k)< 1 for all sample
sizes. The bias-corrected estimator is not only unbiased but has smaller variance, and thus has a
smaller mean squared error than the usual ML estimator in finite samples. When it comes to the
predicted probabilities, a possible solution is to replace the unknown parameters β in 1/(1+ e−xt β)
with the bias-corrected estimator β. The problem is that a nonlinear function of β may not be unbiased.
King and Zeng (2001) developed the approximate Bayesian estimator based on the approximation of
11
the following estimator after averaging out the uncertainty due to estimation of β:
P(Y = 1|X = xo) =∫
1/(1+ e−xoβ∗)P(β∗)dβ∗. (11)
They stated that ignoring the estimation uncertainty of β would lead to underestimation of the true
probability in a rare event situation. From a Bayesian viewpoint, P(β∗) that summarizes such uncer-
tainty, is interpreted as the posterior density of β, that is, N(β,V (β)). Computation of this approximate
Bayesian estimator and its associated standard deviation can be carried out in a straightforward way.
The pitfall of this estimator is that it is not unbiased in general, even though it often has small mean
squared error in finite samples. King and Zeng (2001) therefore proposed another competing estima-
tor, viz., “the approximate unbiased estimator”, which, as its name suggests, is unbiased.
2.1.2 Nonparametric approach
As mentioned at the beginning of Section 2.1, the nonparametric approach is the most robust way
to model the conditional probability, in that both the link and the index can be rather flexible. Non-
parametric regression often deals with continuous responses with well behaved density functions, but
the theory does not explicitly rule out other possibilities like a binary dependent variable. All extant
nonparametric regression methods, after minor modifications, can be used to model binary dependent
variables as well.
The most well-known nonparametric regression estimator of conditional expectation is the so-
called local polynomial estimator. For the univariate case, the pth local polynomial estimator solves
the following weighted least square problem given a sample Yt ,Xt with t = 1, ...,T :
minbo,b1,...,bp
T
∑t=1
(Yt −bo−b1(Xt − x)− ...−bp(Xt − x)p)2K(x−Xt
hT) (12)
where hT is the selected bandwidth, possibly depending on the sample, and K(·) is the kernel function.
When p = 0, it reduces to local constant or Nadaraya-Watson estimator; When p = 1, it is the local
linear estimator. In any case, the conditional probability P(Y = 1|X = x) can be estimated using bo,
the solution to (12). However, this fitted probability may exceed the feasible range [0,1] for some
values of x, since there is no such implicit constraint underlying this model. An immediate solution
in practice would be to cap the estimates at 0 and 1 when the fitted values fall beyond this range. The
problem is that there is no strong support in theory to do so, and the modified fitted probability is likely
12
to assume these boundary values for a large number of values of x and thus the estimated marginal
effect at these values must be zero as well. Like probit or logit transformations in the parametric
model, we can make use of the same technique here. The only difference is that we fit the model
locally by kernel smoothing. Specifically, let g(x,βx) be such a transformation function with unknown
coefficient vector βx. The conditional probability is modeled as:
P(Y = 1|X = x) = g(x,βx). (13)
In contrast to a parametric model, the coefficient βx is allowed to vary with the evaluation point x. In
the present context, the local logit is a sensible choice in which g(x,βx) = 1/(1+ e−xβx). Generally
speaking, any distribution function can be taken as g. Currently, there are three approaches to estimate
βx and thus P(Y = 1|X = x) in (13); see Gozalo and Linton (2000), Tibshirani and Hastie (1987), and
Carroll et al. (1998).
Another way to get the fitted probabilities within [0,1] nonparametrically is simply by noting that
p(y|x) = p(y,x)p(x)
(14)
where p(y|x), p(y,x) and p(x) are the conditional, joint, and marginal densities, respectively. A non-
parametric conditional density estimator is obtained by replacing p(y,x) and p(x) in (14) by their
kernel estimates. When Y is a binary variable, p(1|x) = P(Y = 1|X = x). A technical difficulty is that
the ordinary kernel smoothing implicitly assumes that the underlying density function is continuous,
which is not true for a binary variable. Li and Racine (2006) provides a comprehensive treatment of
several ways to cope with this problem based on generalized kernels.
A number of papers have compared nonparametric binary models with the familiar parametric
benchmarks. Frolich (2006) applied local logit regression to analyze the dependence of Portuguese
women’s labor supply on family size, especially on the number of children. For the parametric logit
estimator, the estimated employment effects of children never changed sign in the population. How-
ever, the nonparametric estimator was able to detect a larger heterogeneity of marginal effects in that
the estimated effects were negative for some women but positive for others. Bontemps et al. (2009)
compared nonparametric conditional density estimation with a conventional parametric probit model
in terms of their out-of-sample binary forecast performances by bootstrap resampling. They found that
the nonparametric method was significantly better behaved according to the “revealed performance”
test proposed by Racine and Parmeter (2009). Harding and Pagan (2011) considered a nonparametric
13
regression model using constructed binary time series. They argued that due to the complex scheme
of transformation, the true data generating process governing an observed binary sequence is often
not described well by a parametric specification, say, the static or dynamic probit model. Their dy-
namic nonparametric model was then applied to U.S. recession data using the lagged yield spread to
predict recessions. They compared the fitted probabilities from the probit model and those based on
the Nadaraya-Watson estimator, and concluded that the parametric probit specification could not char-
acterize the true relationship between recessions and yield spread over some range. The gap between
these two specifications was statistically significant and economically substantial.
2.1.3 Semiparametric approach
The semiparametric model consists of both parametric and nonparametric components. Compared
with the two extremes, a semiparametric model has its own strength. It is not only more robust
than a parametric one because of its flexibility in the nonparametric part, but also reduces the risk
of the “curse of dimensionality” and data “sparseness” associated with its nonparametric counterpart.
Various semiparametric models for binary responses have emerged in the last few decades. We will
briefly review some of the important developments in this area.
Recall that the link function is assumed to be known in the parametric model. Suppose this
assumption is relaxed while keeping the index unchanged. We have then the following single-index
model:
E(Y |X) = P(Y = 1|X) = F(G(X)). (15)
Generally speaking, the index G(X) does not have to be linear, as in the parametric model. We
only consider the case where G(X) = Xβ for the sake of simplicity. The only difference from the
parametric model is that the functional form for F(·) is unknown here and thus needs to be estimated.
By allowing for a flexible link function, greater robustness is achieved, provided the index has been
correctly specified. Horowitz (2009) discussed the identification issues for various sub-cases of (15).
Generally speaking, the simplest identified specification can be used without worrying about other
possibilities, provided that the alternative models are observationally equivalent from the standpoint
of forecasting.
For the single-index model, once a consistent estimator of β is available, F could be estimated
using a nonparametric regression with β replaced by its estimator. There are three suggested estima-
14
tors for β. Horowitz (2009) categorized them according to whether a nonlinear optimization problem
has to be solved. Two estimators obtained as the solution of a nonlinear optimization problem are
the semiparametric weighted nonlinear least square estimator due to Ichimura (1993), and the semi-
parametric maximum likelihood estimator proposed by Klein and Spady (1993). A direct estimator
not involving optimization is the average derivative estimator; see Stoker (1986, 1991a,b), Hardle and
Stoker (1989), Powell et al. (1989), and Hristache et al. (2001).
Another semiparametric model suitable for binary responses is the nonparametric additive model
where the link is given, but the index contains nonparametric additive elements:
P(Y = 1|X = x) = F(µ+m1(x1)+ ...+mk(xk)). (16)
Here, X is a k-dimensional random vector and the function F(·) is known prior to estimation, al-
though the univariate function m j(·) for each j needs to be estimated. The model is semiparametric
in nature as it contains both the parametric component F(·), along with the additive structure, and
the nonparametric component m j(·). Note that this nonparametric additive model does not overlap
with the single-index model, in the sense that there is at least one single-index model that cannot be
rewritten in the form of nonparametric additive model, and vice versa. Like the single-index model,
the nonparametric additive model relaxes restrictions on model specification to some extent, thereby
reducing the risk of misspecification as compared with the parametric approach. Furthermore, it over-
comes the “curse of dimensionality” associated with a typical multivariate nonparametric regression
by assuming each additive component to be a univariate function. Often, a cumulative distribution
function with range between 0 and 1 is a sensible choice for F(·). To ensure consistency of estimation
methodology, F(·) has to be correctly specified. Horowitz and Mammen (2004) described estimation
of this additive model. The basic idea is to estimate each m j(·) by series approximation. A natu-
ral generalization is to allow for unknown F(·). This more general specification nests (15) and (16)
as two special cases. Horowitz and Mammen (2007) developed a penalized-least-squares estimator
for this model, which does not suffer from the “curse of dimensionality” and achieves the optimal
one-dimensional nonparametric rate of convergence.
2.1.4 Bayesian approach
In contrast to the frequentist approach, the Bayesian approach takes the probability of a binary event as
a random variable instead of a fixed value. Combining prior information with likelihood using Bayes’
15
rule, it obtains the posterior distribution of parameters of interest. By the property of a binary variable,
each 0/1-valued Yt must be distributed as Bernoulli with probability p. The likelihood function for a
random sample would take the following form:
T !T1!T0!
pT1(1− p)T0 (17)
where T1 and T0 are the total number of observations with Yt = 1 and Yt = 0, respectively, and T =
T1 +T0. A conjugate prior for parameter p is Beta (α, β) where both α and β are nonnegative real
numbers. According to Bayes’ rule, the posterior is Beta (α+T1, β+T0) with mean:
E(p|Y ) = λpo +(1−λ)T1
T(18)
where po = α/(α+ β) is the prior mean, T1/T is the sample mean, and λ = (α+ β)/(α+ β+ T )
is the weight assigned to the prior mean. If α = β = 1 in the above Beta-Binomial model, that is,
when a noninformative prior is used, the posterior distribution is then dominated by the likelihood,
and (18) gets close to the sample mean provided T is sufficiently large. In other words, Bayesian nests
the frequentist approach as a special case. However, this flexibility comes at the cost of robustness,
as the posterior relies on the prior, which, to some extent, is thought of as arbitrary and subject to
choice by the analyst. This deficiency can be alleviated by checking the sensitivity of the posterior to
multiple priors, or using empirical Bayes methods. For the former, if different priors produce similar
posteriors, the result obtained under a particular prior is robust. In the latter approach, the prior is
determined by other data sets such as those examined in previous studies. For instance, we can match
the prior mean and variance with sample counterparts to determine two parameters α and β in the
above Beta-Binomial model. This is a natural way to update the information from previous studies.
Once the posterior density is known, the predicted probability can be obtained under a suitable loss
function. For example, the posterior mean is the optimal choice under quadratic loss.
Up to this point, only the information contained in the prior distribution and past Y are utilized
for generating probability forecasts. Usually in practice, a set of covariates X is available for use. In
line with our general formulation at the beginning of this section, only the prior distribution and past
Y are incorporated into the information set Ω in the Beta-Binomial model. Let us now consider how
to incorporate X into Ω within the framework of (2). There are two approaches to do this. The first
one is conceptually simple in that only Bayes’ rule is involved. The prior density of parameters π(β)
multiplied by the conditional sampling density of Y given X generates the posterior in the following
16
way:
p(β|Y,X) =Cπ(β)T
∏t=1
F(G0(Xt ,β))Yt (1−F(G0(Xt ,β)))
1−Yt (19)
where C is a constant which equals
∫π(β)
T
∏t=1
F(G0(Xt ,β))Yt (1−F(G0(Xt ,β)))
1−Yt dβ. (20)
The Metropolis-Hastings algorithm can draw samples from this distribution directly. Alternatively,
we can use Monte Carlo integration to approximate the constant C. Albert and Chib (1993) developed
the second method using the idea of data augmentation. The parametric model F(G0(Xt ,β)) is seen
to have an underlying regression structure on the latent continuous data; see (2). Without loss of gen-
erality, we only consider the case where G0(Xt ,β) = Xtβ, and ε has the standard normal distribution,
that is, F(·) = Φ(·) where Φ(·) is the standard normal distribution function with φ(·) as its density.
If the latent data Y ∗t is known, then the posterior distribution of the parameters can be computed
using standard results for normal linear models; see Koop (2003) for more details. Values of the latent
variable are drawn from the following truncated normal distributions:
p(Y ∗t |Yt ,Xt ,β) ∝
φ(Y ∗t −Xtβ)I(Y ∗t > 0) if Yt = 1;
φ(Y ∗t −Xtβ)I(Y ∗t ≤ 0) otherwise.(21)
where ∝ means “is proportional to”. Draws from the posterior distribution are then used to sample
new latent data, and the process is iterated with Gibbs sampling, given all conditional densities. The
distribution of the predicted probability can be obtained as follows. Given an evaluation point x, the
conditional probability is Φ(xβ), which is random in the Bayesian framework. When a sufficiently
large sample is generated from the posterior p(β|Y,X), the distribution of Φ(xβ) can be approximated
arbitrarily well by evaluating Φ(xβ) at each sample point. As before, when only a point estimate is
desired, we can derive it given a specified loss function.
Albert and Chib (1993) also pointed out a number of advantages of the Bayesian estimation over
a frequentist approach. First, frequentist ML relies on asymptotic theory and its estimator may not
perform satisfactorily in finite samples. Indeed, Griffiths et al. (1987) found that a ML estimator could
have significant bias in small samples, while the Bayesian estimator could perform exact inferences
even in these cases. Second, the Bayesian approach based on the latent variable formulation, is com-
17
putationally attractive. Third, Gibbs sampling needs to draw samples mainly from several standard
distributions, and therefore is simple to implement. Finally, we can easily extend this model to deal
with other sampling densities for the latent variables other than the present multivariate normal den-
sity. As a cautionary note, some diagnostic methods have to be used to ensure that the generated
Markov chain has reached its equilibrium distribution. For applications of this general approach in
other binary response models, see Koenker and Yoon (2009), Lieli and Springborn (2012), and Scotti
(2011).
2.1.5 An empirical example
In this part, we will present an empirical example that illustrates the application of the methodologies
covered so far. The task is to generate the probabilities of future U.S. economic recessions. The
monthly data we use consists of 624 observations on the difference between 10-year and 3-month
Treasury rates, and NBER dated recession indicators from January 1960 to December 2011.2 The
binary target event is the recession indicator that is one, if the recession occurred, and zero otherwise.
The sample proportion of months that were in recession is about 14.9%, indicating that it is a relatively
uncommon event. The independent variables are the yield spread, i.e., difference between 10-year and
3-month Treasury rates, and the lagged recession indicator. Estrella and Mishkin (1998) found that
the best fit occurred when the yield spread is lagged 12 months. We maintain this assumption here.
Figure 1 shows the frequency distribution of the yield spread in our sample periods. The three tallest
bars show that the value of the spread was between 0 and 1.5 percentage points in about 42.6% of
the cases. The distribution is heavily skewed toward the positive values. All our fitted models with
the yield spread as the explanatory variable reveal a very strong serial correlation in residuals. As a
result, the dynamic specification involving one month lagged indicator as an additional regressor is
used here. We implement parametric, semiparametric, and nonparametric approaches on this dataset,
and summarize the fitted curves in a single graph. For the Bayesian approach, we use the R code
provided by Albert (2009) to simulate the posterior distributions under different priors.
Figure 2 presents three fitted curves generated using a parametric probit model, a semiparametric
single-index model, and the nonparametric conditional density estimator of Section 2.1.2, given the
value of the lagged indicator. Both the probit and the single-index models contain the linear index3.
2Downloaded from http://www.financeecon.com/ycestimates1.html.3The single-index model is estimated by the Klein-Spady approach with carefully selected bandwidth, see
Section 2.1.3.
18
Figure 1: Frequency distribution of the yield spread
In the top panel in Figure 2, which is conditional on being in recession in the last month, we find
the estimated conditional probabilities to be very close to each other, except for values of the yield
spread larger than 2.5%. Despite the divergence between them on the right end, both are downward-
sloping. In contrast, the relationship, as estimated by the nonparametric model, is not monotonic in
that the probability surprisingly rises when the spread increases from−1% to 0. However, this finding
is hard to explain given the prototypical negative correlation between them. We ascribe this to the
data “sparseness” exhibited in Figure 1, namely that the nonparametric estimators on these values are
not reliable. In the bottom panel, which is conditional on not being in recession in the last month,
there is no substantial difference among these three models, and all of them are decreasing over the
entire range. Again, the precision for nonparametric estimators on both ends are relatively low for
the same reason as before. An interesting issue that arises as one compares both the panels is that the
estimated probabilities when the lagged recession occurs are uniformly larger than those when it does
not. Actually, the probabilities in the bottom panel are nearly zero in magnitude no matter how small
the spread is. This could be true if there is a strong serial correlation in recessions identified by NBER,
as shown in our probit model that has a highly significant coefficient estimate for the lagged indicator.
For this reason, the information contained in the current macroeconomic state, which is related to the
occurrence of future recessions is far more important than that given by the spread. This example, at
first sight, seems to be an evidence against the predictive power of the yield spread. However, it is
not the case given the fact that the one month lagged recession indicator is unavailable at the date of
forecasting. The autocorrelation among recession indicators shrinks toward zero as forecast horizon
19
Figure 2: Probability of a recession given its lagged value (1 for the top panel; 0 for the bottom panel)
increases. The yield spread stands out only in these longer horizon forecasts where few competing
predictors with good quality exist.
To apply the Bayesian approach, we need some prior information. Suppose the coefficient vector
β is assigned a multivariate normal prior with mean βo and covariance matrix Vo. For βo, we assume
the prior means of the intercept, the coefficient of the spread and the lagged indicator to be -1, -1
and 1, respectively. As for Vo, three cases are examined: the noninformative prior corresponding to
infinitely large Vo, and a variation of the Zeller’s g informative priors4 with large and small precisions.
Figure 3 summarizes the simulated posterior means for the conditional probabilities as well as the
probit curves from Figure 2. For comparison purpose, we also plot a curve replacing unknown β by
its prior mean βo. In both panels, the Bayesian fitted curves are sensitive to the prior involved. For
4See Albert (2009) for an explanation of g informative prior.
20
Figure 3: Probability of a recession given its lagged value (1 for the top panel; 0 for the bottom panel)
noninformative and informative priors with small precision, these curves are almost identical to the
probit curves, reflecting the dominance of the sample information over priors. The reversed pattern
appears in the other two curves. When the prior precision is extremely large, the forecasters’ beliefs
about the true relationship between the spread and future recession is so firm that they are unlikely
to be affected by the observed sample. That is the reason why the simulated curves under this sharp
prior almost overlap with the curves implied by βo alone. To summarize, the Bayesian approach is a
compromise between prior and sample information, and the degree of compromise crucially depends
on the relative informativeness.
21
2.1.6 Probability predictions in panel data models
Panel data consists of repeated observations for a given sample of cross-sectional units, such as in-
dividuals, households, companies, and countries. In empirical microeconomics, a typical panel has
a small number of observations along the time dimension but very large number of cross-sectional
units. The opposite scenario is generally true in macroeconomics. In this section, we consider a micro
panel environment with small or moderate T and large N. Many estimation and inference methods
developed for micro panel can be adapted to binary probability prediction. For the ease of exposition,
only balanced panels with an equal number of repeated observations for each unit will be discussed.
The basic linear static panel data model can be written in the following form:
Yit = Xitβ+ ci + εit , i = 1, ...,N, t = 1, ...,T (22)
where Yit and Xit are the dependent and k-dimensional independent variables, respectively, for unit i
and period t. One of the crucial features that distinguishes panel data models from cross-sectional and
univariate time series models is the presence of unobserved ci, the time-invariant individual effects. In
more general unobserved effects models, time effects λt are also included. εit is the idiosyncratic error
varying with i and t, and is often assumed to be i.i.d. and independent from other model components.
The benefits of using panel data mainly come from its larger flexibility in specification as it allows
the unobserved effect to be correlated with regressors. In a cross-sectional contexts without further
information (such as availability of the valid instruments), parameters such as β cannot be identified.
Even if ci is uncorrelated with regressors, the panel data estimator is generally more efficient relative
to those obtained in cross-sectional models. Baltagi (2012) covers many aspects of forecasting in
panel data models with continuous response variables.
When Yit is binary, the linear panel data model, like the linear probability model, is no longer
adequate. Again, we rewrite it in the latent variable form. The unobserved latent dependent variable
Y ∗it satisfies:
Y ∗it = Xitβ+ ci + εit , i = 1, ...,N, t = 1, ...,T. (23)
Instead of knowing Y ∗it , only its sign Yit = I(Y ∗it > 0) is observed. In order to get the conditional
probability of Yit = 1, certain distributional assumptions concerning εit and ci have to be made. For
example, when εit is i.i.d. with distribution function F(·) and ci has G(·) as its marginal distribution,
22
the conditional probability of Yit = 1 given Xi = (X ′i1,X′i2, ...,X
′iT )′ and ci is
P(Yit = 1|Xi,ci) = 1−F(−Xitβ− ci). (24)
The problem with this conditional probability is that ci is unobserved and P(Yit = 1|Xi,ci) cannot
be estimated directly except for large T . In a micro panel, the solution, without estimating ci, is to
compute P(Yit = 1|Xi), that is, integrating out ci from P(Yit = 1|Xi,ci). If the conditional density of ci
given Xi is denoted by g(·|·), then the conditional probability is:
P(Yit = 1|Xi) =∫(1−F(−Xitβ− c))g(c|Xi)dc, (25)
which is a function of Xi alone, and thus can be estimated by replacing β with its estimate, provided
that the functional forms of F(·) and g(·|·) are known.
In general, the function g(·|·) is unknown. The usual practice is to make some assumptions about
it. One such assumption is that ci is independent of Xi, so
g(c|Xi) = g(c)≡ dG(c)dc
. (26)
This leads to the random effects model. Given this specification, β and other parameters in g(·) and
F(·) can be efficiently jointly estimated by maximum likelihood. For some parametric specifications
of g(·) and F(·), such as normal distributions, identification often requires further restrictions on their
parameters; see Lechner et al. (2008). In general, the conditional likelihood function for each unit i is
computed as below by noting that idiosyncratic error is i.i.d. across t:
Li(Yi|Xi) =∫ T
∏t=1
[1−F(−Xitβ− c)]Yit F(−Xitβ− c)1−Yit g(c)dc. (27)
If both G(·) and F(·) are zero mean normal distributions with variances σ2c and σ2
ε , respectively, then
σ2c +σ2
ε = 1 is often needed to identify all parameters. In general, G(·) or F(·) may be any cumulative
distribution function. Multiplying conditional likelihood functions Li(Yi|Xi) for each i and taking
logarithms, we get the conditional log-likelihood function for the whole sample:
l(Y |X) =N
∑i=1
lnLi(Yi|Xi). (28)
The ML estimate is defined as the global maximizer of l(Y |X) over the parameter space, and the
23
estimated conditional probability is thus
P(Y = 1|x) =∫(1− F(−xβ− c))g(c)dc (29)
where β is the ML estimate of β, g(·) and F(·) are the density of c and the distribution of ε, with
unknown parameters replaced by their ML estimates. The predicted probability is evaluated at x.
The above framework can be extended to a general case where the covariance matrix of errors
is not restricted to have the conventional component structure. Let Y ∗i = (Y ∗i1,Y∗i2, ...,Y
∗iT )′ and ui =
(ui1,ui2, ...,uiT )′ be the stacked matrix of Y ∗ and u for unit i. The latent variable linear panel data
model can be rewritten in the following compact form:
Y ∗i = Xiβ+ui. (30)
We consider the case where Xi is independent of ui, with the latter having a T -dimensional multivariate
joint distribution Fu. Note that when uit = ci + εit for each t, (30) reduces to the random effects model
discussed above. Given data (Yi,Xi) for i = 1, ...,N, the likelihood function for unit i is
Li(Yi|Xi) =∫
Di
dFu (31)
where
Di = u ∈ RT : I(Xitβ+ut > 0) = Yit for t = 1, ...,T. (32)
The log-likelihood for the whole sample is thus l(Y |X) = ∑Ni=1 lnLi(Yi|Xi). Denote the ML estimate
by β. The predicted probability at point x is then
P(Y = 1|x) = P(xβ+ux > 0|x)
= P(ux >−xβ|x)
=∫
ux>−xβ
dFo (33)
where Fo is the estimated joint distribution function of (ui,ux). Here, ux is the latent error term cor-
responding to the point x, and (33) is for unit i. In general, it is hard to specify a particular form for
Fo without further knowledge of the serial dependence among the ui. Additional conditions, such as
serial independence, are needed to make (33) tractable.
24
In practice, this general framework is hard to implement due to the presence of the multiple inte-
gral in the likelihood function. Numerous methods of overcoming this technical difficulty have been
developed in the last few decades. Most of them are based on a stochastic approximation of the multi-
ple integral by simulation; see Lee (1992), Gourieroux and Monfort (1993), and Train (2003) for more
details on these simulation-based estimators and their asymptotic properties.
We can generalize the above model further to deal with the case where ui depends on Xi in a
known form. Similar to the linear panel data model, Chamberlain (1984) relaxed the assumption that
the individual effect ci is independent of the regressors. Let the linear projection of ci on Xi be in the
following form:
ci = Xiγ+ηi (34)
For simplicity, ηi is assumed to be independent of Xi. After pluging Xiγ+ηi into (23), we get the
following equation free of ci:
Y ∗it = Xiγt +ηi + εit (35)
where γt = γ+β⊗ et , and et is a T -dimensional column vector with one for the tth element and zero
for the others. The composite error ηi + εit is independent of Xi. If we know the distributions of ηi
and εit , the above likelihood-based framework can be applied here in the same manner. Note that
for making probability predictions, we are not interested in β in (23), the reduced form parameter
γt in (35) is sufficient. To summarize, in parametric panel data models, as long as the conditional
distribution of error given Xi is correctly specified, the predicted probability at evaluation point x is
obtained by replacing unknown parameters by their maximum likelihood estimates. The parametric
approach is efficient but not robust. In the panel data context, it is hard to ensure that all stochastic
components of the model are correctly specified. If one of them is misspecified, the resulting estimator
is in general not consistent. More robust estimation approaches, that do not require full specification of
the random components, have been proposed, such as the well-known conditional logit model which
allows for an arbitrary relationship between the individual effect and the regressors, see Andersen
(1970), Chamberlain (1980, 1984), and Hsiao (1996). Unfortunately, these appoaches cannot be used
to get probability forecasts. Given that the conditional probability P(Y = 1|x) depends on both β and
the distribution function that transfers an index into a number between zero and one, consistency of the
parameter estimator is not enough. When parametric models fail, the semiparametric or nonparametric
25
approach may be an obvious choice; see Ai and Li (2008). However, most of the semiparametric and
nonparametric panel data models focus on how to estimate β, instead of the predicted probabilities.
In a dynamic binary panel data model, the latent variable in period t depends on the lagged ob-
served binary event as shown below:
Y ∗it = Yit−1α+Xitβ+ ci + εit . (36)
The dynamic model is useful in some cases as it accounts for the state dependence of the binary choice
explicitly. Consider consumers’ brand choice as an example. The unobserved indirect utility over a
brand is likely to be correlated with past purchasing behavior, as most consumers tend to buy the same
brand if it has been tried before and was satisfactory. Presence of the lagged endogenous variable Yit−1
on the right hand side of (36) complicates the estimation due to the correlation between ci and Yit−1.
In dynamic panel data models, the lagged value Yi0 is not observed by the econometricians. Therefore,
another issue is how to deal with this initial distribution in order to get the valid likelihood function for
estimation and inference; see Heckman (1981), Wooldridge (2005), and Arellano and Carrasco (2003)
for alternative solutions. Lechner et al. (2008) provided an outstanding overview of several dynamic
binary panel data models.
The Bayesian approach in the panel data context shares much similarity with its counterpart in the
single equation case. Chib (2008) considered a general latent variable model in which both slope and
intercept exhibit heterogeneity. This random coefficient model is shown below:
Y ∗it = Xitβ+Witbi + εit (37)
where Wit is the subvector of Xit whose marginal effects on Y ∗it captured by bi are unit specific, and
where εit follows standard normal distribution. The probability of the binary response given this
formulation is P(Yit = 1|Xit ,bi) = Φ(Xitβ+Witbi). bi is assumed to be a multivariate random vec-
tor N(0,D). Again, data augmentation with the latent continuous response is suggested to facilitate
computation of the posterior distribution; see Chib (2008) for more details.
26
2.2 Non-model-based probability predictions
The methodologies covered so far rely crucially on alternative econometric binary response models. In
practice, researchers sometimes are confronted with binary probability predictions which may or may
not come from any econometric model. Instead, the predicted probabilities are issued by a number
of market experts following their professional judgements and experiences. These are non-model-
based probability predictions, or judgemental forecasts in psychological parlance; see, for instance,
Lawrence et al. (2006). The Survey of Professional Forecasters (SPF) conducted by the Federal
Reserve Bank of Philadelphia and by the European Central Bank (ECB) are leading examples of
non-model-based probability predictions in economics. Other forecasting organizations like the Blue
Chip Surveys, Bloomburg, and many central banks also report probability forecasts from time to time.
Given the high reputation and widespread use of the U.S. SPF data in academia and industry, this
section will give a brief introduction to this survey focusing on probability forecasts for real GDP
declines. See Croushore (1993) for a general introduction to SPF, and Lahiri and Wang (2012) for
these probability forecasts.
The Survey of Professional Forecasters is the oldest quarterly survey of macroeconomic forecasts
in the United States. It began in 1968 and was conducted by the American Statistical Association
and the National Bureau of Economic Research. The Federal Reserve Bank of Philadelphia took over
the survey in 1990. Currently, the dataset contains over thirty economic variables. In every quarter,
the questionnaire is distributed to selected individual forecasters and they are asked for their expecta-
tions about a number of economic and business indicators, such as real GDP, CPI, and employment
rate in the current and next few quarters. For real GDP, GDP Price Deflator, and Unemployment,
density forecasts are also collected, viz., the predicted probability of annual percent change in each
prescribed interval for current and the next four quarters. Furthermore, the survey asks forecasters for
their predicted probabilities of declines in real GDP in the quarter in which the survey is conducted
and each of the following four quarters. For any target year, there are five forecasts from an indi-
vidual forecaster, each corresponding to a different quarterly forecast horizon. By investigating the
time series of individual forecasts for a given target, we can study how their subjective judgements
evolve over time and their usefulness. SPF also reports aggregate data summarizing responses from all
forecasters, including their mean, median, and cross-sectional dispersion. Note that the dataset is not
balanced, and the individual forecasters enter or exit from the survey in any quarter for a number of
reasons. Also, some forecasters may not report their predictions for some variables or horizons. Given
27
the novelty and quality of this dataset, SPF is extensively used in macroeconomics. For our purpose,
probability forecasts of a binary economic event can also be easily constructed from the subjective
density forecasts. Galbraith and van Norden (2012) used the Bank of England’s forecast densities to
calculate the forecast probability that the annual rate of change of inflation and output growth exceed
given threshold values. For instance, if the target event is GDP decline in the current year, then the
constructed probability of this event is the sum of probabilities in each interval with negative values.
For quarterly GDP declines, however, this probability is readily available in the U.S. SPF, and can be
analyzed for their properties. Clements (2006) has found some internal inconsistency between these
probability and density forecasts, whereas Lahiri and Wang (2006) found that the probability forecasts
for real GDP declines have no significant skill beyond the second quarter.
A commonly cited SPF indicator is the anxious index. It is defined as the probability of a decline
in real GDP in the next quarter. For example, in the survey taken in the fourth quarter of 2011, the
anxious index is 16.6 percent, which means that forecasters on average believed that there was a 16.6
percent chance that real GDP will decline during the first quarter of 2012. Figure 4 illustrates the
path of anxious index over time, beginning in the fourth quarter of 1968, along with the shaded NBER
dated recessions. The fluctuations in the probabilities seem roughly coincident with the NBER defined
peaks and troughs of the U.S. business cycle since 1968. Rudebusch and Williams (2009) compared
0
10
20
30
40
50
60
70
80
90
100
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Pro
babi
lity
(per
cent
)
Survey Date
Figure 4: The Anxious Index from 1968:Q4 to 2011:Q4 (source: SPF website)
the economic downturn forecast accuracy of SPF and a simple binary probit model using yield spread
as regressor, finding that in terms of alternative measures of forecasting performance, the former wins
for the current quarter but the difference is not statistically significant. Its advantage over the latter
deteriorates as forecast horizon increases. Given the widespread recognition of the enduring role of
yield spread in predicting contractions during the past 20 years, this result that professional forecasters
do not seem to incorporate this readily available information on yield spread in forecasting real GDP
28
downturns appears to be a puzzle; see Lahiri et al. (2012a) for further analysis of the issue. A number
of papers have studied the properties of the SPF data. See for example, Braun and Yaniv (1992),
Clements (2008, 2011), Lahiri et al. (1988) and, Lahiri and Wang (2012), just to name a few.
Engelberg et al. (2011) called attention to the problem of changing panel composition in surveys
of forecasters and illustrated this problem using SPF data. They warned that the traditional aggregate
analysis of time series SPF conflate changes in the expectations of individual forecasters with changes
in the composition of the panel. Instead of aggregating individual forecasts by mean or median as
reported by the Federal Reserve Bank of Philadelphia, they suggested putting more emphasis on the
analysis of time series of predictions made by each individual forecaster. Aggregation, as a simplifying
device, should only be applied to subpanels with fixed composition.
3 Evaluation of Binary Event Predictions
Given a sequence of predicted values for a binary event that may come from an estimated model or
subjective judgements by individual forecasters like SPF, we can evaluate their accuracy empirically.
For example, it is desirable to verify whether it is associated well with the realized event. An important
issue here is how to compare the performance of two or more forecasting systems predicting the same
event, and whether a particular forecasting system is valuable from the perspective of end users. In
this section, we shall summarize many important and useful evaluation methodologies developed in
diverse fields in a coherent fashion. There are two types of binary predictions: probability prediction
discussed thoroughly in Section 2 and point prediction, which will be covered in the next section. The
evaluation of probability predictions is discussed first.
3.1 Evaluation of Probability Predictions
We can roughly classify the extant methodologies on binary forecast evaluation into two categories.
The first one measures forecast skill, which describes how the forecast is related to the actual, while the
second one measures forecast value, which emphasizes the usefulness of a forecast from the viewpoint
of an end user. Skill and value are two facets of a forecasting system, a skillful forecast may or may
not be valuable. We will first review the evaluation of forecast skill and then move to forecast value
29
where the optimal forecasts are defined in the context of a two-state, two-action decision problem.
3.1.1 Evaluation of forecast skill
The econometric literature contains many alternative measures of goodness of fit analogous to the
R2 in conventional regressions, which can be related to various re-scalings of functions of the like-
lihood ratio statistics for testing that all slope coefficients of the model are zero5. These measures,
though useful in many situations, are not directly oriented towards measuring forecast skill, and are
often unsatisfactory in gauging the usefulness of the fitted model in either identifying a relatively
uncommon or rare event in the sample or forecasting out-of-sample. Most methods for skill evalua-
tion for binary probability predictions were developed in meteorology without emphasizing model fit.
Murphy and Winkler (1984) provide a historic review of probability predictions in meteorology from
both theoretical and practical perspectives. Given the prevalence of binary events in economics such
as economic recessions and stock market crashes, existing economic probability forecasts should be
evaluated carefully, whether they are generated by models or judgements.
Murphy and Winkler (1987) described a general framework of forecast skill evaluation with bi-
nary probability forecasts as a special case. The basis for their framework is the joint distribution
of forecasts and observations, which contains all of the relevant statistical information. Let Y be the
binary event to be predicted and P be the predicted probability of Y = 1 based on a forecasting system.
The joint distribution of (Y,P) is denoted by f (Y,P), a bivariate distribution when only one forecast-
ing system is involved. Murphy and Winkler (1987) suggested two alternative factorizations of the
joint distribution. Consider the calibration-refinement factorization first. f (Y,P) can be decomposed
into the product of two distributions: the marginal distribution of P and the conditional distribution of
Y given P, that is, f (Y,P) = f (P) f (Y |P). For perfect forecasts, f (1|P = 1) = 1 and f (1|P = 0) = 0,
i.e., the conditional probability of Y = 1 given the forecast is exactly equal to the predicted value. In
general, it is natural to require f (1|P) = P almost surely over P and this property is called calibration
in the statistics literature, see Dawid (1984). A well-calibrated probability forecast implies the actual
frequency of event given each forecast value should be close to the forecast itself, and the user will
not commit a large error by taking the face value of the probability forecast as the true value. Given
a sample Yt ,Pt of actuals and forecasts, we can plot the observed sample fraction of Y = 1 against
P, the so-called attribute diagram, to check calibration graphically. The ideal situation is that all pairs
5Estrella (1998) and Windmeijer (1995) contain critical analyses and comparison of most of these goodnessof fit measures.
30
of (Yt ,Pt) concentrate around the diagonal line, and corresponds to the so-called Mincer–Zarnowitz
regression in a rational expectation framework, cf. Lovell (1986). Seillier-Moiseiwitsch and Dawid
(1993) proposed a test to determine if in finite samples the difference between the actual and the
probability forecasts is purely due to the sampling uncertainty. This test is based on the asymptotic
approximation using the martingale central limit theorem, and is consistent in spirit with the prequen-
tial principle of Dawid (1984), which states that any assessment of a series of probability forecasts
should not depend on the way the forecast is generated. The strength of the prequential principle is
that it allows for a unified test for calibration regardless of the probability law underlying a particular
forecasting system.
Seillier-Moiseiwitsch and Dawid (1993) calibration test groups a sequence of probability forecasts
in a small number of cells, say J cells with the midpoint Pj as the estimate of the probability in each
cell. Given a sample Yt ,Pt, the number of events Yt = 1 in the jth cell is counted and denoted by
N j. The corresponding expected count under the predicted probability is PjTj where Tj is the number
of observations in the jth cell. The calibration test for cell j becomes straightforward by constructing
the test statistics Z j = (N j−PjTj)/√w j, where w j = TjPj(1−Pj) is the weight for cell j. Under the
null hypothesis of calibration for cell j, Z j is asymptotically normally distributed with zero mean and
unit variance, and should not lie too far out in the tail of this distribution. The overall calibration
test for all cells is then conducted using statistic ∑Jj=1 Z2
j which has χ2 distribution asymptotically
with J degrees of freedom, and there is a strong evidence against overall calibration if it exceeds the
critical value under a significant level. As an example, Lahiri and Wang (2012) find that for the current
quarter aggregate SPF forecasts of GDP declines introduced in Section 2.2, the calculated χ2 value is
8.01, which is significant at the 5% level. Thus, even at this short horizon, recorded forecasts are not
calibrated.
Calibration measures the predictive performance of probability forecasts with observed binary
outcomes. However, this is not a unique criterion of primary concern in practice. Consider the naive
forecast which always predicts the marginal probability P(Y = 1). Since f (1|P) = P(Y = 1|P(Y =
1)) = P(Y = 1), it is necessarily calibrated. Generally speaking, any conditional probability forecast
P(Y = 1|Ω) for some information set Ω has to be calibrated since
P(Y = 1|P(Y = 1|Ω)) = E(E(Y |Ω)|P(Y = 1|Ω)) = P(Y = 1|Ω), (38)
by applying the law of iterated expectations. The naive forecast P(Y = 1) is a special case of this
31
conditional probability forecast with Ω containing only the constant term. However, forecasting with
the long run probability P(Y = 1) is typically not a good option as it does not distinguish those obser-
vations when Y = 1 with those when Y = 0. This latter property is better characterized by the marginal
distribution f (P) that is a measure of the refinement for probability forecasts and indicates how often
different forecast values are used. For the naive forecast, f (P) is a degenerate distribution with all
probability mass at P = P(Y = 1) and the forecast is said to be not refined, or sharp. A perfectly
refined forecasting system tends to predict 0 and 1 in each case. According to these definitions, the
aforementioned perfect forecast is not only perfectly calibrated but also refined. In contrast, the naive
forecast is perfectly calibrated but not refined at all. Any forecasting system that predicts 1 when Y = 0
and 0 when Y = 1 is still perfectly refined but not calibrated at all. Given that perfect forecasts do not
exist in reality, Gneiting et al. (2007) developed a paradigm of maximizing the sharpness subject to
calibration, see also Murphy and Winkler (1987).
The second way of factorizing f (Y,P) is to write it as the product of f (P|Y ) and f (Y ), called
the likelihood-base rate factorization, which corresponds to Edwin Mills’ Implicit Expectations hy-
pothesis; see Lovell (1986). Given a binary event Y , we have two conditional distributions, namely,
f (P|Y = 1) and f (P|Y = 0). The former is the conditional distribution of predicted probabilities in
the case of Y = 1 while the latter is the distribution for Y = 0. We would hope that f (P|Y = 1) puts
more density on higher values of P and opposite for f (P|Y = 0). These two distributions are the con-
ditional likelihoods associated with the forecast P. For perfect forecasts, f (P|Y = 1) and f (P|Y = 0)
degenerate at P = 1 and P = 0, respectively. Conversely, if f (P|Y = 0) = f (P|Y = 1) for all P, the
forecasts are said not to be discriminatory at all between the two events and provide no useful infor-
mation about the occurrence of the event. The forecast is perfectly discriminatory if f (P|Y = 1) and
f (P|Y = 0) are two distinct degenerate densities, in which case, after observing the value of P, we are
sure which event will occur. Based on this idea, Cramer (1999) suggested the use of the difference in
the means of these two conditional densities as a measure of goodness of fit. Since each mean is taken
over respective sub-samples, this measure is not unduly influenced by the success rate in the more
prevalent outcome group.
Figure 5 shows these two empirical likelihoods for the current quarter forecasts based on SPF
data; cf. Lahiri and Wang (2012). This diagram shows that the current quarter probability forecasts
discriminate between the two events fairly well, and f (P|Y = 0) puts more weight on the lower proba-
bility values than f (P|Y = 1) does. However, not enough weight is associated with higher probability
values when GDP does decline, and so the SPF forecasters appear to be somewhat conservative in this
32
sense. Q0_C
Page 1
0%
10%
20%
30%
40%
50%
60%
70%
80%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Like
lihoo
d [f(
P|Y)
]
Forecast Probability (P)
f(P|Y=0)f(P|Y=1)
Figure 5: Likelihoods for Quarter 0 (source: Lahiri and Wang (2012))
In the likelihood-base rate factorization, f (Y ) is the unconditional probability of each event. In
weather forecasting, this is called the base rate or sample climatology and represents the long run
frequency of the target event. Since it is only a description of the forecasting situation, it is fully
independent of the forecasting system. Murphy and Winkler (1987) took f (Y ) as the probability
forecast in the absence of any forecasting system and f (P|Y ) as the new information beyond the base
rate contributed by a forecasting system P. They emphasized the central role of joint distribution
of forecasts and observations in any forecast evaluation, and discussed the close link between their
general framework and some popular evaluation procedures widely used in practice. For example,
Brier (1950)’s score can be calculated as the sample mean squared error of forecasts and actuals or
1/T ∑Tt=1(Yt −Pt)
2 which has a range between zero and one. Perfect forecasts have zero Brier score,
and a smaller value of Brier score indicates better predictive performance. The population mean
squared error is E(Yt−Pt)2 =Var(Yt−Pt)+[E(Yt)−E(Pt)]
2 where the first term is the variance of the
forecast errors and the second is the square of the forecast bias. Murphy and Winkler (1987) expressed
this score in terms of population moments as follows:
E(Yt −Pt)2 =Var(Pt)+Var(Yt)−2Cov(Yt ,Pt)+ [E(Yt)−E(Pt)]
2. (39)
This decomposition reaffirms the previous statement that all evaluation procedures are based on the
joint distribution of Y and P. It shows that the performance, as measured by the mean squared error,
is not only affected by the covariance Cov(Yt ,Pt) (larger value means better performance), but also by
33
the marginal moments of forecasts and actuals. Suppose Y is a relatively rare event with E(Yt) close
to zero. The optimal forecast minimizing (39) is close to the constant E(Y) which is the naive forecast
having no skill at all. In practice, the skill score defined below, which measures the relative skill over
the naive forecast, is often used in this context:
skill score ≡ 1− ∑Tt=1(Yt −Pt)
2
∑Tt=1(Yt −E(Yt))2
. (40)
The reference naive forecast has no skill in that its skill score is zero, whereas a skillful forecast is
rewarded by a positive skill score. The larger the skill score, the higher skill the forecast has. For the
current quarter forecasts from SPF, Lahiri and Wang (2012) calculated Brier score and skill score as
0.0668 and 0.45 respectively, and seen impresive.
Murphy (1973) decomposed the Brier score in terms of two factorizations of f (Y,P). In light of
the calibration-refinement factorization, it can be rewritten as:
E(Yt −Pt)2 =Var(Yt)+EP[Pt −E(Yt |Pt)]
2−EP[E(Yt |Pt)−E(Yt)]2 (41)
where EP(·) is the expectation operator with respect to the marginal distribution of P. This decomposi-
tion summarizes the features in two marginal distributions and f (Y |P). The second term is a measure
of calibration as it is a weighted average of the discrepancy between the face value of the probability
forecast and the actual probability of the realization given the forecast. The third term is a measure of
the difference between conditional and unconditional probabilities of Y = 1. This attribute is called
resolution by Murphy and Daan (1985). In terms of the likelihood-base rate factorization, the Brier
score can be alternatively decomposed as
E(Yt −Pt)2 =Var(Pt)+EY [Yt −E(Pt |Yt)]
2−EY [E(Pt |Yt)−E(Pt)]2 (42)
where EY (·) is the expectation operator with respect to the marginal distribution of Y . Instead of using
information in f (Y |P), (42) exploits information in the likelihood f (P|Y ) in addition to two marginal
distributions. The second term is a weighted average of the squared difference between the observation
and the mean forecast given observation and is supposed to be small for a good forecast. The third
term is a weighted average of the squared difference between the mean forecast given the observation
and the overall mean forecast, and measures the discriminatory power of forecasts against two events.
These two decompositions summarize different aspects of f (Y,P), and its sample analogue can be
34
computed straightforwardly.
Yates (1982) suggested an alternative decomposition of the Brier score which isolated individual
components capturing distinct features of f (Y,P) in the same sprite as Murphy and Wrinkler’s general
framework. Yates’ decomposition, popular in psychology, is derived from the usual interpretation of
the mean squared error (39) in terms of variance and squared bias. Note that Var(Yt) = E(Yt)[1−
E(Yt)] and Cov(Yt ,Pt) = [E(Pt |Yt = 1)− E(Pt |Yt = 0)]E(Yt)[1− E(Yt)]. We get Yates’ covariance
decomposition by plugging these into (39) using the definition VarP,min(Pt)≡ [E(Pt |Yt = 1)−E(Pt |Yt =
0)]2E(Yt)[1−E(Yt)] and obtain
E(Yt)[1−E(Yt)]+∆Var(Pt)+VarP,min(Pt)−2Cov(Yt ,Pt)+ [E(Yt)−E(Pt)]2, (43)
where ∆Var(Pt)≡Var(Pt)−VarP,min(Pt) by definition. The first term E(Yt)[1−E(Yt)] is the variance
of the binary event and thus is independent of forecasts. It is close to zero when either E(Yt) or
1−E(Yt) is very small. Given this property, a comparison across several forecasts with different targets
based on the overall Brier score may be misleading, because two target events tend to have different
marginal distributions, and the discrepancy of the scores is likely to solely reflect the differential of the
marginal distributions, thus saying nothing about the real skill. Yates regarded E(Yt)[1−E(Yt)] as the
Brier score of the naive forecast mentioned before, and showed that it is the minimal achievable value
for a constant probability forecast. It is the remaining part, that is, E(Yt−Pt)2−E(Yt)[1−E(Yt)], that
matters for evaluation purposes.
The term [E(Yt)− E(Pt)]2 measures the magnitude of the global forecast bias and is zero for
unbiased forecasts. In contrast to perfect calibration, which requires the conditional probability to
be equal to the face value almost surely, Yates called this calibration-in-the-large. It says that the
unconditional probability of Y = 1 should match the average predicted values. Cov(Yt ,Pt) describes
how responsive a forecast is to the occurrence of the target event, both in terms of the direction and the
magnitude. A skillful forecast ought to identify and explore this information in a sensitive and correct
manner. It is apparent that small Var(Pt) is desired, but this is not everything. A typical example is
the naive forecast with zero variance but no skill as well. VarP,min(Pt) is the minimum variance of Pt
given any value of the covariance Cov(Yt ,Pt), and ∆Var(Pt) is the excess variance which should be
minimized. The minimal variance VarP,min(Pt) is achieved only when ∆Var(Pt) = 0 for which Pt = P1
on all occasions of Yt = 1, and Pt = P0 on other occasions and the variation of forecasts is due to the
event’s occurrence. In this sense, Yates called ∆Var(Pt) the excess variability of forecasts and it is not
35
zero, when the forecast is responsive to information that is not related to the event’s occurrence. Using
the current quarter SPF forecasts, Lahiri and Wang (2012) found that the excess variability was 53%
of the total forecast variance of 0.569. For longer horizons, excess variability increases rapidly and
indicates an interesting characteristic of these forecasts. Overall, the Yates’ decomposition stipulates
that a skillful forecast is expected to be unbiased and highly sensitive to relevant information, but
insensitive to irrelevant information. Yates (1982) emphasized on the essence of resolution instead of
the conventional focus on calibration in probability forecast evaluation; see also Toth et al. (2003).
Although the Brier score is extensively used in probability forecast evaluation, it is not the only
choice. Alternative scores characterizing other features of the joint distribution exist. Two lead-
ing examples are the average absolute deviation which is E(|Yt − Pt |) and the logarithmic score
−E(Yt log(Pt)+(1−Yt)log(1−Pt)). In general, any function with (Yt ,Pt) as arguments can be taken
as a score. In the theoretical literature, a subclass called proper scoring rules is comprised of functions
satisfying
E(S(Yt ,P∗t ))≤ E(S(Yt ,Pt)),∀Pt ∈ [0,1], (44)
where S(·, ·) is the score function with the observation as the first argument and the forecast as the
second, and P∗t is the underlying true conditional probability. If P∗t is the unique minimizer of the
expected score, S(·, ·) is called the strictly proper scoring rule. It can be easily shown that the Brier
score and the logarithmic score are proper, while the absolute deviation is not. Gneiting and Raftery
(2007) pointed out the importance of using proper scores for evaluation purposes and provided an
example to demonstrate the problem associated with improper scores. Schervish (1989) developed an
intuitive way of constructing a proper scoring rule that has a natural economic interpretation in terms
of the loss associated with a decision problem based on forecasts. He also generated a proper scoring
rule that is equal to the integral of the expected loss function evaluated at the threshold value with
respect to a measure defined on unit interval, and discussed the connection between calibration and a
proper scoring rule. Gneiting (2011) argued that a consistent scoring function or an elicitable target
functional (the mean in our context) ought to be specified ex ante if forecasts are to be issued and
evaluated. Thus, it does not make sense to evaluate probability forecasts using the absolute deviation,
which is not consistent for the mean.
Up to this point, all evaluations are carried out through a number of proper scoring rules. If we
have more than one competing forecasting model targeting the same event and a large sample track-
36
ing the forecasts, scores can be calculated and compared. For example, in terms of the Brier score
1/T ∑Tt=1(Yt −Pt)
2, model A with larger score is considered to be a worse performer than model B.
Lopez (2001), based on Diebold and Mariano (1995), proposed a new test constructed from the sam-
ple difference between two scores, allowing for asymmetric scores, non-Gaussian and nonzero mean
forecast errors, series correlation among observations, and contemporaneous correlation between fore-
casts. Here we replace the objective function of Diebold and Mariano (1995) by a generic proper
scoring rule. Let S(Yt ,Pti) be the score value of the ith (i = 1 or 2) model for observation t. It is often
assumed to be a function of the forecast error defined by eti ≡ Yt −Pti, that is, S(Yt ,Pti) = f (eti). The
method works equally well for more general cases where the functional form of S(·, ·) is not restricted
in this way. In addition, let dt = f (et1)− f (et2) be the score differential between 1 and 2. The null
hypothesis of no skill differential is stated as E(dt) = 0.
Suppose the score differential series dt is covariance stationary and has short memory. The
standard central limit theorem for dependent data can be used to establish the asymptotic distribution
of test statistic under E(dt) = 0 as
√T (d−E(dt))
d−→ N(0,2π fd(0)), (45)
where
d =1T
T
∑t=1
dt (46)
is the sample mean of score differentials,
fd(0) =1
2π
∞
∑τ=−∞
γd(τ) (47)
is the spectral density of dt at frequency zero, and γd(τ) = E(dt −E(dt))(dt−τ−E(dt)) is the autoco-
variance of dt with the τth lag. The t statistic is thus
t =d√
2π fd(0)T
, (48)
where fd(0) is a consistent estimator of fd(0). Estimation of fd(0) based on lag truncation methods is
quite standard in time series econometrics, see Diebold and Mariano (1995) for more details. The key
idea is that only very weak assumptions about the data generating process are imposed and neither
37
serial nor contemporaneous correlation is ruled out by these assumptions. Implementation of this
procedure is quite easy as it is simply the standard t test of a zero mean for a single population after
adjusting for serial correlation. Thus, while comparing the current quarter SPF forecasts with the naive
constant forecast given by the sample proportion, Lahiri and Wang (2012) found the Lopez t statistic
to be -2.564, suggesting the former to have significantly lower Brier score than the naive forecast at
the usual 5% level.
West (1996) developed procedures for asymptotic inference about the moments of a smooth score
based on out-of-sample prediction errors. If predictions are generated by econometric models, these
procedures adjust for errors in the estimation of the model parameters. The conditions are also given,
under which ignoring this estimation error would not affect out-of-sample inference. This framework
is neither more general nor a special case of the Diebold-Mariano approach and thus should be viewed
as complementary. Note that the Diebold-Mariano test is not applicable when two competing forecasts
cannot be treated as coming from two nonnested models. However, if we think of the null hypothesis
as the two forecast series having equal finite sample forecast accuracy, then, the Diebold-Mariano test
statistic as a standard normal approximation gives a reasonably-sized test of the null in both nested and
non-nested cases, provided that the long run variances are estimated properly and the small-sample
adjustment of Harvey et al. (1997) is employed; see Clark and McCracken (2012).
Another useful tool for probability forecast evaluation, popular in medical imaging, meteorology
and psychology, that has not received much attention in economics is the Receiver Operating Charac-
teristic (ROC) analysis; see Berge and Jorda (2011) for a recent exception. Given the joint distribution
f (Y,P) and a threshold value which is a number between zero and one, we can calculate two condi-
tional probabilities: hit rate and false alarm rate. Let P∗ be a threshold, and Yt = 1 is predicted if and
only if Pt ≥ P∗, that is, P∗ transforms a continuous probability forecast into a binary point forecast.
Table 1 presents the joint distribution of this forecast and realization under a generic P∗. In this 2×2
contingency table, πi j is the joint probability of (Y = i,Y = j) while πi. and π. j are marginal proba-
bilities of Y = i and Y = j, respectively. The hit rate (H) is the conditional probability of Y = 1 given
Y = 1, that is, H ≡ πY=1|Y=1 = π11/π.1 and it tells the chance that Y = 1 is correctly predicted when it
does happen.
In contrast, false alarm rate (F) is the conditional probability of Y = 1 given Y = 0, that is, F ≡
πY=1|Y=0 = π10/π.0 and it measures the fraction of incorrect forecasts when Y = 1 does not occur.
Although these two probabilities appear to be constant for a given sample, they are actually functions
of P∗. If P∗ = 0≤ Pt for all t, then Y = 1 would always be predicted. As a result, both the hit and false
38
alarm rates equal one. Conversely, only Y = 0 would be given and both probabilities are zero when
P∗ = 1. For interior values of P∗, H and F fall within [0,1]. Their relationship due to the variation
of P∗ can be depicted by tracing out all possible pairs of (F(P∗),H(P∗)) for P∗ ∈ [0,1]. This graph
plotted with the false alarm rate on the horizontal axis and the hit rate on the vertical axis is called the
Receiver Operating Characteristic curve. Its typical shape for a skillful probability forecast is shown
in Figure 6.
Hits = 97.5%False alarms = 84%
Hits = 97.5%False alarms = 84%
Hits = 97.5%False alarms = 84%
0
20
40
60
80
100
0 20 40 60 80 100
Figure 6: A typical ROC curve
In categorical data analysis, H is often called the sensitivity and 1−F = πY=0|Y=0 is the speci-
ficity. Both measure the fraction of correct forecasts and are expected to be high for skillful forecasts.
(F(P∗),H(P∗)), corresponding to a particular threshold P∗, is only one point on the ROC curve which
consists of all such points for possible values of P∗.
The ROC curve can be constructed in an alternative way based on the likelihood-base rate factor-
ization f (Y,P) = f (P|Y ) f (Y ). Given a threshold P∗, H is the integral of f (P|Y = 1),
H =∫ 1
P∗f (P|Y = 1)dP, (49)
and F is the integral of f (P|Y = 0) over the same domain,
F =∫ 1
P∗f (P|Y = 0)dP. (50)
Table 1: Joint distribution of binary point forecast Y and observation Y
Y = 1 Y = 0 Row totalY = 1 π11 π10 π1.Y = 0 π01 π00 π0.
Column total π.1 π.0 1
39
Figure 7 illustrates these two densities along with three values of P∗
H
F
B
A dA>dB
Hits = 97.5% False alarms = 84%
Hits = 84% False alarms = 50%
Hits = 50% False alarms = 16%
Figure 7: f (P|Y = 1)(right), f (P|Y = 0)(left) and three values of P∗
In this graph, H is the area of f (P|Y = 1) on the right of P∗, while F is the same area for f (P|Y =
0). As the vertical line shifts rightward from top to bottom, both areas shrink, and both H and F
decline. In one extreme where P∗ = 0, both areas equal one. In the other extreme where P∗ = 1, they
equal zero. Figure 7 reveals the tradeoff between H and F : they move together in the same direction
as P∗ varies and the scenario (H = 1,F = 0) is generally unobtainable unless the forecast is perfect.
This relationship is also apparent from the upward sloping ROC curve in Figure 6. Deriving ROC
curve from the likelihood-base rate factorization is in the same spirit of Murphy and Winkler’s general
framework. To see this, consider the likelihoods of two systems (A and B) predicting the same event,
see Figure 8 below.
Let us assume that the likelihoods when Y = 1 are exactly the same for both A and B, while
the likelihoods when Y = 0 share the same shape but center at different locations. The likelihood
f (P|Y = 0) for A is symmetric around a value that is less than the corresponding value for B. In the
terminology of the likelihood-base rate factorization, A is said to have a higher discriminatory ability
than B because its f (P|Y = 0) is farther apart from f (P|Y = 1) and is thus more likely to distinguish
the two cases. Consequently, A has a higher forecast skill, which should be reflected by its ROC curve
as well. This result is supported by considering any threshold value represented by a vertical line in
this graph. As discussed before, the area of f (P|Y = 0) for A lying on the right of the threshold (A’s
false alarm rate) is always smaller than that for B, and this is true for any threshold. On the other
40
Figure 8: Likelihoods for forecasts A and B with a common threshold
hand, since f (P|Y = 1) is identical for both A and B, hit rates defined as the area of f (P|Y = 1) on the
right of the vertical line are the same for both. Therefore, A is more skillful than B, which is shown in
Figure 9 where the ROC curve of A always lies to the left of B for any fixed H.
Figure 9: ROC curves for A and B with different skills
The ROC curve is a convenient graphical tool to evaluate forecast skill and can be used to facilitate
comparison among competing forecasting systems. To see this, consider three special curves in the
unit box. The first one is the 45 degree diagonal line on which H = F . The probability forecast, which
41
has an ROC curve of this type, is the random forecast that is statistically independent of observation.
As a result, H and F are identical and both equal the integral of marginal density of probability forecast
over the domain [P∗,1]. One of the examples is the naive forecast. Probability forecasts whose ROC
curve is the diagonal line has no skill and are often taken as the benchmark to be compared with
other forecasts of interest. For a perfect forecast, the corresponding ROC curve is the left and upper
boundaries of the unit box. Most probability forecasts in real life situations fall in between, and their
ROC curves lie in the upper triangle, like the one shown in Figure 6. Since higher hit rate and lower
false alarm rate are always desired, the ROC curve lying farther from the diagonal line indicates higher
skill. A curve in the lower triangle appears to be even worse than the random forecast at first sight, but
it can potentially be relabeled to be useful.
Given a sample, there are two methods of plotting the ROC curve: parametric and nonparamet-
ric. In parametric approach, some distributional assumptions about the likelihoods f (P|Y = 1) and
f (P|Y = 0) are necessary. A typical example is the normal distribution. However, it is not a sensible
choice given that the range of P is limited. Nevertheless, we can always transform P into a variable
with unlimited range. For instance, the inverse function of any normal distribution suffices for this
purpose. The parameters in this distribution are estimated from a sample, and the fitted ROC curve
can be plotted by varying the threshold in the same way as when deriving the population curve. This
approach, however, is subject to misspecification like any parametric method. In contrast, nonpara-
metric estimation does not need such stringent assumptions and can be carried out based on data alone.
Fawcett (2006) provides an illustrative example with computational details. Fortunately, most current
commercial statistical packages like Stata have built-in procedures for generating ROC graphs.
Sometimes, a single statistic summarizing information contained in an ROC curve is warranted.
There are two alternatives: one measures the local skill for a threshold of primary interest, while the
other measures global skill over all thresholds. For the former, there are two statistics most commonly
used. The first one is the smallest Euclidean distance between point (0,1) and the point on the ROC
curve. This is motivated by observing that the ROC curve of more skillful probability forecast is
often closer to (0,1). The second statistic is called the Youden index that is the maximal vertical gap
between diagonal to the ROC curve (or hit rate minus false alarm rate). The global measure is the area
under the ROC curve (AUC). For random forecasts, the AUC is one half while it is one for perfect
forecasts. The larger AUC thus implies higher forecast skill. Calculation of the AUC proceeds in two
ways depending on the approach used to estimate the ROC curve. For parametric estimation, the AUC
is the integral of a smooth curve over the domain [0,1]. For nonparametric estimation, the empirical
42
ROC curve is a step function and its integral is obtained by summing areas of a finite number of
trapezia. If the underlying ROC curve is smooth and concave, the AUC computed in this way is bound
to underestimate the true value in a finite sample. Note that these two measures may not concord with
each other in the sense that they may give conflicting judgements regarding forecast skill. Figure 10
illustrates a situation like this.
H
F
B
A dA>dB
Hits = 97.5%False alarms = 84%
Hits = 97.5%False alarms = 84%
Hits = 97.5%False alarms = 84%
Figure 10: ROC curves for two forecasts: A and B
In this graph, dA and dB are local skill statistics for A and B, respectively, and A is slightly less
skillful in terms of this criterion. However, the AUC of A is larger than that of B. Conflict between
these two raises a question in practice as to which one should be used. Often, there is no universal
answer and it depends on the adopted loss function. Mason and Graham (2002), Mason (2003),
Cortes and Mohri (2005), Faraggi and Reiser (2002), Liu et al. (2005), among others, proposed and
compared estimation and inference methods concerning AUC in large data sets. These include, but
are not limited to, the traditional test based on the Mann-Whitney U-statistic, an asymptotic t-test, and
bootstrap-based tests. Using these procedures in large samples, we can answer questions like: “Does
a forecasting system have any skill?” or “Is its AUC larger than 1/2 significantly?”or “Is AUC of
forecast A significantly larger than that of B in the population?”
ROC analysis was initially developed in the field of signal detection theory, where it was used to
evaluate the discriminatory ability for a binary detection system to distinguish between two clearly-
defined possibilities: signal plus noise and noise only. Thereafter, it has gained increasing popularity
in many other related fields. For a general treatment of ROC analysis, the readers are referred to Egan
(1975), Swets (1996), Zhou et al. (2002), Wickens (2001), and Krzanowski and Hand (2009), just to
name a few. For economic forecasts, Lahiri and Wang (2012) evaluated the SPF probability forecasts
43
of real GDP declines for the U.S. economy using the ROC curve. Figure 11, taken from this paper for
the current quarter forecasts, shows that at least for the current quarter, the SPF is skillful.
0.2
5.5
.75
1
0 .25 .5 .75 1
Figure 11: ROC curve with 95% confidence band for Quarter 0 (source: Lahiri and Wang (2012))
3.1.2 Evaluation of forecast value
For calculating the forecast value, one needs more information than what is contained in the measures
of association between forecasts and realizations. Let L(a,Y ) be the loss of a decision maker when
(s)he takes the action a and the event Y is realized in the future. Here, like in the banker’s problem,
only the scenario with two possible actions (e.g. making a loan or not) coupled with a binary event
(e.g. default or not) is considered. It is simple, yet fits a large number of real life decision making
scenarios in economics.
First, we need to show that a separate analysis of forecast value is necessary. The following
example suffices to this end. Suppose A and B are two forecasts targeting the same binary event Y .
The following tables summarize predictive performances for both models.
Table 2: Contingency table cross-classifying forecasts of A and observations Y
Y = 1 Y = 0 Row totalY = 1 20 100 120Y = 0 23 997 1020
Column total 43 1097 1140
Here A and B are 0/1 binary point forecasts. If forecast skill is measured by the Brier score,
then A performs better than B since its Brier score is about 10.79%, less than B’s score of 17.54%.
44
Consequently, A is superior to B in terms of the forecast skill measured by the Brier score. Does the
same conclusion hold in terms of forecast value? To answer this question, we have to specify the loss
function L(a,Y ) first. Without loss of generality, suppose the decision rule is given by a = 1 if Y = 1
is predicted and a = 0 otherwise. The loss is described in Table 4.
This loss function has some special features: it is zero when the event is correctly predicted; the
losses associated with incorrect forecasts are not symmetric in that the loss for a = 0 when the event
Y = 1 occurs is much larger than that when a = 1 and the event Y = 1 does not occur. Loss functions
of this type are typical when the target event Y = 1 is rare but people incur a substantial loss once it
takes place, such as a dam collapse or financial crisis. The overall loss of A is 10×100+5000×23 =
116000 which is much larger than that of B (10×197+5000×3 = 16970). This example shows that
the superiority of A in terms of skill does not imply its usefulness from the standpoint of a forecast
user. An evaluation of forecast value needs to be carried out separately.
Thompson and Brier (1955) and Mylne (1999) examined forecast values in the simple cost/loss
decision context in which L(1,1) = L(1,0) = C > 0, L(0,1) = L > 0, and L(0,0) = 0. C is cost and
L is loss. This model simplifies the analysis by summarizing the loss function into two values: cost
and loss; and its result can be conveyed visually as a consequence. Loss functions of this type are
suitable in a context such as the decision to purchase insurance by a consumer, where two actions are
“buy insurance” or “do not buy insurance”, which lead to different losses depending on whether the
adverse event occurs in the future. If one buys the insurance (a = 1), (s)he is able to protect against
the effects of adverse event by paying a cost C, whereas occurrence of adverse event without benefit
of this protection results in a loss L. If the consumer knows the marginal probability that the adverse
event would occur at the moment of decision, the problem boils down to comparing expected losses
by two actions. On the one hand, (s)he has to pay C irrespective of the event if (s)he decides to
Table 3: Contingency table cross-classifying forecasts of B and observations Y
Y = 1 Y = 0 Row totalY = 1 40 197 237Y = 0 3 900 903
Column total 43 1097 1140
Table 4: Loss function associated with the 2×2 decision problem
Y = 1 Y = 0a = 1 0 10a = 0 5000 0
45
buy the insurance, and her/his expected loss would equal PL if (s)he does not do so, where P is the
marginal probability of Y = 1 perceived by the consumer. The optimal decision rule is thus a = 1 if
and only if P ≥C/L, and the lowest expected loss resulting from this rule is min(PL,C) denoted by
ELclim. Now, suppose the consumer has access to perfect forecasts. Then, the minimum expected loss
would be ELper f ≡ PC which is smaller in magnitude than ELclim given that P ∈ [0,1] and C≤ L. The
difference ELclim−ELper f measures the gain of a perfect forecast relative to the naive forecast. The
more realistic situation is that the probability forecast under consideration improves upon the naive
forecast, but is not perfect. Wilks (2001) suggested the value score (VS) to measure the value of a
forecasting system where
V S =ELclim−ELP
ELclim−ELper f, (51)
and ELP denotes the expected loss of the forecasting system P. The value score defined in this way
can be interpreted as the expected economic value of the forecasts of interest as a fraction of the value
of perfect forecasts relative to naive forecasts. Its value lies in (−∞,1] and it is positively oriented in
the sense that higher VS means larger forecast value. Naive forecasts and perfect forecasts have VS 0
and 1, respectively. Note that VS may be negative, indicating that it is better to use the naive forecast
of no skill in these cases. However, Murphy (1977) demonstrated that VS must be nonnegative for
any forecasting system with perfect calibration; thus any perfectly calibrated probability forecast is at
least as useful as the naive forecast. This illustrates the interplay between forecast skill and forecast
value.
Given a probability forecast Pt , VS can be calculated from f (Pt ,Yt), the joint distribution of fore-
casts and observations, and the loss function. To accomplish this, the joint distribution of (a,Y ) must
be derived first where the optimal action depends on consumer’s knowledge of f (Pt ,Yt). Given the
forecast Pt , the conditional probability of the event is f (Yt = 1|Pt) which corresponds to the second
element in the calibration-refinement factorization of f (Pt ,Yt), and the optimal decision rule takes the
form specified above: a = 1 if and only if P(Yt = 1|Pt)≥C/L. Therefore, the cost/loss ratio C/L is the
optimal threshold for translating a continuous probability P(Yt = 1|Pt) into a binary action. Given C/L,
the joint probability of (a = 1,Y = 1) is thus equal to π11 ≡∫
I(P(Yt = 1|Pt) ≥C/L) f (Pt ,Yt = 1)dPt
where I(·) is the indicator function which is one only when the condition in (·) is met. Likewise, we
46
can calculate other three joint probabilities, listed as follows:
π10 ≡∫
I(P(Yt = 1|Pt)≥C/L) f (Pt ,Yt = 0)dPt ;
π01 ≡∫
I(P(Yt = 1|Pt)<C/L) f (Pt ,Yt = 1)dPt ;
π00 ≡∫
I(P(Yt = 1|Pt)<C/L) f (Pt ,Yt = 0)dPt . (52)
Based on these results, the expected loss ELP is the weighted average of L(a,Y ) with the above
probabilities πi j as weights:
ELP = (π11 +π10)C+π01L (53)
which is then plugged into (51) to get VS. Note that in this derivation, not only is the information
contained in f (Pt ,Yt) used, but the cost/loss ratio, which is user-specific, plays a role as well. This
observation reconfirms our previous argument that the forecast value is a mixture of objective skill
and subjective loss. If f (Pt ,Yt) is fixed, ELP is a function of C and L. Wilks (2001) proved a stronger
result that VS is only a function of C/L, so that only the ratio matters. For this reason, we can plot
VS against cost/loss ratio in a simple 2-dimensional diagram. In other decision problems, where the
loss function takes a more general rather than the current cost/loss form, VS can be calculated in the
same fashion as before, but the resulting VS as a function of four loss values cannot be shown by a 2
or 3-dimensional diagram.
Figure 12 plots VS against the cost/loss ratio of a probability forecast. Note that the domain of
interest is the unit interval between zero and one, as the nonnegative cost C is assumed to be less
than the loss L. The two points (0,0) and (1,0) must lie on VS curve, because when C/L = 0,
a = 1 is adopted, resulting in ELclim = ELP = V S = 0; on the other hand, when C/L = 1, a = 0
with ELclim = ELP = PC, which again implies zero VS. In this graph, the probability forecast is not
calibrated, as the VS curve lies beneath zero for some cost/loss ratios. Krzysztofowicz (1992) and
Krzysztofowicz and Long (1990) showed that recalibration (i.e., relabeling) of such forecasts will not
change the refinement but can improve the value score over the entire range of cost/loss ratios, which
again is evidence that forecast skill would affect the forecast value. For the ROC curve, however, Wilks
(2001) demonstrated that even with such recalibration, the recalibrated ROC curve will not change.
Wilks (2001) hence concluded that “the ROC curve is best interpreted as reflecting potential rather
than actual skill” and it is insensitive to calibration improvement. Further details on the interaction of
47
Figure 12: An artificial value score curve
skill and value measured by other criteria are available in Richardson (2003).
The value score curve lends support for the use of probability forecasts instead of binary point
forecasts. For the latter, only 0/1 values are issued without any uncertainty measurement. Suppose
there is a community populated by more than one forecast user, and each one has his own cost/loss
ratio. Initially, the single forecaster serving the community produces a probability forecast Pt , and
then changes it into a 0/1 prediction by using a threshold P∗, which is announced to the community.
The threshold P∗ determines a unique 2×2 contingency table, and the value score for any given C and
L can be calculated. As a result, the value score curve as a function of the cost/loss ratio can be plotted
as well. Richardson (2003) pointed out that this VS curve is never located higher than that generated
by probability forecasts Pt for any cost/loss ratio on [0,1]. This result is obvious since the optimal P∗
for the community as a whole may not be optimal for all users. If the forecaster provides a probability
forecast Pt instead of a binary point forecast, each user has larger flexibility to choose her/his action
according to his/her own cost/loss ratio, and this would minimize the individual expected loss. A
single forecaster without knowing the distribution of cost/loss ratios across individuals is likely to
give a sub-optimal 0/1 forecast for the whole community.
Similar to the ROC analysis, we often need a single quantity like AUC to measure the overall value
of a probability forecast. A natural choice is the integral of VS curve over [0,1]. This may be justified
by a uniform distribution of cost/loss ratios, which means that forecast values are equally weighted
for all cost/loss ratios. Wilks (2001) proved that this integral is equivalent to the Brier score. This is
a special case where forecast value is completely determined by forecast skill. This may not be true
48
generally. Wilks (2001) suggested using a beta distribution on the domain [0,1], with two parameters
(α,β), to describe the distribution of cost/loss ratios, as it allows for a very flexible representation
of how C/L spreads across individuals by specifying only two parameters. For example, α = β = 1
yields the uniform distribution with equal weights. The weighted average of value scores (WVS) is
WV S≡∫ 1
0V S(
CL)b(
CL
;α,β)dCL, (54)
where V S(CL ) is the value score as a function of the cost/loss ratio and b(C
L ;α,β) is the beta density with
parameters α and β. Wilks (2001) found that this overall measure of forecast value is very sensitive to
the choice of parameters. In practice, it is impossible for a forecaster to know this distribution exactly
since the cost/loss ratio is user-dependent and may involve cost and loss in some mental or utility unit.
Therefore the application of WVS in forecast evaluation practice calls for extra caution. However,
even if one has a perfect awareness of the cost/loss distribution and ranks a collection of competing
forecasts by WVS, this rank cannot be interpreted from the perspective of a particular end user. After
all, WVS is only an overall measure; and the good forecasts identified by WVS may not be equally
good in the eyes of a particular user who will re-evaluate each forecast according to his own cost/loss
ratio.
Although the value score provides a general framework to evaluate the usefulness of probability
forecasts in terms of economic cost and loss, it has its own drawbacks. In the derivation of value score,
we have used the conditional probability P(Yt = 1|Pt) which is unknown in practice and needs to be
estimated from a sample. For a user without much professional knowledge, this is highly infeasible.
Richardson (2003) simplified the derivation by assuming the forecast is perfectly calibrated (P(Yt =
1|Pt) = Pt) and thus a user can take the face value Pt as the truth. All empirical value score curves
presented in Richardson (2003) are generated under this assumption. However, the assumption may
not hold for any probability forecast, and deriving the VS curve and conducting statistical inference
in such a situation become much more challenging.
3.2 Evaluation of Point Predictions
Compared to probability forecasts, only 0/1 values are issued in binary point predictions, which will
be discussed in depth in Section 4. For binary forecasts of this type, the 2× 2 contingency tables,
cross-classifying forecasts and actuals, completely characterize the joint distribution, and thus are
49
convenient tools from which a variety of evaluation measures about skill and value can be constructed.
We will introduce usual skill measures based on contingency tables. See Stephenson (2000) and
Mason (2003) as well. Statistical inference on a contingency table, especially the independence test
under two sampling designs, and the measure of forecast value are then briefly reviewed.
3.2.1 Skill measures for point forecasts
Although there are four cells in a contingency table (Table 1), only three quantities are sufficient for
describing it completely. The first one is the bias (B) which is defined to be the ratio of two marginal
probabilities π1./π.1. For an unbiased forecasting system, B is one and E(Y ) = E(Y ). Note that B
summarizes the marginal distributions of forecasts and observations, and thus does not tell anything
about the association between them. For example, independence of Y and Y is possible for any value
of the bias. The unbiased random forecasts are often taken as having no skill in this context, and all
other forecasts are assessed relative to this benchmark. Two other measures necessary to characterize
the forecast errors are the hit rate (H) and the false alarm rate (F) and are the two basic building blocks
for a ROC curve. Note that for the random forecasts of no skill, both H and F are equal to the marginal
probability P(Y = 1) due to independence. For forecasts of positive skill, H is expected to exceed
F. Given B, H and F, any joint probability πi j in Table 1 is uniquely determined, verifying that only
three degrees of freedom are needed for a 2×2 contingency table. The false alarm ratio is defined as
1−H ′ ≡ P(Y = 0|Y = 1) while the conditional miss rate is F ′ ≡ P(Y = 1|Y = 0). Using Bayes’ rule
connecting two factorizations, Stephenson (2000) derived the following relationship between these
four conditional measures:
H ′ =HB
F ′ =F(1−H)
F−H +B(1−F). (55)
Other measures of forecast skill can be constructed using the above three elementary but sufficient
statistics. The first one is the odds ratio defined as the ratio of two odds
OR≡ H1−H
/F
1−F, (56)
which is positively oriented in that it equals 1 for random forecasts and is greater than 1 for forecasts
of positive skill. Actually, OR is often taken as a measure of association between rows and columns
50
in any contingency table, and is zero if and only if they are independent; see Agresti (2007). Note that
OR is just a function of H and F, both of which are summaries of the conditional distributions. As
a result, OR does not rely on the marginal information. Another measure that is parallel to the Brier
score is the probability of correct forecasts defined as
πcorr ≡ 1−E(Y − Y )2
= π11 +π00
=FH +(1−F)(B−H)
B−H +F, (57)
which depends on B and the marginal information as well. In rare event cases where the unconditional
probability of Y = 1 is close to zero, πcorr would be very high for the random forecasts of no skill.
This is easily seen by observing that H = F = P(Y = 1) = P(Y = 1) and B = 1. Substituting these
into πcorr, we get
πcorr =FH +(1−F)(B−H)
B−H +F= 2P2(Y = 1)−2P(Y = 1)+1 (58)
and the minimum is obtained when P(Y = 1) = 0.5, that is, the event is balanced. In contrast, it
achieves its maximum when P(Y = 1) = 1 or P(Y = 1) = 0. For rare events where P(Y = 1) is
close to zero, πcorr is near one and this leads to the misconception that the random forecasts perform
exceptionally well, as nearly 100% cases are correctly predicted. Even if there is no association
between forecasts and observations, this score could be very high. For this reason, Gandin and Murphy
(1992) regarded πcorr to be “inequitable” in the sense of encouraging hedging. In contrast, the odds
ratio which is not dependent on B does not have this flaw and hence is a reliable measure in rare event
cases. Often we take logarithm of OR to transform its range into the whole real line, and the statistical
inference based on log odds ratio is much simpler to conduct than ones based on odds ratio, as shown
in Section 3.2.2.6 Alternatively, we can use the improvement of πcorr relative to the random forecasts
of no skill to measure the forecast skill. This is the Heidke skill score (HSS):
HSS =πcorr−πo
corr
1−πocorr
(59)
6Another transformation of OR is the so-called Yules Q or Odds Ratio Skill Score (ORSS) which is definedas (OR−1)/(OR+1). Unlike OR, ORSS ranges from −1 to 1 and is recognized conventionally as a measureof association in contingency tables.
51
where πocorr is πcorr for random forecasts. According to Stephenson (2000), HSS is a more reliable
score to use than πcorr, albeit it also depends on B.
The second widely used skill score that gets rid of the marginal information is the Peirce skill score
(PSS) or Kuipers score, which is defined as the hit rate minus the false alarm rate, cf. Peirce (1884).
Like OR, forecasts of higher skill is rewarded by larger PSS. One of the advantages of PSS over OR
is that it is a linear function of H and F, and thus is well-defined for virtually all contingency tables,
whereas OR is not defined when H and F are zero. Stephenson (2000) evaluated the performance of
these scores in terms of complement and transpose symmetry properties, and their encouragement to
hedging behaviour. His conclusion is that the odds ratio is generally a useful measure of skill for binary
point forecasts. It is easy to compute and construct inference built on it; moreover, it is independent
of the marginal totals and is both complement and transpose symmetric. Mason (2003) provided a
more comprehensive survey on various scores that are built on contingency tables and established
five criteria for screening these measures, namely, equitability, propriety, consistency, sufficiency and
regularity.
3.2.2 Statistical inference based on contingency tables
So far, all scores are calculated using population contingency tables and nearly all of them are func-
tions of four joint probabilities. In practice, only a sample Yt ,Yt for t = 1, ...,T is available, which
may or may not be generated from the models in Section 4. We have to use this sample to construct
the score estimates. This is made simple by noticing that any score, denoted by f (π11,π10,π01), is
a function of three probabilities πi j. The estimator is obtained by replacing each πi j by the sample
proportion pi j. The statistical inference is therefore based on the maximum likelihood theory if the
sample size is sufficiently large. For simplicity, let us consider the random sampling scheme where
Yt ,Yt is i.i.d.. The objective is to find the asymptotic distribution of an empirical score which is a
function of the sample proportions, denoted by f (p11, p10, p01).
Taking each (Yt ,Yt) as a random draw from the joint distribution of forecasts and observations,
we have four possible outcomes for each draw: (1,1), (1,0), (0,1) and (0,0) with corresponding
probabilities π11, π10, π01, and π00, respectively. Under the assumption of independence, the sam-
pling distribution of Yt ,Yt is the multinomial having four outcomes each with probability πi j. The
52
likelihood as a function of πi j is thus
L(πi j|Yt ,Yt) =T !
n11!n10!n01!n00!π
n1111 π
n1010 π
n0101 π
n0000 (60)
where ni j is the number of observations in the cell (i, j), and T = ∑1i=0 ∑
1j=0 ni j. The maximum
likelihood estimator is obtained by maximizing (60) over πi j, subject to the natural constraint:
∑1i=0 ∑
1j=0 πi j = 1. Agresti (2007) showed that ML estimator is simply pi j = ni j/T which is the sam-
ple proportion of outcomes (i, j). By maximum likelihood theory, pi j is consistent and asymptotically
normally distributed, that is,
√T (p−π)
d−→ N(0,V ) (61)
where p = (p11, p10, p01)′, π = (π11,π10,π01)
′ and V is the 3×3 asymptotic covariance matrix which
can be estimated by the inverse of negative Hessian for the log-likelihood evaluated at p. The asymp-
totic distribution of f (p11, p10, p01) can be derived by delta method, provided f is differentiable in a
neighborhood of π, obtaining
√T ( f (p11, p10, p01)− f (π11,π10,π01))
d−→ N(0,∂ f∂π
V∂ f∂π
T
), (62)
where ∂ f∂π
is the gradient vector of f evaluated at π, and can be estimated by replacing π with p. Asymp-
totic confidence intervals for any score defined above can be obtained based on (62); see Stephenson
(2000) and Mason (2003).
In small samples, the above asymptotic approximation is no longer valid. A rule-of-thumb is
that the number of observations in each cell should be at least 5 in order for the approximation to
be valid. For samples in real life, one or more cells may not contain any observation; and some
measures, such as OR, cannot be calculated. The Bayesian approach with a reasonable prior could
work in these situations. As shown above, the sample is drawn from a multinomial distribution. Albert
(2009) showed that the conjugate prior for π is the so-called Dirichlet distribution with four parameters
(α11,α10,α01,α00) with density
p(π) =Γ(∑1
i=0 ∑1j=0 αi j)
∏1i=0 ∏
1j=0 Γ(αi j)
πα11−111 π
α10−110 π
α01−101 π
α00−100 , (63)
where ∑1i=0 ∑
1j=0 πi j = 1 and Γ(·) is the Gamma function. A natural choice is the noninformative prior,
53
in which all αi j’s equal one and all π’s are equally likely. Albert (2009) showed that the posterior
distribution is also Dirichlet with the updated parameters (α11 + n11,α10 + n10,α01 + n01,α00 + n00).
A random sample of size M from this posterior distribution, denoted by πm for m = 1, ...,M, can
be used to obtain a sequence of scores f (πm). For the purpose of inference, the resulting highest
posterior density (HPD) credible set Cα at a given significant level α can be treated as the same as the
confidence interval in the non-Bayesian analysis. Note that the strength of the Bayesian approach in
the present situation is that the score can be calculated even though some ni j’s are zero.
Testing independence between rows and columns in contingency tables is very important for fore-
cast evaluation. As shown above, independent forecasts would not be credited a high value by any
score. Merton (1981) proposed a statistic to measure the market timing skill of directional forecasts
(DF). According to Merton (1981), a DF has no value if, and only if,
HM ≡ P(Yt = 1|Yt = 1)+P(Yt = 0|Yt = 0) = 1, (64)
where Yt = 1 means the variable has moved upward. In our terminology, this means that
P(Yt = 1|Yt = 1)−P(Yt = 1|Yt = 0) = 0. (65)
Note that P(Yt = 1|Yt = 1) is the hit rate and P(Yt = 1|Yt = 0) is the false alarm rate. As a result, the
DF under consideration has no market timing skill in the sense of Merton (1981) if, and only if, the
Peirce skill score is zero. Blaskowitz and Herwartz (2008) derived an alternative expression for the
HM statistic in relation to the covariance of realized and forecasted directions
HM−1 =Cov(Yt ,Yt)
Var(Yt). (66)
HM = 1 if, and only if, Cov(Yt ,Yt) is zero, which is equivalent to independence between Yt and Yt in
the case of binary variables. Interestingly, a large number of papers investigating DF use symmetric
loss functions of various forms, which amounts to taking the percentage of correct forecasts as the
score; see Leitch and Tanner (1995), Greer (2005), Blaskowitz and Herwartz (2009), Swanson and
White (1995, 1997a,b), Gradojevic and Yang (2006), and Diebold (2006), to name a few. Pesaran and
Skouras (2002) linked the HM statistic with a loss function in a decision-based forecast evaluation
framework.
Since testing market timing skills is equivalent to the independence test in contingency tables, let
54
us look at this test a bit more. The independence test under random sampling is much simpler than the
test in the presence of serial correlation. As a matter of fact, all of the above frequentist and Bayesian
tests are applicable in this situation. Take the Peirce skill score as an example. We can construct
an asymptotic confidence interval for PSS based on a large sample and then check whether zero is
included in the confidence interval. Besides these, two additional asymptotic tests exist, namely, the
likelihood ratio and the Pearson chi-squared tests. The former is constructed as
LR≡ 2(lnL(π∗i j|Yt ,Yt)− lnL(πi j|Yt ,Yt)) (67)
where π∗i j is the unrestricted ML estimate, whereas πi j is the restricted one under the restrictions
πi j = πi.π. j for all i and j. Given the null hypothesis of independence, LR follows a chi-squared
distribution with one degree of freedom asymptotically, and the null should be rejected if and only if
LR is larger than the critical value at a preassigned significant level. The Pearson chi-squared statistic
is
χ2 ≡
1
∑i=0
1
∑j=0
(ni j− ni j)2
ni j(68)
where ni j is the observed cell count, ni j = T pi.p. j is the expected cell count under independence,
pi. is the marginal sample proportion of the ith row, and p. j is that for the jth column. If the rows
and the columns are independent, this statistic is expected to be small. It also has an asymptotic
chi-squared distribution with one degree of freedom and the same rejection area. Both tests are valid
and equivalent in large samples. In finite samples, where one or more cell counts are smaller than 5,
Fisher’s exact test is preferred under the assumption that the total row and column counts are fixed.
The null distribution of the Fisher test statistic is not valid if these marginal counts are not fixed, as is
often the case in random sampling. Specifically, the probability of the first count n11 given marginal
totals and independence is
P(n11) =n1.!
n11!n10!n0.!
n01!n00!/
T !n.1!n.0!
(69)
which has the hypergeometric distribution for any sample size. This test was proposed by Fisher in
1934 and is widely used to test independence for I× J contingency tables in the random sampling
design. Here only the simple case with I = J = 2 is considered, and the readers are referred to Agresti
(2007) for further discussions on this exact test. Another way of testing independence in general I×J
55
contingency tables is the asymptotic test of ANOVA coefficients of ln(πi j), that is, the significance
test of relevant coefficients in the log-linear model, which is popular in statistics and biostatistics,
but rarely used by econometricians. This test makes use of the fact that ANOVA coefficients of
ln(πi j) must meet some conditions under independence. One of the conditions is that the coefficient
of any interaction term must be zero. The test proceeds by checking whether the maximum likelihood
estimators support these implied values by three standard procedures, that is, the Wald, likelihood
ratio, and Larangian multiplier tests. In econometrics, Pesaran and Timmermann (1992) proposed
an asymptotic test (PT92) based on the difference between P(Y = 1,Y = 1)+P(Y = 0,Y = 0) and
P(Y = 1)P(Y = 1)+P(Y = 0)P(Y = 0), which should be close to zero under independence. A large
deviation of the sample estimate from zero is thus a signal of rejection. In 2× 2 contingency tables,
ANOVA and PT92 tests are asymptotically equivalent to the classical χ2 test.
In reality, especially for macroeconomic forecasts, Yt and Yt are likely to be serially correlated. All
of the above testing statistics can be used nevertheless; but their null distributions are going to change.
For example, Tavare and Altham (1983) examined the performance of the usual χ2 test, where both
row and column are characterized by two-state Markov chains, and concluded that the χ2 statistic does
not have the χ2 distribution with one degree of freedom, as in the case of random samples. Before
drawing any meaningful conclusions from these classic tests, serial correlation needs to be tackled
properly.
Blaskowitz and Herwartz (2008) provided a summary of the testing methodologies in the presence
of serial correlation of Yt and Yt . These include a covariance test based on the covariance of obser-
vations and events, a static/dynamic regression approach adjusted for serial correlation by calculating
Newey-West corrected t-statistic, and the Pesaran and Timmermann (2009) test based on the canoni-
cal correlations from dynamically augmented reduced rank regressions specialized to the binary case.
They found that all of these tests based on the asymptotic approximations tend to produce incorrect
empirical size in finite samples, and suggested a circular bootstrap approach to improve their finite
sample performance. Bootstrap-based tests are found to have smaller size distortion in small samples
without much sacrifice of power, and those without taking care of serial correlation tend to generate
inflated test size in finite samples.
Dependence of forecasts and observations is necessary for a forecasting system to have positive
skill. However, it is only a minimal requirement for good forecasts. It is not unusual that the perfor-
mance of a forecasting system is worse than random forecasts of no skill in terms of some specific
criterion. Donkers and Melenberg (2002) proposed a test of relative forecasting performance over this
56
benchmark by comparing the difference in the percentage of correct forecasts. In a real life example,
they found that the test proposed by them and the PT92 test differ dramatically in the estimation and
evaluation samples.
3.2.3 Evaluation of forecast value
Most evaluation methodologies focus on the skill of binary point forecasts. As argued by Diebold and
Mariano (1995) and Granger and Pesaran (2000a,b), however, the end user often finds measures of
economic value to be more useful than the usual mean squared error or other statistical scores. We have
emphasized this point in the context of probability forecasts in which the cost/loss ratio is important
for value-evaluation in a forecast-based decision problem. In a 2× 2 payoff matrix (e.g. Table 4),
each cell corresponds to the loss associated with a possible combination of action and realization, and
is not limited to the specific cost/loss structure. Blaskowitz and Herwartz (2011) proposed a general
loss function suitable for directional forecasts in economics and finance, which takes into account
the realized sign and the magnitude of directional movement for the target economic variable. They
regarded this general loss function as an alternative to the commonly used mean squared error for
forecast evaluation.
As indicated before, Richardson (2003) analyzed the relationship between skill and value in the
context of the cost/loss decision problems. Note that for probability forecasts, any user, faced with a
probability value, decides whether or not to take some action according to his optimal threshold. For
binary point predictions, we can also calculate the value score, defined as a function of the cost/loss
ratio. The resulting VS curve would lie below the one generated by probability forecasts. Richard-
son (2003) proved that the particular cost/loss ratio which maximizes VS is equal to the marginal
probability of Y = 1, and the highest achievable value score is simply the Peirce skill score (PSS).
Granger and Pesaran (2000b) derived a very similar result. Consequently, the maximum economic
value is related to the forecast skill, and PSS is taken as a measure of the potential forecast value as
well as skill. However, for a specific user with a cost/loss ratio different from the marginal probability
P(Y = 1), this maximum value is not attainable. Thus PSS only gives the possible maximum rather
than the actual value achievable for any user. On the other hand, Stephenson (2000) argued that in
order to have a positive value score for at least one cost/loss ratio, the odds ratio (OR) has to exceed
one. That is, forecasts and observations have to depend on each other, otherwise, nobody benefits from
the forecasts and one would rather use the random forecasts with no skill. This observation provides
57
another example, where forecast value is influenced by forecast skill. Only those forecasts satisfying
the minimal skill requirements can be economically valuable.
4 Binary Point Predictions
In some circumstances, especially in two-state, two-action decision problems, one has to make a
binary decision according to the predicted probability of a future event. This can be done by trans-
forming a continuous probability into a 0/1 point prediction, as we will discuss in this section. Unlike
probability forecasts, binary point forecasts cannot be isolated from an underlying loss function. For
this reason, we deferred a detailed examination of the topic until after forecast evaluation under a
general loss function was reviewed in Section 3. The plan of this section is as follows: Section 4.1
considers ways to transform predicted probabilities into point forecasts – the so called “two-step ap-
proach”. Manski (1975, 1985) generalized this transformation procedure to other cases where no
probability prediction is given as the prior knowledge, and the optimal forecasting rule is obtained
through a one-step approach. This will be addressed in Section 4.2, followed by an empirical illustra-
tion in Section 4.3. A set of binary classification techniques primarily used in the statistical learning
literature are briefly introduced in Section 4.4. These include discriminant analysis, classification
trees, and neural networks.
4.1 Two-step approach
In the two-step approach, the first step consists of generating binary probability predictions, as re-
viewed in Section 2, while a threshold is employed to translate these probabilities into 0/1 point
predictions in the second step. In the cost/loss decision problem, the optimal threshold of doing so is
based on the cost/loss ratio. For a general loss function L(Y ,Y ), the optimal threshold minimizing the
expected loss can be solved by comparing two quantities, namely, the expected loss of Y = 1 and that
of Y = 0. Denote the former by EL1 = P(Y = 1|P)L(1,1)+(1−P(Y = 1|P))L(1,0) and the latter by
EL0 = P(Y = 1|P)L(0,1)+(1−P(Y = 1|P))L(0,0). Y = 1 is optimal if and only if EL1 ≤ EL0, or,
P(Y = 1|P)≥ L(1,0)−L(0,0)L(1,0)−L(0,0)+L(0,1)−L(1,1)
≡ P∗. (70)
58
Here we assume that making a correct forecast is beneficial and making a false forecast is costly,
that is, L(0,0) < L(1,0) and L(1,1) < L(0,1). P∗ defined above is the optimal threshold which is a
function of losses, and is interpreted as the fraction of the gain from getting the forecast right when
Y = 0 over the total gain of correct forecasts. Given P∗, the optimal decision (or forecasting) rule is:
Y = I(P(Y = 1|P) ≥ P∗). In general, P(Y = 1|P) is unknown, and this rule is infeasible. However,
suppose P is generated by one of the models in Section 2 that are correctly specified in the sense that
P = P(Y = 1|Ω). The law of iterated expectations implies that P(Y = 1|P) = P, that is, P is perfectly
calibrated, and so the decision rule reduces to Y = I(P ≥ P∗). Given a sequence of this type of
probability forecasts Pt, this rule says that we can generate another sequence of 0/1 point forecasts
Yt by simply comparing each Pt with P∗. In reality, rather than P, what we know is its estimate P
from a particular binary response model, say probit or single index model, evaluated at a particular
covariate value x. Once this model is correctly specified, the decision rule using P in replace of P is
asymptotically optimal as well, and both yield the same expected loss as the sample size approaches
infinity. Figure 13 illustrates a decision rule based on the probit model with threshold 0.4.
Figure 13: Probit and linear probability models with threshold 0.4
From this figure, Y = 1 is predicted for any observation with Φ(X β) ≥ 0.4, or for those on the
right hand side of the vertical line.
4.2 One-step approach
Manski (1975, 1985) developed a semiparametric estimator for the binary response model, the so-
called maximum score estimator (MSCORE). This is different from other semiparametric estimators
59
in Section 2.1.3 in terms of the imposed assumptions. Both single-index and nonparametric additive
models assume that the error in (2) is stochastically independent of X . In contrast, MSCORE only
assumes the conditional median of this error is zero, that is, med(ε|X) = 0, or median independence,
which is much weaker. Manski assumed the index function to be linear in unknown parameters β,
so the full specification is akin to the parametric model in Section 2.1.1, but he relaxed the inde-
pendence and distributional assumptions. Compared with other binary response models, the salient
feature of Manski’s semiparametric estimator is its weak distributional assumptions. However, as a
result, the conditional probability P(Y = 1|X) cannot be estimated—the price one has to pay with less
information. This is the reason why we did not discuss this model in Section 2 under “Probability
Predictions”.
The maximum score estimator β solves the following maximization problem based on a sample
Yt ,Xt:
maxβ∈B,|β1|=1
Sms(β)≡1T
T
∑t=1
(2Yt −1)(2I(Xtβ≥ 0)−1), (71)
where B is the permissible parameter space, |β1| is assumed to be 1 due to identification considerations,
as β is identified up to scale, and Sms(·) is the score function. Note that when Yt = 1 and Xtβ ≥ 0 or
Yt = 0 and Xtβ < 0, (2Yt − 1)(2I(Xtβ ≥ 0)− 1) = 1; Otherwise, (2Yt − 1)(2I(Xtβ ≥ 0)− 1) = −1.
Interpreting this as the problem of using X to predict Y , it says that Y = 1 is predicted if, and only if,
a linear predictor Xβ is larger than zero. As long as the predicted and observed values are the same,
the score rises by 1/T ; otherwise, it decreases by the same amount. By this observation, MSCORE
attempts to estimate the optimal linear forecasting rule of the form Xβ which maximizes the percentage
of correct forecasts.
Manski (1985) established strong consistency of the maximum score estimator. The rate of conver-
gence and the asymptotic distribution were analyzed by Cavanagh (1987) and Kim and Pollard (1990),
respectively. However, the score function is not continuous in parameters, and thus the limiting dis-
tribution is complex for carrying out statistical inference. Manski and Thompson (1986) suggested
using a bootstrap to conduct inference for MSCORE, which was critically evaluated by Abrevaya and
Huang (2005). Delgado et al. (2001) discussed the use of nonreplacement subsampling to approximate
the distribution of MSCORE. Furthermore, the convergence rate of MSCORE is T 1/3, which is slower
than the usual√
T . All of these issues restrict the application of MSCORE in empirical studies. To
overcome the problem resulting from discontinuity, Horowitz (1992) proposed a smoothed version of
60
the score function using a differentiable kernel. The resulting smoothed MSCORE is consistent and
asymptotically normal with a convergence rate of at least T 2/5 , and can be arbitrarily close to√
T
under some assumptions. Horowitz (2009) also discussed extensions of MSCORE to choice-based
samples, panel data and ordered-response models. Caudill (2003) illustrated the use of MSCORE in
forecasting where seeding is taken as a predictor of winning in the men’s NCAA basketball tourna-
ment. He found that MSCORE tends to outperform parametric probit models for both in-sample and
out-of-sample forecasts.
Manski and Thompson (1989) investigated a one-step analog estimation of optimal predictors
of binary response with much relaxed parametric assumptions on the response process. The loss
functions they considered are quite general. The first is the class of asymmetric absolute loss functions
under which the optimal forecasting rule takes the same form as Y = I(P ≥ P∗). The second is the
class of asymmetric square loss functions, and the last is the logarithmic loss function. Under these
last two losses, however, the optimal forecasts are not 0/1-valued and thus are omitted here. A natural
estimation strategy is to estimate P first, and then to get the point forecasts using the optimal rule,
as explained in Section 4.1. Manski and Thompson (1989) suggested estimating the optimal binary
point forecasts directly by the analogy principle, viz., the estimates of best predictors are obtained by
solving sample analogs of the prediction problem without the need to estimate P first. The potential
benefit of this one-step procedure is that it allows for a certain degree of misspecification for P. They
discussed this issue in two specific binary response models, “isotonic” and “single-crossing”, finding
that the analog estimators for a large class of predictors are algebraically equivalent to MSCORE, and
so are consistent.
Elliott and Lieli (2010) followed the same one-step approach under a general loss function. They
extended Manski and Thompson’s analog estimator allowing the best predictor to be nonlinear in β.
In MSCORE, the “rule of thumb” threshold of transforming X β into 0/1 binary point forecasts is 0.
Note that X β is not the conditional probability of Y = 1 given X. However, this threshold may not
be optimal for a particular decision problem under consideration. Elliott and Lieli (2010) derived an
optimal threshold based on a general utility function which may depend on the covariates X as well.
Their motivation can be explained in terms of Figure 13.
Suppose the true model is the probit model, but a linear probability model is fitted instead, with
the fitted line shown in Figure 13. According to the analysis in Section 2.1.1, the estimated β is
generally not consistent and so the linear probability model will be viewed as a bad choice. Elliott
and Lieli (2010) argued, however, that this may not be the case, at least in this example. Rather than
61
concentrating on β, what is important is the optimal forecasting rule; two different models may yield
the same forecasting rule. In Figure 13, the optimal forecasting rule determined by the true model
is: Y = 1 is predicted if, and only if, X lies on the right hand side of the vertical line – the very rule
we get by using the linear predictor Xβ. This finding highlights the point that we do not require the
model to be correctly specified in order to obtain an optimal forecasting rule. As a result, modeling
binary responses for point predictions becomes much more flexible than for probability predictions.
However, this gain in specification flexibility should not be overstated, since not every misspecified
model will work. The key requirement is that both the working model and the true model have to cross
the optimal threshold level at exactly the same cutoff point. The working model can behave arbitrarily
elsewhere, where the predictions can even go beyond [0,1].7 Therefore, a good working model may
not be the real conditional probability model and need not have any structural interpretation. For
example, β in the linear probability model in Figure 13 does not give the marginal effect of X on the
probability of Y = 1. Elliott and Lieli concluded that the usual two-step estimation procedures, such
as maximum likelihood estimation, fit the working model globally, and thus the fitted model is close
to the true model over the whole range of covariate values. However, this is not necessary since the
goodness of fit in the neighborhood of the cutoff point is all that is necessary. In other words, all we
need is a potentially misspecified working model that fits well locally instead of globally.
To overcome the problem of the two-step estimation approach, Elliott and Lieli (2010) incor-
porated utility into the estimation stage – the one-step approach initially proposed by Manski and
Thompson (1989). The population problem involves maximizing expected utility by choosing a bi-
nary optimal action as a function of X , namely,
maxa(·)
E(U(a(X),Y,X)), (72)
where U(a,Y,X) is the utility function depending on the binary action a which is again a function of
X , realized event Y as well as covariates X .8 After some algebraic manipulations, (72) can be rewritten
as
maxg∈G
E(b(X)[Y +1−2c(X)]sign[g(X)]), (73)
7Another nontrivial requirement is that the working model must be above (below) the cutoff whenever thetrue model is above (below) it.
8Elliott and Lieli suggested empirical examples where X enters into the utility function.
62
where b(X) =U(1,1,X)−U(−1,1,X)+U(−1,−1,X)−U(1,−1,X)> 0, c(X) is the optimal thresh-
old expressed as a function of utility, a(X) = sign[g(X)], and G is a collection of all measurable func-
tions from Rk to R (note X is k-dimensional). The so-called Maximum Utility Estimator (MUE) is
then obtained by solving the sample version of (73);
maxg∈G
1T
T
∑t=1
b(Xt)[Yt +1−2c(Xt)]sign[g(Xt)]. (74)
For implementation, g needs to be parameterized, that is, only a subclass of G is considered to reduce
the estimation dimension. The estimator β which maximizes the objective function
maxβ∈B
1T
T
∑t=1
b(Xt)[Yt +1−2c(Xt)]sign[h(Xt ,β)] (75)
produces the empirical forecasting rule sign[h(Xt , β)].9 Under weak conditions, this empirical fore-
casting rule converges to the theoretically optimal rule given the model specification h(x,β). If, in
addition, the model h(x,β) satisfies the stated condition for correct specification, the constrained op-
timal forecast is also the globally optimal forecast for all possible values of the predictors. They
recommended a finite order polynomial for use in practice.
The identification issues in the Manski and Elliott and Lieli approaches are less important for
prediction purposes than for structural analysis. The estimation proceeds without much worry about
identification provided alternative identification restrictions yield the same forecasting rules. Their
statistical inference is built on the optimand function instead of the usual focus on β. One difficulty
comes from the discontinuity of the objective function, meaning that maximization in practice can-
not be undertaken by the usual gradient-based numerical optimization techniques. Elliott and Lieli
employed the simulated annealing algorithm in their Monte Carlo studies, while mixed integer pro-
gramming was suggested by Florios and Skouras (2007) to solve the optimization problem.
Lieli and Springborn (2012) assessed the predictive ability of three procedures (two-step maxi-
mum likelihood, two-step Bayesian and one-step maximum utility estimation) in deciding whether to
allow novel imported goods which may be accompanied by undesirable side effects, such as biological
invasion. They used Australian data to demonstrate that a maximum utility method is likely to offer
significant incremental gains relative to the other alternatives, and estimated this annual value to be
$34-$49 million (AU$) under their specific loss function. This paper also extends the maximum utility
9Note that (75) with constant b(Xt), c(Xt) = 0.5 and h(Xt ,β) = Xtβ, is equivalent to the maximum scoreproblem. Therefore, MSCORE is a special case of this general estimator.
63
model to address an endogenously stratified sample where the uncommon event is over-represented in
the sample relative to the population rate, as discussed in Section 2.1.1.
Lieli and Nieto-Barthaburu (2010) generalized the above approach with a single decision maker to
a more complex context where a group of decision makers has heterogeneous utility functions. They
considered a public forecaster serving all decision makers by maximizing a weighted sum of individ-
ual (expected) utilities. The maximum welfare estimator was then defined through the forecaster’s
maximization problem, and its properties were explored. The conditions under which the traditional
binary prediction methods can be interpreted asymptotically as socially optimal were given, even when
the estimated model was misspecified.
4.3 An empirical illustration
To illustrate the difference between the one-step and two-step approaches in terms of their forecasting
performance, the data in Section 2.1.5 involving yield spreads and recession indicators, are used here.
For simplicity, the lagged indicator is removed, that is, only static models with yield spread as the only
regressor are fitted. It is well known that the best model for fitting the data is not always the best model
for forecasting. The whole sample is, therefore, split into two groups. The first group, covering the
period from January 1960 to December 1979, is for estimation use, while the second one, including
all remaining observations, is for out-of-sample evaluation. For the conventional two-step approach,
we fit a parametric probit model with a linear index. The recession for the month t is predicted if and
only if
Φ(β0 + β1Y St−12)≥ optimal threshold, (76)
where Φ(·) is the standard normal distribution function, Y St−12 is the 12 month lagged yield spread,
and β j, for j = 0 and 1, are the maximum likelihood estimates. For the purpose of comparison, the
same model specification in (76) is fitted by Elliott and Lieli’s approach under alternative loss func-
tions. In this case, we use the same forecasting rule (76) with β j replaced by β j, the maximum utility
estimates. Two particular loss functions are analyzed here: the percentage of correct forecasts and
the Peirce skill score, with 0.5 and the population probability of recession as the optimal thresholds,
respectively. We take the sample proportion as the estimate of the population probability. Note that
these are also the two most commonly used thresholds to translate a probability into a 0/1 value in em-
64
pirical studies; see Greene (2011). The maximum utility estimates are computed using OPTMODEL
procedure in SAS 9.2.
Figure 14 presents these fitted curves using the estimation sample, together with two optimal
thresholds.10 In contrast to the two-step maximum likelihood approach, one-step estimates depend on
the loss function of interest. When the Peirce skill score is maximized, instead of the percentage of
correct forecasts, both intercept and slope estimates change, making the fitted curve shift rightward.
One noteworthy result is that both the one-step and two-step fitted curves of maximizing Peirce skill
score touch the optimal threshold (0.15) in roughly the same region, despite their large gap when the
yield spread is negative. According to Elliott and Lieli (2010), this implies that both are expected
to yield the same forecasting rule, and thus yield the same value for the Peirce skill score. For the
percentage of correct forecasts, the fitted curves from these two approaches are also very close to
each other in the critical region, where the curves touch the optimal threshold (0.5). Their results are
confirmed in Table 5 where we summarize the in-sample goodness of fit for all fitted models. As
expected, it makes no difference in terms of the objectives they attempt to maximize. For instance,
the maximized Peirce skill score is 0.4882 for both the probit and MPSS. One possible reason for
their equivalence in this particular example could be due to the correct specification in (76), i.e., the
true data generating process can be represented by the probit model correctly.11 Note that in Table 5,
the Peirce skill score of MPC is significantly lower than those for the other two; so is the percentage
of correct forecasts for MPSS. This is not surprising, as the one-step semiparametric model is not
designed to maximize it.
Table 5: In-sample goodness of fit for one-step v.s. two-step models
PC PSS
Probit 0.8625 0.4882MPC 0.8625 0.1744MPSS 0.7167 0.4882
To correct for possible in-sample overfitting, we evaluate the fitted models using the second sam-
ple with the results summarized in Table 6. Both tables convey similar information pertaining to the
forecasting performances of one-step and two-step models. In Table 6, the probit model still per-
10In Figure 14, MPC is the fitted curve for the maximum percentage of correct forecasts, while MPSS is themaximum Peirce skill score fitted curve.
11In fact, a nonparametric specification test shows that the functional form in (76) cannot be rejected by thesample. Thus, the fitted probit model serves as a proxy for the unknown data generating process.
65
Figure 14: One-step v.s. two-step fitted curves
forms admirably well. In terms of percentage of correct forecasts, it even outperforms MPC, which is
constructed to maximize this criterion. Given that the probit model is correctly specified, the slight su-
periority of two-step approach may be possibly due to sampling variability or the structural differences
between estimation and evaluation samples.
Table 6: Out-of-sample evaluation for one-step v.s. two-step models
PC PSS
Probit 0.8672 0.5854MPC 0.8542 0.1333MPSS 0.8229 0.5854
The relative flexibility of the one-step approach, as emphasized in Section 4.2, is that it allows for
some types of misspecification, which are not allowed in the two-step approach. In order to highlight
this point, we fit the linear probability model (1) instead of the probit model (76). For the two-step
approach, the recession for the month t is predicted if and only if
βOLS0 + β
OLS1 Y St−12 ≥ optimal threshold, (77)
where βOLSj , for j = 0 and 1, are the OLS estimates. For the one-step approach, these parameters are
estimated by the Elliott and Lieli method. Figure 15 illustrates some interesting results in this setting.
Compared with the probit fitted curve, the OLS fitted line is dramatically different. However, the MUE
fitted lines, based on PC and PSS, intersect the MUE fitted curves (76) at their associated threshold
66
values (0.5 and 0.15, respectively). Thus, MUE produces the same binary point forecasts even when
the working model (77) is misspecified. Figure 15 shows that the lines estimated by MUE do not fit
the data generating process globally very well, yet are capable of producing correct point predictions.
Given that a global fit is less important than the localized problem of identifying the cutoff in the
present binary point forecast context, the one-step approach with better local fit should be preferred.12
Figure 15: One-step v.s. two-step linear fitted lines
4.4 Classification models in statistical learning
Supervised statistical learning theory is mainly concerned with predicting the value of a response vari-
able using a few input variables (or covariates), which is similar to forecasting models in econometrics.
Many binary point prediction models have been proposed in the supervised learning literature, and are
called binary classification models. This section serves as a sketchy introduction to a few classical
classification models amongst them.
4.4.1 Linear discriminant analysis
As stated above, an optimal threshold is needed to transform the conditional probability P(Y = 1|X)
into 0/1 point prediction. The most widely used threshold is 1/2 which corresponds to a symmetric
12When we implemented the in-sample and out-of-sample evaluation exercises for the linear specification(77), we found that the linear model fitted by OLS performed worse than its MUE counterparts.
67
loss function given by Mason (2003). Given this threshold, classification simply involves comparison
of two conditional probabilities, that is, P(Y = 1|X) and P(Y = 0|X), and the event with larger proba-
bility is predicted accordingly. Linear discriminant analysis follows this rule but obtains P(Y = 1|X)
in a different way than the usual regression-based approach. The analysis assumes that we know the
marginal probability P(Y = 1) and the conditional density f (X |Y ). By Bayes’ rule, the conditional
probability is given by
P(Y = 1|X) =P(Y = 1) f (X |Y = 1)
P(Y = 1) f (X |Y = 1)+P(Y = 0) f (X |Y = 0). (78)
To simplify the analysis, hereafter a parametric assumption is imposed on the conditional density
f (X |Y ). The usual practice, when X is continuous, is to assume both f (X |Y = 1) and f (X |Y = 0) are
multivariate normal with different means but a common covariance matrix Σ, that is,
f (x|Y = j) =1
(2π)k/2|Σ|1/2 exp(−12(x−µ j)Σ
−1(x−µ j)′), (79)
where j = 1 or 0. Under this assumption, the log odds in terms of the conditional probabilities is
lnP(Y = 1|X = x)P(Y = 0|X = x)
= lnf (x|Y = 1)f (x|Y = 0)
+ lnP(Y = 1)P(Y = 0)
= lnP(Y = 1)P(Y = 0)
− 12(µ1 +µ0)Σ
−1(µ1−µ0)′+ xΣ
−1(µ1−µ0)′ (80)
which is an equation linear in x. The equal covariance matrices causes the normalization factors to
cancel, as well as the quadratic part in the exponents. The previous classification rule amounts to
determining whether (80) is positive for a given x. The decision boundary that is given by setting (80)
to be zero is a hyperplane in Rk, dividing the whole space into two disjoint subsets. For any given x
in Rk, it must exclusively fall into one subset; and the classification follows in a straightforward way.
To make this rule work in practice, four blocks of parameters have to be estimated using samples:
P(Y = 1), µ1, µ0 and Σ. This can be done easily by using their sample counterparts. To be specific,
P(Y = 1) = T1/T , µ j = ∑ j Xt/Tj for j = 0,1 and Σ = (∑1(Xt − µ1)′(Xt − µ1) +∑0(Xt − µ0)
′(Xt −
µ0))/(T − 2), where P is the estimate of P, T1 is the number of observations with Yt = 1, and ∑ j is
the summation over those observations with Yt = j. Substituting parameters with their estimates in
the decision boundary yields the empirical classification rule. It is called linear discriminant analysis
simply because the resulting decision boundary is a hyperplane in the input vector space, which again
is the consequence of the imposed assumptions. Hastie et al. (2001) derived a decision boundary
68
described by a quadratic equation under the normality assumption with distinct covariance matrices,
that is, Σ1 6= Σ0. They also extended this simplest case by considering other distributional assumptions
leading to more complex decision boundaries.
Another point worth mentioning is that the log odds generated by linear discriminant analysis
takes the form of a logistic specification. Specifically, the linear logistic model by construction has
linear logit
lnP(Y = 1|X = x)P(Y = 0|X = x)
= βo + xβ1, (81)
which is akin to (80) if
βo ≡ lnP(Y = 1)P(Y = 0)
− 12(µ1 +µ0)Σ
−1(µ1−µ0)′ (82)
and
β1 ≡ Σ−1(µ1−µ0)
′. (83)
Therefore, the assumptions in linear discriminant analysis induce the logistic regression model, which
can be estimated by maximum likelihood to get estimates for βo and β1. In this sense, both models
generate the same classification rules asymptotically, in spite of the difference in their estimation
methods. However, the joint distribution of Y and X is used in discriminant analysis, whereas logistic
regression only uses the conditional distribution of Y given X , leaving the marginal distribution of X
not explicitly specified. As a consequence, linear discriminant analysis, by relying on the additional
model assumptions, is more efficient but less robust when the assumed conditional density of X given
Y is not true. In the situation where some of the components of X are discrete, logistic regression is a
safer, more robust choice.
Maddala (1983) followed an alternative way to derive the linear discriminant boundary, which
provides a deep insight into what discriminant analysis actually does. Suppose that only a linear
boundary is considered for simplicity. Without loss of generality, denote it by Xλ = 0, and Y = 1
is predicted if and only if Xλ ≥ 0. What discriminant analysis does is to find the optimal value
for λ according to a certain criterion. Fisher posed this problem initially for finding λ such that
the between-class variance is maximized relative to the within-class variance. The between-class
variance measures how far away from each other are the means of Xλ for both classes (Y = 1 and
69
Y = 0 where Y is the binary point prediction), which should be maximized subject to the constraint
that the variance of Xλ within each class is fixed. This does make intuitive sense in the context of
classification. If the dispersion of two means is small or two distributions of Xλ overlap to a large
extent, it is hard to distinguish one from the other. In other words, a large proportion of observations
could be misclassified. Alternatively, even if the means of two distributions are far away from each
other, they cannot be sharply distinguished unless both distributions have small variances. The optimal
λ solving the Fisher’s problem gives the best linear decision boundary whose analytical form is given
in Maddala (1983). Mardia et al. (1979) offered a concise discussion of linear discriminant analysis.
Michie et al. (1994) compared a large number of popular classifiers on benchmark datasets. Linear
discriminant analysis is a simple classification model with a linear decision boundary, and subsequent
developments have extended it in various directions; see Hastie et al. (2001) for details.
4.4.2 Classification trees
As with discriminant analysis, methods based on classification trees partition the input vector space
into a number of subsets on which 0/1 binary point predictions are made. Consider the case with
two input variables: X1 and X2, both of which take values in the unit interval. Figure 16 presents a
particular partition of the unit box.
Figure 16: Partition of the unit box
First, subset R1 is derived if X1 < t1. For the remaining part, check whether X2 < t2, if so, we get
R2. Otherwise, check whether X1 < t3, if so, we get R3. Otherwise, check whether X2 < t4, if so, we
get R4. Otherwise, we take the remaining as R5. This process can be represented by a classification
70
tree in Figure 17.
Figure 17: The classification tree associated with Figure 16
Each node on the tree represents a stage in the partition; and the number of final subsets equals that
of terminal nodes. The branch connecting two nodes gives the condition under which the upper node
transits to the lower one. For example, condition X1 < t1 must be satisfied in order to get R1 from
the initial node. The tree shown in Figure 17 can be expanded further to incorporate more terminal
nodes when the partition ends up with more final subsets. In general, suppose we have M subsets: R1,
R2,..., RM on each of which we have assigned a unique probability denoted by p j for j = 1, ...,M.
Using the optimal threshold 1/2, Y = 1 should be predicted on subset j if and only if p j ≥ 0.5. Hence,
the classification boils down to how to divide the input vector space into disjoint subsets as shown
in Figure 16 (or how to generate a classification tree like the one in Figure 17), and how to assign
probabilities to them.
To introduce an algorithm to grow a classification tree, we define X as a k-dimensional input
vector, with X j as its jth element and
R1( j,s) ≡ X |X j ≤ s,
and R2( j,s) ≡ X |X j > s. (84)
Given a sample Yt ,Xt, the optimal splitting variable j and split point s solve the following problem:
minj,s
[minc1
∑xt∈R1( j,s)
(Yt − c1)2 +min
c2∑
xt∈R2( j,s)(Yt − c2)
2]. (85)
71
For any fixed j and s, the optimal ci (for i = 1 or 2) that minimizes the mean squared errors is the
sample proportion of Yt = 1 within the class of Xt : Xt ∈ Ri( j,s). Computation of the optimal j and
s can be carried out in most statistical packages without much difficulty. Having found the best split,
the whole input space is divided into two subsets according to whether X j∗ ≤ s∗ where j∗ and s∗ are
the optimal solutions to (85). The whole procedure is then iterated on each subset to get finer subsets
which can be partitioned further as before. In principle, this process can be repeated infinitely many
times, but we have to stop it when a certain criterion is met. To this end, we define the cost complexity
criterion function
Cα(T )≡|T |
∑m=1
∑Xt∈Rm
(Yt − Ym)2 +α|T | (86)
where T is a subtree of To that is a very large tree, |T | is the number of terminal nodes of T each of
which is indexed by Rm for m = 1, ..., |T |, and Ym is the sample proportion of Yt = 1 within subset Rm.
The criterion is a function of α, that is a nonnegative tuning parameter to be specified by user. The
optimal subtree T depending on α should minimize Cα(T ). If α = 0, the optimal T should be as large
as possible and equals the upper bound To. Conversely, an infinitely large α forces T to be very small.
This result is very intuitive. When the partition gets finer and finer, fewer and fewer observations fall
into each subset. In the limit, each one would contain at most one observation, so that Ym = Yt for
each m, and the first term of Cα(T ) would vanish. This also shows that without any other constraint,
the optimal partition rule tends to overfit in-sample data. This is very unstable and inaccurate in the
sense that this rule is sensitive to even a slight change in sample. The optimal subtree should balance
the tradeoff between stability and in-sample goodness of fit. This balance is controlled by parameter
α. Breiman et al. (1984) and Ripley (1996) outlined details to obtain the optimal subtree for a given α
that is determined by cross-validation.
Hastie et al. (2001) recommended using other measures of goodness of fit in the complexity cri-
terion function instead of the sample mean squared error in (86) for binary classification purpose,
including the missclassification error, Gini index, and cross-entropy. They compared them in terms
of their sensitivity to changes in the node probabilities. They also discussed cases with categorical
predictors and asymmetric loss function. For an initial introduction to classification trees, see Morgan
and Sonquist (1963). Breiman et al. (1984) and Quinlan (1992) contain a general treatment of this
topic.
72
4.4.3 Neural networks
The model of neural networks is a highly nonlinear supervised learning model, which seeks to ap-
proximate the regression function by combining a k-dimensional input vector in a hierarchical way
via multiple hidden layers. To outline its basic idea, only a single hidden layer neural networks is
considered here.
As before, Y is a binary response, and X is a k-dimensional input vector to be used for classi-
fication. Let Z1, ...,ZM be unobserved hidden units that depend on X by Zm = σ(α0m +Xαm), for
m = 1, ...,M, where σ(·) is a known link function. A typical choice is σ(v) = 1/(1+ e−v). Then the
neural networks, with Z1, ...,ZM as the only hidden layer, can be written as
Tk = β0k +Zβk, k = 0,1,
P(Y = 1|X) = g(T ), (87)
where T = (T0,T1), Z = (Z1, ...,ZM), P(Y = 1|X) is the conditional probability of Y = 1 given X , and
g is a known function with two arguments. For a binary response, g(T ) = eT1
eT0+eT1is often used. The
above model structure is presented by Figure 18.
Figure 18: Neural networks with a single hidden layer
In general, there may be more than one hidden layer, and so Y will depend on X in a more
complex way. The model therefore allows for enhanced specification flexibility and reduced risk of
misspecification. Note that there are M(k+ 1)+ 2(M + 1) parameters in this model that need to be
estimated, and some of them may not be identified when both M and k are large. In other words,
73
the specification is too rich to be identified. For this reason, instead of fitting the full model, only a
nested model, with some parameters fixed, is estimated given a sample Yt ,Xt. Despite its complex
structure, it is still a parametric model because the functional forms of g and σ are known a priori
and only a finite set of parameters are estimated. The usual nonlinear least squares, or maximum
likelihood, method is used to get a consistent estimator. For the former, the objective function that
should be minimized is the forecast mean squared error
R(θ) =T
∑t=1
(Yt −P(Y = 1|Xt))2, (88)
whereas the likelihood function for the latter is
R(θ) =T
∑t=1
P(Y = 1|Xt)Yt (1−P(Y = 1|Xt))
1−Yt , (89)
where θ is the vector of all parameters. The classification rule is that Y = 1 is predicted if, and only
if, the fitted probability P(Y = 1|X) is no less than 0.5. Typically, the global solutions of the above
problems are often not desirable in that they tend to overfit the model in-sample but perform poorly
out-of-sample. So, one can obtain a suboptimal solution either directly through a penalty term added
in any of the above objective functions, or indirectly by early stopping. For computational details
on neural networks, see Hastie et al. (2001), Parker (1985), and Rumelhart et al. (1986). A general
introduction of neural networks is given by Ripley (1996), Hertz et al. (1991), and Bishop (1995). For
a useful review of neural networks from an econometric point of view, see Kuan and White (1994).
Refenes and White (1998), Stock and Watson (1999), Abu-Mostafa et al. (2001), Marcellino (2004)
and Terasvirta et al. (2005) applied neural networks in time series econometrics and forecasting.
5 Improving Binary Predictions
Till now, all binary probability and point predictions have been constructed based on a single training
sample Yt ,Xt, and the resulting predictions are thus subject to sampling variability. We say a binary
probability/point prediction Q(x) evaluated at x is unstable if its value is sensitive to even a slight
change of the training sample from which it is derived. The lack of stability is especially severe in
cases of small training samples and highly nonlinear forecasting models. If Q(x) varies a lot, it is
74
hardly reliable as one may get a completely different predicted value when a different training sample
is used. In other words, the variance of the forecast error would be extremely large for an unstable
prediction. To improve forecast performance and reduce the uncertainty associated with an unstable
binary forecast, combining multiple individual forecasts for the same event was suggested; see Bates
and Granger (1969), Deutsch et al. (1994), Granger and Jeon (2004), Stock and Watson (1999, 2005),
Yang (2004), and Timmermann (2006). The motivation of forecast combination is much analogous to
the use of the sample mean instead of a single observation as an unbiased estimator of the population
mean, as taking average reduces the variance without affecting unbiasedness. Let us consider using the
usual criterion of mean squared error for forecast evaluation. Denote an individual binary forecast by
Q(x,L) where x is the evaluation point of interest and L is the training sample Yt ,Xt (for t = 1, ...,T )
by which Q(x,L) is constructed. The mean squared error of an individual forecast is
el ≡ ELEY,X(Y −Q(X ,L))2. (90)
Suppose we can draw N random samples Li each of which has size T from the joint distribution
f (Y,X). Then the combined forecast QA(x) ≡ 1/N ∑Ni=1 Q(x,Li) is closer to the population average
when N is very large, that is,
QA(x)≈ ELQ(x,L). (91)
The mean squared error associated with this combined forecast is thus
ea ≡ EY,X(Y −QA(X))2. (92)
Now using Jensen’s inequality, we have
el = EY,XY 2−2EY,XY QA(X)+EY,X EL(Q(X ,L))2
≥ EY,XY 2−2EY,XY QA(X)+EY,X(QA(X))2
= EY,X(Y −QA(X))2
= ea. (93)
Thus, the combined forecast has a lower mean squared error than any individual forecast, and the
magnitude of improvement depends on EL(Q(X ,L))2− (ELQ(X ,L))2 = VarL(Q(X ,L)), which is the
75
variance of the individual forecasts due to the uncertainty of the training sample and measures forecast
stability. Substantial instability leaves more space for improvement induced by forecast combination.
Generally speaking, small training samples and high nonlinearity in forecasting models are two main
sources of instability. Forecast combination can help a lot under these circumstances. Section 5.1 deals
with the case where multiple binary forecasts for the same event are available and the combination to
be carried out is straightforward. The bootstrap aggregating technique is followed when we only have
a single training set.
5.1 Combining binary predictions
Sometimes more than one binary prediction is available for the same target. A typical example is the
SPF probability forecasts of real GDP declines where approximately 40−50 individual forecasters is-
sue their subjective probability judgements in each survey about real GDP declines in the current and
each of the next four quarters. In these instances, individual forecasters might give diverse probability
assessments of a future event but none of them makes effective use of all available information. Be-
sides, the forecasts are likely to fluctuate over time and across individuals. Stimulated by concerns of
instability, a number of combination methods have been suggested. However, the combination meth-
ods should not be arbitrary and simplistic. Cases of combined forecasts that have performed worse
than individual forecasts have been documented in the literature; see Ranjan and Gneiting (2010) for
a good example. In this light, an effort to search for the optimal combination method is desired. Here,
the main focus is to combine probability forecasts instead of point forecasts. As for the latter, there
are already a large number of articles in computer science under the title of multiple classifier systems
(MCS), see Kuncheva (2004) for a textbook treatment.
The optimal combination of probability forecasts is discussed in a probabilistic context where the
joint distribution of observation and multiple individual forecasts is
f (Y,P1,P2, ...,PM), (94)
where Pm for m = 1, ...,M is the mth individual probability forecast of the binary event Y . The deriva-
tion of the optimal combination in the framework of the joint distribution unifies various separate
combination techniques in that it allows for more general assumptions on observations and forecasts.
For example, the Pm may be contemporaneously correlated with each other, which is very common as
76
individual forecasts are often based on similar information sets. Series correlation of observations and
forecasts is also allowed. Moreover, individual forecasts may come from either econometric models,
subjective judgements, or both. As shown in Section 3, there are many competing criteria or scores to
measure the skill or accuracy for probability forecasts. As a consequence, one may expect that optimal
combination rules may rely on adopted scores and thereby no universal combination rule will exist.
Fortunately, the situation is not as hopeless as it seems, as long as the score is proper. Denote the
proper score by S(Y,P) which is a function of the realized event and the probability forecasts, and the
conditional probability of Y = 1, given all individual forecasts, by P ≡ P(Y = 1|P1,P2, ...,PM). Ran-
jan and Gneiting (2010) proved that P, as a function of individual forecasts, is the optimal combined
forecast in the sense that its expected score is the smallest among all candidates provided the score is
proper. To see this, note that the expected score of P is given by
E(S(Y, P)) = E(E(S(Y, P)|P1,P2, ...,PM))
= E(PS(1, P)+(1− P)S(0, P))
≤ E(PS(1, f (P1,P2, ...,PM))+(1− P)S(0, f (P1,P2, ...,PM)))
= E(E(S(Y, f (P1,P2, ...,PM))|P1,P2, ...,PM))
= E(S(Y, f (P1,P2, ...,PM))), (95)
where f (P1,P2, ...,PM) is any measurable function of (P1,P2, ...,PM), an alternative combined forecast.
The inequality above uses the fact that S(Y,P) is a negatively oriented proper scoring rule. This result
says that taking P as the combined forecast always wins, which is true irrespective of the possible de-
pendence structures. A specific combination rule, such as the widely used linear opinion pool (OLP)
in which f (P1,P2, ...,PM) = ∑Mm=1 wmPm and wm is the nonnegative weight satisfying ∑
Mm=1 wm = 1,13
performs well only if it is close to the optimal P. A large number of specific rules have been devel-
oped, each of which is valid under its own assumptions. As a result, a specific rule may succeed if its
assumptions roughly hold in practice, but fail when the data generating process violates these assump-
tions. For example, the rule ignoring dependence structure among individual forecasts may perform
poorly if they are highly correlated with each other. For details of various specific combination rules,
see Genest and Zidek (1986), Clemen (1989), Diebold and Lopez (1997), Graham (1996), Wallsten
et al. (1997), Clemen and Winkler (1986, 1999, 2007), Timmermann (2006), and Primo et al. (2009).
13That is, f (P1,P2, ...,PM) is a convex combination of individual forecasts. Note that the linearity of P ispossible as each Pm lies in the unit interval, so does the convex combination.
77
In general, the functional form of this conditional probability P is unknown and needs to be esti-
mated from the sample Yt ,P1t ,P2t , ...,PMt for t = 1, ...,T , which is the usual practice in econometrics,
by noting that P is nothing more than a conditional probability. All methods covered in Section 2 will
work here. The most robust way of estimation is nonparametric regression, even though it is subject to
the “curse of dimensionality” when a large number of individual forecasts need to be combined. Ran-
jan and Gneiting (2010) recommended the beta-transformed linear opinion pool (BLP) to reduce the
estimation dimension, yet reserve certain flexibility in the specification. BLP is akin to the parametric
model (2) with linear index and beta distribution as its link function, that is,
P(Y = 1|P1,P2, ...,PM) = Bα,β(M
∑m=1
wmPm), (96)
where Bα,β(·) is the distribution function of the beta density with two parameters α > 0 and β > 0.
The number of unknown parameters including α and β is M + 2. They showed that BLP reduces to
OLP when α = β = 1. All parameters can be estimated by maximum likelihood given a sample, and
validity of OLP can thus be verified by a likelihood ratio test. Ranjan and Gneiting examined the
properties of BLP, compared it with OLP and each individual forecast in terms of their calibration
and refinement. They found that correctly specified BLP, necessarily calibrated by construction, is a
recalibration of OLP, which may not be calibrated even if the individual forecasts are. The empirical
version of BLP, based on a sample, performs equally well compared with the optimal P. Using SPF
forecasts, Lahiri et al. (2012b) find that the procedure works reasonably well in practice.
5.2 Bootstrap aggregating
Bootstrap aggregating, or bagging, is a forecast combination approach proposed by Breiman (1996)
in the machine learning literature, when only a single training sample is available. The basic intuition
is to average individual predictions generated by each bootstrap sample to reduce the variance of
unbagged prediction without affecting its bias. Like the usual forecast combination approach, bagging
is useful only if the sample size is not large and the forecasting model is highly nonlinear. Typical
examples where forecasts can be improved significantly by bagging include classification trees and
neural networks. But bagging does not seem to work well in linear discriminant analysis and k-nearest
neighbor methods; see Friedman and Hall (2007), Buja and Stuetzle (2006), and Buhlmann and Yu
(2002) for further discussion of this issue. A striking result is that bagged predictors can perform
78
even worse than unbagged predictors in terms of certain criteria, as shown in Hastie et al. (2001).
Though it is not useful for all problems at hand, its ability to stabilize a binary classifier has been
supported in the machine learning literature, as documented by Bauer and Kohavi (1999), Kuncheva
and Whitaker (2003), and Evgeniou et al. (2004). Lee and Yang (2006) demonstrated that bagged
predictors outperform unbagged predictors even under asymmetric loss functions, instead of the usual
mean squared error. They also established the conditions under which bagging is successful.
Bootstrap aggregating starts by resampling Yt ,Xt via bootstrap to get B bootstrap samples. The
binary forecasts, with fixed evaluation point x, are then constructed from each bootstrap sample to get
a set of Q(x,Li), where Li is the ith bootstrap sample. The bagged predictor is calculated as the
weighted average of Q(x,Li), where
Qb(x,L)≡1B
B
∑i=1
wiQ(x,Li) (97)
and wi is the nonnegative weight attached to the ith bootstrap sample Li and satisfies the usual con-
straint ∑Bi=1 wi = 1. The bagged predictor Qb(x,L) depends on the original sample L, as resampling
is based on the empirical distribution of L. There are a few points to be clarified for its implemen-
tation. First, appropriate bootstrap methods should be used depending on the context. For example,
nonparametric bootstrap is the natural choice for independent data, and parametric bootstrap is more
efficient when the data generating process of L is known up to a finite dimensional parameter vector.
For time series or other dependent data, block bootstrap can provide a sound simulation sample, as
illustrated by Lee and Yang (2006). Second, for probability prediction, the predictor Qb(x,L) is di-
rectly usable as its value must be between zero and one if each Q(x,Li) is. However, this is not the
case for binary point prediction, as Qb(x,L) is not 0/1-valued even if each Q(x,Li) is. In this context,
a usual rule is the so-called majority voting, where Qb(x,L) always predicts what is predicted more
often in Q(X ,Li). This is equivalent to taking 1/2 as threshold, that is, using I(Qb(x,L) ≥ 1/2) as
the bagged predictor.14 Third, the BLP combination method in Section 5.1 can be used here, provided
its parameters can be estimated from bootstrap samples. Finally, the choice of B depends on the orig-
inal sample size, computational capacity and model structure in a complex way. Lee and Yang (2006)
showed that B = 50 is more than sufficient to get a stable bagged predictor, and even B = 20 is good
14Hastie et al. (2001) suggested another way to make a binary point prediction if we can obtain a proba-bility prediction at evaluation point x. The bagged probability predictor is then derived by (97) which is thentransformed to a 0/1 value according to the threshold. They argued that, compared to the first procedure, thisapproach ends up with a bagged predictor having lower variance especially for small B.
79
enough in some cases in their empirical example. For other applications of bootstrap aggregating in
econometrics, interested readers are referred to Kitamura (2001), Inoue and Kilian (2008), and Stock
and Watson (2005).
6 Conclusion
In this chapter, we discussed the specification, estimation and evaluation of binary response models in
a unified framework from the standpoint of forecasting. In a stochastic setting, generating the probabil-
ity of the occurrence of an event with binary outcomes boils down to the specification and estimation
of the conditional expectation or the regression function. In this process, the conventional nonlinear
econometric modeling approaches play a dominant role. Specification designed for the limited range
of the response distinguishes models for binary dependent variables from those for continuous pre-
dictands. Therefore, the validity of transformations like the probit link function becomes an issue in
modeling binary events for forecasting.
Two types of forecasts for binary events are distinguished in this chapter: probability forecasts
and point forecasts. There is no universal answer as to which one is better. The value score analysis
in section 3.1.2 justifies the use of probability forecast, as it allows for the heterogeneity in the loss
functions of the end users in decision making. However, if the working model is misspecified, the
point forecast based on a one-step approach that integrates estimation and forecasting may be superior,
provided a loss function has been properly chosen. Moreover, in many regulatory environments, there
are mandates for the issuance of only binary forecasts.
The joint distribution of forecasts and actuals embodies the basic ingredients required for the
evaluation of forecast skill. All existing scoring rules and graphical approaches essentially reflect
certain attributes of this joint distribution. Since no single evaluation tool provides a complete measure
of skill for forecasting binary events, the use of a battery of such measures is recommended to assess
the skill more comprehensively. As a general rule, those not influenced by the marginal information
regarding the actuals are preferred. Many examples fall into this category, such as the odds ratio, Peirce
skill score, or ROC. Compared with those commonly used in practice, the tools within this category
are more likely to capture the true forecast skill. In circumstances where the event under consideration
is rare or relatively uncommon, the marginal probability of the occurrence of the event may confound
the true skill if it is not isolated from the score. The usual methods for assessing the goodness of fit
80
of a binary regression model, such as the pseudo R2 or the percentage of correct predictions, do not
adjust for the asymmetry of the response variable. We have also emphasized the need for reporting
sampling errors of these statistics. In this regard, there is substantial room for improvement in current
econometric practice.
Given that we have introduced a wide range of models and methods for forecasting binary out-
comes, a natural question is which ones should be used in a particular situation. It appears that complex
models that often fit better in-sample tend not to do well out-of-sample. Three classification models
in section 4.4 illustrate this point pretty well. Simple models like the discriminant analysis with a
linear boundary or the neutral networks with a single hidden layer often do very well in out-of-sample
forecasting exercises. This also explains why the forecast combination would usually work when
the individual forecasts come from complex nonlinear models. When multiple forecasts of the same
binary event are available, the skill performance of any single forecast can potentially be improved
when it is combined with other individual forecasts efficiently. Here again, the optimal combination
scheme should be derived from the joint distribution of forecasts and actuals. When only a single
training sample is available and the individual forecasts based on it are highly unstable, bagging is an
attractive way to reduce the forecast variance and improve the forecast skill.
It is virtually impossible that a forecast with an extremely low skill would satisfy the need of a
forecast user. Only those forecasts that enjoy at least a moderate amount of skill can be of some value
in guiding the decision making process. It is possible that a skillful forecast on the basis of a particular
criterion may not be useful at all in another decision making context. Knowing the joint distribution
is not enough for the purpose of evaluating the usefulness of a forecast from the perspective of a user
– the loss function connecting forecasts and realizations needs to be considered as well. The binary
point prediction discussed in Section 4 is a prime example where a 0/1 forecast is made by implicitly or
explicitly relying on a threshold value that is determined by a presumed loss function. In some specific
contexts, certain skill scores are directly linked to the value of the end user. One such example is that,
under certain circumstances, the highest achievable value score is the Peirce skill score, as shown in
section 3.2.3. Without any knowledge about the joint distribution of forecasts and realizations, we do
not know the nature of uncertainty facing us. However, even with knowledge of the joint distribution,
without information regarding the loss function, we would not know how to balance the expected
gains and losses under different forecasting scenarios for making decisions under uncertainty. For a
truly successful forecasting system, we need both.
81
References
Abrevaya, J. and Huang, J. (2005), ‘On the Bootstrap of the Maximum Score Estimator’, Economet-
rica 73, 1175–1204.
Abu-Mostafa, Y. S., Atiya, A. F., Magdon-Ismail, M. and White, H. (2001), ‘Introduction to the
Special Issue on Neural Networks in Financial Engineering’, IEEE Transactions on Neural Networks
12, 653–656.
Agresti, A. (2007), An Introduction to Categorical Data Analysis, John Wiley & Sons.
Ai, C. and Li, Q. (2008), Semi-parametric and Non-parametric Methods in Panel Data Models, in
L. Matyas and P. Sevestre, eds, ‘The Econometrics of Panel Data: Fundamentals and Recent Devel-
opments in Theory and Practice’, Springer, pp. 451–478.
Albert, J. (2009), Bayesian Computation with R, Springer.
Albert, J. H. and Chib, S. (1993), ‘Bayesian Analysis of Binary and Polychotomous Response Data’,
Journal of the American Statistical Association 88, 669–679.
Amemiya, T. (1985), Advanced Econometrics, Harvard University Press.
Amemiya, T. and Vuong, Q. H. (1987), ‘A Comparison of Two Consistent Estimators in the Choice-
Based Sampling Qualitative Response Model’, Econometrica 55, 699–702.
Anatolyev, S. (2009), ‘Multi-Market Direction-of-Change Modeling Using Dependence Ratios’, Stud-
ies in Nonlinear Dynamics & Econometrics 13, Article 5.
Andersen, E. B. (1970), ‘Asymptotic Properties of Conditional Maximum-Likelihood Estimators’,
Journal of the Royal Statistic Society, Series B 32, 283–301.
Arellano, M. and Carrasco, R. (2003), ‘Binary Choice Panel Data Models with Predetermined Vari-
ables’, Journal of Econometrics 115, 125–157.
Baltagi, B. H. (2012), Panel Data Forecasing, in A. Timmermann and G. Elliott, eds, ‘Handbook of
Economic Forecasting (forthcoming)’, North-Holland Amsterdam.
82
Bates, J. M. and Granger, C. W. J. (1969), ‘The Combination of Forecasts’, Operational Research
Quarterly 20, 451–468.
Bauer, E. and Kohavi, R. (1999), ‘An Empirical Comparison of Voting Classification Algorithms:
Bagging, Boosting, and Variants’, Machine Learning 36, 105–139.
Berge, T. J. and Jorda, O. (2011), ‘Evaluating the Classification of Economic Activity into Recessions
and Expansions’, American Economic Journal: Macroeconomics 3, 246–277.
Bishop, C. M. (1995), Neural Networks for Pattern Recognition, Oxford University Press.
Blaskowitz, O. and Herwartz, H. (2008), Testing Directional Forecast Value in the Presence of Serial
Correlation. Humboldt University, Collaborative Research Center 649, SFB 649, Discussion Papers.
Blaskowitz, O. and Herwartz, H. (2009), ‘Adaptive Forecasting of the EURIBOR Swap Term Struc-
ture’, Journal of Forecasting 28, 575–594.
Blaskowitz, O. and Herwartz, H. (2011), ‘On Economic Evaluation of Directional Forecasts’, Inter-
national Journal of Forecasting 27, 1058–1065.
Bontemps, C., Racine, J. S. and Simioni, M. (2009), Nonparametric vs Parametric Binary Choice
Models: An Empirical Investigation. Toulouse School of Economics TSE Working Papers with num-
ber 09-126.
Braun, P. A. and Yaniv, I. (1992), ‘A Case Study of Expert Judgment: Economists’ Probabilities
Versus Base-Rate Model Forecasts’, Journal of Behavioral Decision Making 5, 217–231.
Breiman, L. (1996), ‘Bagging Predictors’, Machine Learning 24, 123–140.
Breiman, L., Friedman, J., Olshen, R. A. and Stone, C. J. (1984), Classification and Regression Trees,
Chapman & Hall.
Brier, G. W. (1950), ‘Verification of Forecasts Expressed in Terms of Probability’, Monthly Weather
Review 78, 1–3.
Buhlmann, P. and Yu, B. (2002), ‘Analyzing Bagging’, Annals of Statistics 30, 927–961.
Buja, A. and Stuetzle, W. (2006), ‘Observations on Bagging’, Statistica Sinica 16, 323–351.
83
Bull, S. B., Greenwood, C. M. T. and Hauck, W. W. (1997), ‘Jackknife Bias Reduction for Polychoto-
mous Logistic Regression’, Statistics in Medicine 16, 545–560.
Carroll, R. J., Ruppert, D. and Welsh, A. H. (1998), ‘Local Estimating Equations’, Journal of the
American Statistical Association 93, 214–227.
Caudill, S. B. (2003), ‘Predicting Discrete Outcomes with the Maximum Score Estimator: the Case
of the NCAA Men’s Basketball Tournament’, International Journal of Forecasting 19, 313–317.
Cavanagh, C. L. (1987), Limiting Behavior of Estimators Defined by Optimization. Unpublished
Manuscript, Department of Economics, Harvard University.
Chamberlain, G. (1980), ‘Analysis of Covariance with Qualitative Data’, Review of Economic Studies
47, 225–238.
Chamberlain, G. (1984), Panel Dada, in Z. Griliches and M. D. Intrilligator, eds, ‘Handbook of Econo-
metrics’, North-Holland Amsterdam, pp. 1248–1318.
Chauvet, M. and Potter, S. (2005), ‘Forecasting Recessions using the Yield Curve’, Journal of Fore-
casting 24, 77–103.
Chib, S. (2008), Panel Data Modeling and Inference: A Bayesian Primer, in L. Matyas and P. Sevestre,
eds, ‘The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Prac-
tice’, Springer, pp. 479–515.
Clark, T. E. and McCracken, M. W. (2012), Advances in Forecast Evaluation, in A. Timmermann and
G. Elliott, eds, ‘Handbook of Economic Forecasting (forthcoming)’, North-Holland Amsterdam.
Clemen, R. T. (1989), ‘Combining Forecasts: A Review and Annotated Bibliography’, International
Journal of Forecasting 5, 559–583.
Clemen, R. T. and Winkler, R. L. (1986), ‘Combining Economic Forecasts’, Journal of Business &
Economic Statistics 4, 39–46.
Clemen, R. T. and Winkler, R. L. (1999), ‘Combining Probability Distributions From Experts in Risk
Analysis’, Risk Analysis 19, 187–203.
84
Clemen, R. T and Winkler, R. L (2007), Aggregating Probability Distributions, in W. Edwards, R.
F. Miles and D. von Winterfeldt, eds, ‘Advances in Decision Analysis: From Foundations to Applica-
tions’, Cambridge University Press, pp. 154–176.
Clements, M. P. (2006), ‘Evaluating the Survey of Professional Forecasters Probability Distributions
of Expected Inflation Based on Derived Event Probability Forecasts’, Empirical Economics 31, 49–64.
Clements, M. P. (2008), ‘Consensus and Uncertainty: Using Forecast Probabilities of Output De-
clines’, International Journal of Forecasting 24, 76–86.
Clements, M. P. (2011), ‘An Empirical Investigation of the Effects of Rounding on the SPF Probabili-
ties of Decline and Output Growth Histograms’, Journal of Money, Credit and Banking 43, 207–220.
Cortes, C. and Mohri, M. (2005), Confidence Intervals for the Area under the ROC Curve. Advances
in Neural Information Processing Systems (NIPS 2004).
Cosslett, S. R. (1993), Estimation from Endogenously Stratified Samples, in G. S. Maddala, C. R. Rao
and H. D. Vinod, eds, ‘Handbook of Statistics 11 (Econometrics)’, North-Holland Amsterdam, pp. 1–
44.
Cramer, J. S. (1999), ‘Predictive Performance of the Binary Logit Model in Unbalanced Samples’,
Journal of the Royal Statistical Society, Series D 48, 85–94.
Croushore, D. (1993), Introducing: The Survey of Professional Forecasters. Federal Reserve Bank of
Philadelphia Business Review, November/December, 3-13.
Dawid, A. P. (1984), ‘Present Position and Potential Developments: Some Personal Views: Statistical
Theory: The Prequential Approach’, Journal of the Royal Statistical Society, Series A 147, 278–292.
Delgado, M. A., Rodrıguez-Poo, J. M. and Wolf, M. (2001), ‘Subsampling Inference in Cube
Root Asymptotics with an Application to Manski’s Maximum Score Estimator’, Economics Letters
73, 241–250.
Deutsch, M., Granger, C. W. J. and Terasvirta, T. (1994), ‘The Combination of Forecasts Using Chang-
ing Weights’, International Journal of Forecasting 10, 47–57.
Diebold, F. X. (2006), Elements of Forecasting, South-Western College.
85
Diebold, F. X. and Lopez, J. A. (1997), Forecast Evaluation and Combination, in G.S. Maddala and
C.R. Rao, eds, ‘Handbook of Statistics 14 (Statistical Methods in Finance)’, North-Holland Amster-
dam, pp. 241–268.
Diebold, F. X. and Mariano, R. S. (1995), ‘Comparing Predictive Accuracy’, Journal of Business &
Economic Statistics 13, 253–263.
Donkers, B. and Melenberg, B. (2002), Testing Predictive Performance of Binary Choice Models.
Erasmus School of Economics, Econometric Institute Research Papers.
Egan, J. P. (1975), Signal Detection Theory and ROC Analysis, Academic Press.
Elliott, G. and Lieli, R. P. (2010), Predicting Binary Outcomes. Working paper, Department of Eco-
nomics, University of California, San Diego.
Engelberg, J., Manski, C. F. and Williams, J. (2011), ‘Assessing the Temporal Variation of Macroeco-
nomic Forecasts by a Panel of Changing Composition’, Journal of Applied Econometrics 26, 1059–
1078.
Engle, R. F. (2000), ‘The Econometrics of Ultra-High-Frequency Data’, Econometrica 68, 1–22.
Engle, R. F. and Russell, J. R. (1997), ‘Forecasting the Frequency of Changes in Quoted Foreign
Exchange Prices with the ACD Model’, Journal of Empirical Finance 12, 187–212.
Engle, R. F. and Russell, J. R. (1998), ‘Autoregressive Conditional Duration: A New Model for Irreg-
ularly Spaced Transaction Data’, Econometrica 66, 1127–1162.
Estrella, A. and Mishkin, F. S. (1996), ‘The Yield Curve as a Predictor of U.S. Recessions’, Current
Issues in Economics and Finance 2, 41–51.
Estrella, A. (1998), ‘A New Measure of Fit for Equations with Dichotomous Dependent Variables’,
Journal of Business & Economic Statistics 16, 198–205.
Estrella, A. and Mishkin, F. S. (1998), ‘Predicting U.S. Recessions: Financial Variables as Leading
Indicators’, The Review of Economics and Statistics 80, 45–61.
Evgeniou, T., Pontil, M. and Elisseeff, A. (2004), ‘Leave One Out Error, Stability, and Generalization
of Voting Combinations of Classifiers’, Machine Learning 55, 71–97.
86
Faraggi, D. and Reiser, B. (2002), ‘Estimation of the Area Under the ROC Curve’, Statistics in
Medicine 21, 3093–3106.
Fawcett, T. (2006), ‘An Introduction to ROC Analysis’, Pattern Recognition Letters 27, 861–874.
Florios, K. and Skouras, S. (2007), Computation of Maximum Score Type Estimators by Mixed In-
teger Programming. Working paper, Department of International and European Economic Studies,
Athens University of Economics and Business.
Friedman, J. H. and Hall, P. (2007), ‘On Bagging and Nonlinear Estimation’, Journal of Statistical
Planning and Inference 137, 669–683.
Frolich, M. (2006), ‘Non-parametric Regression for Binary Dependent Variables’, Econometrics Jour-
nal 9, 511–540.
Galbraith, J. W. and van Norden, S. (2007),‘Assessing Gross Domestic Product and Inflation Proba-
bility Forecasts Derived from Bank of England Fan Charts’, Journal of the Royal Statistical Society,
Series A 175, 1–15.
Gandin, L. S. and Murphy, A. H. (1992), ‘Equitable Skill Scores for Categorical Forecasts’, Monthly
Weather Review 120, 361–370.
Genest, C. and Zidek, J. V. (1986), ‘Combining Probability Distributions: A Critique and an Annotated
Bibliography’, Statistical Science 1, 114–135.
Gneiting, T. (2011), ‘Making and Evaluating Point Forecasts’, Journal of the American Statistical
Association 106, 746–762.
Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007),‘Probabilistic Forecasts, Calibration and Sharp-
ness’, Journal of the Royal Statistical Society, Series B 69, 243–268.
Gneiting, T. and Raftery, A. E. (2007), ‘Strictly Proper Scoring Rules, Prediction, and Estimation’,
Journal of the American Statistical Association 102, 359–378.
Gourieroux, C. and Monfort, A. (1993), ‘Simulation-based Inference: A Survey with Special Refer-
ence to Panel Data Models’, Journal of Econometrics 59, 5–33.
Gozalo, P. and Linton, O. (2000), ‘Local Nonlinear Least Squares: Using Parametric Information in
Nonparametric Regression’, Journal of Econometrics 99, 63–106.
87
Gradojevic, N. and Yang, J. (2006), ‘Non-linear, Non-parametric, Non-fundamental Exchange Rate
Forecasting’, Journal of Forecasting 25, 227–245.
Graham, J. R. (1996), ‘Is a Group of Economists Better Than One? Than None?’, Journal of Business
69, 193–232.
Grammig, J. and Kehrle, K. (2008), ‘A New Marked Point Process Model for the Federal Funds
Rate Target: Methodology and Forecast Evaluation’, Journal of Economic Dynamics and Control
32, 2370–2396.
Granger, C. W. J. and Jeon, Y. (2004), ‘Thick Modeling’, Economic Modeling 21, 323–343.
Granger, C. W. J. and Newbold, P. (1986), Forecasting Economic Time Series, Academic Press.
Granger, C. W. J. and Pesaran, M. H. (2000a), A Decision-Theoretic Approach to Forecast Evaluation,
in W. S. Chan, W. K. Li and H. Tong, eds, ‘Statistics and Finance: An Interface’, Imperial College
Press, pp. 261–278.
Granger, C. W. J. and Pesaran, M. H. (2000b), ‘Economic and Statistical Measures of Forecast Accu-
racy’, Journal of Forecasting 19, 537–560.
Greene, W. H. (2011), Econometric Analysis, Prentice Hall.
Greer, M. R. (2005), ‘Combination Forecasting for Directional Accuracy: An Application to Survey
Interest Rate Forecasts’, Journal of Applied Statistics 32, 607–615.
Griffiths, W. E., Hill, R. C. and Pope, P. J. (1987), ‘Small Sample Properties of Probit Model Estima-
tors’, Journal of the American Statistical Association 82, 929–937.
Hamilton, J. D. (1989), ‘A New Approach to the Economic Analysis of Nonstationary Time Series
and the Business Cycle’, Econometrica 57, 357–384.
Hamilton, J. D. (1990), ‘Analysis of Time Series Subject to Changes in Regime’, Journal of Econo-
metrics 45, 39–70.
Hamilton, J. D. (1993), Estimation, Inference and Forecasting of Time Series Subject to Changes in
Regime, in G. S. Maddala, C. R. Rao and H. D. Vinod, eds, ‘Handbook of Statistics 11 (Economet-
rics)’, North-Holland Amsterdam, pp. 231–260.
88
Hamilton, J. D. (1994), Time Series Analysis, Princeton University Press.
Hamilton, J. D. and Jorda, O. (2002), ‘A Model of the Federal Funds Rate Target’, Journal of Political
Economy 110, 1135–1167.
Hao, L. and Ng, E. C. Y. (2011), ‘Predicting Canadian Recessions using Dynamic Probit Modelling
Approaches’, Canadian Journal of Economics 44, 1297–1330.
Harding, D. and Pagan, A. (2011), ‘An Econometric Analysis of Some Models for Constructed Binary
Time Series’, Journal of Business & Economic Statistics 29, 86–95.
Hardle, W. and Stoker, T. M. (1989), ‘Investigating Smooth Multiple Regression by the Method of
Average Derivatives’, Journal of the American Statistical Association 84, 986–995.
Harvey, D., Leybourne, S. and Newbold, P. (1997), ‘Testing the Equality of Prediction Mean Squared
Errors’, International Journal of Forecasting 13, 281–291.
Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer.
Heckman, J. J. (1981), The Incidental Parameters Problem and the Problem of Initial Conditions in
Estimating a Discrete Time-Discrete Data Stochastic Process and some Monte-Carlo Evidence, in C.
F. Manski and D. McFadden, eds, ‘Structural Analysis of Discrete Data’, MIT Press, pp. 179–195.
Hertz, J., Krogh, A. and Palmer, R. G. (1991), Introduction to the Theory of Neural Computation,
Westview Press.
Horowitz, J. L. (1992), ‘A Smoothed Maximum Score Estimator for the Binary Response Model’,
Econometrica 60, 505–531.
Horowitz, J. L. (2009), Semiparametric and Nonparametric Methods in Econometrics, Springer.
Horowitz, J. L. and Mammen, E. (2004), ‘Nonparametric Estimation of an Additive Model with a
Link Function’, Annals of Statistics 32, 2412–2443.
Horowitz, J. L. and Mammen, E. (2007), ‘Rate-Optimal Estimation for a General Class of Nonpara-
metric Regression Models with Unknown Link Functions’, Annals of Statistics 35, 2589–2619.
89
Hristache, M., Juditsky, A. and Spokoiny, V. (2001), ‘Direct Estimation of the Index Coefficient in a
Single-Index Model’, Annals of Statistics 29, 595–623.
Hsiao, C. (1996), Logit and Probit Models, in L. Matyas and P. Sevestre, eds, ‘The Econometrics of
Panel Data: Handbook of Theory and Applications’, Kluwer Academic Publishers, pp. 410–428.
Hu, L. and Phillips, P. C. B. (2004a), ‘Dynamics of the Federal Funds Target Rate: A Nonstationary
Discrete Choice Approach’, Journal of Applied Econometrics 19, 851–867.
Hu, L. and Phillips, P. C. B. (2004b), ‘Nonstationary Discrete Choice’, Journal of Econometrics
120, 103–138.
Ichimura, H. (1993), ‘Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single-
Index Models’, Journal of Econometrics 58, 71–120.
Imbens, G. W. (1992), ‘An Efficient Method of Moments Estimator for Discrete Choice Models With
Choice-Based Sampling’, Econometrica 60, 1187–1214.
Imbens, G. W. and Lancaster, T. (1996), ‘Efficient Estimation and Stratified Sampling’, Journal of
Econometrics 74, 289–318.
Inoue, A. and Kilian, L. (2008), ‘How Useful is Bagging in Forecasting Economic Time Series? A
Case Study of U.S. CPI Inflation’, Journal of the American Statistical Association 103, 511–522.
Kauppi, H. (2012), ‘Predicting the Direction of the Fed’s Target Rate’, Journal of Forecasting 31, 47–
67.
Kauppi, H. and Saikkonen, P. (2008), ‘Predicting U.S. Recessions with Dynamic Binary Response
Models’, The Review of Economics and Statistics 90, 777–791.
Kim, J. and Pollard, D. (1990), ‘Cube Root Asymptotics’, Annals of Statistics 18, 191–219.
King, G. and Zeng, L. (2001), ‘Logistic Regression in Rare Events Data’, Political Analysis 9, 137–
163.
Kitamura, Y. (2001), Predictive Inference and the Bootstrap. Working paper, Yale University.
Klein, R. W. and Spady, R. H. (1993), ‘An Efficient Semiparametric Estimator for Binary Response
Models’, Econometrica 61, 387–421.
90
Koenker, R. and Yoon, J. (2009), ‘Parametric Links for Binary Choice Models: A Fisherian-Bayesian
Colloquy’, Journal of Econometrics 152, 120–130.
Koop, G. (2003), Bayesian Econometrics, John Wiley & Sons.
Krzanowski, W. J. and Hand, D. J. (2009), ROC Curves for Continuous Data, Chapman & Hall.
Krzysztofowicz, R. (1992), ‘Bayesian Correlation Score: A Utilitarian Measure of Forecast Skill’,
Monthly Weather Review 120, 208–219.
Krzysztofowicz, R. and Long, D. (1990), ‘Fusion of Detection Probabilities and Comparison of Mul-
tisensor Systems’, IEEE Transactions on Systems, Man, and Cybernetics 20, 665–677.
Kuan, C. M. and White, H. (1994), ‘Artificial Neural Networks: An Econometric Perspective’, Econo-
metrics Reviews 13, 1–91.
Kuncheva, L. I. (2004), Combining Pattern Classiers: Methods and Algorithms, John Wiley & Sons.
Kuncheva, L. I. and Whitaker, C. J. (2003), ‘Measures of Diversity in Classifier Ensembles and Their
Relationship with the Ensemble Accuracy’, Machine Learning 51, 181–207.
Lahiri, K., Monokroussos, G. and Zhao, Y. (2012a), The Yield Spread Puzzle and the Information
Content of SPF Forecasts. CESifo Working Paper Series No. 3949.
Lahiri, K., Peng, H. and Zhao, Y. (2012b), Evaluating the Value of Probability Forecasts in the Sense
of Merton. Paper presented at the 7th New York camp econometrics.
Lahiri, K., Teigland, C. and Zaporowski, M. (1988), ‘Interest Rates and the Subjective Probability
Distribution of Inflation Forecasts’, Journal of Money, Credit and Banking 20, 233–248.
Lahiri, K. and Wang, J. G. (1994), ‘Predicting Cyclical Turning Points with Leading Index in a Markov
Switching Model’, Journal of Forecasting 13, 245–263.
Lahiri, K. and Wang, J. G. (2006), ‘Subjective Probability Forecasts for Recessions: Evaluation and
Guidelines for Use’, Business Economics 41, 26–37.
Lahiri, K. and Wang, J. G. (2012), Evaluating Probability Forecasts for GDP Declines using Alter-
native Methodologies. Working paper, Department of Economics, State University of New York at
Albany.
91
Lawrence, M., Goodwin, P., O’Connor, M, and Onkal, D. (2006), ‘Judgmental Forecasting: A Review
of Progress over the Last 25 Years’, International Journal of Forecasting 22, 493–518.
Lechner, M., Lollivier, S. and Magnac, T. (2008), Parametric Binary Choice Models, in L. Matyas
and P. Sevestre, eds, ‘The Econometrics of Panel Data: Fundamentals and Recent Developments in
Theory and Practice’, Springer, pp. 215–245.
Lee, L. F. (1992), ‘On Efficiency of Methods of Simulated Moments and Maximum Simulated Like-
lihood Estimation of Discrete Response Models’, Econometric Theory 8, 518–552.
Lee, T. H. and Yang, Y. (2006), ‘Bagging Binary and Quantile Predictors for Time Series’, Journal of
Econometrics 135, 465–497.
Leitch, G. and Tanner, J. (1995), ‘Professional Economic Forecasts: Are They Worth Their Costs?’,
Journal of Forecasting 14, 143–157.
Li, Q. and Racine, J. S. (2006), Nonparametric Econometrics: Theory and Practice, Princeton Uni-
versity Press.
Lieli, R. P. and Nieto-Barthaburu, A. (2010), ‘Optimal Binary Prediction for Group Decision Making’,
Journal of Business & Economic Statistics 28, 308–319.
Lieli, R. P. and Springborn, M. (2012), ‘Closing the Gap Between Risk Estimation and Decision-
Making: Efficient Management of Trade-Related Invasive Species Risk’, Review of Economics and
Statistics (forthcoming) .
Liu, H., Li, G., Cumberland, W. G. and Wu, T. (2005), ‘Testing Statistical Significance of the Area Un-
der a Receiving Operating Characteristics Curve for Repeated Measures Design with Bootstrapping’,
Journal of Data Science 3, 257–278.
Lopez, J. A. (2001),‘Evaluating the Predictive Accuracy of Volatility Models’, Journal of forecasting
20, 87–109.
Lovell, M. C. (1986),‘Tests of the Rational Expectations Hypothesis’, The American Economic Review
76, 110–124.
Maddala, G. S. (1983), Limited-dependent and Qualitative Variables in Econometrics, Cambridge
University Press.
92
Maddala, G. S. and Lahiri, K. (2009), Introduction to Econometrics, John Wiley & Sons.
Manski, C. F. (1975), ‘Maximum Score Estimation of the Stochastic Utility Model of Choice’, Journal
of Econometrics 3, 205–228.
Manski, C. F. (1985), ‘Semiparametric Analysis of Discrete Response: Asymptotic Properties of the
Maximum Score Estimator’, Journal of Econometrics 27, 313–333.
Manski, C. F. (1988), ‘Identification of Binary Response Models’, Journal of the American Statistical
Association 83, 729–738.
Manski, C. F. and Lerman, S. R. (1977), ‘The Estimation of Choice Probabilities from Choice Based
Samples’, Econometrica 45, 1977–1988.
Manski, C. F. and Thompson, T. S. (1986), ‘Operational Characteristics of Maximum Score Estima-
tion’, Journal of Econometrics 32, 85–108.
Manski, C. F. and Thompson, T. S. (1989), ‘Estimation of Best Predictors of Binary Response’, Jour-
nal of Econometrics 40, 97–123.
Manzato, A. (2007), ‘A Note On the Maximum Peirce Skill Score’, Weather and Forecasting
22, 1148–1154.
Marcellino, M. (2004), ‘Forecasting EMU Macroeconomic Variables’, International Journal of Fore-
casting 20, 359–372.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979), Multivariate Analysis, Academic Press.
Mason, I. B. (2003), Binary Events, in I. T. Jolliffe and D. B. Stephenson, eds, ‘Forecast Verification:
A Practitioner’s Guide in Atmospheric Science’, John Wiley & Sons, pp. 37–76.
Mason, S. J. and Graham, N. E. (2002), ‘Areas Beneath the Relative Operating Characteristics (ROC)
and Relative Operating Levels (ROL) Curves: Statistical Significance and Interpretation’, Quarterly
Journal of the Royal Meteorological Society 128, 2145–2166.
Meese, R. and Rogoff, K. (1988), ‘Was it Real? The Exchange Rate-Interest Differential Relation
Over the Modern Floating-Rate Period’, Journal of Finance 43, 933–948.
93
Merton, R. C. (1981), ‘On Market Timing and Investment Performance.I.An Equilibrium Theory of
Value for Market Forecast’, Journal of Business 54, 363–406.
Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (1994), Machine Learning, Neural and Statistical
Classification, Prentice Hall.
Monokroussos, G. (2011), ‘Dynamic Limited Dependent Variable Modeling and U.S. Monetary Pol-
icy’, Journal of Money, Credit and Banking 43, 519–534.
Morgan, J. N. and Sonquist, J. A. (1963), ‘Problems in the Analysis of Survey Data, and a Proposal’,
Journal of the American Statistical Association 58, 415–434.
Murphy, A. H. (1973), ‘A New Vector Partition of the Probability Score’, Journal of Applied Meteo-
rology 12, 595–600.
Murphy, A. H. (1977), ‘The Value of Climatological, Categorical and Probabilistic Forecasts in the
Cost-Loss Situation’, Monthly Weather Review 105, 803–816.
Murphy, A. H. and Daan, H. (1985), Forecast Evaluation, in A. H. Murphy and R. W. Katz, eds,
‘Probability, Statistics, and Decision Making in the Atmospheric Sciences’, Westview Press, pp. 379–
437.
Murphy, A. H. and Winkler, R. L. (1984), ‘Probability Forecasting in Meterology’, Journal of the
American Statistical Association 79, 489–500.
Murphy, A. H. and Winkler, R. L. (1987), ‘A General Framework for Forecast Verification’, Monthly
Weather Review 115, 1330–1338.
Mylne, K. R. (1999), The Use of Forecast Value Calculations for Optimal Decision-making Using
Probability Forecasts, in ‘17th Conference on Weather Analysis and Forecasting’, American Meteo-
rological Society, Boston, Massachusetts, pp. 235–239.
Park, J. Y. and Phillips, P. C. B. (2000), ‘Nonstationary Binary Choice’, Econometrica 68, 1249–1280.
Parker, D. B. (1985), Learning Logic. Technical Report TR-47, Cambridge MA: MIT Center for
Research in Computational Economics and Management Science.
Patton, A. J. (2006), ‘Modelling Asymmetric Exchange Rate Dependence’, International Economic
Review 47, 527–556.
94
Patton, A. J. and Timmermann, A. (2012),‘Forecast Rationality Tests Based on Multi-Horizon
Bounds’, Journal of Business & Economic Statistics 30, 1–17.
Peirce, C. S. (1884), ‘The Numerical Measure of the Success of Predictions’, Science 4, 453–454.
Pesaran, M. H. and Skouras, S. (2002), Decision-Based Methods for Forecast Evaluation, in M.
P. Clements and D. F. Hendry, eds, ‘A companion to Economic Forecasting’, Wiley-Blackwell,
pp. 241–267.
Pesaran, M. H. and Timmermann, A. (1992), ‘A Simple Nonparametric Test of Predictive Perfor-
mance’, Journal of Business & Economic Statistics 10, 461–465.
Pesaran, M. H. and Timmermann, A. (2009), ‘Testing Dependence among Serially Correlated Multi-
Category Variables’, Journal of the American Statistical Association 104, 325–337.
Powell, J. L., Stock, J. H. and Stoker, T. M. (1989), ‘Semiparametric Estimation of Index Coefficients’,
Econometrica 57, 1403–1430.
Primo, C., Ferro, C. A. T., Jolliffe, I. T. and Stephenson, D. B. (2009), ‘Combination and Calibration
Methods for Probabilistic Forecasts of Binary Events’, Monthly Weather Review 137, 1142–1149.
Quinlan, J. R. (1992), C4.5: Programs for Machine Learning, Morgan Kaufmann.
Racine, J. S. and Parmeter, C. F. (2009), Data-driven Model Evaluation: a Test for Revealed Perfor-
mance. Mac Master University Working Papers.
Ranjan, R. and Gneiting, T. (2010), ‘Combining Probability Forecasts’, Journal of the Royal Statistical
Society, Series B 72, 71–91.
Refenes, A. P. and White, H. (1998), ‘Neural Networks and Financial Economics’, International
Journal of Forecasting 17, 347–495.
Richardson, D. S. (2003), Economic Value and Skill, in I. T. Jolliffe and D. B. Stephenson, eds,
‘Forecast Verification: A Practitioner’s Guide in Atmospheric Science’, John Wiley & Sons, pp. 165–
187.
Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.
95
Rudebusch, G. D. and Williams, J. C. (2009), ‘Forecasting Recessions: The Puzzle of the Enduring
Power of the Yield Curve’, Journal of Business & Economic Statistics 27, 492–503.
Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986), Learning Internal Representations by
Error Propagation, in D. E. Rumelhart, J. L. McClelland and the PDP Research Group, eds, ‘Parallel
Distributed Processing: Explorations in the Microstructure of Cognition’, MIT Press, pp. 318–362.
Schervish, M. J. (1989), ‘A General Method for Comparing Probability Assessors’, Annals of Statistics
17, 1856–1879.
Scott, A. J. and Wild, C. J. (1986), ‘Fitting Logistic Models Under Case-Control or Choice Based
Sampling’, Journal of the Royal Statistical Society, Series B 48, 170–182.
Scotti, C. (2011), ‘A Bivariate Model of Federal Reserve and ECB Main Policy Rates’, International
Journal of Central Banking 7, 37–78.
Seillier-Moiseiwitsch, F. and Dawid, A. P. (1993), ‘On Testing the Validity of Sequential Probability
Forecasts’, Journal of the American Statistical Association 88, 355–359.
Steinberg, D. and Cardell, N. S. (1992), ‘Estimating Logistic Regression Models When the Dependent
Variable Has no Variance’, Communications in Statistics-Theory and Methods 21, 423–450.
Stephenson, D. B. (2000), ‘Use of the ‘Odds Ratio’ for Diagnosing Forecast Skill’, Weather Forecast-
ing 15, 221–232.
Stock, J. H. and Watson, M. W. (1999), A Comparison of Linear and Nonlinear Univariate Models for
Forecasting Macroeconomic Time Series, in R. F. Engle and H. White, eds, ‘Cointegration, Causality,
and Forecasting, A Festschrift in Honor of Clive W. J. Granger’, Oxford University Press, pp. 1–44.
Stock, J. H. and Watson, M. W. (2005), An Empirical Comparison of Methods for Forecasting Using
Many Predictors. Working paper, Harvard University and Princeton University.
Stoker, T. M. (1986), ‘Consistent Estimation of Scaled Coefficients’, Econometrica 54, 1461–1481.
Stoker, T. M. (1991a), Equivalence of Direct, Indirect and Slope Estimators of Average Derivatives,
in W. A. Barnett, J. Powell and G. Tauchen, eds, ‘Nonparametric and Semiparametric Methods in
Econometrics and Statistics’, Cambridge University Press, pp. 99–118.
96
Stoker, T. M. (1991b), Lectures on Semiparametric Econometrics, Louvain-la-Neuve, Belgium:
CORE Foundation.
Swanson, N. R. and White, H. (1995), ‘A Model Selection Approach to Assessing the Information
in the Term Structure Using Linear Models and Artificial Neural Networks’, Journal of Business &
Economic Statistics 13, 265–275.
Swanson, N. R. and White, H. (1997a), ‘Forecasting Economic Time Series Using Flexible Ver-
sus Fixed Specification and Linear Versus Nonlinear Econometric Models’, International Journal of
Forecasting 13, 439–461.
Swanson, N. R. and White, H. (1997b), ‘A Model Selection Approach to Real-Time Macroeconomic
Forecasting Using Linear Models and Artificial Neural Networks’, The Review of Economics and
Business Statistics 79, 540–550.
Swets, J. A. (1996), Signal Detection Theory and ROC Analysis in Psychology and Diagnostics:
Collected Papers, Lawrence Erlbaum Associates.
Tajar, A., Denuit, M. and Lambert, P. (2001), Copula-Type Representation for Random Couples with
Bernoulli Margins. Discussing paper 0118, Universite Catholique de Louvain.
Tavare, S. and Altham, P. M. E. (1983), ‘Dependence in Goodness of Fit Tests and Contingency
Tables’, Biometrika 70, 139–144.
Terasvirta, T., Tjøstheim, D. and Granger, C. W. J. (2010), Modelling Nonlinear Economic Time
Series, Oxford University Press.
Terasvirta, T., van Dijk, D. and Mederios, M. C. (2005), ‘Smooth Transition Autoregressions, Neu-
ral Networks, and Linear Models in Forecasting Macroeconomic Time Series: A Re-examination’,
International Journal of Forecasting 21, 755–774.
Thompson, J. C. and Brier, G. W. (1955), ‘The Economic Utility of Weather Forecasts’, Monthly
Weather Review 83, 249–254.
Tibshirani, R. and Hastie, T. (1987), ‘Local Likelihood Estimation’, Journal of the American Statisti-
cal Association 82, 559–567.
97
Timmermann, A. (2006), Forecast Combinations, in G. Elliott, C. W. J. Granger and A. Timmermann,
eds, ‘Handbook of Economic Forecasting’, North-Holland Amsterdam, pp. 135–196.
Toth, Z., Talagrand, O., Candille, G. and Zhu, Y. (2003), Probability and Ensemble Forecasts, in I.
T. Jolliffe and D. B. Stephenson, eds, ‘Forecast Verification: A Practitioner’s Guide in Atmospheric
Science’, John Wiley & Sons, pp. 137–163.
Train, K. E. (2003), Discrete Choice Methods with Simulation, Cambridge University Press.
Wallsten, T. S., Budescu, D. V., Erev, I. and Diederich, A. (1997), ‘Evaluating and Combining Sub-
jective Probability Estimates’, Journal of Behavioral Decision Making 10, 243–268.
West, K. D. (1996), ‘Asymptotic Inference about Predictive Ability’, Econometrica 64, 1067–1084.
Wickens, T. D. (2001), Elementary Signal Detection Theory, Oxford University Press.
Wilks, D. S. (2001), ‘A Skill Score Based on Economic Value for Probability Forecasts’, Meteorolog-
ical Applications 8, 209–219.
Windmeijer, F. A. G. (1995), ‘Goodness-of-Fit Measures in Binary Choice Models’, Econometric
Reviews 14, 101–116.
Wooldridge, J. M. (2005), ‘Simple Solutions to the Initial Conditions Problem in Dynamic Non Linear
Panel Data Models with Unobserved Heterogeneity’, Journal of Applied Econometrics 20, 39–54.
Xie, Y. and Manski, C. F. (1989), ‘The Logit Model and Response-Based Samples’, Sociological
Methods and Research 17, 283–302.
Yang, Y. (2004), ‘Combining Forecasting Procedures: Some Theoretical Results’, Econometric The-
ory 20, 176–222.
Yates, J. F. (1982), ‘External Correspondence: Decompositions of the Mean Probability Score’, Or-
ganizational Behavior and Human Performance 30, 132–156.
Zhou, X. H., Obuchowski, N. A. and McClish, D. K. (2002), Statistical Methods in Diagnostic
Medicine, John Wiley & Sons.
98