22
Autumn, 2015 ECON 852 Supplementary Material James G. MacKinnon Dept. of Economics, Queen’s University 1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic is a quadratic form in a vector that is a function of the unrestricted estimates and the inverse of an estimate of the covariance matrix of that vector. Wald tests are much more general than least squares. We can perform a Wald test whenever we can obtain asymptotically normal parameter estimates and estimate their covariance matrix consistently. An ordinary t statistic is really a Wald test, or at least it is the square root of a Wald test. Consider a single restriction on the parameter β. For example, the restriction might be that β 2 + β 3 = 1. Then the vector of restrictions as a function of the unrestricted estimates is ˆ β 2 + ˆ β 3 - 1. From an important result proved in Section 3.4, we know that Var(w ˆ β)= w Var( ˆ β) w. (1) In this case, w is just a vector with 1 as the second element, 1 as the third element, and 0 everywhere else. Thus Var(w ˆ β) = Var( ˆ β 2 ) + Var( ˆ β 3 ) + 2 Cov( ˆ β 2 , ˆ β 3 ). (2) The t statistic for this restriction is just ˆ β 2 + ˆ β 3 - 1 ( Var( ˆ β 2 ) + Var( ˆ β 3 ) + 2 Cov( ˆ β 2 , ˆ β 3 ) ) 1/2 , (3) and the square of this t statistic is ( ˆ β 2 + ˆ β 3 - 1) 2 Var( ˆ β 2 ) + Var( ˆ β 3 ) + 2 Cov( ˆ β 2 , ˆ β 3 ) (4) =( ˆ β 2 + ˆ β 3 - 1) ( Var( ˆ β 2 ) + Var( ˆ β 3 ) + 2 Cov( ˆ β 2 , ˆ β 3 ) ) 1 ( ˆ β 2 + ˆ β 3 - 1). (5) Equation (5) looks like a quadratic form, which may seem strange, because every- thing in it is a scalar. When there are two or more restrictions, however, every Wald Statistic must be written as a quadratic form. –1–

1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

Autumn, 2015

ECON 852Supplementary Material

James G. MacKinnonDept. of Economics, Queen’s University

1. Wald TestsSection 5.4 of ETM introduces a very special case of the Wald statistic. In general,a Wald statistic is a quadratic form in a vector that is a function of the unrestrictedestimates and the inverse of an estimate of the covariance matrix of that vector.Wald tests are much more general than least squares. We can perform a Wald testwhenever we can obtain asymptotically normal parameter estimates and estimatetheir covariance matrix consistently.

An ordinary t statistic is really a Wald test, or at least it is the square root ofa Wald test. Consider a single restriction on the parameter β. For example, therestriction might be that β2 + β3 = 1. Then the vector of restrictions as a functionof the unrestricted estimates is β2 + β3 − 1. From an important result proved inSection 3.4, we know that

Var(wβ) = w⊤Var(β)w. (1)

In this case, w is just a vector with 1 as the second element, 1 as the third element,and 0 everywhere else. Thus

Var(wβ) = Var(β2) + Var(β3) + 2Cov(β2, β3). (2)

The t statistic for this restriction is just

β2 + β3 − 1(Var(β2) + Var(β3) + 2Cov(β2, β3)

)1/2 , (3)

and the square of this t statistic is

(β2 + β3 − 1)2

Var(β2) + Var(β3) + 2Cov(β2, β3)(4)

= (β2 + β3 − 1)⊤(Var(β2) + Var(β3) + 2Cov(β2, β3)

)−1(β2 + β3 − 1). (5)

Equation (5) looks like a quadratic form, which may seem strange, because every-thing in it is a scalar. When there are two or more restrictions, however, everyWald Statistic must be written as a quadratic form.

– 1 –

Page 2: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

We now consider the general case in which there are r linear restrictions on aparameter vector β. These restrictions can always be written in the form

Rβ = r, (6)

where R is an r× k matrix and r is an r --vector. For example, suppose that k = 3and there are two restrictions, that β1 = 0 and β2 + β3 = 1. Then equations (6)would be [

1 0 00 1 1

]β =

[01

]. (7)

The elements of the matrix R and the vector r must be known. They are notfunctions of the data, and, as in this example, they are very often integers.

Now suppose that we have obtained a k --vector of unrestricted parameter esti-mates β, of which the (true) covariance matrix is Var(β). From the standard resultfor the covariance matrix of a linear function of parameter estimates, which gener-alizes equation (3.33) in ETM,

Var(Rβ − r) = RVar(β)R⊤. (8)

Then, provided Var(β) estimates Var(β) consistently, the Wald statistic for testingthe restrictions (6) is

W (β) = (Rβ − r)⊤(RVar(β)R⊤)−1

(Rβ − r). (9)

This is just a quadratic form in the vector Rβ−r and the matrix (RVar(β)R⊤)−1.

In order to prove that W (β) is asymptotically distributed as χ2(r), we need toinsert the appropriate factors of n. This yields

W (β) =(n1/2(Rβ − r)

)⊤(RnVar(β)R⊤)−1(n1/2(Rβ − r)

). (10)

If the vector n1/2(Rβ − r) is asymptotically multivariate normal and the matrixVar(β) consistently estimates the covariance matrix of n1/2(β − β0), it follows from(8) and Theorem 4.1 that the Wald statistic W (β) must be asymptotically dis-tributed as χ2(r) under the null hypothesis.

It may be more intuitive to think of the Wald statistic as being computed in twosteps, even though it is actually computed in one. In the first step, we take thevector Rβ − r and transform it into an r--vector z of random variables that are,asymptotically, independent standard normals:

z = A−1(Rβ − r), (11)

where AA⊤= RVar(β)R⊤, so that

(A⊤)−1A−1 =(RVar(β)R⊤)−1

. (12)

– 2 –

Page 3: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

In the second step, we compute the Wald statistic as

W (β) = z⊤z =r∑

j=1

z2j . (13)

In this form, it is clear that the Wald statistic is just the sum of r squared randomvariables, each of which is (asymptotically) standard normal and independent of allthe others.

For the specific case of a linear regression model with IID disturbances and zerorestrictions on some of the parameters, Wald tests are very closely related to t testsand F tests. In fact, the square of the t statistic is a Wald statistic, and r times theF statistic is a Wald statistic. However, Wald tests are asymptotically valid for avery wide class of models that is far more general than the class of linear regressionmodels with IID disturbances. Equation (9) would still define a Wald statistic forthe hypothesis (6) if the vector β were any root-n consistent estimator and Var(β)were any consistent estimator of its covariance matrix.

Asymptotic tests cannot be exact in finite samples, because they are necessarilybased on P values, or critical values, that are approximate. By itself, asymptotictheory cannot tell us just how accurate such tests are. If we decide to use a nominallevel of α for a test, we reject whenever the asymptotic P value is less than α.In many cases, but certainly not all, asymptotic tests are probably quite accurate,committing Type I errors with probability reasonably close to α. They may eitheroverreject, that is, reject the null hypothesis more than 100α% of the time whenit is true, or underreject, that is, reject the null hypothesis less than 100α% of thetime. Whether they overreject or underreject, and how severely, depends on manythings, including the sample size, the distribution of the disturbances, the numberof regressors and their properties, the number of restrictions, and the relationshipbetween the disturbances and the regressors.

In many cases, bootstrap tests are more accurate than asymptotic tests. It istherefore often a good idea to compute both bootstrap and asymptotic P values fora given test statistic. Ideally, both P values will be very similar. In that case, wecan be pretty confident that both tests are reliable.

2. Cluster-Robust InferenceIn many areas of applied econometrics, it has become obligatory to allow for dis-turbances that may be “clustered” at the city, province, or state level. The firstedition of ETM fails to discuss this important topic.

Consider the linear regression model

y ≡

y1

y2...yG

= Xβ + u ≡

X1

X2...

XG

β +

u1

u2...

uG

, (14)

– 3 –

Page 4: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

where the data are divided into G clusters, indexed by g. The g th cluster has ng

observations, and the entire sample has n =∑G

g=1 ng observations. The matrix Xand the vectors y and u have n rows, X has k columns, and the parameter vectorβ has k elements. Least squares estimation of equation (14) yields OLS estimates

β and residuals u. The disturbances are assumed to be uncorrelated across clustersbut potentially correlated and heteroskedastic within clusters, so that

E(ugug⊤) = Ωg, g = 1, . . . , G, (15)

where the ng×ng covariance matrices Ωg are unknown. Thus the covariance matrixΩ of the entire vector u is assumed to be block diagonal, with the matrices Ωg

forming the diagonal blocks.

For the model (14), the covariance matrix of β is

(X⊤X)−1X⊤ΩX(X⊤X)−1 = (X⊤X)−1

(G∑

g=1

Xg⊤ΩgXg

)(X⊤X)−1. (16)

Notice that the matrix in the middle of the sandwich is actually the sum of Gmatrices, one for each cluster.

2.1 Why Clustering Matters

The matrix (16) can be very different from both the classical covariance matrixσ2(X⊤X)−1 and sandwich covariance matrices that allow for heteroskedasticity butnot clustering. Just how different it is depends on the regressors, the cluster sizes,and the intra-cluster correlations. When the clusters are large, (16) can be verymuch larger than the conventional covariance matrix, even when the intra-clustercorrelations are very small.

The simplest and most popular way to model intra-cluster correlation is the randomeffects model

ugi = vg + εgi, vg ∼ IID(0, σ2v), εgi ∼ IID(0, σ2

ε), (17)

where vg is a random variable that affects every observation in cluster g and noobservation in any other cluster. This model implies that

Var(ugi) = σ2v + σ2

ε and Cov(ugi, ugj) = σ2v , (18)

so that

ρg ≡ Corr(ugi, ugj) =σ2v

σ2v + σ2

ε

. (19)

Thus all the intra-cluster correlations are equal to ρg.

There has been a good deal of analysis of this special case; see Kloek (1981), Moulton(1986, 1990), and Angrist and Pischke (2008). Suppose for simplicity that the model

– 4 –

Page 5: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

contains only a constant and one regressor, with coefficient β2, where the value ofthe regressor is fixed within each cluster. If ng is the same for every cluster, thenit can shown that

Var(β2)

Varc(β2)= 1 + (ng − 1)ρg, (20)

where Var(β2) is the true variance of β2 based on the matrix (16), and Varc(β2) is theincorrect variance based on the conventional OLS covariance matrix σ2(X⊤X)−1;see Angrist and Pischke (2008, page 310). The square root of the right-hand side ofequation (20) is sometimes called the Moulton factor. A very simple, but not veryaccurate, way to “correct” conventional standard errors is to multiply them by anestimate of the Moulton factor.

More generally, the ratio of the correct variance to the conventional one looks like(20), but with ng − 1 replaced by a more complicated function of the cluster sizesand the intra-cluster correlations of the regressors. For a given sample size and setof regressors, the ratio is greatest when all the ng are the same.

It is clear from (20) that the true variance of β2 can be very much greater than theincorrect, conventional one when ng is large, even if ρg is quite small. For example,if ρg = 0.05, the true variance will be twice the conventional one when ng = 21, fourtimes the conventional one when ng = 61, and 25 times the conventional one whenng = 481. In practice, clusters often have thousands or even tens of thousands ofobservations, so that conventional standard errors may easily understate the trueones by factors of ten or more.

The most obvious way to solve this problem is to include group fixed effects soas to remove the vg from the disturbances, leaving only the εgi. However, that isimpossible if any of the regressors of interest does not vary within clusters. In thatcase, the fixed effects will explain all the variation in the regressor(s) of interest,so that we cannot identify the coefficient(s) we are interested in. Unfortunately,this is a very common situation. It occurs whenever certain regressors, such asmacroeconomic variables, are measured at the group level, and it occurs wheneverwe are interested in the effects of laws or policies that affect entire groups. Thiswas precisely the situation that motivated Kloek (1981).

Even when it is possible to include group fixed effects, they may not solve the prob-lem, because there is no reason to believe that intra-cluster correlations actuallyarise from the random effects model (17). In practice, they are probably far morecomplicated. Building on an approach pioneered by Bertrand, Duflo, and Mul-lanaithan (2004), MacKinnon and Webb (2015) provides strong evidence for this.They perform “placebo laws” experiments, in which they use real data to estimatewage equations for the United States, clustering by state and including state fixedeffects. Other regressors are age, age squared, education dummies, and time fixedeffects.

The regressor of interest is a dummy variable that takes the value 1 for a numberof randomly chosen states after a randomly chosen year (which differs by state), as

– 5 –

Page 6: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

if those states had introduced laws in various years. A different dummy variableis used for each replication. On average, since these dummies are totally random,they should have no effect.

There are 547,518 observations and 51 clusters, since the District of Columbia istreated as a state. Cluster sizes vary from 5,418 to 42,625. If state fixed effects fullyaccounted for intra-cluster correlation, then it would be valid to make inferencesbased on a covariance matrix that is robust to heteroskedasticity of unknown formbut not to clustering. We should reject the null hypothesis about 5% of the time.It is evident from Figure 1 that this is not the case! The horizontal axis shows G1,the number of “treated” states, and the vertical axis shows rejection frequencies.Except for G1 = 1, where there are only 51 × 11 = 561 possible cases, results arebased on 100,000 replications.

5 10 15 20 25 30 35 40 45 501

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.........1.00

.........0.80

.........0.60

.........0.40

.........0.20

.........0.10

.........0.05

.........0.01

.........0.00

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................................HCCME: .10

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..................................................................................05

.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..................................................................................01..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..............................................................................

...........................................................CRVE: .10

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................05

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................01G1

Rej. Rate

1. Rejection frequencies for tests of placebo laws

2.2 Cluster-Robust Covariance Matrix Estimators

When we cannot include fixed effects, or including them does not solve the problemof intra-cluster correlations, we need to use a cluster-robust variance estimator,or CRVE, which is an estimator Var(β) that is robust to heteroskedasticity andclustering of unknown form. The idea dates back to Liang and Zeger (1986), but itdid not become popular in econometrics until Stata added the “cluster” option inthe 1990s, and econometricians did not start to study it until after that.

– 6 –

Page 7: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

The most widely-used CRVE is

CV1 ≡G(n− 1)

(G− 1)(n− k)(X⊤X)−1

(G∑

g=1

Xg⊤ugug

⊤Xg

)(X⊤X)−1, (21)

where ug is the vector of OLS residuals for cluster g. Notice that each of the k× kmatrices within the summation has rank one, because it equal to the column vectorXg

⊤ug times its transpose. This implies that the rank of the CRVE itself cannotexceed G, making it impossible to test more than G restrictions at once.

Notice that, when ng = 1 for all g and the factor of G/(G−1) is omitted, expression(21) reduces to the familiar HC1 matrix. So this CRVE is actually a generalizationof that HCCME. There exist CV2 and CV3 matrices which similarly generalize HC2

and HC3, although at present these are rarely used, in part because they can bedifficult or impossible to compute when some of the ng are large.

The degrees-of-freedom adjustment in (21) may seem odd. That is because it isreally the product of two adjustments, one based on the numbers of observationsand coefficients, and one based on the number of clusters. It is customary to baseinferences on the t distribution with G−1 degrees of freedom, because what mattersfor asymptotic analysis is the number of clusters, not the number of observations;see Donald and Lang (2007) and Bester, Conley, and Hansen (2011). When G issmall and n is large, this can result in substantially larger critical values than usingn− k degrees of freedom.

Conventional inference based on (21) seems to work well if

1. the number of clusters is reasonably large,

2. cluster sizes do not vary too much,

3. the disturbances are approximately homoskedastic across clusters, and

4. the number of “treated” clusters is not too small.

The last point applies whenever the regressor of interest is a dummy variable. Ifsuch a variable only takes the value 1 for some or all of the observations within afew clusters, then CRVE t statistics can overreject severely; see Figure 1. It is verycommon to estimate regressions in which the regressor of interest is a dummy vari-able; this is always the case for “difference-in-differences” regressions (see below).If we run a DiD regression with only a few treated clusters and uses a CRVE, weare very likely to obtain a test statistic that is misleadingly large.

The reasons why cluster-robust inference fails when the number of treated clustersis small are explained in MacKinnon and Webb (2015). Trying to estimate theeffect of a dummy variable that affects only a small number of clusters is a bit liketrying to estimate the effect of such a dummy that affects only a small number ofclusters when using an HCCME. In both cases, the residuals for the treated clustersor observations do a very poor job of estimating the disturbances for those clustersor observations.

– 7 –

Page 8: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

2.3 The Wild Bootstrap

The wild bootstrap was originally designed for regression models with heteroskedas-ticity of unknown form. Consider the linear regression model

yi = Xiβ + ui,

= X1iβ1 + β2x2 + ui,(22)

whereVar(ui) = σ2

i , i = 1, . . . , n. (23)

We wish to test the hypothesis that β2 = 0. We can base inference on the HCCME

(X⊤X)−1

( n∑i=1

f2(ui)Xi⊤Xi

)(X⊤X)−1, (24)

where f(ui) is one of several transformations designed to compensate for the factthat OLS residuals are too small; see ETM, page 200. The test statistic can bewritten as

τ(β2) =β2√[

(X⊤X)−1X⊤ΩX(X⊤X)−1]22

, (25)

in which β2, the OLS estimate of β2, is divided by the square root of the appropriatediagonal element of any suitable HCCME, such as HC1, HC2, or HC3, dependingon precisely how Ω is defined. Note that, unless X1 has just one column, this is notactually the second diagonal element. It is whatever diagonal element correspondsto the position of x2 within the matrix X.

Although HCCME-based inference is generally quite reliable, it can be improved,sometimes substantially, by using the wild bootstrap. To calculate a wild bootstrapP value for a t statistic based on (24), we first estimate the model (22) under thenull hypothesis to obtain restricted estimates β1 and restricted residuals u. Wethen generate B bootstrap samples, using the DGP

y∗i = X1iβ1 + f(ui)v∗i , (26)

which imposes the null hypothesis.

In (26), f(ui) is a transformation of the ith residual ui, and v∗i is a random variablewith mean 0 and variance 1. The best choice for v∗i seems to be the Rademacherdistribution

v∗i =

−1 with probability 1

2 ,

1 with probability 12 .

This distribution imposes symmetry on the bootstrap disturbances, which it is goodto do if the disturbances actually are symmetric. Perhaps surprisingly, it seems to

– 8 –

Page 9: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

work well even when disturbances are not symmetric. A natural choice for thetransformation f(·) is

f(ui) =ui

(1− hi)1/2, (27)

where the hi are now the diagonals of the hat matrix for the restricted model. Sincethis looks like the transformation used by HC2, we will refer to it as w2. Using (27)ensures that the f(ui) must have constant variance whenever the disturbances arehomoskedastic.

For each bootstrap sample, indexed by j, we calculate the bootstrap analog of thetest statistic (25), which is

τ∗j (β∗2j) =

β∗2j√[

(X⊤X)−1X⊤Ω∗jX(X⊤X)−1

]22

. (28)

Here β∗2j is the OLS estimate for the j th bootstrap sample, andX⊤Ω∗

jX is computedin exactly the same way as X⊤ΩX in (25), except that it uses the residuals fromthe bootstrap regression.

After we have calculated the B values of τ∗j , we can calculate either a symmetric oran equal-tail P value in the usual way. The symmetric bootstrap P value is

p∗(τ) =1

B

B∑j=1

I(|τ∗j | > |τ |

). (29)

As usual, we reject whenever p∗(τ) < α. This procedure generally works very well.See MacKinnon (2012).

2.4 The Wild Cluster Bootstrap

The wild cluster bootstrap, proposed in Cameron, Gelbach, and Miller (2008),is an ingenious generalization of the ordinary wild bootstrap to handle clustereddisturbances. For a test based on restricted estimates, the bootstrap DGP is

y∗jgi = X1giβ1 + ugiv∗jg , g = 1, . . . , G, (30)

where y∗jgi is the observation on the bootstrap dependent variable for observation i

in cluster g, and v∗jg is a random variable drawn from the Rademacher distributionof some other auxiliary distribution with mean 0 and variance 1.

The idea of the wild cluster bootstrap is very simple. For the ordinary wild boot-strap, the residual associated with each observation is multiplied by an auxiliaryrandom variable that has mean 0 and variance 1. For the wild cluster bootstrap,the residuals associated with all the observations in a given cluster are multiplied by

– 9 –

Page 10: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

the same auxiliary random variable. This ensures that the bootstrap DGP mimicsboth the intra-cluster correlations and the heteroskedasticity of the residuals.

The bootstrap test statistic based on (30), called the wild cluster restricted, orWCR, bootstrap, is

t∗j =β∗j2 − β20

se(β∗j2 )

, (31)

where β∗j2 is the estimate for the j th bootstrap sample, and se(β∗j

2 ) is a cluster-robust standard error based on the same formula as the one for the original CRVEt statistic.

It is also possible to use a wild cluster bootstrap DGP that does not impose anyrestrictions:

y∗jgi = Xgiβ + ugiv∗jg , g = 1, . . . , G. (32)

In this case, the wild cluster unrestricted, or WCU, bootstrap test statistic, is

t∗j =β∗j2 − β2

se(β∗j2 )

. (33)

5 10 15 20 25 30 35 40 45 501

....................................................................................................................................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................

........0.30

........0.20

........0.15

........0.10

........0.05

........0.025

........0.01

........0.00 .........................................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................................................................................................................................................................................................................................................................................

.......................................................................................................................................................................................................................................................................................................................................................................................................

.......................

..........................................................................WCR bootstrap: .10

.......................................................................................................................................................................................................................................................................................................................................................

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.............................................................................................................................

...........................................................................................05

.......................................................................................................................................................................................................

.....................................................................................................................................................................................................................................................................................

..............................................................................................................................................................................................................................................................................................................................................

...................................................................................................................................

...........................................................................01....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................................................................................................................................................................

...............

......................................................WCU bootstrap: .10

..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

....................................................................................

...................................................................05

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................................................................................................................................................................................................................

.......................................................01

G1

Rej. Rate

2. Rejection frequencies for WC bootstrap tests of placebo laws

Figure 2 shows rejection frequencies for the two varieties of wild cluster bootstraptest, WCR and WCU, for the same placebo laws data as Figure 1. They are basedon 10,000 replications.

– 10 –

Page 11: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

Both procedures work well for sufficiently large values of G1. WCR works wellfor G1 ≥ 6, but it underrejects very severely for very small values of G1. WCUworks quite well for G1 ≥ 15, although it always tends to overreject a bit. Itoverrejects very severely in the cases where WCR underrejects, and it overrejectsfairly substantially for moderate values of G1 where WCR performs well.

2.5 Difference in Differences

Suppose that a new policy comes into effect in one or more jurisdictions (such ascountries, states, provinces, or cities) at one or more points in time. Economistsmay be interested in seeing what effect, if any, the policy had on some variable ofinterest. The problem is to disentangle the effects of the policy from other changesacross time or across jurisdictions. One method that is commonly used is difference-in-differences, or DiD, regression.

To begin with, suppose there are only two jurisdictions, indexed by g, and twotime periods, indexed by t, so that ygti denotes the dependent variable for the ith

unit (for example, an individual, a household, or a firm) within jurisdiction g attime t. If E(ygti) could vary arbitrarily across both jurisdictions and time periods,there would be no way to identify the effects of the policy. Therefore, we make the(arguably quite strong) assumption that, in the absence of the policy,

ygti = ηg + λt + ugti, (34)

where ηg is a jurisdiction fixed effect and λt is a time fixed effect. This assumptionis by no means innocuous, since it imposes a common jurisdiction fixed effect ηgon all time periods and a common time fixed effect, or common trend, λt on alljurisdictions.

Suppose further that the effect of the policy is to shift E(ygti) by a constant δ. Letthe two jurisdictions be denoted a and b and the two time periods 1 and 2, andsuppose that the policy is imposed only in jurisdiction b in period 2. Then we have

ya1i = ηa + λ1 + ua1i, ya2i = ηa + λ2 + ua2i,

yb1i = ηb + λ1 + ub1i, yb2i = ηb + λ2 + δ + ub2i.(35)

Let ygt and ugt denote the average values of the ygti and the ugti, respectively, forg = a, b and t = 1, 2. Then equation (34) and our assumption about the effect ofthe policy imply that

ya2 − ya1 = λ2 − λ1 + (ua2 − ua1), (36)

andyb2 − yb1 = δ + λ2 − λ1 + (ub2 − ub1). (37)

Therefore,

(yb2 − yb1)− (ya2 − ya1) = δ + (ub2 − ub1)− (ua2 − ua1). (38)

– 11 –

Page 12: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

The quantity on the left of this equation is the difference between two differences,yb2 − yb1 and ya2 − ya1. Thus it is a difference in differences. The quantity onthe right is the parameter we want to estimate, δ, plus a linear combination of thedisturbances.

Instead of actually computing the four averages that appear on the left-hand sideof equation (38) and using them to calculate the difference in differences, we cansimply estimate a regression model. In this case, the model is just

ygti = β1 + β2Dbgti + β3D

2gti + δDb

gtiD2gti + ugti, (39)

where Dbgti is a dummy variable that equals 1 if g = b and 0 otherwise, and D2

gti is adummy variable that equals 1 if t = 2 and 0 otherwise. The first three coefficientsare related to the coefficients in equations (35) as follows:

β1 = ηa + λ1, β2 = ηb − ηa, β3 = λ2 − λ1. (40)

The coefficient of interest is, of course, δ, which measures the effect of treatment onjurisdiction b in period 2.

Although there are studies that use the difference-in-differences methodology withjust two jurisdictions and two time periods (Card and Krueger (1994) is a veryfamous one), it seems to be impossible to allow for clustered disturbances in thatcase. If we are to make valid inferences that allow for clustering, it is essential tohave at least a moderate number of jurisdictions, both treated and untreated.

In the general case, there are G ≥ 2 clusters, of which G1 are treated in at leastsome of the T time periods. It is then possible to estimate a regression like

ygti = β1 +G∑

j=2

βjDJjgti +

T−1∑k=1

βG+kDTkgti + δTRgti + ugti, (41)

where the DJjgti are jurisdiction dummies that equal 1 when g = j, the DTk

gti

are time dummies that equal 1 when t = k, and TRgti is a treatment dummy thatequals 1 when g and t correspond to a jurisdiction that is treated in that time period.Note that it would be impossible to estimate equation (41) if any jurisdiction weretreated in every period, because there would be perfect collinearity between thejurisdiction dummies and the treatment dummy. If every jurisdiction were eithertreated in every period or not treated in every period, all of the jurisdiction dummieswould have to be dropped.

For a more detailed discussion of the DiD methodology, see Angrist and Pischke(2008, Chapter 5).

– 12 –

Page 13: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

3. Nonlinear Least SquaresThere are some aspects of the treatment of NLS in ETM that are a bit more com-plicated than necessary. These notes supplement and replace some of the materialin Chapter 6.

3.1 Properties of NLS Estimates

In order to obtain NLS estimates, we minimize the SSR(y − x(β)

)⊤(y − x(β)). (42)

Doing so yields the first-order conditions

X⊤(β)(y − x(β)

)= 0. (43)

Of course, we have omitted a factor of −2 here, so that these first-order conditionslook like moment conditions. They tell us that X(β), the matrix of the derivatives

of x(β) with respect to β, evaluated at β, must be orthogonal to the vector of NLSresiduals

The left-hand side of 1/n times the moment conditions (43), as a function of β, is

1−nX⊤(β)

(y − x(β)

). (44)

Since y − x(β0) = u, a first-order Taylor expansion around β0 yields

1−nX0

⊤u− 1−nX0

⊤X0(β − β0) +

(1−n

n∑t=1

At(β0)ut

)(β − β0), (45)

where, as usual, X0 ≡ X(β0), and At(β) is a k × k matrix with typical element∂2xt(β)/∂βi∂βj .

The third term in (45) arises when we differentiate X⊤(β) with respect to β. Pro-vided that a law of large numbers can be applied to each element of the matrixinside the large parentheses, that matrix must tend to zero as n→∞. Since xt(β0)and its derivatives belong to Ωt (the information set at time t), they must be inde-pendent of ut. Therefore, because E(ut) = 0, each element of At(β0)ut must haveexpectation zero. It follows that, if the At satisfy reasonable regularity conditionswhich allow a law of large numbers to apply to the matrix in large parenthesesin the third term of (45), that matrix must tend to zero as n → ∞. In contrast,the matrix n−1X0

⊤X0 in the second term tends to a finite, positive definite matrix.Therefore, the third term in (45) must be asymptotically negligible relative to thesecond term.

Because we can ignore the third term in (45), the moment conditions (43) areasymptotically equivalent to

n−1/2X0⊤u− 1−

nX0

⊤X0n1/2(β − β0)

a= 0. (46)

– 13 –

Page 14: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

Note that we have multiplied the first two terms in (45) by n1/2 here. Equation (46)is analogous to equation (6.21) in ETM, but the latter applies to a general class ofmethod of moments estimators rather than to NLS.

Solving (46), we find that

n1/2(β − β0)a=(1−nX0

⊤X0

)−1

n−1/2X0⊤u. (47)

The asymptotic normality of the vector n1/2(β−β0) follows directly from this, andthe asymptotic covariance matrix can easily be obtained by taking the plim of theright-hand side of (47) times itself transposed. We find that

Var(n1/2(β − β0)

)a=(plimn→∞

1−nX0

⊤X0

)−1(plimn→∞

1−nX0

⊤uu⊤X0

)(plimn→∞

1−nX0

⊤X0

)−1

. (48)

What the middle factor here is equal to depends on the covariance matrix of theerror vector u. When E(uu⊤) = σ2I, it is easy to see that

plimn→∞

1−nX0

⊤uu⊤X0 = σ2(plimn→∞

1−nX0

⊤X0

). (49)

Thus we find that

Var(n1/2(β − β0)

) a= σ2

(plimn→∞

1−nX0

⊤X0

)−1

. (50)

More generally, when E(uu⊤) = Ω,

plimn→∞

1−nX0

⊤uu⊤X0 = plimn→∞

1−nX0

⊤ΩX0, (51)

and we conclude that

Var(n1/2(β − β0)

) a=(plimn→∞

1−nX0

⊤X0

)−1(plimn→∞

1−nX0

⊤ΩX0

)(plimn→∞

1−nX0

⊤X0

)−1

. (52)

Of course, we cannot actually evaluate (50) and (52), but we can estimate themafter removing the factors of 1/n. We conclude that, when the error terms are IID,

Var(β) = s2(X⊤X)−1, (53)

where X ≡X(β). When the error terms are not IID,

Var(β) = (X⊤X)−1X⊤ΩX(X⊤X)−1. (54)

– 14 –

Page 15: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

Here Ω is an estimate of Ω. It may even be an inconsistent estimator, as in thecase of heteroskedasticity of unknown form. In that case, it is an n × n diagonalmatrix with possibly rescaled squared residuals on the diagonal.

3.2 Tests Based on the Gauss-Newton Regression

In the vast majority of cases, tests based on the GNR involve evaluating it atrestricted estimates. The test regression can therefore be written as

y − x = Xb + residuals, (55)

where x ≡ x(β), and X ≡ X(β). If we partition the parameters into two groups,a k1--vector β1 that is estimated under the null and a k2--vector β2 that is set tozero under the null, and partition the matrix X in the same way, then (55) can berewritten as

y − x = X1b1 + X2b2 + residuals. (56)

The ordinary F statistic for b2 = 0 is a perfectly valid test statistic. It will beasymptotically distributed as F (k2, n− k). Another valid test statistic is nR2 fromregression (56), which will be asymptotically distributed as χ2(k2).

The heteroskedasticity-robust Gauss-Newton regression, or HRGNR, discussed inthe text is unnecessarily complicated. A simpler one is

ι = UMX1X2b2 + residuals, (57)

where U is an n× n diagonal matrix with ut as its tth diagonal element, and MX1

is the matrix that takes residuals from a regression on X1. The test statistic forthe null hypothesis that β2 = 0 is n minus the SSR from regression (57).

In many cases, we can compute heteroskedasticity-robust, or cluster-robust, teststatistics without using the HRGNR. We simply need to run the ordinary GNR,either equation (55) or its unrestricted equivalent, and ask the regression package tocompute heteroskedasticity-robust or cluster-robust standard errors. To test a singlerestriction, we just use the reported t statistic. To test two or more restrictions, weneed to compute the appropriate Wald test.

4. Instrumental VariablesSuppose there is just one endogenous right-hand-side variable. In this case, we canwrite the complete model as a two-equation system:

y1 = βy2 +Zγ + u, and (58)

y2 = Zπ1 +W2π2 + v. (59)

Here y1 and y2 are n--vectors of observations on endogenous variables, Z is an n×kmatrix of observations on exogenous or predetermined variables, and W = [Z W2]is an n× l matrix of instruments. In the context of simultaneous equations models,

– 15 –

Page 16: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

the first equation here is called a structural equation, and the second is called areduced-form equation.

What we are interested in is equation (58), the structural equation. In particular,we are interested in β, the coefficient on y2. It must be the case that l ≥ k + 1,since otherwise the IV estimator would not exist. When l = k+ 1, equation (58) issaid to be exactly identified. When l > k + 1, it is said to be overidentified. Thenumber of overidentifying restrictions is l − k − 1.

The reduced-form equation (59) is simply used to obtain the fitted values

y2 ≡ PWy2 ≡ Zπ1 +W2π2, (60)

which are then used to compute the IV (or two-stage least squares) estimates as[βIV

γIV

]= (X⊤PWX)−1X⊤PWy1, (61)

whereX ≡ [y2 Z] and PWX = [PWy2 Z]. (62)

As in Chapter 8, the estimated covariance matrix of the IV estimates is given by

Var

([βIV

γIV

])= σ2

u(X⊤PWX)−1. (63)

Thus we could test hypotheses about β by using the t statistic

βIV − β0

σu

([(X⊤PWX)−1

]11

)1/2 , (64)

where the denominator of (64) is just the standard error of βIV that a regression

package will print. Similarly, we could construct confidence intervals for β as βIV

plus or minus some number of standard errors. But these tests and confidenceintervals will only be useful if the asymptotic distribution of βIV provides a goodapproximation in finite samples.

Whether or not the distribution of the IV estimator βIV is well approximated byits asymptotic distribution depends on several things:

• the number of overidentifying restrictions, which is l − k − 1, or, equivalently,the number of instruments, which is l − k;

• the correlation ρ between the elements of u and v; and

• the weakness of the instruments.

One way to see how weak or strong the instruments are is to calculate the F statisticfor π2 = 0 in equation (59). This F statistic has l−k and n− l degrees of freedom.

– 16 –

Page 17: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

We are not really interested in the hypothesis that π2 = 0, although if we cannotreject that hypothesis, the instruments must be very weak, and asymptotic analysisis likely to be very unreliable. Instead, we are using the F statistic to estimatesomething called the concentration parameter, which depends on both the strengthof the instruments and the size of the sample. In order for asymptotic analysis tobe reliable, the concentration parameter needs to be quite large.

Just how big the F statistic needs to be is unclear. Staiger and Stock (1997) suggeststhat it should be greater than 10. Stock and Yogo (2005) provides more detailed butmore complicated advice, which depends on l− k and on whether we are concernedabout the bias of βIV or about the size of t tests for β = 0. To keep relative biasbelow 10%, the F statistic needs to exceed a number between roughly 9.0 and 11.5.To avoid t tests rejecting more than 10% of the time at the .05 level, the F statisticneeds to be larger, roughly 16.4 for l − k = 1 and 26.9 for l − k = 5.

I suspect that the Stock-Yogo numbers are too conservative, because they are de-signed for the worst possible case and do not take account of how much correlationthere is between the elements of u and v.

As the number of overidentifying restrictions increases, the finite-sample distribu-tion of βIV becomes more biased but less dispersed. The fundamental problem isthat, as more and more columns are added to W, the latter does a better and betterjob of explaining y2, and the more PWy2 comes to resemble y2 itself. This is aninevitable consequence of the tendency of OLS to fit too well. As a result, the IVestimator tends to become more and more biased as l − k increases.

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.......................................................................................................................................................................................................................................................................................................................................

....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..........................................................................................................

.....................................................................................................................................................................

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................................................................................................................................................................................................................................................................

.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

........................................................................................................................................................................................................................................................................................................................................................................

← IV0

← IV3

OLS→

IV6−→

←True value

3. Distributions of OLS and IV estimates, n = 25

– 17 –

Page 18: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

Figure 3 provides an illustration of this. It shows the distribution functions of theOLS estimator and three different IV estimators in a simple case. The three IVestimators, which we will refer to as IV0, IV3, and IV6, have l − k − 1 equal to 0,3, and 6, respectively. The quantity being estimated is the slope parameter froman equation with one endogenous regressor and a constant term; its true value is 1.The sample size is only 25 so as to make finite-sample biases very apparent.

The left-most curve in Figure 3 is the distribution function for OLS, which is severelybiased downward in this case. The right-most curve is the one for IV0, which hasapproximately the right median but also has much more dispersion (we cannot sayvariance since it does not have a second moment) and much thicker tails than theOLS estimator. Indeed, among the 50,000 replications performed, there were severalIV0 estimates larger than 1000 in absolute value! The distribution functions for IV3

and IV6 mostly lie between those for OLS and IV0 and have much thinner tails thanthe latter, with IV6 being closer to OLS than IV3, as the above argument predicts.Both these estimators are quite severely biased (remember that n = 25 here),although not nearly as much so as OLS, and both evidently have more variancethan OLS, as evidenced by the less steep slopes of their distribution functions.

Which estimator is best depends on what criterion is used to choose among them.If one looks only at the median, IV0 is clearly the best. On the other hand, if oneuses mean squared error as a criterion, IV0 is clearly the worst. Because it has nofirst or higher moments, its mean squared error is infinite. Based on most criteria,the choice would be between IV3 and IV6. For a large enough sample size, the latterwould of course be preferable, since its greater bias will vanish as n increases, whileits smaller variance will persist.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.000.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

...............................................................................................................................................................................

..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

............................................................................................................................................

................................................................................................................................................................................................................................

..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..................................................................................................................................................................................................................................................................

......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

True value

n = 25

n = 100

n = 500 ........................................................................................................................................................... ...........

..................................................................................................................................................... ...........

........................................................................................................................... ...........

...........................................................................................................................................

4. Distributions of IV6 estimates for several sample sizes

– 18 –

Page 19: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

The effect of increasing the sample size is shown clearly in Figure 4, which showsthe distribution of IV6 for n = 25, n = 100, and n = 500. As n increases, boththe variance and the bias of the estimator decrease, as expected. However, even forn = 500 the bias is noticeable. About 58% of the estimates are greater than thetrue value of 1.

It should be stressed that Figures 3 and 4 apply only to a particular example, inwhich the instruments happen to be quite good ones. In other cases, especially whenthe instruments are weak, so that they have little ability to explain the endogenousregressor(s), IV estimators may be extremely inefficient, and their finite-sampledistributions may be very different from their asymptotic ones.

5. The Generalized Method of MomentsThere is more than one way to derive the efficient GMM estimator for a linearregression model with heteroskedasticity and/or serial correlation of unknown form.Here we discuss an approach that is a bit simpler than the one in ETM. The modelwe wish to estimate is

y = Xβ + u, E(W⊤u) = 0, E(uu⊤) = Ω, (65)

where X is an n× k matrix of regressors, and W is an n× l matrix of instruments,with l ≥ k.

Suppose we start with the moment conditions

E(W⊤(y −Xβ)

)= 0. (66)

Unless l = k, there are too many of these, so we cannot set their sample analogs tozero. Since the covariance matrix of the vector W⊤u is W⊤ΩW , it makes sense tominimize the GMM criterion function

Q(β,y) ≡ (y −Xβ)⊤W (W⊤ΩW )−1W⊤(y −Xβ). (67)

For the moment, we treat Ω as known, but that unrealistic assumption will berelaxed later. This is simply a way of weighting the various moment conditions.Observe that (67) would reduce to the IV criterion function if Ω were proportionalto an identity matrix.

Differentiating (67) with respect to β yields the first-order conditions

X⊤W (W⊤ΩW )−1W⊤(y −Xβ) = 0.

These can be solved to yield

βGMM =(X⊤W (W⊤ΩW )−1W⊤X

)−1X⊤W (W⊤ΩW )−1W⊤y. (68)

– 19 –

Page 20: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

This is called the efficient GMM estimator, although it is not actually fully efficient.It is efficient among estimators based on the moment conditions (66). Observe that,

if Ω is proportional to an identity matrix, βGMM reduces to (X⊤PWX)−1X⊤PWy,which is the generalized IV estimator.

In practice, we almost never know Ω, so we need to replace it by something else inorder to obtain a feasible efficient GMM estimator. In the case of heteroskedasticityof unknown form, Ω is just a diagonal matrix with squared (possibly rescaled)residuals on the diagonal. More generally, we can use a HAC estimator. There aremany of these. What is required is that

plim 1−nW⊤ΩW = plim 1−

nW⊤ΩW.

Whether W⊤ΩW will provide a reliable estimate of W⊤ΩW depends on a lot ofthings, including the sample size, the particular HAC estimator that is used, andthe matrix Ω. There is some discussion of HAC estimators in Section 9.3. Theclassic papers of Andrews (1991) and Andrews and Monahan (1992) go well beyondthis, however.

In this way, we obtain the feasible efficient GMM estimator

βFGMM =(X⊤W (W⊤ΩW )−1W⊤X

)−1X⊤W (W⊤ΩW )−1W⊤y. (69)

which also appears in equation (9.15) in ETM.

5.1 A Regression Interpretation:

The empirical analog of the moment conditions (66) is

W⊤(y −Xβ) = W⊤u,

or, equivalently,W⊤y = W⊤Xβ +W⊤u. (70)

This looks like a linear regression. It is simply the regression y = Xβ + u witheverything premultiplied by W⊤. Observe that this regression has l observations,and there are k coefficients to estimate.

If k = l, it is obvious that we can find a value of β that makes the residuals forregression (70) equal zero. It is just

βIV = (W⊤X)−1W⊤y, (71)

which is the simple IV estimator. Thus, when there are no overidentifying restric-tions, there is only one GMM estimator, and it is the simple IV estimator. Becausethe residuals for regression (70) are zero,

W⊤y −W⊤XβIV = 0, (72)

– 20 –

Page 21: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

which implies that the minimized value of the GMM criterion function (67) is alsoequal to zero.

When l > k, we can simply estimate the regression (70) by generalized least squares.We need to use GLS because the error terms are not IID, even if the ut are. Aswe saw previously, W⊤ΩW is the covariance matrix of the vector W⊤u. Thus theGLS estimator is

βGLS =(X⊤W (W⊤ΩW )−1W⊤X

)−1X⊤W (W⊤ΩW )−1W⊤y,

which is, of course, identical to βGMM. So we can interpret βGMM as the GLSestimator for equation (70), and we can interpret βFGMM from (69) as the feasibleGLS estimator for that equation.

As noted, the number of overidentifying restrictions is l − k. These can be testedby using the minimized value of the GMM criterion function as a test statistic. It isasymptotically distributed as χ2(l−k) when the model is correctly specified and theinstruments are valid. This is analogous to a Sargan test and is sometimes called aHansen-Sargan test or, quite often, a J test. See Section 9.4 of ETM.

References

Andrews, D. W. K. (1991), “Heteroskedasticity and autocorrelation consistent co-variance matrix estimation,” Econometrica, 59, 817–58.

Andrews, D. W. K., and J. C. Monahan (1992), “An improved heteroskedasticityand autocorrelation consistent covariance matrix estimator,” Econometrica, 60,953–66.

Angrist, J. D., and J.-S. Pischke (2008), Mostly Harmless Econometrics: An Em-piricist’s Companion, Princeton, Princeton University Press.

Bertrand, M., and E. Duflo, and S. Mullainathan (2004), “How much should wetrust differences-in-differences estimates?,” Quarterly Journal of Economics, 119,249–275.

Bester, C. A., T. G. Conley, and C. B. Hansen (2011), “Inference with dependentdata using cluster covariance estimators,” Journal of Econometrics, 165, 137–151.

Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008), “Bootstrap-based improve-ments for inference with clustered errors,” Review of Economics and Statistics,90, 414–427.

Card, D., and A. B. Krueger (1994). “Minimum wages and employment: A casestudy of the fast-food industry in New Jersey and Pennsylvania,” American Econ-omic Review, 84, 772–793.

Donald, S. G., and K. Lang (2007), “Inference with difference-in-differences andother panel data,” Review of Economics and Statistics, 89, 221–233.

– 21 –

Page 22: 1. Wald Testsqed.econ.queensu.ca/pub/faculty/mackinnon/econ852/...1. Wald Tests Section 5.4 of ETM introduces a very special case of the Wald statistic. In general, a Wald statistic

Kloek, T. (1981), “OLS estimation in a model where a microvariable is explained byaggregates and contemporaneous disturbances are equicorrelated,” Econometrica,49, 205–207.

Liang, K.-Y., and S. L. Zeger (1986). “Longitudinal data analysis using generalizedlinear models,” Biometrika, 73, 13–22.

MacKinnon, J. G. (2012), “Thirty years of heteroskedasticity-robust inference,” inRecent Advances and Future Directions in Causality, Prediction, and Specifica-tion Analysis, ed. Xaiohong Chen and Norman R. Swanson, New York, Springer,437–461.

MacKinnon, J. G., and M. D. Webb (2015), “Wild bootstrap inference for wildlydifferent cluster sizes,” Queen’s Economics Department Working Paper No. 1314(revised).

Moulton, B. R. (1986), “Random group effects and the precision of regression esti-mates,” Journal of Econometrics, 32, 385–397.

Moulton, B. R. (1990), “An illustration of a pitfall in estimating the effects ofaggregate variables on micro units,” Review of Economics and Statistics, 72,334–338.

Staiger, D. and J. H. Stock (1997), “Instrumental variables regression with weakinstruments,” Econometrica, 65, 557–86.

Stock, J. H. and M. Yogo (2005), “Testing for weak instruments in linear IV regres-sion,” in D. W. K. Andrews and J. H. Stock (eds.), Identification and Inferencefor Econometric Models: Essays in Honor of Thomas Rothenberg, pp. 80–108.Cambridge: Cambridge University Press.

– 22 –