29
Alternative Asymptotics and the Partially Linear Model with Many Regressors 1 Matias D. Cattaneo, Department of Economics, University of Michigan, [email protected]. Michael Jansson, Department of Economics, UC Berkeley, [email protected]. Whitney K. Newey, Department of Economics, MIT, [email protected]. July, 2010 Revised November, 2010 JEL classication: C13, C31. Keywords : partially linear model, many terms, adjusted variance. 1 The authors thank comments from Alfonso Flores-Lagunes, seminar participants at Indiana, Michigan and MSU, and conference participants at the 2010 Joint Statistical Meetings and the 2010 LACEA Impact Evaluation Network Conference.

Alternative Asymptotics and the Partially Linear Model

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Alternative Asymptotics and the Partially LinearModel with Many Regressors1

Matias D. Cattaneo, Department of Economics, University of Michigan, [email protected].

Michael Jansson, Department of Economics, UC Berkeley, [email protected].

Whitney K. Newey, Department of Economics, MIT, [email protected].

July, 2010

Revised November, 2010

JEL classi�cation: C13, C31.

Keywords: partially linear model, many terms, adjusted variance.

1The authors thank comments from Alfonso Flores-Lagunes, seminar participants at Indiana, Michiganand MSU, and conference participants at the 2010 Joint Statistical Meetings and the 2010 LACEA ImpactEvaluation Network Conference.

Proposed Running Head: Semiparametric Asymptotics

Corresponding Author:

Whitney K. Newey

Department of Economics

MIT, E52-262D

Cambridge, MA 02142-1347

Abstract

1

1 Introduction

Many instrument asymptotics, where the number of instruments grows as fast as the sample

size, has proven useful for instrumental variable estimators. Kunitomo (1980) and Morimune

(1983) derived asymptotic variances that are larger than the usual formulae when the number

of instruments and sample size grow at the same rate, and Bekker (1994) and others pro-

vided consistent estimators of these larger variances. Hansen, Hausman, and Newey (2008)

showed that using many instrument standard errors provides a theoretical improvement for

a range of number of instruments and a practical improvement for estimating the returns to

schooling. Thus, many instrument asymptotics and the associated standard errors have been

demonstrated to be a useful alternative to the usual asymptotics for instrumental variables.

Instrumental variable estimators implicitly depend on a nonparametric series estimator.

Many instrument asymptotics has the number of series terms growing so fast that the se-

ries estimator is not consistent. Analogous asymptotics for kernel density weighted average

derivative estimators has been considered by Cattaneo, Crump, and Jansson (2010). They

�nd that when the bandwidth shrinks faster than needed for consistency of the kernel es-

timator the variance of the estimator is larger than the usual formula. They also �nd that

correcting the variance provides an improvement over standard asymptotics for a range of

bandwidths.

The purpose of this paper is to show that these results share a common structure and that

this structure can be used to derive new results. The common structure is that the object

determining the limiting distribution is a V-statistic, with a remainder that is a asymptot-

ically normal degenerate U-statistic. Asymptotic normality of the remainder distinguishes

this setting from other ones with V-statistics. Here the asymptotically normal remainder

comes from the number of series terms going to in�nity (or bandwidth shrinking to zero),

while the behavior of a degenerate U-statistic is more complicated in other settings. When

the number of terms grows as fast as the sample size the remainder has the same magnitude

as the leading term, resulting in an asymptotic variance larger than just the variance of

the leading term. The many instrument and small bandwidth results share this structure.

In keeping with this common structure, we will henceforth refer to such results under the

general heading of alternative asymptotics.

Applying the common structure to a series estimator of the partially linear model leads

to new results. These results allow the number of terms in the series approximation to

grow as fast as the sample size. The asymptotic distribution of the estimator is derived

2

and it is shown that it has a larger asymptotic variance than the usual formula. When

the disturbance is homoskedastic this larger variance is consistently estimated by using by

the usual homoskedasticity consistent estimator with the proper degrees of freedom. This

provides a large sample justi�cation for the use of a degrees of freedom correction without

normality of disturbances. It is also found that the White (1980) variance estimator is

inconsistent with many regressors, being too small when the disturbance is homoskedastic.

We give a variance estimator that is heteroskedasticity consistent when the number of series

terms grows as fast as the sample size. We also show that the new variance estimator provides

an improvement when the number of terms grows slower than the sample size. These results

suggest that the new standard errors should be useful for inference for a partially linear

model with many regressors.

In Section 2 we describe the common structure of many instrument and small bandwidth

asymptotics. In Section 3 we show how the structure leads to new results for the partially

linear model. Section 4 describes previous results for the partially linear model and Section 5

gives the new results. Section 6 provides some Monte Carlo evidence and Section 7 concludes.

2 A Common Structure

To describe the common structure of many instrument and small bandwidth asymptotics,

let W1; :::;Wn denote independent data observations. Also, let � denote an estimator and �0a corresponding true value. We consider estimators that satisfy

pn (�� �0) = �

�1Sn; Sn =nX

i;j=1

unij(Wi;Wj); (1)

where unij(wi; wj) is a function of a pair of observations that can depend on i; j; and n. We

allow u to depend on n to account for number of terms or bandwidths that change with the

sample size. Also, we allow u to vary with i and j to account for dependence on variables

that are being conditioned on in the asymptotics, and so treated as nonrandom. We will

illustrate in examples to follow.

We will assume throughout that �p�! � non-singular and focus on the V-statistic Sn:

That V-statistic has a well known decomposition that we describe here because it is an

essential feature of the common structure. For notational implicitly we will drop the Wi and

Wj arguments and let unij = unij(Wi;Wj) and ~unij = unij + unji � E[unij + unji]. We have the

3

following result.

Proposition 1: If E[kunijk2] <1 for all i; j; n then

Sn = n + Un +Bn, (2)

where

n =nXi=1

ni (Wi), ni (Wi) = unii � E[unii] +Xj 6=i

E[~unijjWi],

Un =

nXi=2

Dni (Wi;Wi�1; :::;W1), Dn

i (Wi; :::;W1) =Xj<i

�~unij � E[~unijjWi]� E[~unijjWj]

�,

Bn = E[Sn],

and

E[ ni (Wi)] = 0, E[Dni (Wi; :::;W1)jWi�1; :::;W1] = 0, E[nUn] = 0.

This result shows that Sn can be decomposed into a sum of independent terms n, a

U-statistic remainder Un that is a martingale di¤erence sum and uncorrelated with n, and

a pure bias term Bn. This decomposition of a V-statistic is well known, being referred to

by van der Vaart (1998, Chapter 11), and is included here for exposition. It is important in

many of the proofs of asymptotic normality of semiparametric estimators, including Powell,

Stock, and Stoker (1989), with the limiting distribution being determined by n, and Unbeing treated as a remainder that is of smaller order when the bandwidth shrinks slowly

enough.

An interesting feature of this decomposition in semiparametric settings is that Un is

asymptotically normal at some rate when the number of series terms grow or the bandwidth

shrinks to zero. In other settings the asymptotic behavior of Un is more complicated. It is

a degenerate U-statistic, that in general converges to a weighted sum of chi-squares, e.g. see

van der Vaart (1998, Chapter 12). Apparently what occurs in semiparametric settings as the

number of instruments grows or the bandwidth shrinks is that the individual contributions

Dni (Wi; :::;W1) to Un are small enough to satisfy a Lindeberg-Feller condition. Combined

with the martingale property of Un, as in Proposition 1 and is well known, this leads to

asymptotic normality of Un. This asymptotic normality property of Un has been shown for

4

both series and kernel estimators, as further explained below.

Alternative asymptotics occurs when the number of series terms grows or the bandwidth

shrinks fast enough that n and Un have the same magnitude in the limit. Because of

uncorrelatedness of n and Un the asymptotic variance will be larger than the usual formula

which is limn�!1V[n]. Thus, consistent variance estimation under alternative asymptoticsrequires accounting for the presence of Un.

Accounting for the presence of Un should also yield improvements when numbers of series

terms and bandwidths do not satisfy the knife-edge conditions of alternative asymptotics.

Intuitively, if the number of series terms grows just slightly slower than the sample size or the

bandwidth shrinks slightly slower than 1=n. then accounting for the presence of Un should

still give a better large sample approximation. Hansen, Hausman, and Newey (2008) show

such an improvement for many instrument asymptotics. It would be good to consider such

improved approximations more generally, though it is beyond the scope of this paper to do

so.

We can show that many instrument asymptotics has the structure we have outlined.

To keep the exposition simple we focus on the jackknife instrumental variables estimator

JIVE2 of Angrist, Imbens, and Krueger (1999). The limited information maximum likelihood

estimator could also be considered, for which many instrument asymptotics was carried

out by Kunitomo (1980) and Morimune (1983) some time ago, but the analysis is more

complicated. Consider a linear structural equation

yi = x0i�0 + "i, E["i] = 0, i = 1; :::; n,

where xi is a vector and yi and "i are scalar dependent variable and disturbance respectively.

Let Zi be a K � 1 vector of instrumental variables that we treat as constants. It is alsoequivalent to allow instrumental variables to be random but condition on the matrix of

observations Z = [Z1; :::; Zn]0 and replace unconditional moment conditions with conditional

ones.

To describe the estimator letQ = Z(Z 0Z)�Z 0 denote the projection matrix on the column

space of Z. The JIVE2 estimator takes the form

� =

Xi6=j

Qijxix0j

!�1Xi6=j

Qijxiyj.

5

Substituting for yj and collecting terms gives

pn(� � �0) =

Xi6=j

Qijxix0j=n

!�1Xi6=j

Qijxi"j=pn.

Here we can see that that JIVE2 is a special case of equation (1) with

�n = �0, � =Xi6=j

Qijxix0j=n, unii(Wi;Wi) = 0, unij(Wi;Wj) = Qijxi"j=

pn, i 6= j.

Note that E[unij(Wi;Wj)] = 0 and for �i = E[xi],

E[unij(Wi;Wj)jWi] = QijxiE["i] = 0, E[unij(Wi;Wj)jWj] = Qij�i"j=pn.

Here �i can be interpreted as the reduced form for observation i. Let vi = xi��i. ApplyingProposition 1, and using symmetry of the matrix Q, we �nd that equation (2) is satis�ed

with

ni (Wi) =

Xj 6=i

Qij�j

!"i = �i(1�Qii)"i=

pn�

�i �

Xi

Qij�j

!"i=pn,

Dni (Wi; :::;W1) =

Xj<i

Qij (vi"j + vj"i) =pn, Bn = 0.

Note that �i�P

iQij�j is the i-th residual from regressing the reduced form observations

on Z; so that by appropriate de�nition of the reduced form this can generally be assumed

to go to zero as the sample size grows. In that case

n =Xi

�i(1�Qii)"i=pn+ op(1).

Furthermore, under standard asymptotics Qii will go to zero, so the variance of this does

indeed correspond to the usual asymptotic variance for IV.

The degenerate U-statistic term is

Un =nXi=1

Xj<i

Qij (vi"j + vj"i) =pn.

Chao, Swanson, Hausman, Newey, and Woutersen (2010) apply the martingale central limit

6

theorem to show that this Un will be asymptotically normal when xi and "i have uniformly

bounded fourth moments, rank(Q) = dim(Z) = K �! 1, and Qii is bounded away from 1

uniformly in i and n. The conditions of the martingale central limit theorem are veri�ed by

showing that certain linear combinations with coe¢ cients depending on the elements of Q

go to zero as K �!1. In the proof, this makes individual terms asymptotically negligible,with a Lindeberg-Feller condition being satis�ed. Alternative asymptotics occurs when K

grows as fast as n, resulting in n and Un having the same magnitude in the limit.

As mentioned above, small bandwidth asymptotics for kernel estimators also has the

structure outlined above. To illustrate this we consider kernel estimation of the integrated

squared density. We use this example to keep the exposition relatively simple and because it

shares the common structure with the more interesting density weighted average derivative

estimator of Powell, Stock, and Stoker (1989) treated in Cattaneo, Crump, and Jansson

(2010). In this example the parameter of interest is

�0 =

Zf0(w)

2dw = E[f0(Wi)],

where Wi denotes a continuously distributed random variable with pdf f0. A �leave one out

estimator,�analogous to that of Powell, Stock, and Stoker (1989), is

� =Xi6=j

Kh(Wi �Wj)=n(n� 1);

where K(u) be a symmetric kernel with K(�u) = K(u) and Kh(u) = h�dK(u=h). This

estimator has the V-statistic form of equation (1) with

�0 =

Zf0(w)

2dw, � = 1, unii(Wi;Wi) = 0, unij(Wi;Wj) = Kh(Wi�Wj)=pn(n�1), i 6= j.

De�ne

fh(w) =

ZK(u)f(w + hu)du.

Note that by symmetry of K(u),

E[unij(Wi;Wj)jWi] = fh(Wi)=pn(n� 1), E[unij(Wi;Wj)jWj] = fh(Wj)=

pn(n� 1),

E[unij(Wi;Wj)] = E[fh(Xi)]=pn(n� 1).

7

Applying Proposition 1, and using symmetry of K(u), we �nd that equation (2) is satis�ed

with

ni (Wi) = 2ffh(Wi)� E[fh(Wi)]g=pn,

Dni (Wi; :::;W1) = 2

Xj<i

fKh(Wi �Wj)� fh(Wi)� fh(Wj) + E[fh(Wi)]g=pn(n� 1),

Bn =pnfE[fh(Wi)]� �0g.

Note that 2ffh(Wi)�E[fh(Wi)]g is an approximation to the well known in�uence function2[f0(Wi)��0] for estimators of the integrated squared density. As h! 0 we will have fh(Wi)

converging to f0(Wi) in mean square, so that

n =Xi

2[f0(Wi)� �0]=pn+ op(1).

The asymptotic variance of this leading term does correspond to the usual asymptotic vari-

ance for estimators of the integrated square density.

The degenerate U-statistic term is

Un = 2nXi=1

Xj<i

fKh(Wi �Wj)� fh(Wi)� fh(Wj) + E[fh(Wi)]g=pn(n� 1).

The martingale central limit theorem can be applied as in Cattaneo, Crump, and Jansson

(2010) to show that this Un will be asymptotically normal as h ! 0 and n ! 1 growing.

Intuitively, the bandwidth shrinking leads to Kh(Wi � Wj) being small except when Wi

and Wj are close, so that tail of the distribution ofpnDn

i (Wi; :::;W1) becomes thin and a

Lindeberg-Feller condition is satis�ed. Alternative asymptotics occurs when h shrinks as

fast as 1=n; resulting in n and Un having the same magnitude in the limit.

This pair of examples shows how several estimators share the common structure outlined

above. We now show how that structure can be applied to derive new results.

3 Series Estimation of the Partially Linear Model

To describe the partially linear model let (yi; x0i; z0i)0, i = 1; : : : ; n, be a random sample of the

random vector (y; x0; z0)0, where y 2 R is a dependent variable, and x 2 Rdx�1 and z 2 Rdz�1

8

are explanatory variables. The model is given by

yi = x0i� + g(zi) + "i, E ["ijxi; zi] = 0, �2(xi; zi) = E�"2i jxi; zi

�,

where vi = xi � h(zi), with h(zi) = E [xijzi].2 A series estimator of � is obtained by

regressing yi on xi and approximating functions of zi. To describe the estimator let pK(z) =

(pK1(z); : : : ; pKK(z))0 be an approximating functions, such as polynomials or splines, where

K denotes the number of terms in the regression. Here the unknown function g(z) will

be approximated by a linear combination of pkK(z). Also, let Y = [y1; � � � ; yn]0 2 Rn�1,X = [x1; � � � ; xn]0 2 Rn�dx, P = [pK(z1); : : : ; pK(zn)]0. A series estimator of � is given by

� = (X 0MX)�1X 0MY , M = I �Q, Q = P (P 0P )�P 0,

where A� denotes a generalized inverse of a matrix A (satisfying AA�A = A) and X 0MX

will be non-singular with probability approaching one under conditions given below. Donald

and Newey (1994) gave conditions for the asymptotic normality of this estimator.

Conditional on Z = [z1; :::; zn]0 this estimator has the structure outlined earlier. For

this reason we will condition on Z throughout the following discussion so that expectations

are implicitly conditional on Z. To explain how � �ts within the common structure, let

hi = E[xi] and vi = xi � hi: Also let gi = g(zi),

G = [g1; � � � ; gn]0, " = ["1; � � � ; "n]0, H = [h1; � � � ; hn]0, V = [v1; � � � ; vn]0.

By Y = X� +G+ " we have

pn(� � �) = (X 0MX=n)�1X 0M("+G)=

pn = ��1Sn,

� = X 0MX=n, Sn =Xi;j

xiMij(gj + "j)=pn.

This estimator has the V-statistic form of equation (1) with Wi = (yi; xi) and

�0 = �, � = X 0MX=n, unij(Wi;Wj) = xiMij(gj + "j)=pn.

By E["ijxi] = 0 we have E[xi"i] = 0. Therefore, letting unij = unij(Wi;Wj) as we have done

2See Robinson (1988) for the analysis of this model when using kernel regression, and Linton (1995) forthe corresponding higher-order properties.

9

previously, we have

E[unij] = HiMijgj=pn, unij � E[unij] =Mij (vigj + xi"j) =

pn,

~uij =Mij (vjgi + vigj + xj"i + xi"j) =pn, E[~uijjWi] =Mij (vigj + hj"i) .

Plugging these expressions into the formulas on Proposition 1 we obtain

ni (Wi) = Mii (vigi + hi"i + vi"i) =pn+

Xj 6=i

Mij (vigj + "ihj) =pn

=

"viMii"i + vi

Xj

Mijgj + "iXj

hjMij

#=pn,

Dni (Wi; :::;W1) =

Xj<i

Mij (vi"j + vj"i) =pn;Bn = H 0MG=

pn.

Summing up ni (Wi) it follows that

n =Xi

Miivi"i=pn+ (V 0MG+H 0M") =

pn.

The last term in this expression has mean zero and will converge to zero in mean square as

K grows, as further discussed below. Therefore, by Mii = 1�Qii;

n =Xi

(1�Qii)vi"i=pn+ op(1). (3)

Under standard asymptotics Qii will go to zero and hence the variance of this corresponds

to the usual asymptotic variance.

Noting that for j < i, Mij = �Qij, plugging in we �nd that the degenerate U-statisticterm is

Un = �nXi=1

Xj<i

Qij (vi"j + vj"i) =pn.

Remarkably this term is essentially the same as the degenerate U-statistic term for JIVE2

that was shown above. Consequently, the central limit theorem of Chao, Swanson, Hausman,

Newey, and Woutersen (2010) applies immediately to show that Un is asymptotically normal

as K �!1. This leads to many regressor asymptotics for series estimators of the partiallylinear model. It also illustrates the usefulness of the common structure, showing how that

10

structure can be used to derive new results.

4 Standard Asymptotics for the Series Estimator

The estimator � may be intuitively interpreted as a two-step semiparametric estimator with

smoothing parameter pK (�) and tuning parameter K, because the unknown (regression)functions g(�) and h (�) are non-parametrically estimated in a preliminary step by the se-ries estimator. In particular, the following assumption characterizes the rate at which the

approximation error of series estimator should vanish.

Assumption B. (i) For some �h > 0, there exists �h so that

1

nH 0MH =

1

nmin�kH � P�k2 = 1

n

nXi=1

h (zi)� pK (zi)0 �h 2 = Oas(K

�2�h).

(ii) For some �g > 0, there exists �g so that

1

nG0MG =

1

nmin�kG� P�k2 = 1

n

nXi=1

�g (zi)� pK (zi)

0 �g�2= Oas(K

�2�g).

The conditions required in Assumption B are implied by conventional assumptions from

the series-based nonparametric literature (e.g., Newey (1997, Assumption 3)). Thus, under

appropriate assumptions, commonly used basis of approximation such as polynomials or

splines will satisfy Assumption B with �h = dz=sh and �g = dz=sg, where sh and sg denotes

the number of continuous derivatives of h and g, respectively.

Under regularity conditions (given below) and Assumption B, Donald and Newey (1994)

obtained the following (infeasible) classical asymptotic approximation for � corresponding

to equation (3): If

nK�2(�h+�g) ! 0 andK

n! 0 (4)

thenpn(� � �) =

1pn

nXi=1

i + op(1)!d N (0;) , = ��1���1, (5)

where

i = ��1vi"i, � = E [viv0i] , � = E [viV ["jX;Z] v0i] .

11

The classical asymptotic linear representation of �, given in (5), is established by analyzing

the �rst-order stochastic properties of the �numerator�X 0M(Y �X�)=pn and �denomina-

tor� � of the estimator. Speci�cally, for the �denominator�, it can be shown (see Lemma 1

below) that if Condition (4) holds, then the (Hessian) matrix satis�es (recall thatM = I�Qand QP = P )

�n =1

nX 0MX =

1

nV 0MV + op(1) =

1

nV 0V � 1

nV 0QV + op(1)

=1

n

nXi=1

viv0i + op(1)!p �,

where the second equality captures the bias introduced by the series estimator (Assumption

B(i)), and the fourth equality requires the contribution of V 0QV to vanish asymptotically.

Similarly, it follows from equation (3) that the �numerator�of � satis�es

1pnX 0M (Y �X�) =

Xi

vi(1�Qii)"i=pn+ op(1) =

1pn

nXi=1

vi"i + op(1)!d N (0;�) ,

where the �rst equality is again related to the bias introduced by the nonparametric estimator

and the second equality will follow by Qii �! 0.

In both cases, the approximation error associated with the bias is controlled by the

condition nK�2(�h+�g) ! 0, which requires K to be �large�. On the other hand, condition

K=n! 0 guarantees that both V 0QV and V 0Q" are asymptotically negligible, as required for

the classical, asymptotically linear, approximation to be valid. The latter condition controls

the variance of the estimator, and it is directly related to the behavior of the nonparametric

estimator, which in this case is described by Q.

The classical approach to form a con�dence interval for �0 is to use the asymptotic

distributional result coupled with a consistent standard errors estimator. A plug-in approach

employs the (asymptotically) pivotal test statistic T0;n(K) = �1=20

pn(���), together with

a plug-in consistent estimator for 0. Under heteroskedasticity, a feasible test statistic is

given by

T0;n(Kn) = �1=2n

pn(� � �), n = �

�1n �n�

�1n ,

where

�n =X 0MX

n, �n =

1

n�K � dx

nXi=1

~xi~x0i"2i , " = ~Y � ~X 0� = ["1; � � � ; "n]0.

12

In this case, the standard error estimator is the classical Heteroskedasticity-Robust (HR)

standard errors estimator commonly used in regression analysis. Under Condition (4) it is

not di¢ cult to show that �1n 0 !p Idx.

5 Many Regressor Asymptotics

This section derives a generalized asymptotic distribution forpn(� � �), which relaxes the

condition Kn=n! 0. This non-standard asymptotic theory encompasses the classical result

discussed in the previous section, and also captures the e¤ect of the degenerate U-statistic

term of the expansion, which is negligible when K=n �! 0. Intuitively, this asymptotic

experiment captures the e¤ect of K �large�(relative to n) by breaking down the asymptotic

linearity of the estimator.

To characterize the generalized central limit theory, it is natural to study the stochastic

behavior of the estimatorpn(� � �) as a �ratio�of two bilinear forms:

pn(� � �) =

�1

nX 0MX

��1pnX 0M (Y �X�) = ��n

1pnX 0M (Y �X�) .

The following lemma characterizes the behavior of the Hessian matrix �n under quite

weak conditions.

Lemma 1. Suppose that E[kvik4] � Cv <1 (a.s.) and Assumption B(i) holds. Then,

�n =1

nX 0MX = �n + op(K=n), �n =

1

n

nXi=1

MiiE [viv0ijzi] = Oas(1 +K=n).

This lemma characterizes the stochastic behavior of the Hessian matrix under conditions

that are weaker than those entertained by the classical, asymptotically linear, distribution

theory. Speci�cally, because M = I � Q is an idempotent symmetric matrix, Mii 2 (0; 1)and

Pni=1Mii � n � K, Lemma 1 implies that �n remains asymptotically bounded even

when K=n 6! 0. In particular, �n = E [viv0i] + op (1) when K=n ! 0. Moreover, in the

case of homoskedasticity of vi, that is, if E [viv0ijzi] = E [viv0i] (and rank (Q) = K), then

�n = (1�K=n)E [viv0i]. Finally, if �min (E [viv0ijzi]) � FV > 0 and Mii = 1 � Qii � FQ > 0

(a.s.) then �min (�n) � FQFV > 0. (�min (A) denotes the minimum eigenvalue of a matrix

A.)

13

To fully characterize the asymptotic behavior of the �numerator� ofpn(� � �) it is

convenient to proceed in two steps. First, under appropriate bias assumptions, it is possible

to show that the numerator is asymptotically equivalent to a quadratic form based on mean-

zero random variables.

Lemma 2. Suppose the assumptions of Lemma 1 hold, and E["4i ] � C" < 1 (a.s.) and

Assumption B(ii) holds. Then,

(V 0MG+H 0M"+H 0MG) =pn = Op(

pnK�(�h+�g) +K��h +K��g).

As in Lemma 1, this result only requires bounded moments and a bias condition. In

this case, the bias arises from the approximation of both unknown functions h and g. As

mentioned above, the high-level Assumption B is implied by the standard assumption of best

approximation from the sieve literature. Interestingly, in this model there is a trade-o¤ in

terms of curse of dimensionality: provided that min f�h; �gg > 0, the bias condition is givenbypnK�(�h+�g) ! 0, which implies a trade-o¤ between smoothness and dimensionality

between h and g.

Lemma 2 justi�es focusing on the bilinear form

�n =1pnV 0M" =

1pn

nXi=1

nXj=1

Mijvi"j,

where E [�njZ;X] = 0. Moreover, under the assumptions imposed in Lemma 2, a simple

variance calculation yields

�n = V [�njZ] =1

n

nXi=1

nXj=1

M2ijE[viv0i"2j ] = Oas (1 +K=n) .

In particular, if K=n! 0 then

�n =1

n

nXi=1

M2iiE�viv

0i"2i jzi�+ op (1) = E

�viv

0i"2i

�+ op (1) = � + op (1) ,

as given in Section 2. Moreover, under homoskedasticity, that is, if E ["2i jxi; zi] = �2, then

V�1pnV 0M"

����X;Z� = �2

nV 0MV ,

14

and

�n =�2

n

nXi=1

nXj=1

M2ijE [viv0ijzi] = �2�n,

becausePn

j=1M2ij = Mii. Furthermore, if E ["2i jxi; zi] = �2 and E [viv0ijzi] = E [viv0i] (and

rank (Q) = K (a.s.)), then �n = �2 (1�K=n)E [viv0i]. Finally, if �min (E [viv0i"2jzi]) � FV " >

0 and Mii = 1�Qii � FQ > 0 (a.s.), then �min (�n) � F 2QFV " > 0.

The following Lemma characterizes the asymptotic distribution of the bilinear form �n.

Lemma 3. Suppose the assumptions of Lemma 2 hold, and Mii > FM > 0 (a.s.) and

�min (�n) > F� > 0 (a.s.). Then,

��1=2n

1pnV 0M"!d N (0; Idx) .

The following theorem is a direct consequence of the previous lemmas and Slutsky The-

orem, and constitutes the main result of this section.

Theorem 1. Suppose the assumptions of Lemma 3 hold and suppose �min (�n) > FH > 0.

Then, if

nK�2(�x+�g) ! 0 andK

n! � 2 [0; 1) (6)

then

�1=2n (� � �)!d N (0; Idx) . (7)

If, moreover, (�n;�n) = (�1;�1) + op (1) for some (�1;�1), then

pn(� � �)!d N (0;1) , 1 = �

�11 �1�

�11 .

If, moreover, E ["2i jxi; zi] = �2 for some �2, then

pn(� � �)!d N

�0; �2��11

�.

Theorem 1 shows that the central limit theorem for � holds under the weaker condition

(6). (Compare to Condition (4).) This result does not rely on asymptotic linearity, nor on

the actual convergence of the matrices �n and �n. However, if K=n ! 0, then (�n;�n) =

(�;�)+op (1) with � = E [viv0i] and � = E [viv0i"2i ], and the resulting large sample distributiontheory does rely on the asymptotically linear representation of �, as given in (5).

15

Importantly, if K=n 9 0 and (�n;�n) = (�;�) + op (1), then � 6= E [viv0i] and � 6=E [viv0i"2i ], in general. For instance, if (vi; "i) is independent of zi, then �n = (1�K=n)E [viv0i]and

�n =

�1� K

n

�E�viv

0i"2i

�+

1

n

nXi=1

Q2ii �K

n

!�E�viv

0i"2i

�� E [viv0i]E

�"2i��.

6 Standard Errors

This section discusses di¤erent homoskedastic- and heteroskedastic- standard errors estima-

tors, and their properties under the generalized asymptotics studied in this paper.

6.1 Homoskedasticity

If E ["2i jxi; zi] = �2 for all i = 1; 2; � � � ; n, then �n = �2��n . Thus, a natural plug-in estimator

is given by VHOM = �2��, where �2 is chosen so that �2 = �2 + op (1). The usual OLS

estimator is

�2n =1

n�K � dx

nXi=1

"2i =1

n�K � dx"0".

As shown in Lemma 1, �n = �n + op(1) under the many terms asymptotics discussed in

this paper (Condition (6)), and therefore it remains to verify that �2n is also consistent under

this generalized asymptotic experiment. This estimator is consistent because

"0" = (Y �X�)0M(Y �X�) = "0M"+ op(1) = (n�K)�2 + op(1),

where the �rst approximation is based on thepn-consistency of � and the approximation

bias of the series estimator, while the second approximation is based on the fact that the

bilinear form "0M" = "0(I �Q)" is dominated by its diagonal.

Theorem 2. Suppose the assumptions of Theorem 1 hold. Then, �2n = �2 + op (1).

It follows by Lemma 1, Theorem 2 and Slutsky Theorem that

VHOM = �2��n + op (1) = �n + op (1) ,

and therefore

V�1=2HOM(� � �0)!d N (0; Id) ,

16

under Condition (6). Thus, under known homoskedasticity, the usual �nite sample stan-

dard errors estimator from least squares theory turns out to be valid even when K is large.

However, the �consistent�but biased standard errors estimator ("0"=n)��n will not be con-

sistent unless K=n! 0, which implies that using the �nite sample, unbiased standard errors

estimator under homoskedasticity is important even in large samples when the degrees of

freedom is small (i.e., when K is large).

6.2 Heteroskedasticity

Under heteroskedasticity of unknown form, a natural candidate for standard errors estimator

is the (family of) Eicker-Huber-White estimators given by

VHR = ( ~X0 ~X)� ~X 0� ~X( ~X 0 ~X)� =

1

n��n �n�

�n ,

�n =1

nX 0M�MX, � = diag

�!1"

21; � � � ; !n"2n

�=

2664!1"

21

. . .

!n"2n

3775where f!i : i = 1; � � � ; ng are appropriate weights. Classical choices of weights include !i = 1,!i = n=(n � K � dx), !i = M�1

ii , etc. Since �n = �n + op(1) according to Lemma 1, it

only remains to characterized the middle matrix of this classical sandwich formula. The

asymptotic properties of �n are given by

�n =1

n

nXi=1

!i~xi~x0i"2i =

1

n

nXi=1

!i~vi~v0i~"2i + op(1) =

1

n

nXi=1

E�!i~vi~v

0i~"2i

��Z�+ op(1),

where, as before, the �rst approximation captures the bias of the series estimator and removes

the estimation error of �, and the second approximation shows that a (conditional) law of

large numbers holds in this case (i.e., a variance condition). This results is summarized in

the following theorem.

Theorem 3. Suppose the assumptions of Theorem 1 hold, nK�2(�g+�h)+1 ! 0 withminf�g; �hg >1=2, and !i 2 � (Z) for all i with j!ij � C!. Then, �n = ~�n + op(1), where

~�n =1

n

nXi=1

nXj=1

nX`=1

!iM2i`M

2j`

!E�viv

0i"2j

�� zi; zj� .17

Recall that the population asymptotic middle matrix �n is given by

�n =1

n

nXi=1

nXj=1

M2ijE[viv0i"2j jzi; zj],

which implies that the classical HR standard errors will not be consistent in general when

Condition (6). For example, the bias may be characterized under homoskedasticity: assum-

ing E ["2i jxi; zi] = �2 and !i = 1, a simple calculation yields

�n =�2

n

nXi=1

MiiE [viv0ijzi] ,

while, using basic properties of idempotent matrices, it is easy to verify that

~�n = �n �1

n

nXi=1

nXj=1

M2ij (1�Mjj)E [viv0ijzi] < �n.

As a consequence, the classical Eicker-Huber-White estimator is downward biased when K

is �large�. It is important to note that this result continue to hold in a simple linear model

where the number of regressors is �large�when compared to the sample size. Therefore, there

is an important sense in which the classical HR standard errors estimator is not robust, even

in a simple linear model.

On the other hand, if K=n! 0, then the asymptotic results presented above imply that~�n = �n + op(1), which veri�es that the classical HR standard errors estimator is indeed

consistent under heteroskedasticity of unknown form.

6.3 HR and Many Terms Robust Standard Errors

Intuitively, the failure of the classical HR standard errors estimator is due to the fact that

both ~xi and "i are estimated with too much error when K=n 6! 0. Thus, it is possible to �x

this problem by considering alternative (consistent) estimators for either ~xi or "i. To describe

the new estimators, let Kg and Kh be two choices of truncation for an approximation series,

and let~X = (I �Qh)X, Qh = PKh

(P 0KhPKh

)�P 0Kh, Mh = I �Qh,

" = (I �Qg)", Qg = PKg(P0KgPKg)

�P 0Kg, Mg = I �Qg.

18

Using this notation, Theorem 3 may be extended to the following result.

Theorem 4. Suppose the assumptions of Theorem 1 hold, nK1=2g K

�2�gg K

1=2h K�2�h

h ! 0

with minf�g; �hg > 1=2, and !i 2 � (Z) for all i with j!ij � C!. Then

��n =1

n

nXi=1

nXj=1

nX`=1

!iM2h;i`M

2g;j`

!E�viv

0i"2j jzi; zj

�.

This theorem leads naturally to two alternative recipes for heteroskedasticity and many

terms robust estimators. Speci�cally, for !i = 1, ifminfKh; Kgg = o (n) andmaxfKh; Kgg =K, then ��n = �n + op(1). This result implies that

V�1=2HET (� � �0)!d N (0; Id) , VHET =

1

n��n��n�

�n .

7 Simulations

To explore the consequences of using many terms in the partially linear model, or alterna-

tively using many covariates in a linear model, this section reports preliminary results from

a Monte Carlo experiment. Speci�cally, the simulations consider the following model:

yi = x0i� + g(zi) + "i, "i = �" (xi; zi)u1i,

xi = h (zi) + vi, vi = �v (zi)u2i,

with dx = 1, dz = 10, g (z) = 1, h (z) = 0, and u = (u1i; u1i)0 � N (0; I2) and zi � U (�1; 1)independently of u. Note that this data generating process does not have smoothing bias.

Four models of heteroskedasticity are considered, as given in Table 1 (with � = (1; 1; � � � ; 1)0 2Rdz).

Table 1: Simulation Models ()

�2v (zi) = 1 �2v (zi) = (z0i�)

2

�2" (xi; zi) = 1 Model 1 Model 3

�2" (xi; zi) = (z0i�+ xi)

2 Model 2 Model 4

For simplicity, the simulations consider additive-separable power series, that is, the

unknown function g(zi) is assumed to satisfy g(zi) = 1 + g1 (z1i) + � � � + gdz (zdzi) and

each component is estimated by gj (zji) � pK (zji)0 j, j = 1; 2; � � � ; dz, with pK (zji) =

(0; zji; z2ji; � � � ; zK�1ji )0.

19

We consider the classical least squares homoskedasticity-consistent standard errors esti-

mators

VHO1 ="0"

n�� and VHO2 =

"0"

n�K � dx��,

and the classical heteroskedasticity-consistent standard errors estimators

VHR1 = �� ~X 0� ~X��, � = diag ("1; � � � ; "n) =n,

VHR2 = �� ~X 0� ~X��, � = diag ("1; � � � ; "n) =(n�K � dx).

Also, we report the two alternative heteroskedasticity- and many terms- robust standard

errors estimators proposed in Theorem 4. These estimators are given by

VCJN1 = �� ~X 0� ~X��, � = diag ("1; � � � ; "n) =n, Kh = KCV , Kg = K,

VCJN2 = �� ~X 0� ~X��, � = diag ("1; � � � ; "n) =n, Kh = K, Kg = KCV ,

where KCV represents a cross-validation estimate of the optimal K.

The results are presented in Figure 1, for a grid of K = (0; 1; 2; � � � ; 20). The e¤ectivedegrees of freedom is determined by the choice of K and dz.

20

8 Technical Appendix

All statements involving conditional expectations are understood to hold almost surely. Recall that M =

I �Q is symmetric and idempotent, and therefore jMiij � 1, n�K =Pn

i=1Mii and Mij =Pn

`=1Mi`M`j .

Proof of Lemma 1. It follows from H 0MH=n = op (1) and the Cauchy-Schwarz inequality that

X 0MX=n = (V +H)0M (V +H) =n = V 0MV=n + H 0MH=n + 2H 0MV=n = V 0MV=n + op (1), provided

that

V 0MV=n =nXi=1

Miiviv0i=n+

nXi=1

Xj 6=i

Mijviv0j=n = Op(1).

Using jMiij � 1 and the Markov inequality,

nXi=1

Miiviv0i=n =

nXi=1

MiiE [viv0ijzi] =n+ op (1) ,

because

V

"nXi=1

Mii kvik2 =n�����Z#=

nXi=1

M2iiV[kvik

2 jzi]=n2 = Oas�n�1

�,

whilenXi=1

Xj 6=i

Mijviv0j=n = Op(n

�1K1=2) = op (1)

by Lemma A1 of Chao, Swanson, Hausman, Newey, and Woutersen (2010). �

Proof of Lemma 2. First note that

X 0M (Y �X�0) =pn = V 0M"=

pn+H 0M"=

pn+X 0MG=

pn,

where H 0M"=pn = Op (K

��h) because

Eh H 0M"=

pn 2���Zi = tr (H 0ME[""0jZ]MH) =n � C" tr (H 0MH) =n = O

�K�2�hn

�.

Next,

X 0Mg=pn = V 0MG=

pn+ �X 0MG=

pn = Op

�K��g +

pnK�(�X+�g)

�,

because

Eh V 0MG=pn 2���Zi = G0ME [V V 0jZ]MG=n � CvG0MG=n = Oas �K�2�g

�,

and

H 0MG=pn �

pnpH 0MH=n

pG0MG=n = Oas

�pnK�(�X+�g)

�,

which completes the proof. �

Proof of Lemma 3. Follows from Lemma A2 in Chao, Swanson, Hausman, Newey, and Woutersen

(2010). �

21

Proof of Theorem 1. Follows by standard arguments from Lemmas 1�3. �

Proof of Theorem 2. First, it follows from G0MG=n = op (1) and the Cauchy-Schwarz inequality that

"0"=n = (Y �X�)0M(Y �X�)=n

= (Y �X� �G)0M(Y �X� �G)=n+G0MG=n� 2(Y �X� �G)0MG=n

= (Y �X� �G)0M(Y �X� �G)=n+ op(1),

provided (Y �X� � G)0M(Y �X� � G)=n = Op (1). Next, note that Lemma 1 and � � � = op(1) imply(� � �)0X 0MX(� � �)=n = op (1), which together with the Cauchy-Schwarz inequality gives

(Y �X� �G)0M(Y �X� �G)=n

= "0M"=n+ (� � �)0X 0MX(� � �)=n� 2(Y �X(� � �)�G)0M(� � �)=n

= "0M"=n+ op(1) = ~"0~"=n+ op(1).

Finally, consider the bilinear form

~"0~"=n = "0M"=n =Xi

Mii"2i =n+

Xi

Xj 6=i

"iMij"j=n.

Using jMiij � 1 and the Markov inequality,Xi

Mii"2i =n =

Xi

MiiE�"2i jzi

�=n+ op (1) =

n�Kn

�2 + op(1)

because

E

"V

"Xi

Mii"2i =n

�����Z##= E

"Xi

M2iiV["2i jzi]=n2

#�Xi

E�V["2i jzi]

�=n2 � Cv=n = o (1) ,

while Xi

Xj 6=i

Mij"i"j=n = Op

�n�1K1=2

�= op (1) ,

by Lemma A1 of Chao, Swanson, Hausman, Newey, and Woutersen (2010). Therefore, since (n�K)=(n�K � dx)! 1, �2 = "0"=(n�K � d) = �2 + op(1), which completes the proof. �

Proof of Theorem 3. Apply Theorem 4 with Kg = Kh = K. �

Proof of Theorem 4. First note that " = ~Y � ~X 0� = ~" � ~X 0(� � �) + ~g0, for ~X = (I � Qg)X and

~g = (I �Qg)g. Thus, �n = T1;n + T2;n + T3;n + T4;n + T5;n + T6;n, where

T1;n =1

n

Xi

wi~"2i ~vi~v

0i, T2;n =

1

n

Xi

wi~"2i ~vi (~xi � ~vi)

0 , T3;n =1

n

Xi

wi~"2i (~xi � ~vi) ~v0i,

22

T4;n =1

n

Xi

wi~"2i (~xi � ~vi) (~xi � ~vi)

0 , T5;n =1

n

Xi

wi

�~x0i(� � �) + ~gi

�2~xi~x

0i,

T6;n =1

n

Xi

wi~"i

�~x0i(� � �) + ~gi

�~xi~x

0i,

with ~X = (I �Qh)X = [~x1; � � � ; ~xn]0 and ~V = (I �Qh)V = [~v1; � � � ; ~vn]0.First consider the leading term T1;n. Note that T1;n = T11;n + T12;n + T13;n, where

T1;n =1

n

Xi

wi

0@Xj

M2g;ij"

2j

1A Xk

M2h;ikvkv

0k

!,

T2;n =1

n

Xi

wi

0@Xj

M2g;ij"

2j

1A0@Xk1

Xk2 6=k1

Mh;ik1Mh;ik2vk1v0k2

1A ,T3;n =

1

n

Xi

wi

0@Xj1

Xj2 6=j1

Mg;ij1Mg;ij2"j1"j2

1A0@Xk1;k2

Mh;ik1Mh;ik2vk1v0k2

1A ,with E [T12;njZ] = 0 = E [T13;njZ]. Moreover,

T11;n = E [T11;njZ] + op (1) , E [T11;njZ] =1

n

Xi;j

X`

w`M2g;`iM

2h;`j

!E["2i vjv0j jzi; zj ],

because V[T11;njZ] = oas(1). To see this last result, let a1;ij =P

` w`M2g;`iM

2h;`j and A1;ij = "2i vjv

0j �

E["2i vjv0j jZ], and note after expanding the sums and collecting terms,

V[T11;njZ] =1

n2E

264 Xi;j

aijAij

2�������Z375 � C

n2

Xi;j;k

[aijaik + aikaji + aijajk + aikajk] � Cn�1,

because ������Xi;j;k

aijaik

������ =

������Xi;j;k

X`1

w`1M2g;`1iM

2h;`1j

! X`2

w`2M2g;`2jM

2h;`2k

!������� C

Xi

X`1

M2g;`1i

Xj

M2h;`1j

X`2

M2g;`2j

Xk

M2h;`2k � Cn,

and similarly for the other terms.

Next, T12;n = op (1) because E [T2;njZ] = 0 and E[kT2;nk2 jZ] � Cn�1=2. To see the last conclusion, leta2;ijk = 1 (j 6= k)

P` w`M

2g;`iMh;`jMh;`k and A2;ijk = "2i vjv

0k, and observe that aijk = aikj . Expanding the

23

sums, using the mean-zero property of A2;ijk, and collecting terms gives

E[kT2;nk2 jZ] �C

n2E

264 Xi

Xj 6=i

a2;iji"2i viv

0j

2�������Z375+ C

n2E

264 Xi

Xj 6=i

Xk 6=i;k 6=j

a2;ijkA2;ijk

2�������Z375 .

Now, for the �rst term, de�ne a2;ij = 1 (i 6= j) a2;iji and A2;ij = "2i viv0j , and expanding the sums and

collecting non-mean-zero terms

1

n2E

264 Xi

Xj 6=i

a2;ij"2i viv

0j

2�������Z375 � C

n2

Xi

Xj 6=i[a22;ij + a2;ija2;ji] +

C

n2

Xi

Xj 6=i

Xk 6=i;k 6=j

a2;ika2;jk

� Cn�1 + Cn�1=2,

while for the second term

1

n2E

264 Xi

Xj 6=i

Xk 6=i;k 6=j

a2;ijkA2;ijk

2�������Z375 � C

n2

Xi

Xj 6=i

Xk 6=i;k 6=j

[a2ijk + aijkajik] +C

n2

Xi

Xj 6=i

Xk 6=i;k 6=j

Xl 6=i;l 6=j;l 6=k

aijkaljk

� Cn�1=2 + Cn�1,

where the �nal bounds as a function of n are obtained by properties of the idempotent matrix M as above.

Finally, T3;n = op (1) because E [T3;njZ] = 0 and E[ kT3;nk2 jZ] � Cn�1. To see the last conclusion,

�rst let a3;ij = "i"j and A3;ij = 1 (i 6= j)P

` w`Mg;`iMg;`j

�Pk1

Pk2Mh;`k1Mh;`k2vk1v

0k2

�, and observe that

a3;ij = a3;ji and A3;ij = A3;ji = A03;ij = A03;ji. Then, expanding the sums and collecting non-mean-zero

terms,

E[kT3;nk2 jZ] = E

264 1nXi

Xj

a3;ijA3;ij

2�������Z375 = 2

n2

Xi

Xj

E[a23;ij kA3;ijk2 jZ]

� C

n2

Xi

Xj 6=i

E

24 X`

w`Mg;`iMg;`j

Xk1

Xk2

Mh;`k1Mh;`k2vk1v0k2

! 2������Z35 .

Next, for each (i; j) let a3;ijkl =P

` w`Mg;`iMg;`jMh;`kMh;`l and A3;kl = vkv0l, and note that a3;ijkl =

24

a3;ijlk and A3;kl = A03;lk. Thus, expanding the sums and collecting non-mean-zero terms,

E[kT3;nk2 jZ] � C

n2

Xi

Xj 6=i

E

24 nXk=1

nXl=1

a3;ijklA3;kl

2������Z35

� C

n2

Xi

Xj 6=i

E

24 Xk

a3;ijkkA3;kk

2������Z35+ C

n2

Xi

Xj 6=i

E

264 Xk

Xl 6=k

a3;ijklA3;kl

2�������Z375

� C

n2

Xi

Xj 6=i

Xk

a2ijkk +C

n2

Xi

Xj 6=i

Xk

Xl 6=k

a2ijkl � Cn�1,

where the last inequality follows by properties of the idempotent matrix M as above.

Next, consider the smaller order terms Tk;n, k = 2; � � � ; 6. By ~"i = Mg;ii"i +Pn

j=1;j 6=iMg;ij"j with

E ["ij zi] = 0,E�~"4i��Z� � CX

j

Xk

M2g;ijM

2g;ikE["2j"2kjzj ; zk] � C,

by properties of the idempotent matrix M . Similarly, E[k~vik4jZ] � C. For terms T2;n and T3;n, using

~xi � ~vi =M 0h;iH =M 0

h;i(H � P 0Kh�h), it follows that for k = 2; 3,

E [kTk;nkjZ] � 1

n

Xi

jwijE[~"2i k ~VikjZ] M 0

h;i(H � P 0Kh�h)

� C

n

Xi

Xk

jMh;ikj kh(zk)� pKh(zk)

0�hk � CK1=2h K��h

h ,

by Cauchy-Schwarz inequality, and therefore T2;n = op (1) and T3;n = op (1). Also, by the same arguments,

E [kT4;nkjZ] � C

n

Xi

E�~"2i��Z� M 0

h;i(H � P 0Kh�h) 2

� C

n

Xi

Mh;ii

Xk

kh(zk)� pKh(zk)

0�hk2 � CKhK�2�hh

which implies that T4;n = op (1).

Therefore, it remains to show that T5;n = op(1), which then implies by the Cauchy-Schwarz inequality

that T6;n = op (1). To this end, note that

T5;n =1

n

Xi

wi

�~x0i(� � �) + ~gi

�2~xi~x

0i = Op(n

�1 +KhK�4�hh +K1=2

g K�2�gg + nK1=2

g K�2�gg K

1=2h K�2�h

h ),

25

because, using � � �0 = Op�n�1=2

�, E[k~vik4jZ] � C,

1

n

Xi

jwij k~x0i(� � �0)k2k~xik2 � Ck� � �0k21

n

Xi

k~vik4 + Ck� � �0k21

n

Xi

kM 0h;i(H � P 0Kh

�h)k4

� Op(n�1) +Op(1)

1

n2

Xi

M2h;ii

Xk

kh(zk)� pKh(zk)

0�hk2!2

= Op(n�1 +KhK

�4�hh ),

and

1

n

Xi

jwij j~gij2k~xik2 � C

vuutXi

M2g;ii

Xk

kg(zk)� pKg(zk)0�gk2

!2s1

n2

Xi

k~xik4

� CpKg

Xk

kg(zk)� pKg(zk)

0�gk2!Op(n

�1 +K1=2h K�2�h

h )

= Op(K1=2g K�2�g

g + nK1=2g K�2�g

g K1=2h K�2�h

h ).

This concludes the proof. �

26

ReferencesAngrist, J., G. W. Imbens, and A. Krueger (1999): �Jackknife Instrumental Variables Estimation,�Journal of Applied Econometrics, 14(1), 57�67.

Bekker, P. A. (1994): �Alternative Approximations to the Distributions of Instrumental Variables Esti-

mators,�Econometrica, 62, 657�681.

Cattaneo, M. D., R. K. Crump, and M. Jansson (2010): �Small Bandwidth Asymptotics for Density-Weighted Average Derivatives,�working paper.

Chao, J. C., N. R. Swanson, J. A. Hausman, W. K. Newey, and T. Woutersen (2010): �Asymp-totic Distribution of JIVE in a Heteroskedastic IV Regression with Many Instruments,� forthcoming in

Econometric Theory.

Donald, S. G., and W. K. Newey (1994): �Series Estimation of Semilinear Models,�Journal of Multi-variate Analysis, 50(1), 30�40.

Hansen, C., J. Hausman, and W. K. Newey (2008): �Estimation with Many Instrumental Variables,�Journal of Business and Economic Statistics, 26(4), 398�422.

Kunitomo, N. (1980): �Asymptotic Expansions of the Distributions of Estimators in a Linear Functional

Relationship and Simultaneous Equations,� Journal of the American Statistical Association, 75(371),

693�7000.

Linton, O. (1995): �Second Order Approximation in the Partialy Linear Regression Model,�Econometrica,

63(5), 1079�1112.

Morimune, K. (1983): �Approximate Distributions of k-Class Estimators when the Degree of Overidenti-

�ability is Large Compared with the Sample Size,�Econometrica, 51(3), 821�841.

Newey, W. K. (1997): �Convergence Rates and Asymptotic Normality for Series Estimators,�Journal of

Econometrics, 79, 147�168.

Powell, J. L., J. H. Stock, and T. M. Stoker (1989): �Semiparametric Estimation of Index Coe¢ -cients,�Econometrica, 57(6), 1403�1430.

Robinson, P. M. (1988): �Root-N-Consistent Semiparametric Regression,�Econometrica, 56(4), 931�954.

van der Vaart, A. W. (1998): Asymptotic Statistics. Cambridge University Press, New York.

White, H. (1980): �A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for

Heteroskedasticity,�Econometrica, 48(4), 817�838.

27

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Mo

del

1

Kn

Empirical Coverage

0.95

VH

O1

VH

O2

VH

R1

VH

R2

VC

JN

1

VC

JN

2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Mo

del

2

Kn

Empirical Coverage

0.95

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Mo

del

3

Kn

Empirical Coverage

0.95

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Mo

del

4

Kn

Empirical Coverage

0.95

Figure1:EmpiricalCoverageRatesfor95%Con�denceIntervals:n=500,S=3;000

28