59
Select the Valid and Relevant Moments: An Information-Based LASSO for GMM with Many Moments Xu Cheng y Zhipeng Liao z First Version: June, 2011 This Version: October, 2014 Abstract This paper studies the selection of valid and relevant moments for the generalized method of moments (GMM) estimation. For applications with many candidate moments, our asymptotic analysis accommodates a diverging number of moments as the sample size increases. The pro- posed procedure achieves three objectives in one-step: (i) the valid and relevant moments are distinguished from the invalid or irrelevant ones; (ii) all desired moments are selected in one step instead of in a stepwise manner; (iii) the parameters of interest are automatically estimated with all selected moments as opposed to a post-selection estimation. The new method performs mo- ment selection and e¢ cient estimation simultaneously via an information-based adaptive GMM shrinkage estimation, where an appropriate penalty is attached to the standard GMM crite- rion to link moment selection to shrinkage estimation. The penalty is designed to signal both moment validity and relevance for consistent moment selection. We develop asymptotic results for the high-dimensional GMM shrinkage estimator, allowing for non-smooth sample moments and weakly dependent observations. For practical implementation, this one-step procedure is computationally attractive. JEL Classication: C12, C13, C36 Keywords: Adaptive Penalty; GMM; Many Moments; Moment Selection; Oracle Properties; Shrinkage Estimation We appreciate the insightful suggestions from the co-editor and three anonymous referees. We also thank Xi- aohong Chen, Jinyong Hahn, Bruce Hansen, Michael Jansson, Frank Kleibergen, Adam McCloskey, Peter C. B. Phillips, Jack Porter, Eric Renault, and participants in the 2012 Great New York Area Econometrics Colloquium, 2013 North American Summer Meeting of the Econometric Society at USC, 2013 CEME Stanford/UCLA Conference at Stanford University, econometrics workshops at Brown University, University of Montreal, and the University of Wisconsin-Madison. y Department of Economics, University of Pennsylvania, 3718 Locust Walk, Philadelphia, PA 19104, USA. Email: [email protected] z Department of Economics, UC Los Angeles, 8379 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095. Email: [email protected] 1

Select the Valid and Relevant Moments: An Information-Based … · Hall and Peixe (2003) study the selection of relevant IVs through canonical correlations and conduct simulations

  • Upload
    hakhue

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Select the Valid and Relevant Moments:

An Information-Based LASSO for GMM with Many Moments�

Xu Chengy Zhipeng Liaoz

First Version: June, 2011

This Version: October, 2014

Abstract

This paper studies the selection of valid and relevant moments for the generalized method of

moments (GMM) estimation. For applications with many candidate moments, our asymptotic

analysis accommodates a diverging number of moments as the sample size increases. The pro-

posed procedure achieves three objectives in one-step: (i) the valid and relevant moments are

distinguished from the invalid or irrelevant ones; (ii) all desired moments are selected in one step

instead of in a stepwise manner; (iii) the parameters of interest are automatically estimated with

all selected moments as opposed to a post-selection estimation. The new method performs mo-

ment selection and e¢ cient estimation simultaneously via an information-based adaptive GMM

shrinkage estimation, where an appropriate penalty is attached to the standard GMM crite-

rion to link moment selection to shrinkage estimation. The penalty is designed to signal both

moment validity and relevance for consistent moment selection. We develop asymptotic results

for the high-dimensional GMM shrinkage estimator, allowing for non-smooth sample moments

and weakly dependent observations. For practical implementation, this one-step procedure is

computationally attractive.

JEL Classi�cation: C12, C13, C36

Keywords: Adaptive Penalty; GMM; Many Moments; Moment Selection; Oracle Properties;

Shrinkage Estimation�We appreciate the insightful suggestions from the co-editor and three anonymous referees. We also thank Xi-

aohong Chen, Jinyong Hahn, Bruce Hansen, Michael Jansson, Frank Kleibergen, Adam McCloskey, Peter C. B.Phillips, Jack Porter, Eric Renault, and participants in the 2012 Great New York Area Econometrics Colloquium,2013 North American Summer Meeting of the Econometric Society at USC, 2013 CEME Stanford/UCLA Conferenceat Stanford University, econometrics workshops at Brown University, University of Montreal, and the University ofWisconsin-Madison.

yDepartment of Economics, University of Pennsylvania, 3718 Locust Walk, Philadelphia, PA 19104, USA. Email:[email protected]

zDepartment of Economics, UC Los Angeles, 8379 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095.Email: [email protected]

1

1 Introduction

In many applications of the generalized method of moments (GMM), the number of candidate

moment conditions is much larger than that of the parameters of interest. However, one typically

does not employ all candidate moment conditions due to two concerns. First, some moments may

be invalid, which cause inconsistent estimation if included. Second, some moment conditions may

be redundant. A redundant moment condition does not contain additional information to improve

estimation e¢ ciency and results in additional �nite-sample bias. Therefore, it is important to

identify the valid and relevant (non-redundant) moment conditions, especially when both concerns

are elevated in the presence of many candidate moments. This paper proposes a procedure that

consistently selects all valid and relevant moments in econometric models where the number of

candidate moments is allowed to increase with the sample size. This type of asymptotic framework

re�ects the complexity of the problem and the computation demand associated with a large number

of candidate moments.

Our method achieves consistent moment selection via an information-based adaptive GMM

shrinkage estimation. Assuming there exists a conservative set of moment conditions that iden-

ti�es the parameter of interest, the moment selection problem is embedded in a penalized GMM

(P-GMM) estimation and a novel penalty is designed to incorporate information on both moment

validity and relevance for adaptive estimation. This penalized GMM estimation not only consis-

tently selects all valid and relevant moment conditions in one step, but also simultaneously and

e¢ ciently estimates the parameters of interest by incorporating all valid and relevant moments

and leaving out all invalid or redundant ones automatically. Asymptotic results provide bounds on

the penalty level to ensure consistent moment selection and e¢ cient estimation. We analyze these

bounds as a function of the sample size and the number of moments, and provide an algorithm for

practical implementation of our procedure.

This paper develops asymptotic results for the high-dimensional GMM shrinkage estimator in a

general framework, allowing for: (i) an increasing number of candidate moments; (ii) an increasing

number of nuisance parameters; (iii) non-smooth moment functions; and (iv) weakly dependent

observations. High-level assumptions are �rst provided to capture the main characteristics of the

problem, followed by primitive su¢ cient assumptions. We develop results on consistency, rate

of convergence, super e¢ ciency, and the asymptotic distribution of the high-dimensional GMM

shrinkage estimator. A linear instrumental variable (IV) model with independent and identically

distributed (i.i.d.) observations is studied in detail to illustrate the general results.

Our paper contributes to the study of moment validity and relevance, and extends it to a high-

dimensional framework. There is a long history on the study of moment validity, starting from

2

Sargan (1958), Hansen (1982), Eichenbaum, Hansen, and Singleton (1988). More recent papers

include Berkowitz, Caner, and Fang (2012), Conley, Hansen, and Rossi (2012), Doko Tchatoka,

and Dufour (2012), Guggenberger (2012), Nevo and Rosen (2012), and DiTraglia (2012), among

others. There are moment selection methods in the literature which select moment conditions based

on their validity. In a seminal paper, Andrews (1999) proposes a moment selection criterion, based

on a trade-o¤ between the J statistic and the number of moment conditions, and downward and

upward testing procedures. Andrews and Lu (2001) generalize these methods and study applications

to dynamic panel models. Hong, Preston, and Shum (2003) study moment selection based on the

generalized empirical likelihood estimation. Liao (2013) proposes a GMM shrinkage procedure for

selection of valid moment conditions. These moment selection methods only take the moment

validity into account and they assume that the number of the candidate moments is �xed.

On the moment relevance, Breusch, Qian, Schmidt, and Whyhowski (1999) discuss that, even

though a moment is valid and useful by itself, it becomes redundant if its residual after projecting

onto an existing set of moment conditions does not contain additional information. For example, in

the linear IV model, an IV is redundant if it does not improve the �rst-stage regression. Im, Ahn,

Schmidt, and Wooldridge (1999) study e¢ cient estimation in dynamic panel models in the presence

of redundant moments. Hall and Peixe (2003) study the selection of relevant IVs through canonical

correlations and conduct simulations to demonstrate the importance of excluding redundant IVs

in �nite samples. Hall, Inoue, Jana, and Shin (2007) propose a moment selection criterion that

balances the information content and the number of moments. This procedure can be applied to

select relevant moments, after all invalid moments are left out in the �rst step. For applications

to DSGE models, Hall, Inoue, Nason, and Rossi (2010) propose two moment selection criteria of

this sort to select all valid and relevant impulse response functions for matching estimation. There

are moment selection methods in the literature which select IVs or moments via di¤erent criteria,

such as the mean square error or the coverage of con�dence region of the estimators of structural

parameters, see, e.g., Donald and Newey (2001), Donald, Imbens, and Newey (2009), Kuersteiner

(2002), and Inoue (2006). These selection procedures assume that all candidate moments are valid.

This paper also complements a growing literature on the application of high-dimensional meth-

ods to the IV and moment based econometric models. Most papers in this literature investigate

e¢ cient estimation in the presence of many valid IVs. Belloni, Chernozhukov, and Hansen (2010)

and Belloni, Chen, Chernozhukov, and Hansen (2012) apply Lasso-type estimation to linear mod-

els with many IVs and show that the optimal IV is well approximated by the �rst stage shrinkage

estimation. The boosting method is suggested for IV selection by Bai and Ng (2009). Carrasco

(2012) studies e¢ cient estimation with many IVs by regularization techniques. Shrinkage esti-

3

mation for homoskedastic linear IV models is considered by Chamberlain and Imbens (2004) and

Okui (2011). Gautier and Tsybakov (2011) propose a Danzig selector based IV estimator in high

dimensional models. Kuersteiner and Okui (2010) recommend using the model averaging methods

to approximate the optimal IV in the �rst-stage regression. Lee and Zhou (2011) consider averaged

instrumental variable estimator. Caner and Zhang (2012) study adaptive elastic net GMM esti-

mation with an increasing number of parameters. Caner, Han, and Lee (2013) study the valid IV

selection and variable selection in linear IV models. Fan and Liao (2011) investigate P-GMM and

penalized empirical likelihood estimation in ultra high dimensional models where the number of

parameters increases faster than the sample size and provide a di¤erent type of asymptotic results.

Our paper contributes to the literature by combining the selection of valid and relevant moments

with e¢ cient estimation, proposing a new information-based adaptive penalty, and considering a

general nonlinear GMM estimation with possible non-smooth moment conditions and temporally

dependent observations.

The rest of the paper is organized as follows. Section 2 describes the three categories of moment

conditions, provides heuristic arguments on how shrinkage estimation distinguishes moments in dif-

ferent categories, and introduces the P-GMM estimator and its information-based penalty. Section

3 derives asymptotic results for the P-GMM estimator, including consistency, rate of convergence,

super e¢ ciency, and asymptotic distribution, and discusses their implications on consistent mo-

ment selection. Section 4 applies the main theory of this paper to a linear IV model. Section 5

analyzes the asymptotic magnitudes of the information-based penalty and provides suggestions for

practical implementation of the procedure. Section 6 provides �nite-sample results through simula-

tion. Section 7 concludes and discusses related topics under investigation. The Appendix contains

all the technical proofs. A separate online Supplemental Appendix contains additional supporting

materials and is available on the authors�websites.

The notations are standard. Throughout the paper, C denotes some generic �nite positive

constant; k�k denotes the Euclidean norm; A0 denotes the transpose of a matrix A; �max(A) and�min(A) denote the largest and smallest eigenvalues of a matrix A respectively; for d1 � 1 vectorfunction f(x): Rd2 ! Rd1 we use @f(x)

@x0 to denote the d1 � d2 matrix whose i-th row and j-th

column element is @fi(x)@xjwhere fi(�) and xj are the i-th component in f(�) and j-th component in

x respectively; for any square matrix A, A � 0 means that A is a positive semi-de�nite matrix;

for any positive integers k1 and k2, Ik1 denotes the k1 � k1 identity matrix and 0k1�k2 denotes thek1�k2 zero matrix; A � B means that A is de�ned as B; an = op(bn) means that for any constants�1; �2 > 0, there is Pr (jan=bnj � �1) < �2 eventually; an = Op(bn) means that for any � > 0,

there is a �nite constant C� such that Pr (jan=bnj � C�) < � eventually; �!p�and �!d�denote

4

convergence in probability and convergence in distribution, respectively; and w.p.a.1 abbreviates

with probability approaching 1.

2 An Information-Based GMM Shrinkage Estimator

2.1 Three categories of moment conditions

There exists a vector of moment functions g(Z; �): Rdz � � ! Rkn for the estimation of �o 2� � Rd� , where fZi : i = 1; :::; ng is stationary and ergodic, Z is used generically for Zi and d�

is a �xed positive integer. We allow the number of moments kn to increase with the sample size.

In particular, we are interested in applications where kn is much larger than d�. In this case, it

is not restrictive to assume that there exists a relatively small sub-vector of g(Z; �), denoted by

gS(Z; �) 2 Rk0 , for the identi�cation of �o: Let E [�] denote the expectation operator taken withrespect to the distribution of Z. The true value of � is identi�ed by E [gS(Z; �o)] = 0: We assume

that k0 is a �xed positive integer with k0 � d�. Typically, these are the moment conditions one

would use without further exploring the validity and relevance of other candidate moments. They

are a �conservative�set of moment conditions to ensure identi�cation.1 Given the identi�cation of

�o, this paper proposes a moment selection procedure that explores all other candidate moments

and yields the largest set of valid and relevant moment conditions.

Let gD(Z; �) denote all of the moments not used for identi�cation, where �D� indicates the

�doubt�on the validity and/or relevance of these moments. Without loss of generality, write

g(Z; �) =

24 gS(Z; �)

gD(Z; �)

35 : (2.1)

We also use S and D to denote the sets of indices of all moments in gS(Z; �) and gD(Z; �) respec-

tively. Let g`(Z; �) denote an element of g(Z; �) indexed by `. Given the order of the moment

conditions in (2.1), we know that S = f1; :::; k0g and D = fk0 + 1; :::; kng: A moment is valid if

E [g`(Z; �o)] = 0 for ` 2 D. Given its validity, a moment is considered to be relevant if adding ityields a more e¢ cient estimator than the one based on E [gS(Z; �o)] = 0.

1Because �o is unknown, the conservative set of moment conditions is needed not only for the identi�cation andconsistent estimation of �o but also for de�ning the valid and invalid moment conditions. For example, we may haveE [g`(Z; �1)] = 0 and E [g`(Z; �o)] 6= 0 for some moment function g`(Z; �). In this case, E [g`(Z; �)] = 0 is a validmoment for �1 but invalid for �o. The conservative set of moment conditions uniquely identi�es �o and hence de�nevalid and invalid moment conditions in g(Z; �). The consistent estimation of �o makes it possible to select the validmoments when the moment function is evaluated at �o.

5

By the criteria of validity and relevance, the set D is divided into three mutually disjoint sets

D = A [B1 [B0; (2.2)

where A indexes the set of valid and relevant moments, B1 indexes the set of invalid moments and

B0 indexes the set of valid but redundant moments. Moments in sets A and B0 are both valid,

but only those in A are relevant. Our objective is to consistently estimate the set A, leaving out

all moments indexed by the set B � B1 [B0. We use dA, dB; dB1 ; and dB0 to denote the numberof moment conditions indexed by the sets A, B, B1 and B0; respectively. Our general theory on

consistent moment selection allows dA and dB1 to increase with the sample size n and dB0 to be

bounded from above by some large but �xed integer. The theory only restricts the relative rates

of kn and n and this condition is discussed as the theory progresses.

2.2 Heuristic arguments for shrinkage-based moment selection

For the purpose of moment selection, a slackness parameter � and its true value �o are introduced:

� � E [gD(Z; �)] and �o � E [gD(Z; �o)] . (2.3)

By the de�nition of �, all candidate moments, regardless of their validity, can be transformed to

moment equalities and stacked into

E

24 gS(Z; �o)

gD(Z; �o)� �o

35 = 0: (2.4)

This set of moment conditions identi�es both �o and �o and enables their joint estimation. Our

moment selection strategy is based on the estimation of �o. For any ` = 1; :::; kn, we assume that���o;`�� � C. Below we �rst list all desired properties of the estimator for consistent moment selection,then propose an estimator of �o that satis�es all of these properties.

Let b�n denote an estimator of �o with sample size n. Let b�n;` and �o;` denote the estimatorand true value of the slackness parameter associated with moment ` 2 D. We estimate the desiredset A based on the zero elements of b�n, i.e.,

bAn � f` : b�n;` = 0g: (2.5)

6

For consistent selection of all valid and relevant moments in D, the estimator b�n has to satisfyPr(b�n;` = 0;8` 2 A)! 1 and Pr(b�n;` = 0;8` 2 B)! 0

as the sample size n!1.

Table 2.1 Moment Selection Based on Shrinkage Estimation

Category True Value Estimator Desired PropertyA �valid and relevant �o;` = 0 Pr(b�n;` = 0)! 1 super e¢ ciencyB1 �invalid �o;` 6= 0 Pr(b�n;` = 0)! 0 consistencyB0 �valid but redundant �o;` = 0 Pr(b�n;` = 0)! 0 no shrinkage e¤ect

Table 2.1 summarizes the properties of the slackness parameters and their estimators for all

three categories. First, for the valid and relevant moment (in A), �o;` is 0 and we need its estimator

to be 0 w.p.a.1. This super e¢ ciency type of property can be achieved by shrinking the estimator

of �o;` to be 0 for ` 2 A. Second, for the invalid moment (in B1), the estimator of �o;` di¤ersfrom 0 w.p.a.1 provided that it is consistent, because �o;` is di¤erent from 0 in this case. Heavy

shrinkage of �o;` toward 0 for ` 2 B1 causes estimation bias not only to �o but also to �o. Toensure consistent estimation of �o and �o, the shrinkage e¤ect on the estimator of �o;` has to be

controlled for ` 2 B1. Third, for the redundant moment (in B0), �o;` is 0 because the momentis valid. However, its estimator is required to be di¤erent from 0 in order to leave out redundant

moments. This is completely opposite to the requirement for set A, although �o;` = 0 in both cases.

For ` 2 B0, the shrinkage e¤ect has to be controlled to prevent the estimator of �o;` from having

point mass at 0.

To sum up, consistent moment selection requires a sparse estimation2 of the slackness parame-

ters, however, the shrinkage e¤ect has to be reduced when the moment is either invalid or redundant.

Such requirements motivate the information-based adaptive shrinkage estimation proposed in this

paper. We propose a P-GMM estimation that incorporates the measure of validity and relevance

for each moment. The resulting P-GMM estimator is shown to satisfy all the requirements above

and yields consistent moment selection.

2The sparse estimation means that the resulting estimator may have sparse solutions. That is, when the trueparameter �o has zero elements, the estimator of �o may contain components which are identically zero in �nitesamples.

7

2.3 Information measure and penalized GMM estimation

For the ease of exposition, we de�ne �0 � (�0; �0) and introduce the following notations

m(Z; �) �

24 gS(Z; �)

gD(Z; �)

35 g(Z;�) �

24 gS(Z; �)

gD(Z; �)� �

35m(�) � E [m(Z; �)] g(�) � E [g(Z;�)]mn(�) � 1

n

Pni=1m(Zi; �) gn(�) � 1

n

Pni=1 g(Zi; �)

mS(�) � E [gS(Z; �)] �S(�) � @mS(�)/ @�0 2 Rk0�d�

mD(�) � E [gD(Z; �)] �D(�) � @mD(�)/ @�0 2 R(kn�k0)�d�

mS;n(�) � 1n

Pni=1 gS(Zi; �) mD;n(�) � 1

n

Pni=1 gD(Zi; �)

�(�) �

24 �S(�) 0k0�(kn�k0)

�D(�) �Ikn�k0

35 mS+A(�) � E

24 gS(Z; �)

gA(Z; �)

35 :

(2.6)

By de�nition, the parameter space of � is �n � ��B1�� � ��Bkn�k0 , where Bj � f�j : �j = mj+k0(�)

and � 2 �g for j = 1; : : : ; kn�k0. The e¢ cient estimation and moment selection are simultaneouslyachieved in the P-GMM estimation

b�n = argmin�2�n

"gn(�)

0Wngn(�) + �nX`2D

!n;` j�`j#; (2.7)

where Wn is a kn � kn symmetric weighting matrix, �n 2 R+ is a tuning parameter that controlsthe general penalty level, and !n;` is an information-based adaptive adjustment for each moment

` 2 D. This is a LASSO type estimator that penalizes each individual slackness parameter �` usingits `1-norm. The `1-penalty is particularly attractive in our framework because both the GMM

criterion and the `1-penalty function are convex in �, which makes the computation of the P-GMM

estimator easy in practice.

The novelty of the P-GMM estimation in (2.7) lies in the individual adaptive adjustment !n;`

which incorporates information on both validity and relevance. This individual adjustment is crucial

because consistent moment selection requires di¤erent degrees of penalty for moment conditions in

di¤erent categories, as listed in Table 2.1. To this end, de�ne

!n;` = _�r1n;` j _�n;`j�r2 ; (2.8)

where _�n;` � 0 is an empirical measure of the information in moment `, _�n;` is a preliminary

consistent estimator of �o;`, and r1, r2 (with r1 � r2) are user-selected positive constants. Be-

fore discussing the construction of _�n;` and _�n;`, we �rst list the implications of this individual

8

adjustment on consistent selection of valid and relevant moments.

First, when data suggest the moment ` is relevant, the empirical information measure will be

large, which leads to a heavy shrinkage of �o;` toward 0. In contrast, redundant moments (B0)

are subject to small shrinkage because _�n;` is asymptotically 0 for ` 2 B0. This information-basedadjustment _�n;` di¤erentiates the relevant moments from redundant ones.

Second, when data suggest the moment ` is likely to be valid, the magnitude of the preliminary

estimator j _�n;`j will be small as _�n;` is consistent, which leads to a large penalty !n;` and hence, aheavy shrinkage of �o;` toward 0. In contrast, invalid moments (B1) are subject to small shrinkage

toward 0, avoiding estimation bias. This validity-based adjustment j _�n;`j di¤erentiates the validmoments from invalid ones. The application of j _�n;`j for adaptive shrinkage resembles the adaptiveLASSO penalty proposed in Zou (2006).

Combining _�n;` and _�n;`, !n;` provides a data-driven adjustment that separates the valid and

relevant moments (A) from the rest. The constants r1 and r2 (with r1 � r2) are introduced to ensurethat !n;` is small when the moment ` is redundant. Roughly speaking, the individual adjustment

!n;` is large only when the corresponding moment condition is valid and relevant. In consequence,

�o;` is estimated as 0 w.p.a.1 only for ` 2 A, yielding a consistent moment selection procedure.Next, we discuss the construction of the empirical information measure _�n;`. For this purpose,

we �rst de�ne its population counterpart �`, which is associated with the degree of e¢ ciency im-

provement by adding the moment condition indexed by `. When the moment conditionsmS(�o) = 0

are used for GMM estimation of �o, the asymptotic variance of the optimal GMM estimator is

V �1S � �S(�o)0�1S (�o)�S(�o), where

S(�o) � limn!1

Var

"n�

12

nXi=1

gS(Zi; �o)

#: (2.9)

Let �`(�) =@E[g`(Z;�)]

@�0for any ` and any �. When another moment ` 2 D is added, we can de�ne

a new variance VS+` analogously to VS but with E [gS(Z; �)] replaced by E [gS+`(Z; �)], where

gS+`(Z; �) is a vector that stacks gS(Z; �) and g`(Z; �) together, i.e.,

V �1S+` �

24 �S(�o)�`(�o)

350�1S+`(�o)24 �S(�o)�`(�o)

35 , whereS+`(�o) � lim

n!1Var

"n�

12

nXi=1

gS+`(Zi; �o)

#: (2.10)

It is well-known that adding a valid (but possibly irrelevant) moment condition will not decrease

9

the e¢ ciency of the GMM estimator. We next show that even if a moment condition is invalid,

including this moment condition in calculating the asymptotic variance of the GMM estimator does

not decrease the "e¢ ciency" either.3

Lemma 2.1 Suppose that S(�o) is positive de�nite and S+`(�o) is an invertible matrix for any

` 2 D. Then VS � VS+` for all ` 2 D.

Because the matrix VS � VS+` is positive semi-de�nite, its eigenvalues are always non-negative.Relevance requires that at least one of its eigenvalues is strictly larger than zero. Thus, we de�ne

�` � �max(VS � VS+`) (2.11)

as the measure of information in the moment condition indexed by ` 2 D. When �` > 0, the

moment ` is considered to be relevant. A suitable consistent estimator _�n;` is

_�n;` = �max( _Vn;S � _Vn;S+`); (2.12)

where _Vn;S and _Vn;S+` are consistent estimators of VS and VS+`, respectively.

To obtain _�n;` and _�n;` (for ` 2 D), we construct an initial GMM estimator _�0n = (_�0n;_�0n),

de�ned as

_�n = argmin�2�n

hg0n(�)fWngn(�)

i; (2.13)

where fWn denotes a preliminary weighting matrix (e.g., kn � kn identity matrix) which satis�esAssumption 1(iii) in the next section. The preliminary estimator _�n;` can be constructed using the

formula (2.12) and the preliminary estimator _�n (see Appendix C for more details). It is clear that

this initial estimator _�n can be viewed as a special P-GMM estimator by setting �n = 0 in (2.7) for

all n. Hence, as long as the tuning parameter �n = 0 satis�es the su¢ cient conditions provided in

the next section, the properties of the P-GMM estimator, e.g., consistency and rate of convergence,

also hold for the initial GMM estimator _�n.

In the online Supplemental Appendix, we provide primitive su¢ cient conditions under which the

convergence rates of _�n;` and _�n;` are derived. These stochastic properties are listed in Assumption

5.1 below as high-level assumptions. Under these high-level assumptions, the stochastic properties

of !n;` are studied in Lemma 5.1. For the theoretical analysis in Section 3 below, we derive general

bounds on the tuning parameter �n as an implicit function of !n;`. Given the stochastic order

3Note that we do not suggest estimating �o using possibly invalid moments here. In practice, the variance matrixV �1S+` is calculated as if the moment ` was valid, but the estimator of �o used for this calculation is based on theconservative moment conditions S.

10

of !n;`, these bounds for �n only depend on the sample size, the number of moments, and some

constants. For readers who would like to see the practical choice of �n directly, the rate of �n is

provided in (5.5) and a practical choice is suggested in (5.8).

3 Asymptotic Theory

3.1 Consistency and rate of convergence

We �rst state and discuss the assumptions for the consistency of the P-GMM estimator b�n.Assumption 3.1 (i) mS(�) is continuous in � and for any " > 0, there exists some �" > 0 such

that

inff�2�: k���ok�"g

kmS(�)k > �";

(ii) sup�2� kmn(�)�m(�)k = op(1);(iii) Wn is a real matrix with C�1 � �min(Wn) � �max(Wn) � C w.p.a.1;

(iv) the tuning parameter �n satis�es �nP`2B1 !n;` = op(1).

Assumption 3.1(i) is a standard identi�able uniqueness condition for �o. Assumption 3.1(ii)

is essentially a uniform law of large numbers (ULLN) and it requires uniform convergence of the

sample moments to the population moments. Assumptions 3.1(iii) imposes regularity conditions

on the weighting matrix. Assumption 3.1(iv) imposes an upper bound on �n, which ensures that

the penalty is small enough such that it does not cause inconsistency of the estimator b�n. Byconstruction, the P-GMM criterion has two parts, where the former is a quadratic form minimized

by the true value of the parameter asymptotically and the latter is minimized by � = 0. When the

penalty is too large, it shifts the estimator of �o;` towards 0 for all ` and causes estimation bias

for �o;` 6= 0 and hence for �o. For this reason, the upper bound required by �nP`2B1 !n;` = op(1)

only involves the invalid moments in B1.

Lemma 3.1 Under Assumption 3.1, we have b�n !p �o.

Next, we derive the rate of convergence of the P-GMM estimator b�n, whose dimension increaseswith the sample size. Let !n;B1 denote a vector that collects !n;` for all ` 2 B1 and

bn � �n k!n;B1k : (3.1)

11

Assumption 3.2 (i) There exist a sequence of constants �n ! 0 with ��1n = O(n12 ) and a �xed

constant �1 > 0 such that

supjj���ojj��1

kmn(�)�m(�)k = Op(�n);

(ii) m(�) is continuously di¤erentiable for any � in the local neighborhood of �o;

(iii) C�1 � �min [�S(�o)0�S(�o)] and �max [�(�o)0�(�o)] � C;(iv) max`�kn supjj���ojj��2 k�`(�)� �`(�o)k � C�2 for some �2 > 0;(v)

pknbn = op(1) and

pkn�n = o(1).

Assumption 3.2(i) is a high level condition on the convergence rate of the empirical process

indexed by moment functions. When the number of moment conditions is �xed, Assumption 3.2(i)

holds with �n = n�12 , following standard empirical process results; see e.g., Andrews (1994). Here,

the sequence of constants �n is introduced to allow for an increasing number of moments. Lemma

D.1 in the Appendix D provides su¢ cient conditions under which Assumption 3.2(i) holds with

�n =pkn=n. Assumptions 3.2(ii), 3.2(iii) and 3.2(iv) impose standard regularity conditions on

the �rst order derivative of the population moments. Assumption 3.2(v) imposes restrictions on

the dimension of the moment functions kn and the tuning parameter �n. Assumption 3.2(v) is a

su¢ cient condition for Assumption 3.1(iv) because

�nX`2B1

!n;`���o;`�� � Cpkn�n k!n;B1k = Cpknbn; (3.2)

by the Cauchy-Schwarz inequality and j�o;`j � C for any ` 2 D.For some models, it is easier to verify Assumption 3.3 below, which is a high level assumption

that can replace Assumptions 3.1 and 3.2, in conjunction with Assumption 3.1(iii).

Assumption 3.3 (i) There exists a sequence of constants �n ! 0 with ��1n = O(n12 ) such that

kmn(�o)�m(�o)k = Op(�n);

(ii) for any � 2 �n, C�1 k�� �ok � kgn(�)� gn(�o)k � C k�� �ok w.p.a.1.

If the data are i.i.d. and the second moment of g`(Z; �o) is bounded from above by some �nite

constant uniformly over `, Assumption 3.3(i) is satis�ed with �n =pkn=n. When the number kn of

moment conditions is �xed, Assumption 3.3(i) holds by the central limit theorem with �n = n�12 .

Assumption 3.3(ii) essentially requires that the GMM criterion has a quadratic approximation.

Assumption 3.3 is easy to verify in the linear IV model, as illustrated in Section 4.

12

Lemma 3.2 (a) Suppose Assumptions 3.1 and 3.2 hold. Then,

kb�n � �ok = Op(�n + bn);(b) Part (a) holds with Assumptions 3.1 and 3.2 replaced by Assumptions 3.1(iii) and 3.3.

Remark 3.1 If �n = 0 for all n, thenpknbn = 0 for all n. Hence, if Assumptions 3.1(i)-(iii), 3.2.(i)-

(iv) and kn�2n = o(1) hold, Lemma 3.2(a) immediately implies that the initial GMM estimator _�n

de�ned in (2.13) satis�es that

k _�n � �ok = Op(�n): (3.3)

If alternative Assumptions 3.1(iii) and 3.3 hold, Lemma 3.2(b) implies the same result. The con-

vergence rate of the initial GMM estimator _�n is useful to construct the adaptive penalty and the

tuning parameter, as illustrated in Section 5.

Applying Lemma 3.2, we can show that the invalid moment conditions are not selected w.p.a.1

when the slackness parameters �o;` for any ` 2 B1 are bounded away from 0 or converge to zero

at a rate slower than �n. We consider the possibility that �o;` converges to 0 as the sample size

increases because the number of moments in B1 can diverge. To see this, �rst de�ne

dn � min`2B1

���o;`�� : (3.4)

We have dn > 0 by de�nition. If dn � C > 0, i.e., slackness parameters for invalid moments do notconverge to 0, then using Lemma 3.2, we deduce that

Pr

�min`2B1

jb�n;`j > 0� � Pr

�min`2B1

hj�o;`j � jb�n;` � �o;`ji > 0�

� Pr

�dn �max

`2B1jb�n;` � �o;`j > 0�

� Pr (C � jjb�n � �ojj > 0)! 1, as n!1 (3.5)

which immediately implies that our method does not select the invalid moment conditions w.p.a.1.

From the last inequality in (3.5), we see that the lower bound restriction min`2B1���o;`�� � C can

be relaxed by applying the convergence rate of b�n. Speci�cally, if jjb�n � �ojj = Op(�n) and the

slackness parameters �o;` for any ` 2 B1 satisfy �n = o(dn), using the same arguments in (3.5), wehave

Pr

�min`2B1

jb�n;`j > 0� � Pr�dn�n > jjb�n � �ojj�n

�! 1; as n!1: (3.6)

Results in (3.5) and (3.6) immediately yield the following corollary.

13

Corollary 3.1 (Invalid Moments) (a) Suppose Assumptions 3.1 and 3.2 hold. If we further have

dn � C for all n, then

Pr�[`2B1

nb�n;` = 0o�! 0 as n!1;

(b) Part (a) holds under Assumptions 3.1, 3.2, bn = Op(�n) and �n = o(dn);

(c) Parts (a) and (b) hold with Assumptions 3.1 and 3.2 replaced by Assumptions 3.1(iii) and 3.3.

Remark 3.2 Corollary 3.1 implies that the probability that the P-GMM estimation selects any

invalid moment condition goes to zero. Part (a) is implied by the consistency of the P-GMM

estimator when the magnitudes of the slackness parameters �o;` for any ` 2 B1 are uniformly

bounded from below. Part (b) indicates that the invalid moment conditions will not be selected

w.p.a.1 even when the magnitudes of the slackness parameters �o;` for any ` 2 B1 converge to zeroat certain rate.

3.2 Super e¢ ciency

In this section, we show that the valid and relevant moment conditions are selected w.p.a.1. The

following restrictions on �n are needed.

Assumption 3.4 �n satis�es that (i) bn = Op(�n); (ii) ��1n �nmax`2A !�1n;` = op(1).

Assumption 3.4(i) ensures kb�n � �ok = Op(�n). Assumption 3.4(ii) imposes a lower bound on�n. Assumption 3.4(ii) only involves the valid and relevant moment conditions because only �` for

` 2 A is desired to be penalized heavily. This is a key condition to achieve the super e¢ ciency onmoment selection.

Theorem 3.2 (a) Suppose Assumptions 3.1, 3.2 and 3.4 hold. Then,

Pr�\`2A

nb�n;` = 0o�! 1 as n!1;

(b) Part (a) holds with Assumptions 3.1, 3.2 replaced by Assumptions 3.1(iii) and 3.3.

Theorem 3.2 shows that all valid and relevant moments are simultaneously selected w.p.a.1,

allowing for an increasing number of moments in A as n!1. Corollary 3.1 and Theorem 3.2 are

necessary but not su¢ cient to show that the set A is consistently estimated. For this purpose, it

remains to show that the redundant moments in B0 are not selected w.p.a.1.

14

3.3 Asymptotic normality

In this subsection, we establish the asymptotic distribution of the P-GMM estimator. Without loss

of generality for the asymptotic results below, write �0 = (�0A; �0B), where �A and �B denote the

sub-vector of � that collects �` for ` 2 A and ` 2 B, respectively. Let b�A;n and b�B;n denote theP-GMM estimators of �A and �B, respectively. Theorem 3.2 shows b�A;n = 0 w.p.a.1. It remainsto develop the asymptotic distribution of b�B;n, together with the distribution of b�n. To this end,de�ne �0B � (�0; �0B) which is a d� + dB dimensional vector. Now we stack all moment conditionsand de�ne

g(Z;�B) �

2664gS(Z; �)

gA(Z; �)

gB(Z; �)� �B

3775 ; (3.7)

where gA(Z; �) denotes the valid and relevant moments and gB(Z; �) denotes the invalid or redun-

dant moments. Because g(Z;�B) is linear in �B, the partial derivative of E [g(Z;�B)] with respect

to �B only depends on �. De�ne

��(�)0 �

24 �@E[gS(Z;�)]@�0

�0 �@E[gA(Z;�)]

@�0

�0 �@E[gB(Z;�)]

@�0

�00dB�k0 0dB�dA �IdB

35 and �� � ��(�o): (3.8)

Note that the link between g(Z;�B) and g(Z;�) is that if we treat �A in � to be zero, then we

have g(Z;�) = g(Z;�B). Because the true value of �A is 0, g(Z;�o) = g(Z;�B;o) by de�nition.

Hence, the sample average of g(Z;�B;o) can be written as gn(�o).

Assumption 3.5 Let �n(�) � mn(�) �m(�). There exists a sequence of constants &n ! 0 such

that

sup�1;�22f�2�:jj���ojj��ng

kvn(�1)� vn(�2)kn�

12 + k�1 � �2k

= Op(&n) (3.9)

for some sequence �n which converges to 0 slower than �n.

Assumption 3.5 is a stochastic equicontinuity condition that accommodates non-smooth moment

conditions. Similar stochastic equicontinuity conditions are employed in Pakes and Pollard (1989),

Andrews (2002), and Chen, Linton, van Keilegom (2003), among others. Empirical process results

in Pollard (1984), Andrews (1994), and van der Vaart and Wellner (1996) can be used for the

veri�cation. When the number of moments is �xed, to ensure the root-n consistency of the GMM

estimator, it is su¢ cient to show Assumption 3.5 holds with op(1) on the right hand side.

A speci�c convergence rate &n associated with the empirical process �n(�) has to be derived

in (3.9) to accommodate an increasing number of moments. Lemma D.2 in Appendix D provides

15

primitive su¢ cient conditions under which Assumption 3.5 holds with &n =pkn=n.

De�ne the variance of the sample moments as

n � nVar [gn(�o)] : (3.10)

For i.i.d. observations, this variance matrix is simpli�ed to E[g(Z;�o)g(Z;�o)0] for all n.

Assumption 3.6 (i) For any n 2 Rkn and k nk = 1;

pn 0n

� 12

n gn(�o)!d N(0; 1);

(ii) C�1 � �min(n) � �max(n) � C for all n;

(iii) C�1 � �min(�0���) for all n.

Assumption 3.6(i) assumes a triangular array central limit theorem (CLT) for scalar random

sequences. Assumption 3.6(ii) requires that the variance matrix n is positive de�nite and bounded

for all n. Assumption 3.6(iii) imposes similar regularity condition on �0���.

Assumption 3.7 The tuning parameter �n satis�es that �n k!n;Bk = op(n�12 ).

Assumption 3.7 imposes an upper bound on �n, which ensures that a weighted linear com-

bination of the P-GMM estimator has a mean-zero asymptotic normal distribution. Because

n�12 = O(�n); we see that Assumption 3.7 implies Assumption 3.4(i). Moreover, when kn = o(n),

Assumption 3.7 impliespknbn = op(1) in Assumption 3.2(v).

Assumptions 3.5, 3.6, and 3.7 can be combined with Assumptions 3.1 and 3.2 to show the

asymptotic normality of the P-GMM estimator. Alternatively, they can be used together with

Assumptions 3.1(iii), 3.3, and 3.8 below for the same result.

Assumption 3.8 E [g(Z;�1;B)� g(Z;�2;B)] = ��(�1;B � �2;B) for any �1;B and �2;B.

When the moment functions are linear in �, Assumption 3.8 holds automatically.

De�ne

�n ���0�Wn��

��1(�0�WnnWn��)

��0�Wn��

��1: (3.11)

Theorem 3.3 (a) Suppose Assumptions 3.1-3.2, 3.4-3.7 hold. If we further havepkn�

2n = o(n

� 12 )

and &n�n = o(n�12 ), then

pn 0n�

� 12

n (b�B;n � �B;o)!d N(0; 1)

16

for any n 2 Rd�+dB with k nk = 1;(b) Part (a) holds under Assumptions 3.1(iii), 3.3-3.8, and &n�n = o(n�

12 ).

Remark 3.3 The asymptotic distribution of the P-GMM estimator is derived by a perturbation

technique on a local parameter space (see, e.g., Shen, 1997), which allows for non-smooth moment

functions and an increasing number of parameters.

Remark 3.4 Theorem 3.3(a) requirespkn�

2n = o(n�

12 ); which restricts the rate at which kn

diverges to in�nity. When �n = &n =pkn=n, which holds under the su¢ cient conditions in

Lemmas D.1 and D.2 in the Appendix, this condition holds provided kn = o(n13 ), i.e., the number

of moment conditions increases at a rate slower than n13 . On the other hand, Theorem 3.3(b)

only needs &n�n = o(n�12 ), which is satis�ed if kn = o(n

12 ), i.e., the number of moment conditions

increases at a rate slower than n12 .

Remark 3.5 Theorem 3.3, in conjunction with the Cramér-Wold device, yields the asymptotic

distribution of b�n. To see this, let 0n;d� = ( 0d� ;00dB) where d� 2 R

d� . We have 0n;d��B = 0d��

because �0B = (�0; �0B). Let

�n;d�

= �1=2n n;d� jj�

1=2n n;d� jj

�1, which satis�es jj �n;d� jj = 1. ApplyingTheorem 3.3, we get

( �n;d�)0pn��1=2n (b�B;n � �B;o)!d N(0; 1)

which implies that

jj�1=2n n;d� jj�1pn 0d�(b�n � �o)!d N(0; 1): (3.12)

This asymptotic distribution can be applied to conduct inference for the parameter of interest �o.

Let

S+A;n � Var

24n� 12

nXi=1

0@ gS(Zi; �o)

gA(Zi; �o)

1A35 ; (3.13)

be a variance matrix that involves all valid and relevant moments. Suppose that Wn satis�es

W�1n � n

= op(1); (3.14)

then we can use Assumptions 3.1(iii) and 3.6(iii) to show that4

��� 0n;d� h�n � ��0��1n ����1i n;d� ��� = op(1); (3.15)

4The formal proof is in the Supplemental Appendix of the paper.

17

where

0n;d���0�

�1n ��

��1 n;d� =

0d�

��@mS+A(�o)

@�0

�0�1S+A;n

�@mS+A(�o)

@�0

���1 d� : (3.16)

The matrix in the middle of the right hand side of (3.16) is the variance of the infeasible �oracle�

estimator one would get with the complete knowledge of which moments are valid and relevant.

Hence, the P-GMM estimator of �o is as e¢ cient as the oracle estimator asymptotically.

Remark 3.6 When �n = 0, by the same arguments used to show Theorem 3.3, we obtain

pn 0�;n�

� 12

�;n( _�n � �o)!d N(0; 1) (3.17)

for any �;n 2 Rd�+kn with �;n = 1, where

��;n ���(�o)

0Wn�(�o)��1 �

�(�o)0WnnWn�(�o)

� ��(�o)

0Wn�(�o)��1 . (3.18)

This together with Assumptions 3.1(iii), 3.2(iii) and C�1 � �min [�(�o)0�(�o)] immediately impliesthe root-n normality of the preliminary estimator _�n.

Let n;` 2 Rd�+dB be a selection vector such that 0n;`�B = �n;` for any ` 2 B0. Using argumentssimilar to those employed to derive (3.12), we can show

jj�1=2n n;`jj�1pnb�n;` !d N(0; 1) for any ` 2 B0: (3.19)

Because b�n;` has an asymptotic normal distribution for any individual ` 2 B0, the probability thatb�n;` = 0 approaches 0 for any individual ` 2 B0. This result is particularly important for leavingout valid but redundant moments, which is not covered by Corollary 3.1. Corollary 3.4 states that

all redundant moments are left out w.p.a.1 by the moment selection procedure.

Corollary 3.4 (Redundant Moments) (a) Under the conditions of Theorem 3.3(a),

Pr�[`2B0

nb�n;` = 0o�! 0 as n!1;

(b) Part (a) holds under the conditions of Theorem 3.3(b).

Remark 3.7 Combining Corollary 3.1, Theorem 3.2, and Corollary 3.4, we conclude that

Pr( bAn = A)! 1 (3.20)

18

as n!1; where the estimator bAn is de�ned in (2.5). The P-GMM estimation achieves consistent

moment selection under assumptions and conditions speci�ed above.

Remark 3.8 The assumption that B0 is a �nite set is important for our argument to show Corollary

3.4. By the Bonferroni inequality, we have

Pr�[`2B0

nb�n;` = 0o� � X`2B0

Pr�b�n;` = 0� : (3.21)

As there are only �nite many elements in B0, to prove the result in Corollary 3.4, it is su¢ cient to

have for any ` 2 B0,Pr�b�n;` = 0�! 0 as n!1, (3.22)

which follows from (3.19) because a random variable with an asymptotic normal distribution put

zero mass at any given point w.p.a.1.5

Remark 3.9 Although the procedure can leave out moment conditions that are redundant to S,

one potential limitation is that some moments in A might be redundant to the rest of A combined

with S. That is, there could be a subset A0 � A which contains the valid and relevant momentsto S, while the rest of the moments in A, de�ned as Ac0, are redundant to A0 [ S.6 To deal withthis potential redundancy problem, we can use a two step procedure that selects A in the �rst step

and selects A0 out of A in the second step. The second step is similar to the �rst step but with a

sequential information measure that takes into account the potential redundancy to A0 [ S. Thissequential information measure is de�ned as

��` � �max(VS+A` � VS+A`�`) (3.23)

for ` = 1; :::; dA, where

A` � fj 2 A : j < ` and ��j > 0g [ fj 2 A : j � `g; (3.24)

VS+A` is de�ned similarly to VS+` in (2.10) but with g`(Z; �) replaced by gA`(Z; �), and VS+A`�`

is de�ned similarly to VS+A` but with g`(Z; �) taken out of gA`(Z; �). It is key that the set A`5The proposed method can be extended to consistent moment selection with an increasing number of moments in

B0 if one can derive the rate at which Pr(b�n;` = 0)! 0 uniformly over ` 2 B0:6 It is clear that in the general scenario, the non-redundant set A0 in A may not be unique. However, any non-

redundant set in A can be uniquely determined by an order among the moments in A. That is, if we order themoments in A and delete the redundant moments by investigating the moments from the �rst to the last, we geta non-redundant set A0 determined by the order. We implicitly impose an order on the moments in A and ourcalculation of the empirical information measure below follows this order.

19

excludes those moments that are redundant to their predecessors. By de�nition, we set ��` = 0 for

any ` 2 Ac0. Using the empirical analog of ��` , the P-GMM estimation can be used to consistently

select A0 provided that Ac0 is a �nite set.7

4 Example: A Linear IV Model

In this section, we study a linear IV model to illustrate the general assumptions of the previous

section. Consider the model

Yi = Xi�o + ui;

Xi =

k0Xj=1

�jZ1;i(j) +

1Xj=k0+1

�jZ1;i(j) + vi; (4.1)

where Yi, Xi are scalar endogenous variables and Z1;i(j) for j 2 Z+ � f1; 2; :::g are the excludedexogenous variables. For any vector Z, we use Z(j) to denote the j-th component of Z. We assume

that

E[uiZ1;i(j)] = 0 and E[viZ1;i(j)] = 0 for all j: (4.2)

For example, equation (4.1) can be obtained from the following conditional mean model

Xi = h(!i) + vi with E [ (ui; vi)j!i] = (0; 0): (4.3)

In this example, Z1;i(j) are the basis functions in the series expansion h(!i) =P1j=1 �jpj(!i), i.e.,

Z1;i(j) = pj(!i) for j 2 Z+. Moment conditions in (4.2) are implied by the conditional momentrestrictions in model (4.3).

We assume that an econometrician has the �rst k0 IVs Z�i � (Z1;i(1); :::; Z1;i(k0))0 to con-

struct moment conditions for identi�cation of �o. The valid and relevant IVs Z1;i � (Z1;i(k0 +

1); :::; Z1;i(k0 + dA))0 are mixed with invalid IVs Z2;i � (Z2;i(1); :::; Z2;i(dB1))

0 and irrelevant IVs

Z3;i � (Z3;i(1); :::; Z3;i(dB0))0. For this linear IV model, we have moment functions

gS(Zi; �) = (Yi �Xi�o)Z�i (4.4)

7This is a heuristic argument based on proofs in this paper. A formal proof for this modi�ed sequential procedureis beyond the scope of this paper and left for future research. Also note that the two step procedure is needed toensure that moment conditions in A0 which may be redundant to the moments in B1 will not be ruled out.

20

for consistent estimation of �o and the following moment conditions for selection

gA(Zi; �) = (Yi �Xi�o)Z1;i;

gB1(Zi; �) = (Yi �Xi�o)Z2;i;

gB0(Zi; �) = (Yi �Xi�o)Z3;i: (4.5)

De�ne Z 0i � (Z 01;i; Z 02;i; Z 03;i). Let kn = k0 + dA + dB1 + dB0 denote the number of all availableIVs when the sample size is n. In this example, we assume kn = o(n

12 ). We next provide su¢ cient

conditions for Assumptions 3.3, 3.5, 3.6, and 3.8, when the moment conditions are constructed from

this linear IV model.

Condition 4.1 (i) fYi; Xi; Z�i ; Zigi�n is a triangular array of i.i.d. process;(ii) E[u2i

��Z�i ] � C and E[u2i��Zi] � C for all n;

(iii) E[jjXijj4] � C, E[jjZ�i (j)jj4] � C; and E[jjZi(j)jj4] � C for all n;

(iv) E[Z1;i(j)Z1;i(k)] = �j;k where �j;k denotes the Kronecker�s delta;

(v)Pk0j=1 �

2j > 0, E[jjXiZ2;ijj] � C; and E[XiZ3;i] = 0 for all n.

The triangular array assumption is imposed because the number of IVs may increase with the

sample size. Condition 4.1(ii) requires that the conditional second moments of the error term ui

given the IVs are uniformly bounded. Condition 4.1(iii) requires that the fourth moments of the

endogenous variable Xi and the IVs are bounded from above uniformly. Condition 4.1(iv) implies

that the valid and relevant IVs are orthonormalized, which is a normalization condition (see, e.g.,

Newey, 1997). Condition 4.1(v) contains three restrictions. The �rst restrictionPk0j=1 �

2j > 0 en-

sures that the conservative IVs Z�i can be used to identify and consistently estimate �o. The second

restriction implies that the aggregated information contained in the invalid IVs Z2;i is bounded

from above. The last restriction indicates that the irrelevant IVs Z3;i contains no information

about the endogenous variable Xi.

Lemma 4.1 Under Condition 4.1,

(a) kmn(�o)�m(�o)k2 = Op(kn=n);(b) kgn(�)� g(�o)k2 � C[1 +Op(kn=n) +Op(

pkn=n)] k�� �ok2;

(c) kgn(�)� g(�o)k2 � C[1 +Op(kn=n) +Op(pkn=n)] k�� �ok2;

(d) Assumption 3.5 is satis�ed with &n =pkn=n.

Lemma 4.1(a) implies Assumption 3.3(i) holds with �n =pkn=n. Lemmas 4.1 (b) and 4.1 (c)

and kn = o(n12 ) imply Assumption 3.3(ii). Because �n =

pkn=n and kn = o(n

12 ), Lemma 4.1(d)

implies that &n�n = knn�1 = o(n�12 ), which is required by Theorem 3.3(b).

21

By de�nition, we write

n = E

240@ uiZ�i

uiZi � �o

1A0@ uiZ�i

uiZi � �o

1A035 �0@ 0;n 1;n

01;n 2;n

1A (4.6)

where

0;n �

0@ E�u2iZ

�i Z

�0i

�E�u2iZ

�i Z

01;i

�E�u2iZ1;iZ

�0i

�E�u2iZ1;iZ

01;i

�1A ,

01;n ��E�u2iZ�1;iZ

�0i

�E�u2iZ�1;iZ

01;i

� �,

2;n � E�u2iZ�1;iZ

0�1;i

�� �B;o�0B;o; and Z 0�1;i � (Z 02;i; Z 03;i). (4.7)

Let 11n � 0;n � 1;n�12;n01;n and 22n � 2;n � 01;n�10;n1;n.

Condition 4.2 (i) E[u4i��Z�i ] � C and E[u4i

��Zi] � C for all n;

(ii) �max(E[Z�1;iZ 0�1;i]) � C for all n;

(iii) �min(11n ) � C�1 and �min(22n ) � C�1 for all n.

Condition 4.2(i) is stronger than Condition 4.1(i). The �nite conditional fourth moment of the

error term is a regularity condition for showing asymptotic normality of plug-in series estimator

in conditional mean models, see, e.g., Newey (1997). Condition 4.2(ii) imposes a upper bound for

the largest eigenvalue of E[Z�1;iZ 0�1;i] which is also a mild condition. To see the intuition behind

Condition 4.2(iii), we write

0;n � 1;n�12;n01;n = Ehu2i (Z1;i � 1;n�12;nZ�1;i)(Z1;i � 1;n

�12;nZ�1;i)

0i

(4.8)

where uiZ1;i � 1;n�12;nuiZ�1;i is the residual of the projection of uiZ1;i on the space spanned byuiZ�1;i. Hence, �min(

11n ) � C�1 implies that for any random variable 0Z1;iui generated by some

linear combination of Z1;iui, the optimal mean square linear prediction error based on the set of

variables uiZ�1;i is bounded from below by some �xed constant for all n. The similar intuition

applies to the restriction �min(22n ) � C�1. Hence, Condition 4.2(iii) requires that the distance

between the Hilbert spaces generated by uiZ1;i and uiZ�1;i is bounded away from zero for all n.

Lemma 4.2 Under Conditions 4.1 and 4.2, Assumptions 3.6 and 3.8 are satis�ed.

22

5 Selection of the Tuning Parameter

The asymptotic results established in previous sections provide restrictions on the tuning parameter

�n. These restrictions are implicit in the sense that they depend on the individual information-

based adaptive penalties !n;` de�ned in (2.8), whose asymptotic magnitudes rely on the validity

as well as relevance of the moment condition ` by construction. In this section, we analyze these

individual penalties and provide an explicit formula for the tuning parameter �n.

Speci�cally, we consider the case where

�n = &n = k12nn

� 12 . (5.1)

These speci�c rates apply under the su¢ cient conditions in Lemma D.1 and Lemma D.2 in the

Appendix. Our goal is to choose �n that satis�es the upper bound

�n

�n12 k!n;Bk

�= op(1); (5.2)

and the lower bound

��1n

�k12nn

�1=2max`2A

!�1n;`

�= op(1); (5.3)

Under (5.1), conditions in (5.2) and (5.3) imply Assumptions 3.1(iv), 3.2(v), 3.4, and 3.7.

Assumption 5.1 (i) max`2D�� _�n;` � �o;`�� = Op(�n);

(ii) max`2D j _�n;` � �o;`j = Op(�n);(iii) For any ` 2 B0,

�� _�n;` � �o;`�� = Op(n� 12 ) and

pn( _�n;` � �o;`)!d N(0; �

2`) with �

2` > 0;

(iv) max`2A j��1o;` j � C, max`2B1���o;`�� � C; and max`2B1 j��1o;` j � C.

Assumption 5.1(i) imposes a restriction on the convergence rate of the empirical information

measure. Assumption 5.1(ii) is implied by (3.3), because

max`2D

��� _�n;` � �o;`��� � _�n � �o � k _�n � �ok = Op(�n): (5.4)

Assumption 5.1(iii) is standard because the assumptions are on individual moments. The root-n

normality of _�n;` is implied by (3.17) and the Cramer-Wold device. Assumption 5.1(iv) imposes

bounds on���o;`�� for ` 2 A and ���o;`�� for ` 2 B1.

Su¢ cient conditions for Assumptions 5.1(i) and 5.1(iii) are provided in the Supplemental Ap-

pendix.

Remark 5.1 In Appendix C, we show that the tuning parameter speci�ed in (5.5) below also allows

23

min`2A���o;`�� and min`2B1 ���o;`�� to go to zero at certain rate as kn !1, which relaxes Assumption

5.1(iv).

Lemma 5.1 Suppose Assumption 5.1 holds.

(a) The upper bound in (5.2) is satis�ed if �nk12nn

12 = o(1) and �nn

1+r2�r12 = o(1);

(b) The lower bound in (5.3) is satis�ed if ��1n k1+r22

n n�1+r22 = o(1):

Remark 5.2 By choosing r1 � r2, Lemmas 5.1(a) only requires that �nk12nn

12 = o(1). On the other

hand, Lemma 5.1(b) requires that ��1n k1+r22

n n�1+r22 = o(1). To balance these two rates, we set

�n = ckr24n n

� 12� r2

4 (5.5)

where c is some �nite positive constant.

Remark 5.3 Given r1 and r2 and r1 > r2, we propose a plug-in loading constant based on the

argument in Liao (2013): bc`;n = 2jjW 12n (`)b�njj for any ` 2 D (5.6)

whereW12n (`) denotes the `-th row of the matrixWn, and b�n is an estimator of the following matrix

�n = Ikn �W12n ��

��0�Wn��

��1�0�W

12n (5.7)

based on a preliminary P-GMM estimation with �n = 2kr24n n

� 12� r2

4 . To sum up, we propose using

the tuning parameter b�n;` = 2jjW 12n (`)b�njjk r24n n� 1

2� r2

4 (5.8)

for the `-th moment condition in D.

6 Simulation

For �nite-sample investigation, we consider a simple linear regression model

Y1 = Y2�o + u; (6.1)

where Y1; Y2 2 R are endogenous and �o 2 R is the parameter of interest. Valid and relevant

IVs ZS 2 R2 are available for the identi�cation of �o. In addition, a vector of candidate IVs

ZD = (ZA; ZB0;ZB1) 2 RK are considered, where ZA 2 RdA (dA = 2) are valid and relevant,

ZB0 2 RdB0 (dB0 = K=2 � 1) are redundant, and ZB1 2 RdB1 (dB1 = K=2 � 1) are invalid. We

24

consider K = 10, 30, and 50 in the experiments. The relationship between Y2 and (ZS ; ZA) is

Y2 = �0SZS + �

0AZA + v (6.2)

where �S and �A are dS � 1 and dA � 1 real vectors respectively.The simulated samples are generated in the following way. First we generate

(ZS ; ZA; ZB0 ; Z�B1 ; u; v) � N(0;�); where � = diag(�AS ;�B;�uv) (6.3)

where �AS , �B and �uv are 4 � 4, (K � 2) � (K � 2) and 2 � 2 variance-covariance matricesrespectively. By construction, (ZS ; ZA; ZB0 ; Z

�B1) are all valid, but only ZS and ZA are relevant

based on (6.2). The invalid IVs ZB1 are obtained by contaminating Z�B1with the structural error

u. Speci�cally,

ZB1(`) = Z�B1(`) + c` � u; (6.4)

where c` is a real constant, ZB1(`) and Z�B1(`) are the `-th element of ZB1 and Z

�B1, respectively.

The structure of (6.4) indicates that the degree of endogeneity of an invalid IV varies with the

coe¢ cient c`, which is given below.

Parameters in the data generating process are as follows: (i) �o = 0:5; (ii) �S = (�o; 0:1)0, where

the value �o = 0:1 or 0:3 to experiment di¤erent identi�cation strength; (iii) �A = (0:5; 0:5); (iv)

�AS is a 4 � 4 matrix with the (i; j)-th element being 0:2ji�jj; (v) �B is an (K � 2) � (K � 2)identity matrix; (vi) �u;v is a 2�2 matrix with diagonal elements (0:5; 1) and o¤-diagonal elements(0:6; 0:6); (vii) for co = 0:2 or 0:5 and ` = 1; :::; (K=2� 1), the coe¢ cients in (6.4) are

c` = co +(`� 1)(c� co)K=2� 1 ; (6.5)

where c = 2:4 sets a large upper bound. A larger value of co is associated with stronger endogeneity

of the invalid IVs.

For each speci�cation of (�o; co;K), we generate i.i.d. observations with sample size n = 250,

n = 2; 500 and n = 5; 000. To construct the information-based penalty in (2.8), the user-selected

constants are r1 = 3 and r2 = 2: The preliminary estimator _�n;` is constructed by sample analogs of

the variance matrix and the preliminary estimator _�n;` follows from (2.13). The weighting matrix

Wn is de�ned as

W�1n =

1

n

"nXi=1

g(Z; _�n)g(Z; _�n)0

#(6.6)

where _�n is de�ned in (2.13) with identity weighting matrix, g(Z;�n) is constructed using the

25

IVs (ZS ; ZA; ZB0;ZB1). The number of simulation repetition is 5,000. The projected scaled sub-

gradient method (active-set variant) method proposed in Schmidt (2010) is employed to solve the

minimization problem in the GMM shrinkage estimation.

Table 6.1. Performance of Moment Selection by GMM Shrinkage Estimation

�o = 0:1, K = 10n=250 n=2,500 n=5,000

co=0.2 .011 .495 .482 .012 .000 .906 .092 .002 .000 .951 .049 .000co=0.5 .002 .497 .489 .012 .000 .906 .092 .002 .000 .951 .049 .000

�o = 0:3, K = 10n=250 n=2,500 n=5,000

co=0.2 .001 .687 .184 .128 .000 .961 .027 .012 .000 .982 .014 .004co=0.5 .000 .687 .184 .129 .000 .961 .027 .012 .000 .982 .014 .004

�o = 0:1, K = 30n=250 n=2,500 n=5,000

co=0.2 .021 .132 .834 .013 .000 .701 .297 .002 .000 .828 .171 .001co=0.5 .007 .133 .847 .013 .000 .701 .297 .002 .000 .828 .171 .001

�o = 0:3, K = 30n=250 n=2,500 n=5,000

co=0.2 .001 .398 .467 .135 .000 .902 .086 .011 .000 .945 .051 .004co=0.5 .000 .398 .468 .135 .000 .902 .086 .011 .000 .945 .051 .004

�o = 0:1, K = 50n=250 n=2,500 n=5,000

co=0.2 .025 .052 .910 .013 .000 .558 .440 .002 .000 .731 .268 .001co=0.5 .009 .052 .927 .013 .000 .558 .440 .002 .000 .731 .268 .001

�o = 0:3, K = 50n=250 n=2,500 n=5,000

co=0.2 .001 .241 .622 .136 .000 .849 .141 .010 .000 .918 .076 .006co=0.5 .000 .241 .623 .136 .000 .849 .141 .010 .000 .918 .076 .006

Note: For each parameter combination, four numbers are reported. The �rst number is the probability of "selectingany invalid IVs". The second number is the probability of "selecting all valid and relevant IVs". The third number isthe probability of "selecting all valid and relevant IVs plus some redundant IVs". The fourth column is the probabilityof all other events.

Table 6.1 presents the �nite-sample performances of the moment selection by the GMM shrink-

age estimation. We �rst look at the case with strong identi�cation (�o = 0:3), strong endogeneity of

invalid IVs (co = 0:5), and small sample size (n = 250). In this case, the probabilities of any invalid

IVs being selected are small for K = 10, 30; and 50. Hence, the shrinkage procedure succeeds in

selecting only the valid IVs. The number of the moment conditions a¤ect the probabilities of valid

26

and/or relevant moment conditions to be selected. When K = 10, with a probability of 0:69, ZA is

the set of IVs selected and with a probability of 0:19, ZA plus some elements in ZB0 are selected.

This implies that with a probability of 0:88, the shrinkage procedure selects all of the valid and

relevant IVs. When K increases, the probability of selecting ZA alone decreases and the probability

of selecting ZA plus some elements in ZB0 increases. For example, when K = 50, the probability of

selecting ZA drops to 0:24, while the probability of selecting ZA plus some elements in ZB0 increases

to 0:62. When sample size is n = 2500, the probabilities of selecting ZA are 0:96 with K = 10, 0:90

with K = 30 and 0:85 with K = 50, whereas the probabilities of selecting invalid IVs are 0 and the

probabilities of selecting redundant IVs are as low as 0:14 even with K = 50. When sample size is

n = 5000, the probabilities of selecting ZA are larger than 0:90 and the probabilities of selecting

invalid or redundant IVs are close to zero. Reducing the degree of identi�cation and reducing the

degree of endogeneity for the invalid IVs both make moment selection more challenging. In the

extreme case with relatively weak identi�cation (�o = 0:1) and weak endogeneity (co = 0:2), the

procedure is robust at not including any invalid IVs but tend to include some redundant ones. The

probability of including redundant IVs is reduced signi�cantly when sample size increases.

The P-GMM estimator proposed in this paper produces an automatic estimate of �o in the

shrinkage estimation. Table 6.2 summaries �nite-sample properties of this estimator denoted by

�automatic�in Table 6.2, and compares it with several alternative estimators8. Some of the alter-

native estimators are infeasible, but serve as good benchmarks. To show the e¢ ciency improvement

by using more relevant and valid IVs, we compare the �automatic�estimator with a �conservative�

estimator, which only uses ZS without further exploring information in other candidate IVs. This

comparison shows that the �automatic�estimator enjoys smaller standard deviation and root mean

square error (RMSE) than the �conservative� estimator in all scenarios considered. To show the

�nite-sample improvement by excluding redundant IVs, the �automatic�estimator is compared to

a �pooled�estimator, which uses all valid IVs ZS , ZA; and ZB0 . This comparison indicates that the

�automatic�estimator has smaller �nite-sample bias. Note that this �pooled�estimator is actually

infeasible because it excludes all invalid IVs and include all valid IVs. Table 6.1 suggests that there

is a non-negligible probability that some valid and relevant IVs are not selected when the sample

size is moderate, which is why the standard deviation of the �automatic�estimator is slighter larger

than that of the �pooled�estimator for n = 250. This di¤erence disappears for n = 2500. To show

the importance of excluding invalid IVs, the �automatic�estimator is compared to an �aggressive�

estimator, which uses all candidate IVs regardless of their validity. This comparison suggests that

including invalid IVs increases �nite-sample bias as expected. The �post-shrinkage� estimator is8We only present �nite sample properties of various GMM estimators with K = 50 here. More simulation results

are available in the Supplemental Appendix of the paper.

27

Table6.2.FiniteSampleBias(BS),StandardDeviations(SD)andRMSEs(RE)withK=50

AutomaticEstimate

ConservativeGMMEstimate

n=250

n=2,500

n=250

n=2,500

BS

SDRE

BS

SDRE

BS

SDRE

BS

SDRE

(.1.2).0067

.0855

.0858

.0004

.0254

.0254

.0044

.2581

.2581

-.0018

.0770

.0770

(.1.5).0063

.0858

.0860

.0004

.0254

.0254

.0044

.2581

.2581

-.0018

.0770

.0770

(.3.2).0045

.0811

.0812

.0003

.0237

.0237

.0004

.1608

.1608

-.0012

.0490

.0490

(.3.5).0045

.0811

.0812

.0003

.0237

.0237

.0004

.1608

.1608

-.0012

.0490

.0490

PooledGMMEstimate

AggressiveGMMEstimate

n=250

n=2,500

n=250

n=2,500

BS

SDRE

BS

SDRE

BS

SDRE

BS

SDRE

(.1.2).0231

.0766

.0800

.0026

.0253

.0254

.1957

.0770

.2103

.1936

.0253

.1952

(.1.5).0231

.0766

.0800

.0026

.0253

.0254

.2034

.0770

.2175

.2030

.0253

.2046

(.3.2).0199

.0717

.0744

.0021

.0236

.0237

.1711

.0721

.1856

.1689

.0236

.1706

(.3.5).0199

.0717

.0744

.0021

.0236

.0237

.1778

.0721

.1918

.1773

.0236

.1788

Post-ShrinkageGMMEstimate

OracleGMMEstimate

n=250

n=2,500

n=250

n=2,500

BS

SDRE

BS

SDRE

BS

SDRE

BS

SDRE

(.1.2).0048

.0828

.0830

.0003

.0253

.0254

.0012

.0773

.0773

.0004

.0253

.0253

(.1.5).0048

.0833

.0835

.0003

.0253

.0254

.0012

.0773

.0773

.0004

.0253

.0253

(.3.2).0061

.0900

.0902

.0003

.0240

.0240

.0008

.0722

.0723

.0002

.0236

.0236

(.3.5).0061

.0900

.0902

.0003

.0240

.0240

.0008

.0722

.0723

.0002

.0236

.0236

Note:(i)The"automatic"estimationisobtainedsimultaneouslywithmomentselection.(ii)The"conservative"estimationusesZS.(iii)The"pooled"estimation

usesallvalidIVs,includingZS,ZA,andZB0.(iv)The"aggressive"estimationusesallavailableIVs,includinginvalidones.(v)The"post-shrinkage"estimation

usesZSplusIVsselectedbytheshrinkageprocedure.(vi)The"oracle"estimationusesZSandZA.

28

the GMM estimator uses all IVs selected by the shrinkage procedure. The di¤erence between the

�automatic� estimator and the �post-shrinkage� estimator is small. Finally, an important com-

parison is between the �automatic� estimator and the infeasible �oracle� estimator, which uses

the desirable IVs ZS and ZA. This comparison indicates that the �nite-sample properties of the

�automatic�estimator are comparable to those of the �oracle�estimator, even for a small sample

size, and the two are basically the same when the sample size is large.

In sum, the GMM shrinkage estimator proposed in this paper not only produces consistent

moment selection, as indicated in Table 6.1, but also automatically estimate the parameter of

interest. Table 6.2 shows that this �automatic� estimator dominates all other feasible estimators

and it is comparable to the ideal but infeasible �oracle� estimator in terms of �nite-sample bias

and variance.

7 Conclusion

This paper studies moment selection when the number of moments diverges with the sample size,

allowing for both invalid and redundant moments in the candidate set. We show that the moment

selection problem can be transformed to a P-GMM estimation problem, which consistently selects

the subset of valid and relevant moments and automatically estimates the parameter of interest. In

consequence, the P-GMM estimator is not only robust to the potential mis-speci�cation introduced

by invalid moments but also robust to the possible �nite-sample bias introduced by redundant

moments.

An interesting and challenging question related to this paper is inference on the parameter of

interest �o when moment selection is necessary. Although the asymptotic distribution developed

in this paper can be used to conduct inference on �o, this limiting distribution ignores the moment

selection error in �nite sample. As a result, a robust inference procedure with correct asymptotic

size is an important issue for the P-GMM estimator. This is related to the post model selection

inference problem investigated by Leeb and Pötscher (2005, 2008), Andrews and Guggenberger

(2009, 2010), Guggenberger (2010), Belloni, Chernozhukov, and Hansen (2011), and McCloskey

(2012), among others. Robust inference on the parameter of interest is beyond the scope of this

paper and will be investigated in future research.

29

AppendixIn this appendix, for any two sequences an and bn, we use an . bn to denote that an � Cbn

where C is some �xed �nite positive constant.

A Proofs of Main Results in Sections 2 and 3

Proof of Lemma 2.1. For the ease of notation, we write

�S � �S(�o), �` � �`(�o), S � S(�o) and S+` � S+`(�o): (A.1)

By de�nition, we write

S+` �

0@ S S;`

`;S `

1A (A.2)

where S is the leading k0 � k0 sub-matrix of S+`, ` is the last diagonal element of S+` andS;` =

0`;S are corresponding sub-matrices of S+`.

By the inverse formula of a block matrix, we have

�1S+`

24 S S;`

`;S `;S�1S S;`

35�1S+` =24 �1S 0ko�1

01�ko 0

35 ; (A.3)

which further implies that

V �1S+` � V�1S =

24 �S�`

3500@�1S+` �24 �1S 0ko�1

01�ko 0

351A24 �S�`

35=

24 �S�`

350�1S+`0@S+` �

24 S S;`

`;S `;S�1S S;`

351A�1S+`24 �S�`

35=

24 �S�`

350�1S+`24 0ko�ko 0ko�1

01�ko ` � `;S�1S S;`

35�1S+`24 �S�`

35 ; (A.4)

where

` � `;S�1S S;` = limn!1

Var

"n�

12

nXi=1

�g`(Zi; �o)� g0S(Zi; �o)�1S S;`

�#� 0: (A.5)

This implies that V �1S+` � V�1S � 0 and hence VS � VS+`. The second result is an immediate

implication of the �rst.

30

Proof of Lemma 3.1. Recall �n(�) = mn(�) �m(�). Note that �n(�) = gn(�) � g(�) for any� 2 A, and �n(�o) = gn(�o) because g(�o) = 0. Hence, by Assumptions 3.1(ii) and 3.1(iii), we get

gn(�o)0Wngn(�o) = �n(�o)

0Wn�n(�o) = op(1): (A.6)

The de�nition of b�n implies thatgn(b�n)0Wngn(b�n) + �nX

`2D!n;`

���b�n;`��� � gn(�o)0Wngn(�o) + �nX`2D

!n;`���o;`�� : (A.7)

Let mS;n(b�n) and mD;n(b�n) denote the subvectors of mn(b�n) associated with moments in S and D;respectively. The inequality in (A.7) implies

mS;n(b�n) 2 + mD;n(b�n)� b�n 2 = kgn(b�n)k2 = op(1); (A.8)

because (i) �nP`2D !n;`

���b�n;`��� � 0; (ii) gn(�o)0Wngn(�o) = op(1) by (A.6); (iii) �o;` = 0 for

` =2 B1; (iv) �nP`2B1 !n;`

���o;`�� = op(1) by Assumption 3.1(iv) and j�o;`j < C for all `; and (iv)

�min(Wn) � C�1 w.p.a.1 by Assumption 3.1(iii).Using the triangle inequality and result in (A.8), we have

op(1) = mS;n(b�n) � ��� mS(b�n) � mS;n(b�n)�mS(b�n) ��� ; (A.9)

which combined with Assumption 3.1(ii) implies that jjmS(b�n)jj = op(1). Under Assumption 3.1(i),jjmS(b�n)jj = op(1) implies that b�n !p �o.

Proof of Lemma 3.2. We �rst prove part (a). Because �n!n;` � 0 for all `, the triangle

inequality and the Cauchy-Schwarz inequality imply that,

�nX`2B1

!n;`���o;`��� �n X

`2B1

!n;`

���b�n;`��� � bn b�n � �o : (A.10)

Combining the inequalities in (A.7) and (A.10) and using the fact that �n!n;` � 0 and �o;` = 0 for` =2 B1, we obtain

gn(b�n)0Wngn(b�n) � bn b�n � �o + gn(�o)0Wngn(�o): (A.11)

31

By Assumption 3.1(iii) and gn(�o)0Wngn(�o) = Op(�

2n) under Assumption 3.2(i), we have mS;n(b�n) 2 + mD;n(b�n)� b�n 2 . bn b�n � �o +Op(�2n) (A.12)

w.p.a.1.

To derive the rate of convergence of b�n; we next study the two terms in the left hand side of(A.12) and link them to jjb�n � �ojj. First, note that mS;n(b�n) 2 =

mS;n(b�n)�mS(b�n) +mS(b�n)�mS(�o) 2

� mS(b�n)�mS(�o)

2 � 2 mS(b�n)�mS(�o) mS;n(b�n)�mS(b�n)

= mS(b�n)�mS(�o)

2 �Op(�n) mS(b�n)�mS(�o) ; (A.13)

where the �rst equality holds because mS(�o) = 0; the inequality follows from an expansion of the

quadratic term and the Cauchy-Schwarz inequality, the Op(�n) term in the last equality follows

from Assumption 3.2(i) and the consistency of b�n. By the mean value theorem,mS(b�n)�mS(�o) = �S(�o)(b�n � �o) + h�S(e�n)� �S(�o)i (b�n � �o) (A.14)

where �S(e�n)0 = h�1(e�1;n)0; :::;�k0(e�k0;n)0i and e�`;n is some value between b�n and �o for any ` 2 S.By the Cauchy-Schwarz inequality, the consistency of b�n and Assumption 3.2(iv), h�S(e�n)� �S(�o)i (b�n � �o) � �S(e�n)� �S(�o) b�n � �o . b�n � �o 2 (A.15)

w.p.a.1. Using Assumption 3.2(iii), we have

�S(�o)(b�n � �o) . b�n � �o ; (A.16)

which together with (A.14), (A.15), the consistency of b�n and the Cauchy-Schwarz inequality impliesthat

mS(b�n)�mS(�o) 2 = (b�n � �o)0�S(�o)0�S(�o)(b�n � �o) + op(1) b�n � �o 2 : (A.17)

The above equality combined with Assumption 3.2(iii) further implies that

b�n � �o . mS(b�n)�mS(�o) . b�n � �o (A.18)

32

w.p.a.1. Combining results in (A.13) and (A.18), we have w.p.a.1,

mS;n(b�n) 2 & b�n � �o 2 �Op(�n) b�n � �o : (A.19)

To study the second term on the left hand side of (A.12), we can write

mD;n(b�n)� b�n =hmD;n(b�n)�mD(b�n)i+ hmD(b�n)� b�ni

= Op(�n) +mD(b�n)� b�n= Qn �

hb�n � �oi ; whereQn �

hmD(b�n)� �oi+Op(�n) (A.20)

following the consistency of b�n and Assumption 3.2(i). Then,bn

b�n � �o +Op(�2n) � mD;n(b�n)� b�n 2

� b�n � �o 2 + jjQnjj2 � 2jjQnjj b�n � �o (A.21)

w.p.a.1, where the �rst inequality follows from (A.12) and the second inequality follows from (A.20)

and the Cauchy-Schwarz inequality. Reorganizing (A.21), we obtain

b�n � �o 2 � (2jjQnjj+ bn) b�n � �o + jjQnjj2 �Op(�2n) � 0 (A.22)

which implies b�n � �o . mD(b�n)�mD(�o) + bn +Op(�n) (A.23)

using the de�nition of Qn in (A.21) and �o = mD(�o).

Combining the inequalities in (A.12), (A.19) and (A.23), we get

b�n � �o 2 �Op(�n) b�n � �o . bn mD(b�n)�mD(�o) +Op(�2n + b2n) (A.24)

w.p.a.1. By the mean value theorem,

mD(b�n)�mD(�o) =h�D(e�n)� �D(�o)i (b�n � �o) + �D(�o)(b�n � �o) (A.25)

where �D(e�n)0 = h�k0+1(e�k0+1;n)0; :::;�kn(e�kn;n)0i and e�`;n is some value between b�n and �o for any

33

` 2 D. Note that Assumption 3.2(iv) implies that

h�D(e�n)� �D(�o)i (b�n � �o) .pkn b�n � �o 2 (A.26)

w.p.a.1. Under Assumption 3.2(iii), we have �D(�o)(b�n � �o) 2 . jjb�n � �ojj2. Therefore,

mD(b�n)�mD(�o) .pkn b�n � �o 2 + b�n � �o (A.27)

w.p.a.1 by the triangle inequality and the Cauchy-Schwarz inequality. Combining (A.24) and (A.27)

yields h1�Op(

pknbn)

i b�n � �o 2 � Op(bn + �n) b�n � �o +Op(�2n + b2n) (A.28)

w.p.a.1. Aspknbn = op(1), the inequality above implies b�n � �o = Op(bn + �n): (A.29)

Applying the results in (A.27) and (A.29) to (A.23), we obtain

b�n � �o . Op(bn + �n) +pknOp(b2n + �2n) = Op(bn + �n); (A.30)

where the last equality follows frompkn(bn + �n) = op(1) under Assumption 3.2(v). Combining

the results in (A.29) and (A.30), we get the result in part (a).

We next prove part (b). We �rst note that

gn(b�n)0Wngn(b�n)� gn(�o)0Wngn(�o)

= [gn(b�n)� gn(�o)]0Wn [gn(b�n)� gn(�o)] + 2 [gn(b�n)� gn(�o)]0Wngn(�o)

& kgn(b�n)� gn(�o)k2 � kgn(b�n)� gn(�o)k kgn(�o)k& kb�n � �ok2 � kgn(�o)k kb�n � �ok (A.31)

w.p.a.1, where the �rst inequality follows from Assumption 3.1(iii) and the Cauchy-Schwarz in-

equality and the second inequality holds by Assumptions 3.3(ii). Combining the inequalities in

(A.11) and (A.31), we obtain

kb�n � �ok2 � kgn(�o)k kb�n � �ok . bn kb�n � �ok ; (A.32)

which together with Assumptions 3.3(i), implies kb�n � �ok = Op(�n + bn).34

Proof of Theorem 3.2. Let e` be a kn-dimensional vector with the `-th entry being 1 and others

being 0. By the Karush�Kuhn�Tucker (KKT) optimality condition, b�n;` = 0 if��e0`Wngn(b�n)�� < �����n!n;`2

���� : (A.33)

Hence,

Pr�b�n;` = 0, 8` 2 A� � Pr�max

`2A

����e0`Wngn(b�n)�n!n;`

���� < 1

2

�: (A.34)

To obtain the desired result, it remains to show

max`2A

����e0`Wngn(b�n)�n!n;`

���� = op(1): (A.35)

Following Assumption 3.1(iii),

0 < C�1 � e0`WnWne` � C <1 (A.36)

for any ` w.p.a.1. By the Cauchy-Schwarz inequality and the inequalities in (A.36),

max`2A

����e0`Wngn(b�n)�n!n;`

���� � max`2A

ke0`Wnk�n!n;`

kgn(b�n)k . kgn(b�n)k�n

max`2A

!�1n;` (A.37)

w.p.a.1. By the triangle inequality,

kgn(b�n)k � kg(b�n)k+ �n(b�n) = kg(b�n)k+Op(�n); (A.38)

where the equality follows from Assumption 3.2(i). Note that

kg(b�n)k2 = mS(b�n)�mS(�o)

2 + mD(b�n)� b�n 2.

mS(b�n)�mS(�o) 2 + mD(b�n)�mD(�o)

2 + b�n � �o 2 (A.39)

which together with (A.18), (A.27), Lemma 3.2,pkn(�n+ bn) = o(1) and bn = Op(�n) implies that

kgn(b�n)k = Op(�n). This combined with Assumption 3.4(ii) and (A.37) implies that (A.35) holds.Next, we prove part (b). Under Assumption 3.3, we have

kgn(b�n)k � kgn(b�n)� gn(�o)k+ kgn(�o)k. kb�n � �ok+ kgn(�o)k = Op(�n); (A.40)

where the �rst inequality follows from the triangle inequality, the second inequality is by Assumption

35

3.3(ii) and it holds w.p.a.1, the last equality is by Assumptions 3.3(i), 3.4(i), and Lemma 3.2(b).

This combined with Assumption 3.4(ii) and (A.37) implies that (A.35) holds.

Proof of Theorem 3.3. Let "n be a sequence of constants such that (i) "n = o(n�12 ); (ii)

�n k!n;Bk = Op("n), (iii) &n�n = o("n) (and (iv)pkn�

2n = O("n) for the proof of part (a)). Such

a sequence can be constructed because �n k!n;Bk = op(n�1=2); &n�n = o(n�1=2); (and

pkn�

2n =

o(n�1=2) for part (a)). De�ne b��B;n = b�B;n + "nu�n; (A.41)

where u�n = (�0�Wn��)

�1 �n, �n 2 Rd�+dB and k �nk � C. Because the smallest eigenvalues of Wn

and �0��� are bounded from below by Assumptions 3.1(iii) and 3.6(iii), ku�nk � C. Hence,

k"nu�nk2 = "2n ku�nk

2 = Op("2n) = op(n

�1). (A.42)

Write b��0B;n = (b��0n ; b��0B;n), then b��B;n � b�B;n = k"nu�nk = Op("n) and (A.43) b��B;n � �B;o � kb�B;n � �B;ok+ k"nu�nk = Op(�n): (A.44)

For any �0B = (�0; �0B), we de�ne

gn(�B) =1

n

nXi=1

g(Zi; �B) and g(�B) = E [g(Z;�B)] ; (A.45)

where g(Z;�B) is de�ned in (3.7). By the de�nition of b�n,gn(b�n)0Wngn(b�n) + �nX

`2D!n;`

���b�n;`��� � gn(b��B;n)0Wngn(b��B;n) + �nX`2B

!n;`

���b��n;`��� (A.46)

where b��n;` is the element of b��B;n corresponding to b�n;` for any ` 2 B. By Theorem 3.2, the left

hand side of (A.46) satis�es

gn(b�n)0Wngn(b�n) + �nX`2D

!n;`

���b�n;`��� = gn(b�B;n)0Wngn(b�B;n) + �nX`2B

!n;`

���b�n;`��� (A.47)

36

w.p.a.1. The triangle inequality and Cauchy-Schwarz inequality imply that������nX`2B

!n;`

����b��n;`���� ���b�n;`��������� � �nX

`2B!n;`

���b��n;` � b�n;`���= �n"n

X`2B

!n;`��u�n;`�� � �n"n k!n;Bk u�n;B = Op("2n), (A.48)

where u�n;B � (u�n;d�+1; :::; u�n;d�+dB

)0 is the vector of perturbation on �B and the Op("2n) follows

from jju�n;Bjj � C and �n k!n;Bk = Op("n). Combining (A.46), (A.47) and (A.48) yields

gn(b��B;n)0Wngn(b��B;n)� gn(b�B;n)0Wngn(b�B;n) � Op("2n): (A.49)

We next prove part (a). De�ne

I1;n = �n(b��n)� �n(b�n): (A.50)

Because g(Z;�B) is linear in �,

gn(�B) = g(�B) + �n(�) (A.51)

for any �0B = (�0; �0B). Applying the equality above, we obtain

gn(b��B;n)� gn(b�B;n) = g(b��B;n)� g(b�B;n) + I1;n, (A.52)

which implies that

gn(b��B;n)� gn(b�B;n) 2 � 2 g(b��B;n)� g(b�B;n) 2 + 2 kI1;nk2 : (A.53)

By the mean value theorem,

g(b��B;n)� g(b�B;n) 2 =�b��B;n � b�B;n�0��(e��n)0��(e��n) �b��B;n � b�B;n�

=�b��B;n � b�B;n�0��(�o)0��(�o) �b��B;n � b�B;n�+�b��B;n � b�B;n�0 h��(e��n)� ��(�o)i0 h��(e��n)� ��(�o)i �b��B;n � b�B;n�

+2�b��B;n � b�B;n�0 h��(e��n)� ��(�o)i0 ��(�o) �b��B;n � b�B;n� (A.54)

where the �rst equality is a mean value expansion with e��n between the �rst d� elements of b��B;nand b�B;n, (e��n may vary element by element for the vector). Next, we show each term on the right

37

hand side of the second equality of (A.54) is Op("2n). Under Assumption 3.2(iii), we see that

0���0� = (

0d�;00dA ;

0dB)��(�o)

0�(�o)�( 0d� ;0

0dA; 0dB )

0 � C k k2 (A.55)

for any = ( 0d� ; 0dB)0 2 Rd�+dB , which implies that �max(�

0���) � C. Under (A.55) and b��B;n � b�B;n = Op("n), we have

�b��B;n � b�B;n�0��(�o)0��(�o) �b��B;n � b�B;n� . b��B;n � b�B;n 2 = Op("2n): (A.56)

By the Cauchy-Schwarz inequality and kn�2n = o(1), we have

�b��B;n � b�B;n�0 h��(e��n)� ��(�o)i0 h��(e��n)� ��(�o)i �b��B;n � b�B;n��

b��B;n � b�B;n 2 ��(e��n)� ��(�o) 2= knOp(�

2n) b��B;n � b�B;n 2 = op("2n) (A.57)

where the �rst equality is by jj��(e��n)���(�o)jj = Op(pkn�n), which in turn holds by Assumption3.2(iv) and jje��n��ojj = Op(�n). Using (A.55), the Cauchy-Schwarz inequality, Assumption 3.2(iv),and kn�2n = o(1), we have�����b��B;n � b�B;n�0 h��(e��n)� ��(�o)i0 ��(�o) �b��B;n � b�B;n�����

� b��B;n � b�B;n ��(e��n)� ��(�o) ��(�o)(b��B;n � b�B;n)

. b��B;n � b�B;n 2 ��(e��n)� ��(�o)

=pknOp(�n)

b��B;n � b�B;n 2 = op("2n): (A.58)

Combining the results in (A.54), (A.56), (A.57), and (A.58), we deduce that

g(b��B;n)� g(b�B;n) 2 = Op("2n): (A.59)

Using Assumption 3.5, jjb��n � b�njj = Op("n); ��1n = O(n12 ); and &n�n = o("n), we have

kI1;nk = Op(n�12 &n + "n&n) = Op(n

� 12 &n) = op("n); (A.60)

which together with (A.52), (A.59), and Assumption 3.1(iii) implies that

�gn(b��B;n)� gn(b�B;n)�0Wn

�gn(b��B;n)� gn(b�B;n)� . gn(b��B;n)� gn(b�B;n) 2 = Op("2n) (A.61)

38

where the inequality holds w.p.a.1. We can rewrite the inequality (A.49) to obtain

Op("2n) � gn(b��B;n)0Wngn(b��B;n)� gn(b�B;n)0Wngn(b�B;n)

=�gn(b��B;n)� gn(b�B;n)�0Wn

�gn(b��B;n)� gn(b�B;n)�

+2�gn(b��B;n)� gn(b�B;n)�0Wngn(b�B;n): (A.62)

Applying the results in (A.61) and (A.62), we obtain

�gn(b��B;n)� gn(b�B;n)�0Wngn(b�B;n) � Op("2n): (A.63)

De�ne

I0;n = �n(b�n)� �n(�o): (A.64)

Then

gn(b�B;n) = gn(�o) + g(b�B;n)� g(�o) + I0;n: (A.65)

Plugging (A.65) into (A.63) and using the de�nition of I1;n in (A.50) yields

Op("2n) �

�g(b��B;n)� g(b�B;n)�0Wn [gn(�o) + g(b�B;n)� g(�o)]+An +Bn + Cn, where

An = I 01;nWn[gn(�o) + g(b�B;n)� g(�o)];Bn = [g(b��B;n)� g(b�B;n)]0WnI0;n;

Cn = I 01;nWnI0;n. (A.66)

We next study the terms An, Bn and Cn one by one. By the triangle inequality and the Cauchy-

Schwarz inequality,

jAnj ���I 01;nWngn(�o)

��+ ��I 01;nWn[g(b�B;n)� g(�o)]���

��I 01;nWnI1;n�� 12 ��gn(�o)0Wngn(�o)

�� 12+��I 01;nWnI1;n

�� 12 ��[g(b�B;n)� g(�o)]0Wn[g(b�B;n)� g(�o)]�� 12 : (A.67)

By arguments similar to those used to show (A.59), we can show

kg(b�B;n)� g(�o)k2 = Op(�2n): (A.68)

Using Assumptions 3.1(iii) and 3.2(i), the results in (A.60), (A.67), (A.68), and &n�n = o("n), we

39

have

jAnj = Op(n�12 &n�n) = op(n

� 12 "n): (A.69)

To study Bn and Cn; �rst note that

kI0;nk = Op(n�12 &n + �n&n) = op("n); (A.70)

where the �rst equality follows from Assumption 3.5 and jjb�n � �ojj = Op(�n) and the second

equality follows from ��1n = O(n12 ) and &n�n = o("n). By the Cauchy-Schwarz inequality, (A.59)

and (A.70),

jBnj ���g(b��B;n)� g(b�B;n)]0Wng(b��B;n)� g(b�B;n)]�� 12 ��I 00;nWnI0;n

�� 12 = op("2n): (A.71)

By the Cauchy-Schwarz inequality, (A.60) and (A.70),

jCnj ���I 01;nWnI1;n

�� 12 ��I 00;nWnI0;n�� 12 = op("2n): (A.72)

Putting together (A.66), (A.69), (A.71), and (A.72) and using "n = o(n�1=2); we obtain

�g(b��B;n)� g(b�B;n)�0Wn [gn(�o) + g(b�B;n)� g(�o)] � op(n� 1

2 "n) (A.73)

Next, we consider the mean-value expansions for g(b��B;n) � g(b�B;n) and g(b�B;n) � g(�o) in(A.73). First,

g(b��B;n)� g(b�B;n) = ��(e��n) �b��B;n � b�B;n�= ��(�o)(b��B;n � b�B;n) + h��(e��n)� ��(�o)i (b��B;n � b�B;n)= ��(�o)(b��B;n � b�B;n) +Op(pkn�n"n); (A.74)

where the �rst equality is a mean value expansion with e��n between the �rst d� elements of b��B;nand b�B;n, (e��n may vary element by element for the vector), the second equality is obvious, andthe third equality follows from the Cauchy-Schwarz inequality, jj��(e��n) � ��(�o)jj = Op(

pkn�n),

which in turn holds by Assumption 3.2(iv) and jje��n � �ojj = Op(�n); and b��B;n � b�B;n = Op("n).Similarly, we have

g(b�B;n)� g(�o) = ��(�o)(b�B;n � �B;o) +Op(pkn�2n): (A.75)

40

Plugging (A.74) and (A.75) into (A.73), we obtain

op(n�1=2"n) � (b��B;n � b�B;n)0��(�o)0Wn[gn(�o) + ��(�o)(b�B;n � �B;o)] +D1 +D2

= (b��B;n � b�B;n)0��(�o)0Wn[gn(�o) + ��(�o)(b�B;n � �B;o)] + op(n� 12 "n); (A.76)

because

jD1j �����(b��B;n � b�B;n)0 h��(e��n)� ��(�o)i0Wn [gn(�o) + g(b�B;n)� g(�o)]����

� Op(pkn�n"n) [kgn(�o)k+ kg(b�B;n)� g(�o)k]

= Op(pkn�

2n"n) = op(n

� 12 "n); (A.77)

and

jD2j ����(b��B;n � b�B;n)0��(�o)0Wn

h��(e�n)� ��(�o)i (b�B;n � �B;o)���

= Op("n)Op(pkn�

2n) = op(n

� 12 "n) (A.78)

for some e�n between the �rst d� elements of b��B;n and �0 and e�n may vary element by elementfor the vector. The �rst inequality in (A.77) holds by the Cauchy-Schwarz inequality, the third

equality in (A.74), and Assumption 3.1(iii), the �rst equality in (A.77) follows from Assumption

3.2(i) and (A.68), and the second equality in (A.77) holds becausepkn�

2n = O("n) and "n =

o(n�1=2); as stated at the beginning of the proof. Results in (A.78) holds by similar arguments

using b��B;n � b�B;n = Op("n), kb�B;n � �B;ok = Op(�n) and ��(e�n)� ��(�o) = Op(pkn�n).

The results in (A.76) immediately gives

op("nn� 12 ) � (b��B;n � b�B;n)0��(�o)0Wn[gn(�o) + ��(�o)(b�B;n � �B;o)] (A.79)

which together with b��B;n � b�B;n = "nu�n = "n(�0�Wn��)�1 �n implies that

�0n (�0�Wn��)

�1�0�Wn[pngn(�o) +

pn��(�o)(b�B;n � �B;o)] � op(1) (A.80)

where �n 2 Rd�+dB with k �nk � C. Next, de�ne b��B;n = b�B;n�"nu�n and using the same argumentsin deriving (A.80), we deduce that

�0n (�0�Wn��)

�1�0�Wn[pngn(�o) +

pn��(�o)(b�B;n � �B;o)] � op(1); (A.81)

41

which combined with (A.80), implies that

�� �0n (�0�Wn��)�1�0�Wn[

pngn(�o) +

pn��(�o)(b�B;n � �B;o)]�� = op(1): (A.82)

The approximation in (A.82) can be rewritten as

pn �0n (b�B;n � �B;o) = � �0n (�0�Wn��)

�1�0�Wn12n

�pn

� 12

n gn(�o)

�+ op(1): (A.83)

Let n 2 Rd�+dB be an arbitrary vector with k nk = 1. Let

�0n � 0n�� 12

n and 0n � 0n�� 12

n (�0�Wn��)�1�0�Wn

12n : (A.84)

It is clear that

k nk2 = 0n n = 0n�� 12

n �n�� 12

n n = 1 (A.85)

and

k �nk2 = �0n

�n =

0n�

�1n n � C (A.86)

where the last inequality is by k nk = 1; (A.55), and Assumptions 3.1(iii), 3.6(ii), and 3.6(iii).

Hence, we deduce that

pn 0n�

� 12

n (b�B;n � �B;o) = � 0n �pn� 12

n gn(�o)

�+ op(1)!d N(0; 1); (A.87)

where the weak convergence is by Assumption 3.6(i).

We next prove part (b). Using Assumption 3.3(ii) and (A.43), we get

gn(b��B;n)� gn(b�B;n) = Op( b��B;n � b�B;n ) = Op("n) (A.88)

which together with the expression in (A.62) implies that

gn(b��B;n)0Wngn(b��B;n)� gn(b�B;n)0Wngn(b�B;n)= 2

�gn(b��B;n)� gn(b�B;n)�0Wngn(b�B;n) +Op("2n): (A.89)

By Assumption 3.8,

gn(b��B;n)� gn(b�B;n) = g(b��B;n)� g(b�B;n) + I1;n = ��(b��B;n � b�B;n) + I1;n (A.90)

42

and similarly,

gn(b�B;n)� gn(�o) = g(b�B;n)� g(�B;o) + I0;n = ��(b�B;n � �B;o) + I0;n: (A.91)

By Assumption 3.5, jjb��n � b�njj = Op("n), jjb�n � �ojj = Op(�n); and &n�n = o("n), we havekI1;nk = Op(n�

12 &n) and kI0;nk = Op(&n�n) = op("n). (A.92)

Applying (A.90), (A.91), and (A.92), we have

�gn(b��B;n)� gn(b�B;n)�0Wngn(b�B;n)

= (b��B;n � b�B;n)0�0�Wn[gn(�o) + ��(b�B;n � �B;o)] + (b��B;n � b�B;n)0�0�WnI0;n

+I 01;nWn[gn(�o) + ��(b�B;n � �B;o)] + I 01;nWnI0;n

= (b��B;n � b�B;n)0�0�Wn[gn(�o) + ��(b�B;n � �B;o)] + op(n� 12 "n); (A.93)

following (A.43), (A.55), Assumption 3.1(iii), "n = o(n�12 ) and Op(n�

12 &n�n) = op(n

� 12 "n). Apply-

ing (A.89) and (A.93) in (A.49) and using b��B;n � b�B;n = "nu�n = "n(�0�Wn��)�1 �n; we obtain

�0n (�0�Wn��)

�1�0�Wn[pngn(�o) +

pn��(�o)(b�B;n � �B;o)] � op(1) (A.94)

as in (A.80). The rest of the proof is the same as that of part (a) and hence is omitted.

B Proofs on Linear IV Model

Proof of Lemma 4.1. By de�nition,

gS(Z; �) = (Yi �Xi�)Z�i and gD(Z; �) = (Yi �Xi�)Zi: (B.1)

We write

kmn(�o)�m(�o)k2 =

0@ 1n

Pni=1 uiZ

�i

1n

Pni=1 uiZi � �o

1A00@ 1n

Pni=1 uiZ

�i

1n

Pni=1 uiZi � �o

1A=

1nnXi=1

uiZ�i

2

+

1nnXi=1

uiZi � �o

2

: (B.2)

43

Using Conditions 4.1(i), 4.1(ii), and 4.1(iii), we get

E

24 1nnXi=1

uiZ�i

235 = 1

n

k0Xj=1

EhjuiZ�i (j)j

2i� k0C

n(B.3)

and

E

24 1nnXi=1

uiZi � �o

235 = 1

n

Xj2D

EhjuiZi(j)j

2i� knC

n: (B.4)

Using the upper bounds in (B.3) and (B.4), and the expression in (B.2), we can invoke the Markov

inequality to deduce that

kmn(�o)�m(�o)k2 = Op(kn=n): (B.5)

This proves part (a).

For any �, we can write

gn(�)� gn(�o) =

0@ �1n

Pni=1XiZ

�i 0k0�(kn�k0)

�1n

Pni=1XiZi �Ikn�k0

1A (�� �o) � �n(�� �o): (B.6)

Moreover, we get

kgn(�)� gn(�o)k2 = (�� �o)0�0n�n(�� �o): (B.7)

De�ne �n � E��n�. Then

�n � �n =

0@ �1n

Pni=1XiZ

�i + E [XiZ�i ] 0k0�(kn�k0)

�1n

Pni=1XiZi + E [XiZi] 0(kn�k0)�(kn�k0)

1A : (B.8)

Using the Hölder inequality, the Cauchy-Schwarz inequality and Conditions 4.1(i) and 4.1(iii), we

get

E

24 1nnXi=1

XiZ�i � E [XiZ�i ]

235 = 1

n

k0Xj=1

EhjXiZ�i (j)� E[XiZ�i (j)]j

2i

� 1

n

k0Xj=1

EhjXiZ�i (j)j

2i�

qE�X4i

�Pk0j=1

pE [jZ�i (j)j4]

n� Ck0

n: (B.9)

44

Similarly, we have

E

24 1nnXi=1

XiZ0i � E

�XiZ

0i

� 235 = 1

n

Xj2D

EhjXiZi(j)� E[XiZi(j)]j

2i

� 1

n

Xj2D

EhjXiZi(j)j

2i�

qE�X4i

�Pj2D

pE [jZi(j)j4]

n� Ckn

n: (B.10)

Using the upper bounds in (B.9) and (B.10) and the expression in (B.8), we invoke the Markov

inequality to deduce that �n � �n 2 = Op(kn=n): (B.11)

De�ne

�1;n � E [XiZ�i ] and �2;n = E [XiZi] : (B.12)

Then,

�0n�n =

0@ E [XiZ�0i ]E [XiZ�i ] + E [XiZ0i]E [XiZi] E [XiZ 0i]

E [XiZi] Ikn�k0

1A=

0@ �01;n�1;n + �02;n�2;n �02;n

�2;n Ikn�k0

1A (B.13)

=

0@ 1 �02;n

0(kn�k0)�1 Ikn�k0

1A0@ �01;n�1;n 01�(kn�k0)

0(kn�k0)�1 Ikn�k0

1A0@ 1 01�(kn�k0)

�2;n Ikn�k0

1A :By Conditions 4.1(iv) and 4.1(v), we know that �01;n�1;n > 0; and hence �

0n�n is positive de�nite.

Let �n;� be a eigenvalue of �0n�n with �n;� 6= 1. Then

0 = det

240@ �01;n�1;n + �02;n�2;n � �n;� �02;n

�2;n (1� �n;�)Ikn�k0

1A35= det

240@ �01;n�1;n+�02;n�2;n��n;�

1��n;��02;n1��n;�

�2;n1��n;� Ikn�k0

1A35 (1� �n;�)kn�k0+1=

��01;n�1;n + �

02;n�2;n � �n;� �

�02;n�2;n

1� �n;�

�(1� �n;�)kn�k0 ; (B.14)

45

where the last equality holds by det(ABC) = det(A) det(B) det(C) with

A =

0@ 1�02;n1��n;�

0(kn�k0)�1 Ikn�k0

1A = C 0 and B =

0@ �01;n�1;n+�02;n�2;n��n;�

1��n;� � �02;n�2;n(1��n;�)2 01�(kn�k0)

0(kn�k0)�1 Ikn�k0

1A :(B.15)

From (B.14), we know that �n;� satis�es

�2n;� � (1 + �01;n�1;n + �02;n�2;n)�n;� + �01;n�1;n = 0: (B.16)

The above equation has the following two solutions

�n;� =�n �

q�2n � 4�1;n�01;n2

; (B.17)

where �n � 1 + �01;n�1;n + �02;n�2;n. This implies that the eigenvalues of �0n�n are bounded frombelow by min(�n;�; 1) and bounded from above by max(�n;�; 1). Under Conditions 4.1(iv) and

4.1(v),

�01;n�1;n = E�XiZ

�0i

�E [XiZ�i ] =

k0Xj=1

�2j (B.18)

and

�02;n�2;n = E�XiZ

0i

�E [XiZi] =

k0+dAXj=k0+1

�2j + E�XiZ

02;i

�E�XiZ2;i

�!

1Xj=k0+1

�2j + limn!1

E �XiZ2;i� 2 : (B.19)

Using the results in (B.18), (B.19) and the inequality

�n +q�2n � 4�01;n�1;n � �n +

p�2n = 2�n; (B.20)

we deduce that

�n;�;� �2�1;n�

01;n

�n +q�2n � 4�01;n�1;n

��01;n�1;n

1 + �01;n�1;n + �02;n�2;n

!Pk0j=1 �

2j

1 +P1j=1 �

2j + limn!1

E �XiZ2;i� 2 as n!1: (B.21)

By E�X2i

�� C and Condition 4.1(iv), we see that

P1j=1 �

2j � C, which together with

Pk0j=1 �

2j > 0

46

and limn!1 E �XiZ2;i� 2 < C implies that the smallest eigenvalue of �0n�n is bounded from below

uniformly in n. Next, using the results in (B.18), (B.19) and the inequality in (B.20), we have

�n;�;+ ��n +

q�2n � 4�01;n�1;n2

� 1 + �01;n�1;n + �02;n�2;n

! 1 +1Xj=1

�2j + limn!1

E �XiZ2;i� 2 , (B.22)

which together withP1j=1 �

2j � C and limn!1

E �XiZ2;i� 2 < C implies that the largest eigen-

value of �0n�n is bounded from above uniformly in n. Hence, for some constant C;

C�1 � �min(�0n�n) � �max(�0n�n) � C for all n. (B.23)

Using the inequalities in (B.23) and

k�n(�� �o)k =�(�� �o)0�0n�n(�� �o)

� 12 ; (B.24)

we deduce that

k�� �ok . k�n(�� �o)k . k�� �ok : (B.25)

Note that

(�� �o)0�0n�n(�� �o)� (�� �o)0�0n�n(�� �o)

= (�� �o)0(�n � �n)0(�n � �n)(�� �o) + 2(�� �o)0(�n � �n)0�n(�� �o): (B.26)

Using the triangle inequality, the Cauchy-Schwarz inequality, (B.11), (B.23), and (B.25), we obtain

(�� �o)0�0n�n(�� �o)

� (�� �o)0�0n�n(�� �o) +���(�� �o)0�0n�n(�� �o)� (�� �o)0�0n�n(�� �o)���

. k�� �ok2 + �n � �n 2 k�� �ok2 + k�� �ok �n � �n k�n(�� �o)k

.h1 +Op(kn=n) +Op(

pkn=n)

ik�� �ok2 ; (B.27)

47

which proves the result claimed in (b). Similarly,

(�� �o)0�0n�n(�� �o)

� (�� �o)0�0n�n(�� �o)����(�� �o)0�0n�n(�� �o)� (�� �o)0�0n�n(�� �o)���

& k�� �ok2 � �n � �n 2 k�� �ok2 � k�� �ok �n � �n k�n(�� �o)k

&h1 +Op(kn=n) +Op(

pkn=n)

ik�� �ok2 ; (B.28)

which proves the result claimed in (c).

Finally, we verify the claim in part (d). For any �1, �2 with k�1 � �2k � �, we can use the

Cauchy-Schwarz inequality to deduce that

kvn(�1)� vn(�2)k =

0@ 1

n

Pni=1XiZ

�i � E[XiZ�i ]

1n

Pni=1XiZi � E[XiZi]

1A (�2 � �1)

� �n � �n k�1 � �2k (B.29)

which together with (B.11) implies that

sup�1;�22f�2�:jj���ojj��ng

kvn(�1)� vn(�2)kn�

12 + k�1 � �2k

� sup�1;�22f�2�:jj���ojj��ng

�n � �n k�1 � �2kk�1 � �2k

= �n � �n = Op(pkn=n): (B.30)

This �nishes the proof.

Proof of Lemma 4.2. We �rst show that Assumption 3.6(ii) is satis�ed. Under Conditions

4.1(ii) and 4.2(ii), for any 2 RdB1+dB0 we have

02;n = E�u2i

0Z�1;iZ0�1;i

�� ( 0�B;o)2 � C 0E

�Z�1;iZ

0�1;i

� � C 0

which implies that �max(2;n) � C. Similarly, using Conditions 4.1(ii) and 4.1(iv), we have

�max(0;n) � C. Using Aronszajn�s inequality (see, e.g., Theorem III.2.9 in Bhatia, 1997 or Theo-

rem 1.2 in the Supplemental Appendix of this paper) and the fact that n is positive de�nite, we

know that

�max(n) � �max(0;n) + �max(2;n) � C: (B.31)

48

By the inverse formula of partitioned matrix,

�1n =

0@ (11n )�1 ��10;n1;n(22n )�1

�(22n )�101;n�10;n (22n )�1

1A : (B.32)

Using the Aronszajn�s inequality, Condition 4.2(iii), and the fact that n is positive de�nite, we

have

�max(�1n ) � �max((11n )�1) + �max((22n )�1) � C; (B.33)

which implies that �min(n) � C�1. This together with (B.31) yields

C�1 � �min(n) � �max(n) � C (B.34)

which shows Assumption 3.6(ii).

To verify Assumption 3.6(i), we only need to check the Lindeberg�s condition of the triangular

array CLT. For this purpose, we de�ne

�i;n � n�12 0n

� 12

n g(Zi; �o); (B.35)

where n 2 Rkn with k nk = 1, and

g(Zi; �o) �

0@ uiZ�i

uiZi � E[uiZi]

1A : (B.36)

For any random variable U with �nite fourth moment, we can use monotonicity of the expectation

operator and the inequality U2IfU > 1g � U4 to deduce that

E�U2IfU > 1g

�� E

�U4�: (B.37)

Using the inequality above and Condition 4.1(i)

nXi=1

E��2i;nIf�i;n > �g

�= n�2E

h��i;n=�

�2If�i;n=� > 1g

i� 1

n�2E

"���� 0n� 12

n g(Zi; �o)g(Zi; �o)0� 12

n n

����2#: (B.38)

As n 0n is semi-positive de�nite with eigenvalues 0 and 1, we know that for any symmetric matrix

49

A, there is

( 0nA n)2 = ( 0nA) n

0n(

0nA)

0 � 0nA2 n; (B.39)

which can be applied to show that

E

"���� 0n� 12

n g(Zi; �o)g(Zi; �o)0� 12

n n

����2#

� 0n� 12

n Eh�g(Zi; �o)g(Zi; �o)

0�2i� 12

n n 0n

�1n n: (B.40)

Under Assumption 3.6(ii) (which has been veri�ed above), we have

E

"���� 0n� 12

n g(Zi; �o)g(Zi; �o)0� 12

n n

����2#

. E h�g(Zi; �o)g(Zi; �o)0�2i � E hkg(Zi; �o)k4i ; (B.41)

where the �rst inequality is by (B.40), 0n�1n n � C; and the Cauchy-Schwarz inequality, the

second inequality is by the Jensen�s inequality and the Cauchy-Schwarz inequality. Next note that

Ehkg(Zi; �o)k4

i= E

h�g(Zi; �o)

0g(Zi; �o)�2i

= Eh��u2iZ�0i Z�i + (uiZi � �o)0 (uiZi � �o)��2i

. Eh��u2iZ�0i Z�i ��2i+ E h��(uiZi � �o)0 (uiZi � �o)��2i ; (B.42)

where

Eh��(uiZi � �o)0 (uiZi � �o)��2i

= E����u2i �Z 01;iZ1;i + Z 03;iZ3;i�+ �uiZ2;i � �B1;o�0 �uiZ2;i � �B1;o����2�

. Eh��u2iZ 01;iZ1;i��2i+ E h��u2iZ 03;iZ3;i��2i+ E �����uiZ2;i � �B1;o�0 �uiZ2;i � �B1;o����2� : (B.43)

Combining the inequalities in (B.42) and (B.43) and applying the Cauchy-Schwarz inequality, we

get

Ehkg(Zi; �o)k4

i. kn

k0+dAXj=1

E�u4iZ

41;i(j)

�+ kn

dB0Xj=1

E�u4iZ

43;i(j)

�+ kn

dB1Xj=1

Eh��uiZ2;i(j)� �B1;o(j)��4i

. kn(k0 + dA + dB0 + dB1) = k2n (B.44)

50

where the second inequality follows from Condition 4.2(i). The result in (B.44), together with

(B.38) and (B.41), implies that

nXi=1

E��2i;nIf�i;n > �g

�. k2nn�2

= o(1): (B.45)

Hence the the Lindeberg�s condition holds in the linear IV model, which veri�es Assumption 3.6(i).

Using the eigenvalue bounds in (B.23) and the arguments used to show (A.55), we deduce that

Assumption 3.6(iii) is also satis�ed.

Finally, Assumption 3.8 holds automatically in this linear model. This �nishes the proof.

C Proof of Results in Section 5

When the data are i.i.d. we use the preliminary estimator _�n in (2.13) to de�ne

_�S;n �1

n

nXi=1

@gS(Zi; _�n)

@�0, _S;n �

1

n

nXi=1

gS(Zi; _�n)gS(Zi; _�n)0,

_�S+`;n �1

n

nXi=1

@gS+`(Zi; _�n)

@�0, _S+`;n �

1

n

nXi=1

gS+`(Zi; _�n)gS+`(Zi; _�n)0: (C.1)

The estimators of variance matrices VS and VS+` are thus constructed as

_Vn;S � _�0S;n _�1S;n_�S;n and _Vn;S+` � _�0S+`;n _

�1S+`;n

_�S+`;n: (C.2)

The empirical information measure is de�ned using the above variance estimators and formula

(2.12).

Proof of Lemma 5.1. (a) By the de�nition of k!n;B1k, the triangle inequality and the Cr-inequality,

n�2n k!n;B1k2 = n�2n

X`2B1

_�2r1n;` j _�n;`j�2r2 . n�2n

X`2B1

�2r1o;` +�� _�n;` � �o;`��2r1h

j�o;`j � j _�n;` � �o;`ji2r2 (C.3)

which together with max`2B1 j _�n;` � �o;`j = Op(�n) and���o;`�� � C�1 for all ` 2 B1, implies that

n�2n k!n;B1k2 . n�2n

X`2B1

j�o;`j�2r2h�2r1o;` +

�� _�n;` � �o;`��2r1i (C.4)

51

w.p.a.1. Using max`2B1�� _�n;` � �o;`�� = Op(�n), and the inequality above, we deduce that

n�2n k!n;B1k2 . n�2n

X`2B1

j�o;`j�2r2�2r1o;` + n�2n

X`2B1

j�o;`j�2r2�� _�n;` � �o;`��2r1

. n�2nkn + n�2nknOp(�

2r1n ) = op(1); (C.5)

where the last equality follows from r1 > 0 and n12k

12n�n = o(1). By the de�nition of k!n;B0k,

n�2n k!n;B0k2 = n�2n

X`2B0

_�2r1n;` j _�n;`j�2r2 = n1+r2�r1�2n

X`2B0

��pn � _�n;` � �o;`���2r1���pn� _�n;` � �o;`����2r2 ; (C.6)

which together with max`2B0�� _�n;` � �o;`�� = Op(n

� 12 ) and

pn�_�n;` � �o;`

�!d N(0; �

2` ) implies

that

n�2n k!n;B0k2 = n1+r2�r1�2nOp(1) = op(1); (C.7)

where the last equality is by n1+r2�r1

2 �n = o(1). The desired result holds because jj!n;Bjj2 =jj!n;B0 jj2 + jj!n;B1 jj2:

(b). By the de�nition of !n;` and the triangle inequality,

knmax`2A !�2n;`

n�2n=

kn

n�2nmax`2A

_�2r2n;`

_�2r1n;`� kn

n�2nmax`2A

��� _�n;` � �o;`���2r2���o;` � �� _�n;` � �o;`����2r1. kn

n�2nmax`2A

j�o;`j�2r1��� _�n;` � �o;`���2r2 (C.8)

w.p.a.1, where the last inequality follows frommax`2A�� _�n;` � �o;`�� = Op(�n) andmax`2A j��1o;` j � C.

Using the inequality above, max`2A j��1o;` j � C and jj _�n � �ojj = Op(�n), we have

knmax`2A !�2n;`

n�2n. kn

n�2n

_�n � �o 2r2 = kn

n�2nOp(�

2r2n ) = Op(n

�1��2n kn�2r2n ); (C.9)

which together with �n =pkn=n, k1+r2n n�1�r2��2n = o(1) implies that

knmax`2A !�2n;`

n�2n= op(1).

As discussed in Remark 5.1, the tuning parameter speci�ed in (5.5) allows for the case that

min`2A���o;`�� and min`2B1 ���o;`�� go to zero as kn !1. To see it, suppose Assumptions 5.1(i), (ii),

(iii) hold and the following restrictions

max`2A

�����1o;` ��� = o(n r24r1 k

� 12r1

� r24r1

n ) and max`2B1

�����1o;` ��� = o(n 14k� 12r2

� 14

n ), (C.10)

52

are satis�ed. Using the restrictions above and the arguments in the proof of Lemma 5.1, we deduce

that

n�2n k!n;B1k2 . n�

r22 k

1+r22

n max`2B1

���o;`���2r2 = o(1); (C.11)

when �n is as speci�ed in (5.5) and moreover,

knmax`2A !�2n;`

n�2n. kn

n�2nmax`2A

��� _�n;` � �o;`���2r2j�o;`j2r1

= Op(n� r2

2 k1+

r22

n )max`2A

j�o;`j�2r1 = op(1); (C.12)

which implies that the tuning parameter in (5.5) also ensures that the lower bound is satis�ed.

D Some Su¢ cient Conditions

Let a.e. abbreviates almost everywhere. For any moment function g`(Z; �) and any � 2 �, wede�ne g�;`(Z; �) =

@g`(Z;�)@�0

. By de�nition, g�;`(Z; �) is a 1 � d� vector with the i-th element being@g`(Z;�)@�(i) , where �(i) denotes the i-th element of � for any i = 1; :::; d�. We use g��;`(Z; �) denote the

d� � d� matrix whose (i; j)-th element being @2g`(Z;�)@�(i)@�(j) , where �(i) and �(j) (i; j = 1; :::; d�) denote

the i-th and j-th elements of � respectively.

Lemma D.1 Suppose (i) the observations are i.i.d.; (ii) g`(Z; �) is di¤erentiable in � a.e. for

` = 1; :::; kn; (iii)

max`�kn

E�sup�2�

kg`(Z; �)k2 + sup�2�

k@g�;`(Z; �)k2�� C; (D.1)

and (iv) � is compact. Then Assumption 3.2(i) holds with �n =pkn=n.

Proof of Lemma D.1. De�ne F = fg`(Z; �) : � 2 �g. By Lemma 2.13 of Pakes and Pollard(1989), F is a Euclidean class with the envelope F = sup�2� jg`(Z; �)j + C sup�2� jjg�;`(Z; �)jj forsome positive constant C under assumptions (ii), (iii) and (iv) of Lemma D.1. For the de�nition

of Euclidean class, see (2.7) of Pakes and Pollard (1989). By the maximal inequality (Section 4.3

of Pollard, 1989), for any ` and any n

E

24sup�2�

�����n�1nXi=1

g`(Zi; �)� E [g`(Z; �)]�����235 � Cn�1: (D.2)

Hence,

E

24sup�2�

n�1nXi=1

g(Zi; �)� E [g(Z; �)] 235 � Ckn

n; (D.3)

53

which implies

sup�2�

n�1nXi=1

g(Zi; �)� E [g(Z; �)] = Op(pkn=n) (D.4)

by the Markov inequality.

Lemma D.2 Suppose (i) conditions of Lemma D.1 hold; (ii) g`(Z; �) is twice di¤erentiable in �

a.e. for ` = 1; :::; kn; (iii)

max`�kn

E�sup�2�

kg��;`(Zi; �)k2�� C: (D.5)

Then, (a) Assumption 3.5 holds with &n =pkn=n; (b) &n�n = o(n�

12 ) holds if kn = o(n

12 ).

Proof of Lemma D.2. De�ne

�`;n(�) = n�1

nXi=1

g`(Zi; �)� E [g`(Z; �)] ; (D.6)

which is continuously di¤erentiable with respect to � a.e. Let �1; �2 2 � denote two di¤erent pointsin �: By the mean value expansion,

�`;n(�1)� �`;n(�2) =

24n�1 nXi=1

@g`(Zi;e�n;`)@�0

�@Ehg`(Z;e�n;`)i@�0

35 (�1 � �2) (D.7)

where e�n;` is some value between �1 and �2. We have max`�kn E[ sup�2� j@g`(Z;�)@�(j) j] � C for j =

1; :::; d� by assumption (iii) of Lemma D.1. Hence, by the dominated convergence theorem, we can

exchange "E" and "@" to obtain

@E [g`(Z; �)]@�0

= E�@g`(Z; �)

@�0

�for any � 2 �; (D.8)

which together with (D.7) implies that

�`;n(�1)� �`;n(�2) ="n�1

nXi=1

g�;`(Zi;e�n;`)� E[g�;`(Zi;e�n;`)]#(�1 � �2): (D.9)

It follows that

k�n(�1)� �n(�2)k2 =X`�kn

k�`;n(�1)� �`;n(�2)k2 (D.10)

0@X`�kn

n�1nXi=1

g�;`(Zi;e�n;`)� E[g�;`(Zi;e�n;`)] 21A k�1 � �2k2

54

by the Cauchy-Schwarz inequality.

Applying the proof of Lemma D.1 with g`(Z; �) replaced by g�;`(Z; �), under assumptions (i),

(iii), (iv) of Lemma D.1 and assumption (iii) of Lemma D.2, we obtain

E

n�1nXi=1

g�;`(Zi;e�n;`)� E[g�;`(Zi;e�n;`)] 2

� E

24sup�2�

n�1nXi=1

g�;`(Zi; �)� E [g�;`(Z; �)] 235 � C

n(D.11)

for all n and `. It follows that

E

24X`�kn

n�1nXi=1

g�;`(Zi;e�n;`)� E[g�;`(Zi;e�n;`)] 235 � Ckn

n; (D.12)

which in turn implies

X`�kn

n�1nXi=1

g�;`(Zi;e�n;`)� E[g�;`(Zi;e�n;`)] 2

= Op(kn=n) (D.13)

by the Markov inequality.

Combining (D.10) and (D.13), we obtain

sup�1;�22f�2�:jj���ojj��ng

kvn(�1)� vn(�2)kn�

12 + k�1 � �2k

� sup�1;�22�

kvn(�1)� vn(�2)kk�1 � �2k

= Op(pkn=n): (D.14)

This veri�es Assumption 3(i) with &n =pkn=n. Part (b) holds because �n&n = kn=n = o(n�

12 )

when kn = o(n12 ).

REFERENCES

Andrews, D. W. K.(1994): "Empirical Process Methods in Econometrics," in Handbook of Econo-

metrics, Vol. IV, ed. by R. F. Engle and D. McFadden. Amsterdam: North-Holland.

Andrews, D. W. K. (1999): "Consistent Moment Selection Procedures for Generalized Method of

Moments Estimation," Econometrica, 67(3), 543-563.

Andrews, D. W. K. (2002): "Generalized Method of Moments Estimation When a Parameter is

on a Boundary," Journal of Business and Economic Statistics, 20(4), 530�544.

55

Andrews, D. W. K. and P. Guggenberger (2009): "Hybrid and Size-Corrected Subsampling Meth-

ods," Econometrica, 77(3), 721�762.

Andrews, D. W. K. and P. Guggenberger (2010): "Asymptotic Size and a Problem with Subsam-

pling with the m out of n Bootstrap," Econometric Theory, 26, 426�468.

Andrews, D. W. K. and B. Lu (2001): "Consistent Model and Moment Selection Procedures for

GMM Estimation with Application to Dynamic Panel Data Models," Journal of Economet-

rics, 101(1), 123-164.

Bai, J., and S. Ng (2009): "Selecting Instrumental Variables in a Data Rich Environment," Journal

of Time Series Econometrics, 1(1).

Bhatia, R. (1997): Matrix Analysis. New York: Springer.

Belloni, A., D. Chen, V. Chernozhukov and C. Hansen (2012): "Sparse Models and Methods for

Optimal Instruments with an Application to Eminent Domain," Econometrica, forthcoming.

Belloni, A., V. Chernozhukov and C. Hansen (2010): "LASSO Methods for Gaussian Instrumental

Variables Models," Preprint, arXiv:1012.1297.

Belloni, A., V. Chernozhukov and C. Hansen (2011): "Inference on Treatment E¤ects After Se-

lection amongst High-dimensional Controls," Preprint, arXiv:1201.0224.

Berkowitz, D. M. Caner and Y. Fang (2012) :"The Validity of Instruments Revisited," Journal of

Econometrics, 166(2), 255-266.

Breusch, T., H. Qian, P. Schmidt and D. Wyhowski (1999): "Redundancy of Moment Conditions,"

Journal of Econometrics, 91 89-111.

Caner, M. (2009): "Lasso-type GMM Estimator," Econometric Theory, 25(1), 270-290.

Caner, M., X. Han and Y. Lee (2013): "Bias-Corrected Semiparametrically E¢ cient High Dimen-

sional GMM Estimator with Many Invalid Moment Conditions: An Application to Dynamic

Panel Data Models," mimeo.

Caner, M. and H. Zhang (2012): "Adaptive Elastic Net for Generalized Methods of Moments,"

Unpublished Manuscript.

Carrasco, M. (2012): "A Regularization Approach to the Many Instruments Problem," Journal

of Econometrics, 170(2), 383-398.

56

Chamberlain, G., and G. W. Imbens (2004): "Random E¤ects Estimators with Many Instrumental

Variables," Econometrica, 72, 295-306.

Chen, X., O. Linton, and I. van Keilegom (2003): "Estimation of Semiparametric Models when

the Criterion Function is not Smooth," Econometrica, 71(5), 1591�1608.

Conley T. G., C. B. Hansen, and P. E. Rossi (2012): "Plausibly Exogenous," Review of Economics

and Statistics, 94(1), 260-272.

Doko Tchatoka, F. and J.-M. Dufour (2012): "Identi�cation-robust Inference for Endogeneity

Parameters in Linear Structural Models,"MPRA Paper 40695, University Library of Munich,

Germany.

Donald, S. G., G. W. Imbens and W. K. Newey (2009) "Choosing Instrumental Variables in

Conditional Moment Restriction Models," Journal of Econometrics, 152(1), 28-36.

Donald, S. G. and W. K. Newey (2001): "Choosing the Number of Instruments," Econometrica,

69(5), 1161-91.

DiTraglia, F. (2012): "Using Invalid Instruments on Purpose: Focused Moment Selection and

Averaging for GMM," University of Pennsylvania, Working Paper.

Eichenbaum, M. S., L. P. Hansen and K. J. Singleton (1988): "A Time Series Analysis of Rep-

resentative Agent Models of Consumption and Leisure Choice under Uncertainty," Quarterly

Journal of Economics, 103(1), 51-78.

Fan, J. and Y. Liao (2011): "Ultra High dimensional Variable Selection with Endogenous Covari-

ates," Princeton University and University of Maryland, Working Paper.

Gautier, E., and A.B. Tsybakov (2011): "High-dimensional Instrumental Variables Regression

and Con�dence Sets," Preprint, arXiv: 1105.2454v2.

Guggenberger, P. (2010): "The Impact of a Hausman Pretest on the Asymptotic Size of a Hy-

pothesis Test," Econometric Theory, 26, 369-382.

Guggenberger, P. (2012): "On the Asymptotic Size Distortion of Tests when Instruments Locally

Violate the Exogeneity Assumption," Econometric Theory, 28(2), 387-421.

Hall, A. R., A. Inoue, K. Jana, and C. Shin (2007): "Information in Generalized Method of

Moments Estimation and Entropy based Moment Selection," Journal of Econometrics, 138(2),

488�512.

57

Hall, A. R., A. Inoue, J. M. Nason, and B. Rossi (2010): "Information Criteria for Impulse

Response Function Matching Estimation of DSGE Models," Journal of Econometrics, forth-

coming.

Hall, A. R. and F. P. M. Peixe (2003): "A Consistent Method for the Selection of Relevant

Instruments," Econometric Reviews, 22(3), 269-288.

Hansen, L. P. (1982): "Large Sample Properties of Generalized Method of Moments Estimators,"

Econometrica, 50(4), 1029-54.

Hong, H., B. Preston and M. Shum (2003): "Generalized Empirical Likelihood-based Model Se-

lection Criteria for Moment Condition Models," Econometric Theory, 19(6), 923-943.

Im, K. S., S. C. Ahn, P. Schmidt and J. M. Wooldridge (1999): "E¢ cient Estimation of Panel Data

Models with Strictly Exogenous Explanatory Variables," Journal of Econometrics, 93(1), 177-

201.

Inoue, A. (2006): "A Bootstrap Approach to Moment Selection," Econometrics Journal, 9(1),

48-75.

Kuersteiner, G. M. (2002): "Mean Square Error Reduction for GMM Estimators of Linear Time

Series Models," UC Davis, Working Paper.

Kuersteiner, G. M. and R. Okui (2010): "Constructing Optimal Instruments by First-stage Pre-

diction Averaging," Econometrica, 78(2), 697-718.

Lee, Y. and Y. Zhou (2011): "Averaged Instrumental Variable Estimator," University of Michigan,

Working Paper

Leeb, H. and B. M. Pötscher (2005): "Model Selection and Inference: Facts and Fiction," Econo-

metric Theory, 21(1), 21�59.

Leeb, H. and B. M. Pötscher (2008): "Sparse Estimators and the Oracle Property, or the Return

of the Hodges Estimator," Journal of Econometrics, 142(1), 201-211.

Liao, Z. (2013): "Adaptive GMM Shrinkage Estimation with Consistent Moment Selection,"

Econometric Theory, 29, 1�48.

McCloskey, A. (2012): "Bonferroni-based Size-correction for Nonstandard Testing Problems,"

Brown University, Working Paper.

58

Newey, W.K. (1997): �Convergence Rates and Asymptotic Normality for Series Estimators,�

Journal of Econometrics, 79, 147-168.

Nevo, A. and A. Rosen (2012): "Identi�cation with Imperfect Instruments," Review of Economics

and Statistics, 93(3), 659-671.

Okui, R. (2011): "Instrumental Variable Estimation in the Presence of Many Moment Conditions,"

Journal of Econometrics, 165(1), 70-86.

Pakes, A. and Pollard, D. (1989): "Simulation and the Asymptotics of Optimization Estimators,"

Econometrica, 57, 1027-1057.

Pollard, D. (1984): Convergence of Stochastic Processes. New York: Springer�Verlag.

Pollard, D. (1989): "Asymptotics via Empirical Processes," Statistical Science, 4(4), 341-354.

Sargan, J. (1958): "The Estimation of Economic Relationships using Instrumental Variables,"

Econometrica, 26(3), 393-415.

Schmidt, M. (2010): "Graphical Model Structure Learning with L1-regularization," Thesis, Uni-

versity of British Columbia.

Shen, X. (1997): "On Methods of Sieves and Penalization," Annals of Statistics, 25, 2555-2591.

van der Vaart, A. W. and J. A. Wellner (1996): Weak Convergence and Empirical Processes. New

York: Springer

Zou, H. (2006): "The Adaptive Lasso and Its Oracle Properties," Journal of the American Sta-

tistical Association, 101(476), 1418-1429.

59