Pairwise Diﬁerence Estimators for Nonlinear Models...Pairwise Diﬁerence Estimators for Nonlinear Models⁄ Bo E. Honor e Princeton University James L. Powell University of California,

Pairwise Difference Estimators for Nonlinear Models∗

Bo E. HonorePrinceton University

James L. PowellUniversity of California, Berkeley

November, 1997

Revised September, 2001

Abstract

This paper uses insights from the literature on estimation of nonlinear panel data models

to construct estimators of a number of semiparametric models with a partially linear index,

including the partially linear logit model, the partially linear censored regression model, and

the censored regression model with selection.. We develop the relevant asymptotic theory for

these estimators and we apply the theory to derive the asymptotic distribution of the estimator

for the partially linear logit model. We evaluate the finite sample behavior of this estimator

using a Monte Carlo study.

∗This research is supported by the National Science Foundation. We are grateful to Michael Jansson, Richard

Juster, Ekaterini Kyriazidou, and Thomas Rothenberg for helpful comment on earlier versions. Part of this research

was done while Honore visited the Economics Department at the University of Copenhagen.

1 Introduction.

For the linear panel data regression model with fixed effects,

yit = αi + xitβ + εit, (1)

in which the individual-specific intercept (“fixed effect”) αi can be arbitrarily related to the regres-sors xit, a standard estimation approach is based upon “pairwise differencing” of the dependentvariable yit across time for a given individual to eliminate the fixed effect:

yit − yis = (xit − xis)β + (εit − εis), (2)

a form which eliminates the nuisance parameters {αi} and is amenable to the usual estimationmethods for linear regression models under suitable conditions on the error terms {εit}. For non-linear models – i.e., models which are not additively-separable in the fixed effect αi – this pairwisedifferencing approach is generally not applicable, and identification and consistent estimation ofthe β coefficients is problematic at best. Still, for certain nonlinear panel data models, variationsof the “pairwise comparison” or “matching” approach can be used to construct estimators whichcircumvent the incidental-parameters problem caused by the presence of fixed effects; such modelsinclude the binary logit model (Rasch 1960, Chamberlain 1984), the censored regression model(Honore 1992), and the Poisson regression model (Hausman, Hall and Griliches 1984).

Powell (1987, 2001) and Ahn and Powell (1993) exploited an analogy between the linearpanel data model (1) and semiparametric linear selection models for cross-section data to deriveconsistent estimators for the latter. These estimators treat the additive “selection correctionterm” as analogous to the fixed effect in the linear panel data model, and eliminate selectivitybias by differencing observations with approximately-equal selection correction terms. The objectof this paper is to extend this analogy between linear panel data models and linear selectionmodels to those nonlinear panel data models, cited above, for which pairwise comparisons canbe used to eliminate the fixed effects. This extension will yield consistent and asymptoticallynormal estimators for the linear regression coefficients in binary logit, censored regression, andPoisson regression models with additively-separable nonparametric components – i.e., nonlinearextensions of the “semilinear regression model” – and also for the censored regression model withsample selection.

In the next section, more details of the analogy between linear panel data models and semi-parametric regression and selection models are provided, and the resulting pairwise differenceestimators for the various nonlinear models are precisely defined. These estimators are all defined

1

as minimizers of “kernel-weighted U-statistics”; some general results for consistency and asymp-totic normality of such estimators are provided in Section 3. One novel feature of the generalasymptotic theory is a “generalized jackknife” method for direct bias reduction for the estima-tor, which is a computationally-convenient alternative to the usual requirement that the kernelweights be of “higher-order bias-reducing” form. The paper then specializes these general resultsto the pairwise-difference estimator for the partially linear logit model, and presents the resultsof a Monte Carlo study to evaluate the finite-sample performance of this estimator.

2 Motivation for the Proposed Estimators.

In order to motivate the estimation approach proposed here, it is useful to first consider thepartially linear model1

yi = xiβ + g(wi) + εi i = 1, . . . , n, (3)

where (yi, xi, wi) are observed (and randomly sampled over i ), β is the parameter of interest andg(·) is an unknown function which is assumed to be “sufficiently smooth”. The term g(wi) in (3)can represent “true nonlinearity” or may be the result of sample selection. For example, in thesample selection model (Type 2 Tobit model, in the terminology of Amemiya (1985)), the data isgenerated from

y?i = xiβ + εi (4)

di = 1{wiγ + ηi > 0}, (5)

and the data consists of yi = diy?i , di, xi and wi. If it is assumed that (εi, ηi) is independent of

(xi, wi), then we can write

yi = xiβ + g(wiγ) + νi, E[νi|xi, wi, di = 1] = 0,

where g(wiγ) = E[εi|wiγ + ηi > 0] and νi = εi − g(wiγ).Powell (1987, 2001) proposed estimation of (4) (and implicitly also of (3)) using the idea that

if wiγ equals wjγ then for observations i and j the terms g(wiγ) and g(wiγ) are analogous to thefixed effect αi in (1), which can be differenced away as in.(2) Since γ is typically unknown, andwiγ typically continuously distributed, a feasible version of this idea uses all pairs of observationsand gives bigger weight to pairs for which wiγ is close to wj γ, where γ is an estimator of γ. Theweights are chosen in such a way that only pairs with wiγ − wjγ in a shrinking neighborhood ofzero will matter asymptotically.

1This model is also called the semiparametric regression model and the semilinear regression model, and has

been considered by Engle, Granger, Rice, and Weiss (1986) and Robinson (1988), among others.

2

The insight in this paper is to observe that this pairwise difference idea can be applied to anymodel for which it is possible to “difference out” a fixed effect. Below we outline some examplesof models in which the idea can be used.

2.1 Partially Linear Logit Model.

The logit model with fixed effects is given by

yit = 1{αi + xitβ + εit ≥ 0} t = 1, 2 i = 1, . . . , n,

where {εit : t = 1, 2; i = 1, . . . , n} are i.i.d. logistically distributed random variables. In this model,Rasch (1960) observed that β can be estimated by maximizing the conditional log–likelihood (seealso Chamberlain (1984), page 1274)

Ln(b) ≡∑

i:yi1 6=yi2

−yi1 ln (1 + exp((xi2 − xi1)b))− yi2 ln (1 + exp((xi1 − xi2)b)) .

Now consider the partially linear logit model

yi = {xiβ + g(wi) + εi ≥ 0} i = 1, . . . , n (6)

For observations with wi close to wj , the terms g (wi) and g (wj) are almost like fixed effects,provided that g is smooth. This suggests estimating β by maximizing

Ln(b) =(

n

2

)−1 1hL

n

∑i<j

yi 6=yj

−K

(wi − wj

hn

)(yi ln(1+exp((xj−xi)b))+yj ln(1+exp((xi−xj)b))), (7)

where K(·) is a kernel which gives the appropriate weight to the pair (i, j), and hn is a bandwidthwhich shrinks as n increases. Here L denotes the dimensionality of wi and the term

(n2

)−1 1hL

nin

front of∑

will ensure that the objective function converges to a nondegenerate function underappropriate regularity conditions. The asymptotic theory will require that K(·) is chosen so thata number of regularity conditions, such as K(u) → 0 as |u| → ∞, are satisfied. The effect of theterm K

(wi−wj

hn

)is to give more weight to comparisons of observations (i, j) for which zi is close

to zj .

2.2 Partially Linear Tobit Models.

The fixed effects censored regression model is given by

yit = max{0, αi + xitβ + εit}

3

Honore (1992) showed that with two observations for each individual, and with the error termsbeing i.i.d. for a given individual2,

β = arg minb

E[q(yi1, yi2, (xi1 − xi2b)] (8)

where

q(y1, y2, δ) =

Ξ(y1)− (y2 + δ)ξ(y1) if δ ≤ −y2;Ξ(y1 − y2 − δ) if −y2 < δ < y1;Ξ(−y2)− (δ − y1)ξ(−y2) if y1 ≤ δ;

and Ξ(d) is given by3 either Ξ(d) = |d| or Ξ(d) = d2. The estimators for the fixed effect censoredregression model presented in Honore (1992) are defined as minimizers of the sample analog ofthe minimand in (8),

Sn(b) ≡ 1n

∑i

q(yi1, yi2, (xi1 − xi2)b).

Applying the logic above, this suggests estimating β in the partially linear censored regressionmodel

yi = max{0, xiβ + g(wi) + εi},by minimization of

Sn(b) =(

n

2

)−1 1hL

n

∑i<j

K

(wi − wj

hn

)q(yi, yj , (xi − xj)b)

Honore (1992) also proposed estimators of the truncated regression model with fixed effects.In the simplest version of the truncated regression model, (y, x) is observed only when y > 0,where y = xβ + ε.

The idea in Honore (1992) is that if yit = αi + xitβ + εit and if εit satisfies certain regularityconditions, then

E[r(yi1, yi2, (xi1 − xi2b))|yi1 > 0, yi2 > 0] (9)

is uniquely minimized at b = β, where

r(y1, y2, δ) =

Ξ(y1) if δ ≤ −y2;Ξ(y1 − y2 − δ) if −y2 < δ < y1;Ξ(−y2) if y1 ≤ δ;

and Ξ () is as above.

2The assumption on the error terms made in Honore (1992) allowed for very general serial correlation. However,

for the discussion in this paper we will restrict ourselves to the i.i.d. assumption.

3Other convex loss functions, Ξ (), could be used as well.

4

This suggests that the partially linear truncated regression model, yi = xiβ + g(wi) + εi with(yi, xi, wi) observed only when yi > 0, can be estimated by minimizing

Tn(b) =(

n

2

)−1 1hL

n

∑i<j

K

(wi − wj

hn

)t(yi, yj , (xi − xj)b) (10)

2.3 Partially Linear Poisson Regression Models.

As a third example, consider the Poisson regression model with fixed effects:

yit ∼ po (exp(αi + xitβ)) t = 1, 2 i = 1, . . . , n.

The coefficients β for this model can be estimated by maximizing (see, for example, Hausman,Hall and Griliches (1984)):

Ln(b) =∑

i

−yi1 ln (1 + exp((xi2 − xi1)b))− yi2 ln (1 + exp((xi1 − xi2)b))

Following the same logic as before, the coefficients for a partially linear Poisson regression model

yi ∼ po (exp(xiβ + g(wi))) t = 1, 2 i = 1, . . . , n,

can then be estimated by maximizing

Ln(b) =(

n

2

)−1 1hL

n

∑i<j

K

(wi − wj

hn

)(−yi ln (1 + exp((xj − xi)b))− yj ln (1 + exp((xi − xj)b))) .

2.4 Partially Linear Duration Models.

Finally, Chamberlain (1985) has shown how to estimate a variety of duration models with fixedeffects. Using the objective functions that he suggested, one can derive objective functions, theminimization of which will result in estimators of the linear part of partially linear versions of thesame models.

2.5 Tobit Models with Selection.

As mentioned above, the nonlinear term can sometimes result from sample selection. Consider forexample a modification of the model defined by (4) and (5), in which y?

i is censored. One exampleof this would be a model of earnings, where the variable of interest is often censored from aboveby topcoding at some observable constant ci, as well as subject to sample selection, because noteverybody works. This model can be written as

y?i = min{xiβ + εi, ci} (11)

di = 1{wiγ + ηi > 0} (12)

5

where (εi, ηi) is independent of (xi, wi) and the data consists of yi = diy?i , di, xi and wi. As

usual, we can translate this model, which has right-censoring at ci, to the Tobit framework withleft-censoring at zero by taking yi ≡ dici − yi as the dependent variable and xi ≡ −xi as theregressor.

For the observations for which di = 1, the distribution of εi is conditional on ηi > −wiγ. Fortwo observations, i and j with wiγ = wjγ and di = dj = 1, εi and εj will be independent andidentically distributed (conditionally on (xi, xj , wi, wj)). Therefore

β = arg minb

E[q(yi, yj , (xi − xj)b)|di = dj = 1, wiγ = wjγ].

This suggests estimating β by minimization of

Qn(b) =(

n

2

)−1 1hL

n

∑i<j

di=dj=1

K

(wiγ − wj γ

hn

)q(yi, yj , (xi − xj)b), (13)

where γ is a root-n-consistent preliminary estimator of γ in (12)4, and L denotes the dimensionalityof wiγ. If there is no censoring, and if quadratic loss (Ξ(d) = d2) is used, then the minimizer of(13) is the estimator suggested by Powell (1987, 2001).

Consistency of the estimator for the truncated regression model defined in (10) would requirethat the error, εi has a log–concave density. Whether it is possible to define an estimator for thetruncated regression model with selection by

Rn(b) =(

n

2

)−1 1hL

n

∑i<j

di=dj=1

K

(wiγ − wj γ

hn

)r(yi, yj , (xi − xj)b), (14)

depends on whether one is willing to assume that the conditional density of εi given ηi > k islog–concave for all k. The estimators for the partially linear logit and partially linear Poissonregression models do not generalize in a straightforward manner to the case of selection, becausethe error–terms after selection will have non–logistic and non–Poisson distributions.

3 Asymptotic Properties of Estimators defined by Minimizing

Kernel–Weighted U–Statistics.

The estimators defined in the previous section are all defined by minimizing objective functionsof the form

Qn (γ, b) =(

n

2

)−1 ∑i<j

qn (zi, zj ; γ, b) , (15)

4Numerous root-n-consistent semiparametric estimators for γ have been proposed, some of which are described

in Powell (1994, section 3.1).

6

withqn (zi, zj ; γ, b) =

1hL

n

K

((wi − wj) γ

hn

)s (vi, vj ; b) , (16)

zi = (yi, xi, wi), and νi = (yi, xi). Note that, for the estimators of the partially linear models,γ = I (the identity matrix). Let θ =

(γ′, β′

)′ and L = dim(wγ). In the next two subsections wegive conditions under which such estimators are consistent, and asymptotically normal aroundsome pseudo–true value. The third subsection will then show how to “jackknife” the estimator torecenter its asymptotically normal distribution around the true parameter value.

Throughout, we define 4wij = wi − wj .

3.1 Consistency.

We will present two sets of assumptions under which the estimators defined by minimization of(15) are consistent. One set of assumptions will require a compact parameter space, whereas theother will assume a convex objective function. In both cases we will use the theorems found inNewey and McFadden (1994) to prove consistency.

Let m be a function of zi, zj and b. It is useful to define two function km and `m by

km (a1, a2, b) = E[m (zi, zj ; b)| zi = a1, w

′jγ0 = a2

]and

`m (a1, a2, b) = E[m (zi, zj ; b)| zi = a1, w

′jγ0 = a2

]fw′iγ0

(a2) ,

where fw′iγ0is the density function of w′iγ0 (assumed continuously distributed below). When m

depends only on vi, vj and b, we will write

km (a1, a2, b) = E[m (zi, zj ; b)| vi = a1, w

′jγ0 = a2

]and

`m (a1, a2, b) = E[m (vi, vj ; b)| vi = a1, w

′jγ0 = a2

]fw′iγ0

(a2) .

Assumption 1. All of the following assumptions are made on the distribution of the data

1. E[s (vi, vj ; b)

2]

< ∞;

2. E[‖4wij‖2

]< ∞;

3. w′iγ0 is continuously distributed with bounded density, fw′iγ0(·), and ks (·) defined

above exists and is a continuous function of each of its arguments;

4. for all b, |`s (a1, a2, b)| ≤ c1 (a1, b) with E [c1 (vi, b)] < ∞; and

5. {zi, i = 1, ...n} is an i.i.d. sample.

7

Assumption 2. One of the following assumptions is made on the nonrandom bandwidth se-quence hn:

1. hn > 0, hn = o(1) and h−1n = O(n1/2L); or

2. hn > 0, hn = o(1) and h−1n = o

(n1/2(L+1)

)Assumption 3. K is bounded, differentiable with bounded derivative K ′, and of bounded vari-

ation. Furthermore,∫

K (u) du = 1,∫ |K (u)| du < ∞ and

∫ |K (η)| ‖η‖ dη < ∞.

The assumptions made on the kernel are satisfied for many commonly used kernels.In some of the applications (to selection models) the argument γ of Qn(·) will be estimated.

Assumption 4. One of the following assumptions is made on γ :

1. γ = γ0; or

2. γ = γ0 + Op

(1√n

).

The following Lipschitz continuity assumption will be imposed when the objective functionQn is not convex in b :

Assumption 5. |s (vi, vj ; b1)− s (vi, vj ; b2)| ≤ Bij |b1 − b2|α for some α > 0, where E[B2

ij

]< ∞.

3.1.1 Limiting Objective Function.

Consistency of extremum estimators is usually proved by studying the limiting objective functionand the exact manner in which the objective function approaches its limit. For this problem thelimiting objective function will be

Q(γ0, b) = E[`s

(vi, w

′iγ0, b

)]which is well-defined under Assumption 1.

3.1.2 Pointwise Convergence to Limiting Objective Function.

In this section we will state conditions under which the objective function converges pointwise toits limit. It is useful to distinguish between the case with and with out a preliminary estimator(i.e., between Assumptions 4(1) and 4(2)).

8

In the case of no preliminary estimator (i.e., γ = γ0 is known), we have

E [Qn (γ0, b)] = E

[1hL

n

K

(w′iγ0 − w′jγ0

hn

)s (vi, vj ; b)

](17)

= E

[E

[1hL

n

K


hn

)ks

(vi, w

′jγ0, b

)∣∣∣∣ wi, vi

]]=

∫ ∫K (η) `s

(vi, w

′iγ0 − hnη, b

)dηdF (wi, ui)

−→ Q(γ0, b)

by dominated convergence. Note that the first expectation exists because of Assumptions 1(1)and 3.

Imposing Assumptions 1, 2(1) and 3,

E

[{1hL

n

K

(4wijγ0

hn

)s (vi, vj ; b)

}2]

= O(n)

and therefore (Ahn and Powell (1993), Lemma A.3)

Qn (γ0, b)− E [Qn (γ0, b)] = op (1) .

Combining these results,Qn (γ0, b) −→ Q(γ0, b)

when γ0 is known.In the case where γ = γ0 + Op

(1√n

)is estimated, we have

Qn (γ, b) = (Qn (γ, b)−Qn (γ0, b)) + Qn (γ0, b)

Pointwise convergence of Qn (γ, b) to Q(γ0, b) then follows from

Qn (γ, b)−Qn (γ0, b) =

∣∣∣∣∣∣(

n

2

)−1 ∑i<j

1hL

n

(K

(4wij γ

hn

)−K

(4wijγ0

hn

))s (vi, vj ; b)

∣∣∣∣∣∣=

∣∣∣∣∣∣(

n

2

)−1 ∑i<j

1hL

n

K ′ (c∗ij) 4wij (γ − γ0)hn

s (vi, vj ; b)

∣∣∣∣∣∣≤

(n

2

)−1 ∑i<j

1hL

n

∣∣K ′ (c∗ij)∣∣ ‖4wij‖ ‖γ − γ0‖hn

|s (vi, vj ; b)|

≤ ‖γ − γ0‖1

hL+1n

C

(n

2

)−1 ∑i<j

‖4wij‖ |s (vi, vj ; b)|

= Op

(1

n1/2hL+1n

)= op (1) ,

where the last equality follows from imposing the stronger Assumption 2(2).

9

3.1.3 Uniform Convergence to Limiting Objective Function.

With an objective function that is convex in b, the pointwise convergence suffices to establishconsistency of β, regardless of whether γ is estimated. Without a convex objective function,uniform convergence of the minimand is the key ingredient in the proof of consistency of extremumestimators. Invoking Assumption 1(1) and imposing compactness of the parameter space for β,uniform convergence will follow from Lemma 2.9 of Newey and McFadden (1994).

With no preliminary estimation of γ and the objective function is not convex, imposition ofAssumption 5 yields

|Qn (γ, b1)−Qn (γ, b2)| = |Qn (γ0, b1)−Qn (γ0, b2)|

=

∣∣∣∣∣∣(

n

2

)−1 ∑i<j

1hL

n

K

(4wijγ0

hn

)(s (vi, vj ; b1)− s (vi, vj ; b2))

∣∣∣∣∣∣≤

(n

2

)−1 ∑i<j

1hL

n

∣∣∣∣K (4wijγ0

hn

)∣∣∣∣ Bij |b1 − b2|α

= Op (1) |b1 − b2|α

When γ is estimated,

|Qn (γ, b1)−Qn (γ, b2)|≤ |Qn (γ0, b1)−Qn (γ0, b2)|+ |(Qn (γ, b1)−Qn (γ, b2))− (Qn (γ0, b1)−Qn (γ0, b2))|

and

|(Qn (γ, b1)−Qn (γ, b2))− (Qn (γ0, b1)−Qn (γ0, b2))|

=

∣∣∣∣∣∣(

n

2

)−1 ∑i<j

1hL

n

(K

(4wij γ

hn

)−K

(4wijγ0

hn

))(s (vi, vj ; b1)− s (vi, vj ; b2))

∣∣∣∣∣∣=

(n

2

)−1 ∑i<j

1hL

n

∣∣K ′ (c∗ij)∣∣ ‖4wij‖ ‖γ − γ0‖hn

(s (vi, vj ; b1)− s (vi, vj ; b2))

≤ 1hL+1

n

‖γ − γ0‖(

n

2

)−1 ∑i<j

∣∣K ′ (c∗ij)∣∣ ‖4wij‖Bij |b1 − b2|α

= Op

(1

hL+1n n1/2

)|b1 − b2|α

3.1.4 Identification.

The limiting objective function is uniquely minimized at β0 provided that the following conditionholds:

Assumption 6. E [s (vi, vj ; b) | (wi − wj) γ0 = 0] is uniquely minimized at b = β0.

10

3.1.5 Consistency Theorem.

Combining these results, and referring to Newey and McFadden (1994) Theorem 2.7 and toTheorem 2.1 and Lemma 2.9, respectively, we have

Theorem 1 If K(

(wi−wj)γhn

)s (vi, vj ; b) is a continuous and convex function of b and the param-

eter space for β is a convex set with the true value, β0, it its interior, then the minimizer of (15)over the parameter space is consistent under random sampling and assumptions 1, 3, 6 and either2(1) and 4(1) or 2(2) and 4(2).

Theorem 2 If K(

(wi−wj)γhn

)s (vi, vj ; b) is a continuous function of b and the parameter space for

β is compact and includes the true value, β0, then the minimizer of (15) over the parameter spaceis consistent under random sampling and assumptions 1, 3, 5, 6 and either 2(1) and 4(1) or 2(2)and 4(2).

3.2 Asymptotic Normality.

In this section we derive the asymptotic distribution of the estimator defined by minimizing (15).Specifically, we will derive the limiting distribution of

√n

(β − βh

)where βh is the minimizer of

E[

1hL K

((wi−wj)

′γ0h

)s (vi, vj ; β)

]. Note that the argument that lead to consistency of β implies

that βh → β0.Also note that βh is non–stochastic. In section 3.3 we will discuss conditions underwhich

βh = β0 +p∑

l=1

blhl + o(hp).

Here the estimator will have an asymptotic bias term, which we will eliminate via a jackknifeapproach. The advantage of the approach taken in this case is that it is not necessary to employ abias–reducing kernel. This means that if s in equation (15) is convex, then too will be the objectivefunction Qn in (16). At first sight, is seems that the disadvantage is that it is necessary to calculatethe estimator a number of times. However, as we will see in section 3.3, it is often possible todo the optimization only once. Estimators that are asymptotically equivalent to the remainingestimators can then be defined as the result of performing a finite number of Newton Raphsonsteps from the original estimator.

The following assumption is standard.

Assumption 7. The true parameter, β0, is an interior point of the parameter space

In all the applications considered here, the objective function will be left– and right– differ-entiable. We therefore define

Gn (γ, β) =(

n

2

)−1 ∑i<j

pn (zi, zj ; γ, β)

11

where

pn (zi, zj ; γ, β) =1hL

n

K

((wi − wj)

′ γhn

)t (vi, vj ; β)

and t (vi, vj ; β) is a convex combination of the left– and right derivatives of s (vi, vj ; β) with respectto β.

It is useful to define the ”projection functions”

p1n (zi; γ, β) = E [pn (zi, zj ; γ, β) |zi]− E [pn (zi, zj ; γ, β)] ,

p0n (γ, β) = E [pn (zi, zj ; γ, β)] ,

where the Assumption 8 below will guarantee that these expectations exist. We can then write(n

2

)−1 ∑i<j

pn (zi, zj ; γ, β) = p0n (γ, β) +2n

n∑i=1

p1n (zi; γ, β) (18)

+(

n

2

)−1 ∑i<j

p2n (zi, zj ; γ, β)

where p1n and p2n are P–degenerate (with P denoting the distribution of zi). The function p2n

is defined implicitly by (18).We will assume

Assumption 8. The derivative function {t (·, ·; β) : β ∈ B} is Euclidean for an envelope F , i.e.

supn,β

|t (zi, zj ; β)| ≤ F (zi, zj) ,

satisfying E[F 2

]< ∞. The set B need not be the whole parameter space, but could be

some other set with β0 in its interior.

Assumption 8 is satisfied for all of the examples considered in section 2. Assumptions 3and 8 imply that hL

npn is Euclidean (for some envelope CF with E[(CF )2

]< ∞). This, in

turn, implies that hLnp2n is Euclidean for an envelope with finite second moments (see Sherman

(1994b), Lemma 6). This will be important for the derivation below,because it allows us to ignorethe “error” when we approximate the U–statistic,

(n2

)−1 ∑i<j pn (zi, zj ; γ, β), by its projection,

p0n (γ, β) + 2n

∑ni=1 p1n (zi; γ, β).

We also definepn (zi, γ, β) = p0n (γ, β) + 2p1n (zi; γ, β)

Note that this implies that pn (zi, γ0, βh) = 2p1n (zi, γ0, βh). It is convenient to impose:

Assumption 9.

12

1. pn (zi, γ, β) is continuously differentiable in (γ, β) with a derivative ∆pn (zi, γ, β) withthe property that for any sequence (γ∗, β∗) that converges in probability to (γ0, β0),∆pn (zi, γ

∗, β∗) converges to a matrix ∆p0 (γ0, β0), the lower part ∆pβ0 (γ0, β0) (i.e., the

part that corresponds to differentiation with respect to β) of which is non–singular;and

2. for some function p1 (zi; γ0, β0) with E[‖p1 (zi; γ0, β0)‖2

]< ∞,

1√n

n∑i=1

pn (zi; γ0, βh)− 1√n

n∑i=1

pn (zi; γ0, β0) = op (1) .

In the next subsection, we give results that are useful for verifying Assumption 9.We can write

Gn (γ, β) =(

n

2

)−1 ∑i<j

pn (zi, zj ; γ, β)

= p0n (γ, β) +2n

∑i

p1n (zi; γ, β) +(

n

2

)−1 ∑i<j


=1n

∑i

pn (zi; γ, β) +(

n

2

)−1 ∑i<j


=

(1n

∑i

∆pn (zi; θ∗)

)(γ − γ0

β − βh

)+

1n

∑i

pn (zi; γ0, βh)

+(

n

2

)−1 ∑i<j

p2n (zi, zj ; γ, β) ,

where, as usual, ∆pn (zi; θ∗) should be interpreted as the derivative of pn (zi; ·) evaluated at apoint (which may be different for different rows of ∆pn) between (γ, β) and (γ0, βh).

Since {p2n (zi, zj ; γ, β)} is Euclidean, Sherman (1994), Theorem 3, can be applied to thefunction hLp2n (zi, zj ; g, b) with Θn = {θ : ‖θ − θ0‖ ≤ c} for some constant c, (with γn = 1 andk = 2 in Sherman’s notation) to get

supΘn

(n

2

)−1 ∑i<j

hLp2n (zi, zj ; g, b) = Op

(1n

),

or

supΘn

(n

2

)−1 ∑i<j

p2n (zi, zj ; g, b) = Op

(1

hLn

),

where the assumption on the envelope guarantee that E[supΘn

p2n (zi, zj ; γ, β)2]

< ∞ ,5

5This is condition (ii) for Sherman’s Theorem 3.

13

This yields

√n

(β − βh

)=

(− 1

n

∑i

∆pβn (zi; θ∗)

)−1 [(1n

∑i

∆pγn (zi; θ∗)

)√

n (γ − γ0)

+2√n

∑i

p1n (zi, γ0, βh) + Op

(1

hL√

n

)−√nGn (γ, β)

].

We therefore have

Theorem 3 If β is a consistent estimator of β, Gn

(γ, β

)= op

(n−1/2

), 1/hn = op

(n1/2L

),

√n (γ − γ0) = 1√

n

∑ni=1 ωi + op (1) and Assumptions 7, 8 and 9 are satisfied, then

√n

(β − βh

)=

1√n

n∑i=1

ψi + op (1) ,

whereψi = −∆pβ

0 (γ0, β0)−1 ∆pγ

0 (γ0, β0)ωi − 2∆pβ0 (γ0, β0)

−1 p1 (zi; γ0, β0) .

Furthermore, assuming ωi and p1 (zi; γ0, β0) are jointly i.i.d. with E[ωi] = 0 and E[‖ωi‖2] < ∞,

√n

(β − βh

)→d N (0, E[ψiψ

′i]).

3.2.1 Verifying some of the conditions.

Theorem 3 makes some high level assumptions. In this section we will present some results whichwill be useful in verifying these assumptions.

The following Lemma, which follows immediately from Lemma 1 in Honore and Powell (1994),is useful for verifying that Gn

(γ, β

)= op

(1√n

).

Lemma 4 If the true parameter value, β0, is an interior point in the parameter space, and

1. s (vi, vj ; β) is left and right differentiable in each component of β in some open neighborhoodof the true parameter β0;

2. in an open neighborhood B0 of β0,

supβ∈B0

∑i<j

1{

∂−s (vi, vj ; β)∂β`

6= ∂+s (vi, vj ; β)∂β`

}= Op (1) ;

3. in an open neighborhood of β0,∣∣∣∣∂−s (vi, vj ; β)∂β`

− ∂+s (vi, vj ; β)∂β`

∣∣∣∣ ≤ h (vi, vj)

for some function h with E[h (vi, vj)

1+δ]

< ∞ for some δ.; and

14

4. K is bounded, thenGn

(γ, β

)= op

(n−2+2/(1+δ)h−L

n

).

We next turn to some assumptions under which the conditions of Assumption 9 are satisfied.Recall that by definition of km and `m above,

kt (zi, a, b) = E[t (vi, vj , b)| zi, w

′jγ0 = a

],

`t (zi, a, b) = E[t (vi, vj , b)| zi, w

′jγ0 = a

]fw′jγ0

(a) .

In addition, definet1 (zi, zj , β) = (wi − wj) t (vi, vj , β) ;

then kt1 and `t1 , evaluated at β = β0, become

kt1 (zi, a2, β0) = E[(wi − wj) t (vi, vj , β0)| zi, w

′jγ0 = a2

],

`t1 (zi, a2, β0) = E[(wi − wj) t (vi, vj , β0)| zi, w

′jγ0 = a2

]fw′iγ0

(a2) .

The following restrictions will be imposed on these functions:.

Assumption 10.

1. `t is differentiable with respect to its third argument, and there is a function g withE [g (zi)] < ∞, such that

∣∣∣`(3)t (vi, w

′iγ0 − hη, β0)

∣∣∣ ≤ g (zi).

2. `t1 is differentiable with respect to its second argument, and there is a function g withE

[g (zi)

2]

< ∞, such that∣∣∣`(2)

t1(zi, w

′iγ0 − hη, β0)

∣∣∣ ≤ g (zi). Furthermore,

K (η) `(2)t1

(zi, w

′iγ0 − hη, β0

) → 0 as η → ±∞.

Finally E [(wi − wj) t (vi, vj , β)] < ∞.

3. `t is differentiable with respect to its second argument, and there is a function g withE [g (zi)] < ∞, such that

∣∣∣`(2)t (vi, w

′iγ0 − hη, β0)

∣∣∣ ≤ g (zi) .

A number of results can be used to verify the convergence in Assumption 9.1. For example,Amemiya (1984), Theorem 4.1.4 gives conditions that can be used to verify that p′n (zi, γ

∗, β∗)converges to lim p′n (zi, γ0, β0) The following two lemmata give the expressions for the p′1 thatappear in Assumption 9.1 and in Theorem 3.

Lemma 5 Letpβ0 (γ0, β0) = E

[`(3)t

(vi, w

′iγ0, β0

)].

Then under assumptions 3 and 10(1)

pβ0n (γ0, β0) → pβ

0 (γ0, β0) .

15

Lemma 6 Letpγ0 (γ0, β0) = −E

[`(2)t1

(zi, w

′iγ0, β0

)].

Then under assumptions 3 and 10(2)

pγ0n (γ0, β0) → pγ

0 (γ0, β0) .

Combined, the next two lemmata will give conditions under which Assumption 9.2 is satisfied.

Lemma 7 Suppose that p1n (zi; γ0, ·) is continuously differentiable in a neighborhood N (β0) ofβ0, and that there is a function h (zi) with E

[‖h (zi)‖2

]< ∞, such that ‖p1n (zi; γ0, b)‖ ≤ h (zi)

for all b in N (β0).Then

1√n

∑p1n (zi; γ0, βh)− p1n (zi; γ0, β0) = op(1).

Lemma 8 If

p1 (zi; γ0, β0) = E[t (vi, vj ; β)| vi, wi, w

′iγ0 = w′jγ0

]fw′jγ0

(w′iγ0

)= `t

(zi, w

′iγ0, β0

),

then under assumptions 3 and 10.3

1√n

n∑i=1

p1n (zi; γ0, β0)−1√n

n∑i=1

pn (zi; γ0, β0) = op (1) .

3.3 Bias-reduction.

The asymptotic normality result for β in Theorem 3 centers the asymptotic distribution of β atthe pseudo-true value βh; however, if the bandwidth sequence h = hn declines to zero slowly withn, as required by Assumption 2, the asymptotic bias term βh − β0 need not be o(n−1/2), andβ will not be root-n-consistent. While semiparametric methods typically ensure asymptotically-negligible bias by assuming the relevant kernel function (here, K(·) ) is of ”higher-order biasreducing” form, such a requirement would be unattractive for many of the estimators proposedhere, since the resulting negativity of the kernel function for some data points could compromisethe convexity of the corresponding minimand, complicating both the asymptotic theory (throughan additional compactness restriction) and computation of the estimator. As an alternative touse of higher-order kernels, we instead obtain a root-n-consistent estimator by using the familiarjack-knife approach – that is, assuming the pseudo-true value βh is a sufficiently smooth function

of the bandwidth h, we construct a linear combination β of different estimators of β0 (involving

different bandwidth choices) for which√

n

(β − β0

)has the same asymptotic distribution as

√n

(β − βh

)and will thus be root-n-consistent.

16

For this jack-knife approach to be applicable, we require existence of a Taylor series expansionof βh around β0 as a function of h (in a neighborhood of h = 0). That is, we assume that

βh = β0 +p∑

l=1

blhl + o(hp),

= β0 +p∑

l=1

blhl + o(n−1/2) (19)

where the last line presumes p can be chosen large enough so that

hp = O(n−1/2

)while preserving the conditions of Theorem 3. Such a series approximation will exist if thedifferentiability of Assumption 10 are strengthened to require (p+2)-order differentiability of the`t term, with similar dominance conditions as in Assumption 10(2). To see why this would suffice,note that βh is defined by the relationship

0 = E[Qn (γ0, βh)] ≡ p0n (γ0, βh)

= E

[∫K (η) `t

(vi, w

′iγ0 − hη, β

)dη

].

≡ F (h, βh),

where the definition of F exploits the fact that p0n depends on n only through h. Viewing thepseudo-true value βh as an implicit function of the bandwidth h, it will have (p + 1) derivativeswith respect to h in a neighborhood of zero if Assumption 9(1) holds and the function F (h, β)has (p + 2) derivatives in both arguments in a neighborhood of h = 0 and β = β0, by a higher-order version of the implicit function theorem (Magnus and Neudecker 1988, chapter 7, TheoremA.3). This, in turn, would follow from (p + 2)-order differentiability of `t in its second andthird components, plus conditions permitting interchange of differentiation and integration (overη). In fact, if the kernel function K(η) is chosen to be symmetric (K(u) = K(−u)), then it isstraightforward to show that the coefficient b1 of the first-order term in (19) is identically zero,so that we can derive the stronger condition

βh = β0 +p∑

l=2

blhl + o(n−1/2). (20)

Assuming this condition is satisfied, consider p estimators βc1h, . . . βcph based on the band-widths c1 · h, . . . cp · h, where c1, c2, ...cp is any sequence of distinct positive numbers with c1 ≡ 1.These estimators will have corresponding pseudo–true values βc1h, βc2h, . . . βcp+1h, each of whichwill satisfy the condition (20) for h sufficiently close to zero.Then, defining the jack-knifed esti-

mator β as

β ≡p∑

k=1

akβckh,

17

where the coefficients a1, ...ap are defined as the solution to the linear equations1 1 · · · 1c21 c2

2 · · · c2p

......

. . ....

cp1 cp

2 · · · cpp

a1

a2...ap

=

10...0

,

the corresponding linear combination of pseudo-true values is easily shown to satisfy

p∑k=1

akβckh = β0 + o(n−1/2

)(21)

when (20) applies. Since the coefficients a1, ..., ap must sum to one, and the asymptotic linearityexpression for

√n

(β − βh

)in Theorem 3 does not depend explicitly on the bandwidth parameter

h, the jack-knifed estimator β is asymptotically normal when centered at the true value β0, as

the following theorem states.

Theorem 9 Under the conditions of Theorem 3, if condition (20) also holds, then

√n

(β − β0

)=

1√n

n∑i=1

ψi + op (1) ,

where ψι is defined in the statement of that theorem, and

√n

(β − β0

)→d N (0, E[ψiψ

′i]).

3.3.1 Estimation of the Asymptotic Variance.

In order to construct asymptotically-valid test statistics and confidence regions using the resultsof Theorem 9, a consistent estimator of the asymptotic covariance matrix C ≡ E[ψiψ

′i] of the

jack-knifed estimator β. Assuming the influence function ωi of the preliminary estimator γ of γ0

has a consistent estimator ωi which satisfies

1n

n∑i=1

‖ωi − ωi‖2 = op (1) , (22)

a consistent estimator of C can be constructed if the remaining components of ψι – namely,∆pβ

0 (γ0, β0), ∆pγ0 (γ0, β0) , and p1 (zi; γ0, β0) ≡ p1i – can similarly be consistently estimated. Of

these three terms, estimation of the latter two are most straightforward; defining

p1i ≡ 1(n− 1)

∑j 6=i

1hL

n

K

((wi − wj)

′ γhn

)t

(vi, vj ;

β

)

18

and

Γγ ≡ 2n (n− 1) hn

∑i<j

1hL+1

n

t

(vi, vj ;

β

) ∂K(

(wi−wj)′γ

hn

)∂u′

(wi − wj)′ ,

the same argument as for Lemma 6.2 of Powell (1987, 2001) yields

1n

n∑i=1

‖p1i − p1i‖2 = op (1) (23)

andΓγ →p ∆pγ

0 (γ0, β0) (24)

when the conditions of Theorem ?? are satisfied. Estimation of the derivative ∆pβ0 (γ0, β0) is also

straightforward if the function t (vi, vj , β) is differentiable in β (as for the partially-linear logitestimator).In this case

Γβ ≡ 2n (n− 1) hn

∑i<j

1hL

n

K

((wi − wj)

′ γhn

) ∂t

(vi, vj ;

β

)∂β′

will haveΓβ →p ∆pβ

0 (γ0, β0) , (25)

as for Γγ . If the function t (vi, vj , β) is not differentiable in β, the derivative matrix ∆pβ0 must be

estimated using some “smoothing” method, such as the numerical derivative approach describedin Newey and McFadden (1994, section 7.3). In either case, the resulting consistent estimator ofthe asymptotic covariance matrix C would be

C ≡ [Γβ]−1V [Γβ ]−1,

where V is the joint sample covariance matrix of Γγωi and p1i. Consistency of this estimatorfollows directly from (22) through (25).

3.4 Asymptotic Properties of Partially Linear Logit Estimator.

In this section we discuss how the general results of section 3 might be used to derive the asymp-totic properties of one of the estimators defined in section 2, the partially linear logit model. Forthis model, the terms in the objective function (7) are convex if K is positive. We can thereforeuse Theorem 1 to prove consistency. With the notation in section 3,, and with Λ (η) = exp(η)

1+exp(η) ,we have

s ((yi, xi), (yj , xj); b) = −1 {yi 6= yj}(yi lnΛ

((xi − xj)′b

)+ yj lnΛ

(−(xi − xj)′b))

and γ = I. Also

t ((yi, xi), (yj , xj); b) = 1 {yi 6= yj}(yiΛ

(−(xi − xj)′b)− yjΛ

((xi − xj)′b

))(xi − xj)

19

Theorem 10 Assume a random sample {(yi, xi, wi)}nı=1 from (6) εi logistically distributed. The

estimator defined by minimizing (7) where h−1n = o

(n−L/2

)and K satisfies Assumption 3, is

consistent and asymptotically normal with

√n

(β − βh

)−→N

(0, 4Γ−1V Γ−1

)with

V = V [r (yi, xi, wi)]

where

r (yi, xi, wi) = E

[1 {yi 6= yj}

(yi − exp((xi − xj)′β)

1 + exp((xi − xj)′β)

)(xi − xj)

∣∣∣∣ yi, xi, wi, wj = wi

]fw (wi)

and

Γ = E

[E

[1 {yi 6= yj} exp ((xi − xj)′β)

(1 + exp ((xi − xj)′β))2(xi − xj)(xi − xj)′

∣∣∣∣∣ yi, xi, wi, wj = wi

]fwi (wi)

]

provided that

1. E[‖xi‖2

]< ∞.

2. wi is continuously distributed with a bounded density,fwi. Also E[‖wi‖2

]< ∞.

3. E [‖xi‖ |wi = a] fwi (a) is a bounded function of a.

4. (xi − xj) has full rank conditional on zi = zj.

5. the remaining regularity conditions of Assumptions 1 through 10 apply.

Proof: First note that |log Λ (η)| ≤ log (2) + |η|. Therefore |s ((yi, xi), (yj , xj); b)| ≤ log (2) +|(xi − xj)′b| ≤ log (2) + (‖xi‖+ ‖xj‖) ‖b‖. Assumptions 1.1, 1.2 and 1.3 are therefore satis-fied. To verify 1.4, let a1 = (ay

1, ax1) |E [s (a1, vj , b)|wj = a2]| ≤ E [ |s (a1, vj , b)||wj = a2] ≤

E [ log (2) + (‖ax1‖+ ‖xj‖) ‖b‖|wj = a2], from which 1.4 follows. Assumption 6 then follows from

consistency of the maximizer of the conditional likelihood for logit models with fixed effects. Theremaining assumptions for consistency follow by assumption.

The asymptotic normality of the jackknifed estimator follows:

Corollary 11 Under the assumptions of Theorem 10, the jackknifed estimator is consistent andasymptotically normal √

n(β − β0

)−→N

(0, 4Γ−1V Γ−1

)with V and Γ defined as in Theorem 10.

20

4 A Monte Carlo Illustration.

To get an idea about the small sample properties of the estimators of the partially linear modelsdescribed above, we have performed a Monte Carlo investigation for the partially linear logitmodel for a particular design. The design is chosen to illustrate the method and is not meant tomimic a design that one would expect in a particular data set.. The model is

yi = 1{x1iβ1 + x2iβ2 + g(zi) + εi} i = 1, 2, . . . , n, (26)

where (β1, β2) = (1, 1), g(z) = z2− 2, x2i has a discrete distribution with P (x2i = −1) = P (x2i =1) = 1

2 , zi ∼ N(0, 1) and xi1 = vi + z2i where vi ∼ N(0, 1). With this design, P (yi = 1) ' 0.44.

For the design used here, ignoring the non–linearity of g(z) is expected to result in a bias in theestimators of both β1and β2, although we expect the bias to be bigger for β1, because g(zi) isuncorrelated with x1i.

For each replication of the model, we calculate a number of estimators. First, we calculatethe logit maximum likelihood estimator using a constant, x1i, x2i and g(zi) as regressors. Thisestimator would be asymptotically efficient if one knew g (·); comparing the estimators proposedhere to that estimator will therefore give a measure of the cost of not knowing g (and using theestimators proposed here). Second, we calculate three estimators, β1, β2, and β3, based on (7)with K being the biweight (quartic)6 kernel and hn = c∗std(z)∗n−1/5 where c takes the values 0.3,0.9 and 2.7. These bandwidths are somewhat arbitrary. the middle one is motivated to the rule ofthumb suggested by Silverman (1986, page 48) for estimation of densities (using normal kernel).That bandwidth is supposed to illustrate what happens if one uses a “reasonable” bandwidth. Thetwo other bandwidths are supposed to be “small” and “big”. We also calculate four jackknifedestimators. The first, β123 combines the three estimators according to (21). This ignores the factthat

∫uK (u) du = 0, and we therefore also consider the three jack–knifed estimators based on

combining two of the three estimators according to (??). These three estimators are denoted β12,β13, and β23.

The results from 1000 replications with sample sized 100, 200, 400, 800, 1600 and 3200 aregiven in Table 1A through 1F. In addition to the true parameter values, each table also reportsbias, standard deviation and root mean square error of the estimator, as well as the correspondingrobust measures, the median bias, the median absolute deviation from the median and the medianabsolute error. Since all the estimators discussed here are likely to have fat tails (they are noteven finite with probability 1), the discussion below will focus on these robust measures. Thesample sizes are not chosen because we think that they are realistic given the small number ofexplanatory variables, but rather because we want to confirm that for large samples the estimatorbehaves as predicted by the asymptotic theory. As expected, the (correctly specified) maximum

6Throughout, the kernel was normalized to have mean 0 and variance 1.

21

likelihood estimator that uses x1i, x2i and g(zi) as regressors outperforms the semiparametricestimators. However, the jackknifed estimators perform almost as well. For example, the medianabsolute error of the estimator based on jack-knifing using β2 and β3 is within 10% of the medianabsolute error of the maximum likelihood estimator (and often closer).

The patterns of the bias and the dispersion of the three estimators based on (7) are theexpected — lower values of the bandwidth, h, give less bias but higher dispersion.

The proposed jack-knife procedure generally succeed in removing the bias of the proposedestimators. For example, focusing on the coefficient on x1i (which has the bigger bias), theestimator that removes bias by comparing β2 and β3 has lower bias than either β2 or β3 for allsample sizes. Finally, for the largest sample sizes, there is almost no difference between the fourbias reduced estimators, which corresponds to the predictions of the asymptotic theory.

Table 2 presents evidence about the effect of the bias term (as the bias reduction) on theperformance of the test–statistics calculated on the basis of the estimators discussed here. Foreach sample and for each of the semiparametric estimators, we calculated 80, 90 and 95 percentconfidence intervals. In order to do this, we estimated the (asymptotic variance) of the threenon—bias reduced estimators by 4Γ−1

k VkΓ−1k where k = 1, 2, 3, denotes the estimator and Γk is

the sample variance of rki defined by

rki =

1(n− 1)hn

∑j 6=i

1 {yi 6= yj} ·K(

wi − wj

hh

) yi −exp

((xi − xj)′β

)1 + exp

((xi − xj)′β

) (xi − xj)

and

Vk =2

n (n− 1)hn

∑i<j

1 {yi 6= yj} ·K(

wi − wj

hh

) exp((xi − xj)′β

)(1 + exp

((xi − xj)′β

))2 (xi − xj)(xi − xj)′

The estimated variance of any of the three estimators could be used to estimate the asymp-totic variance of the jack–knifed estimators. However, in order to avoid arbitrarily choosing onevariance estimator over an other, we estimated the joint asymptotic distribution of

(β′1, β

′2, β

′3

)′by 4Γ−1V Γ−1 where Γ is the sample variance of

(r1′i , r2′

i , r3′i

)′ and

Γ =

Γ1 0 00 Γ2 00 0 Γ3

Table 2 gives the fraction of the replications for which these confidence intervals covered the

true parameter.. Because the biases are more dramatic for β1, we only present the results forthat parameter. For all three sample sizes we see that the confidence interval that is based onthe estimator, β1, that is based on a very small bandwidth, has coverage probabilities that are

22

close the 80, 90 and 95 percent, whereas the coverage probabilities are smaller for the two othernon-bias reduced estimators, β2 and β3. While the discrepancies are not enormous (except for β3

in large samples), it is interesting to note that all the bias corrected estimators perform betterthat both β2 and β3 for all sample sizes.

The sample sizes discussed so far are unrealitically large relative to the number of parameters,and there is little reason to think that the design mimics designs that one might encounter inapplications. In order to investigate whether the good performance of the proposed estimators isan artifact of the very simple Monte Carlo design, we performed an additional experiment usingthe labor force participation data given in Berndt (1991, see also Mroz, 1987). Using a constant,log hourly earnings7, number of kids under 6, number of kids between 6 and 18, age, age–squared,age–cubed, education, local unemployment rate, a dummy variable for whether the person livedin a large city, and other family income as explanatory variables, we estimated a logit for whethera woman worked in 1975. The sample size was 753 (of whom 428 worked). Using the original753 vectors of explanatory variables, we generated 1000 data sets from this model. We thenestimated the parameters using the correctly specified logit maximum likelihood estimator andthe semiparametric estimator that treats the functional form for the effect of age as unknown.We calculate the bandwidths as for design 1, and the results are presented in Table 3.

5 Possible Extensions.

Ahn and Powell (1993) extended the model given in (4) and (5) by allowing the nonparametriccomponent to depend upon an unknown function p(wi), which itself must be estimated via non-parametric regression, rather than the parametric (linear) form w′iγ, as assumed above. Makingthe same extension in (11) and (12) would lead to an estimator that minimizes a function of theform

Qn(b) =(

n

2

)−1 1hn

∑i<j

di=dj=1

K

(p(wi)− p(wj)

hn

)s(yi, yj , (xi − xj)b). (27)

The estimator proposed by Ahn and Powell (1993) minimizes Qn in (27), if there is no censoringand if quadratic loss (Ξ(d) = d2) is used. In future work, we plan to investigate the properties ofthe estimator defined by (27). We also intend to consider extension of the asymptotic theory todiscontinuous (in β ) objective functions, so that that rank-based estimators for binary responseand single index models proposed by Han (1987) and Cavanagh and Sherman (1998) might beextended to permit partially-linear and selection models.

7This variable was imputed for the individuals who did not work.

23

6 References.

Ahn, H. and J.L. Powell (1993): “Semiparametric Estimation of Censored Selection Models witha Nonparametric Selection Mechanism,” Journal of Econometrics, 58: 3-29.

Amemiya, T. (1973): “Regression Analysis when the Dependent Variable is Truncated Normal,”Econometrica, 41(6): 997–1016.

Amemiya, T. (1985): Advanced Econometrics. Harvard University Press.

Barrodale, I. and, F. D. K. R. (1974): “Solution of an Overdetermined System of Equations inthe l1 Norm,” Communications ACM, 17: 319-320.

Berndt, E. R. (1991): The Practice of Econometrics: Classic and Contemporary. Reading, Mass.:Addison-Wesley,

Cavanagh, C. and, R. Sherman. (1998): “Rank Estimators for Monotonic Regression Models,”Journal of Econometrics, 84: 351-381.

Chamberlain, G. (1984): “Panel Data,” in Handbook of Econometrics, Vol. II, edited by Z.Griliches and M. Intriligator, North Holland.

Chamberlain, G. (1985): “Heterogeneity, Omitted Variable Bias, and Duration Dependence,”in Longitudinal Analysis of Labor Market Data, edited by J. J.. Heckman and B. Singer.Cambridge University Press.

Engle, R.F, C.W.J. Granger, J. Rice and A. Weiss (1986), “Semipaarametric Estimates of theRelation Between Weather and Electricity Sales,” Journal of the American Statistical As-

sociation, 81:310-320.

Han, A. K. (1987): “Non-parametric Analysis of a Generalized Regression Model: The MaximumRank Correlation Estimator,” Journal of Econometrics, 35(2/3): 303-316.

Hausman, J., Hall, B. and Z. Griliches (1984): “Econometric Models for Count Data with anApplication to the Patents-R&D Relationship,” Econometrica, 52(4), pp. 909–38.

Honore, B. E. (1992): “Trimmed LAD and Least Squares Estimation of Truncated and CensoredRegression Models with Fixed Effects,” Econometrica, 60(3): 533–565.

Honore, B. E., and J. L. Powell (1994): “Pairwise Difference Estimators of Censored and Trun-cated Regression Models,” Journal of Econometrics, 64(2): 241–278.

Mroz, T. A. (1987): “The Sensitivity of an Empirical Model of Married Women’s Hours of Workto Economic and Statistical Assumptions,” Econometrica, 55(4), pp. 765–99.

24

Newey, W. K. and D. McFadden (1994): “Large Sample Estimation and Hypothesis Testing,”in Handbook of Econometrics, Vol. IV, edited by R. F. Engle and D. L. McFadden, NorthHolland.

Powell, J. L. (1987), ”Semiparametric Estimation of Bivariate Latent Variable Models,” Depart-ment of Economics, University of Wisconsin-Madison, SSRI Working Paper No. 8704.

Powell, J. L. (1994): ”Estimation of Semiparametric Models,” in Handbook of Econometrics,Vol. IV, edited by R. F. Engle and D. L. McFadden, North Holland.

Powell, J. L. (2001): ”Semiparametric Estimation of Censored Selection Models.” in Nonlin-

ear Statistical Modeling, edited by C. Hsiao, K. Morimune, and J.L. Powell, CambridgeUniversity Press.

Powell, J. L., J. Stock and T. Stoker (1989): “Semiparametric Estimation of Weighted AverageDerivatives”, Econometrica, 57, pp. 1403-1430.

Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling (1986): Numerical Recipes:

The Art of Scientific Computing , Cambridge: Cambridge University Press.

Rasch, G. (1960): Probabilistic Models for Some Intelligence and Attainment Tests, DenmarksPædagogiske Institut, Copenhagen.

Robinson, P (1988), “Root-N-Consistent Semiparametric Regression,” Econometrica, 56: 931-954.

Serfling, R. J. (1980): Approximation Theorems of Mathematical Statistics, John Wiley andSons.

Sherman R. (1993): “The Limiting Distribution of the Maximum Rank Correlation Estimator,”Econometrica, 61(1): 123–137.

Sherman, R (1994a): “U–Processes in the Analysis of a Generalized Semiparametric RegressionEstimator,” Econometric Theory, 10, pp. 372–395.

Sherman, R. (1994b): “Maximal Inequalities for Degenerate U–Processes with Applications toOptimization Estimators,” Annals of Statistics, 22(1), pp. 439–459.

Silverman, B. W. (1986): Density Estimation for Statistics and Data Analysis, London: Chap-man and Hall.

25

7 Appendix: The Most Dull Derivations.

Proof of Lemma 5:

p0n (γ0, β) = E

[1hL

K


h

)t (vi, vj , β)

]= E

[1hL

K


h

)E

[t (vi, vj , β)| (vi, wi) , w′jγ0

]]= E

[1hL

K


h

)kt

(vi, w

′jγ0, β

)]= E

[∫1hL

K

(w′iγ0 − ω

h

)kt (vi, ω, β) fw′iγ0

(ω) dω

]= E

[∫K (η) `t

(vi, w

′iγ0 − hη, β

)dη

]By assumption 10.1, we can differentiate under the expectation and integral (see e.g., Cramer(1946, page 68)):

pβ0n (γ0, β0) = E

[∫K (η) `

(3)t

(vi, w

′iγ0 − hη, β0

)dη

]→ E

[`(3)t

(vi, w

′iγ0, β0

)]where the limit follows from dominated convergence.¥Proof of Lemma 6: Recall that

p0n (γ, β) = E

[1hL

K

(w′iγ − w′jγ

h

)t (vi, vj , β)

]By assumptions 3 and 10.2, we can differentiate under the expectation:

pγ0n (γ, β) = E

[1hL

K ′(

w′iγ − w′jγh

)wi − wj

ht (vi, vj , β)

].

Evaluation this at(γ0, β0), we get

pγ0n (γ0, β0) = E

[1hL

K ′(

w′iγ0 − w′jγ0

h

)wi − wj

ht (vi, vj , β0)

]= E

[E

[E

[1hL

K ′(

w′iγ0 − w′jγ0

h

)wi − wj

ht (vi, vj , β0)

∣∣∣∣ vi, wi, w′jγ0

]∣∣∣∣ vi, wi

]]= E

[E

[1

hL+1K ′


h

)E

[(wi − wj) t (vi, vj , β0)| vi, wi, w

′jγ0

]∣∣∣∣ vi, wi

]]with the definitions of kt1 and `t1 , and using integration by parts, we have

pγ0n (γ0, β0) = E

[∫1

hL+1K ′

(w′iγ0 − ω

h

)kt1 (zi, ω, β0) fw′iγ0

(ω) dω

]26

= E

[∫1

hL+1K ′

(w′iγ0 − ω

h

)`t1 (zi, ω, β0) dω

]= E

[∫1h

K ′ (η) `t1

(zi, w

′iγ0 − hη, β0

)dη

]= −E

[∫K (η) `

(2)t1

(zi, w

′iγ0 − hη, β0

)dη

]→ −E

[`(2)t1

(zi, w

′iγ0, β0

)]where the limit follows from dominated convergence.¥

Proof of Lemma 8: Write

p1n (zi; γ0, β0) = rn (zi)− E [rn (zi)]

andp1 (zi) = `t

(zi, w

′iγ0, β0

)− E[`t

(zi, w

′iγ0, β0

)]where

rn (zi) = E [pn (zi, zj ; γ0, β0)| zi]

=∫

1hL

K

(w′iγ0 − ω

h

)`t (vi, ω, β0) dω

=∫

K (η) `t

(vi, w

′iγ0 − hη, β0

)dη

and E [`t (zi, w′iγ0, β0)] = 0. With δn (zi) = rn (zi)− `t (zi, w

′iγ0, β0), we then have

1√n

n∑i=1

p1n (zi; γ0, β0)−1√n

n∑i=1

pn (zi; γ0, β0) =1√n

n∑i=1

δn (zi)− E [δn (zi)]

The right hand side has mean 0 and variance

V

[1√n

n∑i=1

δn (zi)− E [δn (zi)]

]= V [δn (zi)− E [δn (zi)]]

≤ E[δn (zi)

2]

= E[{

rn (zi)− `t

(zi, w

′iγ0, β0

)}2]

= E

[{∫K (η) `t

(vi, w

′iγ0 − hη, β0

)dη − `t

(zi, w

′iγ0, β0

)}2]

= E

[{∫K (η)

(`t

(vi, w

′iγ0 − hη, β0

)− `t

(zi, w

′iγ0, β0

))dη

}2]

≤ E

[g (zi)

2 h2

{∫|K (η)| ‖η‖ dη

}2]

= O(h2

) → 0.

27

where g is the function in Assumption 10.3.¥Proof of Lemma 7: For the duration of this proof, let p1n denote one of the elements of

p1n. By definition of p1n and by random sampling, the mean of the left hand side is 0, while thevariance is

E[(p1n (zi; γ0, βh)− p1n (zi; γ0, β0))

2]≤ E

[∥∥∥pβ1n (zi; γ0, β

∗i )

∥∥∥2]‖βh − β0‖2

where β∗i is between βh and β0, but may depend on zi (hence the subscript i). The result nowfollows from βh → β0.¥

28

TABLE 1A: Design 1. Sample size=100.

value bias st.dev. RMSE m.bias MAD MAE

Logit MLE using constant, x1i, x2i and g(zi) as regressors

β1 1.000 0.138 0.406 0.428 0.083 0.253 0.245β2 1.000 0.107 0.376 0.390 0.063 0.222 0.213

The estimator based on (7) with hn = 0.3 · std(z) · n−1/5

β1 1.000 0.175 0.450 0.482 0.103 0.279 0.271β2 1.000 0.117 0.500 0.513 0.047 0.241 0.246


β1 1.000 0.249 0.402 0.473 0.198 0.248 0.251β2 1.000 0.065 0.372 0.377 0.026 0.228 0.223


β1 1.000 0.427 0.370 0.565 0.373 0.233 0.373β2 1.000 0.010 0.333 0.333 -0.021 0.202 0.206

Jack-knife using β1, β2 and β2

β1 1.000 0.136 0.504 0.521 0.057 0.309 0.307β2 1.000 0.150 0.621 0.638 0.071 0.265 0.271

Jack-knife using β1 and β2

β1 1.000 0.166 0.459 0.488 0.091 0.283 0.275β2 1.000 0.124 0.521 0.535 0.049 0.247 0.251


β1 1.000 0.172 0.451 0.483 0.099 0.281 0.272β2 1.000 0.119 0.503 0.516 0.049 0.243 0.246


β1 1.000 0.227 0.408 0.466 0.172 0.255 0.247β2 1.000 0.072 0.379 0.385 0.033 0.227 0.226

29

TABLE 1B: Design 1. Sample size=200.



β1 1.000 0.038 0.253 0.256 0.016 0.161 0.157β2 1.000 0.034 0.238 0.241 0.012 0.157 0.158


β1 1.000 0.050 0.276 0.280 0.013 0.177 0.174β2 1.000 0.031 0.264 0.265 0.019 0.176 0.174


β1 1.000 0.132 0.257 0.288 0.098 0.163 0.166β2 1.000 0.008 0.243 0.243 -0.001 0.153 0.154


β1 1.000 0.324 0.235 0.400 0.290 0.146 0.290β2 1.000 -0.041 0.227 0.230 -0.050 0.145 0.153


β1 1.000 0.005 0.296 0.296 -0.021 0.191 0.193β2 1.000 0.043 0.285 0.288 0.031 0.186 0.189


β1 1.000 0.039 0.279 0.282 0.006 0.179 0.180β2 1.000 0.033 0.268 0.269 0.024 0.178 0.176


β1 1.000 0.046 0.276 0.280 0.010 0.177 0.177β2 1.000 0.031 0.265 0.266 0.020 0.177 0.175


β1 1.000 0.108 0.260 0.282 0.074 0.168 0.161β2 1.000 0.014 0.246 0.246 0.004 0.159 0.157

30

TABLE 1C: Design 1. Sample size=400.



β1 1.000 0.018 0.164 0.165 0.012 0.115 0.110β2 1.000 0.020 0.158 0.159 0.015 0.101 0.102


β1 1.000 0.030 0.181 0.184 0.024 0.123 0.122β2 1.000 0.017 0.170 0.170 0.005 0.109 0.110


β1 1.000 0.090 0.172 0.194 0.078 0.115 0.126β2 1.000 0.000 0.162 0.162 -0.008 0.106 0.107


β1 1.000 0.279 0.156 0.319 0.265 0.104 0.265β2 1.000 -0.041 0.151 0.157 -0.043 0.097 0.105


β1 1.000 0.001 0.190 0.189 -0.006 0.129 0.130β2 1.000 0.026 0.176 0.178 0.017 0.112 0.117


β1 1.000 0.023 0.183 0.184 0.016 0.125 0.124β2 1.000 0.019 0.171 0.172 0.006 0.110 0.111


β1 1.000 0.027 0.182 0.184 0.022 0.123 0.123β2 1.000 0.018 0.170 0.171 0.006 0.109 0.110


β1 1.000 0.066 0.175 0.187 0.058 0.118 0.120β2 1.000 0.006 0.164 0.164 -0.002 0.106 0.107

31

TABLE 1D: Design 1. Sample size=800.



β1 1.000 0.010 0.123 0.124 0.004 0.082 0.084β2 1.000 0.014 0.114 0.115 0.012 0.075 0.075


β1 1.000 0.019 0.131 0.132 0.012 0.088 0.088β2 1.000 0.012 0.123 0.124 0.006 0.082 0.082


β1 1.000 0.067 0.128 0.144 0.063 0.089 0.092β2 1.000 0.001 0.120 0.120 -0.002 0.079 0.078


β1 1.000 0.246 0.118 0.273 0.241 0.083 0.241β2 1.000 -0.039 0.113 0.120 -0.041 0.076 0.082


β1 1.000 -0.002 0.134 0.134 -0.008 0.088 0.088β2 1.000 0.018 0.126 0.127 0.012 0.085 0.085


β1 1.000 0.014 0.132 0.132 0.007 0.088 0.088β2 1.000 0.014 0.124 0.124 0.009 0.082 0.082


β1 1.000 0.017 0.131 0.132 0.009 0.088 0.088β2 1.000 0.013 0.123 0.124 0.007 0.083 0.082


β1 1.000 0.045 0.129 0.137 0.040 0.089 0.090β2 1.000 0.006 0.121 0.121 0.003 0.080 0.080

32

TABLE 1E: Design 1. Sample size=1600.



β1 1.000 0.004 0.078 0.079 0.005 0.053 0.053β2 1.000 0.006 0.079 0.079 0.009 0.052 0.053


β1 1.000 0.010 0.085 0.085 0.009 0.056 0.058β2 1.000 0.005 0.083 0.084 0.007 0.054 0.054


β1 1.000 0.048 0.082 0.095 0.047 0.055 0.067β2 1.000 -0.004 0.081 0.081 -0.003 0.053 0.053


β1 1.000 0.213 0.076 0.226 0.214 0.051 0.214β2 1.000 -0.040 0.077 0.087 -0.038 0.051 0.057


β1 1.000 -0.005 0.086 0.086 -0.006 0.057 0.057β2 1.000 0.009 0.085 0.085 0.010 0.055 0.056


β1 1.000 0.006 0.085 0.085 0.005 0.056 0.057β2 1.000 0.006 0.084 0.084 0.008 0.053 0.054


β1 1.000 0.008 0.085 0.085 0.007 0.056 0.057β2 1.000 0.006 0.084 0.084 0.007 0.054 0.054


β1 1.000 0.027 0.084 0.088 0.027 0.056 0.060β2 1.000 0.001 0.082 0.082 0.001 0.053 0.053

33

TABLE 1F: Design 1. Sample size=3200.



β1 1.000 0.002 0.059 0.059 0.003 0.040 0.040β2 1.000 0.002 0.055 0.055 0.003 0.037 0.038


β1 1.000 0.006 0.063 0.063 0.006 0.040 0.041β2 1.000 -0.000 0.059 0.059 -0.004 0.040 0.041


β1 1.000 0.035 0.062 0.071 0.035 0.040 0.048β2 1.000 -0.007 0.059 0.059 -0.011 0.040 0.040


β1 1.000 0.182 0.058 0.191 0.183 0.037 0.183β2 1.000 -0.039 0.056 0.068 -0.042 0.039 0.051


β1 1.000 -0.004 0.063 0.063 -0.005 0.041 0.041β2 1.000 0.002 0.060 0.060 -0.000 0.040 0.041


β1 1.000 0.003 0.063 0.063 0.002 0.040 0.040β2 1.000 0.001 0.060 0.060 -0.002 0.040 0.041


β1 1.000 0.004 0.063 0.063 0.004 0.040 0.041β2 1.000 0.000 0.060 0.060 -0.003 0.040 0.041


β1 1.000 0.017 0.062 0.064 0.017 0.041 0.041β2 1.000 -0.003 0.059 0.059 -0.006 0.040 0.041

34

TABLE 2: Design 1. Coverage Probabilities.

95% Confidence Interval

n = 100 n = 200 n = 400 n = 800 n = 1600 n = 3200β1 0.948 0.938 0.962 0.933 0.958 0.940β2 0.905 0.927 0.927 0.917 0.931 0.912β3 0.801 0.726 0.573 0.391 0.202 0.080β123 0.968 0.943 0.957 0.931 0.960 0.940β12 0.950 0.937 0.965 0.934 0.961 0.940β13 0.948 0.938 0.963 0.933 0.961 0.941β23 0.910 0.932 0.945 0.929 0.948 0.940


n = 100 n = 200 n = 400 n = 800 n = 1600 n = 3200β1 0.894 0.888 0.907 0.885 0.915 0.898β2 0.849 0.850 0.864 0.841 0.866 0.856β3 0.684 0.624 0.465 0.290 0.123 0.044β123 0.925 0.897 0.918 0.897 0.909 0.892β12 0.903 0.889 0.908 0.885 0.914 0.899β13 0.896 0.887 0.906 0.886 0.913 0.901β23 0.860 0.866 0.884 0.866 0.903 0.885


n = 100 n = 200 n = 400 n = 800 n = 1600 n = 3200β1 0.782 0.786 0.785 0.786 0.817 0.798β2 0.732 0.744 0.751 0.712 0.758 0.740β3 0.526 0.453 0.303 0.174 0.059 0.023β123 0.827 0.800 0.814 0.803 0.810 0.808β12 0.791 0.789 0.801 0.794 0.810 0.799β13 0.784 0.787 0.789 0.789 0.814 0.796β23 0.740 0.757 0.773 0.746 0.805 0.794

35

TABLE 3: Monte Carlo Based on Real Data.

Median Absolute error relative to MLE

β1 β2 β3 β123 β12 β13 β23

Wage rate 1.0229 1.0161 0.9796 1.0531 1.0228 1.0260 1.0206Kids less than 6 1.0342 0.9860 1.1623 1.0443 1.0273 1.0352 1.0065Kids between 6 and 18 1.0372 0.9928 1.7117 1.0560 1.0358 1.0416 1.0346Education 1.0648 1.0303 1.0156 1.0711 1.0625 1.0640 1.0354Local Unemployment 1.0511 1.0439 1.0386 1.0876 1.0643 1.0518 1.0467City 1.0454 1.0016 0.9609 1.0371 1.0336 1.0439 0.9909Other Income 1.0423 1.0181 1.0243 1.0207 1.0395 1.0432 1.0149

Coverage Probability for 80% Confidence Interval

β1 β2 β3 β123 β12 β13 β23



β1 β2 β3 β123 β12 β13 β23



β1 β2 β3 β123 β12 β13 β23


36

Documents

Pairwise Diﬁerence Estimators for Nonlinear Models...Pairwise Diﬁerence Estimators for Nonlinear Models⁄ Bo E. Honor e Princeton University James L. Powell University of California,