Goodness-of-fit tests for parametric regression with selection biased data

T E C H N I C A L

R E P O R T

08011

GOODNESS-OF-FIT TESTS FOR PARAMETRIC

REGRESSION WITH SELECTION BIASED DATA

OJEDA CABRERA, J.L. and I. VAN KEILEGOM

*

I A P S T A T I S T I C S

N E T W O R K

INTERUNIVERSITY ATTRACTION POLE

http://www.stat.ucl.ac.be/IAP

Goodness-of-fit tests for parametric regression with

selection biased data

Jorge L. OJEDA CABRERA

Universidad de Zaragoza

Ingrid VAN KEILEGOM

Universite catholique de Louvain

January 24, 2008

Abstract

Consider the nonparametric location-scale regression model Y = m(X) + σ(X)ε,

where the error ε is independent of the covariate X, and m and σ are smooth but

unknown functions. The pair (X,Y ) is allowed to be subject to selection bias. We

construct tests for the hypothesis that m(·) belongs to some parametric family of

regression functions. The proposed tests compare the nonparametric maximum like-

lihood estimator (NPMLE) based on the residuals obtained under the assumed para-

metric model, with the NPMLE based on the residuals obtained without using the

parametric model assumption. The asymptotic distribution of the test statistics is

obtained. A bootstrap procedure is proposed to approximate the critical values of

the tests. Finally, the finite sample performance of the proposed tests is studied in a

simulation study, and the developed tests are applied on environmental data.

Key words and phrases. Biased sampling, bootstrap, empirical process, goodness-of-fit test, het-

eroscedastic model, location-scale regression, model diagnostics, nonparametric regression, weak

convergence.

1

1 Introduction

Consider the nonparametric location-scale regression model

Y = m(X) + σ(X)ε, (1)

where Y is the variable of interest, X is a covariate, the error ε is independent of X,

E(ε) = 0, Var(ε) = 1, and m(·) and σ2(·) are smooth but unknown regression and variance

curves, respectively.

Suppose that we can not observe the pair (X, Y ) directly, but that our sample comes

from (Xw, Y w), a bivariate random vector whose distribution is given by

dFw (x, y) =w (x, y)

µwdF (x, y),

where F (x, y) is the bivariate distribution of (X, Y ), and µw is the mean value of w (X, Y ).

The weight function w(X, Y ) drives the relationship between the observed random vector

(Xw, Y w), and the unobserved random vector (X, Y ). Let (Xw1 , Y

w1 ), . . . , (Xw

n , Ywn ) be n

independent replications of (Xw, Y w).

A nice description of practical situations where selection bias is encountered, can be

found in Rao (1997). While in Rao (1997) discrete populations are considered, Mahfoud

and Patil (1982) and Patil and Taillie (1989) also deal with the continuous and multivariate

framework (e.g. w (x, y) = yα, w (x, y) = max (x, y), w (x, y) = min (x, y), w (x, y) = x+ y).

A more theoretical treatment of the problem can be found in a series of papers by Vardi, see

for example Vardi (1982). In that paper, the nonparametric maximum likelihood estimator

(NPMLE) of the distribution function is obtained for the special case where the data

are univariate and are subject to length–bias sampling (i.e. when w(x, y) = y). See also

Vardi (1985) and Gill et al. (1988), where the general case of selection biased sampling

is considered for univariate data. They also develop a number of examples (including

stratification and length–bias) and give a detailed treatment of the large sample properties

of the NPMLE.

The aim of this paper is twofold. We will first develop the asymptotic properties of

the NPMLE of the error distribution, based on the nonparametric residuals εwi = Y wi −

mn(Xwi )/σn(Xw

i ) (i = 1, . . . , n), where mn(·) and σn(·) are appropriate kernel estimators of

m(·) and σ(·) respectively. The proofs of these properties rely heavily on modern empirical

process techniques, needed to take care of the difference I(εwi ≤ y)−I(εwi ≤ y) (i = 1, . . . , n).

See also Akritas and Van Keilegom (2001) and Van Keilegom and Akritas (1999), where

the NPMLE of the error distribution has been studied for completely observed and right

2

censored observations, respectively. Related estimation problems can be found in Cheng

(2004) and Muller et al. (2004a,b).

Secondly, we will develop appropriate test statistics for the hypothesis

H0 : m(·) ∈ M, (2)

when the data are subject to selection bias. Here, M = mθ(·) : θ ∈ Θ is a class

of parametric regression functions, including (but not restricted to) the class of linear

regression functions. The set Θ is supposed to be a closed subset of IRd. We are interested

in developing omnibus tests, which have power against any alternative hypothesis.

The test statistics proposed in this paper are based on the following idea. When the

null hypothesis H0 is true, then the NPMLE based on the ‘parametric’ residuals εw0i =

Y wi −mθ(X

wi )/σn(Xw

i ), where θ is an appropriate estimator of θ under H0, will be close

to the NPMLE based on the nonparametric residuals εwi (i = 1, . . . , n). It is worth noting

that when the data suffer from selection–bias, not only the sampled data are biased, but

also the residuals. The two estimators will be compared through Kolmogorov-Smirnov

and Cramer-von Mises type statistics. Similar test statistics have been considered by Van

Keilegom et al. (2007) and Pardo-Fernandez et al. (2007a) for directly observed and right

censored observations, respectively. See also Neumeyer et al. (2006), Pardo-Fernandez

and Van Keilegom (2006), Einmahl and Van Keilegom (2007) and Pardo-Fernandez et al.

(2007b) for other testing procedures under model (1). However, the above papers use local

constant weights, whereas in this paper we will use local linear smoothing.

The paper is organized as follows. In the next section, we propose an estimator of

the error distribution, and study its asymptotic properties. Section 3 is devoted to the

construction and study of test statistics for the hypothesis (2). A bootstrap approximation

is defined in Section 4, which is a useful alternative for the normal approximations obtained

in Sections 2 and 3. Some simulation results are summarized in Section 5, while Section 6

shows the results of the analysis of data on the Acid Neutralizing capacity of lakes in the

Northeastern states of the U.S. Finally, the proofs of the asymptotic results are given in

the Appendix.

2 Estimation of the error distribution

We start with introducing a number of notations. Let Fε(e) = P (ε ≤ e) and FX(x) =

P (X ≤ x). The probability density functions of Fε(e) and FX(x) will be denoted respec-

3

tively by fε(e) and fX(x). The support of X is denoted by RX and is supposed to be a

compact subset of IR.

In order to estimate the distribution of ε, we first need to estimate m(x) and σ(x).

For the regression function m(x), following Cristobal et al. (2004), we use a local linear

estimator, adjusted for the selection bias in the following way:

mn(x) =n∑

i=1

Wwi (x, hn)Y w

i , (3)

where

Wwi (x, hn) =

wwi (x, hn)∑nj=1w

wj (x, hn)

,

wwi (x, hn) =1

wi

(sw2 (x, hn)K

(Xwi − xhn

)− sw1 (x, hn)K

(Xwi − xhn

)Xwi − xhn

),

swk (x, hn) =n∑

i=1

1

wiK

(Xwi − xhn

)(Xwi − xhn

)k

(k = 1, 2), wi = w(Xwi , Y

wi ), K is a known probability density function (kernel) and hn is

a bandwidth sequence. For estimating σ2(x), we use a local constant estimator (we prefer

not to use local linear weights here, as this might lead to negative estimators of the variance

for certain values of x) :

σ2n (x) =

1∑n

i=11wiK(Xwi −xhn

)n∑

i=1

1

wiK

(Xwi − xhn

)(Y w

i − mn (Xwi ))2. (4)

Next, define εwi = Y wi − mn(Xw

i )/σn(Xwi ) for the resulting residuals, and let

Fε(e) = n−1n∑

i=1

wH

wiI(εwi ≤ e) (5)

denote the proposed estimator of Fε(e), where

wH =1

1n

∑ni=1

1wi

.

In order to show the weak convergence of the estimator Fε(e), we will first develop an

asymptotic i.i.d. expansion for Fε(e). This asymptotic representation, given in Theorem

2.1 below, shows that Fε(e) can be written as the NPMLE based on the (unobserved)

errors εwi = Y wi − m(Xw

i )/σ(Xwi ), plus a second term which comes from the difference

between εwi and εwi , and plus a remainder term whose order is asymptotically negligible.

The assumptions under which this result holds true, are given in the Appendix.

4

Theorem 2.1 Assume A1– A4. Then,

Fε(e)− Fε(e)

= n−1n∑

i=1

µwwiI(εwi ≤ e)− Fε(e) + fε(e)

∫∆n(x, e) dFX(x) + op(n

−1/2)

= n−1

n∑

i=1

µwwiI(εwi ≤ e)− Fε(e) + ϕ(Xw

i , Ywi , e)+ op(n

−1/2),

uniformly in −∞ < e <∞, where

∆n(x, e) =1

σn(x)mn(x)−m(x) + e(σn(x)− σ(x)),

and

ϕ(x, y, e) = fε(e)[y −m (x)

σ (x)+e

2

(y −m (x)

σ (x)

)2 − 1].

Theorem 2.2 Assume the conditions of Theorem 2.1 are satisfied. Then, the process

n1/2(Fε(e)−Fε(e)), −∞ < e <∞, converges weakly to a zero-mean Gaussian process W (e)

with covariance function

Cov(W (e),W (e′)) =µwE 1

w(X, Y )[I(ε ≤ e)− Fε(e) + ϕ(X, Y, e)]

× [I(ε ≤ e′)− Fε(e′) + ϕ(X, Y, e′)].

3 Goodness-of-fit tests for the regression function

Suppose that the null hypothesis H0 : m ∈ M is satisfied, where M = mθ : θ ∈ Θ, and

let θ0 be the true value of θ under H0. Let εw0 = Y w −mθ0(Xw)/σ(Xw).

In order to estimate θ0 define the function

Swn (θ) = n−1n∑

i=1

1

wiY w

i −mθ(Xwi )2. (6)

This is the sample version of the quadratic expression

Sw0 (θ) =E

[1

w(Xw, Y w)Y w −mθ(X

w)2

]

=1

µwE[Y −mθ(X)2

]

=1

µwE[mθ(X)−mθ0(X)2

]+

1

µwE[σ2(X)

].

5

Let

θn = argminθSwn (θ),

and define the smoothed parametric estimator m0(x), similarly as in Hardle and Mammen

(1993), by

m0(x) =n∑

i=1

Wwi (x, hn)mθn

(Xwi ),

where we have used the same kernel and bandwidth as for the nonparametric estimator

mn(x), in order to have the same bias as that of mn(x).

Lemma 3.1 Assume A1, A5. Then, under H0,

θn − θ0P→ 0,

θn − θ0 = n−1n∑

i=1

µwwiηθ0(Xw

i , Ywi ) + op(n

−1/2),

and

n1/2(θn − θ0)d→ N(0,Σ),

where

ηθ(x, y) = Ω−1∂mθ(x)

∂θ(y −mθ(x)),

Ω = E

[∂mθ0(X)

∂θ

∂mθ0(X)

∂θ

T],

Σ = µwΩ−1E

[Y −mθ0(X)2

w(X, Y )

∂mθ0(X)

∂θ

∂mθ0(X)

∂θ

T]

Ω−1.

Now, define

Fε0(e) = n−1n∑

i=1

wH

wiI(εw0i ≤ e), (7)

where εw0i = Y wi − m0(Xw

i )/σn(Xwi ) are the residuals estimated under H0.

Before defining the test statistics that we will use for testing H0, it is worth mentioning

that, as in the case of complete observations (see Van Keilegom et al. (2007)), the compar-

ison of Fε(e) with its parametric counterpart Fε0(e) leads to a valid procedure to test the

null hypothesis H0.

6

Theorem 3.1 The null hypothesis H0 : m ∈ M holds if and only if the random variables

ε and Y − mθ0(X)/σ(X) have the same distribution, where θ0 is the value of θ that

minimizes E[m(X)−mθ(X)2].

The proof of this result can be found in Van Keilegom et al. (2007). Note that θ0 can

also be regarded as the value of θ that minimizes Sw0 (θ) = µ−1w E[Y −mθ(X)2].

We are now ready to define the following Kolmogorov-Smirnov and Cramer-von Mises

type test statistics for H0 :

TKS = n1/2 sup−∞<e<∞

|Fε(e)− Fε0(e)|

and

TCM = n

∫[Fε(e)− Fε0(e)]2 dFε0(e).

In order to obtain the asymptotic law of these test statistics, we first need the following

two results :

Theorem 3.2 Assume A1– A5. Then, under H0,

Fε(e)− Fε0(e)

= fε(e)n−1

n∑

i=1

µwwi

εwi −

∫σ−1(x)

(∂mθ0(x)

∂θ

)TdFX(x) ηθ0(Xw

i , Ywi )

+ op(n

−1/2),

uniformly in −∞ < e <∞.

As an immediate consequence of this asymptotic expansion, we have the following weak

convergence result. Note that the limiting process has an extremely simple structure,

as it factorizes in a deterministic function only depending on e and a random variable

independent of e.

Corollary 3.1 Assume A1– A5. Then, under H0, the process n1/2(Fε(e)−Fε0(e)) (−∞ <

e <∞) converges weakly to fε(e)V , where V is a zero-mean normal random variable with

variance

V ar(V ) = µwE

1

w(X, Y )

ε−

∫σ−1(x)

(∂mθ0(x)

∂θ

)TdFX(x) ηθ0(X, Y )

2 .

It can now be easily shown that (see Van Keilegom et al. (2007) for a similar proof)

Corollary 3.2 Assume A1– A5. Then, under H0,

TKSd→ sup−∞<e<∞

|fε(e)| |V |

TCMd→∫f 2ε (e) dFε(e) V

2.

7

4 Bootstrap approximation

The asymptotic distribution of the test statistics TKS and TCM depends on the unknown

error density fε(e). In order to avoid the estimation of this unknown function, we will

approximate the distribution of TKS and TCM by means of a bootstrap procedure. Note

that since the asymptotic representation of Fε(e)−Fε(e) depends on the density fε(e) (see

Theorem 2.1), the use of a smoothed bootstrap procedure is required.

While in the unbiased case, when our sample comes directly from (X, Y ), the error ε and

the covariate X are independent, in the selection biased framework the observed errors and

covariates are no longer independent. In fact, the conditional distribution of the random

variable εw = Y w −m(Xw)/σ(Xw) given that Xw = x is given by

dFw (e|x) =w (x,m (x) + σ (x)e)

µw (x)dFε (e),

where µw (x) = E [w (X, Y )|X = x]. Therefore, bootstrap errors will need to be drawn

from an estimator of the conditional distribution of εw given Xw.

This bootstrap procedure, that as far as we know is new, can be described in the

following way.

1. Define standardized residuals

εwsi =εwi − µwnσwn

(i = 1, . . . , n), where

µwn =1

n

n∑

i=1

wH

wiεwi , (σwn )2 =

1

n

n∑

i=1

wH

wi(εwi − µwn )2.

2. Let B be the number of bootstrap samples. Then, for b = 1, . . . , B, let

X∗i,b = Xwi , Y ∗i,b = mθn

(X∗i,b) + σn(X∗i,b)ε∗i,b,

(i = 1, . . . , n), where ε∗i,b are smoothed bootstrap residuals defined as

ε∗i,b =√

1− κ2n ε

wsV (X∗i,b)

+ κnZi,b,

where κn is a bandwidth parameter taken equal to the standard error of the rescaled

nonparametric residuals εwsi , V(X∗1,b

), . . . , V

(X∗n,b

)are independent random vari-

ables, defined as (for an arbitrary x)

P (V (x) = i) =L(X∗i,b−xh′n

)

∑nj=1 L

(X∗j,b−xh′n

)

8

(i = 1, . . . , n), where L is a probability density function (kernel) and h′n a bandwidth

sequence. Finally, Z1,b, . . . , Zn,b are i.i.d. random variables with zero mean and unit

variance, independent of the data (Xwi , Y

wi ) (i = 1, . . . , n).

Let T ∗KS,b and T ∗CM,b be the test statistics obtained from the bootstrap sample (X∗i,b, Y∗i,b)

(i = 1, . . . , n).

3. The critical values of the test statistics TKS and TCM can now be approximated by

(for some 0 < α < 1)

T ∗KS,([(1−α)B]) and T ∗CM,([(1−α)B]),

where for any b = 1, . . . , B, T ∗KS,(b) denotes the b-th order statistic of T ∗KS,1, . . . , T∗KS,B

(and similarly for T ∗CM,(b)).

5 Simulations

In order to evaluate the finite sample performance of the proposed tests, we present the

results of two simulation studies, in which both test statistics TKS and TCM are considered

for different models.

The analysis will be focused on the size α and the power of the tests, where α is equal

to 0.01 and 0.05. We consider regression functions m(x) of the form mθ(x) +Aδ(x), where

A = 0 corresponds to the null hypothesis. The variance function σ2(x) will be equal to 0.22

and 0.42. We have carried out 500 simulations for each of the tables that will be presented

and 200 simulations for each of the figures, while the number of bootstrap replications was

always taken equal to 500. The results are presented for samples of size 50, 100 and 200 in

the case of the tables and of size 200 for the figures.

For the first model, we consider M = a + bx + cx2 : −10 ≤ a, b, c ≤ 10, while the

true regression function is given by

mθ0(x) = 8− 1.25x+ 2(x− 1.5)2

and the perturbation under the alternative is chosen to be δ(x) = (x − 1.5)3. The error

ε is distributed as a uniform random variable on [−5, 5]. The covariate X is uniformly

distributed on [0, 5], and we assume that the response Y is subject to length bias, i.e.

w(x, y) = y. Note that w(x, y) is positive and bounded away from zero for all x, y.

As we can see from the results in Table 1, the level is well respected (take A = 0) and

the rejection rate increases when |A| increases. We can also see that when σ(x) = 0.4, the

9

ability of the procedure to discriminate between the null hypotheses and the alternatives

decreases in a noticeable manner, what shows that the effect of the variance is according

to what we expected. Table 1 also shows that the Cramer-von Mises test is in general

more powerful than the Kolmogorov-Smirnov test. Similar conclusions can be made from

Figure 1.

α n σ A TKS TCM α n σ A TKS TCM

0.01 50 0.20 -0.250 0.984 0.994 0.05 50 0.20 -0.250 1.000 1.000

-0.125 0.388 0.510 -0.125 0.752 0.794

0.000 0.002 0.010 0.000 0.028 0.056

0.125 0.234 0.344 0.125 0.496 0.632

0.250 0.844 0.952 0.250 0.970 0.992

0.40 -0.250 0.482 0.622 0.40 -0.250 0.810 0.870

-0.125 0.068 0.088 -0.125 0.210 0.246

0.000 0.012 0.012 0.000 0.056 0.072

0.125 0.052 0.078 0.125 0.164 0.240

0.250 0.204 0.312 0.250 0.474 0.574

100 0.20 -0.250 1.000 1.000 100 0.20 -0.250 1.000 1.000

-0.125 0.842 0.916 -0.125 0.978 0.998

0.000 0.004 0.006 0.000 0.038 0.046

0.125 0.574 0.674 0.125 0.856 0.934

0.250 0.998 1.000 0.250 1.000 1.000

0.40 -0.250 0.908 0.980 0.40 -0.250 0.994 0.994

-0.125 0.162 0.236 -0.125 0.410 0.566

0.000 0.010 0.012 0.000 0.060 0.056

0.125 0.068 0.104 0.125 0.214 0.352

0.250 0.534 0.670 0.250 0.814 0.898

200 0.20 -0.250 1.000 1.000 200 0.20 -0.250 1.000 1.000

-0.125 0.998 1.000 -0.125 1.000 1.000

0.000 0.006 0.010 0.000 0.034 0.052

0.125 0.976 0.994 0.125 1.000 0.996

0.250 1.000 1.000 0.250 1.000 1.000

0.40 -0.250 0.998 1.000 0.40 -0.250 1.000 1.000

-0.125 0.512 0.678 -0.125 0.828 0.906

0.000 0.000 0.008 0.000 0.038 0.056

0.125 0.178 0.354 0.125 0.538 0.674

0.250 0.964 0.990 0.250 0.998 1.000

Table 1: Rejection rates of H0 for TKS and TCM under the first model.

10

For the second model, we consider a model that is sometimes used to address population

growth issues. Let

M = 6

1 + ea+bx+ c : −10 ≤ a, b, c ≤ 10

and the true regression function

mθ0(x) =6

1 + e1−1.5x+ 2.

In this case, the permutation is given by

δ(x) = 2(x− 3)2,

TKS, α = 0.05, n = 200

A

% r

ej.

0.2

0.4

0.6

0.8

−0.10 −0.05 0.00 0.05 0.10

sigma= 0.1 sigma= 0.25

TCM, α = 0.05, n = 200

A

% r

ej.

0.2

0.4

0.6

0.8

−0.10 −0.05 0.00 0.05 0.10


Figure 1: Power function for TKS and TCM under the first model.

11

the error ε is distributed as a truncated normal that takes values in (−1.5, 1.5), while

the covariate X is uniformly distributed on [0, 6]. While in this example we have taken

w(x, y) = max(x, y), we cannot see significant discrepancies from the behavior exhibited

by the procedure under the previous model. In fact, from Table 2 and Figure 2, we can see

again that the rejection proportion agrees with the nominal level of the tests when A = 0,

otherwise growing to values close to 1. On the other hand, we can see again that large

values of σ deteriorate the performance of the tests.

TKS, α = 0.05, n = 200

A

% r

ej.

0.2

0.4

0.6

0.8

−0.10 −0.05 0.00 0.05 0.10


TCM, α = 0.05, n = 200

A

% r

ej.

0.2

0.4

0.6

0.8

−0.10 −0.05 0.00 0.05 0.10


Figure 2: Power function for TKS and TCM under the second model.

12

α n σ A TKS TCM α n σ A TKS TCM

0.01 50 0.20 -0.250 0.994 1.000 0.05 50 0.20 -0.250 0.996 1.000

-0.125 0.774 0.934 -0.125 0.914 0.978

0.000 0.012 0.006 0.000 0.040 0.056

0.125 0.836 0.916 0.125 0.956 0.974

0.250 0.976 0.996 0.250 0.998 1.000

0.40 -0.250 0.734 0.916 0.40 -0.250 0.884 0.972

-0.125 0.244 0.404 -0.125 0.470 0.620

0.000 0.018 0.018 0.000 0.066 0.064

0.125 0.194 0.300 0.125 0.444 0.606

0.250 0.338 0.568 0.250 0.666 0.818

100 0.20 -0.250 1.000 1.000 100 0.20 -0.250 1.000 1.000

-0.125 0.990 0.998 -0.125 0.998 1.000

0.000 0.004 0.012 0.000 0.042 0.060

0.125 0.990 1.000 0.125 1.000 1.000

0.250 1.000 1.000 0.250 1.000 1.000

0.40 -0.250 0.992 1.000 0.40 -0.250 1.000 1.000

-0.125 0.510 0.782 -0.125 0.772 0.940

0.000 0.008 0.012 0.000 0.038 0.050

0.125 0.400 0.652 0.125 0.710 0.898

0.250 0.808 0.938 0.250 0.956 0.992

200 0.20 -0.250 1.000 1.000 200 0.20 -0.250 1.000 1.000

-0.125 1.000 1.000 -0.125 1.000 1.000

0.000 0.008 0.008 0.000 0.040 0.040

0.125 1.000 1.000 0.125 1.000 1.000

0.250 1.000 1.000 0.250 1.000 1.000

0.40 -0.250 1.000 1.000 0.40 -0.250 1.000 1.000

-0.125 0.894 0.990 -0.125 0.990 1.000

0.000 0.008 0.008 0.000 0.046 0.054

0.125 0.850 0.982 0.125 0.978 0.998

0.250 0.996 1.000 0.250 1.000 1.000

Table 2: Rejection rates of H0 for TKS and TCM under the second model.

6 Data analysis

During the period 1991–1996 the Environmental Monitoring and Assessment Program

(EMAP) of the U.S. Environmental Protection Agency has conducted a survey of lakes

in the Northeastern states of the U.S. with the purpose of detecting acidic waters and

monitoring the water conditions of these lakes; see Breidt et al. (2007) for more details.

13

In short, the survey is based on a total population of 21,026 lakes from which 334 lakes

were surveyed, obtaining one single measurement for each of them. In order to obtain the

lakes to be included in the survey they used a relatively complex sampling design based

on a hexagonal grid frame, see Larsen et al. (1993). Apart from the acid neutralizing

capacity (ANC), they also considered the concentration of different chemicals in the water,

as Magnesium (MG), and sulphuric acid and/or sulphate (SO4).

As a result of the complex survey design they use, not all the lakes have the same

chances to be visited and hence we no longer can think of an i.i.d. sample of the different

measurements of a lake’s chemicals. It is common in all these grid–based surveys that

information other than what is recorded, such as geographic information, etc. . . is used

to determine whether an object, in this case a lake, becomes part of the sample. In those

cases, we can think of a multivariate random phenomenon (X,Z) and a selection procedure

ψ such that whenever ψ (Z) is equal to 1, we have observations of X. Therefore, we observe

the random phenomenon Xw defined as X|ψ (Z) = 1, whose distribution is given by

dFwX (x) =

p (x)

E [p (X)]dFX (x),

where p (x) = P (ψ (Z) = 1|X = x). As a consequence we need to know, as is common in

surveys, the inclusion probability p (xi) of the ith–lake in the sample.

The aim of this example is to find an appropriate parametric regression function for

the model regressing ANC on the chemical measurement SO4. In order to do so, we will

consider different regression models, exploring in this way the practical performance of

our technique. For each of the models we calculate the p–value corresponding to both the

statistics TKS and TCM , by using the bootstrap approach outlined in Section 4.

Model TKS TCM

ANC ~ 1 0.000 0.000

ANC ~ SO4 0.012 0.040

ANC ~ SO4 + 1 0.314 0.480

ANC ~ SO4 + SO4^2 0.664 0.530

ANC ~ SO4^2 + 1 0.000 0.000

ANC ~ log(SO4) 0.004 0.004

ANC ~ exp(SO4) + 1 0.000 0.000

ANC ~ sin(SO4) 0.002 0.000

ANC ~ sin(SO4) + 1 0.000 0.000

Table 3: p–values for TKS and TCM

14

We have chosen a bandwidth of 90 for estimating the regression function mn(·) and a

bandwidth of 108 to estimate the variance function σ2n(·). From Table 3 we can see that

both the models ANC ~ SO4 + 1 and ANC ~ SO4 + SO4^2 seem to be suitable to explain

the data behavior. In Figure 3 we show the scatterplot of the observed data jointly with the

compensated local linear fit and the compensated quadratic fit (i.e. ANC ~ SO4 + SO4^2).

The two curves are close to each other, which confirms the results of the goodness-of-fit

test. The parameter estimates of this quadratic model are given in Table 4.

0 50 100 150 200 250 300

020

040

060

080

010

00

S04

AN

C

Figure 3: Scatterplot of the data together with the estimated regression curves. The

dashed curve represents the compensated local linear estimator, while the solid curve is the

compensated quadratic estimator.

Estimate

Intercept 65.9538

SO4 0.6514

SO4^2 0.0035

Table 4: Parameter estimates obtained for ANC versus SO4.

Finally, we compare the estimators of the residual density obtained with and without

compensating for the biased sampling. Figure 4 suggests that the skewness of the residuals

computed without taking into account the sampling probabilities is much more important

than when the sampling mechanism is taken into account.

15

−600 −400 −200 0 200 400 600

0.00

000.

0005

0.00

100.

0015

0.00

200.

0025

0.00

300.

0035

Figure 4: Estimation of the error density for ANC versus SO4. The solid curve represents the

unbiased (compensated) estimator, the dashed curve is the biased (unweighted) estimator.

Appendix : Proofs

Assumptions

A1 The function w satisfies w (X, Y ) > C with probability one, for some C > 0.

A2 The bandwidth sequence hn satisfies nh4n → 0 and nh3+2δ

n / log h−1n → ∞ for some

δ > 0, and the kernel K is a symmetric probability density function, it has bounded

support and is twice continuously differentiable.

A3 The functions FX and m are three times continuously differentiable, and σ is twice con-

tinuously differentiable on the interior of their support RX . Moreover, fX(x), σ(x) >

C1 for all x in RX and for some C1 > 0.

A4 The random error ε satisfies

(i) E [ε4] <∞.

(ii) Fε is twice continuously differentiable, and both efε(e) and e2f ′ε(e) are uniformly

bounded.

A5 (i) Θ is a compact subspace of IRd.

(ii) inf‖θ−θ0‖>εE((mθ (X)−mθ0 (X))2) > 0 for all ε > 0.

(iii) θ0 is an interior point of Θ.

16

(iv) Ω is non-singular.

(v) The function mθ (x) is twice continuously differentiable with respect to θ for all

x, and the derivatives are uniformly bounded in x and θ.

Proposition A.1 If assumptions A1, A2, A3 and A4(i) are satisfied, then

supx|mn (x)−m (x)| = O

((nhn)−1/2(log n)1/2

)

supx|σn (x)− σ (x)| = O

((nhn)−1/2(log n)1/2

)

almost surely.

Proof. The first result is a direct consequence of Proposition 3.1 in Ojeda (2008). Note

that the conditions in that theorem are fulfilled, because of assumptions A2 and A3 and

because

E

[(Y w −m (Xw)

w (Xw, Y w)

)2]

=1

µwE

[(Y −m (X))2

w (X, Y )

]≤ 1

Cµwsupx|σ(x)|2 <∞. (8)

For the second result, first note that

σn(x)− σ(x) =σ2n(x)− σ2(x)

σn(x) + σ(x),

and hence it suffices to show that supx |σ2n(x)− σ2(x)| = O((nhn)−1/2(log n)1/2) a.s. Next,

σ2n(x) =

1∑n

i=11wiK(Xwi −xhn

)n∑

i=1

1

wiK

(Xwi − xhn

)Y w

i −m(Xwi )2 +o

((nhn)−1/2(log n)1/2

),

almost surely and uniformly in x, which can be shown using standard lines of proof. Hence,

it suffices to show that

supx

∣∣∣ 1∑n

i=11wiK(Xwi −xhn

)n∑

i=1

1

wiK

(Xwi − xhn

)Y w

i −m(Xwi )2 − σ2(x)

∣∣∣

= O(

(nhn)−1/2(log n)1/2)

almost surely, and this follows again from Proposition 3.1 in Ojeda (2008) (with p = 0),

since

E

[(Y w −m(Xw)2 − σ2(Xw)

w (Xw, Y w)

)2]<∞,

which follows from assumption A4(i). 2

17

Proposition A.2 If assumptions A1, A2 and A3 are satisfied, then

mn (x)−m (x) =µw

nhnfX (x)

n∑

i=1

1

wiK

(Xwi − xhn

)Y w

i −m (x)

+O((nhn)−1 logn

),

uniformly in x and almost surely. If in addition assumption A4(i) is satisfied, then

σn (x)− σ (x) =µw

2nhnfX (x)σ (x)

n∑

i=1

1

wiK

(Xwi − xhn

)(Y w

i −m (x)2 − σ2 (x))

+O((nhn)−1 logn

),

uniformly in x and almost surely.

Proof. For the first expansion, first note that from Proposition A.1 in Ojeda et al. (2008)

we have that

n∑

i=1

wwi (x, hn) = (nhn)2

(1

µ2w

µ2f2X (x) +O

((nhn)−1/2(log n)1/2

))

uniformly in x and almost surely. Furthermore, in the same way we can show that

n∑

i=1

wwi (x, hn)Y wi = (nhn)2

(1

µ2w

µ2m (x)f 2X (x) +O

((nhn)−1/2(log n)1/2

))

uniformly in x and almost surely. Hence, from the definition of mn(x) in equation (3) we

obtain that

mn (x)−m (x) =µ2w

(nhn)2µ2f 2X (x)

n∑

i=1

wwi (x, hn)(Y wi −m (x)) +O

((nhn)−1 log n

)

uniformly in x and almost surely. From the definition of wwi (x, hn) and the almost sure

and uniform expansions for sw1 (x; hn) and sw2 (x; hn) given in Proposition A.1 in Ojeda et

al. (2008), it now follows that

mn (x)−m (x) =µw

nhnfX (x)

n∑

i=1

1

wiK

(Xwi − xhn

)(Y w

i −m (x)) +O((nhn)−1 logn

)


The proof for the expansion of σn (x) − σ (x) follows in a similar way taking into ac-

count the expansion for sw0 (x; hn) and the fact that Y wi − mn (Xw

i ) can be written as

Y wi −m (Xw

i ) + m (Xwi )− mn (Xw

i ). See also Liero (2003) for the case where wi = 1

(i = 1, . . . , n). 2

18

Proposition A.3 If assumptions A1, A2, A3 and A4(i) are satisfied, then

supx|m′n (x)−m′ (x)| = O

((nh3

n)−1/2(log n)1/2)

supx|σ′n (x)− σ′ (x)| = O

((nh3

n)−1/2(logn)1/2)


Proof. This follows along the same lines as in the proof of Proposition A.1, but use Propo-

sition 3.2 in Ojeda (2008) instead of Proposition 3.1. 2

Proposition A.4 If assumptions A1, A2, A3 and A4(i) are satisfied, then for 0 < δ < 1,

supx,x′

|m′n (x)−m′ (x)− m′n (x′) +m′ (x′)||x− x′|δ

= O(

(nh3+2δn )−1/2(logn)1/2

)

supx,x′

|σ′n (x)− σ′ (x)− σ′n (x′) + σ′ (x′)||x− x′|δ

= O(

(nh3+2δn )−1/2(log n)1/2

)

almost surely.

Proof. This follows along the same lines as in the proof of Proposition A.1, but use The-

orem 3.2 in Ojeda (2008) instead of Proposition 3.1. 2

For the next lemma, we need to introduce the following expression (−∞ < e <∞) :

Ψn (e) =1

n

n∑

i=1

(µwwiI (εwi ≤ e)− µw

wiI (εwi ≤ e)− P (ε ≤ e|Xn) + P (ε ≤ e)

).

Lemma A.1 If assumptions A1, A2 and A3 are satisfied then

sup−∞<e<∞

|Ψn (e)| = op

(1√n

),

where P (ε ≤ e|Xn) stands for the distribution of ε = σ (X)−1(Y − m (X)) conditional on

the given values of the sample.

Proof. The proof is heavily based on Lemma B1 in Akritas and Van Keilegom (2001),

which considers the unweighted case, i.e. wi = 1 for all i. Let dn1 and dn2 be defined as in

that paper, but with mn (x) and σn (x) as in equations (3) and (4), so we use respectively

local linear and local constant estimators for the regression and variance functions.

Because of Propositions A.1, A.3, and A.4, we have that P (dn1 ∈ C1+δ1 (RX)) and

P (dn2 ∈ C1+δ2 (RX)) tend to 1, where C1+δ

1 (RX) and C1+δ2 (RX) are defined as in Akritas

19

and Van Keilegom (2001). Next, define the class

F =

(x, ε)→ µw

ω (x, ε)I (ε ≤ e d2 (x) + d1 (x)) ;

−∞ < e <∞, d1 ∈ C1+δ1 (RX), d2 ∈ C1+δ

2 (RX)

,

where ω (x, ε) = w (x, y) and y = m(x) + σ(x)ε.

The computation of the bracketing number of the class F can be done in almost the

same way as in Akritas and Van Keilegom (2001), and is therefore omitted. In fact, the

weights ω(x, ε) do not have any significant effect on the computation of this bracketing

number. This implies that the class F is Donsker. Next, let Vn (x, e) = edn2 (x) + dn1 (x)

and note that

Var

[µw

ω (Xw, εw)I (εw ≤ Vn (Xw, e))− µw

ω (Xw, εw)I (εw ≤ e)− P (ε ≤ Vn (X, e)) + P (ε ≤ e)

]

= Var

[µw

ω (Xw, εw)I (εw ≤ Vn (Xw, e))− I (εw ≤ e)

]

≤ E

[( µwω (Xw, εw)

)2

I (εw ≤ Vn (Xw, e))− I (εw ≤ e)2

]

≤ µwCE

[I (ε ≤ Vn (X, e))− I (ε ≤ e)

]2

,

and this can be easily seen to converge to zero almost surely by using the same arguments

as in the proof of Lemma B1 in Akritas and Van Keilegom (2001), and by using Proposition

A.1. The result now follows from Corollary 2.3.12 in van der Vaart and Wellner (1996). 2

Proof of Theorem 2.1. From the definition of Fε (e) we obtain that

Fε (e)− Fε (e) =1

n

n∑

i=1

µwwi

[I (εwi ≤ e)− Fε (e)

]

+

(wH

µw− 1

)1

n

n∑

i=1

µwwi


]

=1

n

n∑

i=1

µwwi


](1 + op(1))

uniformly in e because µ−1w wH = 1 + op(1). Therefore, it follows from Lemma A.1 that

Fε (e)− Fε (e)

=

[1

n

n∑

i=1

µwwiI (εwi ≤ e)− Fε (e)+ P (ε ≤ e|Xn)− P (ε ≤ e)

](1 + op(1)) + op

(n−1/2

),

20

uniformly in e. Moreover, taking into account that P (ε ≤ e|Xn) = P (ε ≤ e+ ∆n (X, e)|Xn)

(where ∆n (X, e) is defined in the statement of the theorem), we obtain that

P (ε ≤ e|Xn)− P (ε ≤ e) = fε (e)

∫∆n (x, e) dFX (x) + op

(n−1/2

)

uniformly in e because of assumption A4. Using Proposition A.2 we find that

fε (e)∆n (x, e) =µwfε (e)

(nhn)fX (x)σ (x)

n∑

i=1

1

wiK

(Xwi − xhn

)ϕ (x, Y w

i , e) + op(n−1/2

)

uniformly in x and e, where the function ϕ is given in the statement of the theorem. Now

we proceed as in the proof of Theorem 3.1 in Akritas and Van Keilegom (2001), which

leads to the given asymptotic expansion for fε (e)∫

∆n (x, e) dFX (x). 2

Proof of Theorem 2.2. Following similar ideas to those in the proof of Theorem 3.2 in

Akritas and Van Keilegom (2001), we consider the class of functions

F =

(x, y)→ µw

w (x, y)

[I

(y −m (x)

σ (x)≤ e

)− Fε(e) + ϕ (x, y, e)

]: e ∈ IR

.

Using the notation ω (x, ε) = w (x, y) for y = m (x) + σ (x)ε, we can reparametrize F as

F ′ =

(x, ε)→ µwω (x, ε)

[I (ε ≤ e)− Fε(e) + fε (e)A (ε) +B (ε)e

]: e ∈ IR

,

where A and B are the fixed functions A (ε) = ε and B (ε) = (ε2 − 1)/2. In order to show

the weak convergence of the process given in the statement of the theorem, we need to

show that the class F ′ is Donsker, which we will establish by calculating the bracketing

number N[ ] (ε,F ′, L2) of F ′ with respect to the joint probability measure of (Xw, εw).

The bracketing number of the class ε → I (ε ≤ e) : e ∈ IR is O(exp(Kε−1)) because

of Theorem 2.7.5 in van der Vaart and Wellner (1996). Next, note that A (ε), B (ε) and

ω (x, ε) are independent of e, while the functions e→ Fe(e), e→ fe(e) and e→ efe(e) are

deterministic bounded functions, whose bracketing number is O(ε−1). Therefore, Theorems

2.5.6 and 2.10.6 in van der Vaart and Wellner (1996) entail that the class F ′ is Donsker. 2

Proof of Lemma 3.1. The first part of the lemma will be proved by verifying the two

conditions of Theorem 5.7 in van der Vaart (1998) with Mn (θ) = −Swn (θ) and M (θ) =

−Sw0 (θ). As mθ (x) is a continuous function of θ for all x, w (Xw, Y w)−1(Y w −mθ (Xw))2

is continuous as well. Hence, as a consequence of Theorem 2 in Jennrich (1969) we have

that supθ |Swn (θ) − Sw0 (θ)| = op(1). Hence, the first condition in van der Vaart (1998) is

satisfied. The second condition is a consequence of assumption A5 (ii).

21

Next, using assumption A5 (v) and a Taylor series expansion, we find that

θn − θ0 = −(∂2

∂θ2Swn (θn)

)−1∂

∂θSwn (θ0),

where θn lies in the open segment (θ0, θn) and ∂2

∂θ2Swn (θn) is a matrix with entries ∂2

∂θk ∂θlSwn (θn)

(k, l = 1, . . . , d). Next, note that

∂

∂θSwn (θ0) = − 2

n

n∑

i=1

1

wi

∂

∂θmθ0 (Xw

i )(Y wi −mθ0 (Xw

i )),

while

∂2

∂θ2Swn (θn) =

2

n

n∑

i=1

1

wi

∂

∂θmθn

(Xwi )

∂

∂θmθn

(Xwi )T

− 2

n

n∑

i=1

1

wi

∂2

∂θ2mθn

(Xwi )(Y wi −mθn

(Xwi ))

=2

n

n∑

i=1

1

wi

∂

∂θmθ0 (Xw

i )∂

∂θmθ0 (Xw

i )T

− 2

n

n∑

i=1

1

wi

∂2

∂θ2mθ0 (Xw

i )(Y wi −mθ0 (Xw

i )), (9)

which follows easily from the fact that θn − θ0 = op(1). The first term on the right hand

side of (9) equals

2E[ 1

w(Xw, Y w)

∂

∂θmθ0 (Xw)

∂

∂θmθ0 (Xw)T

]+ op(1) =

2

µwΩ + op(1),

because of assumption A5 (v). The second term of (9) is op(1), since its expected value

equals zero. Hence, taking into account that ∂∂θSwn (θ0) = Op

(n−1/2

)and using assumption

A5 (iv), we have that

θn − θ0 =1

n

n∑

i=1

µwwi

Ω−1 ∂

∂θmθ0 (Xw

i )(Y wi −mθ (Xw

i )) + op(n−1/2

)

=1

n

n∑

i=1

µwwiηθ0 (Xw

i , Ywi ) + op

(n−1/2

).

2

Proof of Theorem 3.2. Following the same arguments as in the proof of Theorem 2.1,

we find that

Fε 0 (e)− Fε (e) =1

n

n∑

i=1

µwwiI (εwi ≤ e)− Fε (e)

+ fε (e)

∫∆n0 (x, e) dFX (x) + op

(n−1/2

)

22

uniformly in e, where

∆n0 (x, e) =1

σ(x)mθn

(x)−m (x) + e(σn (x)− σ (x)).

Combining this with the asymptotic expansion of Fε (e)− Fε (e) given in Theorem 2.1, we

have

Fε (e)− Fε 0 (e) = fε (e)

∫Dn (x) dFX (x) + op

(n−1/2

)

uniformly in e, where

Dn (x) =1

σ(x)mn (x)−m (x)−

(mθn

(x)−m (x)).

Therefore, the expansions given in Proposition A.2 and Lemma 3.1 lead to

Fε (e)− Fε 0 (e) = fε (e)1

n

n∑

i=1

µwwi

εwi −MT

θ0ηθ0 (Xw

i , Ywi )

+ op(n−1/2

),

where

Mθ0 =

∫1

σ (x)

(∂

∂θmθ0 (x)

)TdFX (x),

and this proves the result. 2

Acknowledgements. The authors would like to thank Jean Opsomer (Colorado State

University) for helpful discussions and for providing the data set analyzed in this paper.

The first author acknowledges financial support from the D.G.A. (CONSI+D) and CAI

program ‘Europa’ and DGI Grant MTM2005-01464. The second author acknowledges fi-

nancial support from IAP research network P6/03 of the Belgian Government (Belgian

Science Policy).

References

Akritas, M.G. and Van Keilegom, I. (2001). Nonparametric estimation of the residual

distribution. Scand. J. Statist., 28, 549–568.

Breidt F. J., Opsomer J. D., Johnson A. A. and Ranalli M. G. (2007). Semiparametric

model–assisted estimation for natural resource surveys. Survey Methodology, 33, 35–44.

Cheng, F. (2004). Weak and strong uniform consistency of a kernel error density estimator

in nonparametric regression. J. Statist. Plann. Inference, 119, 95–107.

23

Cristobal, J. A., Ojeda, J. L. and Alcala, J. T. (2004). Confidence bands in nonparametric

regression with length biased data. Ann. Inst. Statist. Math., 56, 475–496.

Einmahl, J.H.J. and Van Keilegom, I. (2007). Tests for independence in nonparametric

regression. Statist. Sinica, to appear.

Gill, R.D., Vardi, Y. and Wellner, J.A. (1988). Large sample theory of empirical distribu-

tions in biased sampling models. Ann. Statist., 16, 1069–1112.

Hardle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regres-

sion fits. Ann. Statist., 21, 1926–1947.

Jennrich, R.I. (1969). Asymptotic properties of non-linear least squares estimators. Ann.

Math. Statist., 40, 633–643.

Larsen, D.P., Thornton K.W., Urquhart, N.S. and Paulsen, S.G. (1993). Overview of

survey design and lake selection. EMAP - Surface Waters 1991 Pilot Report. U.S.

Environmental Protection Agency. Edited by Larsen, D.P. and Christie, S.J. Technical

Report EPA/620/R-93/003.

Liero, H. (2003). Testing homoscedasticity in nonparametric regression. J. Nonpar. Statist.,

15, 31–51.

Mahfoud, M. and Patil, G.P. (1982). On weighted distributions. In: Statistics and Proba-

bility: Essays in Honor of C. R. Rao, G. Kallianpur et al. ( eds.), pp. 479-492. North

Holland Publishing Co., Amsterdam.

Muller, U.U., Schick, A. and Wefelmeyer, W. (2004a). Estimating functionals of the error

distribution in parametric and nonparametric regression. J. Nonparam. Statist., 16,

525–548.

Muller, U.U., Schick, A. and Wefelmeyer, W. (2004b). Estimating linear functionals of the

error distribution in nonparametric regression. J. Statist. Plann. Infer., 119, 75–93.

Neumeyer, N., Dette, H. and Nagel, E.-R. (2006). Bootstrap tests for the error distribution

in linear and nonparametric regression models. Austr. N. Zeal. J. Statist., 48, 129–156.

Ojeda, J.L. (2008). Holder continuity properties of the local polynomial estimator (sub-

mitted) (see http://www.unizar.es/galdeano/preprints/lista.html).

Ojeda, J.L., Cristobal, J.A. and Alcala, J.T. (2008). A comparison of two bootstrap schemes

on goodness-of-fit tests in the presence of selection bias (submitted) (see

http://www.unizar.es/galdeano/preprints/lista.html).

Pardo-Fernandez, J.C. and Van Keilegom, I. (2006). Comparison of regression curves with

censored responses. Scand. J. Statist., 33, 409-434.

24

Pardo-Fernandez, J.C., Van Keilegom, I. and Gonzalez-Manteiga, W. (2007a). Goodness-

of-fit tests for parametric models in censored regression. Canad. J. Statist., 35, 249-264.

Pardo-Fernandez, J.C., Van Keilegom, I. and Gonzalez-Manteiga, W. (2007b). Testing for

the equality of k regression curves. Statist. Sinica, 17, 1115-1137.

Patil, G.P. and Taillie, C. (1989). Probing encountered data, meta analysis and weighted

distribution methods. In: Statistical data analysis and inference, Y. Dodge (ed.), pp.

317-345. Elsevier, Amsterdam.

Rao, C.R. (1997). Statistics and Truth. Putting chance to work. World Scientific, Singa-

pore.

van der Vaart, J.A. (1998). Asymptotic statistics. Cambridge University Press.

van der Vaart, A.W. and Wellner, J.A. (1996). Weak convergence and empirical processes.

Springer, New York.

Van Keilegom, I. and Akritas, M.G. (1999). Transfer of tail information in censored re-

gression models. Ann. Statist., 27, 1745–1784.

Van Keilegom, I., Gonzalez-Manteiga, W. and Sanchez-Sellero, C. (2007). Goodness-of-fit

tests in parametric regression based on the estimation of the error distribution. TEST,

to appear.

Vardi, Y. (1982). Nonparametric estimation in the presence of length bias. Ann. Statist.,

10, 616–620.

Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist., 13,

178–205.

Address for correspondence :

Fac. de Ciencias, Edif. Matematicas Institute of Statistics

Universidade de Zaragoza Universite catholique de Louvain

Pedro Cerbuna, num. 12 Voie du Roman Pays, 20

50009 Zaragoza B-1348 Louvain-la-Neuve

Spain Belgium

Email: [email protected] Email: [email protected]

25

Documents

Goodness-of-fit tests for parametric regression with selection biased data