Upload
cuni
View
0
Download
0
Embed Size (px)
Citation preview
T E C H N I C A L
R E P O R T
08011
GOODNESS-OF-FIT TESTS FOR PARAMETRIC
REGRESSION WITH SELECTION BIASED DATA
OJEDA CABRERA, J.L. and I. VAN KEILEGOM
*
I A P S T A T I S T I C S
N E T W O R K
INTERUNIVERSITY ATTRACTION POLE
http://www.stat.ucl.ac.be/IAP
Goodness-of-fit tests for parametric regression with
selection biased data
Jorge L. OJEDA CABRERA
Universidad de Zaragoza
Ingrid VAN KEILEGOM
Universite catholique de Louvain
January 24, 2008
Abstract
Consider the nonparametric location-scale regression model Y = m(X) + σ(X)ε,
where the error ε is independent of the covariate X, and m and σ are smooth but
unknown functions. The pair (X,Y ) is allowed to be subject to selection bias. We
construct tests for the hypothesis that m(·) belongs to some parametric family of
regression functions. The proposed tests compare the nonparametric maximum like-
lihood estimator (NPMLE) based on the residuals obtained under the assumed para-
metric model, with the NPMLE based on the residuals obtained without using the
parametric model assumption. The asymptotic distribution of the test statistics is
obtained. A bootstrap procedure is proposed to approximate the critical values of
the tests. Finally, the finite sample performance of the proposed tests is studied in a
simulation study, and the developed tests are applied on environmental data.
Key words and phrases. Biased sampling, bootstrap, empirical process, goodness-of-fit test, het-
eroscedastic model, location-scale regression, model diagnostics, nonparametric regression, weak
convergence.
1
1 Introduction
Consider the nonparametric location-scale regression model
Y = m(X) + σ(X)ε, (1)
where Y is the variable of interest, X is a covariate, the error ε is independent of X,
E(ε) = 0, Var(ε) = 1, and m(·) and σ2(·) are smooth but unknown regression and variance
curves, respectively.
Suppose that we can not observe the pair (X, Y ) directly, but that our sample comes
from (Xw, Y w), a bivariate random vector whose distribution is given by
dFw (x, y) =w (x, y)
µwdF (x, y),
where F (x, y) is the bivariate distribution of (X, Y ), and µw is the mean value of w (X, Y ).
The weight function w(X, Y ) drives the relationship between the observed random vector
(Xw, Y w), and the unobserved random vector (X, Y ). Let (Xw1 , Y
w1 ), . . . , (Xw
n , Ywn ) be n
independent replications of (Xw, Y w).
A nice description of practical situations where selection bias is encountered, can be
found in Rao (1997). While in Rao (1997) discrete populations are considered, Mahfoud
and Patil (1982) and Patil and Taillie (1989) also deal with the continuous and multivariate
framework (e.g. w (x, y) = yα, w (x, y) = max (x, y), w (x, y) = min (x, y), w (x, y) = x+ y).
A more theoretical treatment of the problem can be found in a series of papers by Vardi, see
for example Vardi (1982). In that paper, the nonparametric maximum likelihood estimator
(NPMLE) of the distribution function is obtained for the special case where the data
are univariate and are subject to length–bias sampling (i.e. when w(x, y) = y). See also
Vardi (1985) and Gill et al. (1988), where the general case of selection biased sampling
is considered for univariate data. They also develop a number of examples (including
stratification and length–bias) and give a detailed treatment of the large sample properties
of the NPMLE.
The aim of this paper is twofold. We will first develop the asymptotic properties of
the NPMLE of the error distribution, based on the nonparametric residuals εwi = Y wi −
mn(Xwi )/σn(Xw
i ) (i = 1, . . . , n), where mn(·) and σn(·) are appropriate kernel estimators of
m(·) and σ(·) respectively. The proofs of these properties rely heavily on modern empirical
process techniques, needed to take care of the difference I(εwi ≤ y)−I(εwi ≤ y) (i = 1, . . . , n).
See also Akritas and Van Keilegom (2001) and Van Keilegom and Akritas (1999), where
the NPMLE of the error distribution has been studied for completely observed and right
2
censored observations, respectively. Related estimation problems can be found in Cheng
(2004) and Muller et al. (2004a,b).
Secondly, we will develop appropriate test statistics for the hypothesis
H0 : m(·) ∈ M, (2)
when the data are subject to selection bias. Here, M = mθ(·) : θ ∈ Θ is a class
of parametric regression functions, including (but not restricted to) the class of linear
regression functions. The set Θ is supposed to be a closed subset of IRd. We are interested
in developing omnibus tests, which have power against any alternative hypothesis.
The test statistics proposed in this paper are based on the following idea. When the
null hypothesis H0 is true, then the NPMLE based on the ‘parametric’ residuals εw0i =
Y wi −mθ(X
wi )/σn(Xw
i ), where θ is an appropriate estimator of θ under H0, will be close
to the NPMLE based on the nonparametric residuals εwi (i = 1, . . . , n). It is worth noting
that when the data suffer from selection–bias, not only the sampled data are biased, but
also the residuals. The two estimators will be compared through Kolmogorov-Smirnov
and Cramer-von Mises type statistics. Similar test statistics have been considered by Van
Keilegom et al. (2007) and Pardo-Fernandez et al. (2007a) for directly observed and right
censored observations, respectively. See also Neumeyer et al. (2006), Pardo-Fernandez
and Van Keilegom (2006), Einmahl and Van Keilegom (2007) and Pardo-Fernandez et al.
(2007b) for other testing procedures under model (1). However, the above papers use local
constant weights, whereas in this paper we will use local linear smoothing.
The paper is organized as follows. In the next section, we propose an estimator of
the error distribution, and study its asymptotic properties. Section 3 is devoted to the
construction and study of test statistics for the hypothesis (2). A bootstrap approximation
is defined in Section 4, which is a useful alternative for the normal approximations obtained
in Sections 2 and 3. Some simulation results are summarized in Section 5, while Section 6
shows the results of the analysis of data on the Acid Neutralizing capacity of lakes in the
Northeastern states of the U.S. Finally, the proofs of the asymptotic results are given in
the Appendix.
2 Estimation of the error distribution
We start with introducing a number of notations. Let Fε(e) = P (ε ≤ e) and FX(x) =
P (X ≤ x). The probability density functions of Fε(e) and FX(x) will be denoted respec-
3
tively by fε(e) and fX(x). The support of X is denoted by RX and is supposed to be a
compact subset of IR.
In order to estimate the distribution of ε, we first need to estimate m(x) and σ(x).
For the regression function m(x), following Cristobal et al. (2004), we use a local linear
estimator, adjusted for the selection bias in the following way:
mn(x) =n∑
i=1
Wwi (x, hn)Y w
i , (3)
where
Wwi (x, hn) =
wwi (x, hn)∑nj=1w
wj (x, hn)
,
wwi (x, hn) =1
wi
(sw2 (x, hn)K
(Xwi − xhn
)− sw1 (x, hn)K
(Xwi − xhn
)Xwi − xhn
),
swk (x, hn) =n∑
i=1
1
wiK
(Xwi − xhn
)(Xwi − xhn
)k
(k = 1, 2), wi = w(Xwi , Y
wi ), K is a known probability density function (kernel) and hn is
a bandwidth sequence. For estimating σ2(x), we use a local constant estimator (we prefer
not to use local linear weights here, as this might lead to negative estimators of the variance
for certain values of x) :
σ2n (x) =
1∑n
i=11wiK(Xwi −xhn
)n∑
i=1
1
wiK
(Xwi − xhn
)(Y w
i − mn (Xwi ))2. (4)
Next, define εwi = Y wi − mn(Xw
i )/σn(Xwi ) for the resulting residuals, and let
Fε(e) = n−1n∑
i=1
wH
wiI(εwi ≤ e) (5)
denote the proposed estimator of Fε(e), where
wH =1
1n
∑ni=1
1wi
.
In order to show the weak convergence of the estimator Fε(e), we will first develop an
asymptotic i.i.d. expansion for Fε(e). This asymptotic representation, given in Theorem
2.1 below, shows that Fε(e) can be written as the NPMLE based on the (unobserved)
errors εwi = Y wi − m(Xw
i )/σ(Xwi ), plus a second term which comes from the difference
between εwi and εwi , and plus a remainder term whose order is asymptotically negligible.
The assumptions under which this result holds true, are given in the Appendix.
4
Theorem 2.1 Assume A1– A4. Then,
Fε(e)− Fε(e)
= n−1n∑
i=1
µwwiI(εwi ≤ e)− Fε(e) + fε(e)
∫∆n(x, e) dFX(x) + op(n
−1/2)
= n−1
n∑
i=1
µwwiI(εwi ≤ e)− Fε(e) + ϕ(Xw
i , Ywi , e)+ op(n
−1/2),
uniformly in −∞ < e <∞, where
∆n(x, e) =1
σn(x)mn(x)−m(x) + e(σn(x)− σ(x)),
and
ϕ(x, y, e) = fε(e)[y −m (x)
σ (x)+e
2
(y −m (x)
σ (x)
)2 − 1].
Theorem 2.2 Assume the conditions of Theorem 2.1 are satisfied. Then, the process
n1/2(Fε(e)−Fε(e)), −∞ < e <∞, converges weakly to a zero-mean Gaussian process W (e)
with covariance function
Cov(W (e),W (e′)) =µwE 1
w(X, Y )[I(ε ≤ e)− Fε(e) + ϕ(X, Y, e)]
× [I(ε ≤ e′)− Fε(e′) + ϕ(X, Y, e′)].
3 Goodness-of-fit tests for the regression function
Suppose that the null hypothesis H0 : m ∈ M is satisfied, where M = mθ : θ ∈ Θ, and
let θ0 be the true value of θ under H0. Let εw0 = Y w −mθ0(Xw)/σ(Xw).
In order to estimate θ0 define the function
Swn (θ) = n−1n∑
i=1
1
wiY w
i −mθ(Xwi )2. (6)
This is the sample version of the quadratic expression
Sw0 (θ) =E
[1
w(Xw, Y w)Y w −mθ(X
w)2
]
=1
µwE[Y −mθ(X)2
]
=1
µwE[mθ(X)−mθ0(X)2
]+
1
µwE[σ2(X)
].
5
Let
θn = argminθSwn (θ),
and define the smoothed parametric estimator m0(x), similarly as in Hardle and Mammen
(1993), by
m0(x) =n∑
i=1
Wwi (x, hn)mθn
(Xwi ),
where we have used the same kernel and bandwidth as for the nonparametric estimator
mn(x), in order to have the same bias as that of mn(x).
Lemma 3.1 Assume A1, A5. Then, under H0,
θn − θ0P→ 0,
θn − θ0 = n−1n∑
i=1
µwwiηθ0(Xw
i , Ywi ) + op(n
−1/2),
and
n1/2(θn − θ0)d→ N(0,Σ),
where
ηθ(x, y) = Ω−1∂mθ(x)
∂θ(y −mθ(x)),
Ω = E
[∂mθ0(X)
∂θ
∂mθ0(X)
∂θ
T],
Σ = µwΩ−1E
[Y −mθ0(X)2
w(X, Y )
∂mθ0(X)
∂θ
∂mθ0(X)
∂θ
T]
Ω−1.
Now, define
Fε0(e) = n−1n∑
i=1
wH
wiI(εw0i ≤ e), (7)
where εw0i = Y wi − m0(Xw
i )/σn(Xwi ) are the residuals estimated under H0.
Before defining the test statistics that we will use for testing H0, it is worth mentioning
that, as in the case of complete observations (see Van Keilegom et al. (2007)), the compar-
ison of Fε(e) with its parametric counterpart Fε0(e) leads to a valid procedure to test the
null hypothesis H0.
6
Theorem 3.1 The null hypothesis H0 : m ∈ M holds if and only if the random variables
ε and Y − mθ0(X)/σ(X) have the same distribution, where θ0 is the value of θ that
minimizes E[m(X)−mθ(X)2].
The proof of this result can be found in Van Keilegom et al. (2007). Note that θ0 can
also be regarded as the value of θ that minimizes Sw0 (θ) = µ−1w E[Y −mθ(X)2].
We are now ready to define the following Kolmogorov-Smirnov and Cramer-von Mises
type test statistics for H0 :
TKS = n1/2 sup−∞<e<∞
|Fε(e)− Fε0(e)|
and
TCM = n
∫[Fε(e)− Fε0(e)]2 dFε0(e).
In order to obtain the asymptotic law of these test statistics, we first need the following
two results :
Theorem 3.2 Assume A1– A5. Then, under H0,
Fε(e)− Fε0(e)
= fε(e)n−1
n∑
i=1
µwwi
εwi −
∫σ−1(x)
(∂mθ0(x)
∂θ
)TdFX(x) ηθ0(Xw
i , Ywi )
+ op(n
−1/2),
uniformly in −∞ < e <∞.
As an immediate consequence of this asymptotic expansion, we have the following weak
convergence result. Note that the limiting process has an extremely simple structure,
as it factorizes in a deterministic function only depending on e and a random variable
independent of e.
Corollary 3.1 Assume A1– A5. Then, under H0, the process n1/2(Fε(e)−Fε0(e)) (−∞ <
e <∞) converges weakly to fε(e)V , where V is a zero-mean normal random variable with
variance
V ar(V ) = µwE
1
w(X, Y )
ε−
∫σ−1(x)
(∂mθ0(x)
∂θ
)TdFX(x) ηθ0(X, Y )
2 .
It can now be easily shown that (see Van Keilegom et al. (2007) for a similar proof)
Corollary 3.2 Assume A1– A5. Then, under H0,
TKSd→ sup−∞<e<∞
|fε(e)| |V |
TCMd→∫f 2ε (e) dFε(e) V
2.
7
4 Bootstrap approximation
The asymptotic distribution of the test statistics TKS and TCM depends on the unknown
error density fε(e). In order to avoid the estimation of this unknown function, we will
approximate the distribution of TKS and TCM by means of a bootstrap procedure. Note
that since the asymptotic representation of Fε(e)−Fε(e) depends on the density fε(e) (see
Theorem 2.1), the use of a smoothed bootstrap procedure is required.
While in the unbiased case, when our sample comes directly from (X, Y ), the error ε and
the covariate X are independent, in the selection biased framework the observed errors and
covariates are no longer independent. In fact, the conditional distribution of the random
variable εw = Y w −m(Xw)/σ(Xw) given that Xw = x is given by
dFw (e|x) =w (x,m (x) + σ (x)e)
µw (x)dFε (e),
where µw (x) = E [w (X, Y )|X = x]. Therefore, bootstrap errors will need to be drawn
from an estimator of the conditional distribution of εw given Xw.
This bootstrap procedure, that as far as we know is new, can be described in the
following way.
1. Define standardized residuals
εwsi =εwi − µwnσwn
(i = 1, . . . , n), where
µwn =1
n
n∑
i=1
wH
wiεwi , (σwn )2 =
1
n
n∑
i=1
wH
wi(εwi − µwn )2.
2. Let B be the number of bootstrap samples. Then, for b = 1, . . . , B, let
X∗i,b = Xwi , Y ∗i,b = mθn
(X∗i,b) + σn(X∗i,b)ε∗i,b,
(i = 1, . . . , n), where ε∗i,b are smoothed bootstrap residuals defined as
ε∗i,b =√
1− κ2n ε
wsV (X∗i,b)
+ κnZi,b,
where κn is a bandwidth parameter taken equal to the standard error of the rescaled
nonparametric residuals εwsi , V(X∗1,b
), . . . , V
(X∗n,b
)are independent random vari-
ables, defined as (for an arbitrary x)
P (V (x) = i) =L(X∗i,b−xh′n
)
∑nj=1 L
(X∗j,b−xh′n
)
8
(i = 1, . . . , n), where L is a probability density function (kernel) and h′n a bandwidth
sequence. Finally, Z1,b, . . . , Zn,b are i.i.d. random variables with zero mean and unit
variance, independent of the data (Xwi , Y
wi ) (i = 1, . . . , n).
Let T ∗KS,b and T ∗CM,b be the test statistics obtained from the bootstrap sample (X∗i,b, Y∗i,b)
(i = 1, . . . , n).
3. The critical values of the test statistics TKS and TCM can now be approximated by
(for some 0 < α < 1)
T ∗KS,([(1−α)B]) and T ∗CM,([(1−α)B]),
where for any b = 1, . . . , B, T ∗KS,(b) denotes the b-th order statistic of T ∗KS,1, . . . , T∗KS,B
(and similarly for T ∗CM,(b)).
5 Simulations
In order to evaluate the finite sample performance of the proposed tests, we present the
results of two simulation studies, in which both test statistics TKS and TCM are considered
for different models.
The analysis will be focused on the size α and the power of the tests, where α is equal
to 0.01 and 0.05. We consider regression functions m(x) of the form mθ(x) +Aδ(x), where
A = 0 corresponds to the null hypothesis. The variance function σ2(x) will be equal to 0.22
and 0.42. We have carried out 500 simulations for each of the tables that will be presented
and 200 simulations for each of the figures, while the number of bootstrap replications was
always taken equal to 500. The results are presented for samples of size 50, 100 and 200 in
the case of the tables and of size 200 for the figures.
For the first model, we consider M = a + bx + cx2 : −10 ≤ a, b, c ≤ 10, while the
true regression function is given by
mθ0(x) = 8− 1.25x+ 2(x− 1.5)2
and the perturbation under the alternative is chosen to be δ(x) = (x − 1.5)3. The error
ε is distributed as a uniform random variable on [−5, 5]. The covariate X is uniformly
distributed on [0, 5], and we assume that the response Y is subject to length bias, i.e.
w(x, y) = y. Note that w(x, y) is positive and bounded away from zero for all x, y.
As we can see from the results in Table 1, the level is well respected (take A = 0) and
the rejection rate increases when |A| increases. We can also see that when σ(x) = 0.4, the
9
ability of the procedure to discriminate between the null hypotheses and the alternatives
decreases in a noticeable manner, what shows that the effect of the variance is according
to what we expected. Table 1 also shows that the Cramer-von Mises test is in general
more powerful than the Kolmogorov-Smirnov test. Similar conclusions can be made from
Figure 1.
α n σ A TKS TCM α n σ A TKS TCM
0.01 50 0.20 -0.250 0.984 0.994 0.05 50 0.20 -0.250 1.000 1.000
-0.125 0.388 0.510 -0.125 0.752 0.794
0.000 0.002 0.010 0.000 0.028 0.056
0.125 0.234 0.344 0.125 0.496 0.632
0.250 0.844 0.952 0.250 0.970 0.992
0.40 -0.250 0.482 0.622 0.40 -0.250 0.810 0.870
-0.125 0.068 0.088 -0.125 0.210 0.246
0.000 0.012 0.012 0.000 0.056 0.072
0.125 0.052 0.078 0.125 0.164 0.240
0.250 0.204 0.312 0.250 0.474 0.574
100 0.20 -0.250 1.000 1.000 100 0.20 -0.250 1.000 1.000
-0.125 0.842 0.916 -0.125 0.978 0.998
0.000 0.004 0.006 0.000 0.038 0.046
0.125 0.574 0.674 0.125 0.856 0.934
0.250 0.998 1.000 0.250 1.000 1.000
0.40 -0.250 0.908 0.980 0.40 -0.250 0.994 0.994
-0.125 0.162 0.236 -0.125 0.410 0.566
0.000 0.010 0.012 0.000 0.060 0.056
0.125 0.068 0.104 0.125 0.214 0.352
0.250 0.534 0.670 0.250 0.814 0.898
200 0.20 -0.250 1.000 1.000 200 0.20 -0.250 1.000 1.000
-0.125 0.998 1.000 -0.125 1.000 1.000
0.000 0.006 0.010 0.000 0.034 0.052
0.125 0.976 0.994 0.125 1.000 0.996
0.250 1.000 1.000 0.250 1.000 1.000
0.40 -0.250 0.998 1.000 0.40 -0.250 1.000 1.000
-0.125 0.512 0.678 -0.125 0.828 0.906
0.000 0.000 0.008 0.000 0.038 0.056
0.125 0.178 0.354 0.125 0.538 0.674
0.250 0.964 0.990 0.250 0.998 1.000
Table 1: Rejection rates of H0 for TKS and TCM under the first model.
10
For the second model, we consider a model that is sometimes used to address population
growth issues. Let
M = 6
1 + ea+bx+ c : −10 ≤ a, b, c ≤ 10
and the true regression function
mθ0(x) =6
1 + e1−1.5x+ 2.
In this case, the permutation is given by
δ(x) = 2(x− 3)2,
TKS, α = 0.05, n = 200
A
% r
ej.
0.2
0.4
0.6
0.8
−0.10 −0.05 0.00 0.05 0.10
sigma= 0.1 sigma= 0.25
TCM, α = 0.05, n = 200
A
% r
ej.
0.2
0.4
0.6
0.8
−0.10 −0.05 0.00 0.05 0.10
sigma= 0.1 sigma= 0.25
Figure 1: Power function for TKS and TCM under the first model.
11
the error ε is distributed as a truncated normal that takes values in (−1.5, 1.5), while
the covariate X is uniformly distributed on [0, 6]. While in this example we have taken
w(x, y) = max(x, y), we cannot see significant discrepancies from the behavior exhibited
by the procedure under the previous model. In fact, from Table 2 and Figure 2, we can see
again that the rejection proportion agrees with the nominal level of the tests when A = 0,
otherwise growing to values close to 1. On the other hand, we can see again that large
values of σ deteriorate the performance of the tests.
TKS, α = 0.05, n = 200
A
% r
ej.
0.2
0.4
0.6
0.8
−0.10 −0.05 0.00 0.05 0.10
sigma= 0.1 sigma= 0.25
TCM, α = 0.05, n = 200
A
% r
ej.
0.2
0.4
0.6
0.8
−0.10 −0.05 0.00 0.05 0.10
sigma= 0.1 sigma= 0.25
Figure 2: Power function for TKS and TCM under the second model.
12
α n σ A TKS TCM α n σ A TKS TCM
0.01 50 0.20 -0.250 0.994 1.000 0.05 50 0.20 -0.250 0.996 1.000
-0.125 0.774 0.934 -0.125 0.914 0.978
0.000 0.012 0.006 0.000 0.040 0.056
0.125 0.836 0.916 0.125 0.956 0.974
0.250 0.976 0.996 0.250 0.998 1.000
0.40 -0.250 0.734 0.916 0.40 -0.250 0.884 0.972
-0.125 0.244 0.404 -0.125 0.470 0.620
0.000 0.018 0.018 0.000 0.066 0.064
0.125 0.194 0.300 0.125 0.444 0.606
0.250 0.338 0.568 0.250 0.666 0.818
100 0.20 -0.250 1.000 1.000 100 0.20 -0.250 1.000 1.000
-0.125 0.990 0.998 -0.125 0.998 1.000
0.000 0.004 0.012 0.000 0.042 0.060
0.125 0.990 1.000 0.125 1.000 1.000
0.250 1.000 1.000 0.250 1.000 1.000
0.40 -0.250 0.992 1.000 0.40 -0.250 1.000 1.000
-0.125 0.510 0.782 -0.125 0.772 0.940
0.000 0.008 0.012 0.000 0.038 0.050
0.125 0.400 0.652 0.125 0.710 0.898
0.250 0.808 0.938 0.250 0.956 0.992
200 0.20 -0.250 1.000 1.000 200 0.20 -0.250 1.000 1.000
-0.125 1.000 1.000 -0.125 1.000 1.000
0.000 0.008 0.008 0.000 0.040 0.040
0.125 1.000 1.000 0.125 1.000 1.000
0.250 1.000 1.000 0.250 1.000 1.000
0.40 -0.250 1.000 1.000 0.40 -0.250 1.000 1.000
-0.125 0.894 0.990 -0.125 0.990 1.000
0.000 0.008 0.008 0.000 0.046 0.054
0.125 0.850 0.982 0.125 0.978 0.998
0.250 0.996 1.000 0.250 1.000 1.000
Table 2: Rejection rates of H0 for TKS and TCM under the second model.
6 Data analysis
During the period 1991–1996 the Environmental Monitoring and Assessment Program
(EMAP) of the U.S. Environmental Protection Agency has conducted a survey of lakes
in the Northeastern states of the U.S. with the purpose of detecting acidic waters and
monitoring the water conditions of these lakes; see Breidt et al. (2007) for more details.
13
In short, the survey is based on a total population of 21,026 lakes from which 334 lakes
were surveyed, obtaining one single measurement for each of them. In order to obtain the
lakes to be included in the survey they used a relatively complex sampling design based
on a hexagonal grid frame, see Larsen et al. (1993). Apart from the acid neutralizing
capacity (ANC), they also considered the concentration of different chemicals in the water,
as Magnesium (MG), and sulphuric acid and/or sulphate (SO4).
As a result of the complex survey design they use, not all the lakes have the same
chances to be visited and hence we no longer can think of an i.i.d. sample of the different
measurements of a lake’s chemicals. It is common in all these grid–based surveys that
information other than what is recorded, such as geographic information, etc. . . is used
to determine whether an object, in this case a lake, becomes part of the sample. In those
cases, we can think of a multivariate random phenomenon (X,Z) and a selection procedure
ψ such that whenever ψ (Z) is equal to 1, we have observations of X. Therefore, we observe
the random phenomenon Xw defined as X|ψ (Z) = 1, whose distribution is given by
dFwX (x) =
p (x)
E [p (X)]dFX (x),
where p (x) = P (ψ (Z) = 1|X = x). As a consequence we need to know, as is common in
surveys, the inclusion probability p (xi) of the ith–lake in the sample.
The aim of this example is to find an appropriate parametric regression function for
the model regressing ANC on the chemical measurement SO4. In order to do so, we will
consider different regression models, exploring in this way the practical performance of
our technique. For each of the models we calculate the p–value corresponding to both the
statistics TKS and TCM , by using the bootstrap approach outlined in Section 4.
Model TKS TCM
ANC ~ 1 0.000 0.000
ANC ~ SO4 0.012 0.040
ANC ~ SO4 + 1 0.314 0.480
ANC ~ SO4 + SO4^2 0.664 0.530
ANC ~ SO4^2 + 1 0.000 0.000
ANC ~ log(SO4) 0.004 0.004
ANC ~ exp(SO4) + 1 0.000 0.000
ANC ~ sin(SO4) 0.002 0.000
ANC ~ sin(SO4) + 1 0.000 0.000
Table 3: p–values for TKS and TCM
14
We have chosen a bandwidth of 90 for estimating the regression function mn(·) and a
bandwidth of 108 to estimate the variance function σ2n(·). From Table 3 we can see that
both the models ANC ~ SO4 + 1 and ANC ~ SO4 + SO4^2 seem to be suitable to explain
the data behavior. In Figure 3 we show the scatterplot of the observed data jointly with the
compensated local linear fit and the compensated quadratic fit (i.e. ANC ~ SO4 + SO4^2).
The two curves are close to each other, which confirms the results of the goodness-of-fit
test. The parameter estimates of this quadratic model are given in Table 4.
0 50 100 150 200 250 300
020
040
060
080
010
00
S04
AN
C
Figure 3: Scatterplot of the data together with the estimated regression curves. The
dashed curve represents the compensated local linear estimator, while the solid curve is the
compensated quadratic estimator.
Estimate
Intercept 65.9538
SO4 0.6514
SO4^2 0.0035
Table 4: Parameter estimates obtained for ANC versus SO4.
Finally, we compare the estimators of the residual density obtained with and without
compensating for the biased sampling. Figure 4 suggests that the skewness of the residuals
computed without taking into account the sampling probabilities is much more important
than when the sampling mechanism is taken into account.
15
−600 −400 −200 0 200 400 600
0.00
000.
0005
0.00
100.
0015
0.00
200.
0025
0.00
300.
0035
Figure 4: Estimation of the error density for ANC versus SO4. The solid curve represents the
unbiased (compensated) estimator, the dashed curve is the biased (unweighted) estimator.
Appendix : Proofs
Assumptions
A1 The function w satisfies w (X, Y ) > C with probability one, for some C > 0.
A2 The bandwidth sequence hn satisfies nh4n → 0 and nh3+2δ
n / log h−1n → ∞ for some
δ > 0, and the kernel K is a symmetric probability density function, it has bounded
support and is twice continuously differentiable.
A3 The functions FX and m are three times continuously differentiable, and σ is twice con-
tinuously differentiable on the interior of their support RX . Moreover, fX(x), σ(x) >
C1 for all x in RX and for some C1 > 0.
A4 The random error ε satisfies
(i) E [ε4] <∞.
(ii) Fε is twice continuously differentiable, and both efε(e) and e2f ′ε(e) are uniformly
bounded.
A5 (i) Θ is a compact subspace of IRd.
(ii) inf‖θ−θ0‖>εE((mθ (X)−mθ0 (X))2) > 0 for all ε > 0.
(iii) θ0 is an interior point of Θ.
16
(iv) Ω is non-singular.
(v) The function mθ (x) is twice continuously differentiable with respect to θ for all
x, and the derivatives are uniformly bounded in x and θ.
Proposition A.1 If assumptions A1, A2, A3 and A4(i) are satisfied, then
supx|mn (x)−m (x)| = O
((nhn)−1/2(log n)1/2
)
supx|σn (x)− σ (x)| = O
((nhn)−1/2(log n)1/2
)
almost surely.
Proof. The first result is a direct consequence of Proposition 3.1 in Ojeda (2008). Note
that the conditions in that theorem are fulfilled, because of assumptions A2 and A3 and
because
E
[(Y w −m (Xw)
w (Xw, Y w)
)2]
=1
µwE
[(Y −m (X))2
w (X, Y )
]≤ 1
Cµwsupx|σ(x)|2 <∞. (8)
For the second result, first note that
σn(x)− σ(x) =σ2n(x)− σ2(x)
σn(x) + σ(x),
and hence it suffices to show that supx |σ2n(x)− σ2(x)| = O((nhn)−1/2(log n)1/2) a.s. Next,
σ2n(x) =
1∑n
i=11wiK(Xwi −xhn
)n∑
i=1
1
wiK
(Xwi − xhn
)Y w
i −m(Xwi )2 +o
((nhn)−1/2(log n)1/2
),
almost surely and uniformly in x, which can be shown using standard lines of proof. Hence,
it suffices to show that
supx
∣∣∣ 1∑n
i=11wiK(Xwi −xhn
)n∑
i=1
1
wiK
(Xwi − xhn
)Y w
i −m(Xwi )2 − σ2(x)
∣∣∣
= O(
(nhn)−1/2(log n)1/2)
almost surely, and this follows again from Proposition 3.1 in Ojeda (2008) (with p = 0),
since
E
[(Y w −m(Xw)2 − σ2(Xw)
w (Xw, Y w)
)2]<∞,
which follows from assumption A4(i). 2
17
Proposition A.2 If assumptions A1, A2 and A3 are satisfied, then
mn (x)−m (x) =µw
nhnfX (x)
n∑
i=1
1
wiK
(Xwi − xhn
)Y w
i −m (x)
+O((nhn)−1 logn
),
uniformly in x and almost surely. If in addition assumption A4(i) is satisfied, then
σn (x)− σ (x) =µw
2nhnfX (x)σ (x)
n∑
i=1
1
wiK
(Xwi − xhn
)(Y w
i −m (x)2 − σ2 (x))
+O((nhn)−1 logn
),
uniformly in x and almost surely.
Proof. For the first expansion, first note that from Proposition A.1 in Ojeda et al. (2008)
we have that
n∑
i=1
wwi (x, hn) = (nhn)2
(1
µ2w
µ2f2X (x) +O
((nhn)−1/2(log n)1/2
))
uniformly in x and almost surely. Furthermore, in the same way we can show that
n∑
i=1
wwi (x, hn)Y wi = (nhn)2
(1
µ2w
µ2m (x)f 2X (x) +O
((nhn)−1/2(log n)1/2
))
uniformly in x and almost surely. Hence, from the definition of mn(x) in equation (3) we
obtain that
mn (x)−m (x) =µ2w
(nhn)2µ2f 2X (x)
n∑
i=1
wwi (x, hn)(Y wi −m (x)) +O
((nhn)−1 log n
)
uniformly in x and almost surely. From the definition of wwi (x, hn) and the almost sure
and uniform expansions for sw1 (x; hn) and sw2 (x; hn) given in Proposition A.1 in Ojeda et
al. (2008), it now follows that
mn (x)−m (x) =µw
nhnfX (x)
n∑
i=1
1
wiK
(Xwi − xhn
)(Y w
i −m (x)) +O((nhn)−1 logn
)
uniformly in x and almost surely.
The proof for the expansion of σn (x) − σ (x) follows in a similar way taking into ac-
count the expansion for sw0 (x; hn) and the fact that Y wi − mn (Xw
i ) can be written as
Y wi −m (Xw
i ) + m (Xwi )− mn (Xw
i ). See also Liero (2003) for the case where wi = 1
(i = 1, . . . , n). 2
18
Proposition A.3 If assumptions A1, A2, A3 and A4(i) are satisfied, then
supx|m′n (x)−m′ (x)| = O
((nh3
n)−1/2(log n)1/2)
supx|σ′n (x)− σ′ (x)| = O
((nh3
n)−1/2(logn)1/2)
uniformly in x and almost surely.
Proof. This follows along the same lines as in the proof of Proposition A.1, but use Propo-
sition 3.2 in Ojeda (2008) instead of Proposition 3.1. 2
Proposition A.4 If assumptions A1, A2, A3 and A4(i) are satisfied, then for 0 < δ < 1,
supx,x′
|m′n (x)−m′ (x)− m′n (x′) +m′ (x′)||x− x′|δ
= O(
(nh3+2δn )−1/2(logn)1/2
)
supx,x′
|σ′n (x)− σ′ (x)− σ′n (x′) + σ′ (x′)||x− x′|δ
= O(
(nh3+2δn )−1/2(log n)1/2
)
almost surely.
Proof. This follows along the same lines as in the proof of Proposition A.1, but use The-
orem 3.2 in Ojeda (2008) instead of Proposition 3.1. 2
For the next lemma, we need to introduce the following expression (−∞ < e <∞) :
Ψn (e) =1
n
n∑
i=1
(µwwiI (εwi ≤ e)− µw
wiI (εwi ≤ e)− P (ε ≤ e|Xn) + P (ε ≤ e)
).
Lemma A.1 If assumptions A1, A2 and A3 are satisfied then
sup−∞<e<∞
|Ψn (e)| = op
(1√n
),
where P (ε ≤ e|Xn) stands for the distribution of ε = σ (X)−1(Y − m (X)) conditional on
the given values of the sample.
Proof. The proof is heavily based on Lemma B1 in Akritas and Van Keilegom (2001),
which considers the unweighted case, i.e. wi = 1 for all i. Let dn1 and dn2 be defined as in
that paper, but with mn (x) and σn (x) as in equations (3) and (4), so we use respectively
local linear and local constant estimators for the regression and variance functions.
Because of Propositions A.1, A.3, and A.4, we have that P (dn1 ∈ C1+δ1 (RX)) and
P (dn2 ∈ C1+δ2 (RX)) tend to 1, where C1+δ
1 (RX) and C1+δ2 (RX) are defined as in Akritas
19
and Van Keilegom (2001). Next, define the class
F =
(x, ε)→ µw
ω (x, ε)I (ε ≤ e d2 (x) + d1 (x)) ;
−∞ < e <∞, d1 ∈ C1+δ1 (RX), d2 ∈ C1+δ
2 (RX)
,
where ω (x, ε) = w (x, y) and y = m(x) + σ(x)ε.
The computation of the bracketing number of the class F can be done in almost the
same way as in Akritas and Van Keilegom (2001), and is therefore omitted. In fact, the
weights ω(x, ε) do not have any significant effect on the computation of this bracketing
number. This implies that the class F is Donsker. Next, let Vn (x, e) = edn2 (x) + dn1 (x)
and note that
Var
[µw
ω (Xw, εw)I (εw ≤ Vn (Xw, e))− µw
ω (Xw, εw)I (εw ≤ e)− P (ε ≤ Vn (X, e)) + P (ε ≤ e)
]
= Var
[µw
ω (Xw, εw)I (εw ≤ Vn (Xw, e))− I (εw ≤ e)
]
≤ E
[( µwω (Xw, εw)
)2
I (εw ≤ Vn (Xw, e))− I (εw ≤ e)2
]
≤ µwCE
[I (ε ≤ Vn (X, e))− I (ε ≤ e)
]2
,
and this can be easily seen to converge to zero almost surely by using the same arguments
as in the proof of Lemma B1 in Akritas and Van Keilegom (2001), and by using Proposition
A.1. The result now follows from Corollary 2.3.12 in van der Vaart and Wellner (1996). 2
Proof of Theorem 2.1. From the definition of Fε (e) we obtain that
Fε (e)− Fε (e) =1
n
n∑
i=1
µwwi
[I (εwi ≤ e)− Fε (e)
]
+
(wH
µw− 1
)1
n
n∑
i=1
µwwi
[I (εwi ≤ e)− Fε (e)
]
=1
n
n∑
i=1
µwwi
[I (εwi ≤ e)− Fε (e)
](1 + op(1))
uniformly in e because µ−1w wH = 1 + op(1). Therefore, it follows from Lemma A.1 that
Fε (e)− Fε (e)
=
[1
n
n∑
i=1
µwwiI (εwi ≤ e)− Fε (e)+ P (ε ≤ e|Xn)− P (ε ≤ e)
](1 + op(1)) + op
(n−1/2
),
20
uniformly in e. Moreover, taking into account that P (ε ≤ e|Xn) = P (ε ≤ e+ ∆n (X, e)|Xn)
(where ∆n (X, e) is defined in the statement of the theorem), we obtain that
P (ε ≤ e|Xn)− P (ε ≤ e) = fε (e)
∫∆n (x, e) dFX (x) + op
(n−1/2
)
uniformly in e because of assumption A4. Using Proposition A.2 we find that
fε (e)∆n (x, e) =µwfε (e)
(nhn)fX (x)σ (x)
n∑
i=1
1
wiK
(Xwi − xhn
)ϕ (x, Y w
i , e) + op(n−1/2
)
uniformly in x and e, where the function ϕ is given in the statement of the theorem. Now
we proceed as in the proof of Theorem 3.1 in Akritas and Van Keilegom (2001), which
leads to the given asymptotic expansion for fε (e)∫
∆n (x, e) dFX (x). 2
Proof of Theorem 2.2. Following similar ideas to those in the proof of Theorem 3.2 in
Akritas and Van Keilegom (2001), we consider the class of functions
F =
(x, y)→ µw
w (x, y)
[I
(y −m (x)
σ (x)≤ e
)− Fε(e) + ϕ (x, y, e)
]: e ∈ IR
.
Using the notation ω (x, ε) = w (x, y) for y = m (x) + σ (x)ε, we can reparametrize F as
F ′ =
(x, ε)→ µwω (x, ε)
[I (ε ≤ e)− Fε(e) + fε (e)A (ε) +B (ε)e
]: e ∈ IR
,
where A and B are the fixed functions A (ε) = ε and B (ε) = (ε2 − 1)/2. In order to show
the weak convergence of the process given in the statement of the theorem, we need to
show that the class F ′ is Donsker, which we will establish by calculating the bracketing
number N[ ] (ε,F ′, L2) of F ′ with respect to the joint probability measure of (Xw, εw).
The bracketing number of the class ε → I (ε ≤ e) : e ∈ IR is O(exp(Kε−1)) because
of Theorem 2.7.5 in van der Vaart and Wellner (1996). Next, note that A (ε), B (ε) and
ω (x, ε) are independent of e, while the functions e→ Fe(e), e→ fe(e) and e→ efe(e) are
deterministic bounded functions, whose bracketing number is O(ε−1). Therefore, Theorems
2.5.6 and 2.10.6 in van der Vaart and Wellner (1996) entail that the class F ′ is Donsker. 2
Proof of Lemma 3.1. The first part of the lemma will be proved by verifying the two
conditions of Theorem 5.7 in van der Vaart (1998) with Mn (θ) = −Swn (θ) and M (θ) =
−Sw0 (θ). As mθ (x) is a continuous function of θ for all x, w (Xw, Y w)−1(Y w −mθ (Xw))2
is continuous as well. Hence, as a consequence of Theorem 2 in Jennrich (1969) we have
that supθ |Swn (θ) − Sw0 (θ)| = op(1). Hence, the first condition in van der Vaart (1998) is
satisfied. The second condition is a consequence of assumption A5 (ii).
21
Next, using assumption A5 (v) and a Taylor series expansion, we find that
θn − θ0 = −(∂2
∂θ2Swn (θn)
)−1∂
∂θSwn (θ0),
where θn lies in the open segment (θ0, θn) and ∂2
∂θ2Swn (θn) is a matrix with entries ∂2
∂θk ∂θlSwn (θn)
(k, l = 1, . . . , d). Next, note that
∂
∂θSwn (θ0) = − 2
n
n∑
i=1
1
wi
∂
∂θmθ0 (Xw
i )(Y wi −mθ0 (Xw
i )),
while
∂2
∂θ2Swn (θn) =
2
n
n∑
i=1
1
wi
∂
∂θmθn
(Xwi )
∂
∂θmθn
(Xwi )T
− 2
n
n∑
i=1
1
wi
∂2
∂θ2mθn
(Xwi )(Y wi −mθn
(Xwi ))
=2
n
n∑
i=1
1
wi
∂
∂θmθ0 (Xw
i )∂
∂θmθ0 (Xw
i )T
− 2
n
n∑
i=1
1
wi
∂2
∂θ2mθ0 (Xw
i )(Y wi −mθ0 (Xw
i )), (9)
which follows easily from the fact that θn − θ0 = op(1). The first term on the right hand
side of (9) equals
2E[ 1
w(Xw, Y w)
∂
∂θmθ0 (Xw)
∂
∂θmθ0 (Xw)T
]+ op(1) =
2
µwΩ + op(1),
because of assumption A5 (v). The second term of (9) is op(1), since its expected value
equals zero. Hence, taking into account that ∂∂θSwn (θ0) = Op
(n−1/2
)and using assumption
A5 (iv), we have that
θn − θ0 =1
n
n∑
i=1
µwwi
Ω−1 ∂
∂θmθ0 (Xw
i )(Y wi −mθ (Xw
i )) + op(n−1/2
)
=1
n
n∑
i=1
µwwiηθ0 (Xw
i , Ywi ) + op
(n−1/2
).
2
Proof of Theorem 3.2. Following the same arguments as in the proof of Theorem 2.1,
we find that
Fε 0 (e)− Fε (e) =1
n
n∑
i=1
µwwiI (εwi ≤ e)− Fε (e)
+ fε (e)
∫∆n0 (x, e) dFX (x) + op
(n−1/2
)
22
uniformly in e, where
∆n0 (x, e) =1
σ(x)mθn
(x)−m (x) + e(σn (x)− σ (x)).
Combining this with the asymptotic expansion of Fε (e)− Fε (e) given in Theorem 2.1, we
have
Fε (e)− Fε 0 (e) = fε (e)
∫Dn (x) dFX (x) + op
(n−1/2
)
uniformly in e, where
Dn (x) =1
σ(x)mn (x)−m (x)−
(mθn
(x)−m (x)).
Therefore, the expansions given in Proposition A.2 and Lemma 3.1 lead to
Fε (e)− Fε 0 (e) = fε (e)1
n
n∑
i=1
µwwi
εwi −MT
θ0ηθ0 (Xw
i , Ywi )
+ op(n−1/2
),
where
Mθ0 =
∫1
σ (x)
(∂
∂θmθ0 (x)
)TdFX (x),
and this proves the result. 2
Acknowledgements. The authors would like to thank Jean Opsomer (Colorado State
University) for helpful discussions and for providing the data set analyzed in this paper.
The first author acknowledges financial support from the D.G.A. (CONSI+D) and CAI
program ‘Europa’ and DGI Grant MTM2005-01464. The second author acknowledges fi-
nancial support from IAP research network P6/03 of the Belgian Government (Belgian
Science Policy).
References
Akritas, M.G. and Van Keilegom, I. (2001). Nonparametric estimation of the residual
distribution. Scand. J. Statist., 28, 549–568.
Breidt F. J., Opsomer J. D., Johnson A. A. and Ranalli M. G. (2007). Semiparametric
model–assisted estimation for natural resource surveys. Survey Methodology, 33, 35–44.
Cheng, F. (2004). Weak and strong uniform consistency of a kernel error density estimator
in nonparametric regression. J. Statist. Plann. Inference, 119, 95–107.
23
Cristobal, J. A., Ojeda, J. L. and Alcala, J. T. (2004). Confidence bands in nonparametric
regression with length biased data. Ann. Inst. Statist. Math., 56, 475–496.
Einmahl, J.H.J. and Van Keilegom, I. (2007). Tests for independence in nonparametric
regression. Statist. Sinica, to appear.
Gill, R.D., Vardi, Y. and Wellner, J.A. (1988). Large sample theory of empirical distribu-
tions in biased sampling models. Ann. Statist., 16, 1069–1112.
Hardle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regres-
sion fits. Ann. Statist., 21, 1926–1947.
Jennrich, R.I. (1969). Asymptotic properties of non-linear least squares estimators. Ann.
Math. Statist., 40, 633–643.
Larsen, D.P., Thornton K.W., Urquhart, N.S. and Paulsen, S.G. (1993). Overview of
survey design and lake selection. EMAP - Surface Waters 1991 Pilot Report. U.S.
Environmental Protection Agency. Edited by Larsen, D.P. and Christie, S.J. Technical
Report EPA/620/R-93/003.
Liero, H. (2003). Testing homoscedasticity in nonparametric regression. J. Nonpar. Statist.,
15, 31–51.
Mahfoud, M. and Patil, G.P. (1982). On weighted distributions. In: Statistics and Proba-
bility: Essays in Honor of C. R. Rao, G. Kallianpur et al. ( eds.), pp. 479-492. North
Holland Publishing Co., Amsterdam.
Muller, U.U., Schick, A. and Wefelmeyer, W. (2004a). Estimating functionals of the error
distribution in parametric and nonparametric regression. J. Nonparam. Statist., 16,
525–548.
Muller, U.U., Schick, A. and Wefelmeyer, W. (2004b). Estimating linear functionals of the
error distribution in nonparametric regression. J. Statist. Plann. Infer., 119, 75–93.
Neumeyer, N., Dette, H. and Nagel, E.-R. (2006). Bootstrap tests for the error distribution
in linear and nonparametric regression models. Austr. N. Zeal. J. Statist., 48, 129–156.
Ojeda, J.L. (2008). Holder continuity properties of the local polynomial estimator (sub-
mitted) (see http://www.unizar.es/galdeano/preprints/lista.html).
Ojeda, J.L., Cristobal, J.A. and Alcala, J.T. (2008). A comparison of two bootstrap schemes
on goodness-of-fit tests in the presence of selection bias (submitted) (see
http://www.unizar.es/galdeano/preprints/lista.html).
Pardo-Fernandez, J.C. and Van Keilegom, I. (2006). Comparison of regression curves with
censored responses. Scand. J. Statist., 33, 409-434.
24
Pardo-Fernandez, J.C., Van Keilegom, I. and Gonzalez-Manteiga, W. (2007a). Goodness-
of-fit tests for parametric models in censored regression. Canad. J. Statist., 35, 249-264.
Pardo-Fernandez, J.C., Van Keilegom, I. and Gonzalez-Manteiga, W. (2007b). Testing for
the equality of k regression curves. Statist. Sinica, 17, 1115-1137.
Patil, G.P. and Taillie, C. (1989). Probing encountered data, meta analysis and weighted
distribution methods. In: Statistical data analysis and inference, Y. Dodge (ed.), pp.
317-345. Elsevier, Amsterdam.
Rao, C.R. (1997). Statistics and Truth. Putting chance to work. World Scientific, Singa-
pore.
van der Vaart, J.A. (1998). Asymptotic statistics. Cambridge University Press.
van der Vaart, A.W. and Wellner, J.A. (1996). Weak convergence and empirical processes.
Springer, New York.
Van Keilegom, I. and Akritas, M.G. (1999). Transfer of tail information in censored re-
gression models. Ann. Statist., 27, 1745–1784.
Van Keilegom, I., Gonzalez-Manteiga, W. and Sanchez-Sellero, C. (2007). Goodness-of-fit
tests in parametric regression based on the estimation of the error distribution. TEST,
to appear.
Vardi, Y. (1982). Nonparametric estimation in the presence of length bias. Ann. Statist.,
10, 616–620.
Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist., 13,
178–205.
Address for correspondence :
Fac. de Ciencias, Edif. Matematicas Institute of Statistics
Universidade de Zaragoza Universite catholique de Louvain
Pedro Cerbuna, num. 12 Voie du Roman Pays, 20
50009 Zaragoza B-1348 Louvain-la-Neuve
Spain Belgium
Email: [email protected] Email: [email protected]
25