Error Variance Estimation in Ultrahigh Dimensional Additive ...0 0.5 1 0 5 10 15 20 density of correlation (a) LM s=1 LM s=2 AM s=1 AM s=2 0 0.5 1 1.5 0 1 2 3 4 5 density of ^ < 2

Error Variance Estimation in Ultrahigh Dimensional Additive

Models

Zhao Chen, Jianqing Fan and Runze Li

August 1, 2016

Abstract

Error variance estimation plays an important role in statistical inference for high dimensional

regression models. This paper concerns with error variance estimation in high dimensional sparse

additive model. We study the asymptotic behavior of the traditional mean squared errors, the

naive estimate of error variance, and show that it may significantly underestimate the error

variance due to spurious correlations which are even higher in nonparametric models than linear

models. We further propose an accurate estimate for error variance in ultrahigh dimensional

sparse additive model by effectively integrating sure independence screening and refitted cross-

validation techniques (Fan, Guo and Hao, 2012). The root n consistency and the asymptotic

normality of the resulting estimate are established. We conduct Monte Carlo simulation study

to examine the finite sample performance of the newly proposed estimate. A real data example

is used to illustrate the proposed methodology.

Key words: Feature screening, Refitted cross-validation, Sparse additive model, Variance estimation

∗Zhao Chen is Research Associate, Department of Statistics, The Pennsylvania State University at UniversityPark, PA 16802-2111, USA. Email: [email protected]. Chen’s research was supported by NSF grant DMS-1206464 andNIH grants R01-GM072611. Jianqing Fan is Frederick L. Moore’18 Professor of Finance, Department of OperationsResearch & Financial Engineering, Princeton University, Princeton, NJ 08544, USA and Honorary Professor, Schoolof Data Science, Fudan University, and Academy of Mathematics and System Science, Chinese Academy of Science,Beijing, China (E-mail: [email protected]). Fan’s research was supported by NSF grant DMS-1206464 and NIHgrants R01-GM072611 and R01GM100474-01. Runze Li is Verne M. Willaman Professor, Department of Statistics andThe Methodology Center, The Pennsylvania State University, University Park, PA 16802-2111. Email: [email protected] research was supported by a NSF grant DMS 1512422, National Institute on Drug Abuse (NIDA) grants P50DA039838, P50 DA036107, and R01 DA039854. The authors thank the Editor, the AE and reviewers for theirconstructive comments, which have led to a dramatic improvement of the earlier version of this paper. The contentis solely the responsibility of the authors and does not necessarily represent the official views of the NSF, NIH andNIDA.

1

1 Introduction

Statistical inference on regression models typically involves the estimation of the variance of its

random error. Hypothesis testing on regression functions, confidence/prediction interval construc-

tion and variable selection all require an accurate estimate of the error variance. In the classical

linear regression analysis, the adjusted mean squared error is an unbiased estimate of the error

variance, and it performs well when the sample size is much larger than the number of predictors,

or more accurately when the degree of freedom is large. It has been empirically observed that

the mean squared error estimator leads to an underestimation of the error variance when model is

significantly over-fitted. This has been further confirmed by the theoretical analysis of Fan, Guo

and Hao (2012), in which the authors demonstrated the challenges of error variance estimation in

the high-dimensional linear regression analysis, and further developed an accurate error variance

estimator by introducing refitted cross-validation techniques.

Fueled by the demand in the analysis of genomic, financial, health, and image data, the analysis

of high dimensional data has become one of the most important research topics during last two

decades (Donoho, 2000; Fan and Li, 2006). There have been a huge number of research papers on

high dimensional data analysis in the literature. It is impossible for us to give a comprehensive

review here. Readers are referred to Fan and Lv (2010), Bühlmann and Van de Geer (2011) and

references therein. Due to the complex structure of high dimensional data, the high dimensional

linear regression analysis may be a good start, but it may not be powerful to explore nonlinear

features inherent into data. Nonparametric regression modeling provides valuable analysis for high

dimensional data (Ravikumar, et al. 2009; Hall and Miller, 2009; Fan, Feng and Song, 2011). This

is particularly the case for error variance estimation, as nonparametric modeling reduces modeling

biases in the estimate, but creates stronger spurious correlations. This paper aims to study issues

of error variance estimation in ultrahigh dimensional nonparametric regression settings.

In this paper, we focus on sparse additive model. Our primary interest is to develop an accurate

estimator for error variance in ultrahigh dimensional additive model. The techniques developed in

this paper are applicable to other nonparametric regression models such as sparse varying coefficient

2

models and some commonly-used semiparametric regression models such as sparse partial linear

additive models and sparse semi-varying coefficient partial linear models. Since its introduction by

Friedman and Stuetzle (1981), additive model has been popular, and many statistical procedures

have been developed for sparse additive models in the recent literature. Lin and Zhang (2006)

proposed COSSO method to identify significant variables in multivariate nonparametric models.

Bach (2008) studied penalized least squares regression with group Lasso-type penalty for linear

predictors and regularization on reproducing kernel Hilbert space norms, which is referred to as

multiple kernel learning. Xue (2009) studied variable selection problem in additive models by

integrating a group-SCAD penalized least squares method (Fan and Li, 2001) and the regression

spline technique. Ravikumar, et. al. (2009) modified the backfitting algorithm for sparse additive

models, and further established the model selection consistency of their procedure. Meier, Van de

Geer and Bühlmann (2009) studied the model selection and estimation of additive models with a

diverging number of significant predictors. They proposed a new sparsity and smoothness penalty

and proved that their method can select all nonzero components with probability approaching to

1 as the sample size tends to infinity. With the ordinary group Lasso estimator as the initial

estimator, Huang, Horowitz and Wei (2010) applied adaptive group Lasso to additive model under

the setting in which there are only finite fixed number of significant predictors. Fan, Feng and Song

(2011) proposed a nonparametric independent screening procedure for sparse ultrahigh dimensional

data, and established its sure screening property in the terminology of Fan and Lv (2008).

In this paper, we propose an error variance estimate in ultrahigh dimensional additive models.

It is typical to assume sparsity in ultrahigh dimensional data analysis. By sparsity, it means that

the regression function depends only on a few significant predictors, and the number of significant

predictors is assumed to be much smaller than the sample size. Because of the basis expansion

in nonparametric fitting, the actual number of terms significantly increases in additive models.

Therefore, the spurious correlation documented in Fan, Guo and Hao (2012) increases significantly.

This is indeed demonstrated in Lemma 1, which shows that the spurious correlation with the

response increases from√n−1 log(p) using one most correlated predictor among p variables to

√dnn−1 log(pdn) by using one most correlated predictor with dn basis functions. If s variables are

3

0 0.5 10

5

10

15

20density

ofcorrelation

(a)

LM s=1LM s=2AM s=1AM s=2

0 0.5 1 1.50

1

2

3

4

5

density

ofσ̂2

(b)

LM s=1LM s=2AM s=1AM s=2

Figure 1: Distributions of the maximum “linear” and “nonparametric” spurious correlations fors = 1 and s = 2 (left panel, n = 50 and p = 1000) and their consequences on the estimatingof noise variances (right panel). The legend ‘LM’ stands for linear model, and ‘AM’ stands foradditive model, i.e., nonparametric model.

used, the spurious correlation may increase to its upper bound at an exponential rate of s.

To quantify this increase and explain more clearly the concept and the problem, we simulate

n = 50 data points from the independent normal covariates {Xj}pj=1 (with p = 1000) and also

independently normal response Y . In this null model, all covariates {Xj}pj=1 and the response Y

are independent and follow the standard normal distribution. As in Fan, Guo and Hao (2012), we

compute the maximum “linear” spurious correlation ζLn = max1≤j≤p |ĉorr(Xj , Y )| and the maxi-

mum “nonparametric” spurious correlation ζNn = max1≤j≤p |ĉorr(f̂j(Xj), Y )|, where f̂j(Xj) is the

best cubic spline fit of variable Xj to the response Y , using 3 equally spaced knots in the range

of the variable Xj which create dn = 6 B-spline bases for Xj . The concept of the maximum spu-

rious “linear” and spurious “nonparametric” (additive) correlations can easily be extended to s

variables, which are the correlation between the response and fitted values using the best subset

of s-variables. Based on 500 simulated data sets, Figure 1 depicts the results, which show the big

increase of spurious correlations from linear to nonparametric fit. As the result, the noise variance

is significantly underestimated.

4

The above reasoning and evidence show that the naive estimation of error variance is seriously

biased. This is indeed shown in Theorem 1. This prompts us to propose a two-stage refitted cross-

validation procedure to reduce spurious correlation. In the first stage, we apply a sure independence

screening procedure to reduce the ultrahigh dimensionality to relative large dimensional regression

problem. In the second stage, we apply refitted cross validation technique, which was proposed for

linear regression model by Fan, Guo and Hao (2012), for the dimension-reduced additive models

obtained from the first stage. The implementation of the newly proposed procedure is not difficult.

However, it is challenging in establishing its sampling properties. This is because the dimensionality

of ultrahigh dimensional sparse additive models becomes even higher.

We propose using B-splines to approximate the nonparametric functions, and first study the

asymptotic properties of the traditional mean squared error, a naive estimator of the error variance.

Under some mild conditions, we show that the mean squared error leads to a significant under-

estimate of the error variance. We then study the sampling properties of the proposed refitted

cross-validation estimate, and establish its asymptotic normality. From our theoretical analysis, it

can be found that the refitted cross-validation techniques can eliminate the side effects due to over-

fitting. We also conduct Monte Carlo simulation studies to examine the finite sample performance

of the proposed procedure. Our simulation results show that the newly proposed error variance

estimate may perform significantly better than the mean squared error.

This paper makes the following major contributions. (a) We show the traditional mean squared

errors as a naive estimation of error variance is seriously biased. Although this is expected, the rig-

orous theoretical development indeed is challenging rather than straightforward. (b) We propose a

refitted cross-validation error variance estimation for ultrahigh dimensional nonparametric additive

models, and further establish the asymptotic normality of the proposed estimator. The asymptotic

normality implies that the proposed estimator is asymptotic unbiased and root n consistent. The

extensions of refitted cross-validation error variance estimation from linear models to nonparamet-

ric models are interesting, and not straightforward in terms of theoretical development because the

bias due to approximation error calls for new techniques to establish the theory. Furthermore, the

related techniques developed in this paper may be further applied for refitted cross-validation error

5

variance estimation in other ultrahigh-dimensional nonparametric regression models such as vary-

ing coefficient models and ultrahigh dimensional semiparametric regression models such as partially

linear additive models and semiparametric partially linear varying coefficient models.

This paper is organized as follows. In Section 2, we propose a new error variance estimation pro-

cedure, and further study its sampling properties. In Section 3, we conduct Monte Carlo simulation

studies to examine the finite sample performance of the proposed estimator, and demonstrate the

new estimation procedure by a real data example. Some concluding remarks are given in Section

4. Technical conditions and proofs are given in the Appendix.

2 New procedures for error variance estimation

Let Y be a response variable, and x = (X1, · · · , Xp)T be a predictor vector. The additive model

assumes that

Y = µ+

p∑

j=1

fj(Xj) + ε, (2.1)

where µ is intercept term, {fj(·), j = 1, · · · , p} are the unknown functions and ε is the random

error with E(ε) = 0 and var(ε) = σ2. Following the convention in the literature, it is assumed

throughout this paper that Efj(Xj) = 0 for j = 1, · · · , p so that model (2.1) is identifiable. This

assumption implies that µ = E(Y ). Thus, a natural estimator for µ is the sample average of Y ’s.

This estimator is root n consistent, and its rate of convergence is faster than that for the estimator

of nonparametric function fj ’s. Without loss of generality, we further assume µ = 0 for ease of

notation. The goal of this section is to develop an estimation procedure for σ2 for additive models.

2.1 Refitted cross-validation

In this section, we propose a strategy to estimate the error variance when the predictor vector

is ultrahigh dimensional. Since fj ’s are nonparametric smoothing functions, it is natural to use

smoothing techniques to estimate fj . In this paper, we employ B-spline method throughout this

paper. Readers are referred to De Boor (1978) for detailed procedure of B-spline construction. Let

{Bjk(x), k = 1, · · · , dj , a ≤ x ≤ b} be B-spline basis of space S lj ([a, b]) with knots depending on

6

j, the polynomial spline space defined on finite interval [a, b] with degree l ≥ 1. Approximate fj by

its spline expansion

fj(x) ≈dj∑

k=1

γjkBjk(x) (2.2)

for some dj ≥ 1. In practice, dj is allowed to grow with the sample size n, and therefore denoted

by djn to emphasize the dependence of n. With slightly abuse of notation, we use dn stands for

djn for ease of notation. Thus, model (2.1) can be written as

Y ≈p∑

j=1

dn∑

k=1

γjkBjk(Xj) + ε, (2.3)

Suppose that {(xi, Yi)}, i = 1, · · · , n is a random sample from the additive model (2.1). Model

(2.3) is not estimable when pdn > n. It is common to assume sparsity in ultrahigh-dimensional

data analysis. By sparsity in additive model, it means that only a few ‖fj‖2 = Ef2j (Xj) 6= 0

and other ‖fj‖ = 0. A general strategy to reduce ultrahigh dimensionality is sure independent

feature screening, which enables one to reduce ultrahigh dimension to large or high dimension.

Some existing feature screening procedures can be directly applied for ultrahigh dimensional sparse

additive models. Fan, Feng and Song (2011) proposed nonparametric sure independent (NIS)

screening method and further showed that the NIS screening method possesses sure screening

property for ultrahigh dimensional additive models. That is, under some regularity conditions,

with an overwhelming probability, the NIS is able to retain all active predictors after feature

screening. Li, Zhong and Zhu (2012) proposed a model free feature screening procedure based

on distance correlation sure independent screening (DC-SIS). The DC-SIS is also shown to have

sure screening property. Both NIS and DC-SIS can be used for feature screening with ultrahigh

dimensional sparse additive models, although we will use DC-SIS in our numerical implementation

due to its intuitive and simple implementation.

Hereafter we always assume that all important variables have been selected by screening pro-

cedure. Under such assumption, we will overfit the response variable Y and underestimate the

error variance σ2. This is due to the fact that extra variables are actually selected to predict the

7

realized noises (Fan, Guo and Hao, 2012). After feature screening, a direct estimate of σ2 is the

mean squared errors of the least squares approach. That is, we apply a feature screening procedure

such as DC-SIS and NIS to screen x-variables and fit the data to the corresponding selected spline

regression model. Denoted by D∗ the indices of all true predictors and D̂∗ the indices of the selected

predictors respectively, satisfying the sure screening property D∗ ⊂ D̂∗. Then, we minimize the

following least squares function with respect to γ:

n∑

i=1

Yi −

∑

j∈D̂∗

dn∑

k=1

γjkBjk(Xij)

2

. (2.4)

Denote by γ̂jk the resulting least squares estimate. Then, the nonparametric residual variance

estimator is

σ̂2D̂∗ =1

n− |D̂∗| · dn

n∑

i=1

Yi −

∑

j∈D̂∗

dn∑

k=1

γ̂jkBjk(Xij)

2

.

Hereafter |D| stands for the cardinality of a set D and we have implicitly assumed that the choice

of D̂∗ and dn is such that n� |D̂∗| ·dn. It will be shown in Theorem 1 below that σ̂2D̂∗ significantly

underestimates σ2, due to spurious correlation between the realized but unobserved noises and

the spline bases. Indeed we will show that σ̂2D̂∗

is inconsistent estimate when |D̂∗| · dn is large.

Specifically, let PD̂∗ be the corresponding projection matrix of model (2.4) with the entire samples.

Denoted by γ̂2n = εTPD̂∗ε/ε

Tε, where ε = (ε1, · · · , εn)T . We will show that σ̂2D̂∗/(1− γ̂2n) converges

to σ2 with root n convergence rate, yet the spurious correlation γ̂2n is of order

γ̂2n = O(( 2

1− δ)|D̂∗|dn log(pdn)

n

), for some δ ∈ (0, 1). (2.5)

See Lemma 1 and Theorem 1 in Section 2.2 for details. Our first aim is to propose a new estimation

procedure of σ2 by using refitted cross-validation technique (Fan, Guo and Hao, 2012).

The refitted cross-validation procedure is to randomly split the random samples into two data

sets denoted by I1 and I2 with approximately equal size. Without loss of generality, assume through

this paper that I1 and I2 have the same sample size n/2. We apply a feature screening procedure

8

(e.g., DC-SIS or NIS) for each set, and obtain two index sets of selected x-variables, denoted by

D̂1 and D̂2. Both of them retain all important predictors. The refitted cross-validation procedure

consists of three steps. In the first step, we fit data in Il to the selected additive model D̂3−l for

l = 1 and 2 by the least squares method. These results in two least squares estimate γ̂(3−l) based

on Il, respectively. In the second step, we calculate the mean squared errors for each fit:

σ̂2l =1

n/2− |D̂3−l| · dn

∑

i∈Il

Yi −

∑

j∈D̂3−l

dn∑

k=1

γ̂(3−l)jk Bjk(Xij)

2

for l = 1 and 2. Then the refitted cross-validation estimate of σ2 is defined by

σ̂2RCV = (σ̂21 + σ̂

22)/2.

This estimator is adapted from the one proposed in Fan, Guo and Hao (2012) for linear regression

models, however, it is much more challenge in establishing the asymptotic property of σ̂2RCV for the

large dimensional additive models than that for linear regression models. The major hurdle is to

deal with the approximation error in nonparametric modeling as well as the correlation structure

induced by the B-spline bases. The procedure of refitted cross validation is illustrated schematically

in Figure 2.

2.2 Sampling properties

We next study the asymptotic properties of σ̂2D̂∗

and σ̂2RCV. The following technical conditions are

needed to facilitate the proofs, although they may not be the weakest.

(C1) There exist two positive constants A1 and A2 such that E{

exp(A1 |ε|)∣∣x}≤ A2.

(C2) For all j, fj(·) ∈ Cd ([a, b]), which consists of functions whose r-th derivative f (r)j exists and

satisfies∣∣∣f (r)j (s)− f

(r)j (t)

∣∣∣ ≤ L |s− t|α , for s, t ∈ [a, b], j = 1, · · · , p, (2.6)

for a given constant L > 0, where r ≤ l is the “integer part” of d and α ∈ (0, 1] such that

9

(yi,Xi), i = 1, · · · , n

(y(1)i ,X

(1)i ), i ∈ I1 (y

(2)i ,X

(2)i ), i ∈ I2

Fitting y(1)i by X

(1)ji , i ∈ I1, j ∈ D̂2 Fitting y

(2)i by X

(2)ji , i ∈ I2, j ∈ D̂1

Data Splitting Data Splitting

σ̂21 σ̂22

σ̂2RCV =(σ̂21 + σ̂

22

)/2

Index set D̂1 Index set D̂2

variance estimate variance estimate

averaging

Figure 2: Refitted Cross Validation Procedure

d = r+α ≥ 2. Furthermore, it is assumed that dn = O(n1/(2d+1)), the optimal nonparametric

rate (Stone, 1985).

(C3) The joint distribution of predictors X is absolutely continuous and its density g is bounded

by two positive numbers b and B satisfying that b ≤ g ≤ B. The predictor Xj , j = 1, · · · , p

has a continuous density function gj , which satisfies that for any x ∈ [a, b], 0 < A3 ≤ gj(x) ≤

A4

stand for p to emphasize that the dimension p of the predictor vector may depend on n. Since

the DC-SIS and the NIS possess sure screening property, the resulting subset of predictors selected

by the utilized screening procedure contains all active predictors, with probability tending to one.

Thus, we assume that all active predictors are retained in the stage of feature screening in the

following two theorems. This can be achieved by imposing the conditions in Li, Zhong and Zhu

(2012) for the DC-SIS and the conditions in Fan, Feng and Song (2011) for the NIS. We first derive

the orders of εTPD̂∗ε and γ̂2n in next lemma, which plays a critical role in the proofs of Theorems

1 and 2 below. The proofs of Lemma 1 and Theorems 1 and 2 will be given in the Appendix.

Lemma 1. Under Conditions (C1)—(C3), it follows that

εTPD̂∗ε = Op

{(2

1− δ

)ŝdn log(p dn)

}, and γ̂ 2n =

εTPD̂∗ε

εTε= Op

{(2

1− δ

)ŝ dn log(p dn)n

},

where δ ∈ (√

1− b2ζ0/B2, 1) for some constant ζ0 ∈ (0, 1) with b and B being given in Condition

(C3).

Lemma 1 clearly shows that the spurious correlation γ̂2n increases to its upper bound at an

exponential rate of ŝ since δ ∈ (0, 1) and 2/(1− δ) > 2.

Theorem 1. Assume that lim supn→∞ γ̂2n < 1. Let ŝ =

∣∣D̂∗∣∣ be the number of elements in the

estimated active index set D̂∗. Assume that all active predictors are retained in the stage of feature

screening. That is, D̂∗ contains all active predictors. Under Conditions (C1)-(C3), the following

statements hold:

(i) If log(pn) = O(nζ), 0 ≤ ζ < 1 and ŝ = Op(log(n)), then σ̂2D̂∗/(1 − γ̂

2n) converges to σ

2 in

probability as n→∞;

(ii) If log(pn) = O(nζ), 0 ≤ ζ < 3/(2d+ 1) and ŝ = Op(log(n)), then it follows that

√n(σ̂2D̂∗/(1− γ̂

2n )− σ2

)L−→ N (0,Eε41 − σ4), (2.7)

whereL−→ stands for convergence in law.

11

Theorem 1 (i) clearly indicates that the naive error variance estimator σ̂2D̂∗

underestimates σ2

by a factor of (1− γ̂2n), yet by Lemma 1, γ̂2n is of order given in (2.5) and is not small. Since γ̂2n can

not be estimated directly from the data, it is challenging to derive an adjusted error variance by

modifying the commonly-used mean squared errors. On the other hand, the refitted cross-validation

method provides an automatic bias correction via refitting and hence a consistent estimator, as we

now show.

Theorem 2. Assume that D̂∗j contains all active predictors, for j = 1 and 2. Let ŝj =∣∣D̂∗j∣∣

be the number of elements in D̂∗j . Under Conditions (C1)-(C3), if ŝ1 = o(n(2d−1)/4(2d+1)), and

ŝ2 = o(n(2d−1)/4(2d+1)), then

√n(σ̂2RCV − σ2

) L−→ N (0,Eε41 − σ4) (2.8)

Comparing with the result in the theorem 1, the refitted cross-validation method can eliminate

the side-effect of the selected redundant variables to correct the bias of the naive variance estimator

through the contributions of refitting. This bias factor can be non-trivial.

Remark 1. This remark provides some implications and limitations of Theorems 1 and 2 and

some clarification of conditions implicitly required by Theorem 2.

(a) From the proof of Theorems 1 and 2, it has been shown that σ̂2D̂∗/(1− γ̂ 2n ) = σ2 +Op(1/

√n)

and σ̂2RCV = σ2 + Op(1/

√n). As a result, the ratio of RCV estimate to the naive estimator

may be used to provide one an estimate of the shrinkage factor 1− γ̂ 2n .

(b) Theorem 2 is applicable provided that the active index sets D̂∗j , j = 1 and 2 include all active

predictor variables. Here we emphasize that the RCV method can be integrated with any

dimension reduction procedure to effectively correct the bias of naive error variance estimate,

and do not directly impose condition on the dimension pn. In practical implementation, the

assumption that both two active index sets include all important variables implies further

condition on pn. In particular, the condition log(pn) = o(n) is necessary for DC-SIS (Li,

Zhong and Zhu, 2012) to achieve sure screening property. This condition is also necessary for

12

other sure screening procedures such as the NIS (Fan, Feng and Song, 2011) to achieve sure

screening property. In Theorems 1 and 2, we have imposed conditions on ŝ, ŝ1 and ŝ2. These

conditions may implicitly require extra conditions on the DC-SIS to ensure that the size of

the subset selected by DC-SIS is of order required by the conditions. For NIS, by Theorem

2 of Fan, Feng and Song (2011), we need to impose some explicit conditions on the signal

strength as well as the growth of the operator norm of the covariance matrix of covariates.

(c) The RCV method can be combined with any feature screening methods such as DC-SIS and

NIS and variable selection methods such as grouped LASSO and grouped SCAD (Xue, 2009)

for ultrahigh dimensional additive models. The NIS method needs to choose a smoothing

parameter for each predictor. The grouped LASSO and the grouped SCAD methods are

expensive in terms of computational cost. We focus only on DC-SIS in the numerical studies

to save space.

(d) For sure independent screening procedures such as the SIS and DC-SIS, the authors recom-

mended to set ŝ = n/ log(n). The diverging rate of ŝ, ŝ1 and ŝ2 required in Theorems 1 and

2 are slower than this due to the nonparametric nature. It seems that it is difficult to further

relax the conditions in Theorems 1 and 2. This can be viewed as a limitation of our theoretical

results. From our simulation studies and real data examples, the performance of the naive

method certainly relies on the choice of ŝ, while the RCV method performs well for a wide

range of ŝ1 and ŝ2. As shown in Tables 1 and 2, the resulting estimate of the RCV method is

very close to the oracle estimate across all scenarios in the tables. Theoretical studies on how

to determine ŝ1 and ŝ2 are more related to the topic of feature screening than the variance

estimation and we do not intend to pursue further in this paper. In practical implementation,

the choices of these parameters should take into account of the degree of freedoms in the

refitting stage so that the residual variance can be estimated with a reasonable accuracy. We

would recommend considering several possible choices of ŝ1 and ŝ2 to examine whether the

resulting variance estimate is relatively stable to the choices of ŝ1 and ŝ2. This is implemented

in the real data example in Section 3.2.

13

3 Numerical studies

In this section, we investigate the finite sample performances of the newly proposed procedures.

We further illustrate the proposed procedure by an empirical analysis of a real data example. In

our numerical studies, we report only results of the proposed RCV method with DC-SIS to save

space, although the NIS method, the grouped LASSO and the grouped SCAD (Xue, 2009) can be

used to screen or select variables. All numerical studies are conducted using Matlab code.

3.1 Monte Carlo simulation

Since there is little work to study the variance estimate for ultra-high dimensional nonparametric

additive model, this simulation study is designed to compare the finite sample performances of

two-stage naive variance estimate and refitted cross-validation variance estimate. In our simulation

study, data were generated from the following sparse additive model

y = a(X1 + 0.75X

22 + 2.25 cos(X5)

)+ ε, (3.1)

where ε ∼ N(0, 1), and {X1, · · · , Xp} ∼ Np(0,Σ) with Σ = {ρij}pi,j=1 where ρii = 1 and ρij = 0.2

for i 6= j. We set p = 2000 and n = 600. We take a = 0, 1/√

3, and 2/√

3 in order to examine the

impact of signal-to-noise ratio to error variance estimation. When a = 0, the DC-SIS always can

pick up the active sets and the challenge is to reduce spurious correlation, while when a = 2/√

3,

the signal is strong enough to pick up active sets so that DC-SIS performs very well. The case

a = 1/√

3 corresponds to the signal-to-noise equalling to 1. This is a difficult case to distinguish

signals and noises and is the most challenge one for DC-SIS among these three cases considered:

the first and the third case are easy to achieve sure screening with relative fewer number of selected

variables and this reduces the biases of the RCV method and leaves more degrees of freedoms for

estimating the residual variance. We intended to design such a case to challenge our proposed

procedure, as sure screening is harder to achieve.

As a benchmark, we include the oracle estimator in our simulation. Here the oracle estimator

corresponds to the mean squared errors for the fitting of the oracle model that includes only X1,

14

X2 and X5 for a 6= 0, and include none of predictors when a = 0. In our simulation, we employ

the distance correlation to rank importance of predictors, and screen out p− ŝ predictors with low

distance correlation. Thus, the resulting model includes ŝ predictors. We consider ŝ=20, 30, 40

and 50 in order to illustrate the impact of choices of ŝ on the performance of the naive estimator

and the refitted cross validation estimator.

In our simulation, each function fj(·) is approximated by a linear combination of an intercept

and 5 cubic B-splines bases with 3 knots equally spaced between the minimum and maximum of

the jth variable. Thus, when ŝ = 50, the reduced model actually has 251 terms, which is near half

of the sample size. Table 1 depicts the average and the standard error of 150 estimates over the

150 simulations. To get an overall picture how the error variance estimates change over ŝ, Figure 3

depicts the overall average of the 150 estimates. In Table 1 and Figure 3, ‘Oracle’ stands for the

oracle estimate based on nonparametric additive models using only active variables, ‘Naive’ for the

naive estimate, and ‘RCV’ for the refitted cross validation estimate.

Table 1: Simulation Results for different ŝ (σ2 = 1)ŝ = 20 ŝ = 30 ŝ = 40 ŝ = 50

Method a = 0

Oracle 1.0042 ( 0.0618 )∗ 1.0042 ( 0.0618 ) 1.0042 ( 0.0618 ) 1.0042 ( 0.0618 )

Naive 0.8048 ( 0.0558 ) 0.7549 ( 0.0589 ) 0.7138 ( 0.0584 ) 0.6771 ( 0.0584 )

RCV 1.0022 ( 0.0656 ) 0.9994 ( 0.0666 ) 0.9990 ( 0.0698 ) 0.9967 ( 0.0705 )

a = 1/√

3

Oracle 1.0049 ( 0.0617 ) 1.0049 ( 0.0617 ) 1.0049 ( 0.0617 ) 1.0049 ( 0.0617 )

Naive 0.9054 ( 0.0572 ) 0.8683 ( 0.0592 ) 0.8387 ( 0.0615 ) 0.8143 ( 0.0644 )

RCV 1.0704 ( 0.1300 ) 1.0493 ( 0.1187 ) 1.0374 ( 0.1095 ) 1.0273 ( 0.1106 )

a = 2/√

3

Oracle 1.0072 ( 0.0618 ) 1.0072 ( 0.0618 ) 1.0072 ( 0.0618 ) 1.0072 ( 0.0618 )

Naive 0.9618 ( 0.0647 ) 0.9618 ( 0.0647 ) 0.9306 ( 0.0687 ) 0.9194 ( 0.0780 )

RCV 1.0026 ( 0.0657 ) 1.0026 ( 0.0657 ) 1.0020 ( 0.0735 ) 1.0013 ( 0.0779 )

*Values in parentheses are standard errors

Table 1 and Figure 3 clearly show that the naive two-stage estimator significantly underestimates

the error variance in the presence of many redundant variables. The larger the value ŝ, the bigger

the spurious correlation γ2n, and hence the larger the bias of the naive estimate. The performance of

the naive estimate also depends on the signal to noise ratio. In general, it performs better when the

15

20 25 30 35 40 45 50

0.6

0.8

1

# of seleced components

σ̂2

(a) a = 0

20 25 30 35 40 45 500.8

0.9

1

1.1


σ̂2

(b) a = 1/√

3

20 25 30 35 40 45 500.8

0.85

0.9

0.95

1


σ̂2

(c) a = 2/√

3

NaiveRCVOracleTrue

Figure 3: Variance estimators for different signal-to-noise ratios

Table 2: Simulation results with different n (σ2 = 1)n = 400 n = 600

Method a = 0

Oracle 1.0044 ( 0.0646 )∗ 0.9924 ( 0.0575 )

Naive 0.6969 ( 0.0610 ) 0.7340 ( 0.0542 )

RCV 0.9905 ( 0.0837 ) 0.9845 ( 0.0729 )

a = 1/√

3

Oracle 1.0047 ( 0.0737 ) 0.9970 ( 0.0552 )

Naive 0.8390 ( 0.0815 ) 0.8533 ( 0.0555 )

RCV 1.1273 ( 0.1528 ) 1.0144 ( 0.0954 )

a = 2/√

3

Oracle 0.9903 ( 0.0687 ) 1.0075 ( 0.0643 )

Naive 0.9013 ( 0.0785 ) 0.9340 ( 0.0691 )

RCV 1.0241 ( 0.1886 ) 1.0031 ( 0.0780 )

*Values in parentheses are standard errors

16

signal to noise ratio is large. The RCV estimator performs much better than the naive estimator.

Its performance is very close to that of the oracle estimator for all cases listed in Table 1.

In practice, we have to choose one ŝ in data analysis. Fan and Lv (2008) suggested ŝ =

[n/ log(n)] for their sure independence screening procedure based on Pearson correlation ranking.

We modify their proposal and set ŝ = [n4/5/ log(n4/5)] to take into account effective sample size in

nonparametric regression. Table 2 depicts the average and the standard error of 150 estimates over

the 150 simulations when the sample size n = 400 and 600. The caption of Table 2 is the same

as that in Table 1. Results in Table 2 clearly show that the RCV performs as well as the oracle

procedure, and outperforms the naive estimate.

We further studied the impact of randomly splitting data strategy on the resulting estimate. As

an alterative, one may repeat the proposed procedure several times, each randomly splitting data

into two parts, and then take the average as the estimate of σ2. Our findings from our simulations

study are consistent with the discussion in Fan, Guo and Hao (2012): (a) the estimates of σ2 for

different numbers of repetitions are almost the same; and (b) as the number of repetitions increases,

the variation slightly reduces at the price of computational cost. This implies that it is unnecessary

to repeat the proposed procedure several times. As another alternative, one may randomly split

the sample data into k groups. Specifically, the case k = 2 is the proposed RCV methods in the

paper. Similarly, we can use data in one group to select useful predictors, data in other groups to

fit the additive model. We refer this splitting strategy to as multi-folder splitting. Our simulation

results implies that the multi-folder splitting leads to (a) less accurate estimate for the coefficients

and (b) increased variation of σ̂2l used to construct the RCV estimate. This is because this strategy

splits the data into many subsets with even smaller sample size. If the sample size n is large, as

nowadays Big Data, it may be worth to try multiple random splits, otherwise we do not recommend

it.

3.2 A real data example

In this section, we illustrate the proposed procedure by an empirical analysis of a supermarket

data set (Wang, 2009). The data set contains a total of n = 464 daily records of the number of

17

customers (Yi) and the sale amounts of p = 6, 398 products, denoted as Xi1, · · · , Xip, which will

be used as predictors. Both the response and predictors are standardized so that they have zero

sample mean and unit sample variance. We fit the following additive model in our illustration.

Yi = µ+ f1(X1) + · · ·+ fp(Xp) + εi,

where εi is a random error with E(εi) = 0 and var(εi|xi) = σ2.

Table 3: Error Variance Estimate for Market Data

ŝ 40 35 30 28 25

Naive 0.0866 0.0872 0.0910 0.0938 0.0990

RCV 0.1245 0.1104 0.1277 0.1340 0.1271

Since the sample size n = 464, we set ŝ = [n4/5/ log(n4/5)] = 28. The naive error variance

estimate equals 0.0938, while the RCV error variance estimate equals 0.1340, an 43% increase

of the estimated value when the spurious correlation is reduced. Table 3 depicts the resulting

estimates of the error variance with different values of ŝ, and clearly shows that the RCV estimate

of error variance is stable with different choices of ŝ, while the estimate of error variance by the

naive method reduces as ŝ increases. This is consistent with our theoretical and simulation results.

Regarding the selected models with ŝ predictors as a correct model and ignoring the approxi-

mation errors (if any) due to B-spline, we further employ the Wald’s χ2-test for hypothesis whether

(γj1, · · · , γjdj )T equals zero, namely whether the jth variable is active in presence of the rest vari-

ables. Such Wald’s χ2 statistics offer us a rough picture whether Xj is significant or not. The

Wald’s χ2-test with the naive error variance estimate concludes 12 significant predictors at signif-

icant level 0.05, while the Wald’s χ2-test with the RCV error variance estimate concludes seven

significant predictors at the same significant level. Figure 4 depicts the Q-Q plot of values of the

χ2-test statistic of those insignificant predictors identified by the Wald’s test. Figure 4 clearly

shows that the χ2-test values using naive error variance estimate systematically deviate from the

45-degree line. This implies that the naive method results in an underestimate of error variance,

while the RCV method results in a good estimate of error variance.

18

0 2 4 6 8 10 120

2

4

6

8

10

12

Theoretical quantile value

Obs

erve

d qu

antil

e va

lue

Q−Q plot

Figure 4: Quantile-quantile plot of χ2-test values. “o” stands for χ2-test using naive error varianceestimate. “+” stands for χ2-test using RCV error variance estimate.

The Wald’s test at level 0.05 is in favor that seven predictors, X11, X139, X3, X39, X6, X62

and X42, are significant. We refit the data with the additive model with these 7 predictors. The

corresponding mean squared errors is 0.1207, which is close to the σ̂2RCV = 0.1340. Note that σ2

is the minimum possible prediction error. It provides a benchmark for other methods to compare

with and is achievable when modeling bias and estimation errors are negligible.

To see how the above selected variables perform in terms of prediction, we further use the

leave-one-out cross-validation (CV) and five-fold CV to estimate the mean squared prediction errors

(MSPE). The leave-one-out CV yields MSPE=0.1414, and the average of the MSPE obtained from

five-fold CV based on 400 randomly splitting data yields is 0.1488 with the 2.5th percentile and

97.5 percentile being 0.1411 and 0.1626, respectively. The MSPE is slightly greater than σ̂2RCV.

19

−4 −2 0 2 4

−2

0

2

X11−4 −2 0 2 4

−2

0

2

X139−4 −2 0 2 4

−2

0

2

X3

−4 −2 0 2 4

−2

0

2

X39−4 −2 0 2 4

−2

0

2

X6−4 −2 0 2 4

−2

0

2

X62

−4 −2 0 2 4

−2

0

2

X42

Figure 5: Estimated functions based on 7 variables selected from 28 variables that survive DC-SISscreening by the χ2-test with the RCV error variance estimator.

This is expected as the uncertainty of parameter estimation has not been accounted. This bias can

be corrected from the theory of linear regression analysis.

Suppose that {xi, Yi}, i = 1, · · · , n is an independent and identically distributed random sample

from a linear regression model Y = xTβ + ε, the linear predictor Ŷ = xT β̂, where β̂ is the least

squares estimate of β, has prediction error at a new observation {x∗, y∗}: E{(y∗ − xT∗ β̂)2|X} =

σ2(1 + xT∗ (XTX)−1x∗), where σ

2 is the error variance and X is the corresponding design matrix.

This explains why the MSPE is slightly greater than σ̂2RCV. To further gauge the accuracy of the

RCV estimate of σ2, define weighted prediction error |y∗ − xT∗ β̂|/√

1 + xT∗ (XTX)−1x∗. Then the

leave-one-out method leads to the mean squared weighted predictor error (MSWPE) 0.1289 and

the average of five-fold CV based on 400 randomly splitting data yields MSWPE 0.1305 with the

2.5th percentile and 97.5 percentile being 0.1254 and 0.1366, respectively. These results imply (a)

the seven selected variables achieves the benchmark prediction; (b) the modeling biases using the

additive models of these seven variables are negligible; (c) σ̂2RCV provides a very good estimate for

20

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

X11

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3

X6

Figure 6: Estimated functions based on 2 variables selected from 28 variables that survive DC-SISscreening by the χ2-test with the RCV error variance estimator and the Bonferroni adjustment.

σ2.

Their estimated functions f̂j(xj) are depicted in Figure 5, from which it seems that all predictors

shown in Figure 5 are not significant since zero crosses the entire confidence interval. This can be

due to the fact that we have used too many variables which increases the variance of the estimate.

We further employ the Wald’s test with Bonferroni correction for 28 null hypotheses. This leads

only two significant predictors, X11 and X6, at level 0.05. We refit the data with the two selected

predictors. Figure 6 depicts the plot of f̂11(x11) and f̂6(x6).

4 Discussions

In this paper, we proposed an error variance estimator in ultrahigh dimensional additive model

by using refitted cross validation technique. This is particularly important given the high level

of spurious correlation induced by the nonparametric models (See Figure 1 and Lemma 1). We

21

established the root n consistency and asymptotic normality of the resulting estimator, and exam-

ined the empirical performance of the proposed estimator by Monte Carlo simulation. We further

demonstrated the proposed methodology via an empirical analysis of supermarket data. The pro-

posed estimator performs well with moderate sample size. However, when the sample size is very

small, the refitted cross validation procedure may be unstable. How to construct an accurate error

variance estimate with very small sample size is challenging and will be an interesting topic for

future research.

Appendix: Proofs

A.1 Proofs of Lemma 1 and Theorem 1

Let Ψ be the corresponding design matrix of model (2.3). Specifically, Ψ is a n × (pdn) matrix

with ith row being (B11(Xi1), · · · , B1dn(Xi1), B21(Xi2), · · · , Bpdn(Xip)). Denote by Ψ(D̂∗) the cor-

responding design matrix of model D̂∗, and PD̂∗ the corresponding projection matrix. That is,

PD̂∗ = Ψ(D̂∗)(Ψ(D̂

∗)TΨ(D̂

∗))−1Ψ(D̂∗)T

. Denote PcD̂∗ = In−PD̂∗ . Without loss of generality, assume

that the first s non-parametric components are nonzero and others are all zero. By the assumption

that all active predictors are retained by DC-SIS screening procedure. For ease of notation and

without loss of generality, assume that D̂∗ = {1, 2, · · · , ŝ}, where ŝ =∣∣∣D̂∗∣∣∣.

Proof of Lemma 1. Note that

εTPD̂∗ε = εTΨ(D̂

∗)(Ψ(D̂∗)TΨ(D̂

∗))−1Ψ(D̂∗)Tε ≤ λ−1min(Ψ(D̂

∗)TΨ(D̂

∗))

∥∥∥∥Ψ(D̂∗)Tε

∥∥∥∥2

2

, (A.1)

where λmin(A) stands for the minimal eigenvalue of matrix A. To show Lemma 1, we need to

derive the bound of eigenvalue of matrix Ψ(D̂∗)TΨ(D̂

∗). Note that Ψ(D̂∗) = (Ψ1, · · · ,Ψŝ) with

Ψj =

Bj1(Xj1) · · · Bj dn(Xj1)· · · · · · · · ·

Bj1(Xjn) · · · Bj dn(Xjn)

, j = 1, · · · , ŝ. (A.2)

22

Let b = (bT1 , · · · ,bTŝ )T and ‖b‖22 = bTb = 1. Then we have Ψ(D̂∗)b = Ψ1b1 + · · · + Ψŝbŝ. As

shown in Lemma S.5 in the supplemental material of this paper, it follows that

(1− δ

2)ŝ−1 (‖Ψ1b1‖2 + · · ·+ ‖Ψŝbŝ‖2)2 ≤ ‖Ψ1b1 + · · ·+ Ψŝbŝ‖22 = bTΨ(D̂

∗)TΨ(D̂

∗)b. (A.3)

This yields that

(1− δ

2)ŝ−1

(ŝ∑

i=1

biΨTi Ψibi

)≤ bTΨ(D̂∗)

TΨ(D̂

∗)b, (A.4)

since ‖Ψibi‖2 ≥ 0. Furthermore,

(1− δ

2)ŝ−1

(ŝ∑

i=1

biΨTi Ψibi

)≥ (1− δ

2)ŝ−1

∑

bTi bi 6= 0

λmin(ΨTi Ψi) b

Ti bi

.

Recalling Lemma 6.2 of Zhou, Shen and Wolfe (1998), there exists two positive constants C1 and

C2 such that, for any 1 ≤ i ≤ ŝ,

C1d−1n n ≤ λmin(ΨTi Ψi) ≤ λmax(ΨTi Ψi) ≤ C2d−1n n. (A.5)

Thus,

(1− δ

2)ŝ−1

∑

bTi bi 6= 0

λmin(ΨTi Ψi) b

Ti bi

≥ C1(

1− δ2

)ŝ−1d−1n n∑

bTi bi 6= 0

bTi bi = C1(1− δ

2)ŝ−1d−1n n.

(A.6)

The last equation is valid due to ‖b‖22 = bTb = 1. Combining the equation (A.4) and (A.6), we

have

λmin(Ψ(D̂∗)TΨ(D̂

∗)) ≥ C1(1− δ

2)ŝ−1 d−1n n. (A.7)

Thus, it follows by using (A.1) that

εTPD̂∗ε ≤ C−11

(2

1− δ

)ŝ−1dnn

−1∥∥∥∥Ψ(D̂

∗)Tε

∥∥∥∥2

2

. (A.8)

23

By the notation (A.2), we have

ΨTi ε =

∑nk=1Bi1(Xik)εk

∑nk=1Bi2(Xik)εk

...∑n

k=1Bidn(Xik)εk

. (A.9)

Recalling that 0 ≤ Bij(·) ≤ 1, for any i, j and E |Bij(Xik)|2 ≤ C4d−1n (Stone, 1985), we note the

fact that for m ≥ 2, E |Bij(Xik)|m ≤ E |Bij(Xik)|2 ≤ C4d−1n . Observe that, using Condition (C1),

for any integers i and j

E |Bij(Xik)εk|m = E |Bij(Xik)|m · E |εk|m ≤ E |Bij(Xik)|m E (m! am exp{|ε1| /a}) . (A.10)

Taking A1 = 1/a and A2 = b in Condition (C1), it follows that the right hand side of above

inequality will not exceed

C4m! amd−1n E (exp{|ε1| /a}) ≤

C42m! (2 d−1n b a

2)am−2. (A.11)

Using Bernstein’s Inequality (see Lemma 2.2.11 of Van der Vaart and Wellner, 1996), we have

P

max

1≤i≤p1≤j≤dn

∣∣∣∣∣n∑

k=1

Bij(Xik)εk

∣∣∣∣∣ ≥M

≤p∑

i=1

dn∑

j=1

P

(∣∣∣∣∣n∑

k=1

Bij(Xik)εk

∣∣∣∣∣ ≥M)

≤ 2 p dn exp{− M

2

2(2 d−1n b a2n+ aM)

}

= 2 exp

{log(p dn)

(1− 1

4 log(p dn)nd−1n b a2M−2 + 2 log(p dn)aM−1

)}.

When we take M = C5√n log(p dn)/dn, with

dn log(p dn)n → 0 and sufficiently large C5, the power

in the last equation goes to negative infinity. Thus, with probability approaching to one, we have

24

max1≤i≤p

1≤j≤dn

|∑nk=1Bij(Xik)εk| ≤ C5√n log(p dn)/dn and

εTPD̂∗ε ≤ C−11

(2

1− δ

)ŝ−1dnn

−1∥∥∥∥Ψ(D̂

∗)Tε

∥∥∥∥2

2

≤ C25C−11(

2

1− δ

)ŝ−1dn log(p dn). (A.12)

Due to the independent and identically distributed random errors with mean 0 and variance σ2, by

the Law of Large Number, we have

1

n

n∑

i=1

εia.s.−−→ 0, 1

n

n∑

i=1

ε2ia.s.−−→ σ2. (A.13)

Thus, we obtain that

γ̂ 2n =εTPD̂∗ε

εTε= Op

{(2

1− δ

)ŝ dn log(p dn)n

}. (A.14)

Proof of Theorem 1. Note that

σ̂2D̂∗ =1

n− ŝ dn

ŝ∑

j=1

fTj (Xj)PcD̂∗

ŝ∑

j=1

fj(Xj) + 2 εTPcD̂∗

ŝ∑

j=1

fj(Xj) + εTPcD̂∗ε

,

where fj(Xj) = (fj(Xj1), · · · , fj(Xjn)T , j = 1, · · · , p. To simplify the first term in σ̂2D̂∗ , let ∆1 =∑ŝj=1 f

Tj (Xj)P

cD̂∗∑ŝ

j=1 fj(Xj). Then

∆1 = {ŝ∑

j=1

fj(Xj)−ŝ∑

j=1

fnj(Xj) +

ŝ∑

j=1

fnj(Xj)}TPcD̂∗{ŝ∑

j=1

fj(Xj)−ŝ∑

j=1

fnj(Xj) +

ŝ∑

j=1

fnj(Xj)},

where fnj(Xj) = (fnj(Xj1), · · · , fnj(Xjn))T = (Bj(Xj1)TΓj , · · · ,Bj(Xjn)TΓj)T , j = 1, · · · , p.

Define

∆11 = {ŝ∑

j=1

fj(Xj)−ŝ∑

j=1


j=1

fj(Xj)−ŝ∑

j=1

fnj(Xj)},

∆12 = {ŝ∑

j=1


j=1

fnj(Xj)},

25

∆13 = 2{ŝ∑

j=1

fj(Xj)−ŝ∑

j=1


j=1

fnj(Xj)}.

Then ∆1 = ∆11 + ∆12 + ∆13. Note that PcD̂∗ is a projection matrix on the complement of the

linear space of Ψ(D̂∗), and therefore PcD̂∗{

∑ŝj=1 fnj(Xj)} = 0. Thus, both ∆12 and ∆13 equal 0.

We next calculate the order of ∆11. By the property of B-spline (Stone, 1985), there exists a

constant c1 > 0 such that ‖fj − fnj‖2 ≤ c1d−2dn . Since PcD̂∗ is a projection matrix, its eigenvalues

equal either 0 or 1. By the Cauchy-Schwarz inequality and some straightforward calculation, it

follows that ∆11 = Op(ŝ2nd−2dn ). Therefore ∆1 = Op(ŝ

2nd−2dn ). Under conditions in Theorem

1(i), Op(ŝ2d−2dn ) = op(1). As a result, ∆1 = op(n). Under conditions in Theorem 1(ii), ŝ =

o(n(2d−1)/4(2d+1)) and therefore ∆1 = op(√n).

Now we deal with the second term in σ̂2D̂∗

. Denote ∆2 = 2εTPcD̂∗

∑ŝj=1 fj(Xj). Since

PcD̂∗{∑ŝ

j=1 fnj(Xj)} = 0, it follows that

∆2 = 2 εTPcD̂∗

ŝ∑

j=1

fj(Xj)−ŝ∑

j=1

fnj(Xj)

.

Denote ∆21 =∑ŝ

j=1

∑ni=1 (fj(Xji)− fnj(Xji)) εi and ∆22 = (εTPD̂∗)

(∑ŝj=1 fj(Xj)−

∑ŝj=1 fnj(Xj)

).

Thus, ∆2 = 2(∆21 − ∆22). To deal with ∆21, we bound the tails of (fj(Xji)− fnj(Xji)) εi, i =

1, · · · , n j = 1, · · · , ŝ. For any m ≥ 2, because fj ∈ Cd ([a, b]) and fnj belongs to the spline space

S l ([a, b]), we have

E |(fj(Xji)− fnj(Xji)) εi|m = E(|fj(Xji)− fnj(Xji)|m E

(|εi|m

∣∣xi)),

which is bounded by Cm−26 E(|fj(Xji)− fnj(Xji)|2 E

(|εi|m

∣∣xi))

for some constant C6 by the

property of B-spline approximation. There exists a constant c1 > 0 such that ‖fj − fnj‖2 ≤ c1d−2dnby the property of B-spline (Stone, 1985). Applying Condition (C1) for E {exp(A1 |εi|)|xi}, it

26

follows that

E |(fj(Xji)− fnj(Xji)) εi|m ≤ m!(C6A1

)m−2 A2A21

c1 d−2dn .

Denote C7 = c1A2/A21, and C8 = C6/A1. Using the Bernstein’s inequality, for some M , we have

P

(max

1≤j≤p

∣∣∣∣∣n∑

i=1

(fj(Xji)− fnj(Xji)) εi∣∣∣∣∣ > M

)≤ 2 p exp

{−1

2

M2

2C7 nd−2dn − C8M

}. (A.15)

If we take M = C9

√log(p)nd−2dn , and for sufficiently large C9, then the tail probability (A.15)

goes to zero. Thus,

∆21 = Op

(ŝ

√log(p) nd−2dn

). (A.16)

Under condition of Theorem 1(i), ŝ = o(n(4d+1)/2(2d+1)) with ζ < 1. Thus, Op(ŝd−dn

√log(pdn)) =

op(√n). Following the similar arguments dealing with ∆11, it follows that ∆21 = op(n). Under

condition of Theorem 1(ii), ŝ = o(nd/(2d+1)−ζ/2) with ζ < 3/(2d+ 1). Thus, ∆21 = op(√n). By the

Cauchy-Schwarz inequality, it follows by Lemma 1 that

∆22 ≤∥∥εTPD̂∗

∥∥2·

∥∥∥∥∥∥

ŝ∑

j=1

fj(Xj)−ŝ∑

j=1

fnj(Xj)

∥∥∥∥∥∥2

= Op(( 2

1− δ)ŝ√

dn log(p dn)) ·Op(ŝ n1/2 d−dn )

= Op(( 2

1− δ)ŝ√

log(p dn) d−d+1/2n ).

When ζ < 4d/(2d+ 1), and ŝ = Op(log(nα), α ≤ 4d/(2d+ 1)− ζ, it follows that ∆22 = op(n) under

condition of Theorem 1(i). When ζ < (2d− 1)/(2(2d+ 1)) and ŝ = log(nα), α ≤ (2d− 1)/(2(2d+

1)) − ζ, (2/(1 − δ))ŝ n1/2√

log(p dn) d−d+1/2n = op(

√n). Thus, ∆22 = op(

√n) under condition of

Theorem 1(ii). Comparing the order of ∆11,∆21 and ∆22, we obtain the order of ŝ in Theorem 1.

Therefore, we have

YT(In −Ψ(D̂

∗)(Ψ(D̂∗)TΨ(D̂

∗))−1Ψ(D̂∗)T)

Y

27

= εT(In −PD̂∗

)ε+ ∆1 + ∆2

= εT(In −PD̂∗

)ε+Op(ŝ

2nd−2dn ) +Op

(ŝ

√log(p)nd−2dn

)+ ∆22.

and it follows by the definition of γ̂2n that

σ̂2D̂∗ =1

n− ŝ dnYT

(In −Ψ(D̂

∗)(Ψ(D̂∗)TΨ(D̂

∗))−1Ψ(D̂∗)T)

Y

=1

n− ŝ dnεTε

(1− γ̂ 2n

)+Op(

ŝ 2nd−2dnn− ŝ dn

) +Op(

√log(p) ŝ 2 nd−2dn

n− ŝ dn) +

∆22n− ŝ dn

.

Since ŝ dn = op(n) and lim sup γ̂2n < 1, we have

σ̂2D̂∗

(1− γ̂ 2n )=

1

n− ŝ dnεTε+Op(ŝ

2 d−2dn ) + Op(√

log(p) ŝ n−1/2 d−dn ) +Op(∆22n

). (A.17)

Under conditions of Theorem 1(i), the small order term in (A.17) is bounded by op(1). We have

σ̂2D̂∗

1− γ̂ 2np−−→ σ2. (A.18)

To establish the asymptotic normality, we should study the asymptotic bias of the estimator. By

the Central Limit Theorem, it follows that

1√n

n∑

i=1

(ε2i − σ2)L−→ N (0,Eε41 − σ4). (A.19)

Note that under conditions of Theorem 1(ii), the small order term in (A.17) is bounded by op(n−1/2).

Therefore, the asymptotic normality holds.

A.2 Proof of Theorem 2.

Define events An1 = {D∗ ⊂ D̂∗1}, An2 = {D∗ ⊂ D̂∗2} and An = An1 ∩ An2. Unless specifically

mentioned, our analysis and calculation are based on the event An.

Let Ψ(D̂∗1) be the design matrix corresponding to D̂∗1, PD̂∗1 = Ψ

(D̂∗1)(Ψ(D̂∗1)TΨ(D̂

∗1))−1Ψ(D̂

∗1)T

,

28

and PcD̂∗1= I −PD̂∗1 . Note that P

cD̂∗1

(∑ŝ1j=1 fnj(X

(2)j ))

= 0. Thus,

(n/2− ŝ1dn) σ̂2D̂∗1 = ε(2)TPcD̂∗1

ε(2)

+

ŝ1∑

j=1

fj(X(2)j )−

ŝ1∑

j=1

fnj(X(2)j )

T

PcD̂∗1

ŝ1∑

j=1

fj(X(2)j )−

ŝ1∑

j=1

fnj(X(2)j )

.

By the same argument as that in the proof of Theorem 1, the second term in the above equation

is of the order Op(ŝ21 nd

−2dn ). Thus,

(n/2− ŝ1dn)(σ̂2D̂∗1 − σ2) =

(ε(2)

Tε(2) − n

2σ2)−(ε(2)

TPD̂∗1

ε(2) − ŝ1dnσ2)

+Op(ŝ21 nd

−2dn ).

We next calculate the order of(ε(2)

TPD̂∗1

ε(2) − ŝ1dnσ2)

. Note that

E

(ε(2)

TPD̂∗1

ε(2) − σ2tr(PD̂∗1 )∣∣∣X(2)D̂∗1

)= 0.

We now calculate its variance

Var

(ε(2)

TPD̂∗1

ε(2) − σ2tr(PD̂∗1 )∣∣∣X(2)D̂∗1

)= E

((ε(2)

TPD̂∗1

ε(2))2 ∣∣∣X(2)D̂∗1

)− σ4tr2PD̂∗1 . (A.20)

Denote by Pij the (i, j)th entry of matrix PD̂∗1. The first term in the right-hand side of the last

equation can be written as

E

∑

i,j,k,l

εiεjεkεlPijPkl

∣∣∣X(2)D̂∗1

.

It follows by the independence between X and ε that

E

(ε(2)

TPD̂∗1

ε(2)∣∣∣X(2)D̂∗1

)

= E ε41

n/2∑

i=1

P 2ii + σ4∑

i 6=jPiiPjj + 2σ

4∑

i 6=jP 2ij .

29

Therefore, it follows that the equation (A.20) equals to

E ε41

n/2∑

i=1

P 2ii + σ4∑

i 6=jPiiPjj + 2σ

4∑

i 6=jP 2ij − σ4

n/2∑

i=1

Pii

2

= E ε41

n/2∑

i=1

P 2ii + 2σ4∑

i 6=jP 2ij − σ4

n/2∑

i=1

P 2ii.

Noting the fact that σ4 = (E ε2)2 ≤ E ε4, the last equation is bounded by

E ε41

n/2∑

i=1

P 2ii − σ4n/2∑

i=1

P 2ii + σ4∑

i 6=jP 2ij + E ε

41

∑

i 6=jP 2ij

= (E ε41 + σ4)

n/2∑

i=1

n/2∑

j=1

P 2ij − 2σ4n/2∑

i=1

P 2ii. (A.21)

Note that

tr(PD̂∗12) = tr(PD̂∗1

TPD̂∗1) =

n/2∑

i=1

n/2∑

j=1

P 2ij ,

tr(PD̂∗1) = tr(PD̂∗1

2) =

n/2∑

i=1

Pii,

tr2(PD̂∗1) =

(tr(PD̂∗1

))2

=

n/2∑

i=1

P 2ii +∑

i 6=jPiiPjj .

and that tr2(PD̂∗1) =

(∑n/2i=1 Pii

)2≤ n∑n/2i=1 P 2ii. It follows that

Var

(ε(2)

TPD̂∗1

ε(2) − σ2tr(PD̂∗1 )∣∣∣X(2)D̂∗1

)≤ (Eε41 + σ4) tr(PD̂∗1 )−

2σ4

ntr2(PD̂∗1

)

≤ (Eε41 + σ4)ŝ1 dn.

since for the projection matrix PD̂∗1, tr(PD̂∗1

) = ŝ1 dn. Consequently, by Markov’s inequality, we

obtain

ε(2)TPD̂∗1

ε(2) − σ2 ŝ1 dn = Op(√

ŝ1 dn

)(A.22)

30

Therefore, we have that

(n

2− ŝ1dn)(σ̂2D̂∗1 − σ

2) = ε(2)Tε(2) − n

2σ2 +Op

(√ŝ1 dn

)+Op(ŝ

21 nd

−2dn ).

Similarly, it follows that

(n

2− ŝ2dn)(σ̂2D̂∗2 − σ

2) = ε(1)Tε(1) − n

2σ2 +Op

(√ŝ2 dn

)+Op(ŝ

22 nd

−2dn ).

Finally, we deal with√n(σ̂2RCV − σ2

). Take ŝ1 = o(n

(2d−1)/4(2d+1)), and ŝ2 = o(n(2d−1)/4(2d+1)) so

that n/(n− 2ŝ1dn) = 1 + op(1) and n/(n− 2ŝ2dn) = 1 + op(1). Then

√n(σ̂2RCV − σ2

)

=

√n

n− 2ŝ1dn

(ε(2)

Tε(2) − n

2σ2 +Op

(√ŝ1 dn

)+Op(ŝ

21 nd

−2dn )

)

+

√n

n− 2ŝ2dn

(ε(1)

Tε(1) − n

2σ2 +Op

(√ŝ2 dn

)+Op(ŝ

22 nd

−2dn )

)

=

{1√n

n∑

i=1

(ε2i − σ2)}{1 + op(1)}+ op(1)

L−→ N (0,E ε41 − σ4), as n→∞.

This completes the proof of Theorem 2.

References

[1] Bach, F. R. (2008), “Consistency of the group lasso and multiple kernel learning,” The

Journal of Machine Learning Research, 9: 1179-1225.

[2] Bühlmann, P. and Van de Geer, S. (2011). Statistics for High-Dimensional Data, Springer,

Berlin.

[3] De Boor, C. (1978), A Practical Guide to Splines, Vol. 27, New York: Springer-Verlag.

31

[4] Donoho, D. (2000), “High-dimensional data analysis: The curses and blessings of dimen-

sionality,” AMS Math Challenges Lecture, 1-32.

[5] Fan, J., Feng, Y. and Song, R. (2011), “Nonparametric independence screening in sparse

ultra-high-dimensional additive models,” Journal of the American Statistical Association,

106, 544 - 557.

[6] Fan, J., Guo, S. and Hao, N. (2012), “Variance estimation using refitted cross-validation

in ultrahigh dimensional regression,” Journal of the Royal Statistical Society: Series B

(Statistical Methodology), 74(1), 37-65.

[7] Fan, J. and Li, R. (2001), “Variable selection via nonconcave penalized likelihood and its

oracle properties,” Journal of the American Statistical Association, 96, 1348-1360.

[8] Fan, J. and Li, R. (2006), “Statistical Challenges with High Dimensionality: Feature Selec-

tion in Knowledge Discovery,” Proceedings of the International Congress of Mathematicians

(M. Sanz-Sole, J. Soria, J.L. Varona, J. Verdera, eds.), Vol. III, European Mathematical

Society, Zurich, 595-622.

[9] Fan, J. and Lv, J. (2008), “Sure independence screening for ultrahigh dimensional feature

space,” Journal of the Royal Statistical Society: Series B, 70, 849-911.

[10] Fan, J. and Lv, J. (2010), “A selective overview of variable selection in high dimensional

feature space”, Statistica Sinica, 20(1), 101.

[11] Friedman, J. and Stuetzle, W. (1981), “Projection pursuit regression.” Journal of the Amer-

ican statistical Association, 76, 817-823.

[12] Hall, P. and Miller, H., (2009), “Using generalized correlation to effect variable selection in

very high dimensional problems,” Journal of Computational and Graphical Statistics, 18,

533-550.

[13] Huang, J., Horowitz, J. L. and Wei, F. (2010), “Variable selection in nonparametric additive

models,” Annals of Statistics, 38, 2282 - 2313.

32

[14] Li, R., Zhong, W. and Zhu, L. (2012), “Feature screening via distance correlation learning,”

Journal of the American Statistical Association, 107, 1129-1139.

[15] Lin, Y. and Zhang, H. H. (2006), “Component selection and smoothing in multivariate

nonparametric regressionm,” The Annals of Statistics, 34, 2272-2297.

[16] Meier, L., Van de Geer, S., and Bühlmann, P. (2009), “High-dimensional additive modeling,”

The Annals of Statistics, 37, 3779-3821.

[17] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009), “Sparse additive models,”

Journal of the Royal Statistical Society: Series B, 71, 1009-1030.

[18] Stone, C. J. (1985), “Additive regression and other nonparametric models,” The Annals of

Statistics, 689-705.

[19] Van Der Vaart, A. W. and Wellner, J. A. (1996), Weak Convergence and Empirical Pro-

cesses, Springer.

[20] Wang, H. (2009), “Forward regression for ultra-high dimensional variable screening,” Jour-

nal of the American Statistical Association, 104(488).

[21] Xue, L., (2009) “Consistent variable selection in additive models,” Statistica Sinica, 19,

1281 - 1296.

[22] Zhou, S., Shen, X., and Wolfe, D. A., (1998), “Local asymptotics for regression splines and

confidence regions,” The Annals of Statistics, 26, 1760-1782.

33

Supplemental Material

In order to give a rigorous proof of (A.3), we extend Lemmas 1 and 3 of Stone (1985) for

the centered second moments to the second moments in this supplemental material. It is worth

pointing out that (A.3) is a direct application of Lemma S.5 below. Notations for constants in this

supplement may have different meaning from those in the main text.

Lemma S.1. Let Vj = h(Xj), j = 1, · · · , p be random variables and have finite second moment.

Then it follows that

E( k∑

j=1

Vj

)2≥ (1− δ

2)k−1

( k∑

j=1

√EV 2j

)2, (S.1)

where δ = (1− b2B−2ζ)1/2, for some constant 0 < ζ < 1.

Let SD stands for standard deviation. Lemma 1 of Stone (1985) shows that

SD( k∑

j=1

Vj

)≥ (1− δ

∗

2)(k−1)/2

( k∑

j=1

SD(Vj)), (S.2)

where δ∗ = (1 − bB−1)1/2 < 1. Thus, Lemma S.1 is an extension of Lemma 1 of Stone (1985) in

the sense that SD2 (i.e. the second centered moment) is replaced by the second moment. This

extension is needed because we cannot assume that the average of elements in every column of Ψ

equals 0 due to the construction of B-splines bases.

Proof of Lemma S.1. Similar to the proof of Lemma 1 in Stone (1985), we will prove Lemma S.1

by induction. When k = 1, (S.1) is valid. Suppose that (S.1) holds for k. We will show that the

inequality holds for k+1. Let τ2k = E(∑k

j=1 Vj)2 and m2k = EV

2k . If τ

2k = 0, then m

21 = · · · = m2k = 0

and (S.1) is valid. If m2k+1 = 0, (S.1) follows from that for k. Next consider the case that τ2k > 0 and

m2k+1 > 0. Set X = 1TV = 1T (V1, · · · , Vk)T and Y = Vk+1. Consider the optimization problem

minβ

E(Y − βX)2 = EY 2(1− (EXY )2

EX2EY 2) = m2k+1(1− η2), (S.3)

where η = EXY/√

EX2EY 2. By Condition (C3), b2 < b < gV1,··· ,Vk+1(v1, · · · , vk+1) < B, when

S.1

b < 1, and gV(v) < B, gVk+1(vk+1) < B. On the other hand, we also have

minβ

E(Y − βX)2

= minβ

∫(Y − βX)2gV1,··· ,Vk+1(v1, · · · , vk+1) dv1 · · · dvk+1

> b2B−2 minβ

∫ ∫(Y − βX)2gV(v)gVk+1(vk+1) dvdvk+1

≥ b2B−2m2k+1(1−(EX)2(EY )2

EX2EY 2).

(S.4)

Because 0 ≤ (EX)2(EY )2/(EX2EY 2) ≤ 1, there exists a positive constant ζ0 < 1 such that

1− (EX)2(EY )2

EX2EY 2≥ ζ0 > 0.

since both X and Y are not degenerated. Thus, η2 ≤ 1−b2B−2ζ0 and hence η ≥ −δ. Consequently,

τ2k+1 = τ2k + 2ητkmk+1 +m

2k+1

= (1 + η

2+

1− η2

)(τ2k +m2k+1) + 2ητkmk+1

≥ 1 + η2

(τ2k +m2k+1) +

1− η2

2τkmk+1 + 2ητkmk+1

=1 + η

2(τk +mk+1)

2

≥ 1− δ2

(1− δ

2)(k−1)/2

k∑

j=1

mj

+mk+1

2

≥ (1− δ2

)k(

k+1∑

j=1

mj)2

(S.5)

Thus, (S.1) holds for k+1. By induction the result holds for any integer k. If m2k is bounded, when

k goes to infinity, the inequality still holds.

The following two lemmas due to Stone (1985) will be used in the proof of Lemma S.5.

Lemma S.2. Let X1, · · · ,Xn be independent and identically distributed samples from a population

satisfying Condition (C3). Pn is the empirical distribution of X1, · · · ,Xn. Suppose d2dn = o(n1−a)

S.2

for some a > 0, and let ε > 0. Then except on an event with probability approaching to zero,

|Pn(Xj ∈ Inν)− P(Xj ∈ Inν)| ≤ εP(Xj ∈ Inν), for 1 ≤ j ≤ p, 1 ≤ ν ≤ dn,

|Pn(Xj1 ∈ Inν1 ,Xj2 ∈ Inν2)− P(Xj1 ∈ Inν1 ,Xj2 ∈ Inν2)| ≤ εP(Xj1 ∈ Inν1 ,Xj2 ∈ Inν2),

for 1 ≤ j1, j2 ≤ p, j1 6= j2, 1 ≤ ν1, ν2 ≤ dn.

Proof. This indeed is Lemma 2 of Stone (1985).

Lemma S.3. Denote I to be an interval of finite positive length dI , and (T,U) to be a pair of

I-valued random variables with an absolutely continuous distribution. Let En be the expectation

operator corresponding to the empirical distribution based on a random sample of size n. Suppose

that the marginal density of T and U are bounded below by β/dI on I, where β > 0. For t > 0,

except on an event with probability at most dk exp{−2nt2}, the following inequalities hold for all

polynomials Q and R of degree k:

|EnQ(T )− EQ(T )| ≤ tβ−1/2ck√

EQ2(T );

∣∣EnQ2(T )− EQ2(T )∣∣ ≤ tβ−1c2k EQ2(T );

|EnR(U)− ER(U)| ≤ tβ−1/2ck√

ER2(U);

|EnQ(T )R(U)− EQ(T )R(U)| ≤ t(t+ 3)β−1c2k√

EQ2(t) ER2(U).

Here dk is a positive constant only depending on k.

Proof. This is Lemma 12 of Stone (1985).

Lemma S.4. Let (Ω,P) be a probability space, and A1, · · · , AΛ be a finite partition of Ω such

that pλ = P(Aλ) > 0 for 1 ≤ λ ≤ Λ. Let V1 and V2 be a pair of random variables with finite

second moments and E be the expectation operator. Let (Ω′,P′) be another probability space.

Define E′, A′i, V′j as before. For any small fixed constant ε > 0, there exists a δ > 0 such that

|P′(A′λ)− P(Aλ)| ≤ δP(Aλ) and∣∣E′(V ′1V ′2

∣∣A′λ)− E(V1V2∣∣Aλ)

∣∣ ≤ δ√

E(V 21∣∣Aλ)E(V 22

∣∣Aλ), for 1 ≤

S.3

λ ≤ Λ. Then it follows that

∣∣E′(V ′1V ′2)− E(V1V2)∣∣ ≤ ε

√EV 21 EV

22 . (S.6)

Proof. Observe that E(V1V2) =∑

λ pλE(V1V2∣∣Aλ), EV 21 =

∑λ pλE(V

21

∣∣Aλ). It follows by using

Cauchy-Schwarz’s inequality twice that

E(V1V2) ≤∑

λ

pλ

√E(V 21

∣∣Aλ)E(V 22∣∣Aλ) ≤

√∑

λ

pλE(V2

1

∣∣Aλ)√∑

λ

pλE(V2

2

∣∣Aλ).

Thus, (S.6) holds.

Lemma S.5. Suppose that Condition (C3) holds and let δ1 ∈ (δ, 1). Let En be the expectation

corresponding to the empirical distribution based on a random sample of size n. Then except on

an event with probability tending to zero, it follows that

En

( k∑

j=1

Vj

)2≥(

1− δ12

)k−1 ( k∑

j=1

√EnV 2j

)2. (S.7)

Proof. Choose constant 0 < ε < {(1 − δ)/2}k−1. By Lemmas S.2, S.3 and S.4, it follows that,

except on an event with probability approaching to zero,

|En(ViVj)− E(ViVj)| ≤ ε√

EV 2i EV2j , for 1 ≤ i, j ≤ k.

When i = j, it follows that

(1− ε)EV 2i ≤ EnV 2j ≤ (1 + ε)EV 2j , for 1 ≤ j ≤ k. (S.8)

Furthermore, we obtain that

En(∑

j

Vj)2 = E(

∑

j

Vj)2 + En(

∑

j

Vj)2 − E(

∑

j

Vj)2

S.4

≥ E(∑

j

Vj)2 − εE(

∑

j

Vj)2

≥ E(∑

j

Vj)2 − ε(

∑

j

√EV 2j )

2.

The last inequality hold by Minkowski’s inequality. By Lemma S.1 and (S.8), it follows that

En(∑

j

Vj)2 ≥

((1− δ12

)k−1 − ε)

(∑

j

√EV 2j )

2 ≥((1− δ1

2

)k−1 − ε)

(1 + ε)−1(∑

j

√EnV 2j )

2.

Since ε can be made arbitrarily small, (S.7) holds.

S.5

Documents

Error Variance Estimation in Ultrahigh Dimensional Additive ...0 0.5 1 0 5 10 15 20 density of correlation (a) LM s=1 LM s=2 AM s=1 AM s=2 0 0.5 1 1.5 0 1 2 3 4 5 density of ^ < 2