9
Computational Statistics and Data Analysis 55 (2011) 1656–1664 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Estimation of inverse mean: An orthogonal series approach Qin Wang a,, Xiangrong Yin b a Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, VA 23284, United States b Department of Statistics, 204 Statistics Building, The University of Georgia, Athens, GA 30602, United States article info Article history: Received 13 November 2009 Received in revised form 3 September 2010 Accepted 23 October 2010 Available online 30 October 2010 Keywords: Sufficient dimension reduction Central subspace Sliced inverse regression Orthogonal series abstract In this article, we propose the use of orthogonal series to estimate the inverse mean space. Compared to the original slicing scheme, it significantly improves the estimation accuracy without losing computation efficiency, especially for the heteroscedastic models. Compared to the local smoothing approach, it is more computationally efficient. The new approach also has the advantage of robustness in selecting the tuning parameter. Permutation test is used to determine the structural dimension. Moreover, a variable selection procedure is incorporated into this new approach, which is particularly useful when the model is sparse. The efficacy of the proposed method is demonstrated through simulations and a real data analysis. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Sufficient dimension reduction (Li, 1991; Cook, 1998) has recently received much attention as an efficient tool to tackle the challenging problem of high dimensional data analysis. In full generality, the goal of regression is to elicit information on the conditional distribution of a univariate response Y given a p-dimensional predictor vector X. Sufficient dimension reduction is to find a k-dimensional projection subspace S = Span{B = 1 2 ,...,β k )} with k p such that Y yX|P S X, (1) where β ’s are unknown p × 1 vectors, y indicates independence and P stands for a projection operator in the standard inner product. The subspace S is then called a dimension reduction subspace for Y |X. When the intersection of all subspaces satisfying (1) also satisfies (1), it is called the central subspace (CS ) and is denoted by S Y |X . Its dimension d Y |X = dim(S Y |X ) is defined as the structural dimension of the regression. Under some mild conditions, the CS exists (Cook, 1998; Yin et al., 2008). The CS , which represents the minimal subspace preserving the original information of Y |X, is unique and the main focus of dimension reduction. Let Z = Σ 1 2 X (X E (X)), where Σ X is the covariance matrix of X, assumed to be positive definite. Then Σ 1 2 X S Y |Z = S Y |X . Hence, without loss of generality, we may work at either Z- or X-scale. Sliced inverse regression (Li, 1991, SIR) is the first and most well-known method for sufficient dimension reduction. It investigates the trajectory of the inverse mean curve E (Z|Y ). Under the so-called linearity condition that E (Z|B T Z) is linear in B T Z, S E(Z|Y ) S Y |Z . Since then, many related studies have been carried out in both theory and applications. Hsing and Carroll (1992) established the asymptotic properties of SIR estimates when each slice only contains 2 observations. Zhu and Ng (1995) extended this idea to allow for a fixed number of observations per slice. Zhu and Fang (1996) bypassed the slicing step and used kernel smoothing to estimate cov[E (Z|Y )]. Bura and Cook (2001a) suggested a parametric approach called parametric inverse regression. Fung et al. (2002) developed a variant version of SIR, CANCOR, where B-spline basis function replaced simple slicing. Xia et al. (2002) proposed an alternative derivation of SIR through the combination of local Corresponding author. E-mail addresses: [email protected] (Q. Wang), [email protected] (X. Yin). 0167-9473/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2010.10.022

Estimation of inverse mean: An orthogonal series approach

Embed Size (px)

Citation preview

Page 1: Estimation of inverse mean: An orthogonal series approach

Computational Statistics and Data Analysis 55 (2011) 1656–1664

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis

journal homepage: www.elsevier.com/locate/csda

Estimation of inverse mean: An orthogonal series approachQin Wang a,∗, Xiangrong Yin b

a Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, VA 23284, United Statesb Department of Statistics, 204 Statistics Building, The University of Georgia, Athens, GA 30602, United States

a r t i c l e i n f o

Article history:Received 13 November 2009Received in revised form 3 September 2010Accepted 23 October 2010Available online 30 October 2010

Keywords:Sufficient dimension reductionCentral subspaceSliced inverse regressionOrthogonal series

a b s t r a c t

In this article, we propose the use of orthogonal series to estimate the inverse meanspace. Compared to the original slicing scheme, it significantly improves the estimationaccuracy without losing computation efficiency, especially for the heteroscedastic models.Compared to the local smoothing approach, it is more computationally efficient. Thenew approach also has the advantage of robustness in selecting the tuning parameter.Permutation test is used to determine the structural dimension. Moreover, a variableselection procedure is incorporated into this new approach, which is particularly usefulwhen the model is sparse. The efficacy of the proposed method is demonstrated throughsimulations and a real data analysis.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

Sufficient dimension reduction (Li, 1991; Cook, 1998) has recently received much attention as an efficient tool to tacklethe challenging problem of high dimensional data analysis. In full generality, the goal of regression is to elicit informationon the conditional distribution of a univariate response Y given a p-dimensional predictor vector X. Sufficient dimensionreduction is to find a k-dimensional projection subspace S = Span{B = (β1, β2, . . . , βk)} with k ≤ p such that

YyX|PSX, (1)

where β ’s are unknown p × 1 vectors, y indicates independence and P stands for a projection operator in the standardinner product. The subspace S is then called a dimension reduction subspace for Y |X. When the intersection of all subspacessatisfying (1) also satisfies (1), it is called the central subspace (CS) and is denoted by SY |X. Its dimension dY |X = dim(SY |X)is defined as the structural dimension of the regression. Under some mild conditions, the CS exists (Cook, 1998; Yin et al.,2008). The CS, which represents the minimal subspace preserving the original information of Y |X, is unique and the main

focus of dimension reduction. Let Z = Σ−

12

X (X − E(X)), where ΣX is the covariance matrix of X, assumed to be positive

definite. Then Σ−

12

X SY |Z = SY |X. Hence, without loss of generality, we may work at either Z- or X-scale.Sliced inverse regression (Li, 1991, SIR) is the first and most well-known method for sufficient dimension reduction. It

investigates the trajectory of the inverse mean curve E(Z|Y ). Under the so-called linearity condition that E(Z|BTZ) is linearin BTZ, SE(Z|Y ) ⊆ SY |Z. Since then, many related studies have been carried out in both theory and applications. Hsing andCarroll (1992) established the asymptotic properties of SIR estimates when each slice only contains 2 observations. Zhuand Ng (1995) extended this idea to allow for a fixed number of observations per slice. Zhu and Fang (1996) bypassed theslicing step and used kernel smoothing to estimate cov[E(Z|Y )]. Bura and Cook (2001a) suggested a parametric approachcalled parametric inverse regression. Fung et al. (2002) developed a variant version of SIR, CANCOR, where B-spline basisfunction replaced simple slicing. Xia et al. (2002) proposed an alternative derivation of SIR through the combination of local

∗ Corresponding author.E-mail addresses: [email protected] (Q. Wang), [email protected] (X. Yin).

0167-9473/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.csda.2010.10.022

Page 2: Estimation of inverse mean: An orthogonal series approach

Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664 1657

linear expansion and projection pursuit, known as inverse minimum average variance estimation (IMAVE). Bura (2003) alsoused local linear smoother to estimate the inverse mean function. On the other hand, Schott (1994), Velilla (1998), Bura andCook (2001b) and Zhu et al. (2006) developed different methods to estimate the structural dimension dY |X, under differentscenarios. SIR is a powerful method due to its simplicity. However, it still has limitations. One of the issues is estimationefficiency. The finite-sample performance of SIR is not very satisfactory when the dimension is more than 2 and can be poorfor heteroscedastic models.

A new direction for sufficient dimension reduction that deserves serious consideration is functional data analysis. SeeFerraty and Vieu (2006) for an extensive review on functional data. Due to infinite dimensional in functional data, onetechnical difficulty is in inverting the ill-conditioned covariance matrix. To overcome this issue, Ferré and Yao (2003, 2005)replaced the matrix by a sequence of finite rank operators, with bounded inverse and converging to the covariance matrix,and used an equivalent eigen-space combiningwith a generalized inverse to avoid the inversion of the functional covariancematrix, respectively. Ait-Saidi et al. (2008) investigated dimension reduction methods assuming a single index functionalmodel. Amato et al. (2006) extended SIR and others to functional data through appropriate wavelet decompositions.Recently, a platform from finite to infinite dimensional settings for inverse regression dimension reduction problem isprovided by Hsing and Ren (2009) using an RKHS formulation.

In this article, we propose the use of orthogonal series to estimate the inverse mean function. As a useful nonparametricmethod, orthogonal series estimation is computationally efficient and can improve the estimation accuracy of SIRsignificantly, especially for the heteroscedastic models. Adopting the covariance matrix estimation techniques proposedby Ferré and Yao (2003, 2005), our method could be applied to functional data as well. The rest of the article is organizedas follows. Section 2 gives a brief review of the estimation of SIR. The new approach based on orthogonal series estimationis detailed in Section 3. Section 4 introduces a Lasso type procedure to select informative variables. Section 5 discusses thepermutation procedure used to choose the structural dimension. Simulation studies and a real data example are in Section 6.Section 7 concludes our discussion.

2. A brief review

Let {(XTi , Yi), i = 1, . . . , n} be a random sample from (XT , Y ), where X = (X1, . . . , Xp)

T∈ Rp and Y ∈ R, and assume

that dY |X is known. The SIR algorithm proposed by Li (1991) can be summarized as follows:

1. Standardize Xi:Zi = 6−1/2X (Xi − X), where X and ΣX are the sample mean and sample covariance matrix respectively.

Then divide Yi for i = 1, . . . , n into H slices and let ph be the proportion of Yi that falls in slice h ∈ {1, 2, . . . ,H};2. Within each slice h, compute the sample mean of Z and denote by Zh. Form a sample SIR matrix V =

∑Hh=1 phZhZT

h , andfind the eigen-structure of V ;

3. The dY |X eigenvectors (ηi, i = 1, . . . , dY |X) corresponding to the dY |X largest eigenvalues are the estimated directions of

SE(Z|Y ). Back to the X scale, βi = Σ−

12

X ηi, i = 1, . . . , dY |X.In this article, we use orthogonal series, a more flexible nonparametric tool, to estimate the inverse mean function. Our

approach can be regarded as an alternative to the Principal Fitted Component model, proposed by Cook (2007) and Cook andForzani (2008). In particular, one model they proposed is

Xy = E(X) + Γ αfy + σϵ,

where Xy denotes a random vector distributed as X|Y = y, Γ ∈ Rp×d, d < p, Γ TΓ = Id, α ∈ Rd×r and d ≤ r . fy ∈ Rr is aknown vector-valued function of responsewith

∑y fy = 0, σ > 0 and the error vector ϵ ∈ Rp. Inverse regression plots ofXy

versus y can be used to find suitable choices of fy, then the sufficient dimension reduction subspace S(Γ ) is estimated fromthemaximum likelihood. The authors alsomentioned other possibilities for basis functions to be used for fy. Rather than justestimating fy, we use orthogonal series to estimate the inverse mean function without any particular model assumption.

3. Alternative estimation for SIR

3.1. Orthogonal series estimation

Suppose that a regression function of y given t can be represented as y = µ(t) + ϵ, where µ(t) is the mean functionand ϵ is the random error. If it is reasonable to assume that µ(t) is a smooth function, many classes of functions can then beused to approximate µ(t). In general,

y =

∞−j=0

θjϕj(t) + ϵ, (2)

where {ϕj} is a basis function and θj’s are the unknown Fourier coefficients. Once a basis function is chosen, the estimation ofµ(t) is equivalent to the estimation of those Fourier coefficients. In practice, not all of them are estimable since only a finitenumber of observations are available. The approximation µ(t) =

∑Jj=0 θjϕj(t) is often used and known as a series estimator.

More details on the properties of series estimator and the choice of smoothing parameter J can be found in Härdle (1990)and Eubank (1999).

Page 3: Estimation of inverse mean: An orthogonal series approach

1658 Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664

A convenient choice of the class of basis functions is called complete orthonormal sequence (CONS). Let t ∈ [−1, 1], {ϕj}

constitute an orthonormal basis if∫ 1

−1ϕi(t)ϕj(t) = δij =

0 if i = j;1 if i = j. (3)

Given data (ti, yi) for i = 1, . . . , n, let y = (y1, . . . , yn)T and define a n × J matrix LJ = {ϕj(ti)} for i = 1, . . . , n; j =

1, . . . , J . A natural estimator of the Fourier coefficients based on the idea of standard linear regression will be

ΘJ = (θ1, . . . , θJ)T

= (LTJ LJ)−1LTJ y,

and the estimator for response is y = LJ(LTJ LJ)−1LTJ y.

Well-know choices of the CONS include Legendre polynomials and trigonometric series. In this article, we will useLegendre polynomials to demonstrate our methodology for its computation convenience and close connection withpolynomial regression. The Legendre polynomials

P0(t) = 1/√2, P1(t) = t/

2/3, P2(t) =

12(3t2 − 1)/

2/5, P3(t) =

12(5t3 − 3t)/

2/7, . . .

constitute an orthonormal series in [−1, 1]. Generally, a term in Legendre polynomials can be easily computed from thefollowing recurrence relation:

(s + 1)Ps+1(t) = (2s + 1)tPs(t) − sPs−1(t) for s ≥ 2.

3.2. Orthogonal series estimator of inverse regression

Our goal is to estimate the inverse mean curve µ = E(X|Y ) via a particular CONS, Legendre polynomials. We scale Y sothat Yi ∈ [−1, 1] for i = 1, . . . , n and still denote as Yi. The model then can be written as

X|Y = m(Y ) + E,or

Xj|Y = mj(Y ) + ϵj, j = 1, 2, . . . , p,

where the link function m(Y ) = (m1(Y ), . . . ,mp(Y ))T is unknown and the error vector E = (ϵ1, . . . , ϵp)T

∈ Rp hasmean 0.

More specifically, we haveXT

1

XT2...

XTn

=

m1(Y1) m2(Y1) · · · mp(Y1)m1(Y2) m2(Y2) · · · mp(Y2)

...m1(Yn) m2(Yn) · · · mp(Yn)

+

ET1

ET2...

ETn

. (4)

Using an orthonormal series expansion, we can get

xij|Yi = mj(Yi) =

J−l=0

θjlPl(Yi).

If we denote

X =

XT

1

XT2...

XTn

, W =

P0(Y1) P1(Y1) · · · PJ(Y1)P0(Y2) P1(Y2) · · · PJ(Y2)

...P0(Yn) P1(Yn) · · · PJ(Yn)

, (5)

then, the estimate of the inverse mean function will beX = W(WTW)−1WTX.

Zhu and Fang (1996) obtained√n-consistency and asymptotic normality for kernel estimates of SIR. Since X can be

treated as a special type of kernel estimates (Härdle, 1990; Eubank, 1999), we believe that these properties are also validfor our approach under similar conditions as in Zhu and Fang (1996). In fact, Fig. 1 in Section 6 numerically supports thisconclusion.

3.3. Algorithm of orthogonal series estimation

Assuming dY |X is known, we describe our algorithm as follows.

1. Scale Yi so that Yi ∈ [−1, 1] and set Zi = Σ−

12

X (Xi − X);2. For a given truncation point J , form the design matrixW described in the previous section;

Page 4: Estimation of inverse mean: An orthogonal series approach

Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664 1659

Fig. 1.√n-consistency check.

3. Compute an estimate of the inverse mean function Z = W(WTW)−1WTZ, where Z = (Z1, Z2, . . . , Zn)T ;

4. Form the dimension reduction matrix M =1n

∑ni=1 ZiZ

Ti , where ZT

i is the ith row of Z;5. Conduct a spectral decomposition of M to find its eigen-structure;6. The dY |X eigenvectors {ηi, i = 1, . . . , dY |X} corresponding to the dY |X largest eigenvalues are the estimated directions of

SE(Z|Y ). Back to the X scale, βi = Σ−

12

X ηi, i = 1, . . . , dY |X.

The choice of J typically governs the smoothness of the fit. Since our focus is on the estimation of β ’s rather than theregression fit, the choice of J is less essential. Although over-smoothing a little bit might be preferred to increasing theestimation accuracy, in practice, a rough choice of J will suffice. We use J = 1.5

√n in the simulation study and it seems to

work well. Fine methods such as cross-validation or generalized cross-validation criteria as discussed in Eubank (1999) canbe adopted here, which may deserve further study.

Note that the algorithm requires the calculation of the inverse of thematrixWTW. In our simulations, theMoore–Penrosegeneralized inverse was used when the matrix was singular.

4. Sufficient variable selection

In some applications the regressionmodel has an intrinsic sparse structure. That is, only a few components of X affect theresponse. Effectively selecting informative predictors in the reduced directions can improve both the estimation accuracyand the interpretability. In this section, we incorporate the shrinkage estimation procedure proposed by Li and Yin (2008)into our method.

Note that the estimation of CS is equivalent in eitherX or Z scale, however the sparseness are generally not transformablefrom one scale to another depending on the 6X. Hence we directly work in the X scale for the sparse solution. From step 4in the previous section, we have M =

1n ZT Z. Thus the dimension reduction matrix in X scale is MX =

1n6

−1/2X ZT Z6

−1/2X .

Define X = X − 1nXT where 1n is a n × 1 vector of 1’s, and let ZX = W(WTW)−1WT XΣ−1X . Then MX =

1n ZT

XZX. Let ZTix

be the ith row of ZX. Instead of using spectral decomposition to find the central subspace, an alternative least square typeestimation can be formed as follows:

B = argminB,Ci,i=1,...,n

n−i=1

‖Zix − BCi‖2, (6)

where B = {βi, i = 1, . . . , dY |X}, Ci = BT Zix and BT B = IdY |X . The objective function is then alternately minimized betweenB and Ci’s in (6) until convergence. To select informative predictors, a shrinkage index vector α can be incorporated into (6)with the solution (B, Ci) for i = 1, . . . , n.

α = argminα

n−i=1

‖Zix − diag(α)BCi‖2, (7)

where {α ∈ Rp:∑p

i=1 |αi| ≤ λ} for someλ > 0. This constrained optimization can be solved by a standard Lasso algorithm.Then diag(α)B forms a basis of the estimated sparse central subspace SY |X. To select the tuning parameter λ, we use K -foldcross-validation method in a Matlab routine written by Karl Sköglund. 10-fold cross-validation is used in our simulationstudy and it seems very effective. More details can be found at http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=3897.

Page 5: Estimation of inverse mean: An orthogonal series approach

1660 Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664

Table 1Estimation accuracy comparison of Example 1.

n SIR orSIR KSIR IMAVE

100 0.192(0.088) 0.203(0.080) 0.209(0.087) 0.175(0.074)200 0.090(0.050) 0.092(0.048) 0.113(0.057) 0.090(0.052)400 0.041(0.020) 0.040(0.019) 0.049(0.026) 0.036(0.015)800 0.020(0.008) 0.019(0.009) 0.023(0.008) 0.018(0.008)

5. Determination of dimensionality

In practice, the structural dimension dY |X is often unknown. Several estimation procedures have been proposed inliterature. Asymptotic sequential testing approach was initially used in Li (1991) and has been elaborated and extendedby Schott (1994), Velilla (1998), Ferré (1998) and Bura and Cook (2001b). Recently, Zhu et al. (2006) proposed a modifiedBIC criterion to estimate dY |X which can handle models with high-dimensional covariates. In this section, we adopt apermutation test procedure based on the results in Cook and Yin (2001). For simplicity we work in the Z-scale.

LetU = {ui} be the p×pmatrix of eigenvectors of the kernel dimension reductionmatrix M as indicated above. PartitionU = (U1,U2) where U1 is p × m, then we have

Proposition 1. If (Y ,UT1Z)yU

T2Z, then S(U1) is a dimension reduction subspace of Y |Z and therefore dim(S(U1)) ≥ dY |Z.

This proposition suggests a sequential permutation testing procedure to estimate the structural dimension dY |Z.Generally, consider testing H0:dY |Z = m vs. Ha:dY |Z ≥ (m + 1). Let Bm = (βT

1 , . . . , βTm) and Am = (βT

m+1, . . . , βTp ). Thus the

following procedure can be used to test dY |Z = m.

• Obtain M from the data (Z, Y ). Use spectral decomposition to obtain the corresponding eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λpand then calculate the test statistic:

f0(λ) = λm+1 −1

p − (m + 1)

p−i=m+2

λi.

• Re-arrange the data as {(BTmZ, A

TmZ, Y )}, and permute the rows of the matrix AT

mZN times to form permuted data sets. Foreach permuted data set, obtain its M(j), and calculate the test statistic denoted by fj(λ(j)), where j = 1, . . . ,N and λ(j) arethe eigenvalues of M(j);

• Compute the permutation p-value as: pperm = N−1 ∑Nj=1 I(fj(λ(j)) > f0(λ)), and reject H0 if pperm < α, where I is the

indicator function and α is a pre-specified significance level;• Repeat the previous 3 steps form = 0, 1, 2, . . . until H0 cannot be rejected. Take thism as the estimated dY |Z.

6. Simulation studies and data analysis

In this section, we evaluate the performance of ourmethod via a simulation study and a real data analysis. For measuringthe accuracy of the estimates, we use the trace correlation r defined by Ye and Weiss (2003) and Zhu and Zeng (2006). LetS(A) and S(B) denote the column space spanned by two p × q matrices of full column rank. Let PA = A(ATA)−1AT andPB = B(BTB)−1BT be the projection matrices onto S(A) and S(B), the trace correlation is defined as r =

1q tr(PAPB). Clearly,

0 ≤ r ≤ 1. The larger the r is, the closer S(A) is to S(B). Following the use in previous papers, 1 − r is used to measure thedistance of two spaces. To measure the effectiveness of variable selection, we use the true positive rate (TPR), defined as theratio of the number of predictors correctly identified as active to the number of active predictors, and the false positive rate(FPR), defined as the ratio of the number of predictors falsely identified as active to the number of inactive predictors. Wecompared our method with the original SIR (Li, 1991), Kernel SIR (Zhu and Fang, 1996) and a local approach, IMAVE (Xiaet al., 2002). 500 data replicates were conducted for each parameter setting. The number of slices was chosen to be 10 inSIR. Gaussian kernel and optimal bandwidth in the sense of mean integrated squared error (Silverman, 1986) were used inkernel SIR and IMAVE. The Matlab code of the estimation is available from the authors.

Example 1. This is a model originally used by Li (1991) for demonstrating SIR.

Y =βT1X

0.5 + (βT2X + 1.5)2

+ 0.5ϵ,

where X = (X1, X2, . . . , X10)′, Xi’s and ϵ are independent and identically distributed as N(0, 1), β1 = (1, 0, 0, 0, 0,

0, 0, 0, 0, 0)′ and β2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0)′. Table 1 indicates all methods perform well and are very comparable.Following Wang and Xia (2008), we check the

√n-consistency by plotting 1 − r versus 1/

√n for n = 100, 200, 300, 400

and 500 in Fig. 1. If√n-consistency is correct, then approximately a straight line is expected. This seems to be the case for

this example.

Page 6: Estimation of inverse mean: An orthogonal series approach

Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664 1661

Fig. 2. Estimation accuracy for Example 2.

Fig. 3. The choice of J for Example 2.

Example 2 (Heteroscedasticity). Consider the following model

Y = (βTX)ϵ,

where X = (X1, X2, . . . , X10)′, Xi’s and ϵ are independent and identically distributed as N(0, 1) and β = (1, 0, 0, 0, 0,

0, 0, 0, 0, 0)′.The Box-plot in Fig. 2 shows the comparisons via 3 different sample sizes, 200, 400 and 800. This model does not favor

the SIR. Even with the increase of sample size, no significant improvement is obtained in the estimation accuracy. On thecontrary, orthogonal series estimation shows the highest accuracy in all cases, including with small sample sizes n = 50and 100 (not reported here). The advantage over usual ‘‘slicing’’ scheme becomes more significant when the sample sizeincreases.

We also study the sensitivity of the choice of truncation point J . Fig. 3 shows the change of average distance 1 − r withthe choice of J ranging from 10 to 40 for 3 different sample sizes, n = 200, 400 and 800, and two different dimensions of thepredictors, p = 10 and 30. The red lines are the results from p = 10, and the blue lines are from p = 30. The right panel is forthe sparse solution. Increasing p does reduce the accuracy, while increasing the sample’s size increases accuracy. Since thismodel has a sparse structure, a sparse solution gives better estimates. Clearly, the choice of J is less essential in estimatingthe directions of central subspace, since the accuracy curve goes flat after certain point of J for each combination of p and n.

In addition, we investigated the effect of the structure of β on the estimation accuracy. Table 2 shows 3 different βstructures when n = 200 and p = 10. Based on the simulation, we can see little effect on the estimation accuracy from thestructure of β .

Page 7: Estimation of inverse mean: An orthogonal series approach

1662 Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664

Table 2Example 2 with different structures on β, n = 200 and p = 10.

β SIR orSIR KSIR IMAVE

(1, 0, 0, 0, 0, 0, 0, 0, 0, 0) 0.738(0.201) 0.237(0.159) 0.489(0.245) 0.302(0.201)(0.6, 0.8, 0, 0, 0, 0, 0, 0, 0, 0) 0.747(0.196) 0.249(0.171) 0.490(0.273) 0.307(0.237)(0.5, 0.5, 0.5, 0.5, 0, 0, 0, 0, 0, 0) 0.720(0.197) 0.234(0.163) 0.483(0.251) 0.294(0.210)

Table 3Effectiveness of variable selection-sparse orSIR.

n Example 2 Example 3TPR FPR TPR FPR

200 0.9100 0.0106 0.8217 0.0393400 0.9900 0.0061 0.9233 0.0171800 1.0000 0.0010 0.9833 0.0021

Fig. 4. Estimation accuracy for Example 3.

Example 3 (A 3-Dimensional Heteroscedastic Model).

Y =βT1X

0.5 + (βT2X + 1.5)2

+ (βT3X)3ϵ,

where X = (X1, X2, . . . , X10)′, Xi’s and ϵ are independent and identically distributed as N(0, 1), β1 = (1, 0, 0, 0, 0, 0,

0, 0, 0, 0)′, β2 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0)′ and β3 = (0, 0, 1, 0, 0, 0, 0, 0, 0, 0)′.Fig. 4 shows the Box-plot for sample sizes n = 200, 400 and 800 (similar conclusion for n = 50 and n = 100, not reported

here). Because of the heteroscedasticity, SIR and kernel SIR do not perform well, even with large sample sizes. The IMAVEestimation takes more than 4 times of the CPU time than our orthogonal series estimation although it performs slightlybetter. The running time for 500 data samples is 7 s for both SIR and orSIR when the sample size n = 200. When the samplesize n increases to 400, the running time is 11 and 14 s for SIR and orSIR respectively.

Example 4 (Variable Selection). In this example, we verify the performance of shrinkage estimation in selecting informativepredictors. The tuning parameter is selected by 10-fold cross-validation. Table 3 shows that our sparse estimates are prettyaccurate.

Example 5 (Determination of Dimensionality). Table 4 summarizes the estimated dY |X from the asymptotic χ2 test for SIR,and permutation test for orSIR. Correct estimates are in bold. For both methods, we choose the significance level α = 0.05.The results show that the determination of the structural dimension for heteroscedastic models seems more difficult thanfor the mean structure, despite the significant gain from permutation test.

Example 6 (OzoneData). In this example,we consider a data set for studying the atmospheric ozone concentration in the LosAngeles basin. This data set has been studied by Li (1992) and Cook and Li (2004). The response Y is the dailymeasurement ofozone concentration. There are p = 8 predictors: the Sanburg Air Force Base temperature, Inverse base height, Dagget pressuregradient, Visibility, Vandenburg 500millibar height, Humidity, Inverse base temperature and Wind speed. The data set contains330 observations.

Sliced inverse regression identifies one significant direction. The scatter plot of response Y vs. the first SIR directionshows clearly a quadratic pattern. After a closer investigation of the residual from the quadratic fit, Li (1992) argued a secondsignificant component is necessary and Principal Hessian Directions (PHD) can recover this direction, concluding a CS with

Page 8: Estimation of inverse mean: An orthogonal series approach

Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664 1663

Table 4Proportion of estimated dimensions of central subspace.

Example n χ2 test—SIR Permutation test—orSIRd = 0 d = 1 d = 2 d = 3 d ≥ 3 d = 0 d = 1 d = 2 d = 3 d ≥ 3

2 200 94 5 1 0 0 51 46 3 0 0400 94 6 0 0 0 21 76 3 0 0800 93 7 0 0 0 1 99 0 0 0

3 200 0 37 62 1 0 2 6 70 22 0400 0 3 92 5 0 0 2 49 46 3800 0 0 94 6 0 0 0 9 84 7

Fig. 5. 3-dimensional plot from the study of Ozone data.

dY |X = 2. Cook and Li (2004) also identify the first direction using Iterative Hessian Transformation (IHT) methodology, butthe estimates of dimension differ in different testing methods.

We apply our method to this data set. The permutation p-values form = 0, 1, and 2 are 0, 0.004 and 0.278, respectively,which clearly suggests 2 significant directions. Fig. 5 shows a good 3-dimensional view of the data summary. A quadraticsurface along the first direction can be observed,while the second direction indicates some variation along the surfacewhichmay be due to the heteroscedasticity.

The estimates by ourmethod are β1 = (−0.50, 0.39, −0.17, 0.30, −0.39, −0.31, −0.48, 0.00) and β2 = (−0.16, 0.09,0.68, −0.15, −0.28, 0.57, −0.22, 0.17). However, it is not easy to interpret these coefficients without fitting a model.Nevertheless, we compare our results with that of Li (1992) since both conclude dY |X = 2. Indeed, the distance betweenthe two estimated spaces is 1 − r = 0.17 which indicates that the two analyses rather agree with each other. However, Li(1992) arrived this conclusion through combining both SIR and PHD.

7. Discussion

In this article, we propose an alternative estimation procedure to sliced inverse regression based on orthogonal seriesestimation. A sparse solution is also introduced. Empirical studies show the efficacy of the new approach, especially for theheteroscedastic models. Although ourmethod and the local estimation procedure, IMAVE, are comparable in the estimationaccuracy, orSIR is more computationally efficient with large sample sizes. Unlike the choice of number of slices H in theoriginal SIR, the only tuning parameter, truncation point J in the orthogonal series, is more robust in estimating boththe directions and the structural dimension. In this article, we simply adopt the permutation procedure to estimate thedimensionality. Some other approaches such as bootstrap (Ye andWeiss, 2003), modified BIC (Zhu et al., 2006) may also beused here.

We believe the better performance of orSIR comparing with other inverse approaches in finding nonlinear structureis from the fact that the orthonormal sequence is a high-order polynomial expansion. Note that the original SIR uses astep function, KSIR is a (local) constant approximation, while IMAVE uses a local linear approximation. Hence, generallyIMAVE outperforms KSIR, while KSIR gives better results than SIR. The complete orthonormal sequence approximation inour approach uses even higher order polynomials. It should perform better as long as the degree J is not too small (whichleads or shrinks to null space), or too big (which leads to PCA, no information of response Y is used).

Page 9: Estimation of inverse mean: An orthogonal series approach

1664 Q. Wang, X. Yin / Computational Statistics and Data Analysis 55 (2011) 1656–1664

Nevertheless, we do not claim that the orthogonal series approach is the best among the smoothing class, such as kernel(Zhu and Fang, 1996), B-spline (Fung et al., 2002), local linear approximation (Xia et al., 2002) and local linear smoothers(Bura, 2003), to name a few. But it does provide a viable alternative estimation procedure. The idea of orthogonal seriesapproach can also be extended to other inverse regression based dimension reduction methods, such as sliced averagevariance estimation. The investigation is under way.

Acknowledgements

The authors would like to thank the editor, an associate editor, and two referees for their constructive comments that ledto substantial improvements in the manuscript. Yin’s research was supported in part by National Science Foundation GrantDMS-0806120.

References

Ait-Saidi, A., Ferraty, F., Kassa, R., Vieu, P., 2008. Cross-validated estimations in the single-functional index model. Statistics 42, 475–494.Amato, U., Antoniadis, A., De Feis, I., 2006. Dimension reduction in functional regression with applications. Computational Statistics and Data Analysis 50,

2422–2446.Bura, E., 2003. Using linear smoothers to assess the structural dimension of regressions. Statistica Sinica 13, 143–162.Bura, E., Cook, R.D., 2001a. Estimating the structural dimension of regressions via parametric inverse regression. Journal of the Royal Statistical Society.

Series B 63, 393–410.Bura, E., Cook, R.D., 2001b. Extending sliced inverse regression: theweighted chi-squared test. Journal of the American Statistical Association 96, 996–1003.Cook, R.D., 1998. Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley, New York.Cook, R.D., 2007. Fisher lecture: dimension reduction in regression. Statistical Science 22, 1–26.Cook, R.D., Forzani, L., 2008. Principal fitted components for dimension reduction in regression. Statistical Science 23, 485–501.Cook, R.D., Li, B., 2004. Determining the dimension of iterative Hessian transformation. The Annals of Statistics 32, 2501–2531.Cook, R.D., Yin, X., 2001. Dimension reduction and visualization in discriminant analysis (with discussion). Australian & New Zealand Journal of Statistics

43, 147–199.Eubank, R.L., 1999. Nonparametric Regression and Spline Smoothing, 2nd ed. Marcel Dekker, Inc.Ferraty, F., Vieu, P., 2006. Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York.Ferré, L., 1998. Determining the dimension in sliced inverse regression and related methods. Journal of the American Statistical Association 93, 132–140.Ferré, L., Yao, A., 2003. Functional sliced inverse regression analysis. Statistics 37, 475–488.Ferré, L., Yao, A., 2005. Smoothed functional inverse regression. Statistica Sinica 15, 665–683.Fung, W.K., He, X., Liu, L., Shi, P., 2002. Dimension reduction based on canonical correlation. Statistica Sinica 12, 1093–1113.Härdle, W., 1990. Applied Nonparametric Regression. Cambridge University Press, New York.Hsing, T., Carroll, R.J., 1992. An asymptotic theory of sliced inverse regression. The Annals of Statistics 20, 1040–1061.Hsing, T., Ren, H., 2009. An RKHS formulation of the inverse regression dimension-reduction problem. The Annals of Statistics 37, 726–755.Li, K.C., 1991. Sliced inverse regression for dimension reduction (with discussion). Journal of the American Statistical Association 86, 316–342.Li, K.C., 1992. On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. Journal of the American

Statistical Association 87, 1025–1039.Li, L., Yin, X., 2008. Sliced inverse regression with regulations. Biometrics 64, 124–131.Schott, J.R., 1994. Determining the dimensionality in sliced inverse regression. Journal of the American Statistical Association 89, 141–148.Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York.Velilla, S., 1998. Assessing the number of linear components in a general regression problem. Journal of the American Statistical Association 93, 1088–1098.Wang, H., Xia, Y., 2008. Sliced regression for dimension reduction. Journal of the American Statistical Association 103, 811–821.Xia, Y., Tong, H., Li, W., Zhu, L., 2002. An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B 64, 363–410.Ye, Z., Weiss, R.E., 2003. Using the bootstrap to select one of a new class of dimension reduction methods. Journal of the American Statistical Association

98, 968–979.Yin, X., Li, B., Cook, R.D., 2008. Successive direction extraction for estimating the central subspace in a multiple-index regression. Journal of Multivariate

Analysis 99, 1733–1757.Zhu, L.X., Fang, K.T., 1996. Asymptotics for kernel estimates of sliced inverse regression. The Annals of Statistics 24, 1053–1068.Zhu, L.X., Miao, B.Q., Peng, H., 2006. On sliced inverse regression with high-dimensional covariates. Journal of the American Statistical Association 101,

630–643.Zhu, L.X., Ng, K.W., 1995. Asymptotics of sliced inverse regression. Statistica Sinica 5, 727–736.Zhu, Y., Zeng, P., 2006. Fourier methods for estimating the central subspace and the central mean subspace in regression. Journal of the American Statistical

Association 101, 1638–1651.