Multi-task Quantile Regression under the Transnormal Modeljqfan/papers/15/TransNormal.pdf2 Multi-task quantile regression: model and method This section presents the complete methodological

Multi-task Quantile Regression under the Transnormal

Model

Jianqing Fan, Lingzhou Xue and Hui Zou

Princeton University, Pennsylvania State University and University of Minnesota

Abstract

We consider estimating multi-task quantile regression under the transnormal model,

with focus on high-dimensional setting. We derive a surprisingly simple closed-form

solution through rank-based covariance regularization. In particular, we propose the

rank-based `1 penalization with positive definite constraints for estimating sparse co-

variance matrices, and the rank-based banded Cholesky decomposition regularization

for estimating banded precision matrices. By taking advantage of alternating direction

method of multipliers, nearest correlation matrix projection is introduced that inherits

sampling properties of the unprojected one. Our work combines strengths of quan-

tile regression and rank-based covariance regularization to simultaneously deal with

nonlinearity and nonnormality for high-dimensional regression. Furthermore, the pro-

posed method strikes a good balance between robustness and efficiency, achieves the

“oracle”-like convergence rate, and provides the provable prediction interval under the

high-dimensional setting. The finite-sample performance of the proposed method is

also examined. The performance of our proposed rank-based method is demonstrated

in a real application to analyze the protein mass spectroscopy data.

Key Words: Copula model; Optimal transformation; Rank correlation; Cholesky decompo-

sition; Quantile regression; Prediction interval; Alternating direction method of multipliers.

1 Introduction

Consider a multi-task high-dimensional learning paradigm that independent variables z =

(z1, . . . , zp0)′ are used to simultaneously predict multiple response variables y = (y1, . . . , yq0)

′,

1

where both dimensions p0 and q0 can be of larger order of magnitude than sample size n.

In this work, we are interested in estimating optimal transformations t = (t1, . . . , tq0)′ such

that t(z) = (t1(z), . . . , tq0(z))′ optimally predict y at the same time. Namely, we shall solve

optimal transformations from

mint: Rp0 7→Rq0

q0∑j=1

E[L(yj − tj(z))],

where L(·) is a convex loss function. If z and y have a joint normal distribution, it is

appropriate to specify L(·) as the squared loss, i.e., L(u) = u2. With the squared loss,

normality makes the neat connection between optimal transformations and ordinary least

squares. Thanks to normality, nice properties such as linearity and homoscedasticity hold

for optimal transformations. Thus, optimal transformations can be easily solved by ordinary

least squares.

However, observed data are often skewed or heavy-tailed, and rarely normally distributed

in real-world applications. Transformations are commonly used in reality to achieve normal-

ity in regression analysis, such as the celebrated Box-Cox transformation. Under the classical

low-dimensional setting, estimating transformations for regression has received considerable

consideration in the statistical literature. On one hand, parametric methods are proposed by

focusing on the parametric families of transformations, for example, Box & Tidwell (1962),

Box & Cox (1964), and others. On the other hand, nonparametric estimation of regression

transformations are also studied, for instance, projection pursuit regression (Friedman &

Stuetzle 1981), alternating conditional expectation (Breiman & Friedman 1985), additivity

and variance stabilization (Stone 1985, Tibshirani 1988), among others. Although these

methods work well under the classical setting, it is nontrivial to extend them to estimate

optimal transformations in high dimensions. Under the high-dimensional setting, such para-

metric and nonparametric methods would suffer from the curse of dimensionality. Therefore,

there are significant demands for relaxing normality when estimating optimal transforma-

tions for high-dimensional regression.

As a nice combination of flexibility and interpretability, Gaussian copulas have generated

a lot of interests in statistics and econometrics. The semiparametric Gaussian copula model

is deemed as a favorable alternative to the Gaussian model in several high-dimensional

statistical problems, including linear discriminant analysis (Lin & Jeon 2003, Mai & Zou

2012), quadratic discriminant analysis (Fan et al. 2013), graphical modeling (Liu et al.

2009, 2012, Xue & Zou 2012), covariance matrix estimation (Xue & Zou 2014), and prin-

2

cipal component analysis (Han & Liu 2012). The semiparametric Gaussian copula model

provides a semiparametric generalization of the Gaussian model by assuming the existence

of univariate monotone transformations f = (f1, · · · , fp) for x = (x1, . . . , xp)′ such that

f(x) = (f1(x1), . . . , fp(xp)) ∼ Np(µ?,Σ?). Throughout this paper, we follow Lin & Jeon

(2003) to call this copula model as the transnormal model, which is also called the nonpara-

normal model in Liu et al. (2009). In this work, we will show the power of the transnormal

model in estimating optimal transformations for high-dimensional regression modeling.

We now suppose that x = (z′,y′)′ consists of z = (x1, . . . , xp0)′ and y = (xp0+1, · · · , xp0+q0=p)

′.

The transnormal model entails the existence of monotone transformations f = (g,h) =

(f1, . . . , fp0 , fp0+1, . . . , fp0+q0) such that

f(x) =

(g(z)

h(y)

)∼ Np

(µ? =

(µ?z

µ?y

),Σ? =

(Σ?zz Σ?

zy

Σ?yz Σ?

yy

))(1)

where we may assume that µ? = 0 and Σ? is a correlation matrix. Remark that the marginal

normality is achieved by transformations, so the transnormal model essentially assumes the

joint normality of marginally normal-transformed variables. The transnormal model strikes a

better balance between model robustness and interpretability than the normal model. In this

work, our aim is to estimate transformations of predictors, t(z) : Rp0 7→ Rq0 , to optimally

predict multiple response variables y under the transnormal model (1).

Given the underlying transformations g and h, we can easily derive the explicit condi-

tional distribution of h(y) given g(z), namely

h(y) | g(z) ∼ Nq0

(Σ?yz(Σ

?zz)−1 · g(z), Σ?

yy −Σ?yz(Σ

?zz)−1Σ?

zy

). (2)

But, since g and h are unknown, it is challenging to obtain the explicit conditional distri-

bution of y given z. Unlike the optimal transformations under the Gaussian model, nice

properties such as linearity and homoscedasticity do not hold for the optimal transforma-

tions under the transnormal model. In the presence of nonlinear transformations g and h,

it is important to take into account the coordinatewise nonlinearity among the conditional

distributions of y given z. We follow the same spirit of quantile regression (Koenker 2005)

to effectively deal with such nonlinearity.

Quantile regression is first introduced by the seminal paper of Koenker & Bassett (1978),

and since then, it has received much attention in many topics such as survival analysis

(Koenker & Geling 2001), time series analysis (Koenker & Xiao 2006), growth chart analysis

(Wei & He 2006, Wei et al. 2006), microarray analysis (Wang & He 2007), variable selection

3

(Zou & Yuan 2008, Bradic et al. 2011), among others. Denote by ρτ (u) = u · (τ − Iu≤0)

the check loss function (Koenker & Bassett 1978). Then we consider multi-task quantile

regression:

mint: Rp0 7→Rq0

q0∑j=1

E[ρτ (yj − tj(z))]. (3)

The use of the `1 loss in prediction was recommended in Friedman (2001, 2002). Our work

shares the similar philosophy with Friedman (2001, 2002), and includes the `1 loss as a special

case. In fact, the problem (3) with τ = 12

reduces to median regression, namely,

mint: Rp0 7→Rq0

q0∑j=1

E[ρτ= 12(yj − tj(z))] =

1

2min

t: Rp0 7→Rq0E[‖y − t(z)‖`1 ].

In this paper, we will show that with the aid of rank-based covariance regularization, the

optimal transformations in multi-task quantile regression can be efficiently estimated under

the transnormal model. Although estimating optimal transformation is very difficult, we

surprisingly show that we can derive the closed-form solution for optimal transformations

in (2) without using any smoothing techniques as in Friedman & Stuetzle (1981), Breiman

& Friedman (1985) or Stone (1985), Tibshirani (1988). The key ingredient of our proposed

method is the positive definite regularized estimation of large covariance matrices under the

transnormal model. We introduce two novel rank-based covariance regularization methods

to deal with two popular covariance structures respectively: the rank-based positive defi-

nite `1 penalization for estimating sparse covariance matrices, and the rank-based banded

Cholesky decomposition regularization for estimating banded inverse covariance matrices.

Our proposed rank-based covariance regularization critically depends on a correlation ma-

trix that retains the desired sampling properties of the adjusted Spearman’s or Kendall’s

rank correlation matrix (Kendall 1948, Liu et al. 2012, Xue & Zou 2012).

The aforementioned correlation matrix is not necessarily positive definite (Devlin et al.

1975). We therefore propose a new nearest correlation matrix projection that inherits re-

quired sampling properties of the adjusted Spearman’s or Kendall’s rank correlation matrix,

which can be solved by an efficient alternating direction method of multipliers. By combin-

ing both strengths of quantile regression modeling and rank-based covariance regularization,

we can simultaneously address issues of nonlinearity, non-normality and high dimensionality

in estimating optimal transformations for high-dimensional regression. Especially, our pro-

posed method achieves the “oracle”-like convergence rate and provides a provable prediction

4

interval under the high-dimensional setting where dimension is on the nearly exponential

order of sample size.

The rest of this paper is organized as follows. We first present the methodological details

of optimal prediction in multi-task quantile regression in Section 2. Section 3 establishes

the theoretical properties of our proposed method under the transnormal model. Section 4

contains simulation results and a real application to analyze the protein mass spectroscopy

data. Technical proofs are presented in the appendices.

2 Multi-task quantile regression: model and method

This section presents the complete methodological details for solving optimal predictions

in multi-task quantile regression, i.e., mint: Rp0 7→Rq0∑q0

j=1 E[ρτ (yj − tj(z))], where y and

z jointly follow a transnormal model. The transnormal family of distributions allows us

to obtain the closed-form solution of optimal predictions for multi-task quantile regression,

which is very appealing and powerful to simultaneously deal with non-normality and high

dimensionality. The transnormal model retains the nice interpretation of the normal model,

and enables us to make a good use of normal model and theory. Moreover, the monotone

transformation is easy to handle for quantile estimation.

2.1 The closed-form solution

Let Qτ (yj|z) be the τ -th quantile of the conditional distribution of yj given z, which

is the analytical solution to mintj : Rp0 7→R E[ρτ (yj − tj(z))]. Denote by Qτ (y|z) the τ -th

equicoordinate quantile of the conditional distribution of y given z, namely, Qτ (y|z) =

(Qτ (y1|z), . . . , Qτ (yp|z))′. Thus, the τ -th equicoordinate conditional quantile Qτ (y|z) is the

exact solution to (3), i.e.

Qτ (y|z) = arg mint: Rp0 7→Rq0

q0∑j=1

E[ρτ (yj − tj(z))],

By using the fact that h and g are monotone under the transnormal model, we have

Qτ (y|z) = Qτ (y|g(z)) = h−1(Qτ (h(y)|g(z))).

Since h is monotonically nondecreasing, it now follows from (2) that

Qτ (y|z) = h−1(Σ?yz(Σ

?zz)−1 · g(z) + Φ−1(τ) vdiag(Σ?

yy −Σ?yz(Σ

?zz)−1Σ?

zy))

(4)

5

where Φ(·) is the cumulative distribution function of the standard normal distribution and

vdiag(A) denotes the vector formed by the diagonal element of A. Therefore, the semipara-

metric Gaussian copula model enables us to obtain the closed-form solution to multi-task

quantile regression.

Moreover, the closed-form solution (4) can be used to construct prediction intervals for

predicting y given z in high dimensions. To be more specific, we can obtain the closed-form

100(1− τ)% prediction interval as

[Q τ2(y|z),Q1− τ

2(y|z)].

For different values of τ , we need only to adjust the value Φ(τ). This is also an appealing

feature of the transnormal model.

The closed-form solution Qτ (y|z) uses true covariance matrices and transformations,

and thus it is not a feasible estimator. To utilize (4), we need to estimate the covariance

matrix Σ? and transformation functions f . It turns out that these two tasks are relatively

easy under the transnormal model. Section 2.2 describes how to estimate transformation

functions f and then show how to estimate structured covariance matrix under the high-

dimensional setting in Section 3. With estimated transformations f = (g, h) and structured

covariance matrix estimators Σ, we can derive the following plug-in estimator

Qτ (y|z) = h−1(Σyz(Σzz)

−1 · g(z) + Φ−1(τ) vdiag(Σyy − Σyz(Σzz)−1Σzy)

). (5)

Given the plug-in estimator (5), we further estimate the 100(1− τ)% prediction interval as

[Q τ2(y|z), Q1− τ

2(y|z)].

Remark 1. When τ = 12, it reduces to estimating optimal transformations from multi-task

median regression. By using the simple fact that Φ−1(12) = 0, it can be further simplified as

Qτ= 12(y|z) = h−1

(Σ?yz(Σ

?zz)−1 · g(z)

).

Remark 2. Compared to multi-task mean regression, multi-task quantile regression (includ-

ing median regression) is much more robust against outliers in measurements, in addition to

delivering the closed-form solution (4) under the transnormal model. In contrast, by solving

ordinary least squares, multi-task mean regression is

E(y|z) = E(y|g(z)) = E(h−1(h(y))|g(z)).

6

But unlike the quantile regression, this can not be simplified further unless h is linear.

Remark 3. The nonlinearity of the transformations h(·) in (4) makes the difference of

conditional quantiles at different values of τ depend on z and thus models the effect of het-

eroscedasticy. In contrast, Wu et al. (2010), Zhu et al. (2012) and Fan et al. (2013) model the

heteroscedastic effect using the single-index quantile regression or semiparametric quantile

regression by imposing the model y = µ(z′β)+σ(z′β)·ε, where σ(z′β)·ε is a heteroscedastic

error and ε is independent of z. Unlike the closed-form expression (4), it is difficult to em-

ploy semiparametric quantile regression to simultaneously deal with nonlinearity and high

dimensionality.

2.2 Estimation of transformation

Note that fj(xj) ∼ N(0, 1) under the transnormal model for any j. Hence, the cumulative

distribution function of Xj admits the form

Fj(xj) = Φ(fj(xj)), or fj(xj) = Φ−1(Fj(xj)). (6)

Equation (6) provides a simple estimation of the transformation function fj. Let Fj(·) be the

empirical estimator of Fj(·), i.e. Fj(u) = 1n

∑ni=1 Ixij≤u. Define the Winsorized empirical

distribution function Fj(·) as

Fj(u) = δn · IFj(u)<δn + Fj(u) · Iδn≤Fj(u)≤1−δn + (1− δn) · IFj(u)>1−δn, (7)

where δn is the Winsorization parameter to avoid infinity values and to achieve better bias-

variance tradeoff. Following Mai & Zou (2012), we specify the Winsorization parameter as

δn = 1n2 , which facilitates both theoretical analysis and practical performance. Next, we can

estimate the transformation functions f in the transnormal model as follows:

f = (f1, . . . , fp) = (Φ−1 F1, . . . ,Φ−1 Fp).

Remark that the estimators are nondecreasing and can be substituted into (4) to estimate

the multi-task quantile regression function.

2.3 Estimation of correlation matrix

We present two covariance regularization methods based on i.i.d. transnormal data x1, . . . ,xn:

the rank-based positive definite `1 penalization for estimating sparse covariance matrices, and

7

the rank-based banded Cholesky decomposition regularization for estimating banded precision

matrices. Both estimates achieve the critical positive definiteness, and they can be used in

(5) to estimate optimal transformations in multi-task quantile regression.

2.3.1 Positive definite sparse correlation matrix

Sparse covariance matrices are widely used in many applications where variables are permu-

tation invariant. By truncating small entries to zero, thresholding (Bickel & Levina 2008b,

Rothman et al. 2009, Fan et al. 2013) is a powerful approach to encourage (conditional) spar-

sity in estimating large covariance matrices. However, the resulting estimator may not be

positive definite in practice. Xue et al. (2012) proposed a computationally efficient positive-

definite `1-penalized covariance estimator to address the indefiniteness issue of thresholding.

Now we extend Xue et al. (2012) to the transnormal model. First, we introduce the

“oracle” positive-definite `1-penalized estimator using the “oracle” transformations f , i.e.,

Σo

`1= arg min

Σ

1

2‖Σ− Ro‖2

F + λ‖Σ‖1,off subject to diag(Σ) = 1; Σ εI

where Ro

is the “oracle” sample correlation matrix of the “oracle” data f(x1), . . . ,f(xn),

and ‖Σ‖1,off is the `1 norm of all off-diagonal elements in Σ.

Motivated by Σo

`1, we can use the adjusted Spearman’s or Kendall’s rank correlation

matrix (Kendall 1948) to derive a correlation matrix that is comparable with Ro. For ease

of presentation, we will focus on the adjusted Spearman’s rank correlation matrix throughout

this paper, since the same analysis can be adapted to the adjusted Kendall’s rank correlation

matrix. Let rj = (r1j, r2j, . . . , rnj)′ be the ranks for (x1j, x2j, . . . , xnj)

′. Denote by rjl =

corr(rj, rl) the Spearman’s rank correlation, and by rsjl = 2 sin(π6rjl) the adjusted Spearman’s

rank correlation. It is well-known that rsjl corrects the bias of rjl (Kendall 1948). Now we

consider the rank correlation matrix Rs

= (rsjl)p×p, which does not require estimating any

transformation function. Thus, the rank-based positive definite `1 penalization is as follows:

Σs

`1= arg min

Σ

1

2‖Σ− Rs‖2

F + λ‖Σ‖1,off subject to diag(Σ) = 1; Σ εI.

Remark that Σs

`1will guarantee the critical positive definiteness, and it can be efficiently

solved by the alternating direction method of multiplier in Xue et al. (2012).

8

2.3.2 Banded Cholesky decomposition regularization

When an ordering exists among variables, the bandable structure is commonly used in esti-

mating large covariance matrices. Banding (Bickel & Levina 2008a) and tapering (Cai et al.

2010) were proposed to estimate bandable covariance matrices. But they have no guaran-

tee of positive definiteness in practice. With the appealing positive definiteness, banded

Cholesky decomposition regularization receives much attention, for example, Wu & Pourah-

madi (2003), Huang et al. (2006), Bickel & Levina (2008a), Levina et al. (2008) and Rothman

et al. (2010). In the sequel we propose the rank-based banded Cholesky decomposition reg-

ularization under the transnormal model.

First we introduce the “oracle” estimator to motivate our proposal. By using the “oracle”

data f(x1), . . . ,f(xn), the “oracle” banded Cholesky decomposition regularization estimates

the covariance matrix Σ? through banding the Cholesky factor of its inverse Θ?. Suppose

that Θ? has the Cholesky decomposition Θ? = (I −A)′D−1(I −A), where D is a p × pdiagonal matrix and A = (ajl)p×p is a lower triangular matrix with a11 = · · · = app = 0.

Due to the fact that f(x) ∼ Np(0,Σ?), it is easy to obtain that (I −A) · f(x) ∼ Np(0,D).

Let A = (a1, · · · ,ap)′. As in Bickel & Levina (2008a), the “oracle” estimator is derived by

regressing fj(xj) on its closest mink, j− 1 predecessors, i.e. ao1 = 0, and for j = 2, . . . , p,

aoj = arg minaj∈Aj(k)

1

n

n∑i=1

(fj(xij)− a′jf(xi)

)2, (8)

where Aj(k) = (α1, · · · , αp)′ : αl = 0 if l < j − k or l ≥ j. Then A is estimated by the

k-banded lower triangular matrix Ao

= (ao1, · · · , aop)′, and D is estimated by the diagonal

matrix Do

= diag(do1, · · · , dop) with doj being the residual variance, i.e.

doj =1

n

n∑i=1

(fj(xij)− (aoj)

′f(xi))2. (9)

Therefore, the “oracle” estimator ends up with a positive-definite estimator

Σo

chol = (I − Ao)−1(D

o)−1[(I − Ao

)′]−1,

which has the k-banded precision matrix Θo

chol = (I − Ao)′(D

o)−1(I − Ao

).

To mimic this “oracle” estimator, it is very important to observe that the “oracle” sample

covariance matrix Ro

= 1n

∑ni=1 f(xi)(f(xi))

′ plays the central role there. To see this point,

we notice that estimating aoj and doj only depends on the quadratic term

1

n

n∑i=1

(fj(xij)− a′jf(xi)

)2= a′jR

oaj − 2a′j r

oj + rojj, (10)

9

where Ro

= (rojl)p×p and roj = (roj1, . . . , rojp)′ is its j-th row. Thus, we only need a positive

definite correlation matrix estimator that is comparable with Ro. The adjusted Spearman’s

or Kendall’s rank correlation matrix achieves the “oracle”-like exponential rate of conver-

gence, but it is not guaranteed to be positive definite (Devlin et al. 1975). We employ the

nearest correlation matrix projection that inherits the desired sampling properties of the

adjusted Spearman’s or Kendall’s rank correlation matrix, namely,

Rs

= arg minR‖R− Rs‖max subject to R εI; diag(R) = 1, (11)

where ‖·‖max is the entrywise `∞ norm and ε > 0 is some arbitrarily small constant satisfying

λmin(Σ?) ≥ ε, say ε = 10−4. Zhao et al. (2012) considered a related matrix projection, but

they used a smooth surrogate function ‖R − Rs‖νmax = max‖U‖1≤1〈U ,R − Rs〉 − ν

2‖U‖2

F ,

which would inevitably introduce unnecessary approximation error. Qi & Sun (2006) solved a

related nearest correlation matrix projection under the Frobinius norm. We use the entrywise

`∞-norm for theoretical considerations, as we now demonstrate. Notice that Σ? is a feasible

solution to the nearest correlation matrix projection (11). Hence, ‖Rs − Rs‖max ≤ ‖Σ? −Rs‖max holds by definition. By using the triangular inequality, R

sretains almost the same

sampling properties as Rs

in terms of estimation bound, i.e.

‖Rs −Σ?‖max ≤ ‖Rs − Rs‖max + ‖Rs −Σ?‖max ≤ 2‖Rs −Σ?‖max.

The details of nearest correlation matrix projection will be presented in the appendix.

Now we propose a feasible regularized rank estimator on the basis of Rs. Let R

s= (rsjl)p×p

and rsj = (rsj1, . . . , rsjp)′. We then substitute R

sinto the quadratic term (10) as

a′jRsaj − 2a′j r

sj + rsjj = a′jR

saj − 2a′j r

sj + 1,

where the fact that rsjj = 1 is used. Accordingly, we estimate A by

As

= (as1, · · · , asp)′,

with as1 = 0, and for j = 2, . . . , p,

asj = arg minaj∈Aj(k)

a′jRsaj − 2a′j r

sj + 1,

where Aj(k) = (α1, · · · , αp) : αi = 0 if i < j − k or i ≥ j. In addition, we estimate D by

Ds

= diag(ds1, · · · , dsp),

10

with

dsj = (asj)′R

sasj − 2(asj)

′rsj + 1.

Thus, the rank-based banded Cholesky decomposition regularization yields the estimator

Σs

chol = (I − As)−1D

s[(I − As

)′]−1,

which has the k-banded precision matrix Θs

chol = (I − As)′(D

s)−1(I − As

).

3 Theoretical properties

This section presents theoretical properties of our proposed methods. We use several matrix

norms: the `1 norm ‖U‖`1 = maxj∑

i |uij|, the spectral norm ‖U‖`2 = λ1/2max(U ′U) and the

`∞ norm ‖U‖`∞ = maxi∑

j |uij|. For a symmetric matrix, its matrix `1 norm coincides with

its matrix `∞ norm. We use c or C to denote constants that do not depend on n or p.

Throughout this section, we follow Bickel & Levina (2008a,b) to assume that

ε0 ≤ λmin(Σ?) ≤ λmax(Σ?) ≤ 1

ε0

.

3.1 A general theory

First of all, we consider any regularized estimator Σ satisfying the following condition:

(C1) There exists a regularized estimator Σ satisfying the following concentration bound

‖Σ−Σ?‖`1 = OP (ξn,p), with ξn,p = o((log n)−1/2).

To simplify notation, we let Σ =

(Σzz Σzy

Σyz Σyy

). In the following theorem, we show that the

conditional distribution of h(y) given g(z) can be well estimated under Condition (C1).

Theorem 1. Assume the data follow the transnormal model. Suppose that there is κ ∈ (0, 1)

such that nκ log n log p0. Given the regularized estimator satisfying Condition (C1), we

have the error bounds concerning the estimated conditional distribution of h(y) given g(z)

in (2) as follows:

(i) (on the conditional mean vector)

‖ΣyzΣ−1

zz · g(z)−Σ?yz(Σ

?zz)−1 · g(z)‖max = OP

(√log n log p0

nκ+ ξn,p

√log n

).

11

(ii) (on the conditional variance matrix)

‖(Σyy − ΣyzΣ−1

zz Σzy)− (Σ?yy −Σ?

yz(Σ?zz)−1Σ?

zy)‖`2 = OP (ξn,p).

Next, we show that the plug-in estimator Qτ (y|z) is asymptotically as good as the closed-

form solution Qτ (y|z) under mild regularity conditions. Let ψj(·) be the probability density

function of Xj. Define L = Σ?yz(Σ

?zz)−1 · g(z) + Φ−1(τ) vdiag(Σ?

yy − Σ?yz(Σ

?zz)−1Σ?

zy) and

L = Σyz(Σzz)−1 · g(z) + Φ−1(τ) vdiag(Σyy − Σyz(Σzz)

−1Σzy). Denote L = (L1, . . . , Lq0)′.

Theorem 2. Under the conditions of Theorem 1, we have the error bound concerning the

plug-in estimator in (5) as follows:

‖Qτ (y|z)−Qτ (y|z)‖max = OP

(1

M

√log q0

n+

1

M

√log n log p0

nκ+ξn,pM

√log n

). (12)

where M is the minimum of ψp0+j(x) over x ∈ Ij = [−|2Lj|, |2Lj|]∪[−f−1

p0+j(|2Lj|), f−1p0+j(|2Lj|)

]for any j = 1, . . . , q0, i.e., M = min

j=1,...,q0minx∈Ij

ψp0+j(x).

Theorem 2 immediately implies that the plug-in estimator can be used to construct

provable prediction intervals in high dimensions. For instance, we can use

[Q τ2(y|z), Q1− τ

2(y|z)]

to construct the 100(1− τ)% prediction interval for predicting y given z.

In what follows, we consider two parameter spaces for Σ? in the transnormal model,

Gq = Σ : maxj

∑j 6=i

|aij|q ≤ s0

Hα = Σ : maxj

∑j<i−k

|aij| ≤ c0k−α, ∀k.

These two parameter spaces were studied in Bickel & Levina (2008a,b) and Cai et al. (2010).

In the sequel we show that the proposed rank-based covariance regularization achieves the

“oracle”-like rate of convergence over Gq and Hα under the matrix `1 norm respectively.

Therefore, we can obtain the optimal prediction result on the basis of intermediate theoretical

results about rank-based covariance regularization.

12

3.2 Positive definite sparse correlation matrix

First we derive the estimation bound for the rank-based positive-definite `1 penalization and

its application to estimate the conditional distribution of h(y) given g(z).

Theorem 3. Assume the data follow the transnormal model with Σ? ∈ Gq, and also assume

that n log p. Suppose that λmin(Σ?) ≥ ε0 s0(log p/n)1−q2 +ε. With probability tending to

1, the rank-based positive-definite `1 penalization with λ = c(log p/n)1/2 achieves the following

upper bound under the matrix `1 norm,

supΣ?∈Gq

∥∥Σs

`1−Σ?

∥∥`1≤ C · s0

( log p

n

) 1−q2.

Corollary 1. Under the conditions of Theorems 1, 2 and 3, with Σ = Σs

`1we have

supΣ?∈Gq

‖ΣyzΣ−1


?zz)−1 ·g(z)‖max = OP

(√log n log p0

nκ+ s0

( log p

n

) 1−q2√

log n

),

and

supΣ?∈Gq

‖(Σyy − ΣyzΣ−1


yz(Σ?zz)−1Σ?

zy)‖`2 = OP

(s0

( log p

n

) 1−q2

).

In light of Theorem 3 and Corollary 1, we can derive the convergence rate for the proposed

optimal prediction in the following theorem.

Theorem 4. Under same conditions of Theorems 1, 2 and 3, with Σ = Σs

`1we have

supΣ?∈Gq

‖Qτ (y|z)−Qτ (y|z)‖max = OP

(1

M

√log q0

n+

1

M

√log n log p0

nκ+s0

M

( log p

n

) 1−q2√

log n

).

3.3 Banded Cholesky decomposition regularization

Next we derive the estimation bound for the rank-based banded Cholesky decomposition

regularization with applications to estimate the conditional distribution of h(y) given g(z).

Theorem 5. Assume the data follow the transnormal model with Σ? ∈ Hα, and also assume

that n k2 log p. With probability tending to 1, the rank-based banded Cholesky decomposi-

tion regularization achieves the following upper bound under the matrix `1 norm,

supΣ?∈Hα

∥∥Σs

chol −Σ?∥∥`1≤ Ck

( log p

n

)1/2

+ Ck−α.

13

If k = c ·( log pn

)−1

2(α+1) in the rank-based banded Cholesky decomposition regularization, we have

supΣ?∈Hα

∥∥Σs

chol −Σ?∥∥`1

= OP

(( log p

n

) α2(α+1)

).

Remark 5. Bickel & Levina (2008a) studied the banded Cholesky decomposition regu-

larization under the normal model. Their analysis directly applies to the “oracle” banded

Cholesky decomposition regularization. By Theorem 3 of Bickel & Levina (2008a), we have

supΣ?∈Hα

∥∥Σo

chol −Σ?∥∥`1

= OP

(( log p

n

) α2(α+1)

).

Therefore, the rank-based banded Cholesky decomposition regularization achieves the same

convergence rate as the “oracle” counterpart.

Corollary 2. Under the conditions of Theorems 1, 2 and 5, with Σ = Σs

chol we have

supΣ?∈Hα

∥∥∥ΣyzΣ−1


?zz)−1 · g(z)

∥∥∥max

= OP

(√log n log p0

nκ+( log p

n

) α2(α+1)

√log n

),

and

supΣ?∈Hα

∥∥∥(Σyy − ΣyzΣ−1


yz(Σ?zz)−1Σ?

zy)∥∥∥`2

= OP (( log p

n

) α2(α+1) ).

In light of Theorem 5 and Corollary 2, we can derive the convergence rate for the proposed

optimal prediction in the following theorem.

Theorem 6. Under the conditions of Theorems 1, 2 and 5, with Σ = Σs

chol we have

supΣ?∈Gq

∥∥∥Qτ (y|z)−Qτ (y|z)∥∥∥

max= OP

(1

M

√log q0

n+

1

M

√log n log p0

nκ+

√log n

M

( log p

n

) α2(α+1)

).

4 Numerical properties

This section examines the finite-sample performance of the proposed methods in Sections 2-

4. For space consideration, we focus only on the rank-based banded Cholesky decomposition

regularization and its application to multi-task median regression in numerical studies.

14

4.1 Simulation studies

In this simulation study, we numerically investigate the “oracle” estimator, the proposed

rank-based estimator and the “naıve” estimator. We summarize notation and details of

three regularized estimator in Table 1. The “oracle” estimator serves a benchmark in the

numerical comparison, and the “naıve” estimator directly regularizes on the covariance of

the original data.

Table 1: List of three regularized estimators in the simulation study.

Methods Details

(Σochol, Θ

ochol, t

o1(z)) regularizing the “oracle” sample correlation matrix

(Σschol, Θ

schol, t

s1(z)) regularizing the adjusted Spearman’s rank correlation matrix

(Σnchol, Θ

nchol, t

n1 (z)) regularizing the usual sample correlation matrix

In Models 1-3, we consider three different designs for the inverse covariance matrix Ω?:

Model 1: Ω? = (I −A)′D−1(I −A): dii = 0.01, ai+1,i = 0.8 and dij = aij = 0 otherwise;

Model 2: Ω?: ω?ii = 1, ω?i,i±1 = 0.5, ω?i,i±2 = 0.25 and ω?ij = 0 otherwise;

Model 3: Ω?: ω?ii = 1, ω?i,i±1 = 0.4, ω?i,i±2 = ω?i,i±3 = 0.2, ω?i,i±4 = 0.1 and ω?ij = 0 otherwise.

Models 1-3 are autoregressive (AR) models widely used in time series analysis, and they were

considered by Huang et al. (2006) and Xue & Zou (2012). Based upon covariance matrix

Γ? = (Ω?)−1 = (γ?ij)p×p, we calculate the true correlation matrix Σ? = (σ?ij)p×p = (γ?ij√γ?iiγ

?jj

)p×p

and also its inverse Θ? = (Σ?)−1. Then, we generate n independent normal data from

Np(0,Σ?) and then transfer the normal data to the desired transnormal data (x1, . . . ,xn)

using the transformation function

[f−11 , f−1

2 , f−13 , f−1

4 , f−11 , f−1

2 , f−13 , f−1

4 , . . .], (13)

where four monotone transformations are considered: f1(x) = x13 (power transformation),

f2(x) = log( x1−x) (logit transformation), f3(x) = log(x) (logarithmic transformation) and

f4(x) = f1(x)Ix<−1 + f2(x)I−1≤x≤1 + (f3(x− 1) + 1)Ix>1.

In all cases we let n = 200 and p = 100, 200 & 500. For each estimator, tuning parameter

is chosen by cross-validation. Estimation accuracy is measured by the matrix `1-norm and

`2-norm averaged over 100 independent replications.

15

Table 2: Performance of estimating the correlation matrix with Σo

chol, Σs

chol and Σn

chol. Esti-

mation accuracy is measured by the matrix `1-norm and `2-norm. Each metric is averaged

over 100 independent replications with standard errors in the bracket.

MethodModel 1 Model 2 Model 3

p=100 p=200 p=500 p=100 p=200 p=500 p=100 p=200 p=500

matrix `1 norm

Σochol

1.34 1.60 1.95 0.89 0.91 0.99 1.00 1.04 1.16

(0.05) (0.13) (0.27) (0.03) (0.04) (0.08) (0.02) (0.05) (0.12)

Σschol

1.49 1.68 2.08 0.92 0.99 1.10 1.03 1.06 1.21

(0.07) (0.11) (0.21) (0.03) (0.05) (0.14) (0.02) (0.04) (0.12)

Σnchol

3.21 3.49 3.78 1.45 1.48 1.63 1.11 1.22 1.28

(0.06) (0.09) (0.24) (0.02) (0.04) (0.11) (0.03) (0.05) (0.11)

matrix `2 norm

Σochol

0.78 0.95 1.10 0.48 0.49 0.54 0.52 0.54 0.59

(0.03) (0.06) (0.13) (0.01) (0.02) (0.04) (0.01) (0.02) (0.05)

Σschol

0.84 1.00 1.26 0.49 0.53 0.58 0.53 0.55 0.63

(0.03) (0.06) (0.11) (0.01) (0.02) (0.05) (0.01) (0.02) (0.06)

Σnchol

1.67 1.73 1.89 0.77 0.80 0.84 0.55 0.60 0.68

(0.05) (0.08) (0.14) (0.01) (0.02) (0.04) (0.01) (0.02) (0.05)

Tables 2-3 summarize the numerical performance of estimating the correlation matrix and

its inverse for three banded Cholesky decomposition regularization methods. We can see that

the “naıve” covariance regularization performs the worst in the presence of non-normality.

The rank-based covariance regularization effectively deals with the transnormal data, and

performs comparably with the “oracle” estimator. This numeric evidence is consistent with

theoretical results presented in Section 3.

Next, we compare the performance of estimating optimal transformations for multi-task

median regression. In each of the 100 replications, we simulate another 4 × n independent

transnormal data (xn+1, . . . ,x5n) as testing dataset. Now we take the first p0 = p/2 variables

in x as response y and take the second q0 = p/2 variables in x as predictor z. Now, given

three regularized covariance estimators Σo

chol, Σs

chol and Σn

chol, we examine their performance

in predicting yi based on zi for i = n+1, . . . , 5n. Recall that t1(z) = h−1

(Σyz(Σzz)−1 · g(z))

16

Table 3: Performance of estimating the inverse correlation matrix with Θo

chol, Θs

chol and

Θn

chol. Estimation accuracy is measured by the matrix `1-norm and `2-norm. Each metric is

averaged over 100 independent replications with standard errors in the bracket.


p=100 p=200 p=500 p=100 p=200 p=500 p=100 p=200 p=500

matrix `1 norm

Σochol

4.07 4.72 5.14 1.83 1.95 1.99 2.21 2.29 2.37

(0.21) (0.37) (0.57) (0.08) (0.12) (0.25) (0.04) (0.09) (0.18)

Σschol

4.21 4.80 5.28 1.91 1.97 2.02 2.27 2.33 2.41

(0.24) (0.32) (0.55) (0.07) (0.14) (0.25) (0.04) (0.07) (0.16)

Σnchol

7.18 7.55 8.02 3.29 3.48 3.70 2.81 2.93 3.08

(0.22) (0.35) (0.54) (0.06) (0.14) (0.29) (0.03) (0.10) (0.19)

matrix `2 norm

Σochol

3.18 3.41 4.07 1.26 1.36 1.41 1.68 1.74 1.75

(0.10) (0.27) (0.35) (0.04) (0.07) (0.13) (0.04) (0.08) (0.10)

Σschol

3.29 3.59 4.10 1.35 1.39 1.45 1.69 1.75 1.77

(0.16) (0.25) (0.30) (0.04) (0.07) (0.08) (0.04) (0.08) (0.11)

Σnchol

4.98 5.15 5.33 2.64 2.81 2.89 2.29 2.32 2.39

(0.18) (0.25) (0.32) (0.04) (0.11) (0.18) (0.03) (0.07) (0.18)

provides the closed-form solution for multi-task median regression. Then we can derive the

closed-form solutions with respect to different regularized covariance estimators as follows:

• the “oracle” estimator: to

1(z) = h−1(Σo

yz(Σo

zz)−1 · g(z))

• the rank-based estimator: ts

1(z) = h−1

(Σs

yz(Σs

zz)−1 · g(z))

• the “naıve” estimator: tn

1 (z) = µy + Σn

yz(Σn

zz)−1 · (z − µz)

Tables 4 shows their numerical performances. Prediction accuracy is measured by the

difference of prediction errors, which is defined by subtracting the “oracle” prediction error,

i.e.

DPE(t1(z)) =1

4n

5n∑i=n+1

‖t1(zi)− yi‖`1 −1

4n

5n∑i=n+1

‖to1(z)− yi‖`1 .

17

Table 4: Performance of relative prediction errors with to

1(z), ts

1(z) and tn

1 (z). Estimation

accuracy are measured by the `1-norm and averaged over 100 independent replications, with

the standard errors shown in the bracket.


p=100 p=200 p=500 p=100 p=200 p=500 p=100 p=200 p=500

ts1(z) vs. t

o1(z)

0.02 0.03 0.08 0.16 0.26 0.66 0.12 0.22 0.56

(0.00) (0.00) (0.01) (0.02) (0.03) (0.04) (0.01) (0.01) (0.04)

tn1 (z) vs. t

o1(z)

0.11 0.19 0.21 13.50 26.07 63.80 7.38 14.70 35.68

(0.00) (0.01) (0.03) (0.22) (0.58) (2.16) (0.14) (0.37) (1.28)

The relative prediction error is averaged over 100 independent replications.

As summarized in Table 4, we can see that the proposed rank-based estimator performs

very similarly to its oracle counterpart, and greatly outperforms the “naıve” estimator.

4.2 Application to the protein mass spectroscopy data

Recent advances in high-throughput mass spectroscopy technology has enabled biomedical

researchers to simultaneously analyze thousands of proteins. This subsection illustrates the

power of the proposed rank-based method in an application to study the prostate cancer

using the protein mass spectroscopy data (Adam et al. 2002), which was previously analyzed

by Levina et al. (2008). This dataset consists of the protein mass spectroscopy measurements

for the blood serum samples of 157 healthy people and 167 prostate cancer patients. In each

blood serum sample, the protein mass spectroscopy measures the intensity for the ordered

time-of-flight values, which are related to the mass over charge ratio of proteins. In our

analysis, we exclude the measurements with mass over charge ratio below 2000 to avoid

chemical artifacts, and perform the same preprocessing as in Levina et al. (2008) to smooth

the intensity profile. This gives a total of p = 218 ordered mass over charge ratio indices for

each sample. Then, we have the the control data xcoi = (xcoi,1, . . . , xcoi,218) for i = 1, . . . , 157,

and the cancer data xcaj = (xcaj,1, . . . , xcaj,218) for j = 1, . . . , 167. We refer the readers to

Adam et al. (2002) and Levina et al. (2008) for more details about this dataset and also the

preprocessing procedure. Our aim now is to use the more stable intensity measurements in

the latter 168 mass over charge ratio indices to optimally predict the more volatile intensity

18

measurements in the first 50 indices.

The analysis of Levina et al. (2008) is on the basis of the multivariate normal covariance

matrix. Hence, we perform the normality test for both the control data and the cancel data.

Table 5 shows the number of rejecting the null hypotheses among 218 normality tests . There

is at least 50% of the mass spectroscopy measurements being non-normal at the significance

level of 0.05. Even after the strict Bonferroni correction, there are at least 30 indices in the

control data and 116 indices in the cancer data rejecting all the null hypotheses. Figure 1

further illustrates the non-normality issue (e.g. heavy tails, skewness) in two indices (109 &

218).

Table 5: Testing for normality. This table shows the number of rejecting the normality

hypothesis at different significance levels among 218 mass over charge ratio indices.

significance level Anderson-Darling Cramer-von Mises Kolmogorov-Smirnov

control data0.05 158 138 119

0.05/218 71 52 30

cancer data0.05 196 177 156

0.05/218 136 134 116

Intensity

Fre

quen

cy

3.8 4.0 4.2 4.4 4.6

05

1015

20

(A1) index=109

Intensity

Fre

quen

cy

3.60 3.65 3.70 3.75 3.80 3.85

02

46

810

1214

(A2) index=218

Intensity

Fre

quen

cy

3.6 3.8 4.0 4.2 4.4

010

2030

(B1) index=109

Intensity

Fre

quen

cy

3.70 3.75 3.80 3.85

05

1015

20

(B2) index=218

Figure 1: Illustration of the non-normality issue of the protein mass spectroscopy data: (A1)

and (A2) for the control data; (B1) and (B2) for the cancer data.

The nonnormality of measurements suggest us to use the transnormal model. Let zco

(zca) and yco (yca) denote the intensity measurements in the first 160 and second 58 indices of

the control (cancer) data respectively. We divide this dataset into training sets (xco1 , . . . ,xco120)

19

& (xca1 , . . . ,xca120) and testing sets (xco121, . . . ,x

co157) & (xca121, . . . ,x

ca167). Note that the sample

estimator tsample

1 (z) = µy + Σsample

yz (Σsample

zz )−1 · (z− µz) is infeasible since the usual sample

correlation matrix Σsample

zz is not invertible. Then we consider two different methods to

predict y: the proposed rank-based estimator ts

1(z) = h−1

(Σs

yz(Σs

zz)−1 · g(z)) by using

the proposed rank-based banded Cholesky decomposition regularization and the “naıve”

estimator tn

1 (z) = µy + Σn

yz(Σn

zz)−1 · (z − µz) by using the banded Cholesky decomposition

regularization (Bickel & Levina 2008a). Following Levina et al. (2008), tuning parameters

are chosen via cross-validation based on the training data. The difference of prediction error

at the j-th index is computed, namely,

DPEj(t1(z)) = PEj(ts

1(z))− PEj(tn

1 (z)).

where PEj(t1(z)) is prediction error at the j-th index computed by averaging the absolute

prediction error |(t1(zi))j − yij| over i = 121, . . . , 157 for the control data and over i =

121, . . . , 167 for the cancer data. The differences of prediction errors are shown in Figure 2.

As demonstrated in Figure 2, the rank-based method outperforms the “naıve” estimator in

38 out of 50 indices for the control data and 46 out of 50 indices for the cancer data.

Moreover, we further use [Q τ2(y|z), Q1− τ

2(y|z)] to construct the 100(1 − τ)% predic-

tion interval. The predict target and prediction intervals are averaged over the testing set.

Specificly, we estimate the 100(1− τ)% prediction interval as follows,

100(1− τ)% PI = [1

34

239∑i=206

Q τ2(yi|zi),

1

34

239∑i=206

Q1− τ2(yi|zi)].

We plot the 95% prediction intervals for both the control data and the cancer data in Figure

3. Figure 3 shows that the estimated prediction intervals cover most of the predict target.

This suggests that our proposal well estimates both end points of prediction intervals in

practice, which is consistent with the asymptotic theory in Theorem 2.

Appendix A: ADMM for nearest matrix projection

The alternating direction method of multipliers (Boyd et al. 2011) has been widely applied

to solving large-scale optimization problems in high-dimensional statistical learning, for ex-

ample, covariance matrix estimation (Bien & Tibshirani 2011, Xue et al. 2012) and graphical

model selection (Ma et al. 2013, Danaher et al. 2013). Now we will design a new alternating

20

0 10 20 30 40 50

−0.

5−

0.4

−0.

3−

0.2

−0.

10.

00.

1

Mass/Charge Ratio Index

Inte

nsity

(A) Relative Prediction Errors for the control data

0 10 20 30 40 50

−1.

2−

1.0

−0.

8−

0.6

−0.

4−

0.2

0.0


Inte

nsity

(B) Relative Prediction Errors for the cancer data

Figure 2: The differences of prediction errors between the existing Cholesky-based method

(Bickel & Levina 2008a) and the proposed rank-based estimate with τ = 0.5. The solid

dots are the mass/charge ratio indices in which our proposed method outperforms the BL

method. The proposed method outperforms in 38 out of 50 indices for the control data and

46 out of 50 indices for the cancer data.

21

0 10 20 30 40 50

010

020

030

040

050

060

0


Inte

nsity

predict targetproposed estimateproposed 95% PI

(A) 95% Prediction Interval for the control data

0 10 20 30 40 50

010

020

030

040

050

060

070

0


Inte

nsity

predict targetproposed estimateproposed 95% PI

(B) 95% Prediction Interval for the cancer data

Figure 3: Prediction intervals for predicting protein mass spectroscopy intensities using our

proposed method. The predict target, proposed point estimate (τ = 0.5) and 95% prediction

intervals are averaged over the testing set. The solid dots are the mass/charge ratio indices

in which the proposed method outperforms the BL method.

22

direction method of multipliers to solve the nearest correlation matrix projection (11). To

begin with, we introduce a new variable S = R− Rsto split the positive definite constraint

and the entrywise `∞ norm. Then we obtain the equivalent convex optimization problem as

(Rs, S) = arg min

(R,S)‖S‖max subject to R εI; S = R−Rs

; diag(S) = 0;S = S′. (14)

Let Λ be the Lagrange multiplier associated with S = R− Rsin (14). Next, we introduce

the augmented Lagrangian function

Lρ(R,S; Λ) = ‖S‖max − 〈Λ,R− S − Rs〉+

1

2ρ‖R− S − Rs‖2

F , (15)

for any given constant ρ > 0. For i = 0, 1, 2, . . ., we iteratively solve (Ri+1,Si+1) through

alternating minimization with respect to R and S respectively, i.e.

(Ri+1,Si+1) = arg min(R,S)

Lρ(R,S; Λi),

and then update the Lagrange multiplier Λ by

Λi+1 = Λi − 1

ρ(Ri+1 − Si+1 − Rs

).

To sum up, the entire algorithm proceeds sequentially as follows till convergence,

R step : Ri+1 = arg minRεI

Lρ(R,Si; Λi); (16)

S step : Si+1 = arg minS=S′: diag(S)=0

Lρ(Ri+1,S; Λi); (17)

Λ step : Λi+1 = Λi − 1

ρ(Ri+1 − Si+1 − Rs

). (18)

We note that both subproblems (16) and (17) indeed have simple solutions, which we will ex-

plain in the sequel, and thus we are able to avoid solving a sequence of complex subproblems.

In what follows, we point out the explicit solutions to both (16) and (17).

First we consider the R step. Denote by (U)+ the projection of a matrix U onto the

convex set R εI. Suppose that U has eigen-decomposition∑p

i=1 λiuiu′i. Then, it is

easy to see that (U )+ =∑p

i=1 max(λi, ε)uiu′i. The subproblem (16) is then solved as follows,

Ri+1 = arg minRεI

−〈Λi,R〉+1

2ρ‖R− Si − Rs‖2

F = (Si + Rs

+ ρΛi)+.

We now consider the S step. Define two subspaces S = S : diag(S) = 0; S = S′ and

T = T : ‖T ‖1 ≤ ρ; diag(T ) = 0; T = T ′. We also define

M i+1 = Ri+1 − Rs − ρΛi. (19)

23

Note that the subproblem in (17) can be equivalently written as follows,

Si+1 = arg minS∈S‖S‖max + 〈Λi,S〉+

1

2ρ‖Ri+1 − S − Rs‖2

F

= arg minS∈S‖S‖max +

1

2ρ‖S −M i+1‖2

F .

Lemma 1 will show that (17) can be solved by the projection onto the entrywise `1 ball.

Lemma 1. For any given symmetric matrix M = (mjl)p×p, we define

S = arg minS∈S‖S‖max +

1

2ρ‖S −M‖2

F and T = arg minT∈T

1

2‖T −M‖2

F .

Let S = (sjl)p×p and T = (tjl)p×p. Then for any (j, l), we always have

sjl = (mjl − tjl)× Ij 6=l.

Now we define

T i+1 = arg minT∈T

‖T −M i+1‖2F . (20)

We note that T i+1 is essentially a projection of a 12p(p − 1)-dimensional vector onto the `1

ball in R 12p(p−1), and it can be efficiently solved by applying the exact projection algorithm

(Duchi et al. 2008) in the O(p2) expected time. Let T i+1 = (ti+1jl )p×p and M i+1 = (mi+1

jl )p×p.

Then by Lemma 1, we can obtain the desired closed-form solution for (17), i.e.

Si+1 =((mi+1

jl − ti+1jl )× Ij 6=l

)p×p .

Algorithm 1 Proposed alternating direction method of multipliers for obtaining Rs.

1. Initialization: ρ, S0 and Λ0.

2. Iterative alternating direction augmented Lagrangian step: for the i-th iteration

2.1 Solve Ri+1 = (Si + Rs

+ ρΛi)+;

2.2 Solve Si+1 =((mi+1

jl − ti+1jl )× Ij 6=l

)p×p with M i+1 and T i+1 as in (19) & (20);

2.3 Update Λi+1 = Λi − 1ρ(Ri+1 − Si+1 − Rs

).

3. Repeat the above cycle till convergence.

The global convergence of Algorithm 1 can be obtained by following Xue et al. (2012).

For space consideration, we omit the technical proof in this paper. The global convergence

guarantees that Algorithm 1 always converges to the optimal solution (Rs, S, Λ) of (14)

from any initial value (S0,Λ0) for any specified constant ρ > 0.

24

Appendix B: Technical Proofs

Proof of Lemma 1

Proof. First we note that the dual norm of the entrywise `1 norm is the entrywise `∞ norm.

Then it is easy to see that ‖S‖max = 1ρ

maxT∈T 〈T ,S〉. Now we have

minS∈S

‖S‖max +1

2ρ‖S −M‖2

F

= minS∈S

maxT∈T〈T ,S〉+

1

2‖S −M‖2

F

= maxT∈T

minS∈S〈T ,S〉+

1

2‖S −M‖2

F

= maxT∈T

minS∈S

1

2‖S −M + T ‖2

F −1

2‖T −M‖2

F +1

2‖M‖2

F .

Notice that the optimal solution for the inner problem with respect to S is given by

S = (sjl)p×p = ((mjl − tjl) · Ij 6=l)p×p.

After we substitute S into the inner problem with respect to S, it suffices to solve the outer

problem with respect to T , namely,

maxT∈T

−1

2‖T −M‖2

F +1

2‖M‖2

F = minT∈T

‖T −M‖2F .

Now by definition, T = (tjl)p×p gives the optimal solution to the above problem with respect

to T . Therefore, it is obvious to see that sjl = (mjl− tjl)×Ij 6=l always holds for the optimal

solutions S and T . This completes the proof of Lemma 1.

Proof of Theorem 1

Proof. In the first part, we bound I = ‖ΣyzΣ−1

zz g(z) − Σ?yz(Σ

?zz)−1g(z)‖max. To this end,

we bound I1 = ‖ΣyzΣ−1

zz − Σ?yz(Σ

?zz)−1‖`∞ and I2 = ‖g(z) − g(z)‖max = OP (

√logn log p0

nκ)

respectively. Let ϕ1 = ‖(Σ?zz)−1‖`∞ and ϕ2 = ‖Σ?

yz‖`∞ for ease of notation. Notice that

ΣyzΣ−1

zz −Σ?yz(Σ

?zz)−1 = Σ?

yz · (Σ−1

zz − (Σ?zz)−1) + (Σyz −Σ?

yz) · (Σ?zz)−1

+ (Σyz −Σ?yz) · (Σ

−1

zz − (Σ?zz)−1).

By using the triangular inequality again, we then have

I1 ≤ ‖Σ?yz‖`∞ · ‖Σ

−1

zz − (Σ?zz)−1‖`∞ + ‖(Σ?

zz)−1‖`∞ · ‖Σyz −Σ?

yz‖`∞+ ‖Σyz −Σ?

yz‖`∞ · ‖Σ−1

zz − (Σ?zz)−1‖`∞

25

To bound I1, we need to bound ‖Σyz −Σ?yz‖`∞ and ‖Σ−1

zz − (Σ?zz)−1‖`∞ . By definition, it is

easy to see that ‖Σyz −Σ?yz‖`∞ ≤ ‖Σ−Σ?‖`∞ . Moreover, we have

‖Σ−1

zz − (Σ?zz)−1‖`∞ ≤ C‖Σzz −Σ?

zz‖`∞ · ‖(Σ?zz)−1‖`∞ ≤ Cϕ1 · ‖Σ−Σ?‖`∞ .

Thus we can derive the explicit upper bound for I1 as

I1 ≤ C(ϕ1ϕ2 + ϕ1) · ‖Σ−Σ?‖`∞ = C(ϕ1ϕ2 + ϕ1) · ‖Σ−Σ?‖`1 = OP (ξn,p). (21)

Next, we prove that

I2 = ‖g(z)− g(z)‖max = OP (

√log n log p0

nκ). (22)

To this end, note that gj(zj) follows the standard normal distribution. Hence we have

Pr(|gj(zj)| >√

(1− κ) log n) =

∫ +∞

√(1−κ) logn

C · exp(−t2/2) · dt

≤ C√log n

·∫ +∞

√(1−κ) logn

exp(−t2/2) · tdt

≤ C · 1

n1−κ2

√log n

.

Then it immediately implies that

‖g(z)‖max = OP (√

log n). (23)

Let η =√

(1− κ) log n. Now by using Lemma 1 of Mai & Zou (2012), we have

Pr( sup|gj(zj)|≤η

|gj(zj)− gj(zj)| ≥ η) ≤ C exp(−cn13η2

log n) + p0 exp(− cn

13

log n)

Thus, we apply the union bound to obtain

Pr(‖g(z)− g(z)‖max ≥ η) ≤p0∑j=1

Pr(|gj(zj)| > η) +

p0∑j=1

Pr( sup|gj(zj)|≤η

|gj(zj)− gj(zj)| ≥ η)

≤ C · p0

n1−κ2

√log n

+ Cp0 exp(−cnκη2

log n) + p0 exp(− cnκ

log n),

which immediately implies the upper bound (22). Now since

ΣyzΣ−1

zz g(z)−Σ?yz(Σ

?zz)−1g(z)

= (ΣyzΣ−1

zz −Σ?yz(Σ

?zz)−1) · (g(z)− g(z)) +

(Σ?yz(Σ

?zz)−1) · (g(z)− g(z)) + (ΣyzΣ

−1

zz −Σ?yz(Σ

?zz)−1) · g(z),

26

we can combine (21), (22), (23) and also the triangle inequality to bound I as

I ≤ ‖ΣyzΣ−1

zz −Σ?yz(Σ

?zz)−1‖`∞ · ‖g(z)− g(z)‖max

+ ‖Σ?yz(Σ

?zz)−1‖`∞ · ‖g(z)− g(z)‖max

+ ‖ΣyzΣ−1

zz −Σ?yz(Σ

?zz)−1‖`∞ · ‖g(z)‖max

= OP (

√log n log p0

nκ+ ξn,p

√log n).

In the second part, we bound J = ‖(Σyy− ΣyzΣ−1

zz Σzy)− (Σ?yy−Σ?

yz(Σ?zz)−1Σ?

zy)‖`2 . By

using the triangular inequality, we have

J ≤ ‖Σyy −Σ?yy‖`2 + ‖ΣyzΣ

−1

zz Σzy −Σ?yz(Σ

?zz)−1Σ?

zy‖`2 ≡ J1 + J2.

Notice that

J2 = Σyz(Σ?zz)−1(Σzy −Σ?

zy) + (Σyz −Σ?yz)(Σ

?zz)−1Σ?

zy + Σyz(Σ−1

zz − (Σ?zz)−1)Σzy.

By using the definition of spectral norm, we have

‖Σ−Σ?‖`2 = maxu=(u′z ,u

′y)′: ‖u‖`2=1

‖

(Σzz −Σ?

zz Σzy −Σ?zy

Σyz −Σ?yz Σyy −Σ?

yy

)(uz

uy

)‖`2

≥ maxuz : ‖uz‖`2=1

‖

(Σzz −Σ?

zz Σzy −Σ?zy

Σyz −Σ?yz Σyy −Σ?

yy

)(uz

0

)‖`2

≥ maxuz : ‖uz‖`2=1

‖(Σyz −Σ?yz)uz‖`2

= ‖Σyz −Σ?yz‖`2 .

By similar arguments, we also have ‖Σ−Σ?‖`2 ≥ ‖Σzy−Σ?zy‖`2 . Given that ε0 ≤ λmin(Σ?) ≤

λmax(Σ?) ≤ 1ε0

, we then apply the sub-multiplicative property to obtain

‖Σyz(Σ?zz)−1(Σzy −Σ?

zy)‖`2 ≤ C · ‖Σ−Σ?‖`2‖(Σyz −Σ?

yz)(Σ?zz)−1Σ?

zy‖`2 ≤ C · ‖Σ−Σ?‖`2‖Σyz(Σ

−1

zz − (Σ?zz)−1)Σzy‖`2 ≤ C · ‖Σ−Σ?‖`2

Thus we can combine the above bounds to bound J2 as

J2 = ‖ΣyzΣ−1

zz Σzy −Σ?yz(Σ

?zz)−1Σ?

zy‖`2 ≤ C · ‖Σ−Σ?‖`2

In addition, due to the fact that J1 = ‖Σyy −Σ?yy‖`2 ≤ ‖Σ−Σ?‖`2 , we have

J ≤ J1 + J2 ≤ C · ‖Σ−Σ?‖`2 = C · ‖Σ−Σ?‖`1 = OP (ξn,p)

This completes the proof of Theorem 1.

27

Proof of Theorem 2

Proof. To simplify notation, we define µy|z = Σ?yz(Σ

?zz)−1·g(z), Σy|z = Σ?

yy−Σ?yz(Σ

?zz)−1Σ?

zy,

µy|z = Σyz(Σzz)−1 · g(z) and Σy|z = Σyy−ΣyzΣ

−1

zz Σzy. Then L = µ?y|z +Φ−1(τ) vdiag(Σ?y|z)

and L = µy|z + Φ−1(τ) vdiag(Σy|z). In addition, we introduce Qτ (y|z) = h−1(L).

In order to bound Qτ (y|z)−Qτ (y|z), we use the triangle inequality to obtain that

‖Qτ (y|z)−Qτ (y|z)‖max ≤ ‖Qτ (y|z)− Qτ (y|z)‖max + ‖Qτ (y|z)−Qτ (y|z)‖max

≤ ‖h−1(L)− h−1(L)‖max + ‖h−1(L)− h−1(L)‖max

≡ I + J.

In the sequel, we derive probability bounds for I and J respectively.

To bound I, we first use the triangle inequality again to bound

‖L−L‖max ≤ ‖µy|z − µ?y|z‖max + Φ−1(τ) · ‖vdiag(Σy|z)− vdiag(Σ?y|z)‖max

≤ ‖µy|z − µ?y|z‖max + Φ−1(τ) · ‖Σy|z −Σ?y|z‖`2

≤ OP (

√log n log p0

nκ+ ξn,p

√log n) + Φ−1(τ) ·OP (ξn,p)

= OP (

√log n log p0

nκ+ ξn,p

√log n)

Notice that√n−κlog n log p0 + ξn,p

√log n = o(1) ≤ |L|. Thus, with probability 1, we have

|L| ≤ 2 · |L|, and then, Φ(−2 · |L|) ≤ Φ(L) ≤ Φ(2 · |L|). (24)

In view of (24), we use the probability inequality from Page 75 of Serfling (1980) to obtain

Pr(|F−1p0+j(t)− F−1

p0+j(t)| > ε) ≤ 2 exp(−2nδ2ε )

where δε = mint∈[−|2Lj |,|2Lj |]Fj(t + ε) − Fj(t), Fj(t) − Fj(t − ε), for any j = 1, . . . , q0 and

t ∈ [Φ(−2|Lj|),Φ(2|Lj|)] ⊂ Ij. Recall that f = (Φ−1 F1, . . . ,Φ−1 Fp) = (g, h). Note that

h−1j (Lj)− h−1

j (Lj) = F−1j (Φ(Lj))− F−1

j (Φ(Lj)) is the j-th element of h−1

(L)− h−1(L) for

28

j = 1, . . . , q0. Now we use the union bound to derive

Pr(‖Qτ (y|z)− Qτ (y|z)‖max > ε) ≤q0∑j=1

Pr(|h−1j (Lj)− h−1

j (Lj)| > ε)

≤q0∑j=1

Pr(|F−1p0+j(Φ(Lj))− F−1

p0+j(Φ(Lj))| > ε)

≤q0∑j=1

maxt∈[Φ(−2|Lj |),Φ(2|Lj |)]

Pr(|F−1p0+j(t)− F−1

p0+j(t)| > ε)

≤ q0 · exp(−2nM2ε2)

where we use the fact that δε ≥ mint∈Ij ψj(t)ε ≥Mε in the last inequality. Thus, we have

‖Qτ (y|z)− Qτ (y|z)‖max = OP (1

M

√log q0

n)

To bound J , we apply the mean-value theorem to the j-th element of h−1(L)−h−1(L),

that is, h−1j (Lj)− h−1

j (Lj) = F−1j (Φ(Lj))− F−1

j (Φ(Lj)). Namely,

h−1j (Lj)− h−1

j (Lj) = (h−1j )′(Lj) · (Lj − Lj) =

φ(Lj)

ψp0+j(F−1p0+j(Φ(Lj)))

· (Lj − Lj)

where Lj is no the line segment between Lj and Lj, and in the second equality we use the

fact that (h−1j )′(t) = (F−1

p0+j Φ)′(t) = φ(t)/ψp0+j(F−1p0+j(Φ(t))). With probability 1, we have

‖L‖ ≤ ‖L−L‖+ ‖L‖ ≤ ‖L−L‖+ ‖L‖ ≤ 2‖L‖.

Then, φ(Lj)/ψp0+j(F−1p0+j(Φ(Lj))) ≤ φ(Lj)/(minx∈Ij ψp0+j(x)) ≤ (M

√2π)−1. Thus we have

‖Qτ (y|z)−Qτ (y|z)‖max = ‖h−1(L)− h−1(L)‖max

≤ maxj=1,...,q0

φ(Lj)

ψp0+j(F−1p0+j(Φ(Lj)))

· ‖L−L‖max

= OP (1

M

√log n log p0

nκ+ξn,pM

√log n)

Therefore, we obtain the desired bound for Qτ (y|z)−Qτ (y|z) as follows,

‖Qτ (y|z)−Qτ (y|z)‖max ≤ I + J = OP (1

M

√log q0

n+

1

M

√log n log p0

nκ+ξn,pM

√log n).

This completes the proof of Theorem 2.

29

Proof of Theorem 3

Proof. Define the rank-based soft-thresholding estimator

Σs

soft =(sλ(r

sjl))p×p

where sλ(·) applies the soft-thresholding rule for the off-diagonal elements of Rs. Under the

event ‖Rs −Σ?‖max ≤ c · (n−1 log p)1/2, we can follow the same line of proof as in Bickel

& Levina (2008b) to show that

supΣ?∈Gq

∥∥Σs

soft −Σ?∥∥`1≤ C · s0

( log p

n

) 1−q2.

Given that λmin(Σ?) ≥ ε0 s0(log p/n)1−q2 + ε, it is easy to see that λmin(Σ

s

soft) ≥ ε. Thus,

Σs

soft is the unique solution to the convex rank-based positive-definite `1 penalization under

the same event, which immediately implies that

supΣ?∈Gq

∥∥Σs

`1−Σ?

∥∥`1≤ C · s0

( log p

n

) 1−q2.

We cite the entry-wise estimation bound of Rs

from Xue & Zou (2012) that

Pr(‖Rs −Σ?‖max ≤ c · (n−1 log p)1/2p) −→ 1,

as n tends to ∞. This completes the proof of Theorem 2.

Proof of Theorem 5

Proof. As demonstrated in Bickel & Levina (2008a), it is sufficient to bound ‖Θs

chol−Θ?‖`1 .Denote by Bk(U) = (ujlI|j−l|≤k)p×p the k-banded estimator of U = (ujl)p×p. Let Ak =

Bk(A) be the k-banded estimator of A, and Dk be the corresponding diagonal matrix.

Define Θk = (I −Ak)′D−1

k (I −Ak). Denote by Bk(Θ?) the k-banded estimator of Θ?.

We first apply the triangle inequality to obtain

‖Θs

chol −Θ?‖`1 ≤ ‖Θs

chol −Θk‖`1 + ‖Θk −Bk(Θ?)‖`1 + ‖Bk(Θ

?)−Θ?‖`1 . (25)

Notice that ‖Θk −Bk(Θ?)‖`1 in (25) can be bounded as in Bickel & Levina (2008a), i.e.

‖Θk −Bk(Θ?)‖`1 ≤ Ck−α.

30

Also by the definition of Hα, it is easy to see that ‖Bk(Θ?)−Θ?‖`1 ≤ Ck−α holds. Then it

is sufficient for us to bound ‖Θs

chol −Θk‖`2 only.

Under the event ‖Rs −Σ?‖max ≤ ζ with kζ = o(1), we follow (A14) – (A15) in Bickel

& Levina (2008a) to show that ‖As −Ak‖max ≤ Cζ, and meanwhile, follow (A16) – (A17)

in Bickel & Levina (2008a) to show that ‖Ds −Dk‖max ≤ Ck2ζ2. Then by using triangular

inequality, we obtain the upper bound for ‖Θs

chol −Θk‖`1 under the same event as follows,

‖Θs

chol −Θk‖`1 = ‖(I − As)′(D

s)−1(I − As

)− (I −Ak)′D−1

k (I −Ak)‖`1≤ ‖(Ak − A

s)′D−1

k (I − As)‖`1 + ‖(I −Ak)

′D−1k (Ak − A

s)‖`1 +

‖(I − As)′((D

s)−1 −D−1

k )(I − As)‖`1

≤ C‖Ak − As‖`1 + C‖Ds −Dk‖`1 ,

where the last inequality used the fact that ‖D−1‖`1 ≤ C1 and ‖I −A‖`1 ≤ C2 by Lemma

A.2 of Bickel & Levina (2008a) and also the fact that kζ = o(1). Note that

‖Ds −Dk‖`1 ≤ ‖Ds −Dk‖max ≤ Ck2ζ2

and

‖Ak − As‖`1 ≤ k · ‖Ak − A

s‖max ≤ Ckζ.

By using the fact that kζ = o(1) again, we have

‖Θs

chol −Θk‖`1 ≤ Ckλ+ Ck2λ2 ≤ Ckζ.

Next we cite an entry-wise estimation bound from Xue & Zou (2012) that for any positive

quantity ζ satisfying nζ = o(1), there exists some fixed constant c such that

Pr(‖Rs −Σ?‖max > ζ) ≤ p2 exp(−cnζ2).

Since Rs

is the nearest correlation matrix projection of Rs, then we have

‖Rs −Σ?‖max ≤ 2‖Rs −Σ?‖max ≤ 2λ.

Thus, taking ζ = c ·n−1/2 log1/2 p yields the upper bound for ‖Θs

chol−Θ?‖`1 . This completes

the proof of Theorem 3.

31

Acknowledgement

Jianqing Fan’s research is supported in part by R01GM100474-04 and National Science

Foundation grants DMS-1206464 and DMS-1406266. Hui Zou’s research is supported in

part by grants from the National Science Foundation grant DMS-0846068 and a grant from

the Office of Naval Research. Lingzhou Xue was supported by the National Institutes of

Health grant R01-GM072611-09.

References

Adam, B., Qu, Y., Davis, J., Ward, M., Clements, M., Cazares, L., Semmes, O., Schellham-

mer, P., Yasui, Y., Feng, Z. & Wright, G. (2002), ‘Serum protein fingerprinting coupled

with a pattern-matching algorithm distinguishes prostate cancer from benign prostate

hyperplasia and healthy men’, Cancer Research 62, 3609-3614.

Bickel, P. & Levina, E. (2008a), ‘Regularized estimation of large covariance matrices’, The

Annals of Statistics 36, 199–227.

Bickel, P. & Levina, E. (2008b), ‘Covariance regularization by thresholding’, The Annals of

Statistics 36, 2577–2604.

Bien, J. & Tibshirani, R. (2011), ‘Sparse estimation of a covariance matrix’, Biometrika

98, 807–820.

Box, G. E. P. & Cox, D. R. (1964), ‘An analysis of transformations (with discussion)’, Journal

of the Royal Statistical Society. Series B 26, 211–252.

Box, G. E. P. & Tidwell, P. W. (1962), ‘Transformation of the independent variables’, Tech-

nometrics 4, 531–550.

Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. (2011), ‘Distributed optimization

and statistical learning via the alternating direction method of multipliers’, Foundations

and Trends in Machine Learning 3, 1–122.

Bradic, J., Fan, J. & Wang, W. (2011), ‘Penalized composite quasi-likelihood for ultrahigh

dimensional variable selection’, Journal of the Royal Statistical Society: Series B 73, 325–

349.

32

Breiman, L. & Friedman, J. (1985), ‘Estimating optimal transformations for multiple regres-

sion and correlation (with discussion)’, Journal of the American Statistical Association

80, 580–598.

Cai, T., Zhang, C. & Zhou, H. (2010), ‘Optimal rates of convergence for covariance matrix

estimation’, The Annals of Statistics 38, 2118–2144.

Danaher, P., Wang, P. & Witten, D. (2013), ‘The joint graphical lasso for inverse covariance

estimation across multiple classes’, Journal of the Royal Statistical Society: Series B, to

appear.

Devlin, S., Gnanadesikan, R. & Kettenring, J. (1975), ‘Robust estimation and outlier detec-

tion with correlation coefficients’, Biometrika 62, 531–545.

Duchi, J., Shalev-Shwartz, S., Singer, Y. & Chandra, T. (2008), ‘Efficient projections onto the

`1-ball for learning in high dimensions’, Proceedings of the 25th International Conference

on Machine Learning, pp. 272–279.

Fan, J., Liao, Y. & Mincheva, M. (2013), ‘Large covariance estimation by thresholding prin-

cipal orthogonal complements (with discussion)’, Journal of the Royal Statistical Society:

Series B, 75, 603-680.

Fan, J., Ke, T., Liu, H., and Xia, L. (2013). QUADRO: A supervised dimension reduction

method via rayleigh quotient optimization. The Annals of Statistics, to appear.

Fan, Y., Hardle, W., Wang, W. & Zhu, L. (2013), ‘Composite quantile regression for the

single-index model’, SFB 649 Discussion Papers, No. 2013-010.

Friedman, J. H. (2001), ‘Greedy function approximation: a gradient boosting machine.’, The

Annals of Statistics 29, 1189–1232.

Friedman, J. H. (2002), ‘Stochastic gradient boosting’, Computational Statistics & Data

Analysis 38, 367–378.

Friedman, J. H. & Stuetzle, W. (1981), ‘Projection pursuit regression’, Journal of the Amer-

ican statistical Association 76, 817–823.

Han, F. & Liu, H. (2012), ‘Semiparametric principal component analysis’, Advances in Neural

Information Processing Systems, pp. 171–179.

33

Huang, J., Liu, N., Pourahmadi, M. & Liu, L. (2006), ‘Covariance matrix selection and

estimation via penalised normal likelihood’, Biometrika 93, 85–98.

Kendall, M. (1948), Rank Correlation Methods, Charles Griffin and Co. Ltd., London.

Koenker, R. (2005), Quantile Regresssion. Cambridge University Press.

Koenker, R. & Bassett, G. J. (1978), ‘Regression quantiles’, Econometrica 46, 33–50.

Koenker, R. & Geling, O. (2001), ‘Reappraising medfly longevity: a quantile regression

survival analysis’, Journal of the American Statistical Association 96, 458–468.

Koenker, R. & Xiao, Z. (2006), ‘Quantile autoregression (with discussion)’, Journal of the

American Statistical Association 101, 980–990.

Lam, C. & Fan, J. (2009), ‘Sparsistency and rates of convergence in large covariance matrix

estimation’, The Annals of Statistics 37, 4254–4278.

Levina, E., Rothman, A. & Zhu, J. (2008), ‘Discriminant analysis through a semiparametric

model’, The Annals of Applied Statistics 2, 245–263.

Lin, Y. & Jeon, Y. (2003), ‘Discriminant analysis through a semiparametric model’,

Biometrika 90, 379–392.

Liu, H., Han, F., Yuan, M., Lafferty, J. & Wasserman, L. (2012), ‘High dimensional semi-

parametric gaussian copula graphical models’, The Annals of Statistics 40, 2293–2326.

Liu, H., Lafferty, J. & Wasserman, L. (2009), ‘The nonparanormal: semiparametric esti-

mation of high dimensional undirected graphs’, Journal of Machine Learning Research

10, 1–37.

Ma, S., Xue, L. & Zou, H. (2013), ‘Alternating direction methods for latent variable gaussian

graphical model selection’, Neural Computation 25, 2172–2198.

Mai, Q. & Zou, H. (2012), ‘Semiparametric sparse discriminant analysis in ultra-high dimen-

sions’, Manuscript .

Qi, H. & Sun, D. (2006), ‘A quadratically convergent Newton method for computing the

nearest correlation matrix’, SIAM Journal on Matrix Analysis and Applications 28, 360–

385.

34

Rothman, A. J., Levina, E. & Zhu, J. (2009), ‘Generalized thresholding of large covariance

matrices’, Journal of the American Statistical Association 104, 177–186.

Rothman, A., Levina, E. & Zhu, J. (2010), ‘A new approach to cholesky-based covariance

regularization in high dimensions’, Biometrika 97, 539–550.

Serfling, R. (1980), Approximation Theorems of Mathematical Statistics, John Wiley & Sons.

Stone, C.J. (1985), Additive regression and other nonparametric models. The Annals of

Statistics, 13, 689–705.

Tibshirani, R. (1988), ‘Estimating transformations for regression via additivity and variance

stabilization’, Journal of the American Statistical Association 83, 394–405.

Wang, H. & He, X. (2007), ‘Detecting differential expressions in GeneChip microarray stud-

ies: a quantile approach’, Journal of the American Statistical Association 102, 104–112.

Wei, Y. & He, X. (2006), ‘Conditional growth charts’, The Annals of Statistics 34, 2069–

2097.

Wei, Y., Pere, A., Koenker, R. & He, X. (2006), ‘Quantile regression methods for reference

growth charts’, Statistics in Medicine 25, 1369–1382.

Wu, W. & Pourahmadi, M. (2003), ‘Nonparametric estimation of large covariance matrices

of longitudinal data’, Biometrika 90, 831–844.

Wu, T., Yu, K. & Yu, Y. (2010), ‘Single-index quantile regression’, Journal of Multivariate

Analysis 101, 1607–1621.

Xue, L., Ma, S. & Zou, H. (2012), ‘Positive definite `1 penalized estimation of large covariance

matrices’, Journal of the American Statistical Association 107, 1480–1491.

Xue, L. & Zou, H. (2012), ‘Regularized rank estimation of high-dimensional nonparanormal

graphical models’, The Annals of Statistics 40, 2541–2571.

Xue, L. & Zou, H. (2014), ‘Rank-based tapering estimation of bandable correlation matrices’,

Statistica Sinica 24, 83-100.

Yuan, M. & Lin, Y. (2007), ‘Model selection and estimation in the Gaussian graphical model’,

Biometrika 94, 19–35.

35

Zhao, T., Roeder, K. & Liu, H. (2012), ‘Smooth-projected neighborhood pursuit for high-

dimensional nonparanormal graph estimation’, Advances in Neural Information Processing

Systems, 162–170.

Zhu, L., Huang, M. & Li, R. (2012), ‘Semiparametric quantile regression with high-

dimensional covariates’, Statistica Sinica, 22, 1379–1401.

Zou, H. & Yuan, M. (2008), ‘Composite quantile regression and the oracle model selection

theory’, The Annals of Statistics 36, 1108–1126.

36

Documents

Multi-task Quantile Regression under the Transnormal Modeljqfan/papers/15/TransNormal.pdf2 Multi-task quantile regression: model and method This section presents the complete methodological