JCP for covariance matrix estimation

Submitted to the Annals of StatisticsarXiv: math.PR/0000000

A WELL CONDITIONED AND SPARSE ESTIMATE OFCOVARIANCE AND INVERSE COVARIANCE MATRIX

USING A JOINT PENALTY

By Ashwini Maurya

Michigan State University

We develop a method for estimating a well conditioned andsparse covariance matrix from a sample of vectors drawn from a sub-gaussian distribution in high dimensional setting. The proposed esti-mator is obtained by minimizing the squared loss function and jointpenalty of `1 norm and sum of squared deviation of the eigenvaluesfrom a positive constant. The joint penalty plays two important roles:i) `1 penalty on each entry of covariance matrix reduces the eectivenumber of parameters and consequently the estimate is sparse andii) the sum of squared deviations penalty on the eigenvalues controlsthe over-dispersion in the eigenvalues of sample covariance matrix.In contrast to some of the existing methods of covariance matrix es-timation, where often the interest is to estimate a sparse matrix, theproposed method is exible in estimating both a sparse and well-conditioned covariance matrix simultaneously. We also extend themethod to inverse covariance matrix estimation and establish the con-sistency of the proposed estimators in both Frobenius and Operatornorm. The proposed algorithm of covariance and inverse covariancematrix estimation is very fast, ecient and easily scalable to largescale data analysis problems. The simulation studies for varying sam-ple size and number of variables shows that the proposed estimatorperforms better than graphical lasso, PDSCE estimates for variouschoices of structured covariance and inverse covariance matrices. Wealso use our proposed estimator for tumor tissues classication usinggene expression data and compare its performance with some otherclassication methods.

1. Introduction. With the recent surge in data technology and storagecapacity, today's statisticians often encounter data sets where sample size nis small and number of variables p is very large: often hundreds, thousandsand even million or more. Examples include gene expression data and websearch problems [Clarke et al. (2008), Pass et al. (2006)]. For many of thehigh dimensional data problems, the choice of classical statistical methodsbecomes inappropriate for making valid inference. The recent developments

AMS 2000 subject classications: Primary 62G20,62G05; secondary 62H12.Corresponding author: MauryaKey words and phrases. Sparsity, Eigenvalue Penalty, Matrix Estimation, Penalized Esti-mation.

1

2 ASHWINI MAURYA

in asymptotic theory deal with increasing p as long as both p and n tend toinnity at some rate depending upon parameter of interest.The estimation of covariance and inverse covariance matrix is a problem

of primary interest in multivariate statistical analysis. Some of the appli-cations include: (i) Principal component analysis (PCA) [Johnstone et al.(2004), Zou et al. (2006)]: where the goal is to project the data on \best"k-dimensional subspace, where best means the projected data explains asmuch of the variation in original data without increasing k. (ii) Discrimi-nant analysis [Mardia et al. (1975)]: where the goal is to classify observationsinto dierent classes, an estimate of covariance and inverse covariance matrixplays an important role as the classier is often a function of these entities.(iii) Regression analysis: If interest focuses on estimation of regression coef-cients with correlated (or longitudinal) data, a sandwich estimator of thecovariance matrix may be used to provide standard errors for the estimatedcoecients that are robust in the sense that they remain consistent undermis-specication of the covariance structure. (iv) Gaussian graphical mod-eling [Meinshausen (2006), Wainwright et al. (2006), Yuan et al. (2007)]: therelationship structure among nodes can be inferred from inverse covariancematrix. A zero entry in the inverse covariance matrix implies conditionalindependence between the corresponding nodes.

The estimation of large dimensional covariance matrix based on few sam-ple observations is a dicult problem, especially when n p (here an bnmeans that there exist positive constants c and C such that c an=bn C).In these situations, the sample covariance matrix becomes unstable whichexplodes the estimation error. It is well known that the eigenvalues of samplecovariance matrix are over-dispersed which means that the eigen-spectrumof sample covariance matrix is not a good estimator of its population coun-terpart [Marcenko (1967), Karoui (2008)]. To illustrate this point, considerp = Ip, so all the eigenvalues are 1. A result from [Geman S. (1980)]shows that if entries of Xi's are i.i.d and have a nite fourth moment and ifp=n! > 0, then the largest sample eigenvalue l1 satises:

l1 ! (1 +p)2; a:s

This suggests that l1 is not a consistent estimator of the largest eigenvalue1 of population covariance matrix. In particular if n = p then l1 tendsto 4 whereas 1 is 1. This is also evident in the eigenvalue plot in gure2.1. The distribution of l1 also depends upon the underlying structure ofthe true covariance matrix. From gure 2.1, it is evident that the smallersample eigenvalues tend to underestimate the true eigenvalues for large p and

JPEN FOR COVARIANCE AND INVERSE COVARIANCE MATRIX ESTIMATION3

small n. For more discussion here see [Karoui (2008)]. To correct this bias,a natural choice would be to shrink the sample eigenvalues towards somesuitable constant to reduce the over-dispersion. For instance, Stein (1975)proposed an estimator of the form ~ = ~U(~) ~U where (~) is a diagonalmatrix with diagonal entries as transformed function of sample eigenvaluesand ~U is matrix of eigenvectors. In another interesting paper Ledoit andWolf(2004) proposed an estimator that shrinks the sample covariance matrixtowards the identity matrix. In another paper, Karoui (2008) proposed anon-parametric estimation of spectrum of eigenvalues and show that hisestimator is consistent in sense of weak convergence of distributions.The covariance matrix estimates based on eigen-spectrum shrinkage are

well conditioned in the sense that their eigenvalues are well bounded awayfrom zero. These estimates are based on the shrinkage of the eigenvalues andtherefore invariant under some orthogonal group i.e. the shrinkage estimatorsshrink the eigenvalues but eigenvectors remain unchanged. In other words,the basis (eigenvector) in which the data are given is not taken advantageof and therefore the methods rely on premise that one will be able to nda good estimate in any basis. In particular, it is reasonable to believe thatthe basis generating the data is somewhat nice. Often this translates intothe assumption that the covariance matrix has particular structure that oneshould be able to take advantage of. In these situations, it becomes naturalto perform certain form of regularization directly on the entries of samplecovariance matrix.Much of the recent literature focuses on two broad class of regularized co-

variance matrix estimation. i) The one class rely on natural ordering amongvariables, where one often assumes that the variables far apart are weeklycorrelated and ii) the other class where there is no assumption on the naturalordering among variables. The rst class includes the estimators based onbanding and tapering [Bickel and Levina (2008), Cai et al. (2010)]. Theseestimators are appropriate for a number of applications for ordered data(time series, spectroscopy, climate data). However for many applications in-cluding gene expression data, priori knowledge of any canonical ordering isnot available and searching for all permutation of possible ordering wouldnot be feasible. In these situations, an `1 penalized estimator becomes moreappropriate which yields a permutation-invariant estimate.To obtain a suitable estimate which is both well conditioned and sparse,

we introduce two regularization terms: i) `1 penalty to each of the o-diagonal elements of matrix and, ii) squared deviation penalty to eigenvaluesfrom a suitable constant. The `1 minimization problems are well studied inthe covariance and inverse covariance matrix estimation literature [Freidman

4 ASHWINI MAURYA

et al. (2007), Banerjee et al. (2008), Bickel and Levina (2008), Ravikumar etal. (2011), Jacob and Tibshirani (2011), Maurya (2014) etc.]. Meinshausenand Buhlmann (2006) studied the problem of variable selection using highdimensional regression with lasso and show that it is a consistent selectionscheme for high dimensional graphs. Rothman et al. (2008) propose an `1 pe-nalized log-likelihood estimator and show that their estimator is consistent

in Frobenius and operator norm at the rate of OP

pf(p+ s) log pg=n,as both p and n approach to innity. Here s is the number of non-zero o-diagonal elements in true covariance matrix. Jacob and Tibshirani (2011)propose an estimator of covariance matrix as penalized maximum likelihoodestimator with a weighted lasso type penalty. In these optimization prob-lems, the `1 penalty results in sparse (as compared to other lq; q 6= 1 penal-ties) and a permutation-invariant estimate as compared to other lq; q 6= 1penalties. Another advantage is that the `1 norm is a convex function whichmakes it suitable for large scale optimization problems and a number of fastalgorithms exist for covariance and inverse covariance matrix estimation[(Freidman et al. (2007), Rothman (2012)]. The eigenvalue squared penaltyfrom a suitable constant overcomes the over-dispersion in the sample covari-ance matrix so that the estimator remains well conditioned.Ledoit and Wolf (2004) proposed an estimator of covariance matrix as a

linear combination of sample covariance and identity matrix. Their estimatorof covariance matrix is well conditioned but it is not sparse. Rothman (2012)proposed estimator of covariance matrix based on squared error penalty and`1 penalty with a log-barrier on the determinant of covariance matrix. Thelog-determinant barrier is a valid technique to achieve positive denitenessbut it is still unclear whether the iterative procedure proposed in this pa-per [Rothman (2012)] actually nds the right solution to the correspondingoptimization problem. In another interesting paper, Xue et al. (2012) pro-pose an estimator of covariance matrix as a minimizer of penalized squaredloss function over set of positive denite cones. In this paper, the authorssolve a positive denite constrained optimization problem and establish theconsistency of estimator. The resulting estimator is sparse and positive def-inite but whether it overcomes the over-dispersion of the eigen-spectrumof sample covariance matrix, is hard to justify. Maurya (2014) proposed ajoint convex penalty as function of `1 and trace norm (dened as sum ofsingular values of a matrix) for inverse covariance matrix estimation basedon penalized likelihood approach.In this paper, we derive an explicit rate of convergence of the proposed

estimator (2.4) in Frobenius norm and operator norm. This rate dependsupon level of sparsity of the true covariance matrix. In addition, for a slight


modication of the method (Theorem 3.3), we prove the consistency of ourestimate in operator norm and show that its rate is similar to that of bandedestimator of Bickel and Levina (2008). One of the major advantage of theproposed estimator is that the derived algorithm is very fast, ecient andeasily scalable to a large scale data analysis problem.The rest of the paper is organized as following. The next section highlights

some background and problem set-up for covariance and inverse covariancematrix estimation. In section 3, we give proposed estimator and establishits theoretical consistency. In section 4, we give an algorithm and compareits computational time with some other existing algorithms. Section 5 high-lights the performance of proposed estimator on simulated data while anapplication of proposed estimator to real life colon tumor data is given inSection 6.Notation: For a matrix M , let kMk1 denote its `1 norm dened as the

sum of absolute values of the entries of matrix M , kMkF denote the Frobe-nius norm of matrix M dened as sum of squared element of M , kMkdenote the operator norm (also called spectral norm) dened as largest ab-solute eigenvalue of M , M denote matrix M where all diagonal elementsare set to zero, M+ denote matrix M where all o-diagonal elements areset to zero, i(M) denote the i

th largest eigenvalue of M , tr(M) denotes itstrace and let det(M) denote its determinant.

2. Background and Problem Set-up. Let X = (X1; X2; ; Xp) bea zero-mean p-dimensional random vector. The focus of this paper is theestimation of the covariance matrix := E(XXT ) and its inverse 1 froma sample of independently and identically distributed data fX(k)gnk=1. Inthis section we provide some background and problem setup more precisely.The choice of loss function is very crucial in any optimization problem.

An optimal estimator for a particular loss function may not be optimal foranother choice of loss function. Recent literature in covariance matrix andinverse covariance matrix estimation mostly focus on estimation based onlikelihood function or quadratic loss function [Freidman et al. (2007), Baner-jee et al. (2008), Bickel and Levina (2008), Ravikumar et al. (2011), Roth-man (2012), Maurya (2014) etc.]. The maximum likelihood estimation re-quires a tractable probability distribution of observations whereas quadraticloss function does not have any such requirement and therefore fully non-parametric. The quadratic loss function is convex and due to this analyticaltractability, it is a widely applicable choice for many data analysis problem.

6 ASHWINI MAURYA

2.1. Proposed Estimator. Let S be the sample covariance matrix. Con-sider the following optimization problem.

(2.1) ^; = argmin=T

hjj Sjj22 + kk1 +

pXi=1

aifi() tg2i;

where i() is the ith largest eigenvalue of matrix , and are some

positive constants. Note that by penalty function kk1, we only penalizeo-diagonal elements of . The t 2 R+ is a suitably chosen constant. A choiceof t can be mean or median of sample eigenvalues. Weights ai's are shrinkageweights associated with ith eigenvalue i. For ai = 1; 8i = 1; 2; p, the op-timization problem (2.1) shrinks all the eigenvalues by same weight towardsthe same constant t (mean of eigenvalues) and consequently (due to squareddeviation penalty on eigenvalues) this will yield maximum shrinkage in theeigen-spectrum. The squared deviation penalty term for eigenvalues shrink-age is chosen from following points of interest: i) It is easy to interpret andii) this choice of penalty function yields a very fast optimization algorithm.From here onwards we suppress the dependence of ; on ^ and denote^; by ^.For = 0, the standard lasso estimator for quadratic loss function and

its solution is (see x4 for derivation of this estimator):

^ii = sii

^ij = sign(sij)maxjsij j

2; 0; i 6= j:

(2.2)

where sign(x) is sign of x and jxj is absolute value of x. It is clear from thisexpression that a suciently large value of will result in sparse covariancematrix estimate. But it is hard to assess whether ^ of (2.2) overcomes theover-dispersion in the sample eigenvalues. The following eigenvalue plot (g-ure (2.1)) illustrates this phenomenon for a neighbourhood type (see x5 fordetails on description of neighborhood type of matrix) of covariance matrix.We simulated random vectors from multivariate normal distribution withn = 50; p = 50.


Fig 2.1. Comparison of eigenvalues of sample and JPEN estimate of Covariance Matrix

As is evident from gure 2.1, eigenvalues of sample covariance matrix areover-dispersed as most of them are either too large or close to zero. Eigenval-ues of the Joint Penalty (JPEN) estimate (2.4) of the covariance matrix areconsistent for the eigenvalues of true covariance matrix. See x5 for detaileddiscussion. Another drawback of the estimator (2.2) is that the estimate canbe negative denite [for details here see Xue et al. (2012)].

As argued earlier, to overcome the over-dispersion in sample covariancematrix, we include eigenvalues squared deviation penalty. To illustrate itsadvantage, consider = 0. After some algebra, let ^ be the minimizer of(2.1) (for = 0) is given by:

(2.3) ^ =1

2(^1 + ^

T1 ) where ^1 = (S + t UAU

T )(I + UAUT )1;

where A = diag(A11; A22; ; App) with Aii = ai and U is a matrix ofeigenvectors (refer to x4 for details for choice of U). Note that ^1 in (2.3)may not be symmetric but ^ is. To see if the estimate above is positive

8 ASHWINI MAURYA

denite, since min(^1) = min(^T1 ), after some algebra, we have:

min(^) = min(SU(I + A)1UT + t UA(I + A)1UT )

min(SU(I + A)1UT ) + t min(UA(I + A)1UT ) min(S)

1 + maxip(Aii)+ t min

ip Aii1 + Aii

t min

ipAii

1 + Aii> 0

for minipAii > 0. This means that the eigenvalues squared deviationpenalty improves S to a positive denite estimator ^ provided that >0; t > 0;minipAii > 0. Note that the estimator (2.3) is well conditionedbut need not be sparse. Sparsity can be achieved by imposing `1 penalty toeach entry of covariance matrix. Simulation experiments have shown that ingeneral the minimizer of (2.1) is not positive denite for all values of > 0and > 0. To achieve both well conditioned and sparse positive deniteestimator we optimize the objective function of (2.1) over specic regionof values of (; ) which depends upon S; t; and A. The proposed JPENestimator of covariance matrix is given by:

(2.4) ^ = argmin=T j(;)2R^S;t;A;1

hjjSjj2F +kk1+

pXi=1

aifi() tg2i;

where

R^S;t;A;1 =S>0

n(; ) : (; ; ) 2 RS;t;A;1 )

o;

and

RS;t;A;1 =n(; ; ) : > 0;

rlog p

n;

min(S)

1 + maxipAii

+ t minip

Aii1 + Aii

2maxip

(1 + Aii)1

o:

The minimization in (2.4) over is for xed (; ) 2 R^S;t;A;1 ; is somepositive constant. Note that such choice of ; guarantees the minimumeigenvalue of the estimate in (2.4) to be at least > 0. Theorem 3.1 estab-

lishes that the set R^S;t;A;1 is asymptotic nonempty.

2.2. Our Contribution. The main contributions are the following:i) The proposed estimator is both sparse and well conditioned simultane-ously. This approach allows to take advantage of a prior structure if known


on the eigenvalues of true covariance matrix.ii)We establish theoretical consistency of proposed estimator in both Frobe-nius and Operator norm.iii) The proposed algorithm is very fast, ecient and easily scalable to largescale optimization problems.We did simulations to compare the performance of the proposed esti-

mators of covariance and inverse covariance matrix to some other existingestimators for a number of structured covariance and inverse covariance ma-trices for varying sample sizes and dimensions. See x5 for further details.

3. Analysis of JPEN Method. Def: A random vector X is said tohave sub-gaussian distribution if for each y 2 Rp f0g with kyk2 = 1 andfor t 0, there exist 0 < tg et2=2

Theorem 3.1. X := (X1; X2; ; Xp) be a mean zero subgaussian ran-dom vector as dened in (3.1). Let S = (1=n)XXT be the sample covariance

matrix and pn ! < 1 as n = n(p)!1. Let R^S;t;A;1 be as dened in (2.4).For (; ) 2 R^S;t;A;1 we have R^S;t;A;1 4R;1 ! in probability, where

R;1 =[>0

ng() >

o;

where g() > 0 is the limit of smallest eigenvalue of S in probability and is the empty set.

Next we give the theoretical results about the consistency of our proposedestimator (2.4) of covariance matrix.

3.1. Covariance Matrix Estimation. We make the following assumptionsabout the true covariance matrix 0.A0. The X := (X1; X2; ; Xp) be a mean zero vector with covariancematrix 0 such that each Xi=

p0ii has subgaussian distribution with pa-

rameter as dened in (3.1).A1. With E = f(i; j) : 0ij 6= 0; i 6= jg; the cardinality(E) s for somepositive integer s.A2. There exists a nite positive real number k > 0 such that 1=k min(0) max(0) k, where min(0) and max(0) are the mini-mum and maximum eigenvalues of matrix 0 respectively.

10 ASHWINI MAURYA

Assumption A2 guarantees that the true covariance matrix 0 is well con-ditioned (i.e. all the eigenvalues are nite and positive). A well conditionedmeans that [Ledoit and Wolf (2004)] inverting the matrix does not explodesthe estimation error. Assumption A1 is more of a denition which says thatthe number of non-zero o diagonal elements are bounded by some posi-tive integer. The Theorem 3.2 below gives the rate of convergence of theproposed covariance matrix estimator (2.4) in Frobenius norm.

Theorem 3.2. Let (; ) 2 R^S;t;A;1 and ^ be as dened in (2.4). UnderAssumptions A0, A1, A2 and for min(0) t max(0), we have:

(3.2) k^ 0kF = OPr(p+ s)log p

n

Here the worst part of rate of convergence comes from estimating the

diagonal entries. For correlation matrix estimation, the rate can be improved

to OP

ps log p=n

(Corollary 3.2).

Let 0 =WW be the variance correlation decomposition of true covari-ance matrix 0 where is true correlation matrix and W is the a diagonalmatrix of true standard deviations. Let K^ be the solution to following opti-mization problem.

(3.3) K^ = argminK=KT j(;)2R^^;t;A;1a

nkK^k2 + kKk1 +

pXi=1

aifi(K)tg2o

where R^^;t;A;1a is given by:

R^^;t;A;1a =[

>0

n(; ) : (; ; ) 2 R^;t;A;1a )

o;(3.4)

and

R^;t;A;1a =n(; ; ) : > 0;

rlog p

n;

min(^)

1 + maxipAii

+ t minip

Aii1 + Aii

2maxip

(1 + Aii)1

o:

and ^ is the sample counterpart of . Similar to Theorem 3.1, the following

corollary establishes that the set of symmetric dierence between R^^;t;A;1aand its asymptotic counterpart R;1a is empty as n = n(p)!1.


Corollary 3.1. X := (X1; X2; ; Xp) be a mean zero random vectorwhere each fXigi=1; ;p has subgaussian distribution as dened in (3.1). Let^ be the sample correlation matrix. Let pn ! < 1 as n = n(p) ! 1. LetR^^;t;A;1a be as dened in (3.4). We have R^

^;t;A;1a 4R;1a ! in probability,

where

R;1a =[ >0

n(1

p)2 >

oWe have the following rate of convergence for correlation matrix estimate

K^ of (3.3).

Corollary 3.2. Under the Assumption of A0; A1; A2, min() t max() and for (; ) 2 R^^;t;A;1a ,

(3.5) kK^ kF = OPrs log p

n

:

The improved rate is due to the fact that for correlation matrix, all thediagonal entries are one. Dene ^c := W^ K^W^ , where W^ is a diagonal matrixof the estimates of true standard deviations based on observations. Thefollowing theorem gives the rate of convergence of correlation matrix basedcovariance matrix estimator in operator norm.

Theorem 3.3. Under the assumption A0, A1, A2 and for (; ) 2R^^;t;A;1a ,

(3.6) k^c 0k = OPr(s+ 1)log p

n

:

Note that k^c0kF ppk^c0k. Therefore the rate of convergencein Frobenius norm of the correlation matrix based estimator of covariancematrix is the same as the one dened in (2.4).Remark: This rate of operator norm convergence is same as the one ob-tained in Bickel and Levina (2008) for banded covariance matrices. Althoughthe method of proof is very dierent but the similar rate of convergence inoperator norm is due to the similar kind of tail inequality for sample covari-ance matrix of Gaussian and sub-Gaussian random variables [Ravikumaret al. (2011)]. Rothman (2012) propose an estimator of covariance matrixbased on similar loss function but the choice of dierent penalty functionyields very dierent estimate. This is also exhibited in simulation analy-sis of x5. Moreover our proposed estimator is applicable to estimate any

12 ASHWINI MAURYA

non-negative covariance matrix which is not the case for Rothman's (2012)estimator (since Rothman's estimator involves logarithmic of determinantof the estimator as another penalty to keep all the eigenvalues of estimatedmatrix away from zero).

3.2. Estimation of Inverse Covariance Matrix. Notation: We shall use

for inverse covariance matrix.Assumptions: We make following assumptions about the true inverse co-variance matrix 0. Let 0 =

10

B0. The random vector X := (X1; X2; ; Xp) is a mean zero vector withcovariance matrix 0 such that each Xi=

p0ii has subgaussian distribution

with parameter as in (3.1).B1. With H = f(i; j) : 0ij 6= 0; i 6= jg, the cardinality(H) s for somepositive integer s.B2. There exist 0 < k < 1 large enough such that (1=k) min(0) max(0) k and min(S + I) 1=(k) for all

plog p=n and S =

(1=n)XXT .Remark: In Assumption B2, we require the minimum eigenvalue of S1 :=(S + I)1 to be bounded above by some positive constant. Letlimn(p)!1 p=n = < 1, then by a result from Bai and Yin (1993),limn(p)!1 min(S) = g() > 0. Consequently min(S + I) 1=(k) forlarge enough k. This condition is required in establishing the rate of conver-gence of estimator (3.7) (see the Theorem 3.5).Dene the JPEN estimator of inverse covariance matrix 0 as the solutionto the following optimization problem,(3.7)

^ = argmin

=T j(;)2R^S;t;A;2

hkS1 k2 + kk1 +

pXi=1

aifi() tg2i

where

R^S;t;A;

2 =[

>0

n(; ) : (; ; ) 2 RS;t;A;2 )

o;(3.8)

with

RS;t;A;

2 =n(; ; ) : > 0;

qlog pn ;

min(S1)

1+maxip Aii

+ t minip

Aii1+Aii

2 maxip(1 + Aii)1 o;for A = diag(A11; A22; ; App) with Aii = ai and ai dened in (3.7).Remark: Note that this choice of S is positive denite matrix and thereforeinvertible.


Theorem 3.4. X := (X1; X2; ; Xp) be a mean zero vector whereeach fXigi=1; ;p has subgaussian distribution as dened in (3.1). Let S =(1=n)XXT ; S = S + I for plog p=n. Let pn ! < 1 as n = n(p)!1. Let R^S;t;A;2 be as dened in (3.8). We have R^S

;t;A;2 4R;2 ! in

probability, where

R;2 =n : > 0; g1() >

o;

g1() = limn=n(p)!1 min(S1) and is empty set.

The following theorem gives the consistency of inverse covariance matrixestimator (3.7) in Frobenius norm.

Theorem 3.5. Let ^ be the minimizer as dened in (3.7). Under As-

sumptions B0, B1, B2 and for (; ) 2 R^S;t;A;2 and min(0) t max(0), we have:

(3.9) k^ 0kF = OPr(p+ s)log p

n

Note that the rate of convergence here is the same as for the covariance

matrix estimation. Let L^ be the solution to following optimization problem:(3.10)

L^ = argminL=LT j(;)2R^^;t;A;2a

nkL ^1k2 + kLk1 +

pXi=1

aifi(L) tg2o

where ^1 = W^S1W^ and

R^^;t;A;2a =[

>0

n(; ) : (; ; ) 2 R^;t;A;2a )

o;(3.11)

with

R^;t;A;2a =n(; ; ) : > 0;

qlog pn ;

min(^1)

1+maxip Aii

+ t minip

Aii1+Aii

2 minip(1 + Aii)1 o;Corollary 3.3. X := (X1; X2; ; Xp) be a mean zero vector where

each fXigi=1; ;p has subgaussian distribution as dened in (3.1). Let pn ! < 1 as n = n(p) ! 1 and R^^;t;A;2a be as dened in (3.8). For (; ) 2R^^;t;A;2a , we have R^

^;t;A;2a 4R;2a ! in probability, where

R;2a =[>0

ng2()

o;

14 ASHWINI MAURYA

g2() is limit of smallest eigenvalue of ^1 and is empty set.

We have following rate of convergence of the inverse of the correlationmatrix estimator given in (3.10).

Corollary 3.4. Let L^ be the minimizer of (3.10). Under the assump-

tion B0, B1, B2 and for (; ) 2 R^^;t;A;2a ,

(3.12) kL^ 1kF = OPrs log p

n

This rate is the same as that of correlation matrix estimator given in

(3.3).Dene ^c := W^

1L^W^1. We have the following result on the operator normconsistency of inverse correlation matrix based inverse covariance matrix.

Theorem 3.6. Under the assumption of B0, B1, B2 and for (; ) 2R^^;t;A;2a ,

(3.13) k^c 0k = OPr(s+ 1)log p

n

Since k^c 0kF ppk^ 0k, the rate of convergence of the inverse

covariance matrix based on inverse correlation matrix is same as that of thecovariance matrix estimator based on correlation matrix.

4. An Algorithm.

4.1. Covariance Matrix Estimation:. The optimization problem (2.4) canbe written as:

^ = argmin=T j(;)2R^S;t;A;1

f();(4.1)

where

f() = jj Sjj2F + kk1 + pX

i=1

aifi() tg2:

A solution to (4.1) is given by:

^ii =Mii;

^ij = signMij

max

njMij j

2(1 + maxipAii); 0o; i 6= j;(4.2)


where

M =1

2

M1 +M

T1

with M1 = (S + t UAU

T )(I + UAUT )1;

A = diag(A11; A22; ; App) with Aii = ai and (; ) 2 R^S;t;A;1 .Choice of U:Note that U is the matrix of eigenvectors of , which is unknown. One choiceof U is matrix of eigenvectors of corresponding eigenvalue decomposition ofS + I for some > 0 i.e. let S + I = U1D1U

T1 , then take U = U1.

Choice of and :For given value of , we can nd the value of satisfying:

< 2 (1 + minip

Aii)n min(S)1 + maxiAii

o+ 2 t min

ipAii 2 ;

and such choice of (; ) 2 R^S;t;A;1 which guarantees that the minimumeigenvalue of the estimate (4.2) will be at least > 0.

4.2. Inverse Covariance Matrix Estimation:. To get an expression of in-verse covariance matrix estimate, we replace S by S1 in (4.2). Let A bethe weight matrix for eigenvalues of inverse covariance matrix of optimiza-tion problem (3.7), then an optimal solution to optimization problem (3.7)is give by:

(4.3)

^ii =M

ii

^ij = signMij

max

njMij j 2(1+maxip Aii) ; 0

o: i 6= j:

where M = 12(M2 +MT2 ); M2 = (S

1 + t U1AUT1 )(I + U1AUT1 )1,and (; ) 2 R^S;t;A;2 . A choice of U1 is matrix of eigenvectors of eigen-decomposition of S1 = U1D1UT1 .

4.2.1. Computational Time. We compare the computational timing ofour algorithm to some other existing algorithms glasso[12] (Friedman etal.(2008)), PDSCE [28] (Rothman (2011)). Note that the exact timing ofthese algorithm also depends upon the implementation, platform etc. (wedid our computations in R on a AMD 2.8GHz processor). For each estimate,the optimal tuning parameter was obtained by minimizing the empirical lossfunction

(4.4) k^ SrobustkF

16 ASHWINI MAURYA

where ^ is an estimate of the the covariance matrix, Srobust is the samplecovariance matrix based on 20000 sample observations (refer the section x5for detailed discussion).Figure 4.1 illustrates the total computational time taken to estimate thecovariance matrix by Glasso; PDSCE and JPEN algorithms for dierentvalues of p for Toeplitz type of covariance matrix on log-log scale (see sectionx5 for Toeplitz type of covariance matrix). Although the proposed methodrequires optimization over a grid of values of (; ) 2 R^S;t;A;1 , our algorithmis very fast and easily scalable to large scale data analysis problems.

500 1000 2000 5000

51

05

01

00

50

05

00

0

number of covariates p

tim

e in

se

ce

on

ds

JPEN

glasso

PDSCE

Fig 4.1. Timing comparison of JPEN, Graphical Lasso(Glasso), PDSCE on log-log scale.

5. Simulation Results.

We compare the performance of the proposed method to other existingmethods on simulated data for four types of structured covariance and in-verse covariance matrices.

(i) Hub Graph: The rows/columns of 0 are partitioned into J equally-sized disjoint groups: fV1 [ V2 [; :::;[ VJg = f1; 2; :::; pg; each group isassociated with a pivotal row k. Let size jV1j = s. We set 0i;j = 0j;i = for i 2 Vk and 0i;j = 0j;i = 0 otherwise. In our experiment, J = [p=s]; k =1; s+ 1; 2s+ 1; :::; and we always take = 1=(s+ 1) with J = 20.


(ii) Neighborhood Graph: We rst uniformly sample (y1; y2; :::; yn)from a unit square. We then set 0i;j = 0j;i = with probability

(p2)

1exp(4kyi yjk2). The remaining entries of 0 are set to be zero.

The number of nonzero o-diagonal elements of each row or column is re-stricted to be smaller than [1=] where is set to be 0.245.

(iii) Toeplitz Matrix: We set 0i;j = 2 for i = j; 0i;j = j0:75jjijjfor ji jj = 1; 2; and 0i;j = 0 otherwise.

(iv) Block Diagonal Matrix: In this setting 0 is a block diagonalmatrix with varying block size. For p = 500 number of blocks is 4 and forp = 1000 the number of blocks is 6. Each block of covariance matrix is takento be Toeplitz type matrix as in case (iii).

We chose similar structure of 0 for simulations. For all these choicesof covariance and inverse covariance matrices, we generate random vec-tors from multivariate normal distribution with varying n and p. We chosen = 50; 100 and p = 500; 1000. Here we report the results for n = 50 andp = 500; 1000. Please refer the section 8 for detailed simulation analysis.We compare the performance of proposed covariance matrix estimator toto graphical lasso, PDSC Estimate [Rothman (2011)] and Ledoit-Wolf es-timate of covariance matrix. The JPEN estimate (4.2) of the covariancematrix was computed using R software(version 3.0.2). The graphical lassoestimate of the covariance matrix was computed using R package \glasso"(http://statweb.stanford.edu/ tibs/glasso/). The Ledoit-Wolf estimate wasobtained using code from (http: //www.econ.uzh.ch/faculty/wolf/ publi-cations.html#9). The PDSC estimate was obtained using PDSCE pack-age (http://cran. r-project. org/web/ packages/PDSCE/index.html). Forinverse covariance matrix performance comparison we only include glassoand PDSCE. For each of covariance and inverse covariance matrix estimate,we calculate Average Relative Error (ARE) based on 50 iterations usingfollowing formula:

ARE(; ^) = jlog(f(S; ^)) log(f(S;))j=j(log(f(S;))j;where f(S; ) is density of multivariate normal distribution, S is sample co-variance matrix, is the true covariance, ^ is the estimate of . Otherchoices of performance criteria are Kullback Leibler used by Yuan and Lin[2007], Bickel and Levina [2008]. The optimal values of tuning parametersfor and were obtained by minimizing empirical loss function given in(4.4). Simulation shows that the optimal choice of tuning parameters and

18 ASHWINI MAURYA

are same as if we replace Srobust by true covariance matrix . The averagerelative error and their standard deviations are given in table 5.1. The num-bers in the bracket are the standard error estimate of relative error. Table5.1 gives average relative errors and standard errors of the covariance ma-trix estimates based on glasso, Ledoit-Wolf, PDSCE and JPEN for n = 50and p = 500; 1000. The glasso estimate of covariance matrix performs verypoorly among all the methods. The Ledoit-Wolf estimate performs goodbut the estimate is generally not sparse. Also the eigenvalues estimates ofLedoit-Wolf estimator is heavily shrunk towards the center than the trueeigenvalues. The JPEN estimators outperforms other estimators for most ofthe values of p for all four type of covariance matrices. PDSCE estimateshave lower average relative error and close to JPEN. This could be due tothe fact the PDSCE and JPEN uses quadratic optimization function with adierent penalty function. Table 5.2 reports the average relative error andtheir standard deviations for inverse covariance matrix estimation. Here wedo not include the Ledoit-Wolf estimator and only compare glasso, PDSCEestimates with proposed JPEN estimator. The JPEN estimate of inversecovariance matrix outperforms other methods for all values of p = 500 andp = 1000 for all four types of structured inverse covariance matrices. Figure5.1 report the zero recovery plot of percentage of time each zero element ofcovariance matriz was truly recovered based on 50 realizations. The JPENestimates recovers the true zeros for about 90% of times for Hub and Neigh-borhood type of covariance matrix. Our proposed estimator also reect therecovery of true structure of non-zero entries and any pattern among therows/columns of covariance matrix.

Table 5.1Covariance matrix estimation

Hub type matrix Neighborhood type matrixp=500 p=1000 p=500 p=1000

Ledoit-Wolf 2.13(0.103) 2.43(0.043) 1.36(0.054) 2.89(0.028)Glasso 10.8(0.06) 14.7(0.052) 11.9(0.056) 14.3(0.03)PDSCE 1.22(0.052) 2.23(0.051) 0.912(0.077) 1.85(0.028)JPEN 1.74(0.051) 1.97(0.037) 0.828(0.052) 1.66(0.028)

Block type matrix Toeplitz type matrixLedoit-Wolf 1.54(0.102) 2.96(0.0903) 1.967(0.041) 2.344(0.028)

Glasso 30.8(0.0725) 33.9(0.063) 12.741(0.051) 18.22(0.04)PDSCE 1.62(0.118) 3.08(0.0906) 0.873(0.042) 1.82(0.028)JPEN 1.01(0.101) 1.91(0.0909) 0.707(0.042) 1.816(0.028)


Table 5.2Inverse covariance matrix estimation

Hub type matrix Neighborhood type matrixp=500 p=1000 p=500 p=1000

Glasoo 13.4(0.057) 17.5(0.065) 12.694(0.03) 13.596(0.033)PDSCE 1.12(0.046) 2.34(0.044) 0.958(0.04) 1.85(0.038)JPEN 0.613(0.033) 0.282(0.028) 0.392(0.038) 0.525(0.036)

Block type matrix Toeplitz type matrixGlasoo 12.7(0.0406) 13.6(0.0316) 19.4(0.037) 20.7(0.022)PDSCE 1.02(0.0562) 1.9(0.038) 1.91(0.064) 3.7(0.037)JPEN 0.372(0.0481) 0.579(0.0328) 0.664(0.068) 2.42(0.045)

To see the implication of eigenvalues shrinkage penalty as compared toother methods, we plot (Figure 5.2) the eigenvalues of estimated covariancematrix for n = 20,p = 50. JPEN estimates of eigen-spectrum are far betterthan other methods and closest being PDSC estimates of eigenvalues.

Fig 5.1. Heatmap of zeros identied in covariance matrix out of 50 realizations. Whitecolor is 50/50 zeros identied, black color is 0/50 zeros identied.

20 ASHWINI MAURYA

Fig 5.2. Eigenvalues plot for n = 20; p = 50 based on 50 realizations

6. Colon Tumor Classication Example. In this section, we com-pare performance of our proposed covariance matrix estimator for LinearDiscriminant Analysis (LDA) classication of tumors using gene expres-sion data from Alon et al. (1999). In this experiment, colon adenocarci-noma tissue samples were collected, 40 of which were tumor tissues and22 non-tumor tissues. Tissue samples were analyzed using an Aymetrixoligonucleotide array. The data were processed, ltered, and reduced to asubset of 2,000 gene expression values with the largest minimal intensity overthe 62 tissue samples (source: http://genomics-pubs.princeton.edu/oncology/aydata/index.html). Additional information about the dataset and itspre-processing can be found in Alon et al. (1999). In our analysis, we re-duce the number of genes by selecting p most signicant genes based onlogistic regression. We obtain estimates of inverse covariance matrix forp = 50; 100; 200 and then use LDA to classify these tissues as either tu-morous or non-tumorous (normal). We classify each test observation x toeither class k = 0 or k = 1 using the LDA rule

k(x) = argmaxk

nxT ^^k 1

2^k^^k + log(k)

o:(6.1)

where k is the proportion of class k observations in the training data, k isthe sample mean for class k on the training data, and ^ := ^1 is an esti-mator of the inverse of the common covariance matrix on the training datacomputed by one of the methods under consideration. Tuning parameters and were chosen using 5-fold cross validation. To create training and test


sets, we randomly split the data into a training set of size 42 and a testingset of size 20; following the approach used by Wang et al. (2007), we requirethe training set to have 27 tumor samples and 15 non-tumor samples. Werepeat the split at random 100 times and measure the average classicationerror.

Table 6.1Averages and standard errors of classication errors over 100 replications in %.

Method p=50 p=100 p=200

Logistic Regression 21.0(0.84) 19.31(0.89) 21.5(0.85)SVM 16.70(0.85) 16.76(0.97) 18.18(0.96)Naive Bayes 13.3(0.75) 14.33(0.85) 14.63(0.75)Graphical Lasso 10.9(1.3) 9.4(0.89) 9.8(0.90)Joint Penalty 9.9(0.98) 8.9(0.93) 8.2(0.81)

Since we do not have separate validation set, we do the 5-fold cross val-idation on training data. At each split, we divide the training data into 5subsets (fold) where 4 subsets are used to estimate the covariance matrixand 1 subset is used to measure the classier's performance. For each split,this procedure is repeated 5 times by taking one of the 5 subsets as vali-dation data. An optimal combination of and is obtained by minimizingthe average classication error. Tuning parameter for graphical lasso wasobtained by similar criteria.The average classication errors with standard errors over the 100 splits arepresented in Table 6.1. Since the sample size is less than the number of genes,we omit the inverse sample covariance matrix as its not well dened and in-stead include the naive Baye's and support vector machine classiers. NaiveBayes has been shown to perform better than the sample covariance matrixin high-dimensional settings (Bickel and Levina (2004)). Support VectorMachine(SVM) is another popular choice for high dimensional classicationtool (Chih-Wei Hsu et al. (2010)). Among all the methods covariance matrixbased based LDA classiers perform far better that Naive Bayes, SVM andLogistic Regression. For all other classiers the classication performancedeteriorates for increasing p. For larger p i.e. when more genes are added tothe data set, the classication performance of JPEN estimate based LDAclassier improves which is dierent from Rothman et el. (2008) analysis ofsame data set where the authors pointed out that as more genes are addedto the data set, the classiers performance deteriorates. Note that the clas-sication error of a covariance matrix based classier initially decreases forincreasing p and deteriorates for large p. This is due to the fact that as di-mension of covariance matrix increases, the estimator does not remain very

22 ASHWINI MAURYA

informative. In particular for p = 2000, when all the genes are used in dataanalysis, the classication error of JPEN and glasso is about 30% which ismuch higher than for p = 50.

7. Summary. We have proposed and analyzed regularized estimationof large covariance and inverse covariance matrices using joint penalty. Oneof its biggest advantages is that the optimization carries no computationalburden unlike many other methods for covariance regularization and the re-sulting algorithm is very fast, ecient and easily scalable to large scale dataanalysis problems. We show that our estimators of covariance and inversecovariance matrix are consistent in the Frobenius and operator norm. Theoperator norm consistency guarantees consistency for principal components,hence we expect that PCA will be one of the most important applicationsof the method. Although the estimators in (2.4) and (3.7) do not requireany assumption on the structure of true covariance and inverse covariancematrices respectively, but priori knowledge of any structure of true covari-ance matrix might be helpful to choose a suitable weight matrix and henceimprove estimation.AcknowledgmentsI would like to express my deep gratitude to Professor Hira L. Koul for hisvaluable and constructive suggestions during the planning and developmentof this research work.

References.

[1] Alon U., Barkai N., Notterman D., Gish K., Ybarra S., Mack D. and Levine A., Broadpatterns of gene expression revealed by clustering analysis of tumor and normal colontissues probed by oligonucleotide arrays. Proceeding of National Academy of ScienceUSA, 96(12):67456750, 1999.

[2] Banerjee O., El Ghaoui L. and dAspremont A., Model selection through sparse max-imum likelihood estimation for multivariate Gaussian or binary data. Journal of Ma-chine Learning Research, 9,485-516, 2008.

[3] Bickel P. and Levina E., Regulatized estimation of large covariance matrices. TheAnnals of Statistics, 36,199-227, 2008.

[4] Bickel P. and Levina E., Covariance regularization by thresholding The Annals ofStatistics, Volume 36, 2577-2604, 2008.

[5] Cai T., Zhang C. and Zhou H., Optimal rates of convergence for covariance matrixestimation. The Annals of Statistics 38, 2118-2144, 2010.

[6] Cai T., Liu W. and Luo X., A constrained `1 minimization approach to sparse precisionmatrix estimation. Journal of American Statistical Association 106, 594-607, 2011.

[7] Chaudhury S., Drton M. and Richardson T., Estimation of a covariance matrix withzeros. Biometrica, Volume 94, Issue 1Pp. 199-216, 2007.

[8] Clarke R., Ressom H., Wang A., Xuan J., Liu M., Gehan E. and Wang Y., Theproperties of high-dimensional data spaces: implications for exploring gene and proteinexpression data. Nat Rev Cancer. Jan 2008; 8(1): 3749.

[9] Dempster A., covariance Selection. Biometrika, 32,95-108, 1972.


[10] Dey D. and Srinivasan C., Estimation of a Covariance Matrix under Stein's Loss.Annals of Statistics Volume 13, Number 4, 1581-1591, 1985.

[11] Fan J., Fan Y. and LV J., High-dimensional covariance matrix estimation using afactor model. Journal of Econometrics

[12] Friedman J., Hastie T. and Tibshirani R., Sparse inverse covariance estimation withthe graphical lasso. Biostatistics. 2008 Jul; 9(3),432-441. 2007.

[13] Geman S., A Limit Theorem for the Norm of Random Matrices. Annals of Statistics,Volume 8, Number 2, 252-261, 1980.

[14] Bein J., Tibshirani R., Sparse estimation of a covariance matrix. Biometrica, Volume98, Issue 4Pp. 807-820, 2011

[15] Johnstone I. and LU Y., Sparse principal components analysis. Unpublishedmanuscript, 2004.

[16] El Karoui N., Spectrum estimation for large dimensional covariance matrices usingrandom matrix theory. Annals of Statistics, Volume 36, Number 6 (2008), 2757-2790.

[17] El Karoui N., Operator norm consistent estimation of large dimensional sparse co-variance matrices. Annals of Statistics. 36:2717-56, 2008.

[18] Ledoit O. and Wolf M., A well-conditioned estimator for large-dimensional covariancematrices. Journal of Multivariate Analysis, 88 (2004), pp. 365411.

[19] Marcenko V. and Pastur L., Distributions of eigenvalues of some sets of randommatrices. Math. USSR-Sb 1 507536,1967.

[20] Mardia K., Kent J. and Bibby J., Multivariate Analysis. Academic Press, New York.MR0560319, 1979.

[21] Maurya Ashwini., A joint convex penalty for inverse covariance matrix estimation.Computational Statistics and Data Analysis, Volume 75, July 2014, Pages 1527.

[22] Maurya Ashwini., A suuplement to "A well conditioned and sparse estimate of co-variance and inverse covariance matrix using joint penalty". Submitted to Annals ofStatistics, Nov, 2014.

[23] Meinshausen and Buhlmann P., High dimensional graphs and variable selection withthe lasso, Annals of Statistics 34, 1436-1462 2006.

[24] Pass G., Chowdhury A. and Torgeson C., "A Picture of Search". The First Interna-tional Conference on Scalable Information Systems, Hong Kong, June, 2006.

[25] Pourahmadi M., Modeling covariance matrices: The GLM and regularization per-spectives. Statistical Science, 26:369-87, 2011.

[26] Pourahmadi M., Cholesky decompositions and estimation of a covariance matrix: or-thogonality of variance-correlation parameters. Biometrika 94 (2007), no. 4, 10061013.

[27] Ravikumar P., Wainwright M., Raskutti G. and Yu B., High-dimensional covarianceestimation by minimizing l1-penalized log-determinant divergence. Electronic Journalof Statistics, Volume 5, 935-980, 2011.

[28] Rothman A. J., Bickel P. J., Levina E. and Zhu J., Sparse permutation invariantcovariance estimation. Electron. J. Stat. 2 494-515,2008.

[29] Rothman A., Positive denite estimators of large covariance matrices. Biometrica,Volume 99, Issue 3Pp. 733-740, 2012.

[30] Wainwright M., Ravikumar P. and Laerty J., High-dimensional graphical modelselection using L1-regularized logistic regression. Proceedings of Advances in NeuralIn formation Processing Systems, 2006.

[31] Stein C., Estimation of a covariance matrix. Rietz lecture, 39th Annual Meeting IMS.Atlanta, Georgia, 1975

[32] Wang S., Kuo T. and Hsu C., Trace bounds on the solution of the algebraic matrixRiccati and Lyapunov equation. IEEE Transactions on Automatic Control, VOL AC-31, NO. 7, July 1986.

24 ASHWINI MAURYA

[33] Wang L., Zhu J. and Zou H., Hybrid huberized support vector machines for microar-ray classication. In ICML 07: Proceedings of the 24th International Conference onMachine Learning, pages 983990, New York, NY, USA. ACM Press. 2007

[34] Xue L., Ma S. and Zou Hui, Positive-Denite l1-Penalized Estimation of Large Co-variance Matrices. Journal of American Statistical Association, Theory and Methods,Vol 107, No.500,2012.

[35] Yin Y. and Bai Z., Limit of the smallest eigenvalue of large dimensional samplecovariance matrix. The Annals of Probability Vol 21, No.3, 1275-1294,1993.

[36] Yuan M. and Lin Y., Model selection and estimation in the Gaussian graphical model.Biometrika 94(1), 19-35,2007.

[37] Yuan M., Sparse inverse covariance matrix estimation via linear programming. Jour-nal of Machine Learning Research 11, 2261-2286, 2009.

[38] Zhou S., Rutimann P., Xu M. and Buhlmann P., High-dimensional covariance esti-mation based on Gaussian graphical models. Journal of Machine Learning Research,to appear, 2011 .

[39] Zou H., Hastie T. and Tibshirani R., Sparse principal components analysis. J. Com-put. Graph. Statist. 15 265286 MR2252527, 2006.

8. Technical Proofs.

Proof of Theorem 3.1. Let = UDUT be the eigenvalue decomposi-tion of . Let,

f1(D) = kUDUT Sk2F + k UDUT k1 + X1ip

aifi() tg2

= tr(D2) 2 tr(SUDUT ) + tr(S2) + k UDUT k1+ftr(AD2) 2 t tr(AD) + t2 tr(A)g

= tr(D2(I + A)) 2 tr(D(UTSU + t A) + tr(S2) + k UDUT k1+ t2tr(A)

Note that this is quadratic in D and since (I + A) is a positive denitematrix, f1(D) is convex. Dierentiating with respect to D, we obtain

@f1(D)

@D= 2D(I + A) 2(USUT + t A) + UUT sign(UDUT )

@f1(D)@D = 0 satises,

D^ = (USUT + t A)(I + A)1 =2UUT sign(UD^UT )(I + A)1.Positive deniteness of eigenvalues matrix D^ implies positive deniteness of^. Next we derive the lower bound on the smallest eigenvalue of D. Notethat

maxfUUT sign(UD^UT )(I + A)1g = max(I + A)1 = 11+ minip Aii :


Hence we obtain,

min(D^) minfUSUT (I + A)1g+ t minfA(I + A)1g2

1

1 + minipAii

min(S)1 + maxipAii

+ t minip

Aii1 + Aii

2

1

1 + minipAii:

For plog p=n and plog p=n, we have min(S) ! g() > 0 inprobability by a theorem in [34]. Next we shall prove that R^S;t;A;1 4R;1 ! in probability. Dene,

Y;;; =min(S)

1 + maxipAii+ t min

ipAii

1 + Aii

2

1

1 + minipAii

Since min(S)! g(), therefore for given > 0, there exist a positive integerN1 such that for all n = n(p) N1,

PjY;;; g()j < 1 ;

i.e. g() Y;;;xi g() + . Take ! 0, we have R^S;t;A;1 4R;1 = .Hence the theorem.

Remark: Note that the above result is true in asymptotic sense underassumption of Theorem 3.1. For nite samples when n < p, min(S) = 0and because minipAii > 0,

min(D^) t minip

Aii1 + Aii

2

1

1 + minipAii

=1

1 + t minipAii

n

tmin

ipAii

2

o> 0;

for suciently large , t . This guarantees the existence of nonempty setR^S;t;A;1 for nite samples.

Proof of Theorem 3.2. Let


i=1

aifi() tg2;

where is the matrix with all the diagonal elements set to zero. Denethe function Q(:) as following:

Q() = f() f(0)

26 ASHWINI MAURYA

where 0 is the true covariance matrix and is any other covariance matrix.Let = UDUT be eigenvalue decomposition of , D is diagonal matrix ofeigenvalues and U is matrix of eigenvectors. We have,

Q() = k Sk2F + kk1 + tr(AD2 2t AD + t2 A) k0 Sk2F k0 k1 tr(AD20 2t AD20 + t2 A)

(8.1)

where A = diag(a1; a2; ; ap) and 0 = U0D0UT0 is eigenvalue decompo-sition of 0. Let n(M) := f : = T ; kk2 = Mrn; 0 < M < 1 g.The estimate ^ minimizes the Q() or equivalently ^ = ^ 0 minimizesthe G() = Q(0 +). Note that G() is convex and if ^ be its solution,then we have G(^) G(0) = 0. Therefore if we can show that G() isnon-negative for 2 n(M), this will imply that the ^ lies within sphereof radius Mrn. We require rn =

q(p+s) log p

n ! 0 as n = n(p) goes to 1.This will give consistency of our estimate in Frobenius norm at rate O(rn).

k Sk2F k0 Sk2F = tr(0 20S + S0S) tr(000 20S + S0S)= tr(0 000) 2 tr(( 0)S)= tr((0 +)

0(0 +) 000) 2 tr(0S)= tr(0) 2 tr(0(S 0))

Next, we bound term involving S in above expression, we have

jtr((0 S))j Xi 6=j

jij(0ij Sij)j+Xi=1

jii(0ii Sii)j

maxi6=j

(j0ij Sij j)kk1 +ppmaxi=1

(j0ii Siij)sX

i=1

2ii

C0(1 + )maxi(0ii)

nr log pn

kk1 +rp log p

nk+k2

o C1

nr log pn

kk1 +rp log p

nk+k2

oholds with high probability by a result (Lemma 1) from Ravikumar et al.(2011) on the tail inequality for sample covariance matrix of sub-gaussianrandom vectors and where C1 = C0(1 + )maxi(0ii); C0 > 0. Next weobtain upper bound on the terms involving in (3.7). we have,

tr(AD2 2t AD) tr(AD20 2t AD0)= trfA(UT2U UT0 20U0)g 2t trfA(UTU UT0 0U0)g


(i)tr(A(UT2U UT0 20U0)) 1(A)tr(2 20) trf( + 0)2 20)g tr(200 + 0) 2kppk+kF + tr(0):

(ii)tr(A(UTU UT0 0U0)) 1(A)tr( 0) trf( + 0) 0)g tr() ppk+kF :

To bound the term (k+0 k1k0 k1) in (3.7), let E be index set as de-ned in Assumption A.2 of Theorem 3.2. Then using the triangle inequality,we obtain,

(k +0 k1 k0 k1) = (kE +0 k1 + kEk1 k0k1) (k0 k1 kEk1 + kEk1 k0 k1) (kEk1 kEk1)

Let = (C1=)plog p=n, = (C1=1)

plog p=n; where (; ) 2 R^S;t;A;1 and

(1=k) t k, we obtain,

G() tr(0) 2 C1nr log p

n(kk1) +

rp log p

nk+kF

oC11

rlog p

n

n2kppkkF + kk2F + 2

ppk+kF

o+C1

rlog p

n

kEk1 Ek1 kk2F (1

C11

rlog p

n) 2C1

rp log p

nk+kF

2C1rlog p

n

kEk1 + kEk1+ C1rlog p

n

kEk1 Ek12C1

rlog p

n(1 + k)

ppkkF :

Also because kEk1 =P

(i;j)2E;i 6=j ij pskkF ,

2C1rlog p

nkEk1 +

C1

rlog p

nkEk1

rlog p

nkEk1

2C1 + C1

0

for suciently small . Also

2C1

rlog p

nkEk 2C1

rlog p

n

pskkF

28 ASHWINI MAURYA

Therefore,

G() kk2F1 C1

1

rlog p

n

2C1rp log pn

k+kF

2C11

rp log p

n

1 + k)k+kF

2C1r log pn

pskkF

kk2F1 C1

1

rlog p

n

2 C1r(p+ s)log pn

k+kF

2C1r(p+ s) log p

nkkF 2C1 (1 +

k)

1

r(p+ s) log p

nk+kF

k+k2Fh1 C1

1

rlog p

n 2k+k1F

r(p+ s) log p

nC1

1 +

1 + k

1

i+kk2F

h1 C1

1

rlog p

n 2C1kk1F

r(p+ s) log p

n

i k+k2F

h1 C1

1

rlog p

n 2C1 +

2C1(1+k)1

M

i+ kk2F

h1 C1

1

rlog p

n 2C1

M

i 0;

for all suciently large n and M . Hence the theorem.

Proof of Corollary 3.1. Note that for a correlation matrix, all the vari-ables are standardized to have mean zero and variance 1. Using a result fromBai and Yin (1993), we have limn=n(p)!0 min(S) = (1

p)2 > 0, for < 1.

Rest of the proof of this corollary is similar to Theorem 3.1 and hence omit-ted.

Proof of Corollary 3.2. This corollary is special case of Theorem 3.2when all of the variables are standardized to have mean zero and variance1.

Proof of theorem 3.3. We have,

k^c 0k = kW^ K^W^ WWk kW^ WkkK^ kkW^ Wk

+kW^ Wk(kK^kkWk+ kW^kkk) + kK^ kkW^kkWk:


Since kk = O(1), it follows from Corollary (3.2) that kK^k = O(1). Also,

kW^ 2 W 2k = maxkxk2=1

pXi=1

j(w^2i w2i )jx2i max1ip

j(w^2i w2i )jpX

i=1

x2i

= max1ip

j(w^2i w2i )j = Or log p

n

:

holds with high probability by using a result (Lemma 1) from Ravikumar etal. (2011) on the tail inequality on entries of sample covariance matrix of sub-gaussian random vectors. Next we shall shows that kW^Wk kW^ 2W 2k,(where AB means A=OP (B) and B=OP (A)). We have,

kW^ Wk = maxkxk2=1

pXi=1

j(wî wi)jx2i = maxkxk2=1pX

i=1

j w^2i w2iwî + wi

jx2i C3

pXi=1

j(w^2i w2i )jx2i = C3kW^ 2 W 2k:

where we have used the fact that the true standard deviations are well abovezero, i.e., 9 0 < C3 < 1 such that 1=C3 w1i C3 8i = 1; 2; ; p, andsample standard deviation are all positive, i.e, wî > 08i = 1; 2; ; p: Nowsince kW^ 2W 2k kW^ Wk, this follows that kW^k = O(1):Which impliesthat k^c 0k2 = O

s log pn +

log pn

. Hence the Theorem 3.3 follows.

Proof of theorem 3.5. The method of proof for inverse covariance ma-trix is similar to covariance matrix estimation. We keep the notations similarto that in proof of Theorem 3.2. Dene,

Q() = k S1k2 + kk1 + tr(AD2 2tAD + t2A) k0 S1k2 k0 k1 tr(AD20 2tAD20 + t2A)

(8.2)

where 0 is the true inverse covariance matrix and is any other covariancematrix, A = diag(A11; A22; ; App), = UDUT and 0 = U0D0UT0 beeigenvalue decomposition of and 0 respectively where D and D0 arediagonal matrices of eigenvalues and U and U0 are matrices of eigenvectors.Let = 0 (dierence between any estimate and true inversecovariance matrix 0). Dene the set of symmetric as: (M) = f : =T ; kkF = Mrn; 0 < M < 1 g. The estimate ^ minimizes the Q() orequivalently ^ = ^ 0 minimizes the G() = Q(0 +) where G() isconvex. Note that if ^ is a solution to G(), then we have G(^) G(0) = 0.As argued in the Proof of Theorem 3.2, if we can show that G() is non-negative for 2 n(M), this will imply that the ^ lies within sphere of

30 ASHWINI MAURYA

radius Mrn. We require rn =

q(p+s) log p

n ! 0 as n goes to 1. This willgive consistency of our estimate in Frobenius norm at rate O(rn). On similar

lines as in proof of Theorem 3.2, for (; ) 2 R^S;t;A;2 , we obtain

G() tr(0) 2 tr((S1 0)) + C1

rlog p

n

kHk1 Hk1C11

rlog p

nf2kppkkF + kk2F + 2

ppk+kF g

where H be the index set as dened in Assumption B1 and H = f(i; j) :(i; j) 62 H; i; j = 1; 2; pg. Also kHk k

pskkF .

Consider the term involving S1,

jtr(0 S1)j = jtr(S1(S 10 )0)j 1(S1)jtr((S 10 )0)j= 1(S

1)jtr((S 10 )0)j 1(S1)jtr((S 10 ))j1(0) k2jtr((S 10 ))j:

by using a result on trace norm inequality from [31]. Now consider the term,tr((S 10 )),

tr((S 10 )) = tr((S + I 10 )) = tr((S 10 ))) + tr()

C1r(p+ s) log p

nk+kF +

rlog p

nkk1

+C1

rp log p

nkkF

holds with high probability by using a result (Lemma 1) from Ravikumaret al. (2011) on the tail inequality of subgaussian random vectors where


plog p=n and C1 is dened as in proof of Theorem 3.2. we have,G() kk2F

1 C1

1

rlog p

n) 2k2C1

r(p+ s) log p

nk+kF

C1rlog p

n

2pp(1 + k)1

k+kF + k2pskkF + 2(1 +

k2)

1kkF

k+k2F

h1 C1

1

rlog p

n 2 k2C1

r(p+ s)log p

nk+kF1

2C11

rp log p

n(1 + k)k+kF1 2C1

rlog p

n(1 + k2)k+kF1

i+kk2F

h1 C1

1

rlog p

n C1 k2

rs log p

nkkF1

2C11

rp log p

n(1 + k2)kkF1

i k+k2F

h1 C1

1

rlog p

n 2

k2C1 +2C1(1+k)

1 2C1(1 + k2)

M

i+ kk2F

h1 C1

1

rlog p

n C1

k2 2C1(1 + k)M

i 0

for all suciently large n and M . Hence the result.

Proof of Corollary 3.3. The proof of this Corollary is similar to The-orem 3.1 and hence omitted.

Proof of Corollary 3.4. The proof of this Corollary is similar to Corol-lary 3.2 and hence omitted.

Proof of Theorem 3.4. The proof of this Theorem is similar to Theo-rem 3.1 and hence omitted.

Proof of Theorem 3.6. The proof of this Theorem is similar to Theo-rem 3.3 and hence omitted.

8.1. Derivation of the Algorithm.

8.1.1. Covariance matrix estimation. The optimization problem (2.4)can be written as:

32 ASHWINI MAURYA

^ = argmin=T j(;)2R^S;t;A;1

f();(8.3)

where


i=1

aifi() tg2:

Note that for a non-negative denite square matrix, singular values are thesame as its eigenvalues. We have the following trace identity:

Sum of eigenvalues of matrix = tr():

Let = UDUT where D is the diagonal matrix of eigenvalues and U isorthogonal matrix of eigenvectors. We have

Ppi=1 ai

2i () =

Ppi=1 aiD

2ii =

tr(AD2), where A = diag(a1; a2; :::; ap). Again D = UTU =) D2 =

DTD = UTTU = UT2U . Therefore

tr(AD) = tr(UAUT ) and

tr(AD2) = tr(AUT2U) = tr(2UAUT )

The third term in the right hand side of (8.3) can be written as:

pXi=1

aifi() tg2 = pX

i=1

fai 2i () 2t ai i() + ai t2g

= tr(2UAUT ) 2t tr(UAUT ) + pX

i=1

ait2;

Therefore,

f() = k Sk2F + kk1

+ tr(2UAUT ) 2t tr(UAUT ) + pX

i=1

ait2

= tr(0) 2 tr(0S) + tr(S0S) + kk1 + tr(2UAUT )2 t tr(UAUT ) + t2tr(A)

= tr(2(I + UAUT )) 2trf(S + t UAUT )g+ tr(S0S)+ kk1 + t2 tr(A)

= tr(2C) 2tr(B) + tr(S0S) + kk1 + t2 tr(A)= trf2 2BC1Cg+ tr(S0S) + kk1 + t2 tr(A)


where I is the identity matrix, C = I+ UAUT and B = S+ t UAUT . Notethat UAUT = UA1=2A1=2UT = (UA1=2)0(UA1=2) is positive denite matrix.Since is non negative, C is sum of two positive denite matrices, thereforepositive denite. Also C1 = U(I+A)1UT and 1(C) 1+maxipAii.Consider the term involving only ,

f1() = trf2 2BC1Cg+ kk1

tr2 2BC11(C) + kk1= kBC1k2F (1 + max

ipAii) + kk1

= (1 + maxip

Aii)kBC1k2F + f=(1 + max

ipAii)gkk1

= f2():

where

f2() = (1 + maxip

Aii)kBC1k2F + f=(1 + max

ipAii)gkk1

:

The function f2() is convex in and therefore minimizer of f2() is unique.Note that for arbitrary choices of and , minimization of f2() can yield annon-positive denite estimator. However as argued earlier values of (; ) 2R^S;t;A;1 will yield a sparse and well conditioned positive denite estimator.Clearly the minimum of f2() is obtained for

(8.4) sign(ij) = sign(ji) = sign(BC1)ij

we dierentiate the right side of f2(), which yields,

@d

@f() = (2) 2BA1 + f=(1 + max

ipAii)sign

)= 0:

Using the optimality condition (8.4), we have,

^ii = (BC1)ii;

^ij = (BC1)ij

2(1 + maxipAii)sign(BC1)ij for i 6= j

(8.5)

Note that the estimate ^ involves matrix of eigenvectors U . Since for a giveneigenvalue, the eigenvectors are not unique, we can choose some suitablematrix of eigenvectors corresponding to some positive denite covariancematrix. One choice is U = U1 where S+ I = U1D1U

T1 for some > 0. Next

to check whether the solution of f2() given by (9.3) is feasible, consider:

34 ASHWINI MAURYA

Case (i): ij 0. The solution (8.5) satises optimality condition (8.4)if and only if (BC1)ij 2(1+maxip Aii) .

Case (ii): ij < 0: As in Case (i), the solution (8.5) satises the opti-mality condition (8.4) if and only if (BC1)ij < 2(1+maxip Aii) .

Note that BC1 may not be symmetric. To get a symmetric estimate, wemake it symmetric as following:

M =1

2

BC1 + (BC1)T

Combining these two cases, the optimal solution of (8.3) is given by:

^ii =Mii:

^ij = signMij

max

njMij j

2(1 + maxipAii); 0o

i 6= j;(8.6)

where sign(x) is sign of x and jxj is absolute value of x.

Choice of U:Note that U is the matrix of eigenvectors of , which is unknown. In prac-tice, one can chose U as matrix of eigenvectors of corresponding eigenvaluedecomposition of S + I for some > 0 i.e. let S + I = U1D1U

T1 , then take

U = U1.

Choice of and :For given value of , we can nd the value of satisfying:

< 2 (1 + minip

Aii)n min(S)1 + maxiAii

o+ 2 t min

ipAii 2 ;

and such choice of (; ) guarantees that the minimum eigenvalue of the

estimate will be at least > 0 and such choice of (; ) 2 R^S;t;A;1 . Inpractice one might choose a higher value of that corresponds to sparseand positive denite covariance matrix.

8.2. Simulation Results.

8.2.1. Choice of weight matrix A:. For p > n, (p-n) sample eigenvaluesare identically equal to zero as well as many of the non-zero eigenvalues areapproximately zero. The simulation analysis shows that if we shrink each


eigenvalues towards a xed constant (i:e: same amount of shrinkage for eachof the sample eigenvalues), the smaller eigenvalues are shrunk upward heav-ily away from the true eigenvalues. Therefore we choose nonuniform weightsfor eigenvalues to avoid over-shrinkage. Note that given a priori knowledgeof eigenvalues dispersion, one might be able to nd better weights. Here wedo not assume knowledge of any structure among eigenvalues and choose theweights as per following scheme: (we assume all the eigenvalues are orderedin decreasing order of magnitude.)

i) Let t=average(of sample eigenvalues). Let k be index such that kth or-dered eigenvalue is less than t: Let r = p=n; b1 = max(diag(S)) (1+

pp=n)2.

ii) For j=1 to p,

cj = bj (1 + :005 log(1 + r))jjkj; bj+1 = b2j=cj

iii)

A = diag(a1; a2; ; ap); where aj = cj=pX

j=1

cj :

where jxj is absolute value of x. Such choice of weights allows more shrinkageof extreme sample eigenvalues than the ones in center of eigen-spectrum.Choice of logarithmic term was to scale the weights but this is arbitrarychoice which has worked in our simulation setting.The gure (8.1) shows the heatmap of zero recovery (sparsity) for block

and Toeplitz type covariance matrices based on 50 realizations for n=50 andp=50. The JPEN estimate of covariance matrix recovers the true zeros forabout 80% for Toeplitz and block type of covariance matrices. Our proposedestimator also reect the recovery of true structure of non-zero entries andany pattern among the rows/columns of covariance matrix.

36 ASHWINI MAURYA

Fig 8.1. Heatmap of zeros identied in covariance matrix out of 50 realizations. Whitishgrid is 50/50 zeros identied, blackish grid is 0/50 zeros identied.

Table 8.1 gives average relative errors and standard errors of the covari-ance matrix estimates based on glasso, Ledoit-Wolf, PDSCE and JPEN forn = 100 and p = 500; 1000. The glasso estimate of covariance matrix per-forms very poorly among all the methods. The Ledoit-Wolf estimate per-forms good but the estimate is generally not sparse. Also the eigenvaluesestimates of Ledoit-Wolf estimator is heavily shrunk towards the center thanthe true eigenvalues. The JPEN estimators outperforms other estimators formost of the values of p for all four type of covariance matrices. PDSCE es-timates have lower average relative error and close to JPEN. This could bedue to the fact the PDSCE and JPEN uses quadratic optimization functionwith a dierent penalty function. Table 8.2 reports the average relative er-ror and their standard deviations for inverse covariance matrix estimation.Here we do not include the Ledoit-Wolf estimator and only compare glasso,PDSCE estimates with proposed JPEN estimator. The JPEN estimate of in-verse covariance matrix outperforms other methods for all values of p = 500


and p = 1000 for all four types of structured inverse covariance matrices.

8.2.2. Covariance Matrix Estimation. -

Table 8.1Covariance matrix estimation

-

n=100 Hub type matrix Neighborhood type matrixp=500 p=1000 p=500 p=1000

Ledoit-Wolf 1.07(0.165) 3.47(0.0477) 1.1(0.0331) 2.32(0.0262)Glasso 9.07(0.167) 10.2(0.022) 9.61(0.0366) 10.4(0.0238)PDSCE 1.48(0.0709) 2.03(0.0274) 0.844(0.0331) 1.8(0.0263)JPEN 0.854(0.0808) 1.82(0.0273) 0.846(0.0332) 1.7(0.0263)n=100 Block type matrix Toeplitz type matrix

Ledoit-Wolf 4.271(0.0394) 2.18(0.11) 1.967(0.041) 2.344(0.028)Glasso 9.442(0.0438) 30.4(0.0875) 12.741(0.051) 18.221(0.0398)PDSCE 0.941(0.0418) 1.66(0.11) 0.873(0.0415) 1.82(0.028)JPEN 0.887(0.0411) 1.66(0.11) 0.707(0.0416) 1.816(0.0282)

8.3. Inverse Covariance matrix Estimation. -

Table 8.2Inverse covariance matrix estimation

n=100 Hub type matrix Neighborhood type matrixp=500 p=1000 p=500 p=1000

Glasoo 9.82(0.0212) 10.9(0.0204) 12.365(0.0176) 13.084(0.0178)PDSCE 1.13(0.0269) 2.07(0.0238) 1.74(0.0549) 3.79(0.0676)JPEN 0.138(0.0153) 0.856(0.0251) 0.260(0.0234) 1.208(0.0277)n=100 Block type matrix Toeplitz type matrixGlasoo 12.4(0.0266) 13.1(0.0171) 19.3(0.0271) 20.7(0.0227)PDSCE 0.993(0.0375) 1.83(0.0251) 1.89(0.0465) 3.79(0.0382)JPEN 0.355(0.0319) 1.18(0.0258) 1.24(0.0437) 3.18(0.0432)

Ashwini MauryaDepartment of Statisticsand ProbabilityMichigan State UniversityEast Lansing, MI 48824-1027U. S. A.E-mail: [email protected]

IntroductionBackground and Problem Set-upProposed EstimatorOur Contribution

Analysis of JPEN MethodCovariance Matrix EstimationEstimation of Inverse Covariance Matrix

An AlgorithmCovariance Matrix Estimation:Inverse Covariance Matrix Estimation:Computational Time

Simulation ResultsColon Tumor Classification ExampleSummaryReferencesTechnical ProofsDerivation of the AlgorithmCovariance matrix estimation

Simulation ResultsChoice of weight matrix A:Covariance Matrix Estimation

Inverse Covariance matrix Estimation

Author's addresses

Documents

JCP for covariance matrix estimation