A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional...

Volume 4, Issue 1 2005 Article 32

Statistical Applications in Geneticsand Molecular Biology

A Shrinkage Approach to Large-ScaleCovariance Matrix Estimation and

Implications for Functional Genomics

Juliane Schäfer, Department of Statistics, University ofMunich, Germany

Korbinian Strimmer, Department of Statistics, University ofMunich, Germany

Recommended Citation:Schäfer, Juliane and Strimmer, Korbinian (2005) "A Shrinkage Approach to Large-ScaleCovariance Matrix Estimation and Implications for Functional Genomics," StatisticalApplications in Genetics and Molecular Biology: Vol. 4: Iss. 1, Article 32.DOI: 10.2202/1544-6115.1175

Brought to you by | Universitätsbibliothek PotsdamAuthenticated

Download Date | 10/14/14 11:00 AM

A Shrinkage Approach to Large-ScaleCovariance Matrix Estimation and

Implications for Functional GenomicsJuliane Schäfer and Korbinian Strimmer

Abstract

Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problemin bioinformatics. Clearly, the widely used standard covariance and correlation estimators are ill-suited for this purpose. As statistically efficient and computationally fast alternative we propose anovel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analyticcalculation of the optimal shrinkage intensity.

Subsequently, we apply this improved covariance estimator (which has guaranteed minimummean squared error, is well-conditioned, and is always positive definite even for small samplesizes) to the problem of inferring large-scale gene association networks. We show that it performsvery favorably compared to competing approaches both in simulations as well as in application toreal expression data.

KEYWORDS: Shrinkage, covariance estimation, "small n, large p" problem, graphical Gaussianmodel (GGM), genetic network, gene expression.

Author Notes: We thank W. Schmidt-Heck, R. Guthke and K. Bayer for kindly providing us withthe E. coli data and for discussion. This work was supported by the DeutscheForschungsgemeinschaft (DFG) via an Emmy-Noether research grant to K.S.

1. Introduction

Estimationof large-scalecovariancematricesis a commonthoughoften implicittaskin functionalgenomicsandtranscriptomeanalysis:

• For instance,considerthe clusteringof genesusingdatafrom a microarrayexperiment(e.g.Eisenet al., 1998). In orderto constructa hierarchicaltreedescribingthe functional groupingof genesan estimateof the similaritiesbetweenall pairsof expressionprofilesis needed.This is typically basedona distancemeasurerelatedto theempiricalcorrelation.Thus,if p genesarebeinganalyzed(with p perhapsin theorderof 1,000to 10,000),acovariancematrixof sizep× p hasto becalculated.

• Anotherexampleis the constructionof so-calledrelevancenetworks(Butteet al., 2000).Thesevisually representthemarginal(in)dependencestructureamongthe p genes.Thenetworksarebuilt by drawingedgesbetweenthosepairsof geneswhoseabsolutepairwisecorrelationcoefficientsexceeda pre-specifiedthreshold(say,0.8).

• Relatedto generelevancenetworks(thoughconceptuallyquitedifferent)aregeneassociationnetworks. Theseare graphicalmodelsthat haverecentlybeensuggestedasameansof displayingtheconditionaldependenciesamongthe consideredgenes(e.g., Toh and Horimoto, 2002; Dobra et al., 2004;SchäferandStrimmer,2005a).An essentialinput to inferringsuchanetworkis the p× p covariancematrix.

• Furthermore,the covariancematrix evidentlyplaysan importantrole in theclassificationof genes.

• In addition, thereare numerousbioinformaticsalgorithmsthat rely on thepairwisecorrelationcoefficient aspart of an (often ratheradhoc)optimalityscore.

Thus,a commonkey problemin all of theseexamplesis asfollows: How shouldoneobtainan accurateandreliableestimateof the populationcovariancematrixΣ if presentedwith a dataset that describesa largenumberof variablesbut onlycontainscomparativelyfew samples(n p)?

In thevastmajority of analysisproblemsin bioinformatics(specificallyexclud-ing classification)thesimplesolutionis to rely eitheron themaximumlikelihoodestimateSML or on the relatedunbiasedempiricalcovariancematrix S = n

n−1SML ,

Schäfer and Strimmer: Large-Scale Covariance Matrix Estimation

with entriesdefinedas

si j =1

n− 1

n∑k=1

(xki − xi)(xkj − xj), (1)

where xi =1n

∑nk=1 xki and xki is the k-th observationof the variableXi. However,

unfortunatelybothS andSML exhibit seriousdefectsin the“small n, largep” datasettingcommonlyencounteredin functionalgenomics.Specifically,in thiscasetheempiricalcovariancematrix cannot anymorebeconsidereda goodapproximationof thetruecovariancematrix(this is truealsofor moderatelysizeddatawith n ≈ p).

For illustrationconsiderFig. 1 wherethesamplecovarianceestimatorS is com-paredwith an alternativeestimatorS? developedin Section2 of this paperandsummarizedin Tab.1. This figure showsthe sortedeigenvaluesof the estimatedmatricesin comparisonwith thetrueeigenvaluesfor fixed p = 100andvariousra-tios p

n . It is evidentby inspectionthatfor smalln theeigenvaluesof Sdiffergreatlyfrom the true eigenvaluesof Σ. In addition, for n smallerthan p (bottomrow inFig. 1) S loosesits full rankasagrowingnumberof eigenvaluesbecomezero.Thishasseveralundesirableconsequences.First, S is not positivedefiniteany more,andsecond,it cannot beinvertedasit becomessingular.For comparisoncontrastthepoorperformanceof S with thatof S? (fat greenline in Fig. 1). This improvedestimatorexhibitsnoneof the defectsof S, in particularit is moreaccurate,wellconditioned,andalwayspositivedefinite.Nevertheless,S? canbecomputedin onlyabouttwice thetime requiredto calculateS.

With thispaperwepursuethreeaims.First,weargueagainsttheblind useof theempiricalcovariancematrix S in datasituationswhereit is not appropriate– not-ing that this affectsmanycurrentapplicationareasin bioinformatics.Second,wedescribea routeto obtainimprovedestimatesof thecovariancematrix via shrink-agecombinedwith analyticdeterminationof the shrinkageintensityaccordingtothe Ledoit-Wolf theorem(Ledoit andWolf, 2003). Third, we showthat this newregularizedestimatorgreatlyenhancesinferencesof geneassociationnetworks.

Theremainderof thepaperisorganizedasfollows. In thenextsectionweprovideanoverviewovershrinkage,theLedoit-Wolf lemmaandits applicationto shrinkageof covariancematrices. We discussseveralpotentiallyuseful lower-dimensionaltargets(cf. Tab.2), with specialfocuson the“diagonal,unequalvariance”model.In the secondpart of the paper,we review methodologyfor inferring large-scalegeneticnetworks(relevanceandassociationnetworks).Weconductcomputersim-ulationsto showthatusingS? in geneticnetworkmodelselectionis highly advan-tageousin termsof powerandotherperformancecriteria. Finally, we illustratethedescribedapproachby analyzinga realgeneexpressiondatasetfrom E. coli.

Statistical Applications in Genetics and Molecular Biology, Vol. 4 [2005], Iss. 1, Art. 32

DOI: 10.2202/1544-6115.1175

0 20 40 60 80 100

p/n = 0.1

0 20 40 60 80 1000

p/n = 0.5

0 20 40 60 80 100

p/n = 2

0 20 40 60 80 100

p/n = 10

Figure 1: Orderedeigenvaluesof the samplecovariancematrix S (thin black line) andthatof analternativeestimatorS? (fat greenline, for definitionseeTab.1), calculatedfromsimulateddatawith underlyingp-variatenormaldistribution,for p = 100andvariousratiosp/n. Thetrueeigenvaluesareindicatedby a thin blackdashedline.

“Small n, Large p” Covarianceand Correlation Estimators S? and R?:

s?i j =

sii if i = j

r?i j√

sii sj j if i , j

r?i j =

1 if i = j

ri j min(1,max(0,1− λ?)) if i , j

∑i, j Var(ri j)∑

i, j r2i j

Table1: Smallsampleshrinkageestimatorsof theunrestrictedcovarianceandcorrelationmatrixsuggestedin thispaper(Section2.4).Thecoefficientssii andri j denotetheempiricalvariance(unbiased)andcorrelation,respectively.For detailsof thecomputationof Var(ri j)seeAppendixA. Furthervariantsof theseestimatorsarediscussedin Section2.4.

DOI: 10.2202/1544-6115.1175

2. Shrinkage estimation of covariance matrices ina “small n, large p” setting

2.1. Strategies for obtaining a more efficient covarianceestimator

It haslong beenknownthatthetwo widely-employedestimatorsof thecovariancematrix, i.e. theunbiased(S) andtherelatedmaximumlikelihood (SML) estimator,arebothstatisticallyinefficient. In anutshell,thiscanbeexplainedasaconsequenceof theso-called“Stein phenomenon”discoveredby Stein(1956)in thecontextofestimatingthe meanvector of a multivariatenormal distribution. Stein demon-stratedthat in high-dimensionalinferenceproblemsit is oftenpossibleto improve(sometimesdramatically!) uponthemaximumlikelihood estimator.This resultisat first counterintuitive,asmaximumlikelihoodcanbeprovento beasymptoticallyoptimal,andassuchit seemsnotunreasonableto expectthatthesefavorableproper-tiesof maximumlikelihood alsoextendto thecaseof finite data.However,furtherinsight into the Steineffect is providedby Efron (1982)who pointsout that oneneedsto distinguishbetweentwo differentaspectsof maximumlikelihood infer-ence.First,maximumlikelihoodasameansof summarizingtheobserveddataandproducingamaximumlikelihoodsummary(MLS). Second,maximumlikelihoodasaprocedureto obtainamaximumlikelihoodestimate(MLE). Theconclusionis thatmaximumlikelihood is unassailableasadatasummarizerbut thatit hassomecleardefectsasanestimatingprocedure.

Thisappliesdirectlyto theestimationof covariancematrices:SML constitutesthebestestimatorin termsof actualfit to thedatabut for mediumto small datasizesit is far from beingtheoptimalestimatorfor recoveringthepopulationcovariance– asis well illustratedby Fig. 1. Fortunately,theSteintheoremalsodemonstratesthat it is possibleto constructa procedurefor improvedcovarianceestimation.Inadditionto increasedefficiencyandaccuracy,it is desirablefor sucha methodtoexhibit thefollowing characteristicsnot foundin S andSML :

1. Theestimateshouldalwaysbepositivedefinite,i.e. all eigenvaluesshouldbedistinctfrom zero.

2. Theestimatedcovariancematrix shouldbewell-conditioned.

Thepositive-definitenessrequirementis anintrinsicpropertyof thetruecovariancematrix that is satisfiedas long asthe consideredrandomvariableshavenon-zerovariance.If a matrix is well-conditioned,i.e. if theratio of its maximumandmin-imum singularvalue is not too large, it hasfull-rank andcanbe easily inverted.

Thus,by producinga well-conditionedcovarianceestimateoneautomaticallyalsoobtainsan equallywell-conditionedestimateof the inversecovariance– a quan-tity of crucial importance,e.g.,in intervalestimation,classification,andgraphicalmodels.

A (naive)strategyto obtaina positivedefiniteestimatorof the covariancerunsasfollows: TakethesamplecovarianceS andapply,e.g.,thealgorithmby Higham(1988).Thiswill adjustall eigenvaluestobelargerthansomeprespecifiedthresholdε andthusguaranteepositivedefiniteness.However,the resultingmatrix will notbewell conditioned.

Anothermoregeneralprocedureto obtainan improvedcovarianceestimatorisvariancereduction. Considerthe well-known bias-variancedecompositionof themeansquarederror(MSE) for thesamplecovariance,i.e.

MSE(S) = Bias(S)2 + Var(S). (2)

As Bias(S) = 0 by construction,the only way to decreasethe overall accuracyof S is by reducingits variance. A simple non-parametricapproachto variancereductionis offered,e.g.,by bootstrapaggregation(“bagging”)of theempiricalco-variancematrix. Thiscanbedoneby explicitly approximatingtheexpectationE(S)via thebootstrap.In previouswork (SchäferandStrimmer,2005a)wehaveresortedto this strategyto produceimprovedestimatesof thecorrelationmatrix andits in-verse.However,especiallyfor thevery largedimensionscommonlyencounteredingenomicsproblems(oftenwith p > 1,000)thisapproachis computationallyby fartoodemanding.

Instead,in this paperwe investigate“shrinking” or moregeneral“biasedesti-mation” (e.g.,Hoerl andKennard,1970a,b;Efron, 1975;Efron andMorris, 1975,1977)asa meansof variancereductionof S. In particular,we considera recentanalyticresult from Ledoit andWolf (2003)that allows to constructan improvedcovarianceestimatorthatis notonly suitablefor smallsamplesizen andlargenum-bersof variablesp butat thesametime is alsocompletelyinexpensiveto compute.

2.2. Shrinkage estimation and the lemma of Ledoit-Wolf

In thissectionwebriefly reviewthegeneralprinciplesbehindshrinkageestimationanddiscussan analyticapproachby Ledoit andWolf (2003) for determiningtheoptimal shrinkagelevel. We notethat the theoryoutlinedhereis not restrictedtocovarianceestimationbut appliesgenerallyto large-dimensionalestimationprob-lems.

LetΨ = (ψ1, . . . , ψp) denotetheparametersof theunrestrictedhigh-dimensionalmodelof interest,andΘ = (θi) the matchingparametersof a lower dimensional

DOI: 10.2202/1544-6115.1175

restrictedsubmodel.For instance,Ψ couldbethemeanvectorof a p-dimensionalmultivariatenormal, andΘ the vector of a correspondingconstrainedsubmodelwherethemeansareall assumedto beequal(i.e. θ1 = θ2 = · · · = θp). By fittingeachof the two differentmodelsto theobserveddataassociatedestimatesU = Ψand T = Θ are obtained. Clearly, the unconstrainedestimateU will exhibit acomparativelyhigh variancedueto the largernumberof parametersthat needtobe fitted, whereasits low-dimensionalcounterpartT will havelower variancebutpotentiallyalsoconsiderablebiasasanestimatorof thetrueΨ.

Insteadof choosingbetweenone of thesetwo extremes,the linear shrinkageapproachsuggeststo combinebothestimatorsin aweightedaverage

U? = λT + (1− λ)U, (3)

whereλ ∈ [0,1] denotestheshrinkageintensity.Notethat for λ = 1 theshrinkageestimateequalstheshrinkagetargetT whereasfor λ = 0 theunrestrictedestimateU is recovered.Thekey advantageof this constructionis thatit offersa systematicway to obtaina regularizedestimateU? thatoutperformstheindividual estimatorsU andT bothin termsof accuracyandandstatisticalefficiency.

A keyquestionin thisprocedureis howto selectanoptimalvaluefor theshrink-ageparameter.In someinstances,it maysufficeto fix theintensityλ atsomegivenvalue,or to makeit dependonthesamplesizeaccordingto asimplefunction.Oftenmoreappropriate,however,is to choosetheparameterλ in adata-drivenfashionbyexplicitly minimizinga risk function

R(λ) = E(L(λ)) = E(p∑

(u?i − ψi)2), (4)

herefor examplethemeansquarederror(MSE).Onecommonbut alsocomputationallyvery intensiveapproachto estimatethe

minimizing λ is by usingcross-validation- for an exampleseeFriedman(1989)whereshrinkageis appliedin the contextof regularizedclassification. Anotherwidely appliedroute to inferring λ views the shrinkageproblemin an empiricalBayescontext.In this casethequantityE(T) is interpretedasprior meanandλ asa hyper-parameterthatmaybeestimatedfrom thedataby optimizingthemarginallikelihood (e.g.,Morris, 1983;Greenland,2000).

It is lesswell knownthat theoptimal regularizationparameterλ mayoftenalsobedeterminedanalytically. Specifically,Ledoit andWolf (2003)recentlyderiveda simple theoremfor choosingλ that guaranteesminimal MSE without the needof havingto specifyanyunderlyingdistributions,andwithout requiringcomputa-tionally expensiveproceduressuchasMCMC, the bootstrap,or cross-validation.

This lemmais obtainedin a straightforwardfashion. Assumingthat the first twomomentsof thedistributionsof U andT exist, thesquarederror lossrisk functionfrom Eq.4 maybeexpandedasfollows:

R(λ) =p∑

Var(u?i)+[E(u?i ) − ψi

p∑i=1

Var(λti + (1− λ)ui) +[E(λti + (1− λ)ui) − ψi

p∑i=1

λ2 Var(ti) + (1− λ)2 Var(ui) + 2λ(1− λ) Cov(ui , ti)

+ [λE(ti − ui) + Bias(ui)]2 .

Analytically minimizing this function with respectto λ gives,after sometediousalgebraiccalculations,thefollowing expressionfor theoptimalvalue

∑pi=1 Var(ui) − Cov(ti ,ui) − Bias(ui) E(ti − ui)∑p

i=1 E[(ti − ui)2], (6)

for which minimumMSE R(λ?) is achieved.It canbeshownthatλ? alwaysexistsandthatit is unique.If U is anunbiasedestimatorofΨwith E(U) = Ψ thisequationreducesto

∑pi=1 Var(ui) − Cov(ti ,ui)∑p

i=1 E[(ti − ui)2], (7)

which is – apartfrom somefurtheralgebraicsimplification– theexpressiongivenin Ledoit andWolf (2003).

Closer inspectionof Eq. 6 yields a numberof insights into how the optimalshrinkageintensityis chosen:

• First,thesmallerthevarianceof thehigh-dimensionalestimateU, thesmallerbecomesλ?. Therefore,with increasingsamplesizetheinfluenceof thetargetT diminishes.

• Second,λ? alsodependsonthecorrelationbetweenestimationerrorof U andof T. If both arepositively correlatedthenthe weight put on the shrinkagetargetdecreases.Hence,theinclusionof thesecondtermin thenumeratorofEq. 6 adjustsfor the fact that the two estimatorsU andT areboth inferredfrom thesamedataset.It alsotakesinto accountthatthe“prior” informationassociatedwith T is not independentof thegivendata.

DOI: 10.2202/1544-6115.1175

• Third, with increasingmeansquareddifferencebetweenU andT (in thede-nominatorof Eq.6) theweightλ? alsodecreases.NotethatthisautomaticallyprotectstheshrinkageestimateU? againstamisspecifiedtargetT.

• Fourth,if the unconstrainedestimatoris biased,andthe biaspointsalreadytowardsthetarget,theshrinkageintensityis correspondinglyreduced.

Furthermore,it is noteworthythat variablesthat by designarekept identical intheconstrainedandunconstrainedestimators(i.e. ti = ui for somei) play no rolein determiningtheintensityλ?, astheir contributionsto thevarioustermsin Eq.6cancelout.

This canbe generalizedfurtherby allowing multiple targets, eachwith its owndifferentoptimalshrinkageintensity.This is especiallyappropriateif thereexistsanaturalgroupingof parametersin theinvestigatedhigh-dimensionalmodel. In thiscaseonesimply computestheindividual targetsandappliesEq.6 to eachgroupofvariablesseparately.As onerefereesuggests,it maybehelpful to clustervariablesaccordingto their variancesVar(ui) – typically thepredominanttermto determinetheshrinkagelevelλ?.

Finally, it is importantto considerthe transformationpropertiesof the shrink-ageprocedure.FromEq. 6 it is clearthatλ? is invariantagainsttranslations.Forinstance,the underlyingdatamay be centeredwithout affectingthe estimationoftheoptimalshrinkageintensity.However,λ? is notgenerallyinvariantagainstscaletransformations.Thisdependenceontheabsolutescalesof theconsideredvariablesis a generalpropertythatshrinkageshareswith otherapproachesto biasedestima-tion, suchasridgeregressionandpartialleastsquares(e.g.Hastieetal., 2001).

2.3. Estimation of the optimal shrinkage intensity

For practicalapplicationof Eq. 6 oneneedsto obtainan estimateλ? of the opti-mal shrinkageintensity. In their paperLedoit andWolf (2003)emphasizethat theparametersof Eq.6 shouldbeestimatedconsistently.However,this is only a veryweak requirement,asconsistencyis an asymptoticpropertyanda basicrequire-ment of any sensibleestimator. Furthermore,we are interestedin small sampleinference.Thus,insteadwe suggestto computeλ? by replacingall expectations,variances,andcovariancesin Eq. 6 by their unbiasedsamplecounterparts.Thisleadsto

∑pi=1 Var(ui) − Cov(ti ,ui) − Bias(ui) (ti − ui)∑p

i=1(ti − ui)2. (8)

In finite samplesλ? mayexceedthevalueone,andin somecasesit mayevenbe-comenegative.In orderto avoidovershrinkageor negativeshrinkagewe truncate

theestimatedintensitycorrespondingly,usingλ?? = max(0,min(1, λ?)) whencon-structingtheshrinkageestimatorof Eq.3.

It is alsoworth notingthatEq.8 is valid regardlessof thesamplesizen at hand.In particular,n may be substantiallysmallerthan p, a fact we will exploit in oursuggestedapproachto inferringgeneassociationnetworks.

2.4. Shrinkage estimation of the covariance matrix

Estimationof theunrestrictedcovariancematrix requiresthedeterminationof (p2+

p)/2 free parameters,andthusconstitutesa high-dimensionalinferenceproblem.Consequently,applicationof shrinkageoffersa promisingapproachto obtainim-provedestimates.

DanielsandKass(2001)providea fairly extensivereview of empiricalBayesshrinkageestimatorsproposedin recentyears.Unfortunately,mostof thesuggestedestimatorsappearto suffer from at leastone of the following drawbacks,whichrendersthemunsuitablefor theanalysisof genomicdata:

1. Typically, the applicationis restrictedto datawith p < n, in order to en-surethat theempiricalcovarianceS canbe inverted.However,mostcurrentgenomicdatasetscontainvastlymorefeaturesthansamples(p n).

2. Many of thesuggestedestimatorsarecomputationallyexpensive(e.g. thosebasedonMCMC sampling),or assumespecificunderlyingdistributions.

Thesedifficultiesareelegantlyavoidedby resortingto the(almost)distribution-freeLedoit-Wolf approachto shrinkage.

In amatrixsettingtheequivalentto thesquarederrorlossfunctionis thesquaredFrobeniusnorm.Thus,

L(λ) = ||S? − Σ||2F= ||λT + (1− λ)S− Σ||2F

p∑i=1

p∑j=1

(λti j + (1− λ)si j − σi j

)2 (9)

is anaturalquadraticmeasureof distancebetweenthetrue(Σ) andinferredcovari-ancematrix (S?). In this formulatheunconstrainedunbiasedempiricalcovariancematrixS replacestheunconstrainedestimateU of Eq.3.

Selectinga suitableestimatedcovariancetargetT = (ti j) requiressomedili-gence.In general,thechoiceof a targetshouldbeguidedby thepresumedlower-dimensionalstructurein the datasetas this determinesthe increaseof efficiency

DOI: 10.2202/1544-6115.1175

Target A: “diagonal,unit variance” Target B: “diagonal,commonvariance”0 estimatedparameters 1 estimatedparameter:v

ti j =

1 if i = j

0 if i , jti j =

v = avg(sii ) if i = j

0 if i , j

λ? =∑

i, j Var(si j )+∑

i Var(sii )∑i, j s2

i j+∑

i (sii−1)2λ? =

∑i, j Var(si j )+

∑i Var(sii )∑

i, j s2i j+∑

i (sii−v)2

Target C: “common(co)variance” Target D: “diagonal,unequalvariance”2 estimatedparameters:v, c p estimatedparameters:sii

ti j =

v = avg(sii ) if i = j

c = avg(si j) if i , jti j =

sii if i = j

0 if i , j

λ? =∑

i, j Var(si j )+∑

i Var(sii )∑i, j (si j−c)2+

∑i (sii−v)2 λ? =

∑i, j Var(si j )∑

i, j s2i j

Target E: “perfectpositivecorrelation” Target F: “constantcorrelation”p estimatedparameters:sii p+ 1 estimatedparameters:sii , r

ti j =

sii if i = j√

sii sj j if i , jti j =

sii if i = j

sii sj j if i , j

fi j =12

√sj j

siiCov(sii , si j) +

√sii

sj jCov(sj j , si j)

λ? =∑

i, j Var(si j )− fi j∑i, j (si j−

√sii sj j )2 λ? =

∑i, j Var(si j )−r fi j∑

i, j (si j−r√

sii sj j )2

Table2: Six commonlyusedshrinkagetargetsfor the covariancematrix andassociatedestimatorsof theoptimalshrinkageintensity– seemaintext for discussion.Abbreviations:v, averageof samplevariances;c, averageof samplecovariances;r, averageof samplecorrelations.

overthatof theempiricalcovariance.However,it is alsoaremarkableconsequenceof Eq. 6 that in fact any targetwill leadto a reductionin MSE, albeit only a mi-nor onein caseof a stronglymisspecifiedtarget(thenS? will simply reduceto theunconstrainedestimateS).

Six commonlyusedcovariancetargetsarecompiledin Tab.2, alongwith a briefdescription,thedimensionof the target,andtheresultingestimateλ?. In ordertocomputethe optimal shrinkageintensity it is necessaryto estimatethe variancesof the individual entriesof S – seeAppendix A for the technicaldetails. Notethat theresultingshrinkageestimatorsS? all exhibit thesameorderof algorithmiccomplexityasthestandardestimateS.

Probablythemostcommonlyemployedshrinkingtargetsaretheidentity matrixandits scalarmultiple. Thesearedenotedin Tab.2 “diagonal,unit variance”(targetA) and“diagonal,commonvariance”(targetB). A furtherextensionis providedbythe two parametercovariancemodel that in additionto the commonvariance(asin targetB) also maintainsa commoncovariance(“common (co)variance”,tar-get C). The three targetsshareseveralproperties. First, they are all extremelylow-dimensional(0 to 2 free parameters),thusthey imposea ratherstrongstruc-ture which in turn requiresonly little datato fit. Second,the resultingestimatorsshrink all componentsof the empiricalcovariancematrix, i.e. both diagonalandoff-diagonalentries. In the literatureit is easyto find exampleswhereoneof theabovetargetsis employed– albeit not in combinationwith analyticestimationofthe shrinkagelevel. For instance,the unit diagonaltargetA is typically usedinridge regressionandthe relatedTikhonov regularization(e.g.Hastieet al., 2001).The targetB is utilized, e.g., by Friedman(1989) who estimatesλ by meansofcross-validation,by LeungandChan(1998)who usea fixed λ = 2

n+2, by Dobraet al. (2004)asa parameterin an inverseWishartprior for the covariancematrix,andfinally alsoby Ledoit andWolf (2004b). The two-parametertargetC appearsnot to bewidely used.

Anotherclassof covariancetargetsis givenby the“diagonal,unequalvariance”model(targetD), the“perfectpositivecorrelation”model(targetE) andthe“con-stantcorrelation”model(targetF) of Tab.2. A sharedfeatureof thesethreetargetsis thattheyarecomparativelyparameter-rich,andthattheyonly leadto shrinkageoftheoff-diagonalelementsof S. Thelasttwo shrinkagetargetswereintroducedwiththe purposeof modelingstockreturns. Thesetend– on average– to be stronglypositivelycorrelated(Ledoit andWolf, 2003,2004a).

In thispaper,we focuson theshrinkagetargetD for theestimationof covarianceandcorrelationmatricesarising in genomicsproblems. This “diagonal, unequalvariance”modelrepresentsa compromisebetweenthelow-dimensionaltargetsA,B, and C and the correlationmodelsE and F. Like the simpler targetsA and Bit shrinksthe off-diagonalentriesto zero. However,unlike shrinkagetargetsA

DOI: 10.2202/1544-6115.1175

andB, targetD leavesdiagonalentriesintact, i.e. it doesnot shrink thevariances.Thus,thismodelassumesthattheparametersof thecovariancematrix fall into twoclasses,andbotharetreateddifferentlyin theshrinkageprocess.

This clearseparationalsosuggeststhat for shrinkingpurposesit maybeusefulto parameterizethecovariancematrix in termsof variancesandcorrelations(ratherthanvariancesandcovariances)sothats?i j = r?i j

√sii sj j. In this formulation,shrink-

ageis appliedto thecorrelationsratherthanthecovariances.This hastwo distinctadvantages.First, theoff-diagonalelementsdeterminingtheshrinkageintensityareall on thesamescale.Second,the(partial)correlationsderivedfrom theresultingcovarianceestimatorS? are independentof scaleandlocationtransformationsoftheunderlyingdatamatrix, just asis thecasefor thosecomputedfrom S.

It is this form of targetD thatwe proposefor estimatingcorrelationandcovari-ancematrices.For reference,the correspondingformulaearecollectedin Tab.1.Notetheremarkablysimpleexpressionfor theshrinkageintensity

∑i, j Var(ri j)∑

i, j r2i j

– seealsoTab.2 (TargetD). For technicaldetailssuchasthecalculationof Var(ri j)we referto AppendixA. In this formulaa concernmaybetheuseof theempiricalcorrelationcoefficientsri j – after all, theseare the onesthat we aim to improve.Thus,it seemswe facea circularity problem,namelythat for anaccurateestimateof the shrinkageintensity reliable estimatesof correlationare needed,and viceversa. However,it is a remarkablefeatureof targetD that it completelyresolvesthis “chicken-egg” issue: regardlesswhetherstandardor shrinkageestimatesofcorrelationaresubstitutedinto Eq.10 theresultingλ? remainsall thesame.

UsingthetargetD hasanotherimportantadvantage:theresultingshrinkageco-varianceestimatewill automaticallybepositivedefinite. The targetD itself is al-wayspositivedefinite,andtheconvexcombinationof apositivedefinitematrix (T)with anothermatrix thatis positivesemidefinite(S) alwaysyieldsapositivedefinitematrix. Note that this is alsotrue for targetsA andB but not for the targetsC, E,andF (considerascounterexamplethetargetE with all variancessetequalto one).

Furthervariantsof the proposedestimator(Tab.1) areeasilyconstructed.Onepossibleextensionis to shrink thediagonalelementsaswell, usinga differentin-tensityfor variancesandcorrelations.Shrinkingthevariancesto a commonmeanis standardpracticein in genomiccase-controlstudies(e.g.Cui et al., 2005). It isparticularhelpful if thereareso few samplesthat the gene-specificvariancesaredifficult to obtain.In suchascase,however,it maymakenosenseatall to considerestimatingthefull covariancematrix.

3. Inference of gene networks from small samplegenomic data

3.1. Methodological background

We considerheretwo simpleapproachesfor modelingnet-likedependencystruc-turesin genomeexpressiondata,bothof which requireasinputanestimatedlarge-scalecovariancematrix. Thefirst andconceptuallysimplermodelis thatof a“generelevancenetwork”. This wasintroducedby Butteet al. (2000)andis built in thefollowing simplefashion. First, the p × p correlationmatrix P = (ρi j) is inferredfrom thedata.Second,for estimatedcorrelationcoefficientsexceedinga prespec-ified threshold(say r > 0.8) an edgeis drawnbetweenthe two respectivegenes.Thus,relevancenetworksrepresentthe marginal (in)dependencestructureamongthe p genes.In statisticalterminologythis typeof networkmodelis alsoknownas“covariancegraph”.

Despitethepopularityof relevancenetworks(whichstemsfrom therelativeeaseof construction)therearemanyproblemsconnectedwith theirproperinterpretation.For instance,the cut-off valuethat determinesthe “significant” edgesis typicallychosenin a ratherarbitrary fashion– often simply a largevalue is selectedwiththevagueaim to exclude“spurious”edges.However,this missesthestatisticalin-terpretationof themarginalcorrelationwhich takesaccountof bothdirectaswellasindirect associations.As a direct consequence,in a reasonablywell-connectedgeneticnetworkmostgeneswill by constructionbecorrelatedwith eachother(e.g.seetheanalysisof theE. coli databelow). Thus,in this caseevena largeobserveddegreeof correlationwill provideonly weakevidencefor the direct dependencyof any two consideredgenes.Instead,theabsenceof correlation (i.e. r ≈ 0) willbe a strongmeasureof their independence. Therefore,evenignoring the difficul-ties of obtainingaccuratemeasuresof correlationfrom small sampledata,generelevancenetworksaresuitabletools not for elucidatingthe dependencenetworkamonggenesbut ratherfor uncoveringindependence.

In contrast,with theclassof graphicalGaussianmodels(GGMs),alsocalled“co-varianceselection”or “concentrationgraph”models,a simplestatisticalapproachexiststhatallows to detectdirectdependencebetweengenes.This “geneassocia-tion network” approachis basedon investigatingtheestimatedpartial correlationsr for all pairsof consideredgenes.The traditionally developedtheoryof GGMs(e.g.Whittaker,1990)is only applicablefor n p. However,with the increasinginterestin “small n, large p” inferencea numberof refinementsto the GGM the-ory haverecentlybeenproposedthat allow its applicationalsoto genomicdata–seeSchäferandStrimmer(2005a,b)for a discussionanda comprehensivelist of

DOI: 10.2202/1544-6115.1175

references.In essence,in a small samplesettingboth theestimationof thepartialcorrelationsaswell asthesubsequentmodelselectionneedto besuitablymodified.In the following we discussseveralsuchapproaches,including onebasedon thesuggestedcovarianceshrinkageestimator.

3.2. Small sample GGM selection using false discovery ratemultiple testing

Standardgraphicalmodelingtheory(e.g.Whittaker,1990)showsthatthematrixofpartial correlationsP = (ρi j) is relatedto the inverseof the covariancematrix Σ.This relationshipleadsto thestraightforwardestimator

r i j = ˆρi j = −ωi j/√ωii ω j j , (11)

whereΩ = (ωi j) = Σ

−1. (12)

We notethat in the lastequation,it is absolutelycrucial that thecovarianceis es-timatedaccurately,andthat Σ is well-conditioned– otherwisetheaboveformulaewill resultin a ratherpoorestimateof partialcorrelation(cf. SchäferandStrimmer,2005a).Here,weadopttheshrinkageestimatorS? developedin thefirst part of thispaper(Tab.1). As weshowbelowthis leadsto (sometimesdramatic)improvementin accuracyoveralternativeprocedures.In this contextit is alsointerestingto notethat the difficulty of obtainingreliableestimatesof P hasled someresearcherstoinsteadconsiderpartialcorrelationsof limited order(e.g.,dela Fuenteetal., 2004;Wille et al., 2004;MagweneandKim, 2004). However,usingpartial correlationsof first or secondorderasa measureof dependenceamountsto employinga net-work model that is muchmoresimilar to relevancethanto associationnetworks,andhencealsoinheritstheir interpretationdifficulties.

The secondcritical part of inferring GGMs is modelselection. In SchäferandStrimmer(2005a)we havesuggesteda simpleyet quite effectivesearchheuristicbasedon large-scalemultiple testingof edges.Thisapproachis basedontwo ratio-nales.First, it exploitsthe fact thatgeneticnetworksaretypically sparse,i.e. thatmostof the p(p− 1)/2 partialcorrelationcoefficientsρ vanish.In turn, this allowsto estimatethenull distributionfrom thedata,andthusto decidewhich edgesarepresentor absent.Second,GGM searchby multiple testingimplicitly assumesthatfor all cliques(i.e. fully connectedsubsetof nodes)of size threeand more theunderlyingjoint distribution is well approximatedby the productof the bivariatedensitiesassociatedwith therespectiveundirectededges(CoxandReid,2004).

Specifically,in theapproachof SchäferandStrimmer(2005a)thedistributionof

theobservedpartialcorrelationsr acrossedgesis takenasthemixture

f (r) = η0 f0(r; κ) + (1− η0) fA(r) , (13)

where f0 is the null distribution,η0 is the (unknown)proportionof “null edges”,and fA thedistributionof observedpartialcorrelationsassignedto actuallyexistingedges.Thenull density f0 is givenin Hotelling (1953)as

f0(r; κ) = (1− r2)(κ−3)/2Γ( κ2)

π12Γ( κ−1

= |r |Be(r2;12,κ − 1

whereBe(x; a,b) is the Betadistributionandκ is the degreeof freedom,equaltothe reciprocalvarianceof the null r. Fitting this mixture densityallows κ, η0 andeven fA to be determined– for an algorithm to infer the latter seeEfron (2004,2005b). Subsequently,it is straightforwardto computethe edge-specific“localfalsediscoveryrate” (fdr) via

Prob(nulledge|r) = fdr(r) =η0 f0(r; κ)

f (r), (15)

i.e. theposteriorprobabilitythatanedgeis null given r. Finally, anedgeis consid-ered“present”or “significant” if its local fdr is smallerthan0.2(Efron,2005b).

Closelyrelatedto theempiricalBayeslocal “fdr” statisticis thefrequentist“Fdr”(alsocalledq-value)approachadvocatedby Storey(2002),andtheBenjaminiandHochberg(1995) “FDR” rule. In our original GGM model selectionproposal(SchäferandStrimmer,2005a)wehavereliedontheFDRmethodto identify edgesin thenetwork.However,we now suggestto employthelocal fdr, asthis fits morenaturallywith themixturemodelsetup,andbecauseit takesaccountof thedepen-denciesamongtheestimatedpartialcorrelationcoefficients(Efron,2005a).

3.3. Small sample GGM selection using lasso regression

Partial correlationsmay not only be estimatedby inversionof the covarianceorcorrelationmatrix (Eq.11andEq.12). An alternativerouteis offeredby regressingeachgenei ∈ 1, . . . , p in turn againstthe remainingsetof p − 1 variables.Thepartialcorrelationsarethensimply

r i j = sign(β( j)i )√β

( j)i β

(i)j , (16)

DOI: 10.2202/1544-6115.1175

whereβ(i)j denotestheestimatedregressioncoefficientof predictorvariableXj for

theresponseXi. Notethatwhile in generalβ( j)i , β

(i)j thesignsof thesetwo regres-

sioncoefficientsareidentical.Thisopensthewayfor obtainingsmallsampleestimatesof partialcorrelationand

GGM inferenceby meansof regularizedregression.Thisavenueis pursued,e.g.,byDobraetal. (2004)whoemployBayesianvariableselection.Anotherpossibilitytodeterminetheregressioncoefficientsis by penalizedregression,for instanceridgeregression(Hoerl andKennard,1970a,b)or the the lasso(Tibshirani,1996). Thelatter approachhasthe distinct advantagethat it will set many of the regressioncoefficients(andhencealsopartial correlations)exactlyequalto zero. Thus, forcovarianceselectionno additionaltestingis required,andan edgeis recoveredintheGGM networkif both β( j)

i andβ(i)j differ from zero.

GGM inferenceusing the lassois investigatedin MeinshausenandBühlmann(2005)whereit is suggestedto choosethe lassopenaltyλi for regressionagainstvariableXi accordingto

λi = 2

√sML

nΦ−1(1−

2p2), (17)

whereΦ(z) is the cumulativedistribution function of the standardnormal, α isa constant(setto 0.05in our computationsbelow) that controlstheprobabilityoffalselyconnectingtwodistinctconnectivitycomponents(MeinshausenandBühlmann,2005),andsML

ii is themaximum-likelihoodestimateof thevarianceof Xi. Notethatthis adaptivechoiceof penaltyensuresthat for small samplevarianceλi vanishesandhencein this casenopenalizationtakesplace.

3.4. Performance for synthetic data

In an extensivesimulationstudywe comparedthe shrinkageand lassoapproachto GGM selectionin termsof accuracy,power,andpositivepredictiveaccuracy.In addition to thosetwo methodswe also investigatedtwo further estimatorsofpartialcorrelationdenotedΠ1 andΠ2. Thesearediscussedin SchäferandStrimmer(2005a).Π1 employsthepseudoinverseinsteadof thematrix inversein Eq.12,thusfor n > p it reducesto the classicalestimateof partial correlation. Π2 usesthebootstrapto obtaina variance-reducedpositivedefiniteestimateof thecorrelationmatrix. Notethatin our previousstudywe foundthatΠ2 exhibitedtheoverallbestperformance.

Specifically,thesimulationsetupwasasfollows:

1. We controlledparametersof interestsuchas the numberof featuresp, the

fractionof non-zeroedgesηA = 1− η0 andthesamplesizen of thesimulateddata. Specifically,we fixed at p = 100 and ηA = 0.04, and varied n =10,20, . . . ,200.

2. We generatedR = 200 randomnetworks(i.e. partial correlationmatrices)andsimulateddataof sizen from thecorrespondingmultivariatenormaldis-tribution.

3. Fromeachof theR datasetswe estimatedthepartialcorrelationcoefficientswith the four methods“shrinkage”, “lasso”, Π1, and Π2. The numberofbootstrapreplicationsrequiredfor Π2 wassetto B = 500.

4. Subsequently,we computedthemeansquarederrorby comparisonwith theknowntruevalues.

5. Similarly, wedterminedtheaveragenumberof edgesdetectedassignificant,thepower,andthe“positive predictivevalue” (PPV),i.e. thefractionof cor-rectedgesamongall significantedges.ThePPVis sometimesalsocalledthe“true discoveryrate”(TDR).Notethatit is only definedif thereis at leastonesignificantedge.Thefdr cut-off wassetto 0.2assuggestedin Efron(2005b).

In orderto simulaterandom“true” partial correlationmatriceswe relied on analgorithm producingdiagonallydominantmatrices– seeSchäferand Strimmer(2005a)for details.This methodallowsto generatepositivedefiniterandomcorre-lationmatricesof arbitrarysizep×p with anapriori fixedproportionηA of non-nullentries. Unfortunately,further structuralanddistributionalpropertiesarenot eas-ily specified– seefor instanceHirschbergeret al. (2004).This would bedesirableas the presentsimulationalgorithmproducesnetworkswith edgesthat representmostlyweaklinks. Notethatthis renderstheir inferencedisproportionallyhard.

In Fig. 2 we comparethe accuracyof the four investigatedestimatorsof par-tial correlation. Both the shrinkageandthe lassoGGM estimatoroutperformthetwo othersregardlessof samplesize.ThepreviouslyrecommendedestimatorΠ2 isnearlyasaccuratefor smallsamplesize,however,it is muchmorecomputerexpen-sivethantheshrinkageestimator.Thepeakatn = 100associatedwith theestimatorΠ1 is a dimensionresonanceeffectdueto theuseof thepseudoinverse(recall thatp = 100)– seeSchäferandStrimmer(2005a)for adiscussionandreferences.

Fig. 3aandFig. 3b summarizetheresultswith regardto GGM selection.Fig. 3ashowsthenumberof edgesthatweredetectedassignificantusingeachof thefourmethods.For ηA = 0.04 and p = 100 thereexist exactly198 edgesin any of thesimulatednetworks.Thenumberof edgesdetectedassignificantfor theshrinkageestimatorremainswell below this threshold,howeverin comparisonwith Π1 and

DOI: 10.2202/1544-6115.1175

50 100 150 200

sample size

Mean Squared Error

Shrinkage

Figure2: Meansquarederror of the four investigatedsmall-sampleestimatorsof partialcorrelation( “shrinkage”,“lasso”, Π1, andΠ2) in dependenceof the samplesizefor p =100genes.Notethatthecurvesfor “shrinkage”and“lasso”completelyoverlap.

Π2 it typically findsthelargestnumberof edges.In contrast,for simulateddatathelassoGGM networkapproachrecoversevenfor smallsamplesizemanymoreedgesthanareactuallypresent.This indicatesthatthechoiceof penalizationaccordingtoEq.17maystill betoopermissive.Thelargenumberof significantedgesfor Π2 forvery small samplesizesis a systematicbiasrelatedto the improperfit of the nullmodel(Eq.14).

Fig. 3b illustratesthecorrespondingpower(i.e. theproportionof correctlyiden-tified edges)andPPV.The latterquantityis of key practicalimportanceasit is anestimateof the proportionof true edgesamongthe list of edgesreturnedassig-nificant by the algorithm. For the shrinkageestimatorthe PPV is constantacrossthewholerangeof samplessizesandcloseto thedesiredlevel near1− Fdr ≈ 0.9(Efron, 2005b). The lassoGGM estimatorexhibitsa very low PPV of about0.2only. The other two estimatorsreachthe appropriatelevel of PPV, but only forn > p. In termsof powertheshrinkageandthe lassoGGM approachoutperformthe other two investigatedestimatorsΠ1 and Π2 which exhibit reasonablepoweronly for n > p. Thepowerof thelassoregressionapproachis distinctlyhigherthanthatof theshrinkageestimator.However,this is dueto thefact thattheformerlib-erally includesmanyedgesin theresultingnetworkwithout controllingtherateoffalsepositives.In oursimulationstheshrinkageestimatorhasnon-zeropoweronly

50 100 150 200

sample size

Number of Significant Edges

Shrinkage

50 100 150 200

sample size

Shrinkage

50 100 150 200

sample size

Positive Predictive Value

Figure3: Performanceof methodsfor GGM networkinference:(a) Averagenumberofedgesdetectedassignificant.Note that thereare198 trueedgesin thesimulatednetwork(horizontaldashedline). (b) Powerandpositivepredictivevalue(PPV) for reconstructingtheGGM networktopology.Gapsin thecurvesfor thePPVindicatesituationsin whichthePPVcouldnotbecomputed(nosignificantedges).

DOI: 10.2202/1544-6115.1175

from n ≥ 30 (for p = 100). As discussedabovethis is very likely a consequenceof our simulationsetupwhich producespartial correlationnetworksthat arehardto infer. Thus,it is decidingto notethehigh PPVof this estimator:this indicatesthat if thereis a significantedgethenthe probability is very high that it actuallycorrespondsto a trueedge.

3.5. Analysis of expression profiles from an E. coliexperiment

For illustrationwe now apply theabovemethodsfor inferring genenetworksto areal dataset from a microarrayexperimentconductedat the Instituteof AppliedMicrobiology,Universityof Agricultural Sciencesof Vienna(Schmidt-Hecket al.,2004). This wassetup to measurethe stressresponseof the microorganismEs-cherichiacoli duringexpressionof arecombinantprotein.Theresultingdatamoni-torsall 4,289proteincodinggenesof E.coli 8,15,22,45,68,90,150,and180min-utesafterinductionof therecombinantproteinSOD(humansuperoxidedismutase).In acomparisonwith pooledsamplesbeforeinduction102geneswereidentifiedbySchmidt-Hecketal. (2004)asdifferentiallyexpressedin oneor moresamplesafterinduction. In the following we try to establishthegenenetworkamongthese102preselectedgenes.

A first impressionof the dependencystructurecan be obtainedby investigat-ing the estimatedcorrelationcoefficients. For the shrinkageapproachwe obtainλ? = 0.18. The resultingcorrelationmatrix hasfull rank (102) with conditionnumberequalto 386.6. In contrast,thestandardcorrelationmatrix hasrank8 onlyandis ill-conditioned(infinite conditionnumber).Thus,alreadyfor calculatingthecorrelationcoefficientsthebenefitsof usingtheshrinkageestimatorareapparent.

Fig. 4a showsthe distributionof the estimatedcorrelationcoefficients,mostofwhich differ from zero. This is indicatesthat essentiallyall genesareareeitherdirectly or indirectly associatedwith eachother. Thus,constructinga traditionalrelevancenetwork (Butte et al., 2000) will – at least for this data– not lead touncoveringof thedependencystructure.This is comparedwith thecorrespondingpartial correlationmatrix. Fig. 4b showsthedistributionof theFisher-transformedcoefficients(cf. Hotelling,1953).Thecontrastwith thepreviousfigureis apparent,asthedistributionof partialcorrelationsis unimodalandcenteredaroundzero.Thismeansthatmostpartialcorrelationsvanish,thatthenumberof directinteractionsissmall,andhencethattheresultinggeneassociationnetworkis sparse.

Fig. 5 showsthe correspondinggeneassociationandrelevancenetworks. TheshrinkageGGM networkis depictedin Fig. 5aandwasderivedby fitting themix-turedistributiondefinedin Eq.13to theestimatedpartialcorrelationswith acut-off

−1.0 −0.5 0.0 0.5 1.0

shrunken sample correlationcoefficients

estimated partial correlation coefficients

−0.10 −0.05 0.00 0.05 0.10

Figure4: (a) Histogramof the estimatedshrunkencorrelationcoefficientscomputedforall 102× 101/2 = 5,151 pairsof genes.(b) Distribution of estimatedpartial correlationcoefficients(greenline) afterFisher’snormalizingz-transformation(atanh)wasappliedfornormalizationpurposes.Also shownarethefitted null distribution(dashedblue line) andthealternativedistribution(pink) asinferredby thelocfdr algorithm(Efron,2004,2005b).Theblacksquaresindicatethe0.2 local fdr cut-off valuesfor thepartialcorrelations.

fdr < 0.2. Thenetworkcomprises116significantedgeswhichamountto about2%of the5,151possibleedgesfor 102genes.This showsthat for realdata– in sharpcontrastto our comparablesimulations– the shrinkageestimatoris powerful forsmallsamplesize.

Severalaspectsof the inferred network are interesting. First, we recoverthe“hub” connectivitystructurefor thegenesucA.This geneis involved in thecitricacidcycle. Theexistenceof thesehubsis a well-knownpropertyof biomolecularnetworks(e.g.BarabásiandOltvai,2004).It is astrengthof thepresentmethodthatthesenodescanbe identifiedwithout anyparticularadditionaleffort. Second,theedgesconnectingthe geneslacA, lacZ andlacY arethe strongestin the network,with thelargestabsolutevaluesof partialcorrelation,andcorrespondinglyalsowiththesmallestlocal fdr values.Interestingly,theseareexactlythegenesonwhich theexperimentwasbased:lacA, lacYandlacZareinducedby IPTG(isopropyl-beta-D-thiogalactopyranoside)dosageandinitiaterecombinantproteinsynthesis(Schmidt-Hecketal., 2004).

DOI: 10.2202/1544-6115.1175

atpGatpH

b1583 b1963

eutGfixC

glnH glnP

ibpB icdA

manYmanZ

nuoC nuoF

ompC ompF

pspDptsG

yhgIyjbE

(a)ShrinkageGGM network

dnaJdnaK eutG

hupAhupB

yhdMyheI

(b) LassoGGM network

artQasnA

atpGatpH

b1191b1583 b1963

folK ftsJ

gatZ glnH

glnPgltAgrpE

hns hupA hupB

ilvC lacA lacY

manZmopB

pspA pspB pspC

sodAsucA

sucDtnaA

yedEyfaDyfiA

ygbD ygcE yhdM yheI

(c) Relevancenetwork

Figure5: Genenetworksinferred from the E. coli databy (a) the shrinkageGGM ap-proachpresentedin this paper(Tab.1), (b) the lassoGGM approachby MeinshausenandBühlmann(2005),and(c) therelevancenetworkwith abs(r) > 0.8. Black andgreyedgesindicatepositiveandnegative(partial)correlation,respectively.

For comparison,the lassoGGM networkis shownin Fig. 5b. It wascomputedfrom the standardizedE. coli dataandcontains100 edges. Closerinspectionofthisnetworkrevealsaninterestingstructuralbiasintroducedby thelassoregressionfor GGM inference.As canclearlybeseenin Fig. 5b the lassolimits thenumberof edgesgoing in andout of a node. The reasonfor this is that the lassoimposessparsityon the regressioncoefficientsper nodeso that in eachregressiononly afew non-zerocoefficientsexist. As a consequence,the degreedistributionof theE. coli lassoGGM networkhasan implicit upperbound.Thus,the lassopreventsthe identificationof hubsandalsoexcludespower-law-typeconnectivitypatterns.Note that in contrastin the shrinkageGGM approachsparsityis imposedon thenetworklevel ratherthanlocally atnodelevel.

Finally, Fig. 5c showsthe relevancenetworkobtainedby applyingthe conven-tional 0.8 cut-off on the absolutevaluesof the shrunkencorrelationcoefficients.The resultingnetworkcontains58 edgesandbearsno resemblanceto the GGMnetworks.As is clearfrom inspectingFig. 4a therearemanymoregenesthatarestronglycorrelated,sofrom thisnetworkthedirectdependenciesamonggenescan-notbededuced.Instead,weargueherethatcorrelationsshouldratherbeemployedfor detectingindependenceamonggenes.Thecorrespondingnull hypothesisis thatthe two genearedependent.Fur this purposethemixturemodelof Eq. 13 is stillapplicable,exceptthattherolesof f0 and fA areinterchanged.Thusanyedgewithfdr > 0.8 (definedasin Eq.15!) would beconsideredsignificant.

As a last commentwe remarkthat in our analysiswe haveplainly ignoredthefact that the E. coli dataderive from a time seriesexperiment.This appearsnotto betoo harmful for theGGM selectionprocess– at leastpartof the longitudinalcorrelationwill beaccountedfor by empiricallyfitting thenull distribution(seealsoEfron (2005a)).

4. Discussion and summary

In thispaperwedrawattentionto theproblemof thewidespreadandlargelyuncrit-ical useof thestandardcovarianceestimatorin theanalysisof functionalgenomicsdata. As a quick glancein any recentissueof a journal suchas Bioinformaticsor BMC Bioinformaticswill reveal,the empiricalcorrelationandcovarianceesti-matorsareoftenratherblindly appliedby bioinformaticiansto large-scaleproblemswith manyvariablesandfew samplepointsalthoughit is well knownthatin thisset-ting thestandardestimatorsarenotappropriateandmayperformextremelypoorly.Here,westronglyadviseto refrainfrom usingtheempiricalcovariancein theanal-ysisof high-dimensionaldatasuchasfrom microarrayor proteomicsexperiments.

We emphasizethat alternativesare readily availablein the form of shrinkage

DOI: 10.2202/1544-6115.1175

estimators(e.g. Greenland,2000). Shrinkageformalizesthe idea of “borrowingstrengthacrossvariables”andhasprovedbeneficialin the problemof differentialexpression(e.g.,Smyth,2004;Cui et al., 2005)andclassificationof transcriptomedata(e.g.,Tibshirani et al., 2002; Zhu and Hastie,2004). In this paperwe par-ticularly highlight the shrinkageapproachof Ledoit andWolf (2003) that allowsfitting of all necessarytuning parametersin a simple analytical fashion. Whilethis methodappearsto belittle knownwe anticipatethat it will behelpful in many“small n, largep” inferenceproblems.

In Section2 of this paperwe presenta novelshrinkageestimatorfor thecovari-anceandcorrelationmatrix (Tab.1) with guaranteedminimum MSE andpositivedefinitenessthat is not only perfectlyapplicableto “small n, large p” databut canalsobecomputedin time comparableto thatof theconventionalestimator.By useof thetheoremof LedoitandWolf (2003)to estimatetheoptimalshrinkageintensitythereis no needto specifyanyfurtherparameters.Consequently,computationallyexpensiveproceduressuchascross-validationarecompletelyavoided.As anaddedbonus,the proposedestimatoris alsodistribution-freeanddemandsonly modestassumptionswith regardto theexistenceof highermoments.

As a specificbioinformaticalapplicationwe employ this covarianceshrinkageestimatorin thesearchfor net-likegeneticinteractions.In Section3 we showthatthis leadsto largeoverallgainsin theaccuracyandin thepowerto recoverthetruenetwork structurecomparedwith a precursorapproachdescribedin SchäferandStrimmer(2005a).In addition,our algorithmalsooutperformsthe lassoapproachto regularizedGGM inferencein termsof positive predictiveaccuracy. Further-more,network inferenceby usingthe shrinkagecovarianceestimator(Section2)combinedwith theheuristicmodelselectionof Section3 takesonly a few minutesevenon a slow computer– thuswe offer it asa fastalternativeto exhaustiveGGMsearchprocedures,suchastheMCMC methodof Dobraetal. (2004).

Furtherpossibleusesof theproposedshrinkagecovarianceestimatorin bioinfor-maticsincludeclassificationof geneexpressionprofiles.For instance,theSCRDA(“shrunkencentroidsregularizeddiscriminantanalysis”)approach(Guoetal.,2004)employsregularizedcovarianceand correlationmatricessimilar to the one de-scribedin Section2. Hence,it shouldbe straightforwardto apply SCRDA alsoin conjunctionwith ourproposedshrinkagecovarianceestimator.

We endwith a notethat“small n, largep” covarianceestimationproblemshaverecentlyarisenalsoin computationaleconometrics.Specifically,theinferenceandmodelingof largefinancialnetworks(MantegnaandStanley,2000;Boginskietal.,2005)requiresmethodsakin to thosefor generelevanceandassociationnetworks.

A. Estimation of the variance and covariance ofthe components of the S and R matrix

In order to computethe optimal estimatedshrinkageintensity λ? (Eq. 8) for thevariousstructuredcovariancetargetslistedin Tab.2, it is necessaryto obtainunbi-asedestimatesof thevarianceandthecovarianceof individualentriesin thematrixS= (si j).

Let xki bethek-th observationof thevariableXi and xi =1n

∑nk=1 xki its empirical

mean.Now setwki j = (xki − xi)(xkj − xj) andwi j =1n

∑nk=1 wki j. Thentheunbiased

empiricalcovarianceequals

Cov(xi , xj) = si j =n

n− 1wi j

and,correspondingly,thevarianceis

Var(xi) = sii =n

n− 1wii .

Theempiricalunbiasedvariancesandcovariancesof the individualentriesof Sarecomputedin asimilar fashion.

Var(si j) =n2

(n− 1)2Var(wi j) =

n(n− 1)2

Var(wi j) =n

(n− 1)3

n∑k=1

(wki j − wi j)2.

Similarly,

Cov(si j , slm) =n

(n− 1)3

n∑k=1

(wki j − wi j)(wklm− wlm).

Momentsof higherorderthanVar(si j), in particularvariancesandcovariancesofaveragesof si j, areneglectedin estimatingtheoptimal λ? in Tab.2.

ThevarianceVar(ri j) of theempiricalcorrelationcoefficientscanbeestimatedinasimilar fashion:simplyapplytheaboveformulaeto thestandardizeddatamatrix.We notethat this proceduretreatsthe estimatedvariancesasconstantsandhenceintroducesaslightbutgenerallynegligibleerror.Thesameassumptionalsojustifiesto ignorethebiasof theempiricalcorrelationcoefficientsin Eq.10.

DOI: 10.2202/1544-6115.1175

B. Available computer software

The shrinkageestimatorof the covariancematrix describedin this paperis im-plementedin the R package“corpcor”. This packagealsocontainsfunctionsforcomputing(partial) correlations. The analysisand visualisationof the geneex-pressiondatawasperformedusingthe “GeneTS”R package.Both packagesaredistributedunder the GNU GeneralPublic Licenseand are availablefor down-load from the CRAN archiveat http://cran.r-project.org. “GeneTS” isalso availablefrom Bioconductor(http://www.bioconductor.org) and fromhttp://www.statistik.lmu.de/~strimmer/software/genets/.

References

Barabási,A.-L. andZ. N. Oltvai (2004).Networkbiology: understandingthecell’sfunctionalorganization.NatureRev.Genetics5, 101–113.

Benjamini, Y. and Y. Hochberg(1995). Controlling the false discoveryrate: apracticalandpowerfulapproachto multiple testing.J. R.Statist.Soc.B 57, 289–300.

Boginski,V., S.Butenko,andP.M. Pardalos(2005).Statisticalanalysisof financialnetworks.Comp.Stat.DataAnal.48, 431–443.

Butte,A. J.,P.Tamayo,D. Slonim,T. R. Golub,andI. S.Kohane(2000). Discov-ering functional relationshipsbetweenRNA expressionand chemotherapeuticsusceptibilityusingrelevancenetworks.Proc. Natl. Acad.Sci.USA97, 12182–12186.

Cox,D. R.andN. Reid(2004).A noteonpseudolikelihoodfrommarginaldensities.Biometrika91, 729–737.

Cui, X., J.T. G. Hwang,J.Qiu, N. J.Blades,andG. A. Churchill (2005).Improvedstatisticaltestsfor differentialgeneexpressionby shrinkingvariancecomponentsestimates.Biostatistics6, 59–75.

Daniels,M. J.andR.E.Kass(2001).Shrinkageestimatorsfor covariancematrices.Biometrics57, 1173–1184.

dela Fuente,A., N. Bing, I. Hoeschele,andP.Mendes(2004).Discoveryof mean-ingful associationsin genomicdatausingpartialcorrelationcoefficients. Bioin-formatics20, 3565–3574.

Dobra,A., C. Hans,B. Jones,J. R. Nevins,G. Yao, andM. West(2004). Sparsegraphicalmodelsfor exploringgeneexpressiondata. J. Multiv. Anal. 90, 196–212.

Efron,B. (1975).Biasedversusunbiasedestimation.Adv.Math.16, 259–277.

Efron,B. (1982). Maximumlikelihood anddecisiontheory. Ann.Statist.10, 340–356.

Efron,B. (2004).Large-scalesimultaneoushypothesistesting:thechoiceof a nullhypothesis.J. Amer.Statist.Assoc.99, 96–104.

Efron, B. (2005a). Correlationandlarge-scalesimultaneoussignificancetesting.Preprint,Dept.of Statistics,StanfordUniversity.

Efron,B. (2005b).Localfalsediscoveryrates.Preprint,Dept.of Statistics,StanfordUniversity.

Efron, B. andC. N. Morris (1975). DataanalysisusingStein’sestimatoranditsgeneralizations.J. Amer.Statist.Assoc.70, 311–319.

Efron, B. and C. N. Morris (1977). Stein’s paradoxin statistics. Sci. Am. 236,119–127.

Eisen,M. B., P.T. Spellman,P.O. Brown,andD. Botstein(1998).Clusteranalysisanddisplayof genome-wideexpressionpatterns.Proc.Natl. Acad.Sci.USA95,14863–14868.

Friedman,J. H. (1989). Regularizeddiscriminantanalysis. J. Amer.Statist.As-soc.84, 165–175.

Greenland,S. (2000). Principlesof multilevel modelling. Intl. J. Epidemiol.29,158–167.

Guo,Y., T. Hastie,andT. Tibshirani(2004).Regularizeddiscriminantanalysisandits applicationin microarray.Preprint,Dept.of Statistics,StanfordUniversity.

Hastie,T., R.Tibshirani,andJ.Friedman(2001).TheElementsof StatisticalLearn-ing. NewYork: Springer.

Higham,N. J.(1988).Computinganearestsymmetricpositivesemidefinitematrix.LinearAlgebraAppl.103, 103–118.

DOI: 10.2202/1544-6115.1175

Hirschberger,M., Y. Qi, andR. E. Steuer(2004). Randomlygeneratingportfolio-selectioncovariancematriceswith specifieddistributionalassumption.Preprint,TerryCollegeof Business,Universityof Georgia.

Hoerl, A. E. and R. W. Kennard (1970a). Ridge regression: applicationstononorthogonalproblems.Technometrics12, 69–82.

Hoerl, A. E. andR. W. Kennard(1970b). Ridgeregression:biasedestimationfornonorthogonalproblems.Technometrics12, 55–67.

Hotelling,H. (1953).New light on thecorrelationcoefficientandits transforms.J.R.Statist.Soc.B 15, 193–232.

Ledoit, O. andM. Wolf (2003). Improvedestimationof the covariancematrix ofstock returnswith an applicationto portfolio selection. J. Empir. Finance10,603–621.

Ledoit,O. andM. Wolf (2004a).Honey,I shrunkthesamplecovariancematrix. J.Portfolio Management30, 110–119.

Ledoit, O. and M. Wolf (2004b). A well conditioned estimator for large-dimensionalcovariancematrices.J. Multiv. Anal.88, 365–411.

Leung,P. L. andW. Y. Chan(1998). Estimationof thescalematrix andits eigen-valuesin the Wishart and the multivariateF distributions. Ann. Inst. Statist.Math.50, 523–530.

Magwene,P. M. andJ. Kim (2004). Estimatinggenomiccoexpressionnetworksusingfirst-orderconditionalindependence.GenomeBiology5, R100.

Mantegna,R. N. andH. E. Stanley(2000).An Introductionto Econophysics:Cor-relationsand Complexityin Finance. Cambridge,UK: CambridgeUniversityPress.

Meinshausen,N. andP. Bühlmann(2005). High dimensionalgraphsandvariableselectionwith thelasso.Ann.Statist.in press.

Morris, C. N. (1983). ParametricempiricalBayesinference:theoryandapplica-tions. J. Amer.Statist.Assoc.78, 47–55.

Schäfer,J. andK. Strimmer(2005a). An empiricalBayesapproachto inferringlarge-scalegeneassociationnetworks.Bioinformatics21, 754–764.

Schäfer,J.andK. Strimmer(2005b).Learninglarge-scalegraphicalGaussianmod-elsfrom genomicdata.In J.F.F.Mendes,S.N. Dorogovtsev,A. Povolotsky,F.V.Abreu,andJ.G. Oliveira (Eds.),Scienceof ComplexNetworks:FromBiologytothe InternetandWWW, Volume776,AIP ConferenceProceedings,Aveiro, PT,August2004,pp.263–276.AmericanInstituteof Physics.

Schmidt-Heck,W., R. Guthke, S. Toepfer, H. Reischer,K. Duerrschmid,andK. Bayer(2004). Reverseengineeringof the stressresponseduring expressionof arecombinantprotein.In Proceedingsof theEUNITEsymposium,10-12June2004,Aachen,Germany, pp.407–412.VerlagMainz.

Smyth,G. K. (2004). Linear modelsandempiricalBayesmethodsfor assessingdifferential expressionin microarrayexperiments. Statist.Appl. Genet.Mol.Biol. 3, 3.

Stein,C. (1956). Inadmissibilityof theusualestimatorfor themeanof amultivari-atedistribution. In J. Neyman(Ed.),Proc. Third. BerkeleySymp.Math. Statist.Probab., Volume1, Berkeley,pp.197–206.Univ. CaliforniaPress.

Storey,J. D. (2002). A directapproachto falsediscoveryrates.J. R. Statist.Soc.B 64, 479–498.

Tibshirani,R. (1996).Regressionshrinkageandselectionvia thelasso.J.R.Statist.Soc.B 58, 267–288.

Tibshirani,R.,T. Hastie,B. Narasimhan,andG. Chi (2002).Diagnosisof multiplecancertype by shrunkencentroidsof geneexpression.Proc. Natl. Acad.Sci.USA99, 6567–6572.

Toh,H. andK. Horimoto(2002).Inferenceof ageneticnetworkby acombinedap-proachof clusteranalysisandgraphicalGaussianmodeling.Bioinformatics18,287–297.

Whittaker,J. (1990). Graphical Modelsin AppliedMultivariate Statistics. NewYork: Wiley.

Wille, A., P.Zimmermann,E.Vranová,A. Fürholz,O.Laule,S.Bleuler,L. Hennig,A. Prelic, P. von Rohr, L. Thiele, E. Zitzler, W. Gruissem,and P. Bühlmann(2004). SparsegraphicalGaussianmodelingof the isoprenoidgenenetworkinArabidopsisthaliana.GenomeBiology5, R92.

Zhu,J.andT. Hastie(2004).Classificationof genemicroarraysby penalizedlogis-tic regression.Biostatistics5, 427–443.

DOI: 10.2202/1544-6115.1175

A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional...

Documents

Functional Genomics of Cacao

Sequential Covariance Calculation for Exoplanet Image Processing

Thermoplastic Polymer Shrinkage in Emerging Molding Processes

COVARIANCE REALISM

Shrinkage modeling of recycled aggregate concrete

Strategic Asset Allocation Using Robust Covariance

Overview and Future Trends of Shrinkage Research

CreditGrades Framework within Stochastic Covariance Models

Post-Processing Time Dependence of Shrinkage and ... - MDPI

Low-Shrinkage Resin Matrices in Restorative Dentistry ... - MDPI

Shrinkage Stoping Methods

Genomics Workshop

Improving Self-Healing and Shrinkage Reduction of ... - MDPI

Application of Shrinkage and Swelling Factors on State

Phylogenetic Analysis of Covariance by Computer Simulation

Parametric modelling of shrinkage effects on industrial floors

Shrinkage versus deconvolution

Scalable sparse covariance estimation via self-concordance

MAINSTREAMING AGRICULTURAL RESEARCH THROUGH GENOMICS: Background Study 5, PLANT GENETIC RESOURCES AND GENOMICS

Nonparametric Estimation of Nonstationary Spatial Covariance Structure