41
Variance estimation in high-dimensional linear models Lee H. Dicker * Department of Statistics and Biostatistics Rutgers University 501 Hill Center, 110 Frelinghuysen Road Piscataway, NJ 08854 e-mail: [email protected] Abstract: The residual variance and the proportion of explained variation are impor- tant quantities in many statistical models and model fitting procedures. They play an important role in regression diagnostics, model selection procedures, and in determining the performance limits in many problems. In this paper, we propose new method-of- moments based estimators for the residual variance, the proportion of explained variation and other related quantities, such as the 2 -signal strength. The proposed estimators are consistent and asymptotically normal in high-dimensional linear models with Gaussian predictors and errors, where the number of predictors d is proportional to the number of observations n; in fact, consistency holds even in settings where d/n →∞. Exist- ing results on residual variance estimation in high-dimensional linear models depend on sparsity in the underlying signal. Our results require no sparsity assumptions and imply that the residual variance and the proportion of explained variation may be consistently estimated even when d>n and the underlying signal itself is non-estimable. Basic nu- merical work suggests that some of our distributional assumptions may be relaxed. A real data analysis involving gene expression data and single nucleotide polymorphism data further illustrates the performance of the proposed methods. AMS 2000 subject classifications: Primary 62J05; secondary 62F12, 15B52. Keywords and phrases: Asymptotic normality, Proportion of explained variation, Random matrix theory, Residual variance, Signal-to-noise ratio. 1. Introduction Consider the linear model y i = x T i β + i (i =1,...,n), (1) where y 1 ,...,y n R and x 1 =(x 11 ,...,x 1d ) T ,...,x n =(x n1 ,...,x nd ) T R d are observed outcomes and d-dimensional predictors, respectively, 1 ,..., n R are unobserved inde- pendent and identically distributed errors with E( i ) = 0 and var( i )= σ 2 > 0, and * Supported by NSF Grant DMS-1208785 1

Variance estimation in high-dimensional linear modelsstat.rutgers.edu/home/ldicker/papers/variance.pdf · 2014. 2. 1. · Variance estimation in high-dimensional linear models Lee

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • Variance estimation in high-dimensional linearmodels

    Lee H. Dicker∗

    Department of Statistics and BiostatisticsRutgers University

    501 Hill Center, 110 Frelinghuysen RoadPiscataway, NJ 08854

    e-mail: [email protected]

    Abstract: The residual variance and the proportion of explained variation are impor-tant quantities in many statistical models and model fitting procedures. They play animportant role in regression diagnostics, model selection procedures, and in determiningthe performance limits in many problems. In this paper, we propose new method-of-moments based estimators for the residual variance, the proportion of explained variationand other related quantities, such as the `2-signal strength. The proposed estimators areconsistent and asymptotically normal in high-dimensional linear models with Gaussianpredictors and errors, where the number of predictors d is proportional to the numberof observations n; in fact, consistency holds even in settings where d/n → ∞. Exist-ing results on residual variance estimation in high-dimensional linear models depend onsparsity in the underlying signal. Our results require no sparsity assumptions and implythat the residual variance and the proportion of explained variation may be consistentlyestimated even when d > n and the underlying signal itself is non-estimable. Basic nu-merical work suggests that some of our distributional assumptions may be relaxed. Areal data analysis involving gene expression data and single nucleotide polymorphismdata further illustrates the performance of the proposed methods.

    AMS 2000 subject classifications: Primary 62J05; secondary 62F12, 15B52.Keywords and phrases: Asymptotic normality, Proportion of explained variation,Random matrix theory, Residual variance, Signal-to-noise ratio.

    1. Introduction

    Consider the linear modelyi = x

    T

    i β + �i (i = 1, . . . , n), (1)

    where y1, . . . , yn ∈ R and x1 = (x11, . . . , x1d)T, . . . , xn = (xn1, . . . , xnd)T ∈ Rd are observedoutcomes and d-dimensional predictors, respectively, �1, . . . , �n ∈ R are unobserved inde-pendent and identically distributed errors with E(�i) = 0 and var(�i) = σ

    2 > 0, and

    ∗Supported by NSF Grant DMS-1208785

    1

    mailto:[email protected]

  • L.H. Dicker/Variance estimation in high-dimensional linear models 2

    β = (β1, . . . , βd)T ∈ Rd is an unknown d-dimensional parameter. Let y = (y1, . . . , yn)T ∈ Rn

    denote the n-dimensional vector of outcomes and X = (x1, . . . , xn)T denote the n× d matrix

    of predictors. Also let � = (�1, . . . , �n)T ∈ Rn. Then (1) may be re-expressed as y = Xβ + �.

    In this paper, we focus on the case where the predictors xi are random. More specifically, weassume that x1, . . . , xn are independent and identically distributed random vectors with meanE(xi) = 0 and d× d positive definite covariance matrix cov(xi) = Σ. The xi are additionallyassumed to be independent of �. In practice, the assumption E(xi) = 0 is often untenable andit may be appropriate to add an intercept term to the linear model (1). All of our theoreticalresults in this paper remain valid in cases where E(xi) 6= 0 and an intercept term is includedin the model, upon centering the data and replacing n with n− 1.

    Let τ 2 = βTΣβ = ||Σ1/2β||2, where || · || denotes the `2-norm. Then τ 2 is a measure ofthe overall `2-signal strength and σ2 = var(�i) = var{E(yi | xi)} is the residual variance.This paper is concerned with identifying effective estimators for σ2, τ 2, and related quantitiesin high-dimensional linear models, where d and n are large; we are particularly interestedin settings where d > n. The parameters σ2 and τ 2 are important in many problems instatistics. In estimation and prediction problems, σ2 frequently determines the scale of anestimator’s risk under quadratic loss. Reliable estimates of σ2 may be required to computepopular model selection statistics, such as AIC, BIC, or RIC (Akaike, 1974; Foster and George,1994; Schwarz, 1978; Zou et al., 2007). Good estimates of σ2 and τ 2 may be used to derive plug-in estimates of other quantities, such as the proportion of explained variation r2 = τ 2/(σ2+τ 2)and the signal-to-noise ratio τ 2/σ2. The proportion of explained variation is important forvarious regression diagnostics, including goodness-of-fit testing, and is related to importantpractical concepts, such as heritability in genetics; the signal-to-noise ratio is important forregularization parameter selection and determines the performance limits in certain high-dimensional regression problems, e.g. Dicker (2013) and a 2013 technical report availablefrom the author.

    If the predictors xi are nondegenerate and n − d is large, then estimating σ2 and τ 2 isstraightforward. Indeed, if d ≤ n and X has full rank, let β̂ols = (XTX)−1XTy be the ordinaryleast squares estimator for β; then

    σ̂20 =1

    n− d||y −Xβ̂ols||2 =

    1

    n− d||y||2 − 1

    n− dyTX(XTX)−1XTy, (2)

    τ̂ 20 =1

    n||y||2 − σ̂20 =

    1

    n− dyTX(XTX)−1XTy − d

    n(n− d)||y||2

    are unbiased estimators for σ2 and τ 2, respectively. Furthermore, if n− d→∞, then σ̂20 andτ̂ 20 are consistent and asymptotically normal, under fairly mild conditions.

    When d > n, it is more challenging to construct reliable estimators for σ2 and τ 2; indeed,if d > n, then XTX is not invertible and the estimator σ̂20 breaks down. Fan et al. (2012)

  • L.H. Dicker/Variance estimation in high-dimensional linear models 3

    and Sun and Zhang (2012) have proposed methods for estimating σ2 that are effective whend ≥ n and β is sparse, e.g., the `0- or `1-norm of β is small. Fan et al.’s (2012) and Sunand Zhang’s (2012) results apply in settings where d/n → ∞. However, their underlyingsparsity assumptions may be untenable in certain instances and this can dramatically affectthe performance of the proposed estimators, as demonstrated by our numerical simulationsin Section 5.2.

    In this paper, we propose new method of moments-based estimators for σ2 and τ 2 thatare consistent when d/n2 → 0. Some of our results require σ2 and τ 2 to be bounded, butwe make no additional sparsity assumptions on β. When d/n → ρ ∈ [0,∞), the proposedestimators are shown to be asymptotically normal with rate n−1/2. We also derive consistentand asymptotically normal estimators for r2 = τ 2/(σ2 + τ 2); moreover, the same techniquesmay be used to derive reliable estimators for other functions of σ2 and τ 2, like the signal-to-noise ratio. One consequence of our results is that σ2 and τ 2 may be consistently estimatedwhen d > n, even if β itself is non-estimable. In addition to providing theoretical results,we illustrate the performance of the proposed estimators through simulation studies and areal-data analysis involving gene expression data and single nucleotide polymorphism data.

    2. Assumptions

    2.1. Distributional assumptions

    Though sparsity is not required in this paper, we do make strong distributional assumptionsabout the data. Henceforth, we assume that

    �1, . . . , �n ∼ N(0, σ2), x1, . . . , xn ∼ N(0,Σ) (3)are all independent. Normality is used repeatedly throughout our analysis. However, we expectthat key aspects of many of the results in this paper remain valid under weaker distributionalassumptions. This is explored via simulation in Section 5.1.

    While it is of interest to relax (3), this assumption facilitates the use of a collection of toolsfor random matrices developed by Chatterjee (2009), which provides the means for a highlyflexible theoretical analysis, especially regarding asymptotic normality. To give a sense of therelevant arguments, let g(σ2, τ 2) be some function of σ2, τ 2. We use Chatterjee’s results tobound the total variation distance between estimators for g(σ2, τ 2) and a standard normalrandom variable, and then show that this distance vanishes asymptotically for certain g(σ2, τ 2)of interest.

    Alternative approaches to studying asymptotic normality in random matrix theory do notrequire an underlying normality assumption, e.g., Bai et al. (2007) and Pan and Zhou (2008).These could potentially be applied in the problems considered here, but this may be signifi-cantly more complex, especially when considering general estimands of the form g(σ2, τ 2).

  • L.H. Dicker/Variance estimation in high-dimensional linear models 4

    2.2. Correlation among predictors

    The covariance matrix cov(xi) = Σ plays an important role in our analysis. The initiallyproposed estimators for σ2 and τ 2 are derived under the assumption that Σ is known, which isequivalent to assuming that Σ = I; see Section 3.1. These estimators are unbiased, consistent,and asymptotically normal. We subsequently propose modified estimators for σ2 and τ 2 incases where Σ is unknown, but a norm-consistent estimator for Σ is available, or Σ andβ satisfy conditions described in Section 4.2, which are closely related to other conditionsappearing in the random matrix theory literature (Bai et al., 2007; Pan and Zhou, 2008). Itremains of interest to find estimators for σ2 and τ 2 that are consistent for broader classes ofΣ.

    3. Independent predictors: Σ = I

    3.1. The basic estimators

    We assume Σ = I throughout the discussion in Section 3. However, conditions on Σ will bestated explicitly in all formal results; in particular, Theorem 1 in Section 3.3, on asymptoticnormality, holds for arbitrary positive definite Σ. If Σ 6= I, but Σ is known, then one easilyreduces to the case where Σ = I by replacing (X, β) with (XΣ−1/2,Σ1/2β).

    The estimators for σ2 and τ 2 proposed in this paper are based on the method of moments.Using basic facts about moments of the normal and Wishart distributions, which may befound in Section S3 of the Supplementary Material, one finds that

    E

    (1

    n||y||2

    )= τ 2 + σ2, E

    (1

    n2||XTy||2

    )=d+ n+ 1

    nτ 2 +

    d

    nσ2. (4)

    The key observation is that these expressions are non-degenerate linear combinations of τ 2

    and σ2. It follows that unbiased estimators of σ2 and τ 2 may be found by taking linearcombinations of n−1||y||2 and n−2||XTy||2. Define

    σ̂2 =d+ n+ 1

    n(n+ 1)||y||2 − 1

    n(n+ 1)||XTy||2, τ̂ 2 = − d

    n(n+ 1)||y||2 + 1

    n(n+ 1)||XTy||2.

    for all positive integers d, n. Our first result follows directly from (4).

    Lemma 1. If Σ = I, then E(σ̂2) = σ2 and E(τ̂ 2) = τ 2.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 5

    3.2. Consistency

    Let θ̂ = (σ̂2, τ̂ 2) and let S = (n−1||y||2, n−2||XTy||2). The covariance matrix of θ̂ is importantfor understanding the asymptotic properties of σ̂2 and τ̂ 2. Since θ̂ = AS, where

    A =

    (d+n+1n+1

    − nn+1

    − dn+1

    nn+1

    ), (5)

    it follows that cov(θ̂) = Acov(S)AT. The covariance matrices for θ̂ and S are computedexplicitly in Lemma S1 and Corollary S1 of the Supplementary Material. Asymptotic ap-proximations for the entries of cov(θ̂) are given in the next result, which also gives basicconsistency properties for σ̂2, τ̂ 2. The result follows directly from Corollary S1 in the Supple-mentary Material.

    Lemma 2. If Σ = I, then

    var(σ̂2) =2

    n

    {d

    n(σ2 + τ 2)2 + σ4 + τ 4

    }{1 +O

    (1

    n

    )},

    var(τ̂ 2) =2

    n

    {(1 +

    d

    n

    )(σ2 + τ 2)2 − σ4 + 3τ 4

    }{1 +O

    (1

    n

    )},

    cov(σ̂2, τ̂ 2) = − 2n

    {d

    n(σ2 + τ 2)2 + 2τ 4

    }{1 +O

    (1

    n

    )}.

    In particular, |σ̂2 − σ2|, |τ̂ 2 − τ 2| = OP [{(d+ n)/n2}1/2(σ2 + τ 2)].Remark 1. If σ2, τ 2 are bounded, then Lemma 2 implies that σ̂2 and τ̂ 2 converge to σ2 and τ 2,respectively, at rate (d+ n)1/2/n; in particular, σ̂2 and τ̂ 2 are consistent whenever d/n2 → 0.Define the plug-in estimator for r2, r̂2 = τ̂ 2/(σ̂2 + τ̂ 2). If σ2, τ 2 are contained in some compactsubset of (0,∞), then Lemma 2 and Slutsky’s theorem imply that r̂2 is consistent for r2,whenever d/n2 → 0. The analogous plug-in estimator for the signal-to-noise ratio τ 2/σ2 hassimilar properties.

    Remark 2. It is instructive to compare the asymptotic variance of σ̂2 to that of σ̂20, definedin (2). If n → ∞ and d/n → ρ ∈ [0, 1), then var(σ̂20) ∼ 2σ4/{n(1 − ρ)} and var(σ̂2) ∼(2/n){ρ(σ2 + τ 2)2 + σ4 + τ 4}. Evidently, var(σ̂2) increases with the signal strength τ 2, whilevar(σ̂20) does not depend on τ

    2; this may be an undesirable feature of σ̂2. On the other hand,var(σ̂2) < var(σ̂20) when τ

    2 is small or ρ is close to 1.

    Remark 3. Suppose c1, c2 > 0 are fixed real numbers. Lemma 2 implies that if d/n → ρ ∈[0,∞), then σ̂2, τ̂ 2 are consistent in the sense that

    limd/n→ρ

    sup0≤σ2≤c10≤τ2≤c2

    E{(σ̂2 − σ2)2} = limd/n→ρ

    sup0≤σ2≤c10≤τ2≤c2

    E{(τ̂ 2 − τ 2)2} = 0.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 6

    In a 2013 technical report available from the author, it is shown that if ρ > 0, then it isimpossible to estimate β in this setting. In particular, if ρ > 0, then

    lim infd/n→ρ

    infβ̂

    sup0≤σ2≤c10≤τ2≤c2

    E(||β̂ − β||2) > 0,

    where the infimum is over all measurable estimators for β. Thus, Lemma 2 describes methodsfor consistently estimating σ2 and τ 2 in high-dimensional linear models, where d > n and itis impossible to estimate β.

    3.3. Asymptotic normality

    Define the total variation distance between random variables u and v, dTV (u, v) = supB∈B(R) |pr(u ∈B) − pr(v ∈ B)|, where B(R) denotes the collection of Borel sets in R. The next theorem isthis paper’s main result on asymptotic normality. It is a direct application of results due toChatterjee (2009) and it is proved in the Supplementary Material.

    Theorem 1. Let λ1 = ||n−1XTX|| be the operator norm of n−1XTX, i.e., λ1 is the largesteigenvalue of n−1XTX. Let h : R2 → R be a function with continuous second order partialderivatives, let ∇h denote the gradient of h and let ∇2h denote the Hessian of h. Supposeψ2 = var{h(S)} < ∞, where S = (n−1||y||2, n−2||XTy||2), and let w be a normal randomvariable with the same mean and variance as h(S). Then

    dTV {h(S), w} = O(||Σ||3/2ξνn3/2ψ2

    ), (6)

    where

    ξ = ξ(σ2, τ 2,Σ, d, n) = γ1/44 + γ

    1/42 + γ

    1/40 τ(τ + 1),

    ν = ν(σ2, τ 2,Σ, d, n) = η1/48 + η

    1/44 + η

    1/40 τ

    2(τ 2 + 1) + γ1/44 + γ

    1/40 (τ

    2 + 1)

    and, for non-negative integers k,

    γk = γk(σ2, τ 2,Σ, d, n) = E

    {||∇h(S)||4 (λ1 + 1)6

    (1

    n||�||2

    )k},

    ηk = ηk(σ2, τ 2,Σ, d, n) = E

    {∣∣∣∣∇2h(S)∣∣∣∣4 (λ1 + 1)12( 1n||�||2

    )k}.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 7

    Remark 1. If ||Σ|| is bounded, then the asymptotic behavior of the upper bound (6) is de-termined by that of ξ, ν, and ψ2, which, in turn, is determined by the function h. For thefunctions h considered in this paper, if d/n→ ρ ∈ [0,∞), then ξ, ν, and nψ2 are bounded byrational functions in σ2 and τ 2. Thus, if ||Σ|| is bounded, d/n → ρ ∈ [0,∞) and σ2, τ 2 lie insome compact set, then we typically have dTV {h(S), w} = O(n−1/2). In other words, h(S) con-verges to a normal random variable at rate n−1/2. Under these conditions, if ψ2 = var {h(S)}is known or estimable, as it is for the h studied here, then asymptotically valid confidenceintervals for E{h(S)} may be constructed using Theorem 1.

    Now let A be the matrix (5) and let aT1 , aT2 denote the first and second rows of A, respectively.

    Then σ̂2 = aT1S, τ̂2 = aT2S, and r̂

    2 = τ̂ 2/(σ̂2 + τ̂ 2) = aT2S/(aT1S + a

    T2S). Straightforward

    applications of Theorem 1 with Σ = I and h(S) = aT1S = σ̂2, h(S) = aT2S = τ̂

    2, andh(S) = aT2S/(a

    T1S+a

    T2S) = r̂

    2 give bounds on the total variation distance between σ̂2, τ̂ 2, andr̂2 and corresponding normal random variables; some additional care must be taken for r̂2 =h(S) = aT2S/(a

    T2S+a

    T1S) because it is undefined when a

    T2S+a

    T1S = 0. The asymptotic variance

    of the estimators follows from Lemma 2 and, in the case of r̂2 = h(S) = aT2S/(aT2S + a

    T1S), a

    basic Taylor expansion, i.e., the delta method. This is summarized in the following corollaryto Theorem 1.

    Corollary 1. Suppose Σ = I and σ2, τ 2 ∈ D for some compact set D ⊆ (0,∞). Define

    ψ21 = 2

    {d

    n(σ2 + τ 2)2 + σ4 + τ 4

    }, (7)

    ψ22 = 2

    {(1 +

    d

    n

    )(σ2 + τ 2)2 − σ4 + 3τ 4

    }, (8)

    ψ20 =2

    (σ2 + τ 2)2

    {(1 +

    d

    n

    )(σ2 + τ 2)2 − σ4

    }. (9)

    If d/n→ ρ ∈ [0,∞), then

    n1/2(σ̂2 − σ2

    ψ1

    ), n1/2

    (τ̂ 2 − τ 2

    ψ2

    ), n1/2

    (r̂2 − r2

    ψ0

    )→ N(0, 1)

    in distribution.

    Remark 1. In Corollary 1, we require that d/n → ρ ∈ [0,∞). By contrast, Lemma 2 onlyrequires d/n2 → 0 in order to ensure consistency. It may be of interest to investigate howmuch the conditions on d, n in Corollary 1 can be relaxed, while still ensuring asymptoticnormality.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 8

    4. Unknown Σ

    4.1. Estimable Σ

    In this subsection, we consider the case where Σ is unknown, but a norm consistent estimatorfor Σ is available. An estimator Σ̂ for Σ is norm consistent if ||Σ̂−Σ|| → 0, where ||Σ̂−Σ|| is theoperator norm of Σ̂−Σ and the convergence holds in some appropriate sense, e.g. convergencein probability or squared-mean. If d/n→ ρ > 0, then the sample covariance matrix n−1XTXis not norm-consistent for Σ; furthermore, in the absence of additional information about Σ, itis generally not possible to find a norm-consistent estimator for Σ. However, Bickel and Levina(2008), El Karoui (2008), Cai et al. (2010), and others have shown that for wide classes ofmatrices Σ, norm-consistent estimators are available when d/n→ ρ > 0. Accurate estimationof Σ may also be possible in semi-supervised learning situations (Lafferty and Wasserman,2008), where additional xi’s (i = n + 1, . . . , N) are available, but the corresponding yi’s areunobserved.

    Suppose Σ̂ is a positive definite estimator for Σ and define the estimators

    σ̂2(Σ̂) =d+ n+ 1

    n(n+ 1)||y||2 − 1

    n(n+ 1)||Σ̂−1/2XTy||2,

    τ̂ 2(Σ̂) = − dn(n+ 1)

    ||y||2 + 1n(n+ 1)

    ||Σ̂−1/2XTy||2.

    Then σ̂2(Σ) and τ̂ 2(Σ) are the known-Σ analogues of σ̂2 and τ̂ 2, respectively. In particular, allof the results from Section 3 remain valid with σ̂2(Σ) and τ̂ 2(Σ) in place of σ̂2 and τ̂ 2. Since

    σ̂2(Σ̂) = σ̂2(Σ) +O

    {1

    n2||Σ−1/2XTy||2||Σ1/2Σ̂−1Σ1/2 − I||

    }, (10)

    τ̂ 2(Σ̂) = τ̂ 2(Σ) +O

    {1

    n2||Σ−1/2XTy||2||Σ1/2Σ̂−1Σ1/2 − I||

    }, (11)

    we conclude that if ||Σ1/2Σ̂−1Σ1/2−I|| is small, then asymptotic properties of σ̂2(Σ̂) and τ̂ 2(Σ̂)are determined by those of σ̂2(Σ) and τ̂ 2(Σ). The following proposition is a direct consequenceof (10)–(11) and the results of Section 3.

    Proposition 1. Let Σ̂ be a positive definite estimator for Σ and suppose ||Σ||, ||Σ−1||, ||Σ̂||,||Σ̂−1|| = OP (1).

    (i) The plug-in estimators σ̂2(Σ̂) and τ̂ 2(Σ̂) satisfy

    |σ̂2(Σ̂)− σ2|, |τ̂ 2(Σ̂)− τ 2| = OP

    [{(d+ n

    n2

    )1/2+ ||Σ̂− Σ||

    }(σ2 + τ 2)

    ].

  • L.H. Dicker/Variance estimation in high-dimensional linear models 9

    (ii) Let ψ21, ψ22, and ψ

    20 be as defined in (7)-(9) and let r̂

    2(Σ̂) = τ̂ 2(Σ̂)/{σ̂2(Σ̂) + τ̂ 2(Σ̂)}.Suppose σ2, τ 2 ∈ D for some compact set D ⊆ (0,∞). If d/n → ρ ∈ [0,∞) and||Σ̂− Σ|| = oP (n−1/2), then

    n1/2

    {σ̂2(Σ̂)− σ2

    ψ1

    }, n1/2

    {τ̂ 2(Σ̂)− τ 2

    ψ2

    }, n1/2

    {r̂2(Σ̂)− r2

    ψ0

    }→ N(0, 1)

    in distribution.

    Remark 1. Part (i) of Proposition 1 implies if σ2, τ 2 are bounded, d = o(n2) and ||Σ̂− Σ|| =oP (1), then σ̂

    2(Σ̂) and τ̂ 2(Σ̂) are consistent for σ2 and τ 2, respectively.

    Remark 2. If ||Σ̂ − Σ|| = oP (n−1/2) and the other conditions of Proposition 1 are met, thenσ̂2(Σ̂), τ̂ 2(Σ̂), and r̂2(Σ̂) are asymptotically normal with the same asymptotic variance asσ̂2(Σ), τ̂ 2(Σ), and r̂2(Σ), respectively. The condition ||Σ̂ − Σ|| = oP (n−1/2) is quite strong.However, important classes of high-dimensional covariance matrices can be estimated at thisrate, e.g., certain classes of Toeplitz matrices (Cai et al., 2013).

    4.2. Non-estimable Σ

    If cov(xi) = Σ is unknown and a norm-consistent estimator is unavailable, then it is morechallenging to find good estimators for σ2, τ 2 when d > n. To indicate where the estimatorsσ̂2, τ̂ 2 from Section 3 break down when Σ is unknown, define τ 2k = β

    TΣkβ and mk = d−1tr(Σk),

    k = 0, 1, 2, . . .. Then τ 2 = τ 21 . If Σ = I, then τ2 = τ 2k and mk = 1 for all k = 0, 1, 2, . . .. On

    the other hand, if Σ 6= I, then typically τk 6= τk′ and mk 6= mk′ for k 6= k′. One easily checksthat

    1

    nE(||y||2) = σ2 + τ 21 , (12)

    1

    n2E(||XTy||2) = d

    nm1σ

    2 +d

    nm1τ

    21 +

    (1 +

    1

    n

    )τ 22 , (13)

    E(σ̂2) =d(1−m1) + n+ 1

    n+ 1σ2 +

    d(1−m1) + n+ 1n+ 1

    τ 21 − τ 22 ,

    E(τ̂ 2) =d(m1 − 1)n+ 1

    σ2 +d(m1 − 1)n+ 1

    τ 21 + τ22 .

    Thus, if Σ 6= I, then σ̂2, τ̂ 2 are no longer unbiased.In this subsection we propose alternative estimators for σ2, τ 2 that are consistent and

    asymptotically normal, provided β,Σ satisfy conditions that simplify the relationship between

  • L.H. Dicker/Variance estimation in high-dimensional linear models 10

    the various τk and mk. Indeed, consider the approximation

    τ 2k = βTΣkβ ≈ ||β||2 1

    dtr(Σk) = τ 20mk (14)

    for positive integers k. While (14) does not hold in general, related conditions have beenproposed elsewhere in the random matrix theory literature (Bai et al., 2007; Pan and Zhou,2008). If Σ = ν2I, where ν2 > 0 is a positive real number, then (14) is an equality. Morebroadly, if d is large and Σ is an independent orthogonally invariant random matrix withwell-behaved eigenvalues, then (14) may be a reasonable approximation for any β (Bai et al.,2007). Furthermore, Bai et al. (2007) point out that for any given Σ there must exist someβ such that (14) is an equality; for instance, take β = ū, where ū = d−1/2(u1 + · · ·+ ud) andu1, . . . , ud are orthonormal eigenvectors of Σ.

    Now assume that (14) holds for k = 1, 2. The method of moments suggests that

    m̂1 =1

    dtr

    (1

    nXTX

    ), m̂2 =

    1

    dtr

    {(1

    nXTX

    )2}− 1dn

    {tr

    (1

    nXTX

    )}2(15)

    are reasonable estimators for m1 and m2. Combining this with (14) yields τ2k ≈ τ 2m̂k/m̂1

    (k = 1, 2) and, by (12)–(13),

    1

    nE(||y||2

    )= σ2 + τ 2,

    1

    n2E(||XTy||2

    )≈ dnm̂1σ

    2 +

    {d

    nm̂1 +

    (1 +

    1

    n

    )m̂2m̂1

    }τ 2.

    In particular, up to the accuracy of approximation, n−1E(||y||2) and n−2E(||XTy||2) are linearcombinations of σ2 and τ 2, with coefficients determined by the known quantities d, n, m̂1, andm̂2. Thus, we may obtain approximately unbiased estimators of σ

    2 and τ 2 by taking linearcombinations of n−1||y||2 and n−2||XTy||2, with coefficients determined by d, n, m̂1, and m̂2.Indeed, we define the estimators

    σ̃2 =

    {1 +

    dm̂21(n+ 1)m̂2

    }1

    n||y||2 − m̂1

    n(n+ 1)m̂2||XTy||2,

    τ̃ 2 = − dm̂21

    n(n+ 1)m̂2||y||2 + m̂1

    n(n+ 1)m̂2||XTy||2

    where m̂1, m̂2 are given in (15).Proposition 2 summarizes some asymptotic properties of σ̃2 and τ̃ 2, which depend on the ac-

    curacy of the approximation (14). An outline of the proof may be found in the SupplementaryMaterial.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 11

    Proposition 2. Suppose D ⊆ (0,∞) is a compact set and σ2, τ 2 ∈ D. Suppose furtherthat there exist constants m−,m+ ∈ R such that 0 < m− < m1,m4 < m+ for all d. Let∆k = |τ 21 − τ 20m1|+ · · ·+ |τ 2k − τ 20mk| (k = 1, 2, . . .).

    (i) If 0 < d/n < c for some constant c ∈ R, then |σ̃2 − σ2|, |τ̃ 2 − τ 2| = OP (n−1/2 + ∆2).(ii) Let r̃2 = τ̃ 2/(σ̃2 + τ̃ 2) and let

    ψ̃21 = 2

    {(dm21nm2

    +m1m3m22

    − 1)

    (σ2 + τ 2)2 +

    (2− m1m3

    m22

    )σ4 +

    m1m3m22

    τ 4},

    ψ̃22 = 2

    {(dm21nm2

    +m1m3m22

    )(σ2 + τ 2)2 − m1m3

    m22σ4 +

    (2 +

    m1m3m22

    )τ 4},

    ψ̃20 =2

    (σ2 + τ 2)2

    {dm21nm2

    (σ2 + τ 2)2 + 2m1m3m22

    (σ2τ 2 + τ 4)− τ 4}.

    If d/n→ ρ ∈ [0,∞), d→∞, and ∆3 = o(n−1/2), then

    n1/2(σ̃2 − σ2

    ψ̃1

    ), n1/2

    (τ̃ 2 − τ 2

    ψ̃2

    ), n1/2

    (r̃2 − r2

    ψ̃0

    )→ N(0, 1)

    in distribution.

    Remark 1. The condition that 0 < m− < m1,m4 < m+ < ∞ for all d ensures that the firstfour moments of the empirical distribution of the eigenvalues of Σ = Σd are well-behaved.

    Remark 2. Proposition 2 (i) requires that ∆2 is small in order to ensure consistency; Proposi-tion 2 (ii) requires that ∆3 is small. If ∆2, ∆3 are small, then (14) is a reasonable approximationfor k = 2, 3.

    Remark 3. The condition ∆3 = o(n−1/2) in Proposition 2 (ii) is quite strong. For instance, if Σ

    is a sample covariance matrix formed from independent N(0, σ20) data with a constant aspectratio, then ∆2 = o(1), but ∆3 6= o(n−1/2). On the other hand, if Σ is a constant multiple ofthe identity matrix, then ∆3 = o(n

    −1/2).

    Remark 4. If Σ = I, then m1 = m2 = m3 = 1 and ψ̃2j = ψ

    2j , j = 0, 1, 2, where ψ

    2j are given in

    (7)–(9). In other words, if Σ = I, then the asymptotic variance of σ̃2, τ̃ 2, and r̃2 is the sameas that of σ̂2, τ̂ 2, and r̂2, respectively. This is driven by the fact that if d/n→ ρ ∈ [0,∞) andd→∞, then |m̂k−mk| converges at a rate faster than n−1/2. This also highlights the necessityof the condition d→∞ in Proposition 2 (ii): if d is bounded, then |m̂k−mk| = O(n−1/2) andthis may affect the asymptotic distribution of the estimators.

    Remark 5. In order to construct confidence intervals or conduct Wald tests based on Propo-sition 2 (ii), an estimate of m3 is required. The method of moments and Proposition S1 in

  • L.H. Dicker/Variance estimation in high-dimensional linear models 12

    the Supplementary Material suggest the estimator

    m̂3 =1

    dtr

    {(1

    nXTX

    )3}+

    2

    dn2

    {tr

    (1

    nXTX

    )}3− 3dn

    tr

    (1

    nXTX

    )tr

    {(1

    nXTX

    )2}. (16)

    5. Simulation studies

    5.1. Example 1

    In this example, we investigated basic properties of some of the estimators proposed above. Wefixed d = 500 and considered settings with n = 250, 500. The predictors x1, . . . , xn ∈ Rd weregenerated according to one of three distributions. In the first setting, x1, . . . , xn ∼ N(0, I).In the second setting, we generated a (2d) × d random matrix Z with independent N(0, 1)entries and took Σw = (2d)

    −1ZTZ; the predictors x1, . . . , xn were then generated according toa N(0,Σw) distribution. In this predictor setting, a single matrix Σw was used to generate allof the simulated datasets, i.e., Σw was generated only once. In the third predictor setting, theindividual xij (i = 1, . . . , n; j = 1, . . . , d) were independent random variables taking valuesin {±1} with pr(xij = 1) = pr(xij = −1) = 0.5.

    To generate the parameter β ∈ Rd, we created a d-dimensional vector with the first d/2 co-ordinates independent uniform(0, 1) and the remaining d/2 coordinates independent N(0, 1);β was obtained by standardizing this vector so that βTΣβ = τ 2 = 1. Thus, β correspond-ing to the settings where cov(xi) = I is scaled slightly differently from β corresponding tocov(xi) = Σ 6= I. The residual variance was fixed at σ2 = 1.

    For each setting in this example, we generated 1000 independent datasets and computed theestimators σ̂2 = σ̂2(I), τ̂ 2 = τ̂ 2(I), r̂2 = r̂2(I) and σ̃2, τ̃ 2, r̃2 for each dataset. Recall that theestimators σ̂2, τ̂ 2, r̂2, from Section 3.1, were derived under the assumption that xi ∼ N(0, I);the estimators σ̃2, τ̃ 2, r̃2, from Section 4.2, were derived under the assumption that (14) isa reasonable approximation. Summary statistics for the various estimators are reported inTable 1.

    Table 1 indicates that for xi ∼ N(0, I), all of the methods perform as expected, given thetheoretical results developed above: the estimators are essentially unbiased, and the standarderrors match those predicted in Corollary 1 and Proposition 2. For xi ∼ N(0,Σw), Table1 indicates that σ̃2, τ̃ 2, and r̃2 are approximately unbiased and their standard errors arecomparable to the setting where xi ∼ N(0, I). On the other hand, σ̂2(I), τ̂ 2(I), and r̂2(I) arebadly biased when xi ∼ N(0,Σw); this is not unexpected, given the discussion in Section 4.2.While this paper contains no theoretical results describing the behavior of our estimators fornon-normal data, the numerical results in this example suggest that the methods proposedhere might be successfully applied in broader circumstances. Indeed, the results in Table 1

  • L.H. Dicker/Variance estimation in high-dimensional linear models 13

    Table 1Summary statistics for Example 1; d = 500. Means and standard errors of various estimators, computed over1000 independent datasets for each configuration. In each setting, σ2 = τ2 = 1 and r2 = τ2/(σ2 + τ2) = 0.5.

    In the standard error column corresponding to xi ∼ N(0, I), numbers in parentheses are theoreticallypredicted standard errors, which are denoted ψ1, ψ2, and ψ0 in the text; see Corollary 1 and Proposition 2.

    Theoretically predicted standard errors for xi ∼ N(0,Σw) and xi ∈ {±1} binary are not known.xi ∼ N(0, I) xi ∼ N(0,Σw) xi ∈ {±1} binary

    Estimator n Mean Std. Error Mean Std. Error Mean Std. Errorσ̂2(I) 250 0.99 0.29 (0.28) 0.50 0.41 1.00 0.29

    500 1.00 0.16 (0.15) 0.51 0.24 1.00 0.15σ̃2 250 1.00 0.28 (0.28) 1.01 0.26 1.00 0.29

    500 1.00 0.16 (0.15) 1.01 0.15 1.00 0.15

    τ̂2(I) 250 1.00 0.33 (0.33) 1.50 0.48 0.99 0.34500 1.00 0.20 (0.20) 1.48 0.29 0.99 0.19

    τ̃2 250 1.00 0.33 (0.33) 1.00 0.32 1.00 0.34500 1.00 0.20 (0.20) 0.98 0.19 1.00 0.19

    r̂2(I) 250 0.50 0.15 (0.15) 0.75 0.21 0.49 0.15500 0.50 0.08 (0.08) 0.74 0.12 0.50 0.08

    r̃2 250 0.50 0.15 (0.15) 0.49 0.14 0.49 0.15500 0.50 0.08 (0.08) 0.49 0.08 0.50 0.08

    for xi ∈ {±1} binary shows that all of the estimators are nearly unbiased and have standarderrors that are similar to the corresponding standard errors in the case where xi ∼ N(0, I).

    One of the more striking aspects of Table 1 is the accuracy and robustness of the estimatorsσ̃2, τ̃ 2, and r̃2. Proposition 2 suggests that these estimators might be expected to performwell when xi ∼ N(0, I) and xi ∼ N(0,Σw); however, none of our theoretical results apply tothe case where xi ∈ {±1} is binary.

    The quantities σ2, τ 2 ≥ 0 are all non-negative. However, there is no guarantee that theestimators proposed in this paper are non-negative. Two factors contributing to negative es-timates of σ2 are τ 2 are (i) random fluctuations in the data and (ii) significant violationsof assumptions about cov(xi) = Σ. For most of the estimators and settings considered inthis example, the percentage of datasets with negative estimates of τ 2 or σ2 was less than1%; in these instances, it appears that negative estimates were largely explained by randomfluctuations in the data and that one could still reasonably appeal to asymptotic results inorder to construct confidence intervals for σ2 and τ 2, for instance. The major exception in-volved the estimate σ̂2(I) in cases where xi ∼ N(0,Σw). The estimators σ̂2(I), τ̂ 2(I), andr̂2(I) were derived under the assumption that cov(xi) = I and, as noted above, are signifi-cantly biased when xi ∼ N(0,Σw); in particular, σ̂2(I) is biased towards 0, which led to asubstantially higher fraction of estimates σ̂2(I) < 0. Indeed, σ̂2(I) < 0 in 11% of the datasetswith xi ∼ N(0,Σw) and n = 250. An even more extreme example is considered in Example

  • L.H. Dicker/Variance estimation in high-dimensional linear models 14

    2 below, where the mean value of an estimator for σ2 is negative. In these settings it seemschallenging to make valid inferences about σ2 and τ 2. These observations further highlightthe importance of understanding and validating the assumptions underlying the proposedmethods.

    5.2. Example 2

    Sun and Zhang (2012) proposed scaled lasso and scaled minimax concave penalty methodsfor estimating σ2 in high-dimensional linear models. These methods, which simultaneouslyestimate σ2 and β, are very effective when β is sparse. Let σ̂2lasso and σ̂

    2MCP denote the scaled

    lasso and scaled minimax concave penalty estimators for σ2, respectively. In this example, wecompare the performance of σ̂2lasso and σ̂

    2MCP with some of the estimators for σ

    2 proposed inthis paper, in settings where β is both sparse and non-sparse, i.e., dense.

    With d = 3000, the predictors in this example were generated according to xi ∼ N(0,Σ),where Σ = (σij) and σij = 0.5

    |i−j|. We fixed σ2 = 1. Sparse and dense parameters β ∈ Rdwere generated as follows. First, to generate the sparse β, five random multiples of 25 between25 and d− 25 = 2975 were selected. That is, we selected k1, . . . , k5 from {25, 50, 75 . . . , 2975}uniformly at random. Next, we took β0 ∈ Rd to be the vector with the 7-dimensional sub-vector (1, 2, 3, 4, 3, 2, 1) centered at the coordinates corresponding to k1, . . . , k5, so that thekj-th entry of β0 was 4, the (kj ± 1)-th was 3, etc.; the remaining entries in β0 were set equalto 0. We then set β = {3/(βT0 Σβ0)}1/2β0, so that τ 2 = βTΣβ = 3. This sparse β was generatedonly once; in other words, the same sparse β was use throughout the simulations in thisexample. To generate the dense β used in this example, we followed the same procedure as forthe sparse β, except that in β0, the 7-dimensional subvector (1, 2, 3, 4, 3, 2, 1) was centered atcoordinates corresponding to each multiple of 25 between 25 and 2975. Thus, for the sparseβ, we had ||β||0 = 7 × 5 = 35, where ||β||0 denotes the number of non-zero coordinates inβ, and for the dense β we had ||β||0 = 7 × (d/25 − 1) = 833; however, τ 2 = βTΣβ = 3 wasthe same for both the sparse and dense β. In this simulation study, we considered datasetswith n = 600 and n = 2400 observations. With sparse β and n = 600, the simulation settingsin this example are very similar to those in Example 2 from Section 4.1 of Sun and Zhang(2012).

    Under each of the settings described above, we generated 100 independent datasets and,for each simulated dataset, we computed σ̂2lasso, σ̂

    2MCP, σ̂

    2(Σ̂), σ̂2(Σ), and σ̃2. For the σ̂2lasso andσ̂2MCP, we used the shrinkage parameter λ0 = {log(d)/n}1/2; this value of λ0 yielded the bestperformance in the numerical examples described by Sun and Zhang (2012). The estimatorσ̂MCP requires specification of an additional parameter γ; following Sun and Zhang (2012), wetook γ = 2/[1−maxi,j{XTi Xj/(||Xi||||Xj||)}], where Xj denotes the j-th column of X. The es-timator σ̂2(Σ̂) was introduced in Section 4.1 of this paper. Here we take advantage of the AR(1)

  • L.H. Dicker/Variance estimation in high-dimensional linear models 15

    Table 2Example 2; d = 3000, σ2 = 1. Means and standard errors of estimators for σ2, based on 100 independent

    datasets. Left, sparse β; right, dense β

    Sparse βMean Std. Err.

    n = 600 σ̂2lasso 1.11 0.07σ̂2MCP 1.05 0.06

    σ̂2(Σ̂) 0.97 0.50σ̂2(Σ) 0.97 0.50σ̃2 −0.60 0.52

    n = 2400 σ̂2lasso 1.03 0.03σ̂2MCP 1.01 0.03

    σ̂2(Σ̂) 0.98 0.16σ̂2(Σ) 0.98 0.16σ̃2 −0.59 0.21

    Dense βMean Std. Err.

    n = 600 σ̂2lasso 3.26 0.21σ̂2MCP 3.10 0.21

    σ̂2(Σ̂) 0.98 0.56σ̂2(Σ) 0.98 0.56σ̃2 −0.57 0.59

    n = 2400 σ̂2lasso 2.32 0.07σ̂2MCP 2.00 0.08

    σ̂2(Σ̂) 1.01 0.15σ̂2(Σ) 1.01 0.15σ̃2 −0.57 0.22

    structure of Σ and set Σ̂ = (σ̂ij), where σ̂i,j = α̂|i−j| and α̂ = {n(d−1)}−1

    ∑ni=1

    ∑dj=2 xijxi(j−1).

    We view the estimator σ̂2(Σ) as an oracle estimator, which utilizes full knowledge of actualcovariance matrix Σ; following the discussion in Section 3.1, this estimator should performsimilarly to the estimator σ̂2(I) in settings where cov(xi) = I and τ

    2 = 3. Finally, the esti-mator σ̃2 is the unknown covariance estimator from Section 4.2. Recall that the theoreticalperformance guarantees for σ̃2 given in Proposition 2 require |βTΣkβ − ||β||2tr(Σk)/d| ≈ 0,for k = 1, 2. In this example, for the sparse β we had

    ||β||2 1d

    tr(Σ)− βTΣβ = −1.76, ||β||2 1d

    tr(Σ2)− βTΣ2β = −5.54; (17)

    the corresponding quantities are essentially the same for the dense β. Summary statistics forthe various estimators computed in this numerical study are reported in Table 2.

    For sparse β, the results in Table 2 indicate that σ̂2lasso, σ̂2MCP, σ̂

    2(Σ̂) and σ̂2(Σ) are all nearlyunbiased. However, the empirical standard errors for the scaled lasso and minimax concavepenalty estimators are considerably smaller than the standard errors for σ̂2(Σ̂) and σ̂2(Σ). Inthis example, the performance of σ̂2(Σ̂) is very similar to that of the oracle estimator σ̂2(Σ).

    The estimator σ̃2 is significantly biased in this example. Indeed, the mean value of σ̃2

    is negative, while σ2 > 0. The poor performance of σ̃2 in this example is not completelyunexpected, given that |βTΣkβ−||β||2tr(Σk)/d| is substantially larger than 0 for k = 1, 2; see(17). In fact, more can be said. Using the approximation m̂k ≈ mk = tr(Σk)/d, one can checkthat E(σ̃2) ≈ σ2 + τ 21 − (m1/m2)τ 22 . In this example, τ 21 − (m1/m2)τ 22 = −1.57 and, by theprevious approximation, E(σ̃2) ≈ −0.57; this calculation is for the sparse β, but the result isalmost exactly the same for the dense β. Note the similarity between this approximation andthe empirical means of σ̃2 in Table 2.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 16

    Table 3Example 2; d = 3000, σ2 = 1, ||β||2 = 1.24. Empirical mean squared error ||β̂ − β||2 of the scaled lasso and

    minimax concave penalty estimators for β, based on 100 independent datasets.

    Sparse βn lasso MCP600 0.19 0.372400 0.05 0.09

    Dense βn lasso MCP600 1.22 1.252400 0.90 0.93

    For dense β, the performance of σ̂2lasso and σ̂2MCP breaks down, while the performance of

    σ̂2(Σ̂), σ̂2(Σ) and σ̃2 remains virtually unchanged, as compared to the sparse β case. Whenn = 600, the empirical means of σ̂2lasso and σ̂

    2MCP are both greater than 3; when n = 2400, the

    empirical means of σ̂2lasso and σ̂2MCP are both nearly greater than 2. Both σ̂

    2lasso and σ̂

    2MCP depend

    on associated lasso and minimax concave penalty estimators for β. The performance break-down of σ̂2lasso and σ̂

    2MCP when β is dense is likely related to the fact that the corresponding

    estimators for β perform poorly when β is dense and d/n is large. Indeed, in Table 3 we reportthe empirical mean squared error for the lasso and minimax concave penalty estimators forβ that are associated with σ̂2lasso and σ̂

    2MCP; mean squared error is substantially higher for

    estimating dense β.The results of this simulation study suggest that estimators proposed in this paper may be

    useful for estimating σ2 in settings where d/n is large and little is known about sparsity inβ. However, we emphasize two important points: the predictor covariance must be handledappropriately, which, in this example, means utilizing the fact that Σ has AR(1) structure; andthe estimators for σ2 proposed in this paper may have larger standard error than estimatorsderived from a reliable estimate of β.

    6. Data analysis

    A series of studies have investigated associations between array-based gene expression dataand single nucleotide polymorphisms from individuals in the International HapMap Project(http://hapmap.ncbi.nlm.nih.gov/). One of the major hypotheses validated by Stranger et al.(2007a, 2012, 2007b) and others is that single nucleotide polymorphisms located near a givengene may be associated with gene expression levels. These findings may have significant impli-cations for understanding disease susceptibility, treatment, and public health. In this analysis,we use methods proposed in this paper to estimate the proportion of variation in gene expres-sion levels that is explained by single nucleotide polymorphisms; that is, we estimate r2. Inthe present context, estimating r2 is closely related to estimating heritability, an importantconcept in genetics (Yang et al., 2010).

    In Stranger et al. (2007b), the authors identified 341 genes that were significantly associated

  • L.H. Dicker/Variance estimation in high-dimensional linear models 17

    with single nucleotide polymorphisms from 45 individuals in the Japanese HapMap popula-tion. We randomly selected 100 of these genes and, using a more recent gene expression datasetmade publicly available by Stranger et al. (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-264/), we investigated associations between these 100 genes and nearby single nu-cleotide polymorphisms from n = 80 individuals in the Han Chinese HapMap population.Single nucleotide polymorphism data was obtained from the International HapMap Project,release 28. For each gene, we restricted our attention to single nucleotide polymorphismslocated within 1 Mbp of the gene midpoint and with minor allele frequency > 5%. Singlenucleotide polymorphisms with missingness > 10% in the Han Chinese population were ex-cluded from our analysis; for single nucleotide polymorphisms included in the analysis, missingvalues, i.e., minor allele counts, were imputed using the marginal mean from the Han Chinesepopulation.

    For each gene, we sought to estimate r2 = τ 2/(σ2 + τ 2) for regressing the gene expres-sion level on the minor allele counts of single nucleotide polymorphisms within 1 Mbp of thegene midpoint. Some preprocessing of the data was required in order to ensure appropriateapplication of the methods proposed in this paper. First, we centered the gene expression lev-els and single nucleotide polymorphism minor allele counts, so that each variable had mean0. Next, we standardized the single nucleotide polymorphisms minor allele counts so thateach allele count variable had standard deviation 1. As discussed at length above, correlationbetween the predictors is a key issue for the methods proposed in this paper. In this anal-ysis, we de-correlated the predictors by simply removing highly correlated single nucleotidepolymorphisms; we refer to this as thinning the predictors. To explain in more detail, Jiang(2004) proved that if cov(xi) = I, then the maximum absolute sample correlation betweenthe predictors is approximately 2{log(n)/n}1/2; thus, for each pair of single nucleotide poly-morphism allele count variables with absolute sample correlation greater than 2{log(n)/n}1/2,we removed one of them, chosen at random, from the dataset. After this thinning process,the maximum absolute correlation between single nucleotide polymorphisms was no morethan that predicted for independent single nucleotide polymorphisms and, by this measure,the thinned single nucleotide polymorphisms appeared to be roughly independent. We subse-quently applied the methods proposed in this paper to the standardized and thinned dataset.The thinning process described above is practical, but fairly crude. It may be desirable toseek other methods for handling correlation among the single nucleotide polymorphisms, e.g.,estimating the correlation between single nucleotide polymorphisms.

    Let Geneik denote the centered expression level for the kth gene (k = 1, . . . , 100) in theith individual (i = 1, . . . , n = 80) and let SNPijk denote the standardized minor allele countfor the jth single nucleotide polymorphism corresponding to the kth gene in individual i; soj = 1, . . . , dk, where dk is the number of thinned single nucleotide polymorphisms within 1Mbp of gene k. For the genes included in this analysis, dk ranged from 17 to 190. The mean

  • L.H. Dicker/Variance estimation in high-dimensional linear models 18

    Table 4Significant genes at false discovery rate 0.05 for testing H0 : r

    2 = 0. dk is the number of single nucleotidepolymorphisms included in the analysis for the corresponding gene; p-values were computed using the Wald

    test suggested by Proposition 2 (ii); q-values refer to false discovery rate for the Benjamini-Hochberg step-upprocedure (Benjamini and Hochberg, 1995; Storey, 2002). Recall that n = 80 individuals were included in this

    analysis.

    Gene dk r̃2

    p-value(×102)

    q-value(×102)

    HLA-DRB5 102 0.95 0.01 0.59MRPL43 56 0.79 0.01 0.59WBSCR27 43 0.71 0.02 0.71PEX6 112 0.78 0.06 1.35LOC197322 95 0.74 0.07 1.35FLJ10781 110 0.73 0.08 1.35TSGA10 38 0.59 0.10 1.42NQO2 127 0.73 0.12 1.49

    Gene dk r̃2

    p-value(×102)

    q-value(×102)

    CHPT1 74 0.61 0.20 2.20CCNDBP1 24 0.45 0.25 2.53FN3KRP 48 0.53 0.29 2.61SLC2A8 71 0.54 0.36 3.02IRF5 62 0.50 0.54 4.01UGT2B17 32 0.42 0.56 4.01UTS2 100 0.55 0.63 4.22

    value of dk was 82.6; recall that n = 80, so that on average dk > n. We used the estimator r̃2

    to estimate r2 for the model

    Geneik =

    dk∑j=1

    SNPijkβjk + �ik, k = 1, . . . , 100.

    For each gene, we also computed p-values for the null hypothesis H0 : r2 = 0, using the Wald

    test suggested by Proposition 2 (ii). Genes significant at false discovery rate 0.05 using theBenjamini and Hochberg (1995) step-up procedure are reported in Table 4; more completeresults may be found in Table S1 of the Supplementary Material.

    Overall, 26 out of 100 genes were significant at false discovery rate 0.1 and 15 were significantat false discovery rate 0.05. Among genes significant at false discovery rate 0.05, estimates of r2

    range from 0.42 to 0.95, while p-values ranged from 0.0001 to 0.0056. Confidence intervals forr2 corresponding to each of the reported estimates can be easily constructed using Propostion2 (ii).

    We note that 17 genes in this analysis had p-value smaller than 0.01, which far exceeds the1 that would be expected if r2 = 0 for each of the genes under consideration. Thus, since ourinitial list of 100 genes consisted of genes found to be significant in the Japanese HapMappopulation (Stranger et al., 2007b), this analysis partially validates Stranger et al.’s findingsin a different HapMap population, Han Chinese. However, not all of Stranger et al.’s findingswere validated in our analysis, i.e., not all of the genes under consideration were found tobe significant. While there are many potential explanations for this discrepancy, e.g., randomvariation in the data or differences between individuals in the Han Chinese and Japanese

  • L.H. Dicker/Variance estimation in high-dimensional linear models 19

    populations, it raises questions about the power of the proposed Wald tests and the efficiencyof the estimator r̃2 that are of interest for future research; questions about the thinning processfor handling correlated predictors and its effect on the power of the proposed tests may alsobe of interest. On the other hand, we emphasize that our analysis goes beyond just identifyingsignificant genes and provides estimates of the fraction of variability in gene expression levelsthat is explained by single nucleotide polymorphisms, even in settings where there are moresingle nucleotide polymorphisms than observations, i.e., dk > n.

    Acknowledgements

    The author is grateful to the Editor, the Associate Editor, and three reviewers for manyinsightful comments and suggestions that helped to substantially improve the paper. Thanksto Xihong Lin, Liming Liang, and Sihai Zhao for very helpful suggestions regarding the dataanalysis. This work was supported by the U.S. National Science Foundation.

    References

    Akaike, H. (1974). A new look at the statistical model identification. IEEE T. Automat.Contr. 19 716–723.

    Bai, Z. D., Miao, B. Q. and Pan, G. M. (2007). On asymptotics of eigenvectors of largesample covariance matrix. Ann. Probab. 35 1532–1572.

    Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practicaland powerful approach to multiple testing. J. Roy. Stat. Soc. B 57 289–300.

    Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices.Ann. Stat. 36 199–227.

    Cai, T. T., Ren, Z. and Zhou, H. H. (2013). Optimal rates of convergence for estimatingToeplitz covariance matrices. Probab. Theory Rel. 156 101–143.

    Cai, T. T., Zhang, C.-H. and Zhou, H. H. (2010). Optimal rates of convergence forcovariance matrix estimation. Ann. Stat. 38 2118–2144.

    Chatterjee, S. (2009). Fluctuations of eigenvalues and second order Poincaré inequalities.Probab. Theory Rel. 143 1–40.

    Dicker, L. H. (2013). Optimal equivariant prediction for high-dimensional linear modelswith arbitrary predictor covariance. Electron. J. Stat. 7 1806–1834.

    El Karoui, N. (2008). Operator norm consistent estimation of large-dimensional sparsecovariance matrices. Ann. Stat. 36 2717–2756.

    Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validationin ultrahigh dimensional regression. J. Roy. Stat. Soc. B 74 37–65.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 20

    Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression.Ann. Stat. 1947–1975.

    Graczyk, P., Letac, G. and Massam, H. (2005). The hyperoctahedral group, symmetricgroup representations and the moments of the real Wishart distribution. J. Theor. Prob.18 1–42.

    Jiang, T. (2004). The asymptotic distributions of the largest entries of sample correlationmatrices. Ann. Appl. Probab. 14 865–880.

    Lafferty, J. and Wasserman, L. (2008). Statistical analysis of semi-supervised regression.Adv. Neur. In. 20 801–808.

    Letac, G. and Massam, H. (2004). All invariant moments of the Wishart distribution.Scand. J. Stat. 31 295–318.

    Pan, G. M. and Zhou, W. (2008). Central limit theorem for signal-to-interference ratio ofreduced rank linear receiver. Ann. Appl. Probab. 18 1232–1270.

    Schwarz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6 461–464.Storey, J. D. (2002). A direct approach to false discovery rates. J. Roy. Stat. Soc. B 64

    479–498.Stranger, B. E., Forrest, M. S., Dunning, M., Ingle, C. E., Beazley, C., Thorne,

    N., Redon, R., Bird, C. P., De Grassi, A., Lee, C., Tyler-Smith, C., Carter,N., Scherer, S. W., Tavaré, S., Deloukas, P., Hurles, M. E. and Dermitzakis,E. T. (2007a). Relative impact of nucleotide and copy number variation on gene expressionphenotypes. Science 315 848–853.

    Stranger, B. E., Montgomery, S. B., Dimas, A. S., Parts, L., Stegle, O., Ingle,C. E., Sekowska, M., Smith, G. D., Evans, D., Gutierrez-Arcelus, M., Price,A., Raj, T., Nisbett, J., Nica, A. C., Beazley, C., Durbin, R., Deloukas, P.and Dermitzakis, E. T. (2012). Patterns of cis regulatory variation in diverse humanpopulations. PLoS Genet. 8 e1002639.

    Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C.,Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré, S.,Deloukas, P. and Dermitzakis, E. T. (2007b). Population genomics of human geneexpression. Nat. Genet. 39 1217–1224.

    Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt,

    D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., God-dard, M. E. and Visscher, P. M. (2010). Common SNPs explain a large proportion ofthe heritability for human height. Nat. Genet. 42 565–569.

    Zou, H., Hastie, T. J. and Tibshirani, R. J. (2007). On the “degrees of freedom” of thelasso. Ann. Stat. 35 2173–2192.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 21

    Supplementary Material

    S1. Additional results for Section 6 of the main text (data analysis)

    Results for all 100 genes included in the data analysis may be found in Table S1.

    S2. Results from the main text

    Proof of Lemma 2

    Lemma 2 follows from a corollary of the next result.

    Lemma S1. Suppose that Σ = I. Then

    var

    (1

    n||y||2

    )=

    2

    n(σ2 + τ 2)2 (18)

    var

    (1

    n2||XTy||2

    )=

    2

    n

    [{(d

    n

    )2+d

    n+

    2d

    n2

    }σ4

    +

    {2

    (d

    n

    )2+

    6d

    n+ 2 +

    10d

    n2+

    10

    n+

    12

    n2

    }σ2τ 2 (19)

    +

    {(d

    n

    )2+

    5d

    n+ 4 +

    8d

    n2+

    15

    n+

    15

    n2

    }τ 4

    ]

    cov

    (1

    n||y||2, 1

    n2||XTy||2

    )=

    2

    n

    {d

    nσ4 +

    (2d

    n+ 2 +

    3

    n

    )σ2τ 2 +

    (d

    n+ 2 +

    3

    n

    )τ 4}. (20)

    Proof. Equation (18) is obvious because ||y||2 ∼ (σ2 + τ 2)χ2n. To prove (19), we condition onX and use properties of expectations involving quadratic forms and normal random vectorsto obtain

    var(||XTy||2) = E{

    var(||XTy||2 | X)}

    + var{E(||XTy||2 | X)

    }= 2σ4E

    [tr{

    (XTX)2}]

    + 4σ2E{βT(XTX)3β

    }+ var

    {σ2tr(XTX) + βT(XTX)2β

    }= 2σ4E

    [tr{

    (XTX)2}]

    + 4σ2E{βT(XTX)3β

    }+ σ4E

    [{tr(XTX)}2

    ]+ 2σ2E

    {tr(XTX)βT(XTX)2β

    }+ E

    [{βT(XTX)2β

    }2]

  • L.H. Dicker/Variance estimation in high-dimensional linear models 22

    Table S1Data analysis. dk is the number of SNPs included in the analysis for the corresponding gene (recall thatn = 80 for each gene); p-values (for testing H0 : r

    2 = 0) were computed using the Wald test suggested byProposition 2 (ii) of the main text; q-values refer to false discover rate for the Benjamini-Hochberg step-up

    procedure (Benjamini and Hochberg, 1995; Storey, 2002).

    Gene Probe dk r̃2

    p-value

    (×102)q-value

    (×102)HLA-DRB5 ILMN 1697499 102 0.95 0.01 0.59

    MRPL43 ILMN 1700477 56 0.79 0.01 0.59WBSCR27 ILMN 1719170 43 0.71 0.02 0.71

    PEX6 ILMN 1683279 112 0.78 0.06 1.35LOC197322 ILMN 1716832 95 0.74 0.07 1.35

    FLJ10781 ILMN 1794490 110 0.73 0.08 1.35TSGA10 ILMN 1680430 38 0.59 0.10 1.42

    NQO2 ILMN 1712918 127 0.73 0.12 1.49CHPT1 ILMN 1729112 74 0.61 0.20 2.20

    CCNDBP1 ILMN 1711459 24 0.45 0.25 2.53FN3KRP ILMN 1652333 48 0.53 0.29 2.61SLC2A8 ILMN 1724609 71 0.54 0.36 3.02

    IRF5 ILMN 1670576 62 0.50 0.54 4.01UGT2B17 ILMN 1752214 32 0.42 0.56 4.01

    UTS2 ILMN 1748520 100 0.55 0.63 4.22SNX16 ILMN 1787415 61 0.44 0.94 5.73F25965 ILMN 1656886 106 0.51 0.97 5.73

    LZIC ILMN 1661627 111 0.52 1.04 5.79UGT2B7 ILMN 1679194 33 0.37 1.10 5.81NUDT2 ILMN 1778347 48 0.38 1.57 7.39TFAM ILMN 1715661 53 0.39 1.61 7.39LTBR ILMN 1667476 115 0.47 1.66 7.39OAS1 ILMN 1672606 67 0.40 1.70 7.39

    TIMM10 ILMN 1765332 32 0.32 2.01 8.38NDUFS5 ILMN 1776104 104 0.42 2.23 8.93SLC27A5 ILMN 1725366 29 0.30 2.43 9.35

    BCAS1 ILMN 1776647 151 0.42 3.84 13.84SURF1 ILMN 1663407 120 0.38 3.87 13.84SFXN2 ILMN 1795976 28 0.24 4.29 14.80

    KIAA1913 ILMN 1778400 88 0.32 4.70 15.30CRYZ ILMN 1672389 49 0.26 4.96 15.30

    DPYSL4 ILMN 1792356 113 0.35 5.02 15.30RABGEF1 ILMN 1760884 33 0.23 5.14 15.30

    PHACS ILMN 1757950 122 0.35 5.20 15.30STAT6 ILMN 1763198 46 0.25 5.44 15.55CD151 ILMN 1661589 75 0.29 5.70 15.84PAQR6 ILMN 1737631 55 0.25 6.45 17.43

    RAD51C ILMN 1760635 45 0.22 7.25 19.07ACN9 ILMN 1771348 69 0.24 8.07 20.62

    ABCD2 ILMN 1652959 35 0.18 8.40 20.62DDX17 ILMN 1724114 82 0.25 8.45 20.62

    ENTPD1 ILMN 1773125 65 0.21 10.03 23.87PLTP ILMN 1711748 93 0.23 10.57 24.57

    UGT2B10 ILMN 1742444 31 0.15 10.91 24.79IPP ILMN 1789106 28 0.15 11.29 25.08IVD ILMN 1724207 60 0.18 11.76 25.56

    STIM2 ILMN 1738449 98 0.22 12.35 26.18RAMP1 ILMN 1764754 102 0.22 12.56 26.18

    CDK5RAP2 ILMN 1725235 55 0.17 13.29 27.12RPS16 ILMN 1651850 73 0.18 14.62 29.24

    Gene Probe dk r̃2

    p-value

    (×102)q-value

    (×102)C1QTNF3 ILMN 1687260 103 0.19 15.13 29.66

    MPHOSPH1 ILMN 1712452 79 0.15 18.01 34.64DNASE1L3 ILMN 1762084 30 0.10 18.79 35.42

    SUPT3H ILMN 1813277 95 0.15 19.13 35.42SNRPB2 ILMN 1771620 140 0.17 19.63 35.69

    NIPSNAP3B ILMN 1786308 73 0.13 20.99 37.48RPL37A ILMN 1711222 115 0.14 23.21 40.72MRPL53 ILMN 1813682 72 0.10 24.59 42.39

    TLR6 ILMN 1749287 124 0.13 25.30 42.87SLC7A7 ILMN 1810275 166 0.14 26.56 43.83

    RPS6KB2 ILMN 1761175 34 0.07 26.74 43.83C10orf68 ILMN 1662791 56 0.08 27.32 44.07

    PPA2 ILMN 1755041 41 0.07 28.05 44.29GSTP1 ILMN 1679809 38 0.06 28.34 44.29ZNF175 ILMN 1675788 168 0.12 29.21 44.94UMPS ILMN 1757437 80 0.07 32.79 49.68

    ATP6V1C1 ILMN 1659801 74 0.06 34.65 51.71C10orf88 ILMN 1775423 141 0.06 37.36 54.94RPH3AL ILMN 1693717 108 0.04 40.57 58.80ACY1L2 ILMN 1766000 77 0.01 47.97 67.67CIAO1 ILMN 1792837 21 0.00 48.59 67.67

    MRPL21 ILMN 1744835 82 0.00 48.72 67.67EIF2S1 ILMN 1739821 46 -0.02 57.03 77.29STOM ILMN 1696419 71 -0.02 57.19 77.29

    LOC339229 ILMN 1805131 74 -0.03 59.85 78.89NPEPL1 ILMN 1724194 155 -0.05 60.17 78.89

    C14orf130 ILMN 1667839 117 -0.04 60.75 78.89PRIC285 ILMN 1787509 85 -0.04 62.33 79.91NDUFV3 ILMN 1765500 145 -0.07 66.44 84.11

    MUM1 ILMN 1764764 185 -0.10 69.81 86.43TNFRSF18 ILMN 1743100 56 -0.06 71.05 86.43LOC144097 ILMN 1795564 60 -0.06 71.40 86.43B3GALT3 ILMN 1732822 39 -0.05 72.53 86.43

    G0S2 ILMN 1691846 115 -0.09 72.60 86.43ABCF2 ILMN 1709222 144 -0.13 78.30 92.11MTRR ILMN 1702684 97 -0.13 82.96 96.16

    SFRS10 ILMN 1742798 108 -0.14 83.66 96.16PLXDC2 ILMN 1753312 85 -0.12 86.03 97.76POLE4 ILMN 1660063 92 -0.13 87.67 98.20

    KIAA0265 ILMN 1702635 79 -0.13 88.38 98.20TCL1B ILMN 1805857 190 -0.26 93.67 99.61

    POLR3F ILMN 1673966 109 -0.19 94.27 99.61GPX4 ILMN 1734353 181 -0.27 94.51 99.61IL16 ILMN 1813572 104 -0.19 95.37 99.61

    RAD18 ILMN 1707548 137 -0.24 95.52 99.61DHX30 ILMN 1813719 17 -0.05 96.43 99.61

    SOS1 ILMN 1767135 54 -0.12 96.62 99.61LOC400696 ILMN 1759989 81 -0.23 99.87 100.00ALDH8A1 ILMN 1699258 77 -0.23 100.00 100.00

    STK25 ILMN 1668090 74 -0.25 100.00 100.00

  • L.H. Dicker/Variance estimation in high-dimensional linear models 23

    − σ4 [E {tr(XTX)}]2 − 2σ2E {tr(XTX)}E{βT(XTX)2β

    }−[E{βT(XTX)2β

    }]2.

    Equation (19) now follows from Proposition S1 below (found in Section S3). Equation (20) isproved similarly: we have

    cov(||y||2, ||XTy||2) = E{

    cov(||y||2, ||XTy||2 | X)}

    + cov{E(||y||2 | X), E(||XTy||2 | X)

    }= 2σ4E {tr(XTX)}+ 4σ2E

    {βT(XTX)2β

    }+ cov

    {βTXTXβ, σ2tr(XTX) + βT(XTX)2β

    }= 2σ4E {tr(XTX)}+ 4σ2E

    {βT(XTX)2β

    }+ σ2E {tr(XTX)βTXTXβ}+ E

    {βTXTXββT(XTX)2β

    }− σ2E (βTXTXβ)E {tr(XTX)} − E (βTXTXβ)E

    {βT(XTX)2β

    }and (20) follows from Proposition S1.

    Corollary S1. Suppose that Σ = I. Then

    var(σ̂2) =2n

    (n+ 1)2

    {(d

    n+ 1 +

    2d

    n2+

    2

    n+

    1

    n2

    )σ4 +

    (2d

    n+

    4d

    n2+

    4

    n+

    8

    n2

    )σ2τ 2

    +

    (d

    n+ 1 +

    2d

    n2+

    7

    n+

    10

    n2

    )τ 4}

    var(τ̂ 2) =2n

    (n+ 1)2

    {(d

    n+

    2d

    n2

    )σ4 +

    (2d

    n+ 2 +

    4d

    n2+

    10

    n+

    12

    n2

    )σ2τ 2

    +

    (d

    n+ 4 +

    2d

    n2+

    15

    n+

    15

    n2

    )τ 4}

    cov(σ̂2, τ̂ 2) = − 2n(n+ 1)2

    {(d

    n+

    2d

    n2

    )σ4 +

    (2d

    n+

    4d

    n2+

    5

    n+

    9

    n2

    )σ2τ 2

    +

    (d

    n+ 2 +

    2d

    n2+

    10

    n+

    12

    n2

    )τ 4}.

    Corollary S1 follows from Lemma S1 and the fact that(σ̂2

    τ̂ 2

    )=

    (d+n+1n+1

    − nn+1

    − dn+1

    nn+1

    )(n−1||y||2

    n−2||XTy||2).

    Lemma 2 follows immediately from Corollary S1.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 24

    Proof of Theorem 1

    Theorem 1 is a direct application of Theorem 2.2 from (Chatterjee, 2009), which is statedhere for ease of reference.

    Theorem S1. [Theorem 2.2, (Chatterjee, 2009)] Let v = (v1, . . . , vm) ∼ N(0,Ψ). Supposethat g ∈ C2(Rm) and let ∇g and ∇2g denote the gradient and the Hessian of g, respectively.Let

    κ1 =[E{||∇g(v)||4

    }]1/4,

    κ2 =[E{||∇2g(v)||4

    }]1/4,

    where ||∇2g(v)|| is the operator norm of ∇2g(v). Suppose that E{g(v)4} < ∞ and let ψ2 =var{g(v)}. Let w be a normal random variable having the same mean and variance as g(v).Then

    dTV {g(v), w} ≤2√

    5||Ψ||3/2κ1κ2ψ2

    . (21)

    Remark 1. Chatterjee’s Theorem 2.2 does not actually require Gaussian v. However, for non-Gaussian v, an additional term appears in the bound (21), which is not sufficiently small forour purposes. Furthermore, the class of distributions covered by the full version of Chatterjee’sTheorem 2.2 is not all-encompassing: vi must be a C

    2-function of a normal random variable.

    To prove Theorem 1, we apply Theorem S1 with v = (X, �) ∈ R(d+1)n. Let h ∈ C2(R2) andlet

    g(X, �) = h(S),

    where S = S(X, �) = (n−1||y||2, n−2||XTy||2). First, we bound the quantities κ1, κ2 in TheoremS1. In order to bound κ1, we compute the gradient of g. Let h1, h2 denote the partial derivativesof h with respect to the first and second variables, respectively. Then

    ∂g

    ∂xij(X, �) = h1(S)

    ∂xij

    1

    n||y||2 + h2(S)

    ∂xij

    1

    n2||XTy||2

    for i = 1, . . . , n, j = 1, . . . , d. Let Eij denote the n × d matrix with i′j′-entry δii′δjj′ (δii′ = 1if i = i′ and 0 otherwise). Since

    ∂xij||y||2 = 2βTETijy

    and∂

    ∂xij||XTy||2 = 2yTEijXTy + 2βTETijXXTy,

  • L.H. Dicker/Variance estimation in high-dimensional linear models 25

    it follows that

    ∂g

    ∂xij(X, �) = 2h1(S)

    (1

    nβTETijy

    )+ 2h2(S)

    (1

    n2yTEijX

    Ty +1

    n2βTETijXX

    Ty

    ). (22)

    For 1 ≤ k ≤ n, the partial derivative of g with respect to �k is

    ∂g

    ∂�k(X, �) = h1(S)

    ∂�k

    1

    n||y||2 + h2(S)

    ∂�k

    1

    n2||XTy||2

    = 2h1(S)

    (1

    neTky

    )+ 2h2(S)

    (1

    n2eTkXX

    Ty

    ), (23)

    where ek = (0, ..., 0, 1, 0, ..., 0)T ∈ Rn is the k-th standard basis vector in Rn and we have used

    the facts

    ∂�k||y||2 = 2eTky

    ∂�k||XTy||2 = 2ekXXTy.

    Now recall that κ1 = [E{||∇g(X, �)||4}]1/4. Equations (22)–(23) and the elementary inequality

    (a+ b)2 ≤ 2a2 + 2b2, a, b ∈ R, (24)

    imply that

    ||∇g(X, �)||2 =n∑i=1

    d∑j=1

    {∂

    ∂xijg(X, �)

    }2+

    n∑k=1

    {∂

    ∂�kg(X, �)

    }2

    ≤ 8h1(S)2n∑i=1

    d∑j=1

    (1

    nβTETijy

    )2

    + 16h2(S)2

    n∑i=1

    d∑j=1

    {(1

    n2yTEijX

    Ty

    )2+

    (1

    n2βTETijXX

    Ty

    )2}

    + 8h1(S)2

    n∑k=1

    (1

    neTky

    )2+ 8h2(S)

    2

    n∑k=1

    (1

    n2eTkXX

    Ty

    )2=

    8

    n2(τ 2 + 1)h1(S)

    2||y||2

    +16

    n4h2(S)

    2

    {||y||2||XTy||2 +

    (τ 2 +

    1

    2

    )||XXTy||2

    }.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 26

    Let λ1 = ||n−1XTX|| be the largest eigenvalue of n−1XTX. Applying the triangle inequalityand (24) yields

    ||∇g(X, �)||2 ≤ 16n2

    (τ 2 + 1)h1(S)2(||XTX||τ 2 + ||�||2

    )+

    128

    n4h2(S)

    2||XTX||(||XTX||2τ 4 + ||�||4

    )+

    32

    n4h2(S)

    2||XTX||2(τ 2 +

    1

    2

    )(||XTX||τ 2 + ||�||2

    )≤ 16

    n||∇h(S)||2

    {8λ1

    (1

    n||�||2

    )2+(2λ21 + 1

    ) 1n||�||2τ 2 +

    (10λ31 + λ1

    )τ 4

    + (λ21 + 1)1

    n||�||2 + (λ31 + λ1)τ 2

    }

    ≤ 264n||∇h(S)||2(λ1 + 1)3

    {1

    n||�||2

    (1

    n||�||2 + 1

    )+ τ 2(τ 2 + 1)

    }.

    Thus,

    κ1 =[E{||∇g(X, �)||4

    }]1/4≤√

    264

    n

    (E

    [||∇h(S)||4(λ1 + 1)6

    {1

    n||�||2

    (1

    n||�||2 + 1

    )+ τ 2(τ 2 + 1)

    }2])1/4= O

    [1√n

    {γ1/44 + γ

    1/42 + γ

    1/40 τ(τ + 1)

    }], (25)

    where

    γk = E

    [||∇h(S)||4(λ1 + 1)6

    (1

    n||�||2

    )k].

    To bound κ2 = [E{||∇2g(X, �)||4}]1/4, we bound the operator norm of the Hessian ||∇2g(X, �)||.Let

    U =

    {Ũ = (u U); u = (u1, . . . , un) ∈ Rn, U = (uij)1≤i≤n, 1≤j≤d,

    n∑k=1

    u2k +n∑i=1

    d∑j=1

    u2ij = 1

    }be the collection of partitioned n × (d + 1) matrices with Frobenius norm equal to one. ForŨ = (u U) ∈ U , define the differential operator

    DŨ =n∑i=1

    d∑j=1

    uij∂

    ∂xij+

    n∑k=1

    uk∂

    ∂�k.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 27

    Then

    ||∇2g(X, �)|| = supŨ∈U

    D2Ũg(X, �)

    = supŨ∈U

    {∇h(S)TD2

    ŨS(X, �) + {DŨS(X, �)}

    T∇2h(S)DŨS(X, �)}

    ≤ supŨ∈U

    {||∇h(S)||

    ∣∣∣∣D2ŨS(X, �)

    ∣∣∣∣+ ∣∣∣∣∇2h(S)∣∣∣∣ ||DŨS(X, �)||2} . (26)From our previous calculations,

    DŨS(X, �) =n∑i=1

    d∑j=1

    uij∂

    ∂xij

    (1n||y||2

    1n2||XTy||2

    )+

    n∑k=1

    uk∂

    ∂�k

    (1n||y||2

    1n2||XTy||2

    )

    =n∑i=1

    d∑j=1

    uij

    (2nβTETijy

    2n2yTEijX

    Ty + 2n2βTETijXX

    Ty

    )+

    n∑k=1

    uk

    (2neTky

    2n2eTkXX

    Ty

    )=

    (2nβTUTy + 2

    nuTy

    2n2yTUXTy + 2

    n2βTUTXXTy + 2

    n2uTXXTy

    ).

    To compute D2ŨS(X, �), we need the second order partial derivatives of ||y||2 and ||XTy||2;

    these are

    ∂2

    ∂xi′j′∂xij||y||2 = 2βTETijEi′j′β,

    ∂2

    ∂�k∂xij||y||2 = 2βTETijek,

    ∂2

    ∂�k′∂�k||y||2 = 2eTkek′

    and

    ∂2

    ∂xi′j′∂xij||XTy||2 = 2βTETi′j′EijXTy + 2βTETijEi′j′XTy + 2yTEijETi′j′y

    + 2yTEijXTEi′j′β + 2β

    TETijXET

    i′j′y + 2βTETijXX

    TEi′j′β,

    ∂2

    ∂�k∂xij||XTy||2 = 2eTkEijXTy + 2yTEijXTek + 2βTETijXXTek

    ∂2

    ∂�k′∂�k||XTy||2 = 2eTkXXTek′ ,

  • L.H. Dicker/Variance estimation in high-dimensional linear models 28

    for 1 ≤ i, k ≤ d and 1 ≤ j ≤ d. It follows that the entries of D2ŨS(X, �) are

    1

    nD2Ũ||y||2 = 2

    nβTUTUβ +

    4

    nβTUTu+

    2

    n||u||2

    and

    1

    n2D2Ũ||XTy||2 = 2

    n2yTUUTy +

    4

    n2βTUTUXTy +

    4

    n2βTUTXUTy +

    2

    n2βTUTXXTUβ

    +4

    n2uTUXTy +

    4

    n2yTUXTu+

    4

    n2βTUTXXTu+

    2

    n2uTXXTu.

    We conclude that

    ||DŨS(X, �)||2 =

    4

    n2(βTUTy + uTy)2 +

    4

    n4(yTUXTy + βTUTXXTy + uTXXTy)2

    ≤ 8n2

    (τ 2 + 1)||y||2 + 12n4||XTX||

    (||y||2 + ||XTX||τ 2 + ||XTX||

    )||y||2

    ≤ 16n

    (τ 2 + 1)

    (λ1τ

    2 +1

    n||�||2

    )+

    168

    nλ1

    {λ21τ

    2(τ 2 + 1) +1

    n||�||2

    (λ1 +

    1

    n||�||2

    )}= O

    [1

    n

    {(λ31 + λ1)τ

    2(τ 2 + 1) +1

    n||�||2

    (λ1 +

    1

    n||�||2 + 1

    )}](27)

    and

    ||D2ŨS(X, �)|| ≤ 2

    n(τ + 1)2 +

    2

    n2{||y||2 + 4||X||(τ + 1)||y||+ ||XTX||(τ + 1)2

    }= O

    [1

    n

    {(λ1 + 1)(τ + 1)

    2 +1

    n||�||2

    }]. (28)

    Combining (26)–(28), we obtain

    κ2 =[E{||∇2g(X, �)||4

    }]1/4= O

    [1

    n

    {η1/48 + η

    1/44 + η

    1/40 τ

    2(τ 2 + 1) + γ1/44 + γ

    1/40 (τ

    2 + 1)}]

    , (29)

    where

    ηk = E

    {||∇2h(S)||4(λ1 + 1)12

    (1

    n||�||2

    )k}.

  • L.H. Dicker/Variance estimation in high-dimensional linear models 29

    Appealing to Theorem S1, the bounds (25) and (29) imply

    dTV {g(X, �), w} = O(

    ξν

    n3/2ψ2

    ),

    whereξ = ξ(σ2, τ 2,Σ, d, n) = γ

    1/44 + γ

    1/42 + γ

    1/40 τ(τ + 1)

    andν = ν(σ2, τ 2,Σ, d, n) = η

    1/48 + η

    1/44 + η

    1/40 τ

    2(τ 2 + 1) + γ1/44 + γ

    1/40 (τ

    2 + 1).

    This completes the proof of Theorem 1.

    Proof outline for Proposition 2

    Assume that the conditions of Proposition 2 are satisfied and let

    σ̃2(m̂) = σ̃2 =

    {1 +

    dm̂21(n+ 1)m̂2

    }1

    n||y||2 − m̂1

    n(n+ 1)m̂2||XTy||2,

    τ̃ 2(m̂) = τ̃ 2 = − dm̂21

    n(n+ 1)m̂2||y||2 + m̂1

    n(n+ 1)m̂2||XTy||2,

    where m̂ = (m̂1, m̂2)T. Let m = (m1,m2)

    T = (d−1tr(Σ), d−1tr(Σ2))T and consider the esti-mators σ̃2(m), τ̃ 2(m). Properties of the Wishart distribution and Proposition S1 imply thatm̂k = mk +O{(nd)−1/2}, k = 1, 2. It follows that

    σ̃2(m̂) = σ̃2(m) +O{(nd)−1/2}, τ̃ 2(m̂) = τ̃ 2(m) +O{(nd)−1/2}. (30)

    Using Proposition S1 again, one can further prove that

    σ̃2(m) = σ2 +O(n−1/2 + ∆̃2), τ̃2(m) = τ 2 +O(n−1/2 + ∆̃2). (31)

    Part (i) of Proposition 2 follows from (30)–(31).By Proposition S1,

    E{σ̃2(m)} = σ2 +O(∆̃2), E{τ̃ 2(m)} = τ 2 +O(∆̃2),nvar{σ̃2(m)} = ψ̃21 +O{(nd)−1/2 + ∆̃3}, nvar{τ̃ 2(m)} = ψ̃22 +O{(nd)−1/2 + ∆̃3},

    ncov{σ̃2(m), τ̃ 2(m)} = ψ̃12 +O{(nd)−1/2 + ∆̃3},(32)

    where

    ψ̃12 = −2{dm21nm2

    (σ2 + τ 2)2 +2m1m3m22

    τ 4 + 2

    (m1m3m22

    − 1)σ2τ 2

    }.

    Part (ii) of Proposition 2 follows from Theorem 1 (as in Corollary 1), (30) and (32).

  • L.H. Dicker/Variance estimation in high-dimensional linear models 30

    S3. Moment calculations for the Wishart distribution

    Suppose that X = (x1, . . . , xn)T is an n × d matrix with iid rows x1, . . . , xn ∼ N(0,Σ) and

    that Σ is a d×d positive definite matrix. Then W = XTX is a Wishart(n,Σ) random matrix.Let β ∈ Rd. In this Supplemental Text we provide formulas for various moments involvingW that are used in the paper. Letac and Massam (2004) and Graczyk et al. (2005) providetechniques for computing all such moments. These techniques are utilized here.

    Let Sk denote the symmetric group on k elements. Then each permutation π ∈ Sk can beuniquely as a product of disjoint cycles π = C1 · · ·Cm(π), where Cj = (c1j · · · ckjj), k1 + · · ·+km(π) = k, and all of the cij ∈ {1, . . . , k} are distinct.

    Let H1, . . . , Hk be d× d symmetric matrices and define the polynomial

    rπ(Σ)(H1, . . . , Hk) =

    m(π)∏j=1

    tr

    kj∏i=1

    ΣHcij

    .Theorem 1 in Letac and Massam (2004) and Proposition 1 in Graczyk et al. (2005) give thefollowing formula:

    E {tr(WH1) · · · tr(WHk)} =∑π∈Sk

    2k−m(π)nm(π)rπ(Σ)(H1, . . . , Hk). (33)

    This is our main tool for deriving the explicit formulas in the following proposition. Fornon-negative integers k, define τ 2k = β

    TΣkβ and mk = d−1tr(Σk).

    Proposition S1. We have

    E {tr(W )} = dnm1 (34)E{

    tr(W )2}

    = d2n2m21 + 2dnm2 (35)

    E{

    tr(W 2)}

    = d2nm21 + dn(n+ 1)m2 (36)

    E (βTWβ) = nτ 21 (37)

    E(βTW 2β

    )= dnm1τ

    21 + n(n+ 1)τ

    22 (38)

    E {tr(W )βTWβ} = dn2m1τ 21 + 2nτ 22 (39)E{

    tr(W )βTW 2β}

    = d2n2m21τ21 + dn(n

    2 + n+ 2)m1τ22

    + 2dnm2τ21 + 4n(n+ 1)τ

    23 (40)

    E(βTWββTW 2β) = dn(n+ 2)m1τ41 + n(n+ 2)(n+ 3)τ

    21 τ

    22 (41)

    E(βTW 3β

    )= d2nm21τ

    21 + 2dn(n+ 1)m1τ

    22

    + dn(n+ 1)m2τ21 + n(n

    2 + 3n+ 4)τ 23 (42)

  • L.H. Dicker/Variance estimation in high-dimensional linear models 31

    E{

    (βTW 2β)2}

    = d2n(n+ 2)m21τ41 + 2dn(n+ 2)(n+ 3)m1τ

    21 τ

    22

    + 2dn(n+ 2)m2τ41 + 4n(n+ 2)(n+ 3)τ

    21 τ

    23

    + n(n+ 1)(n+ 2)(n+ 3)τ 42 . (43)

    Proof. Formulas (34) and (37) are trivial (notice that βTWβ ∼ τ 21χ2n). Formulas (35)–(36)may be found in (Letac and Massam, 2004).

    Now let u1, . . . , ud ∈ Rd be an orthonormal basis of Rd, with β = ||β||u1. Define the d× dsymmetric matrices Hij = (uiu

    Tj + uju

    Ti )/2 and Hj = H1j, i, j = 1, . . . , d. Then

    βTW 2β = τd∑j=1

    tr(WHj)2. (44)

    Since S2 = {(1 2), (1)(2)}, the formula (33) and Lemma S2 below imply

    E{

    tr(WHj)2}

    = 22−m((1 2))nm((1 2))tr(ΣHjΣHj) + 22−m((1)(2))nm((1)(2))tr(ΣHj)

    2

    = n{

    (uT1Σuj)2 + uT1Σu1u

    T

    j Σuj}

    + n2(uT1Σuj)2.

    To prove (38), observe that

    E(βTW 2β

    )= τ 20

    d∑j=1

    E{

    tr(WHj)2}

    = n(n+ 1)d∑j=1

    τ 20 (uT

    1Σuj)2 + n

    d∑j=1

    τ 20uT

    1Σu1uT

    j Σuj

    = n(n+ 1)τ 22 + dnm1τ21 .

    For (39), equation (33) implies

    E {tr(W )βTWβ} = τ 20E {tr(W )tr(WH1)}= 2nτ 20 tr(Σ

    2H1) + n2τ 20 tr(Σ)tr(ΣH1)

    = 2nτ 22 + dn2m1τ

    21 .

    To prove (40), first notice that

    E{

    tr(W )βTW 2β}

    = τ 20

    d∑j=1

    E{

    tr(W )tr(WHj)2}

    (45)

  • L.H. Dicker/Variance estimation in high-dimensional linear models 32

    and that (33) implies

    E{

    tr(W )tr(WHj)2}

    =∑π∈S3

    23−m(π)nm(π)rπ(Σ)(I,Hj, Hj).

    It is clear thatr(1 2 3)(Σ)(I,Hj, Hj) = r(1 3 2)(Σ)(I,Hj, Hj)r(1 2)(3)(Σ)(I,Hj, Hj) = r(1 3)(2)(Σ)(I,Hj, Hj).

    Thus, by Lemma S2,

    E{

    tr(W )tr(WHj)2}

    = 8nr(1 2 3)(Σ)(I,Hj, Hj) + 4n2r(1 2)(3)(Σ)(I,Hj, Hj)

    + 2n2r(1)(2 3)(Σ)(I,Hj, Hj) + n3r(1)(2)(3)(Σ)(I,Hj, Hj)

    = 8ntr(Σ2HjΣHj) + 4n2tr(Σ2Hj)tr(Σ

    2Hj)

    + 2n2tr(Σ)tr(ΣHjΣHj) + n3tr(Σ)tr(ΣHj)

    2

    = 2n(uT1Σ2u1u

    T

    j Σuj + uT

    1Σu1uT

    j Σ2uj + 2u

    T

    1Σ2uju

    T

    1Σuj)

    + 4n2uT1Σ2uju

    T

    1Σuj + n2tr(Σ)

    {(uT1Σuj)

    2 + uT1Σu1uT

    j Σuj}

    + n3tr(Σ)(uT1Σuj)2

    Combining this with (45) yields

    E{

    tr(W )βTW 2β}

    = 2nτ 20

    d∑j=1

    uT1Σ2u1u

    T

    j Σuj + 2nτ20

    d∑j=1

    uT1Σu1uT

    j Σ2uj

    + 4n(n+ 1)τ 20

    d∑j=1

    uT1Σ2uju

    T

    1Σuj + n2tr(Σ)τ 20

    d∑j=1

    uT1Σu1uT

    j Σuj

    + n2(n+ 1)tr(Σ)τ 20

    d∑j=1

    (uT1Σuj)2

    = dn(n2 + n+ 2)m1τ22 + 2dnm2τ

    21 + 4d

    2n(n+ 1)τ 23 + d2n2m21τ

    21 .

    The proof of (41) is similar to the proof of (40). By (33) and Lemma S2,

    E{

    tr(WH1)tr(WHj)2}

    = 8nr(1 2 3)(Σ)(H1, Hj, Hj) + 4n2r(1 2)(3)(Σ)(H1, Hj, Hj)

    + 2n2r(1)(2 3)(Σ)(H1, Hj, Hj) + n3r(1)(2)(3)(Σ)(H1, Hj, Hj)

    = 8ntr(ΣH1ΣHjΣHj) + 4n2tr(ΣH1ΣHj)tr(ΣHj)

    + 2n2tr(ΣH1)tr(ΣHjΣHj) + n3tr(ΣH1)tr(ΣHj)

    2

  • L.H. Dicker/Variance estimation in high-dimensional linear models 33

    = 2n{

    (uT1Σu1)2uTj Σuj + 3u

    T

    1Σu1(uT

    1Σuj)2}

    + 4n2uT1Σu1(uT

    1Σuj)2 + n3uT1Σu1(u

    T

    1Σuj)2

    + n2{

    (uT1Σu1)2uTj Σuj + u

    T

    1Σu1(uT

    1Σuj)2}

    = n(n+ 2)(uT1Σu1)2uTj Σuj + n(n

    2 + 5n+ 6)uT1Σu1(uT

    1Σuj)2.

    It follows that

    E(βTWββTW 2β) = τ 40

    d∑j=1

    tr(WH1)tr(WHj)2

    = n(n+ 2)d∑j=1

    τ 40 (uT

    1Σu1)2uTj Σuj

    + n(n2 + 5n+ 6)d∑j=1

    τ 40uT

    1Σu1(uT

    1Σuj)2

    = dn(n+ 2)m1τ41 + n(n

    2 + 5n+ 6)τ 21 τ22 .

    To prove (42), consider the decomposition

    βTW 3β = τ 20

    d∑i,j=1

    tr(WHi)tr(WHj)tr(WHij).

    Equation (33) implies that

    E {tr(WHi)tr(WHj)tr(WHij)} =∑π∈S3

    23−m(π)nm(π)rπ(Σ)(Hi, Hj, Hij).

    Since

    d∑i,j=1

    r(1 2 3)(Σ)(Hi, Hj, Hij) =d∑

    i,j=1

    r(1 3 2)(Σ)(Hi, Hj, Hij)

    d∑i,j=1

    r(1)(2 3)(Σ)(Hi, Hj, Hij) =d∑

    i,j=1

    r(1 3)(2)(Σ)(Hi, Hj, Hij),

    it follows that

    E(βTW 3β

    )= 8nτ 20

    d∑i,j=1

    r(1 2 3)(Σ)(Hi, Hj, Hij) + 4n2τ 20

    d∑i,j=1

    r(1)(2 3)(Σ)(Hi, Hj, Hij) (46)

  • L.H. Dicker/Variance estimation in high-dimensional linear models 34

    + 2n2τ 20

    d∑i,j=1

    r(1 2)(3)(Σ)(Hi, Hj, Hij) + n3τ 20

    d∑i,j=1

    r(1)(2)(3)(Σ)(Hi, Hj, Hij).

    By Lemma S2,

    τ 20

    d∑i,j=1

    r(1 2 3)(Σ)(Hi, Hj, Hij) = τ20

    d∑i,j=1

    tr(ΣHiΣHjΣHij)

    =τ 208

    d∑i,j=1

    {uT1Σu1u

    T

    i ΣuiuT

    j Σuj + uT

    1Σu1(uT

    i Σuj)2

    + (uT1Σui)2uTj Σuj + (u

    T

    1Σuj)2uTi Σui

    +4uT1ΣuiuT

    1ΣujuT

    i Σuj}

    =1

    8

    (d2m21τ

    21 + dm2τ

    21 + 2dm1τ

    22 + 4τ

    23

    )τ 20

    d∑i,j=1

    r(1)(2 3)(Σ)(Hi, Hj, Hij) = τ20

    d∑i,j=1

    tr(ΣHi)tr(ΣHjΣHij)

    =τ 202

    d∑i,j=1

    uT1Σui(uT1Σuju

    T

    i Σuj + uT

    1ΣuiuT

    j Σuj)

    =1

    2

    (τ 23 + dm1τ

    22

    )τ 20

    d∑i,j=1

    r(1 2)(3)(Σ)(Hi, Hj, Hij) = τ20

    d∑i,j=1

    tr(ΣHiΣHj)tr(ΣHij)

    =τ 202

    d∑i,j=1

    (uT1ΣuiuT

    1Σuj + uT

    1Σu1uT

    i Σuj)uT

    i Σuj

    =1

    2

    (τ 23 + dm2τ

    21

    )τ 20

    d∑i,j=1

    r(1)(2)(3)(Σ)(Hi, Hj, Hij) = τ20

    d∑i,j=1

    tr(ΣHi)tr(ΣHj)tr(ΣHij)

    = τ 20

    d∑i,j=1

    uT1ΣuiuT

    1ΣujuT

    i Σuj

    = τ 23 .

  • L.H. Dicker/Variance estimation in high-dimensional linear models 35

    Using these results with (46) we obtain

    E(βTW 3β

    )= n

    (d2m21τ

    21 + dm2τ

    21 + 2dm1τ

    22 + 4τ

    23

    )+ 2n2

    (τ 23 + dm1τ

    22

    )+ n2

    (τ 23 + dm2τ

    21

    )+ n3τ 23

    = d2nm21τ21 + 2dn(n+ 1)m1τ

    22 + dn(n+ 1)m2τ

    21 + (n

    3 + 3n2 + 4n)τ 23 .

    Finally, we prove (43). Similar to the proof of (41)–(42), we have the decomposition

    (βTW 2β)2 = τ 40

    d∑i,j=1

    tr(WHi)2tr(WHj)

    2.

    By (33),

    E{

    tr(WHi)2tr(WHj)

    2}

    =∑π∈S4

    24−m(π)nm(π)rπ(Σ)(Hi, Hi, Hj, Hj).

    It follows thatE{

    (βTW 2β)2}

    =∑π∈S4

    24−m(π)nm(π)r̃π,

    where

    r̃π =d∑

    i,j=1

    τ 40 rπ(Σ)(Hi, Hi, Hj, Hj).

    One can easily see that

    r̃(1 2 3 4) = r̃(1 2 4 3) = r̃(1 3 4 2) = r̃(1 4 3 2)

    r̃(1 3 2 4) = r̃(1 4 2 3)

    r̃(1)(2 3 4) = r̃(1)(2 4 3) = r̃(1 3 4)(2) = r̃(1 4 3)(2) = r̃(1 2 3)(4)

    = r̃(1 3 2)(4) = r̃(1 2 4)(3) = r̃(1 4 2)(3)

    r̃(1 3)(2 4) = r̃(1 4)(2 3)

    r̃(1 2)(3)(4) = r̃(1)(2)(3 4)

    r̃(1 3)(2)(4) = r̃(1 4)(2)(3) = r̃(1)(3)(2 4) = r̃(1)(4)(2 3).

    Thus,

    E{

    (βTW 2β)2}

    = 32nr̃(1 2 3 4) + 16nr̃(1 3 2 4) + 32n2r̃(1)(2 3 4) + 8n

    2r̃(1 3)(2 4)

    + 4n2r̃(1 2)(3 4) + 4n3r̃(1 2)(3)(4) + 8n

    3r̃(1 3)(2)(4) + n4r̃(1)(2)(3)(4). (47)

  • L.H. Dicker/Variance estimation in high-dimensional linear models 36

    It only remains to evaluate the r̃π. It follows from Lemma S2 that

    r̃(1 2 3 4) =d∑

    i,j=1

    τ 40 tr(ΣHiΣHiΣHjΣHj)

    =d∑

    i,j=1

    τ 4016

    {2(uT1Σui)

    2(uT1Σuj)2 + 3uT1Σu1(u

    T

    1Σui)2uTj Σuj

    + 6uT1Σu1uT

    1ΣuiuT

    1ΣujuT

    i Σuj + 3uT

    1Σu1(uT

    1Σuj)2uTi Σui

    +(uT1Σu1)2uTi ΣuiujΣuj + (u

    T

    1Σu1)2(uTi Σuj)

    2}

    =1

    16

    (2τ 42 + 6dm1τ

    21 τ

    22 + 6τ

    21 τ

    23 + d

    2m21τ41 + dm2τ

    41

    )r̃(1 3 2 4) =

    d∑i,j=1

    τ 40 tr(ΣHiΣHjΣHiΣHj)

    =d∑

    i,j=1

    τ 408

    {(uT1Σui)

    2(uT1Σuj)2

    +6uT1Σu1uT

    1ΣuiuT

    1ΣujuT

    i Σuj + (uT

    1Σu1)2(uTi Σuj)

    2}

    =1

    8(τ 42 + 6τ

    21 τ

    23 + dm2τ

    41 )

    r̃(1)(2 3 4) =d∑

    i,j=1

    τ 40 tr(ΣHi)tr(ΣHiΣHjΣHj)

    =d∑

    i,j=1

    τ 404

    {(uT1Σui)

    2(uT1Σuj)2 + uT1Σu1(u

    T

    1Σui)2uTj Σuj

    +2uT1Σu1uT

    1ΣuiuT

    1ΣujuT

    i Σuj}

    =1

    4(τ 42 + dm1τ

    21 τ

    22 + 2τ

    21 τ

    23 )

    r̃(1 3)(2 4) =d∑

    i,j=1

    τ 40 tr(ΣHiΣHj)2

    =d∑

    i,j=1

    τ 404{uT1ΣuiuT1Σuj + uT1Σu1uTi Σuj}

    2

    =1

    4(τ 42 + dm2τ

    41 + 2τ

    21 τ

    23 )

  • L.H. Dicker/Variance estimation in high-dimensional linear models 37

    r̃(1 2)(3 4) =d∑

    i,j=1

    τ 40 tr(ΣHiΣHi)tr(ΣHjΣHj)

    =d∑

    i,j=1

    τ 404

    {(uT1Σui)

    2 + uT1Σu1uT

    i Σui}{

    (uT1Σuj)2 + uT1Σu1u

    T

    j Σuj}

    =1

    4(τ 42 + 2dm1τ

    21 τ

    22 + d

    2m21τ41 )

    r̃(1 2)(3)(4) =d∑

    i,j=1

    τ 40 tr(ΣHiΣHi)tr(ΣHj)2

    =d∑

    i,j=1

    τ 402

    {(uT1Σui)

    2 + uT1Σu1uT

    i Σui}

    (uT1Σuj)2

    =1

    2(τ 42 + dm1τ

    21 τ

    22 )

    r̃(1 3)(2)(4) =d∑

    i,j=1

    τ 40 tr(ΣHiΣHj)tr(ΣHi)tr(ΣHj)

    =d∑

    i,j=1

    τ 402{uT1ΣuiuT1Σuj + uT1Σu1uTi Σuj}uT1ΣuiuT1Σuj

    =1

    2(τ 42 + τ

    21 τ

    23 )

    r̃(1)(2)(3)(4) =d∑

    i,j=1

    τ 40 tr(ΣHi)2tr(ΣHj)

    2

    =d∑

    i,j=1

    τ 40 (uT

    1Σui)2(uT1Σuj)

    2

    = τ 42 .

    Combining this with (47), we conclude that

    E{

    (βTW 2β)2}

    = 32nr̃(1 2 3 4) + 16nr̃(1 3 2 4) + 32n2r̃(1)(2 3 4) + 8n

    2r̃(1 3)(2 4)

    + 4n2r̃(1 2)(3 4) + 4n3r̃(1 2)(3)(4) + 8n

    3r̃(1 3)(2)(4) + n4r̃(1)(2)(3)(4)

    = 2n(2τ 42 + 6dm1τ21 τ

    22 + 6τ

    21 τ

    23 + d

    2m21τ41 + dm2τ

    41 )

    + 2n(τ 42 + 6τ21 τ

    23 + dm2τ

    41 ) + 8n

    2(τ 42 + dm1τ21 τ

    22 + 2τ

    21 τ

    23 )

    + 2n2(τ 42 + dm2τ41 + 2τ

    21 τ

    23 ) + n

    2(τ 42 + 2dm1τ21 τ

    22 + d

    2m21τ41 )

  • L.H. Dicker/Variance estimation in high-dimensional linear models 38

    + 2n3(τ 42 + dm1τ21 τ

    22 ) + 4n

    3(τ 42 + τ21 τ

    23 ) + n

    4τ 42= (n4 + 6n3 + 11n2 + 6n)τ 42 + d(2n

    3 + 10n2 + 12n)m1τ21 τ

    22

    + (4n3 + 20n2 + 24n)τ 21 τ23 + d

    2(n2 + 2n)m21τ41 + d(2n

    2 + 4n)m2τ41

    = d2n(n+ 2)m21τ41 + 2dn(n+ 2)(n+ 3)m1τ

    21 τ

    22 + 2dn(n+ 2)m2τ

    41

    + 4n(n+ 2)(n+ 3)τ 21 τ23 + n(n+ 1)(n+ 2)(n+ 3)τ

    42 .

    This concludes the proof of Proposition S1.

    Lemma S2. Let u1, . . . , ud ∈ Rd and define Hj = (u1uTj + ujuT1 )/2. For integers 1 ≤ i, j ≤ d,we have

    tr(ΣHij) = uT

    i Σuj (48)

    tr(ΣHiΣHj) =1

    2(uT1Σuiu

    T

    1Σuj + uT

    1Σu1uT

    i Σuj) (49)

    tr(ΣHiΣHij) =1

    2(uT1ΣuiuiΣuj + u

    T

    1ΣujuT

    i Σui) (50)

    tr(Σ2HiΣHj) =1

    4

    (uT1Σ

    2u1uT

    i Σuj + uT

    1Σu1uT

    i Σ2uj

    +uT1Σ2uiu

    T

    1Σuj + uT

    1ΣuiuT

    1Σ2uj}

    (51)

    tr(ΣHiΣHjΣHj) =1

    4

    {uT1Σui(u

    T

    1Σuj)2 + uT1Σu1u

    T

    1ΣuiuT

    j Σuj

    +2uT1Σu1uT

    1ΣujuT

    i Σuj} (52)

    tr(ΣHiΣHjΣHij) =1

    8

    {uT1Σu1u

    T

    i ΣuiuT

    j Σuj + uT

    1Σu1(uT

    i Σuj)2

    + (uT1Σui)2uTj Σuj + (u

    T

    1Σuj)2uTi Σui

    +4uT1ΣuiuT

    1ΣujuT

    i Σuj} (53)

    tr(ΣHiΣHiΣHjΣHj) =1

    16

    {2(uT1Σui)

    2(uT1Σuj)23uT1Σu1(u

    T

    1Σui)2uTj Σuj

    + 6uT1Σu1uT

    1ΣuiuT

    1ΣujuT

    i Σuj + 3uT

    1Σu1(uT

    1Σuj)2uTi Σui

    +(uT1Σu1)2uTi ΣuiujΣuj + (u

    T

    1Σu1)2(uTi Σuj)

    2}

    (54)

    tr(ΣHiΣHjΣHiΣHj) =1

    8

    {(uT1Σui)

    2(uT1Σuj)2 + 6uT1Σu1u

    T

    1ΣuiuT

    1ΣujuT

    i Σuj

    +(uT1Σu1)2(uTi Σuj)

    2}

    (55)

    Proof. The identity (48) is trivial. To prove (49), we have

    tr(ΣHiΣHj) =1

    4tr{

    Σ(u1uT

    i + uiuT

    1 )Σ(u1uT

    j + ujuT

    1 )}

  • L.H. Dicker/Variance estimation in high-dimensional linear models 39

    =1

    4tr(Σu1u

    T

    i Σu1uT

    j + Σu1uT

    i ΣujuT

    1 + ΣuiuT

    1Σu1uT

    j + ΣuiuT

    1ΣujuT

    1

    )=

    1

    2(uT1Σuiu

    T

    1Σuj + uT

    1Σu1uT

    i Σuj) .

    Equation (50) follows from

    tr(ΣHiΣHij) =1

    4tr{

    Σ(u1uT

    i + uiuT

    1 )Σ(uiuT

    j + ujuT

    i )}

    =1

    4tr(Σu1u

    T

    i ΣuiuT

    j + Σu1uT

    i ΣujuT

    i + ΣuiuT

    1ΣuiuT

    j + ΣuiuT

    1ΣujuT

    i

    )=

    1

    2(uT1Σuiu

    T

    i Σuj + uT

    1ΣujuT

    i Σui) .

    For (51), we have

    tr(Σ2HiΣHj) =1

    4tr{

    Σ2(u1uT

    i + uiuT

    1 )Σ(u1uT

    j + ujuT

    1 )}

    =1

    4tr(Σ2u1u

    T

    i Σu1uT

    j + Σ2u1u

    T

    i ΣujuT

    1 + Σ2uiu

    T

    1Σu1uT

    j + Σ2uiu

    T

    1ΣujuT

    1

    )=

    1

    4

    (uT1Σuiu

    T

    1Σ2uj + u

    T

    1Σ2u1u

    T

    i Σuj + uT

    1Σu1uT

    i Σ2uj + u

    T

    1Σ2uiu1Σuj

    ).

    To prove (52)–(53), observe that

    tr(ΣHiΣHjΣHj) =1

    8tr{

    Σ(u1uT

    i + uiuT

    1 )Σ(u1uT

    j + ujuT

    1 )Σ(u1uT

    j + ujuT

    1 )}

    =1

    8tr(Σu1u

    T

    i Σu1uT

    j Σu1uT

    j + Σu1uT

    i Σu1uT

    j ΣujuT

    1

    + Σu1uT

    i ΣujuT

    1Σu1uT

    j + Σu1uT

    i ΣujuT

    1ΣujuT

    1

    + ΣuiuT

    1Σu1uT

    j Σu1uT

    j + ΣuiuT

    1Σu1uT

    j ΣujuT

    1

    +ΣuiuT

    1ΣujuT

    1