31
THE OPTIMAL HARD THRESHOLD FOR SINGULAR VALUES IS 4/3 By David L. Donoho Matan Gavish Technical Report No. 2013-04 May 2013 Department of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

THE OPTIMAL HARD THRESHOLD 3 By David L. Donoho …the case of n-by-nmatrix in noise level ˙, is simply (4= p 3) p n˙ˇ2:309 p n˙when ˙is known, or simply 2:858 y medwhen ˙is

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • THE OPTIMAL HARD THRESHOLD FOR SINGULAR VALUES IS 4/√3

    By

    David L. Donoho Matan Gavish

    Technical Report No. 2013-04 May 2013

    Department of Statistics STANFORD UNIVERSITY

    Stanford, California 94305-4065

  • THE OPTIMAL HARD THRESHOLD FOR SINGULAR VALUES IS 4/√3

    By

    David L. Donoho Matan Gavish

    Stanford University

    Technical Report No. 2013-04 May 2013

    This research was supported in part by National Science Foundation grant DMS 0906812.

    Department of Statistics STANFORD UNIVERSITY

    Stanford, California 94305-4065

    http://statistics.stanford.edu

  • The Optimal Hard Thresholdfor Singular Values is 4/

    √3

    David L. Donoho ∗ Matan Gavish ∗

    Abstract

    We consider recovery of low-rank matrices from noisy data by hard thresholding ofsingular values, in which empirical singular values below a prescribed threshold λ areset to 0. We study the asymptotic MSE (AMSE) in a framework where the matrix size islarge compared to the rank of the matrix to be recovered, and the signal-to-noise ratioof the low-rank piece stays constant. The AMSE-optimal choice of hard threshold, inthe case of n-by-n matrix in noise level σ, is simply (4/

    √3)√nσ ≈ 2.309

    √nσ when σ is

    known, or simply 2.858 · ymed when σ is unknown, where ymed is the median empiricalsingular value. In plain terms, when a data singular value falls below this threshold,this means that its associated singular vectors are too noisy to be used in the reconstruc-tion. For nonsquare m by n matrices with m 6= n the thresholding coefficients 4/

    √3 (σ

    known) and 2.858 (σ unknown) are replaced with different constants that depend onlyon m/n, which we provide. In our asymptotic framework, this thresholding rule adaptsto unknown rank and, if needed, to unknown noise level, in an optimal manner: it isalways better than hard thresholding at any other value, no matter what the matrix is thatwe are trying to recover, and is always better than ideal Truncated SVD (TSVD), whichtruncates at the true rank of the low-rank matrix we are trying to recover. Hard thresh-olding at the recommended value to recover an n-by-n matrix of rank r guarantees anAMSE at most 3nrσ2. In comparison, the guarantee provided by TSVD is 5nrσ2, theguarantee provided by optimally tuned singular value soft thresholding is 6nrσ2, andthe best guarantee achievable by any shrinkage of the data singular values is 2nrσ2. Ourrecommended hard threshold value also offers, among hard thresholds, the best possi-ble AMSE guarantees for recovering matrices with bounded nuclear norm. Empiricalevidence shows that these AMSE properties of the 4/

    √3 thresholding rule remain valid

    even for relatively small n, and that performance improvement over TSVD and otherpopular shrinkage rules is often substantial, turning it into the practical hard thresholdof choice.

    ∗Department of Statistics, Stanford University

    1

  • 1 Introduction 2

    Contents

    1 Introduction 21.1 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Optimal location for hard thresholding of singular values . . . . . . . . . . . . 41.3 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Optimal singular value hard thresholding – In practice . . . . . . . . . . . . . 5

    2 Notation and Setting 62.1 Natural problem scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Asymptotic framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Results 73.1 Optimally tuned SVHT asymptotically dominates TSVD and any SVHT . . . 83.2 Minimaxity over matrices of bounded rank . . . . . . . . . . . . . . . . . . . . 93.3 Comparison of worst-case AMSE . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.3.1 TSVD: Truncation at the true rank . . . . . . . . . . . . . . . . . . . . . 103.3.2 Hard Thresholding at/near the bulk edge . . . . . . . . . . . . . . . . . 113.3.3 Soft Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.4 Optimal Singular Value Shrinker . . . . . . . . . . . . . . . . . . . . . . 11

    3.4 Minimaxity over matrices of bounded nuclear norm . . . . . . . . . . . . . . . 123.5 When the noise level σ is unknown . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4 Discussion 154.1 The optimal threshold λ∗(β) relative to the bulk edge threshold 1 +

    √β . . . . 15

    4.2 The optimal threshold λ∗(β) relative to the USVT X̂2.02 . . . . . . . . . . . . . 15

    5 A formula for AMSE of hard thresholding 17

    6 Proofs 20

    7 Empirical comparison of MSE with AMSE 22

    8 Conclusion 22

    1 Introduction

    Suppose we are interested in an m by n matrix X , which is thought to be either exactly orapproximately of low rank, but we only observe a single noisy m-by-n matrix Y , obeyingY = X + σZ; The noise matrix Z has independent, identically distributed entries with zeromean and unit variance. We wish to recover the matrix X with some bound on the meansquared error (MSE). The default estimation technique for our task is Truncated SVD (TSVD)[1]: Write

    Y =m∑i=1

    yiuiv′i (1)

    for the Singular Value Decomposition of the data matrix Y , where ui ∈ Rm and vi ∈ Rn,i = 1, . . . ,m are the left and right singular vectors of Y corresponding to the singular valueyi. The TSVD estimator is

    X̂r =r∑i=1

    yiuiv′i ,

  • 1 Introduction 3

    where r = rank(X), assumed known, and y1 ≥ . . . ≥ ym. Being the best approximationof rank r to the data in the least squares sense [2], and therefore the Maximum LikelihoodEstimator when Z has Gaussian entries, the TSVD is arguably as ubiquitous in science andengineering as linear regression [3, 4, 5, 6, 7, 8].

    When the true rank r of the signalX is unknown, one might try to form an estimate r̂ andthen apply the TSVD X̂r̂. Methods to estimate r have been studied as early as [4, 9] (in FactorAnalysis and Principal Component Analysis) and as recently as [10, 11, 12] (in our setting ofSingular Value Decomposition). It is instructive to think about rank estimation (using anymethod), followed by TSVD, simply as hard thresholding of the data singular values, whereonly components yiuiv′i for which yi passes a specified threshold, are included in X̂ . LetηH(y, τ) = y1{y≥τ} denote the hard thresholding nonlinearity, and consider Singular ValueHard Thresholding (SVHT)

    X̂τ =m∑i=1

    ηH(yi; τ)uiv′i . (2)

    In words, X̂τ sets to 0 any data singular value below τ .Matrix denoisers explicitly or implicitly based on hard thresholding of singular values

    have been proposed in [10, 11, 13, 14, 15, 16, 17, 18, 19]. As a common example of implicitSVHT denoising, consider the standard practice of estimating r by plotting the singularvalues of Y in decreasing order, and looking for a “large gap” or “elbow” (Figure 1, leftpanel). As we will see, whenX is exactly or approximately low-rank and the entries of Z arewhite noise of zero mean and unit variance, the empirical distribution of the singular valuesof them-by-nmatrix Y = X+σZ forms a quarter-circle bulk, whose edge lies approximatelyat (1 +

    √β) ·√nσ, with β = m/n. Only data singular values that are larger than the bulk

    edge are noticeable in the empirical distribution (Figure 1, right plot). Since the singularvalue plot “elbow” is located at the bulk edge, the popular method of TSVD at the “elbow”is an approximation of bulk-edge hard thresholding, X̂(1+√β)√nσ.

    0 20 40 60 80 1000

    0.5

    1

    1.5

    2

    2.5

    3

    0 0.5 1 1.5 2 2.5 30

    2

    4

    6

    8

    10

    12

    14

    Figure 1: Singular values of a data matrix Y ∈ M100,100 drawn from the “natural” model Y = X + Z/√

    100, where X1,1 = 1.7,X2,2 = 2.5 and Xi,j = 0 elsewhere. In Matlab, the a sample can be generated using the command y=svd(diag([[1.7 2.5]zeros(1,98)])+randn(100)/sqrt(100)). Left: the singular values of Y plotted in decreasing order, known in Principal Com-ponents Analysis as a Scree plot. Right: the singular values of Y in a histogram with 15 bins. Note the bulk edge approximately at 2, andnote that the location of the top two singular values of Y is approximately Xi,i + 1/Xi,i (i = 1, 2), in agreement with Lemma 2.

  • 1 Introduction 4

    1.1 Questions

    Let us measure the denoising performance of a denoiser X̂ at a signal matrix X using MeanSquare Error, ∣∣∣∣∣∣X̂(Y )−X∣∣∣∣∣∣2

    F=∑i,j

    (X̂(Y )i,j −Xi,j)2 .

    The TSVD is an optimal rank-r approximation of the data matrix Y . But this does not nec-essarily mean that it is a good, or even reasonable, estimator to the signal matrix X , whichwe wish to recover. We may wonder

    • Question 1. Assume that rank(X) is unknown but small. Is there a singular valuethreshold τ so that SVHT X̂τ successfully adapts to unknown rank and unknown noiselevel, and performs as well as TSVD would, had we known the true rank(X)?

    As we will see, it is convenient to represent the threshold as τ = λ√nσ, where λ is a

    parameter typically between 1 and 10. Recently, Sourav Chatterjee [16] proposed that onecould have a single universal choice of λ; that in our setting any λ > 2 would give near-optimal MSE, in a qualitative sense; and he specifically proposed λ = 2.02, namely X̂2.02√nσas a universal choice for SVHT, regardless of the shape m/n of the matrix, and regardless ofthe underlying signal matrix X or its rank. While the rule of [16] was originally intendedto be ‘fairly good’ across many situations not reducible to the low-rank matrix in i.i.d noisemodel considered here, X̂2.02√nσ is a specific proposal, which prompts the following ques-tion:

    • Question 2. It there really a single threshold parameter λ∗ that provides good perfor-mance guarantees for MSE? Is that value 2.02?

    Finally, note that singular value hard thresholding is just one strategy for matrix de-noising. It is not a-priori clear whether the whole idea of only ‘keeping’ or ‘killing’ em-pirical singular values based on their size makes sense. Could there exist a shrinkage ruleη : [0,∞) → [0,∞), that more smoothly transitions from ‘killing’ to ‘keeping’, which leadsto a much better denoising scheme? We may wonder:

    • Question 3. How does optimally tuned SVHT compare with the performance of thebest possible shrinkage of singular values, at least in the worst-case MSE sense?

    1.2 Optimal location for hard thresholding of singular values

    In this paper, we show that in a certain asymptotic framework there are simple and convinc-ing answers to these questions. Following Shabalin and Nobel [20], we adopt an asymptoticframework where the matrix grows while keeping the nonzero singular values of X fixed,and the signal-to-noise ratio of those singular values stays constant with increasing n.

    In this asymptotic framework, for a low-rank n-by-n matrix observed in white (not nec-essarily Gaussian) noise of level σ,

    τ∗ =4√3

    √nσ ≈ 2.309

    √nσ

    is the optimal location for the hard thresholding of singular values. For a nonsquare m-by-nmatrix with m 6= n, the optimal location is

    τ∗ = λ∗(β) ·√nσ, (3)

    where β = m/n. The value λ∗(β) is the optimal hard threshold coefficient for known σ. It isgiven by formula (10) below and tabulated for convenience in Table 1.

  • 1 Introduction 5

    1.3 Answers

    Our central observation is as follows.

    When a data singular value yi is too small, then its associated singular vectors ui,viare so noisy that the component yiuiv′i should not included in X̂ . In our asymptoticframework, which models large, low-rank matrices observed in white noise, the cutoffbelow which yi is too small is exactly (4/

    √3)√nσ (for square matrices).

    • Answer to Question 1: Optimal SVHT dominates TSVD. Optimally tuned SVHT X̂τ∗ isalways at least as good as TSVD X̂r, in terms of AMSE (Theorem 2). Unlike X̂r, theoptimal SVHT X̂τ∗ does not require knowledge of r = rank(X). In other words, itadapts to unknown low rank while giving uniformly equal or better performance. Forsquare matrices, the TSVD provides a guarantee on worst-case AMSE that is 5/3 timesthe guarantee provided by X̂τ∗ (Table 2).

    • Answer to Question 2: Optimal SVHT dominates every other choice of Hard Threshold. Interms of AMSE, optimally tuned SVHT X̂τ∗ is always at least as good as SVHT X̂τ atany other fixed threshold τ = λ

    √nσ (Theorem 1). It is the asymptotically minimax

    SVHT denoiser, over matrices of small bounded rank (Theorems 3 and 4) and overmatrices of small bounded nuclear norm (Theorem 5). In particular, the parameterλ = 2.02 is noticeably worse. For square matrices, X̂2.02√nσ provides a guarantee forworst-case AMSE that is 4.26/3 ≈ 1.4 times the guarantee provided by X̂τ∗ (Table 2).

    • Answer to Question 3. Optimal SVHT compares adequately to the optimal shrinker. Opti-mally tuned SVHT X̂τ∗ provides a guarantee on worst-case asymptotic MSE that is 3/2times (for square matrices) the best possible guarantee achievable by any shrinkage ofdata singular values (Table 2).

    These are all rigorous results, within a specific asymptotic framework, which prescribes acertain scaling of the noise level, the matrix size, and the signal-to-noise ratio as n grows. Butdoes AMSE predict actual MSE in finite-sized problems? In Section 7 we show finite-n sim-ulations demonstrating the effectiveness of these results even at rather small problem sizes.In high signal-to-noise, all denoisers considered here perform roughly the same, and in par-ticular the classical TSVD is a valid choice in that regime. However, in low and moderateSNR, the performance gain of optimally tuned SVHT is substantial, and can offer 30%−80%decrease in AMSE.

    1.4 Optimal singular value hard thresholding – In practice

    For a low-rank n-by-n matrix observed in white (not necessarily Gaussian) noise of un-known level, one can use the data to obtain an approximation of the optimal location τ∗.Define

    τ̂∗ ≈ 2.858 · ymed ,

    where ymed is the median singular value of the data matrix Y . The notation τ̂∗ is meant toemphasize that this is not a fixed threshold chosen a-priori, but rather a data-dependentthreshold. For a nonsquare m-by-n matrix with m 6= n, the approximate optimal locationwhen σ is unknown is

    τ̂∗ = ω(β) · ymed . (4)

  • 2 Notation and Setting 6

    The optimal hard threshold coefficient for unknown σ, denoted by ω(β), is not available asan analytic formula, but can easily be evaluated numerically. We provide a Matlab scriptand an online calculator for this purpose [21]; the underlying derivation appears in Section3.5 below. Some values of ω(β) are provided in Table 4. When a high-precision value of ω(β)cannot be computed, one can use the approximation

    ω(β) ≈ 0.56β3 − 0.95β2 + 1.82β + 1.43 . (5)

    The optimal SVHT for unknown noise level, X̂τ̂∗ , is very simple to implement and does notrequire any tuning parameters. The denoised matrix X̂τ̂∗(Y ) can be computed using just afew code lines in a high-level scripting language. For example, in Matlab:

    beta = size(Y,1) / size(Y,2)omega = 0.56*betaˆ3 - 0.95*betaˆ2 + 1.82*beta + 1.43[U D V] = svd(Y)y = diag(Y)y( y < (omega * median(y) ) = 0Xhat = U * diag(y) * V’

    Here we have used the approximation (5). We recommend, whenever possible, to use afunction omega(beta), such as the one we provide, to compute the coefficient ω(β) withhigh precision.

    In our asymptotic framework, τ∗ and τ̂∗ enjoy exactly the same optimality properties.This means that X̂τ̂∗ adapts to unknown low rank and to unknown noise level. Empiricalevidence suggest that their performance for finite n is similar. As a result, the answers weprovide above hold for the threshold τ∗ when the noise level is unknown, just as they holdfor the threshold τ∗ when the noise level is known.

    2 Notation and Setting

    Let Mm×n denote the space of real m-by-n matrices and ||·||F denote the Frobenius normon Mm×n, given by ||X||2F =

    ∑i,j X

    2i,j . Matrix denoisers are functions X̂ : Mm×n → Mm×n.

    A general model for a signal matrix X observed in white noise Z is Y = X + σZ, withX, Y, Z ∈Mm×n and Z a matrix whose entries are i.i.d draws from a probability distributionwith zero mean and unit variance. As performance measure for quality of estimation, we

    use the Mean Square Error (MSE)∣∣∣∣∣∣X̂ −X∣∣∣∣∣∣2

    F.

    2.1 Natural problem scaling.

    With the exception of TSVD, all denoisers we will consider operate by shrinkage of datasingular values, namely are of the form

    X̂ :m∑i=1

    uiv′i 7→

    m∑i=1

    η(yi;λ)uiv′i (6)

    where Y is given by (1) and η : [0,∞) → [0,∞) is some univariate shrinkage rule. As wewill see, in the general model Y = X + σZ, the noise level in the singular values of Y is√nσ. Instead of specifying a different shrinkage rule for every n, we calibrate our shrinkage

    rules to the “natural” model Y = X + Z/√n. In this convention, shrinkage rule stay the

    same for every value of n, and we conveniently abuse notation by writing X̂ as in (6) for any

  • 3 Results 7

    X̂ : Mm×n → Mm×n, keeping m and n implicit. To apply any denoiser X̂ below to data fromthe general model Y = X + σZ, use the denoiser

    X̂(n,σ)(Y ) =√nσ · X̂(Y/

    √nσ) . (7)

    For example, to apply the SVHT

    X̂λ :m∑i=1

    yiuiv′i 7→

    m∑i=1

    ηH(yi;λ)uiv′i

    to data sampled from the model Y = X + σZ, use X̂τ , with

    τ = λ ·√nσ .

    Throughout the text, we use X̂λ to denote SVHT calibrated for noise level 1/√n and X̂τ to

    denote SVHT calibrated for a specific general model Y = X + σZ.To translate the AMSE of any denoiser X̂ , calibrated for noise level 1/

    √n, to an approxi-

    mate MSE of the corresponding denoiser X̂(n,σ), calibrated for a model Y = X + σZ, we usethe identity∣∣∣∣∣∣X̂(n,σ)(Y )−X∣∣∣∣∣∣2

    F= n · σ2 ·

    ∣∣∣∣∣∣X̂(X/(√nσ)) + Z/√n)−X/(√nσ)∣∣∣∣∣∣2F.

    Below, we spell out this translation of AMSE where appropriate.

    2.2 Asymptotic framework

    Let mn be a sequence such that mn/n → β. To simplify our formulas, we assume that0 < β ≤ 1. Let the rank r > 0 be fixed. For x ∈ Rr, with x = (x1, . . . , xr), denote by Xn(x) asequence of matrices, where Xn is mn-by−n whose r largest singular values are given by xand the whose other singular values are 0. Let Zn denote a matrix whose entries are i.i.d withzero mean and unit variance, and let Yn = Xn + Zn/

    √n denote the sequence of increasingly

    larger (random) noisy observed matrices.This asymptotic model for large, low-rank matrices has been studied in [16, 20] in the

    context of denoising. When the signal matrix X is itself random, a scenario we do notconsider here, this asymptotic model has received much attention under the name SpikedCovariance Model [22]. The two models can be formally connected [20]. However, recent re-sults in Random Matrix Theory [23] make it possible to work in the signal-plus-noise modeldirectly; Here, we do not appeal to results on the Spiked Model.

    Let X̂ be a denoiser calibrated, as discussed above, for noise level 1/√n. Define the

    Asymptotic MSE (AMSE) of an X̂ at a signal x by 1

    M(X̂,x) = limn→∞

    ∣∣∣∣∣∣X̂(Yn)−Xn∣∣∣∣∣∣2F, (8)

    3 Results

    Define the optimal hard threshold for singular values for n-by-n square matrices by

    λ∗ =4√3. (9)

    1 Our results imply that the AMSE is well-defined as a function of the signal singular values x, instead ofas a function of the signal matrix X itself.

  • 3 Results 8

    β λ∗(β) β λ∗(β)

    0.05 1.5066 0.55 2.01670.10 1.5816 0.60 2.05330.15 1.6466 0.65 2.08870.20 1.7048 0.70 2.12290.25 1.7580 0.75 2.15610.30 1.8074 0.80 2.18830.35 1.8537 0.85 2.21970.40 1.8974 0.90 2.25030.45 1.9389 0.95 2.28020.50 1.9786 1.00 2.3094

    Table 1: Some values of the optimal hard threshold coefficient for known σ, λ∗(β) of Eq. (3). For an m-by-n matrix in noise level σ(with m/n = β), the optimal SVHT denoiser X̂τ∗ sets to zero all data singular values below the threshold τ∗ = λ∗(β)

    √nσ.

    More generally, define the optimal threshold for m-by-n matrices with m/n = β by

    λ∗(β)def=

    √2(β + 1) +

    (β + 1) +√β2 + 14β + 1

    . (10)

    Some values of λ∗(β) are provided in Table 1.

    3.1 Optimally tuned SVHT asymptotically dominates TSVD and any SVHT

    Our main result is simply that X̂λ∗ always has equal or better AMSE compared to SVHTwith any other choice of threshold, and compared to TSVD. In other words, from the idealperspective of our asymptotic framework, the decision-theoretic picture is very straightfor-ward: TSVD is asymptotically inadmissible, and so is any SVHT with λ 6= λ∗:

    Theorem 1. Asymptotic inadmissibility of SVHT at any threshold λ 6= λ∗. For any 0 < β ≤1, any λ 6= λ∗(β), any r ∈ N and any x ∈ Rr we have, almost surely,

    M(X̂λ∗ ,x) ≤M(X̂λ,x) , (11)

    with strict inequality at least at one point x∗(λ) ∈ Rr.

    Theorem 2. Asymptotic inadmissibility of TSVD. For any 0 < β ≤ 1, any r ∈ N and anyx ∈ Rr we have

    M(X̂λ∗ ,x) ≤M(X̂1+√β,x) ≤M(X̂r,x) (12)

    Figure 2 shows the uniform ordering of the AMSE curves, stated in Theorems 1 and 2,for a few values of β.

    To apply the optimal hard threshold tom-by-nmatrices sampled from the general modelY = X + σZ, we translate X̂λ∗ using Eq. (7), which amounts to using the threshold τ∗ =λ∗ ·√nσ. Note that Theorem 1 obviously does not imply that for any finite matrix X and

    τ 6= τ∗ we have∣∣∣∣∣∣X̂τ∗(X)−X∣∣∣∣∣∣2

    F≤∣∣∣∣∣∣X̂τ (X)−X∣∣∣∣∣∣2

    F. However, empirical evidence discussed

    in Section 7 suggests that even for relatively small matrices, e.g n ∼ 30, the performancegain from using X̂τ∗ is noticeable, and becomes substantial in low SNR.

  • 3 Results 9

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    x

    AM

    SE

    β = 0.1

    X̂ rX̂2 . 0 2X̂λ ∗X̂

    1+

    β

    X̂o p t

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    x

    AM

    SE

    β = 0.3

    X̂ rX̂2 . 0 2X̂λ ∗X̂

    1+

    β

    X̂o p t

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    x

    AM

    SE

    β = 0.7

    X̂ rX̂2 . 0 2X̂λ ∗X̂

    1+

    β

    X̂o p t

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    x

    AM

    SE

    β = 1

    X̂ rX̂2 . 0 2X̂λ ∗X̂

    1+

    β

    X̂o p t

    Figure 2: AMSE plotted against the signal amplitude x for the denoisers discussed: TSVD X̂r , the universal hard threshold X̂2.02proposed in [16], and hard threshold at the optimal location proposed here X̂λ∗ . For comparison we include two denoisers discussedbelow, namely the bulk-edge hard threshold X̂1+√β , and the optimal singular value shrinkage denoiser X̂opt introduced in [24]. Differentaspect ratios β are shown. r = 1 everywhere. The curves were jittered in the vertical axis to avoid overlap.

    3.2 Minimaxity over matrices of bounded rank

    Theorem 1 implies that X̂λ∗ is asymptotically minimax among SVHT denoisers, over theclass of matrices of a given low rank. Our next result explicitly characterizes the least favor-able signal and the asymptotic minimax MSE.

    Theorem 3. For square matrices, β = 1, we have

    1. Asymptotically Least Favorable signal for SVHT. Let λ > 2. Then

    argmaxx∈RrM(X̂λ,x) = x∗(λ) · (1, . . . , 1) ∈ Rr , (13)

    where

    x∗(λ) =λ+√λ2 − 42

    .

  • 3 Results 10

    2. Minimax AMSE of SVHT. For the AMSE of the SVHT denoiser (2) we have

    minλ

    maxx∈Rr

    M(X̂λ,x) = 3 r . (14)

    3. Asymptotically minimax tuning of SVHT threshold. For the AMSE of the SVHT de-noiser (2) we have

    argminλ>2 maxx∈RrM(X̂λ,x) =

    4√3. (15)

    In words, in our asymptotic framework, the least favorable signal for SVHT is fully de-generate. We will see in Lemma 2 below that the least favorable location for signal singularvalues, x∗(λ), is such that the top r observed data singular values fall exactly on the chosenthreshold λ.

    Theorem 4. For general matrices, 0 < β ≤ 1, we have

    argmaxx∈RrM(X̂λ,x) = x∗(λ) · (1, . . . , 1) ∈ Rr , (16)

    where

    x∗(λ) =

    √λ2 − β − 1 +

    √(λ2 − β − 1)2 − 4β2

    . (17)

    Moreover,

    minλ

    maxx∈Rr

    M(X̂λ,x) =r

    2·[(β + 1) +

    √β2 + 14β + 1

    ](18)

    argminλ>√

    1+β2maxx∈Rr

    M(X̂λ,x) =

    √2(β + 1) +

    (β + 1) +√β2 + 14β + 1

    . (19)

    3.3 Comparison of worst-case AMSE

    By Theorem 1, the AMSE of optimally tuned SVHT X̂λ∗ is always lower than the AMSE ofother choices for the hard threshold location. One way to measure how much worse theother choices are, and to compare X̂λ∗ with other popular matrix denoisers, is to evaluatetheir worst-case AMSE.

    Table 2 compares the guarantees provided on AMSE by shrinkage rules mentioned, forthe square matrix casem = n in the model Y = X+Z/

    √n. For the general noise Y = X+σZ

    multiply each guarantee by nσ2.

    3.3.1 TSVD: Truncation at the true rank

    The AMSE of the TSVD X̂r is calculated in Lemma 5 below. A simple calculation shows that,in the square matrix case (β = 1)

    maxx∈Rr

    M(X̂r,x) = 5r .

    This is 5/3 times the corresponding worst-case AMSE of X̂λ∗ .

  • 3 Results 11

    3.3.2 Hard Thresholding at/near the bulk edge

    Lemma 4 provides the AMSE of the SVHT denoiser X̂λ, for any λ. Simple calculations showthat

    maxx∈Rr

    M(X̂2,x) = 5r .

    andmaxx∈Rr

    M(X̂2.02,x) = 4.26r ,

    providing the worst-case AMSE of bulk-edge SVHT and the Universal Singular Value Thresh-old (USVT) of [16], respectively. The change in worse-case AMSE for just a small increase inthe threshold λ seems drastic (see Figure 2). The reason for this phenomenon is discussed insection 4.

    3.3.3 Soft Thresholding

    Many authors have considered matrix denoising by applying the soft thresholding nonlin-earity ηS(y; s) = (|y|− s)+ · sign(y), instead of hard thresholding, to the data singular values.The denoiser

    X̂s =n∑i=1

    ηS(yi; s)uiv′i

    is known as Singular Value Soft Thresholding (SVST) or SVT; See [25, 26, 27] and referencestherein. In our asymptotic framework, following reasoning similar to the proof of Theorem3, one finds that the optimal tuning s∗ for the soft threshold is at the bulk edge 1 +

    √β, and

    that, in the square case, the AMSE guarantee of optimally-tuned SVST X̂s∗ is 6r, which istwice as large as that for the optimally tuned SVHT X̂λ∗ . It is interesting to now that bothoptimal tuning for the soft threshold λ and the corresponding best-possible AMSE guaranteeagree with calculations done in an altogether different asymptotic model, in which one firsttakes n → ∞ with rank r/n → ρ, and only then takes ρ → 0 [25, sec. 8]. We also note thatthe worst-case AMSE of SVST is obtained in the limit of very high SNR, where SVHT doesvery well in comparison. When both are optimally tuned, SVHT does not dominate SVSTacross all matrices; In fact, soft thresholding does better than hard thresholding in low SNR(Figure 3). For example, in the square case, when the signal is near

    √3 (the least favorable

    location for X̂λ∗), the AMSE of X̂s∗ is (7 − 8/√3)r ≈ 2.38r, compared to 3r, the worse-case

    AMSE of X̂λ∗ .

    3.3.4 Optimal Singular Value Shrinker

    In the asymptotic framework we are using, Shabalin and Nobel [20] have derived an opti-mal singular value shrinker X̂opt. In [24] we develop a simple expression for the optimalshrinker; calibrated for the model X+Z/

    √n, in the square setting m = n, this shrinker takes

    the form

    X̂opt :n∑i=1

    yiuiv′i 7→

    n∑i=1

    ηopt(yi)uiv′i

    whereηopt(x) =

    √(x2 − 4)+ .

    In our asymptotic framework, this rule dominates, in AMSE, any other estimator based onsingular value shrinkage, at any configuration of the singular values x of the low-rank sig-nal. The AMSE of the optimal shrinker (in the square matrix case) on x ∈ Rr is [24]

    M(X̂opt,x) =r∑i=1

    {2− 1

    x2ixi ≥ 1

    x2i 0 ≤ xi ≤ 1(20)

  • 3 Results 12

    Shrinker Standing notation Guarantee on AMSEOptimal singular value shrinker X̂opt 2r

    Optimally tuned SVHT X̂λ∗ 3rUniversal Singular Value Threshold [16] X̂2.02 ≈ 4.26r

    SVHT at bulk edge X̂2 5rTSVD X̂r 5r

    Optimally tuned SVST X̂s∗ 6r

    Table 2: A comparison of guarantees on AMSE provided by singular value shrinkage rules discussed, for the square matrix casem = n in the model Y = X + Z/

    √n. For the general model Y = X + σZ multiply each guarantee by nσ2.

    which is depicted in Figure 2. It follows that the worst-case AMSE of X̂opt is

    maxx∈Rr

    M(X̂opt,x) = 2r

    in the square case. The worst-case AMSE of X̂λ∗ is thus 1.5 times the lowest possible worst-case AMSE in the square case.

    3.4 Minimaxity over matrices of bounded nuclear norm

    So far we have considered minimaxity over the class of matrices of at most rank r, where r isgiven. In [16], the author considered minimax estimation over a different class of matrices,namely nuclear norm balls. For a given constant ξ, this is the class of all matrices for whichthe nuclear norm is at most ξ. Recall that the nuclear norm of a matrix X ∈ Mm×n, whosevector of singular values is x ∈ Rm, is given by ||x||1. Our next result shows that X̂λ∗ isminimax optimal over this class as well. Specifically, it is the minimax estimator, in AMSE,among all SVHT rules, over a given Nuclear Norm ball. We note that unlike Theorems 3 and4, this result does not follow directly from Theorem 1. We restrict our discussion to squarematrices (β = 1).

    Theorem 5. Let λ > 2 and let ξ = r · (λ+√λ2 − 4)/2 for some r ∈ N.

    1. The least favorable singular value configuration obeys

    argmax||x||1≤ξM(X̂λ,x) = x∗(λ) · (1, . . . , 1) ∈ Rr , (21)

    where

    x∗(λ) =λ+√λ2 − 42

    .

    2. The best achievable inequality between nuclear norm ξ and AMSE of a hard threshold rule is:

    minλ

    max||x||1≤ξ

    M(X̂λ,x) =√3 · ξ . (22)

    3. The threshold achieving this inequality is

    argminλ>2 max||x||1≤ξM(X̂λ,x) =

    4√3. (23)

  • 3 Results 13

    Shrinker Standing notation Best possible constant C in Eq. (24)Optimal singular value shrinker X̂opt 1

    Optimally tuned SVHT X̂λ∗√3 ≈ 1.73

    USVT of [16] X̂2.02 ≈ 3.70SVHT at bulk edge X̂2 5

    TSVD X̂r 5Optimally tuned SVST X̂s∗ ≈ 1.38

    Table 3: A comparison of best available constant in minimax AMSE, over nuclear norm balls, for the shrinkage rules discussed, forthe square matrix case m = n in the model Y = X + Z/

    √n. These constants are the same for the general model Y = X + σZ.

    As an alternative to comparing denoisers by comparing their guarantees on AMSE overa prescribed rank r, one can compare denoisers based on the best available constant C in theinequality

    minλ

    max||x||1≤ξ

    M(X̂λ,x) = C · ξ . (24)

    The results in the square matrix case are summarized in Table 3. Each constant is derivedfrom the AMSE formula for the respective denoiser, as cited above. To understand why thebest available constant for optimally tuned SVST is smaller than than of optimally tunedSVHT, consider Figure 3.

    0 0.5 1 1.5 2 2.5 30

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    x

    AM

    SE

    X̂λ ∗

    X̂s∗

    X̂op t imal

    Figure 3: AMSE of optimally tuned SVHT (red), optimally tuned SVST (blue) and the optimal singular value shrinker of [24] (green),for square case (β = 1) and r = 1. The best available constant from Table 3 is the slope of the convex envelope (dashed) of the AMSEcurve (solid), namely, the slope of the secant running from the origin to the inflection point of the AMSE curve. Although the maximum(worst-case AMSE) of optimally tuned SVST X̂s∗ is higher than that of optimally tuned SVHT X̂λ∗ , the slope of its convex envelope islower.

    3.5 When the noise level σ is unknown

    When the noise level in which Y is observed is unknown, it no longer makes sense to useX̂λ∗ , which is calibrated for a specific noise level. We now describe a method to estimate theoptimal hard threshold from the data matrix Y . To emphasize that the resulting denoiseris ready for use on data from the general model Y = X + σZ, we denote this estimatedthreshold by τ̂∗, and the SVHT denoiser by X̂τ̂∗ . To this end, we are required to estimate

  • 3 Results 14

    the unknown noise level σ. In the closely related Spiked Covariance Model, there are exist-ing methods for estimation of an unknown noise level; see for example [28] and referencestherein.

    Consider the following robust estimator for the parameter σ in the model Y = X + σZ:

    σ̂(Y )def=

    ymed√n · µβ

    , (25)

    where ymed is a median singular value of Y and µβ is the median of the the Marčenko-Pasturdistribution, namely, the unique solution in β− ≤ x ≤ β+ to the equation

    x∫β−

    √(β+ − t)(t− β−)

    2πtdt =

    1

    2,

    where β± = (1 ±√β)2. Define the optimal hard threshold for a data matrix Y ∈ Mm×n

    observed in unknown noise level, with m/n = β, by plugging in σ̂(Y ) instead of σ in Eq. (3):

    τ̂∗(β, Y )def= λ∗(β) ·

    √n σ̂(Y ) =

    λ∗(β)√µβ

    ymed .

    Writing ω(β) = λ∗(β)/√µβ , the threshold is

    τ̂∗(β, Y ) = ω(β) · ymed .

    The median µβ and hence the coefficient ω(β) are not available analytically; In [21] wemake available a Matlab Script and an online calculator to evaluate the coefficient ω(β).Some values are shown in Table 4. A useful approximation to ω is given as a cubic polyno-mial in Eq. (5) above. Empirically,

    max0.001

  • 4 Discussion 15

    β ω(β) β ω(β)

    0.05 1.5194 0.55 2.23650.10 1.6089 0.60 2.30210.15 1.6896 0.65 2.36790.20 1.7650 0.70 2.43390.25 1.8371 0.75 2.50110.30 1.9061 0.80 2.56970.35 1.9741 0.85 2.63990.40 2.0403 0.90 2.70990.45 2.106 0.95 2.78320.50 2.1711 1.00 2.8582

    Table 4: Some values of the optimal hard threshold coefficient for unknown noise level, ω(β) of Eq. (4) For an m-by-n matrixin unknown noise level (with m/n = β), the optimal SVHT denoiser X̂τ̂∗ sets to zero all data singular values below the thresholdτ∗ = ω(β)ymed, where ymed is the median singular value of the data matrix Y .

    4 Discussion

    4.1 The optimal threshold λ∗(β) relative to the bulk edge threshold 1+√β

    Figure 4 shows the optimal threshold λ∗(β) over β. The edge of the quarter circle bulk1+√β, the hard threshold that best emulates TSVD in our setting, is shown for comparison.

    In the null case X = 0, the asymptotically largest singular value is exactly at the bulk edge,1 +√β. It might seem that the bulk edge is a natural place to set a threshold, since anything

    smaller could be the product of a pure noise situation. However, for β > 0.2, the optimalhard threshold λ∗(β) is 15-20% larger than the bulk edge; as β → 0, it grows about 40%larger. Inspecting the proof of Theorem 1 and particularly the expression for AMSE of SVHT(Lemma 4), one finds the reason: one component of the AMSE is due to the angle betweenthe signal singular vectors and the data singular vectors. This angle converges to a nonzerovalue as n → ∞ (given explicitly in Lemma 3) which grows as SNR decreases. When somedata singular value yi is too close to the bulk, its corresponding singular vectors are toobadly rotated, and the rank-one matrix yiuiv′i it contributes to the denoiser hurts the AMSEmore than it helps. For example, for square matrices β = 1, this situation is most acute whenthe signal singular value is just barely larger than xi = 1, causing the corresponding datasingular value yi to be just barely larger than the bulk edge, which for square matrices islocated at 2. A SVHT denoiser thresholding at the bulk edge would include the componentyiuiv

    ′i, incurring an AMSE of 5 (Figure 2, right panel), 5 times larger than the AMSE incurred

    by excluding yi from the reconstruction. The optimal threshold λ∗(β) keeps such singularvalues out of the picture; this is why it is necessarily larger than the bulk edge. The precisevalue of λ∗(β) is the precise point at which it becomes advantageous to include the rank-onecontribution of a singular value yi in the reconstruction.

    4.2 The optimal threshold λ∗(β) relative to the USVT X̂2.02Recently, Sourav Chatterjee [16] discussed SVHT in a broad class of situations; translatinghis remarks to the present context, he observed that any λ > 2 can serve as a universal hardthreshold for singular values (USVT), offering fairly good performance regardless of the matrixshape m/n and the underlying signal matrix X . The author makes the specific recommen-dation λ = 2.02 and writes:

  • 4 Discussion 16

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11

    1.5

    2

    2.5

    β

    λ

    λ∗(β )1 +

    β2.02

    Figure 4: The optimal threshold λ∗(β) of Eq. (10) as function of β. The hard edge of the singular value bulk 1 + √β, which is thehard threshold that corresponds to TSVD in our setting, is shown for comparison.

    “The algorithm manages to cut off the singular values at the ‘correct’ level, depending onthe structure of the unknown parameter matrix. The adaptiveness of the USVT thresholdis somewhat similar in spirit to that of the SureShrink algorithm of Donoho and John-stone.“

    Keeping in mind that the scope of [16] is much broader than the one considered here,we would like to evaluate this proposal, in the setting of low rank matrix in white noise,and specifically in our asymptotic framework. Figure 4 includes the value 2.02: indeed,this threshold is larger than the bulk edge, for any 0 < β ≤ 1, so Chatterjee’s X̂2.02 ruleasymptotically set to zero all singular values which could arise due to an underlying noise-only situation. When λ∗(β) < 2.02, the X̂2.02 rule sometimes “kills” singular values that theoptimal threshold deems good enough for keeping, and when λ∗(β) > 2.02, the X̂2.02 rulesometimes “keeps” singular values that did in fact arise from signal, but are so close to thebulk that the optimal threshold declares them unusable.

    For β = 1, the guarantee on worst-case AMSE obtained by using λ = 2.02 over matricesof rank r is about 4.26r, roughly 1.4 times larger than the guarantee obtained by using theminimax threshold λ = 4/

    √3 (See Figure 2). For square matrices, the regret for preferring

    USVT to optimally-tuned SVHT can be substantial: in low SNR (x ≈ 1), using the thresholdλ = 2.02 incurs roughly twice the AMSE of the minimax threshold 4/

    √3.

    We note that unlike the bulk-edge SVHT X̂1+√β and the optimally tuned SVHT X̂λ∗ ,the USVT X̂2.02 does not take into account the shape factor β. A comparison of worst-caseAMSE between the fixed threshold choice λ = 2.02 and the optimal hard threshold λ = λ∗(β)is shown in Figure 5. The two curves intersect at λ ≈ 0.55, where the optimal threshold (10)is approximately 2.02.

    One might argue that [16] proposed 2.02 based on its MSE performance over classes ofmatrices bounded in nuclear norm. But also for that purpose, 2.02 is noticeably outper-formed by λ∗(β). Arguing as in Theorem 5 we obtain, in the square case:

    max||x||1≤ξ

    M(X̂2.02,x) ≈ 3.70 · ξ . (26)

    The coefficient 3.70 is about 110% worse than the best coefficient achievable by SVHT:√3 by

    (24).

  • 5 A formula for AMSE of hard thresholding 17

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    β

    AMSE

    maxx M(X̂2.02, x)

    maxx M(X̂λ ∗(β ), x)

    Figure 5: Worst-case AMSE for varying shape parameter β for two choices of hard threshold: (i) λ = 2.02 proposed in [16] and (ii)the optimal threshold λ∗ of Eq. (10).

    One should keep in mind that USVT is applicable for a wide range of noise models, e.g.in stochastic block models. [16] is the first, to the best of out knowledge, to suggest that amatrix denoising procedure as simple as SVHT could have universal optimality properties.In our asymptotic framework of low-rank matrices in white noise, the 2.02 threshold per-forms fairly well in AMSE, except for very small values of β (Figure 2); but one often gets asubstantial AMSE improvement by switching to the rule we recommend. Since our recom-mendation dominates in AMSE, there is no downside to making this switch – i.e. there is noconfiguration of signal singular values x which could make one regret this switch.

    5 A formula for AMSE of hard thresholding

    Our main results depend on Lemma 4, a formula for the AMSE of SVHT, which we prove inthis section. We invoke the following observations, which follow almost immediately from[23].

    Assume that Yn is the sequence of matrices defined above with respect to x = (x1, . . . , xr) ∈Rr. Write yn = (yn,1, . . . , yn,m) for the singular values of Yn. Let Yn = Un ·(yn)∆ ·V ′n denote thefull Singular Value Decomposition of the data matrix Yn, where Un ∈ Mm×m and Vn ∈ Mn×nare orthogonal matrices, and where (yn)∆ ∈ Mm×n is a diagonal matrix with main diagonalyn ∈ Rm. Write un,i for the i-th column of Un, namely, the left singular vector of Yn corre-sponding to yn,i. Similarly, write vn,i for the i-th column of Vn. Also, write ui and vi forthe i-th left and right singular vectors of the signal matrix X , which correspond the singularvalue xi, 1 ≤ i ≤ r.Lemma 2. Asymptotic data singular values. For 1 ≤ i ≤ r,

    limn→∞

    yn,ia.s.=

    √(

    xi +1xi

    )(xi +

    βxi

    )xi > β

    1/4

    1 +√β xi ≤ β1/4

    (27)

    Proof. This follows from [23, thm 2.9] (see example in section 3.1, eq. 9). Note that theirresult, while stated in the case where Z has i.i.d Gaussian entries, is valid for our setting ofZ with i.i.d entries following an arbitrary distribution with finite second moment.

  • 5 A formula for AMSE of hard thresholding 18

    Lemma 3. Asymptotic angle between signal and data singular vectors. Let 1 ≤ i 6= j ≤ rand assume that xi has degeneracy d, namely, there are exactly d entries of x equal to xi. Ifxi > β

    1/4, we have

    d · limn→∞

    ∣∣〈ui , un,j〉∣∣2 a.s.= { x4i−βx4i+βx2i xi = xj0 xi 6= xj

    , (28)

    and, a slightly different formula,

    d · limn→∞

    ∣∣〈vi , vn,j〉∣∣2 a.s.= { x4i−βx4i+x2i xi = xj0 xi 6= xj

    . (29)

    If however xi ≤ β1/4, then we have

    limn→∞

    ∣∣〈ui , un,j〉∣∣ a.s.= limn→∞

    ∣∣〈vi , vn,j〉∣∣ a.s.= 0 .Proof. This follows from [23, thm 2.10] (see example in section 3.1, eqs. 10 and 11). There, itis proved that the squared norm of the projection on Span {uj |xj = xi} is given by the righthand side of (28) and, similarly, that the squared projection on Span {vj |xj = xi} is givenby the right hand side of (29). But necessarily, for all j such that xj = xi,

    limn→∞

    ∣∣〈ui , un,i〉∣∣2 = limn→∞

    ∣∣〈uj , un,i〉∣∣2 , andlimn→∞

    ∣∣〈vi , vn,i〉∣∣2 = limn→∞

    ∣∣〈vj , vn,i〉∣∣2since the matrixXn and the distribution of Yn are invariant under the permutation i↔ j.

    We now calculate the AMSE (30) of the hard thresholding estimator X̂λ, for given thresh-old λ, at a matrix of specific aspect ratio β and signal singular values x.

    Lemma 2 tells us that any threshold λ < 1 +√β, namely below the bulk edge, is unrea-

    sonable, and we restrict ourselves to λ ≥ 1 +√β.

    Lemma 4. AMSE of singular value hard thresholding. Fix r > 0 and x ∈ Rr. Let {Xn(x)}∞n=1and {Zn}∞n=1 be matrix sequences as above and let λ ≥ 1 +

    √β. Then

    M(X̂λ,x) =r∑i=1

    M(X̂λ, xi) (30)

    where

    M(X̂λ, xi) =

    {(xi +

    1xi)(xi +

    βxi)− (x2i −

    2βx2i) xi ≥ x∗(λ)

    x2i xi < x∗(λ)(31)

    and x∗(λ) is given by Eq. (17).

    Figure 2 shows the AMSE of Lemma 4, in square case β = 1 and nonsquare cases β = 0.1,β = 0.3 and β = 0.7.

    Proof. Let ei denote the standard basis vector in Rm or Rn. For matrices X, Y ∈ Mm×n wedenote by 〈X , Y 〉 =

    ∑i,j Xi,jYi,j the Frobenius inner product. For vectors x,y ∈ Rn, we

    denote by 〈x , y〉 =∑n

    i=1 xiyi the Euclidean inner product. Since λ ≥ 1 +√β, we have

    X̂λ(Yn) =r∑i=1

    η(yn,i;λ)un,iv′n,i

  • 5 A formula for AMSE of hard thresholding 19

    and, by definition,

    Xn =r∑i=1

    xieie′i

    We have∣∣∣∣∣∣X̂λ(Yn)−Xn∣∣∣∣∣∣2F

    = 〈X̂λ(Yn)−Xn , X̂λ(Yn)−Xn〉

    = 〈X̂λ(Yn) , X̂λ(Yn)〉+ 〈Xn , Xn〉 − 2〈X̂λ(Yn) , Xn〉

    =r∑i=1

    η(yn,i;λ)2 +

    r∑i=1

    x2i − 2r∑

    i,j=1

    xi · η(yn,j;λ) · 〈eie′i , un,jv′n,j〉

    =r∑i=1

    [η(yn,i;λ)

    2 + x2i − 2xir∑j=1

    η(yn,j;λ) · 〈eie′i , un,jv′n,j〉

    ]. (32)

    When 0 ≤ xi ≤ β1/4, only the term xi survives and Eq. (35) holds. Assume now thatxi > β

    1/4. We now consider the limiting value of each of the terms in (32). For the rightmostterm, by Lemma 2, for i = 1, . . . , r we have

    limn→∞

    η(yn,i;λ)2 =

    {(xi +

    1xi)(xi +

    βxi) (xi +

    1xi)(xi +

    βxi) ≥ λ2

    0 (xi +1xi)(xi +

    βxi) < λ2

    .

    Turning to the rightmost term of (32), by Lemma 3, for i, j = 1, . . . , r we have

    limn→∞〈ei , un,j〉〈ei , v′n,j〉 =

    1di

    x4i−β√(x4i+βx

    2i )(x

    4i+x

    2i )

    xi = xj

    0 xi 6= xj=

    1di

    x4i−βx3i

    √(xi+β/xi)(xi+1/xi)

    xi = xj

    0 xi 6= xj

    where di = # {j |xj = xi}. Furthermore, since for a,x ∈ Rm and b,y ∈ Rn we have〈xy′ , ab′〉 = 〈x , a〉〈y , b〉, we find that for i = 1, . . . , r,

    r∑j=1

    η(yn,j;λ) · 〈eie′i , un,jv′n,j〉 =r∑j=1

    η(yn,j;λ) · 〈ei , un,j〉〈ei , v′n,j〉 .

    For the rightmost term of (32) we conclude that

    limn→∞

    xi

    r∑j=1

    η(yn,j;λ) · 〈eie′i , un,jv′n,j〉 =∑

    1≤j≤r :xj=xi

    limn→∞

    η(yn,j;λ) ·1

    di

    x4i − βx2i√(xi + β/xi)(xi + 1/xi)

    =

    {x4i−βx2i

    (xi +1xi)(xi +

    βxi) ≥ λ2

    0 (xi +1xi)(xi +

    βxi) < λ2

    ,

    where we have used Lemma 2 again. Collecting the terms, we find for the limiting value of(32) that

    limn→∞

    ∣∣∣∣∣∣X̂λ(Yn)−Xn∣∣∣∣∣∣2F=

    r∑i=1

    M(X̂λ, xi) , (33)

    where M(X̂λ, xi) is given by (35) as required.

    For the TSVD, the same argument gives:

  • 6 Proofs 20

    Lemma 5. AMSE of TSVD. Fix r > 0 and x ∈ Rr. Let {Xn(x)}∞n=1 and {Zn}∞n=1 be matrixsequences as above and let λ ≥ 1 +

    √β. Also define Then

    M(X̂r,x) =r∑i=1

    M(X̂λ, xi) (34)

    where

    M(X̂λ, xi) =

    {(xi +

    1xi)(xi +

    βxi)− (x2i −

    2βx2i) xi ≥ β1/4

    (1 +√β)2 + x2i xi ≤ β1/4

    . (35)

    6 Proofs

    Proof of Theorem 1. Let x∗ = x∗(λ∗(β)) where λ∗(β) is defined in (10) and x∗(λ) is definedin (17). Then

    x2∗ =

    (x∗ +

    1

    x∗

    )(x∗ +

    β

    x∗

    )−(x2∗ −

    x2∗

    ).

    It follows that for all x > 0 and λ > 1 +√β,

    M(X̂λ∗ , x) = min

    {x2 ,

    (x+

    1

    x

    )(x+

    β

    x

    )−(x2 − 2β

    x2

    )}≤M(X̂λ, x)

    and the theorem follows from Eq. (30).Figure 6 provides a friendly explanation of this proof for the square (β = 1) case.

    Proof of Theorem 2. For x < β1/4, by Lemma 4 and Lemma 5 we have

    M(X̂λ∗ , x) = x2 ≤ x2 + (1 +

    √β)2 =M(X̂r, x) .

    For x ≥ β1/4, by Lemma 5 and Theorem 1 we have

    M(X̂λ∗ , x) ≤M(X̂1+√β, x) =M(X̂r, x) .

    Proof of Theorems 3 and 4. Theorem 3 is a special case of Theorem 4. Since By Eq. (30), itis enough to consider the univariate function x 7→M(X̂λ, x) defined in (35).

    The Theorem follows from Lemma 4 using the following simple observation.Let 0 < β ≤ 1. For λ ≥ 1 +

    √β, let x∗(λ) denote the unique positive solution to the

    equation (x+ 1/x)(x+ β/x) = λ2. Let λ∗ be the unique solution to the equation in λ

    x4∗(λ)− (β + 1)x2∗ − 3β = 0 .

    Then for the function M(X̂λ, x) defined in (35), we have

    argmaxx>0M(X̂λ, x) = x∗(λ) (36)

    argminλ≥1+√β maxx>0 M(X̂λ, x) = λ∗ (37)

    minλ≥1+√β maxx>0

    M(X̂λ, x) = x∗(λ∗)2 . (38)

    Note that the least favorable situation occurs when x1 = . . . = xr = x∗(λ), and that x∗(λ)is precisely the value of x for which the corresponding limiting data singular value satisfieslimn→∞ yn,i = λ. In other words, the least favorable situation occurs when the data singularvalues all coincide with each other and with the chosen hard threshold.

  • 6 Proofs 21

    1 1.5 2 2.51

    2

    3

    4

    5

    6

    x

    AM

    SE

    λ = 2 .02

    1 1.5 2 2.51

    2

    3

    4

    5

    6

    x

    AM

    SE

    λ = 2 .15

    1 1.5 2 2.51

    2

    3

    4

    5

    6

    x

    AM

    SE

    λ = 2 .31

    1 1.5 2 2.51

    2

    3

    4

    5

    6

    x

    AM

    SE

    λ = 2 .45

    1 1.5 2 2.51

    2

    3

    4

    5

    6

    x

    AM

    SE

    λ = 2 .6

    1 1.5 2 2.51

    2

    3

    4

    5

    6

    x

    AM

    SE

    λ = 2 .75

    Figure 6: The AMSE profiles of (30) for β = 1, r = 1, for several values of λ. Green: x 7→ x2. Blue: x 7→ 2 + 3/x2. Horizontal line:location of the cutoff x∗(λ) solving x∗ + 1/x∗ = λ. Solid line: AMSE curve.

    Proof of Lemma 1. Let Fn denote the empirical cumulative distribution function (CDF) ofthe squared singular values of Yn. Write y2med,n = Median(Fn) where Median(·) is a func-tional which takes as argument the CDF and delivers the median of that CDF. Under ourasymptotic framework, almost surely, Fn converges weakly to a limiting distribution, FMP ,the CDF of the Marčenko-Pastur distribution with shape parameter β. This distribution hasa positive density throughout its support, in particular at its median. The median functionalis continuous for weak convergence at F0, and hence, almost surely,

    y2med,n =Median(Fn)→Median(F0) = µβ, n→∞ .

    It follows that, almost surely,

    limn→∞

    σ̂(Yn)

    1/√n= lim

    n→∞

    ymed,n√µβ

    = 1 .

  • 8 Conclusion 22

    7 Empirical comparison of MSE with AMSE

    We have calculated the exact optimal threshold τ∗ in a certain asymptotic framework. Itspractical significance hinges on the validity of the AMSE as an approximation to MSE, forvalues of (m,n, r) and error distributions encountered in practice. This in turn depends onthe simultaneous convergence of three terms:

    • Convergence of the top data singular values yn,i (1 ≤ i ≤ r) to the limit in Lemma 2,

    • Convergence of the bottom data singular values yn,i (r + 1 ≤ i ≤ m) to 0, and

    • Convergence of the angle between the top data singular vectors un,i,vn,i and theirrespective signal singular vectors to the limit in Lemma 3.

    Analysis of each of these terms for specific error distributions is well beyond our currentscope. Figures 7,8,9 and 10 show comparisons of AMSE and empirical MSE. The matrixsizes and number of Monte Carlo draws are small enough to demonstrate that AMSE is areasonable approximation even for relatively small low-rank matrices. As convergence ofthe empirical spectrum to its limit is known to depend on moments of the underlying dis-tributions, we include results for three error distributions: standard normal (light tails), uni-form (compactly supported) and Student’s-t distribution with 6 degrees of freedom (heavytails). The only regime where the AMSE was observed to give poor approximation to MSEwas for X̂λ for λ close to 1 +

    √β (Figure 10). Indeed, for the case n = 50 shown, some data

    singular values from the bulk manage to pass the threshold λ and cause their singular vec-tors to be included in the estimator X̂λ. Our derivation of AMSE assumed however that nosuch singular vectors are included in X̂λ.

    8 Conclusion

    The asymptotic framework considered here is perhaps the simplest nontrivial model for ma-trix denoising. It allows one to calculate, in AMSE, basically any quantity of interest, for anydenoiser of interest. The fundamental elements of matrix denoising in white noise, whichunderly more complicated models, are present yet understandable and quantifiable. For ex-ample, the AMSE of any denoiser based on singular value shrinkage contains a componentdue to noise contamination in the data singular vectors, and this component determines afundamental lower bound on AMSE.

    We conjecture that results calculated in this model, which are not attached to a specificassumption on rank (e.g, the constants in Table 3, which determine the minimax AMSE overnuclear norm balls) remain essentially correct in more complicated models.

    The decision-theoretic landscape as it appears through the naive prism of our asymptoticframework is extremely simple: there is a unique admissible hard thresholding rule, andmoreover a unique admissible shrinkage rule, for singular values. This is of course quitedifferent from the situation encountered, for example, in estimating normal means. Thereason is the extreme simplicity of our model: we have replaced the data singular values,which are random for finite matrix size, with their a.s. limits, and in effect neglected theirrandom fluctuations around these limits. It is known that the data singular values thatescape the bulk follow a Gaussian distribution, whose means are these limits, and whosecovariance structure is known; see for example [29, 30]. We have ignored this structure.However, including these second-order terms in the asymptotic distributions is only likelyto achieve second-order improvements in MSE.

  • References 23

    Reproducible Research

    Matlab scripts used to create the figures in this paper are available at the author webpagegavish.web.stanford.edu .

    Acknowledgements

    We would like to thank Andrea Montanari for pointing us to the work of Shabalin andNobel, and to thank Drew Nobel and Sourav Chatterjee for helpful discussions. This workwas partially supported by NSF DMS 0906812 (ARRA). MG was partially supported by aWilliam R. and Sara Hart Kimball Stanford Graduate Fellowship.

    References

    [1] G Golub and W Kahan. Calculating the Singular Values and Pseudo-Inverse of a Ma-trix. Journal of the Society for Industrial & Applied Mathematics: Series B, 2(2):205–224,1965. URL http://epubs.siam.org/doi/abs/10.1137/0702016.

    [2] Carl Eckart and Gale Young. The approximation of one matrix by another of lowerrank. Psychometrika, 1(3), 1936. URL http://link.springer.com/article/10.1007/BF02288367.

    [3] O. Alter, P. Brown, and D. Botstein. Singular value decomposition for genome-wideexpression data processing and modeling. Proceedings of the National Academy of Sciences,97(18):10101–10106, August 2000. doi: 10.1073/pnas.97.18.10101. URL http://www.pnas.org/cgi/content/abstract/97/18/10101.

    [4] RB Cattell. The scree test for the number of factors. Multivariate behav-ioral research, 1966. URL http://www.tandfonline.com/doi/abs/10.1207/s15327906mbr0102_10.

    [5] DA Jackson. Stopping rules in principal components analysis: a comparison of heuris-tical and statistical approaches. Ecology, 1993. URL http://www.jstor.org/stable/10.2307/1939574.

    [6] T D Lagerlund, F W Sharbrough, and N E Busacker. Spatial filtering of multichannelelectroencephalographic recordings through principal component analysis by singularvalue decomposition. Journal of clinical neurophysiology : official publication of the AmericanElectroencephalographic Society, 14(1):73–82, January 1997. ISSN 0736-0258. URL http://www.ncbi.nlm.nih.gov/pubmed/9013362.

    [7] Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy aShadick, and David Reich. Principal components analysis corrects for stratificationin genome-wide association studies. Nature genetics, 38(8):904–9, August 2006. ISSN1061-4036. doi: 10.1038/ng1847. URL http://www.ncbi.nlm.nih.gov/pubmed/16862161.

    [8] O Edfors and M Sandell. OFDM channel estimation by singular value decomposition.Communications, . . . , 1998. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=701321.

    gavish.web.stanford.eduhttp://epubs.siam.org/doi/abs/10.1137/0702016http://link.springer.com/article/10.1007/BF02288367http://link.springer.com/article/10.1007/BF02288367http://www.pnas.org/cgi/content/abstract/97/18/10101http://www.pnas.org/cgi/content/abstract/97/18/10101http://www.tandfonline.com/doi/abs/10.1207/s15327906mbr0102_10http://www.tandfonline.com/doi/abs/10.1207/s15327906mbr0102_10http://www.jstor.org/stable/10.2307/1939574http://www.jstor.org/stable/10.2307/1939574http://www.ncbi.nlm.nih.gov/pubmed/9013362http://www.ncbi.nlm.nih.gov/pubmed/9013362http://www.ncbi.nlm.nih.gov/pubmed/16862161http://www.ncbi.nlm.nih.gov/pubmed/16862161http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=701321http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=701321

  • References 24

    [9] Svante Wold. Cross-Validatory Estimation of the Number of Components inFactor and Principal Components Components Models. Technometrics, 20(4):397–405, 1978. URL http://www.tandfonline.com/doi/pdf/10.1080/00401706.1978.10489693.

    [10] Art B. Owen and Patrick O. Perry. Bi-cross-validation of the SVD and the nonneg-ative matrix factorization. The Annals of Applied Statistics, 3(2):564–594, June 2009.ISSN 1932-6157. doi: 10.1214/08-AOAS227. URL http://projecteuclid.org/euclid.aoas/1245676186.

    [11] Patrick O. Perry. Cross-Validation for Unsupervised Learning. Stanford University Ph.D.thesis September 2009. URL http://arxiv.org/abs/0909.3052.

    [12] Peter D. Hoff. Model averaging and dimension selection for the singular value decom-position. September 2006. URL http://arxiv.org/abs/0909.3052.

    [13] Dimitris Achlioptas and Frank McSherry. Fast Computation of Low Rank MatrixApproximations. In Proceedings of the thirty-third annual ACM symposium on Theoryof computing, pages 611–618, 2001. ISBN 1581133499. URL http://dl.acm.org/citation.cfm?id=380858.

    [14] Yossi Azar, Amos Fiat, Anna R. Karlin, Frank McSherry, and Jared Saia. Spectral Anal-ysis of Data. In Proceedings of the thirty-third annual ACM symposium on Theory of comput-ing, pages 619–626, 2001. ISBN 1581133499. URL http://dl.acm.org/citation.cfm?id=380859.

    [15] Peter J. Bickel and Elizaveta Levina. Covariance regularization by thresholding. TheAnnals of Statistics, 36(6):2577–2604, December 2008. ISSN 0090-5364. doi: 10.1214/08-AOS600. URL http://projecteuclid.org/euclid.aos/1231165180.

    [16] Sourav Chatterjee. Matrix estimation by universal singular value thresholding. pages1–52, 2010. URL arxiv.org/abs/1212.1247.

    [17] Raghunandan H. Keshavan and Sewoong Oh. OptSpace : A Gradient Descent Algo-rithm on the Grassman Manifold for Matrix Completion. pages 1–26, 2009.

    [18] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completionfrom a few entries. Information Theory, IEEE . . . , 2010. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5466511.

    [19] Jared Tanner and Ke Wei. Normalized iterative hard thresholding for matrix com-pletion. 2012. URL http://people.maths.ox.ac.uk/tanner/papers/TaWei_NIHT.pdf.

    [20] Andrey Shabalin and Andrew Nobel. Reconstruction of a Low-rank Matrix in thePresence of Gaussian Noise. Journal of Multivariate Analysis, 118: 67-76, 2013. http://dx.doi.org/10.1016/j.jmva.2013.03.005

    [21] David L. Donoho and Matan Gavish. Companion website for the article The Opti-mal Hard Threshold for Singular Values is 4/sqrt(3). http://www.runshare.org/CompanionSite/Site303, 2013. accessed 11 Nov 2013.

    [22] Iain M. Johnstone. On the distribution of the largest eigenvalue in principal com-ponents analysis. Annals of Statistics, 29(2):295–327, 2001. ISSN 00905364. doi:10.2307/2674106. URL http://projecteuclid.org/euclid.aos/1009210544.

    http://www.tandfonline.com/doi/pdf/10.1080/00401706.1978.10489693http://www.tandfonline.com/doi/pdf/10.1080/00401706.1978.10489693http://projecteuclid.org/euclid.aoas/1245676186http://projecteuclid.org/euclid.aoas/1245676186http://arxiv.org/abs/0909.3052http://arxiv.org/abs/0909.3052http://dl.acm.org/citation.cfm?id=380858http://dl.acm.org/citation.cfm?id=380858http://dl.acm.org/citation.cfm?id=380859http://dl.acm.org/citation.cfm?id=380859http://projecteuclid.org/euclid.aos/1231165180arxiv.org/abs/1212.1247http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5466511http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5466511http://people.maths.ox.ac.uk/tanner/papers/TaWei_NIHT.pdfhttp://people.maths.ox.ac.uk/tanner/papers/TaWei_NIHT.pdfhttp://dx.doi.org/10.1016/j.jmva.2013.03.005http://dx.doi.org/10.1016/j.jmva.2013.03.005http://www.runshare.org/CompanionSite/Site303http://www.runshare.org/CompanionSite/Site303http://projecteuclid.org/euclid.aos/1009210544

  • References 25

    [23] Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vec-tors of low rank perturbations of large rectangular random matrices. Journal ofMultivariate Analysis, 111:120–135, October 2012. ISSN 0047259X. doi: 10.1016/j.jmva.2012.04.019. URL http://linkinghub.elsevier.com/retrieve/pii/S0047259X12001108.

    [24] David L. Donoho and Matan Gavish. Optimal shrinkage of singular values. newblockTechnical report, Stanford University, 2013.

    [25] David L. Donoho and Matan Gavish. Minimax Risk of Matrix Denoising by SingularValue Thresholding. Technical report, Stanford University, 2013. URL http://arxiv.org/abs/1304.2085.

    [26] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A Singular Value ThresholdingAlgorithm for Matrix Completion. SIAM Journal on Optimization, 20(4):1956, 2008. URLhttp://arxiv.org/abs/0810.3286.

    [27] Emmanuel J. Candès, Carlos A Sing-long, and Joshua D Trzasko. Unbiased Risk Esti-mates for Singular Value Thresholding and Spectral Estimators. Technical report, Stan-ford University, 2012. URL http://arxiv.org/abs/1210.4139.

    [28] Shira Kritchman and Boaz Nadler. Non-parametric detection of the number of signals:Hypothesis testing and random matrix theory. Signal Processing, IEEE Transactions, 57(10):3930–3941, 2009. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4915755.

    [29] Zhidong Bai and Jian-feng Yao. Central limit theorems for eigenvalues in a spikedpopulation model. Annales de l’Institut Henri Poincare (B) Probability and Statistics, 44(3):447–474, June 2008. ISSN 0246-0203. doi: 10.1214/07-AIHP118. URL http://projecteuclid.org/euclid.aihp/1211819420.

    [30] Dai Shi. Asymptotic Joint Distribution of Extreme Sample Eigenvalues and Eigenvec-tors in the Spiked Population Model. April 2013. URL http://arxiv.org/abs/1304.6113.

    http://linkinghub.elsevier.com/retrieve/pii/S0047259X12001108http://linkinghub.elsevier.com/retrieve/pii/S0047259X12001108http://arxiv.org/abs/1304.2085http://arxiv.org/abs/1304.2085http://arxiv.org/abs/0810.3286http://arxiv.org/abs/1210.4139http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4915755http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4915755http://projecteuclid.org/euclid.aihp/1211819420http://projecteuclid.org/euclid.aihp/1211819420http://arxiv.org/abs/1304.6113http://arxiv.org/abs/1304.6113

  • References 26

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (10 , 10 , 1) gauss i an

    x

    MSE

    X̂ r

    X̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (10 , 10 , 1) uni form

    x

    MSE

    X̂ r

    X̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (10 , 10 , 1) student-t(6)

    x

    MSE

    X̂ r

    X̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (20 , 20 , 1) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (20 , 20 , 1) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (20 , 20 , 1) student-t(6)

    xM

    SE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (50 , 50 , 1) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (50 , 50 , 1) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (50 , 50 , 1) student-t(6)

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (100 , 100 , 1) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (100 , 100 , 1) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (100 , 100 , 1) student-t(6)

    x

    MSE

    X̂ rX̂λ ∗

    Figure 7: The AMSE (solid line) and empricial MSE from 50 Monte Carlo draws (dots) of X̂r and X̂λ∗ for β = 1, r = 1. Differentvalues of N and error distributions are shown. Panel titles indicate (m,n, r) and the noise distribution.

  • References 27

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (10 , 50 , 1) gauss i an

    x

    MSE

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (10 , 50 , 1) uni form

    x

    MSE

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (10 , 50 , 1) student-t(6)

    x

    MSE

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (25 , 50 , 1) gauss i an

    x

    MSE

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (25 , 50 , 1) uni form

    x

    MSE

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (25 , 50 , 1) student-t(6)

    xM

    SE

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (35 , 50 , 1) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (35 , 50 , 1) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (35 , 50 , 1) student-t(6)

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (50 , 50 , 1) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (50 , 50 , 1) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    5

    (50 , 50 , 1) student-t(6)

    x

    MSE

    X̂ rX̂λ ∗

    Figure 8: The AMSE (solid line) and empricial MSE from 50 Monte Carlo draws (dots) of X̂r and X̂λ∗ for N = 50, r = 1. Differentvalues of β = m/n and error distributions are shown. Panel titles indicate (m,n, r) and the noise distribution.

  • References 28

    0 0.5 1 1.5 2 2.50

    2

    4

    6

    8

    10

    (50 , 50 , 2) gauss i an

    x

    MSE

    X̂ r

    X̂λ ∗

    0 0.5 1 1.5 2 2.50

    2

    4

    6

    8

    10

    (50 , 50 , 2) uni form

    x

    MSE

    X̂ r

    X̂λ ∗

    0 0.5 1 1.5 2 2.50

    2

    4

    6

    8

    10

    (50 , 50 , 2) student-t(6)

    x

    MSE

    X̂ r

    X̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    (50 , 50 , 3) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    (50 , 50 , 3) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    (50 , 50 , 3) student-t(6)

    xM

    SE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    20

    (50 , 50 , 4) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    20

    (50 , 50 , 4) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    20

    (50 , 50 , 4) student-t(6)

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    20

    25

    (50 , 50 , 5) gauss i an

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    20

    25

    (50 , 50 , 5) uni form

    x

    MSE

    X̂ rX̂λ ∗

    0 0.5 1 1.5 2 2.50

    5

    10

    15

    20

    25

    (50 , 50 , 5) student-t(6)

    x

    MSE

    X̂ rX̂λ ∗

    Figure 9: The AMSE (solid line) and empricial MSE from 50 Monte Carlo draws (dots) of X̂r and X̂λ∗ form = n = 50, namely β = 1.Different values of r and error distributions are shown. Panel titles indicate (m,n, r) and the noise distribution.

  • References 29

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) gauss i an

    x

    MSE

    X̂ 2

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) uni form

    x

    MSE

    X̂2

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) student-t(6)

    x

    MSE

    X̂ 2

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) gauss i an

    x

    MSE

    X̂ 2 . 0 2

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) uni form

    x

    MSE

    X̂ 2 . 0 2

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) student-t(6)

    xM

    SE

    X̂ 2 . 0 2

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    3

    (50 , 50 , 1) gauss i an

    x

    MSE

    X̂ 2 . 2

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    3

    (50 , 50 , 1) uni form

    x

    MSE

    X̂2 . 2

    0 0.5 1 1.5 2 2.50

    0.5

    1

    1.5

    2

    2.5

    3

    (50 , 50 , 1) student-t(6)

    x

    MSE

    X̂ 2 . 2

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) gauss i an

    x

    MSE

    X̂ 2 . 5

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) uni form

    x

    MSE

    X̂2 . 5

    0 0.5 1 1.5 2 2.50

    1

    2

    3

    4

    (50 , 50 , 1) student-t(6)

    x

    MSE

    X̂ 2 . 5

    Figure 10: The AMSE (solid line) and empricial MSE from 50 Monte Carlo draws (dots) of X̂λ for m = n = 50, namely β = 1, andr = 1. Different values of λ and error distributions are shown. Panel titles indicate (m,n, r) and the noise distribution.

    IntroductionQuestionsOptimal location for hard thresholding of singular valuesAnswersOptimal singular value hard thresholding – In practice

    Notation and SettingNatural problem scaling.Asymptotic framework

    ResultsOptimally tuned SVHT asymptotically dominates TSVD and any SVHTMinimaxity over matrices of bounded rankComparison of worst-case AMSETSVD: Truncation at the true rankHard Thresholding at/near the bulk edgeSoft ThresholdingOptimal Singular Value Shrinker

    Minimaxity over matrices of bounded nuclear normWhen the noise level is unknown

    DiscussionThe optimal threshold *() relative to the bulk edge threshold 1+The optimal threshold *() relative to the USVT 2.02

    A formula for AMSE of hard thresholdingProofsEmpirical comparison of MSE with AMSEConclusion