1007.4148

Embed Size (px)

Citation preview

  • 7/26/2019 1007.4148

    1/34

    Reconstruction of a Low-rank Matrix in the

    Presence of Gaussian Noise

    Andrey Shabalin and Andrew Nobel

    July 26, 2010

    Abstract

    In this paper we study the problem of reconstruction of a low-rank matrix observed with additive Gaussian noise. First we showthat under mild assumptions (about the prior distribution of the sig-nal matrix) we can restrict our attention to reconstruction methodsthat are based on the singular value decomposition of the observedmatrix and act only on its singular values (preserving the singularvectors). Then we determine the effect of noise on the SVD of low-rank matrices by building a connection between matrix reconstructionproblem and spiked population model in random matrix theory. Basedon this knowledge, we propose a new reconstruction method, calledRMT, that is designed to reverse the effect of the noise on the singu-lar values of the signal matrix and adjust for its effect on the singularvectors. With an extensive simulation study we show that the pro-posed method outperform even oracle versions of both soft and hardthresholding methods and closely matches the performance of a gen-eral oracle scheme.

    1 Introduction

    Existing and emerging technologies provide scientists with access to a grow-

    ing wealth of data. Some data is initially produced in the matrix form, whileother can be represented in a matrix form once the data from multiple sam-ples is combined. The data is often measured with noise due to limitationsof the data generating technologies. To reduce the noise we need some in-formation about the possible structure of signal component. In this paper

    1

    arXiv:1007.41

    48v1[stat.ME]23Jul2010

  • 7/26/2019 1007.4148

    2/34

    we assume the signal to be a low rank matrix. Assumption of this sort ap-

    pears in multiple fields of study including genomics (Raychaudhuri et al.,2000; Alter et al., 2000; Holter et al., 2000; Wall et al., 2001; Troyanskayaet al., 2001), compressed sensing (Candes et al., 2006; Candes and Recht,Candes and Recht; Donoho, 2006), and image denoising (Wongsawat, Rao,and Oraintara;Konstantinides et al.,1997). In many cases the signal matrixis known to have low rank. For example, a matrix of squared distances be-tween points ind-dimensional Euclidean space is know to have rank at mostd+ 2. A correlation matrix for a set of points in d-dimensional Euclideanspace has rank at most d. In other cases the target matrix is often assumedto have low rank, or to have a good low-rank approximation.

    In this paper we address the problem of recovering a low rank signal

    matrix whose entries are observed in the presence of additive Gaussian noise.The reconstruction problem considered here has a signal plus noise structure.Our goal is to recover an unknownmnmatrixAof low rank that is observedin the presence of i.i.d. Gaussian noise as matrixY:

    Y = A +

    nW, where Wij i.i.d. N(0, 1).

    The factor n1/2 ensures that the signal and noise are comparable, and isemployed for the asymptotic study of matrix reconstruction in Section 3. Inwhat follows, we first consider the variance of the noise2 to be known, andassume that it is equal to one. (In Section4.1we propose an estimator for

    , which we use in the proposed reconstruction method.) In this case themodel (1) simplifies to

    Y = A + 1

    nW, where Wij i.i.d. N(0, 1). (1)

    Formally, a matrix recovery scheme is a map g : Rmn Rmn fromthe space ofm n matrices to itself. Given a recovery schemeg() and anobserved matrix Y from the model (1), we regardA= g(Y) as an estimateofA, and measure the performance of the estimateAby

    Loss(A,

    A) =

    A A2F, (2)

    where F denotes the Frobenius norm. The Frobenius norm of anm nmatrixB ={bij} is given by

    B2F =mi=1

    nj=1

    b2ij .

    2

  • 7/26/2019 1007.4148

    3/34

    Note that if the vector space Rmn is equipped with the inner product

    A, B= tr(AB), thenB2

    F =B, B.1.1 Hard and Soft Thresholding

    A natural starting point for reconstruction of the target matrix A in (1) isthe singular value decomposition (SVD) of the observed matrix Y. Recallthat the singular value decomposition of an m nmatrixY is given by thefactorization

    Y = U DV =mnj=1

    djujvj.

    Here U is an m m orthogonal matrix whose columns are the left singularvectors uj, V is an nn orthogonal matrix whose columns are the rightsingular vectors vj, and D is an mn matrix with singular values dj =Djj 0 on the diagonal and all other entries equal to zero. Although itis not necessarily square, we will refer to D as a diagonal matrix and writeD= diag(d1, . . . , dmn), where m ndenotes the minimum ofm and n.

    Many matrix reconstruction schemes act by shrinking the singular valuesof the observed matrix towards zero. Shrinkage is typically accomplished byhard or soft thresholding. Hard thresholding schemes set every singular valueofYless than a given positive thresholdequal to zero, leaving other singularvalues unchanged. The family of hard thresholding schemes is defined by

    gH(Y) =mnj=1

    djI(dj) ujvj, > 0.

    Soft thresholding schemes subtract a given positive number from eachsingular value, setting values less than equal to zero. The family of softthresholding schemes is defined by

    gS(Y) =mn

    j=1(dj )+ ujvj, >0.

    Hard and soft thresholding schemes can be defined equivalently in the penal-ized forms

    gH(Y) = argminB

    Y B2F+2 rank(B)

    3

  • 7/26/2019 1007.4148

    4/34

    gS(Y) = argminB

    Y B2F+ 2B.

    In the second display,B denotes the nuclear norm ofB, which is equalto the sum of its singular values.

    0 2 4 6 8 10 12 14 16 18 2050

    60

    70

    80

    90

    100

    110

    Singularvalues

    Figure 1: Scree plot for a 1000 1000 rank 2 signal matrix with noise.

    In practice, hard and soft thresholding schemes require estimates of thenoise variance, as well as the selection of appropriate cutoff or shrinkageparameters. There are numerous methods in the literature for choosing thehard threshold. Heuristic methods often make use of the scree plot, whichdisplays the singular values ofY in decreasing order: is typically chosento be the y-coordinate of a well defined elbow in the resulting curve. A

    typical scree plot for a 10001000 matrix with rank 2 signal is shown inFigure1. The elbow point of the curve on the plot clearly indicate thatthe signal has rank 2.

    A theoretically justified selection of hard threshold is presented in re-cent work ofBunea et al.(2010). They also provide performance guarantiesfor the resulting hard thresholding scheme using techniques from empiricalprocess theory and complexity regularization. Selection of the soft threshold-ing shrinkage parametermay also be accomplished by a variety of methods.Negahban and Wainwright(2009) propose a specific choice ofand provideperformance guarantees for the resulting soft thresholding scheme.

    Figure2 illustrates the action of hard and soft thresholding on a 1000

    1000 matrix with a rank 50 signal. The blue line indicates the singular valuesof the signal A and the green line indicates the those of the observed matrix Y.The plots show the singular values of the hard and soft thresholding estimatesincorporating the best choice of the parameters and , respectively. It isevident from the figure that neither thresholding scheme delivers an accurate

    4

  • 7/26/2019 1007.4148

    5/34

    0 10 20 30 40 50 60 70 80 90 1000

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    1000 x 1000 Matrix, Rank 50 Signal

    Singular values of A

    Singular values of Y

    Best hard thresholding

    0 10 20 30 40 50 60 70 80 90 1000

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    1000 x 1000 Matrix, Rank 50 Signal

    Singular values of A

    Singular values of Y

    Best soft thresholding

    Figure 2: Singular values of hard and soft thresholding estimates.

    estimate of the signals singular values. Moreover examination of the lossindicates that they do not provide a good estimates of the signal matrix.

    The families of hard and soft thresholding methods encompass many ex-isting reconstruction schemes. Both thresholding approaches seek low rank(sparse) estimates of the target matrix, and both can be naturally formulatedas optimization problems. However, the family of all reconstruction schemesis much larger, and it is natural to consider alternatives to hard and softthresholding that may offer better performance.

    In this paper, we start with a principled analysis of the matrix recon-struction problem, with the effort of making as few assumptions as possible.

    Theoretically motivated design of the method. Based on the analysis of thereconstruction problem and, in particular, the analysis of the effect of noiseon low-rank matrices we design a new reconstruction method with a theoret-ically motivated design.

    1.2 Outline

    We start the paper with an analysis of the finite sample properties of thematrix reconstruction problem. The analysis does not require the matrix Ato have low rank and only requires that the distribution of noise matrixW isorthogonally invariant (does not change under left and right multiplicationsby orthogonal matrices). Under mild conditions (the prior distribution ofAmust be orthogonally invariant) we prove that we can restrict our attention tothe reconstruction methods that are based on the SVD of the observed matrixY and act only on its singular values, not affecting the singular vectors.

    5

  • 7/26/2019 1007.4148

    6/34

    This result has several useful consequences. First, it reduces the space of

    reconstruction schemes we consider from g :R

    m

    n

    Rm

    n

    to justR

    m

    n

    R

    mn. Moreover, it gives us the recipe for design of the new reconstructionscheme: we determine the effect of the noise on the singular values and thesingular value decomposition of the signal matrix Aand then built the newreconstruction method to reverse the effect of the noise on the singular valuesofAand account for its effect on the singular vectors.

    To determine the effect of noise on low-rank signal we build a connectionbetween the matrix reconstruction problem and spiked population modelsin random matrix theory. Spiked population models were introduced byJohnstone(2001). The asymptotic matrix reconstruction model that matchesthe setup of spiked population models assumes the rank ofAand its non-zero

    singular values to be fixed as the matrix dimensions to grow at the same rate:m, n and m/n c > 0. We use results from random matrix theoryabout the limiting behavior of the eigenvalues (Marcenko and Pastur,1967;Wachter, 1978; Geman, 1980; Baik and Silverstein, 2006) and eigenvectors(Paul,2007;Nadler,2008;Lee et al.,2010) of sample covariance matrices inspiked population models to determine the limiting behavior of the singularvalues and the singular vectors of matrix Y.

    We apply these results to design a new matrix reconstruction method,which we call RMT for its use of random matrix theory. The method es-timated the singular values ofA from the singular values ofY and applies

    additional shrinkage to them to correct for the difference between the singu-lar vectors ofY and those ofA. The method uses an estimator of the noisevariance which is based on the sample distribution of the singular values ofYcorresponding to the zero singular values ofA.

    We conduct an extensive simulation study to compare RMT methodagainst the oracle versions of hard and soft thresholding and the orthogo-nally invariant oracle. We run all four methods on matrices of different sizes,with signals of various ranks and spectra. The simulations clearly show thatRMT method strongly outperforms the oracle versions of both hard and softthresholding methods and closely matches the performance of the orthogo-nally invariant oracle (the oracle scheme that acts only on the singular values

    of the observed matrix).

    The paper is organized as follows. In Section2we present the analysisof the finite sample properties of the reconstruction problem. In Section3we determine the effect of the noise on the singular value decomposition of

    6

  • 7/26/2019 1007.4148

    7/34

    row rank matrices. In Section 4 we construct the proposed reconstruction

    method based on the results of the previous section. The method employsthe noise variance estimator presented in Section4.1. Finally, in Section5wepresent the simulation study comparing RMT method to the oracle versionsof hard and soft thresholding and the orthogonally invariant oracle method.

    2 Orthogonally Invariant Reconstruction

    Methods

    The additive model (1) and Frobenius loss (2) have several elementary in-variance properties, that lead naturally to the consideration of reconstruction

    methods with analogous forms of invariance. Recall that a square matrix Uis said to be orthogonal ifUU = UU = I, or equivalently, if the rows (orcolumns) ofUare orthonormal. If we multiply each side of (1) from the leftright by orthogonal matricesU andV of appropriate dimensions, we obtain

    U Y V = U AV + 1

    nU W V. (3)

    Proposition 1. Equation (3) is a reconstruction problem of the form (1)

    with signal U AV and observed matrixU Y V. If

    A is an estimate ofA in

    model (1), then UAV is an estimate of U AV in model (3) with the sameloss.Proof. If A has rank r then U AV also has rank r. Thus prove the firststatement, it suffices to show that UW V in (3) has independent N(0, 1)entries. This follows from standard properties of the multivariate normaldistribution. In order to establish the second statement of the proposition,let U andVbe the orthogonal matrices in (3). For any m nmatrix B ,

    U B2F = tr

    (UB)(U B)

    = tr

    BB

    =B2F,

    and more generallyUBV 2F =B2F. Applying the last equality to B =A AyieldsLoss(U AV, UAV) =U(A A)V2F =A A2F = Loss(A,A)

    as desired.

    7

  • 7/26/2019 1007.4148

    8/34

    In the proof we use the fact that the distribution of matrix Wdoes not

    change under left and right multiplications by orthogonal matrices. We willcall such distributions orthogonally invariant.

    Definition 2. A random mn matrix Z has an orthogonally invariantdistribution if for any orthogonal matricesU andV of appropriate size thedistribution ofU ZV is the same as the distribution ofZ.

    In light of Proposition1it is natural to consider reconstruction schemesthose action do not change under orthogonal transformations of the recon-struction problem.

    Definition 3. A reconstruction schemeg(

    ) is orthogonally invariant if for

    anym n matrixY, and any orthogonal matricesU andV of appropriatesize,g(U Y V) =U g(Y)V.

    In general, a good reconstruction method need not be orthogonally in-variant. For example, if the signal matrix Ais known to be diagonal, then foreachY the estimate g(Y) should be diagonal as well, and in this case g() isnot orthogonally invariant. However, as we show in the next theorem, if wehave no information about the singular vectors ofA(either prior informationor information from the singular values ofA), then it suffices to restrict ourattention to orthogonally invariant reconstruction schemes.

    Theorem 4. LetY =A + W, whereA is a random target matrix. AssumethatA andWare independent and have orthogonally invariant distributions.Then, for every reconstruction schemeg(), there is an orthogonally invariantreconstruction schemeg() whose expected loss is the same, or smaller, thanthat ofg().Proof. Let U be anm mrandom matrix that is independent ofA and W,and is distributed according to Haar measure on the compact group ofmmorthogonal matrices. Haar measure is (uniquely) defined by the requirementthat, for every mm orthogonal matrix C, both CU and UC have thesame distribution asU(c.f.Hofmann and Morris,2006). Let Vbe an n

    n

    random matrix distributed according to the Haar measure on the compactgroup of nn orthogonal matrices that is independent of A, W and U.Given a reconstruction scheme g(), define a new scheme

    g(Y) = E[Ug(UYV)V | Y].

    8

  • 7/26/2019 1007.4148

    9/34

    It follows from the definition of U and V that g() is orthogonally invari-ant. The independence of{U, V} and{A, W} ensures that conditioningon Y is equivalent to conditioning on{A, W}, which yields the equivalentrepresentation

    g(Y) = E[Ug(UYV)V | A, W].Therefore,

    ELoss(A, g(Y)) = EE[Ug(UYV)V A | A, W]2

    F

    EUg(UYV)V A2F= Eg(UYV) UAV2F,

    the inequality follows from the conditional version of Jensens inequality ap-plied to each term in the sum defining the squared norm. The final equalityfollows from the orthogonality of U and V. The last term in the previousdisplay can be analyzed as follows:

    Eg(UYV) UAV2

    F

    = EEg(UAV+n1/2UWV) UAV2F| U, V, A

    = EEg(UAV+n1/2W) UAV2F| U, V, A

    = Eg(UAV+n1/2W)

    UAV

    2

    F.

    The first equality follows from the definition ofY; the second follows fromthe independence ofWandU, A, V, and the orthogonal invariance ofL(W).By a similar argument using the orthogonal invariance ofL(A), we have

    Eg(UAV+n1/2W) UAV2F= E

    Eg(UAV+n1/2W) UAV2F| U, V, W

    = EEg(A +n1/2W) A2F| U, V, W

    = E

    g(A +n1/2W)

    A

    2F.

    The final term above is ELoss(A, g(Y)). This completes the proof.

    Based on Theorem4 will restrict our attention to orthogonally invariantreconstruction schemes in what follows.

    9

  • 7/26/2019 1007.4148

    10/34

    As noted in introduction, the singular value decomposition (SVD) of the

    observed matrix Y is a natural starting point for reconstruction of a signalmatrix A. As we show below, the SVD of Y is intimately connected withorthogonally invariant reconstruction methods. An immediate consequenceof the decomposition Y = U DV is that UY V = D, so we can diagonalizeY by means of left and right orthogonal multiplications.

    The next proposition follows from our ability to diagonalize the signalmatrixA in the reconstruction problem.

    Proposition 5. LetY =A + n1/2W, whereW has an orthogonally invari-ant distribution. Ifg() is an orthogonally invariant reconstruction scheme,then for any fixed signal matrixA, the distribution of Loss(A, g(Y)), and in

    particularELoss(A, g(Y)), depends only on the singular values ofA.

    Proof. Let U DAV be the SVD ofA. Then DA = UA V, and as the Frobe-

    nius norm is invariant under left and right orthogonal multiplications,

    Loss(A, g(Y)) = g(Y) A 2F = U(g(Y) A) V2F= Ug(Y)V UAV2F = g(UY V) DA 2F= g(DA+n1/2UW V) DA 2F.

    The result now follows from the fact that UW V has the same distributionas W.

    We now address the implications of our ability to diagonalize the observedmatrixY. Letg() be an orthogonally invariant reconstruction method, andlet U DV be the singular value decomposition of Y. It follows from theorthogonal invariance ofg() that

    g(Y) = g(U DV) = U g(D)V =mi=1

    nj=1

    cijuivj (4)

    wherecij depend only on the singular values ofY. In particular, any orthog-onally invariantg(

    ) reconstruction method is completely determined by how

    it acts on diagonal matrices. The following theorem allows us to substantiallyrefine the representation (4).

    Theorem 6. Let g() be an orthogonally invariant reconstruction scheme.Theng(Y) is diagonal wheneverY is diagonal.

    10

  • 7/26/2019 1007.4148

    11/34

    Proof. Assume without loss of generality that mn. Let the observed ma-trix Y = diag(d1, d2,...,dn), and letA=g(Y) be the reconstructed matrix.Fix a row index 1km. We will show thatAkj = 0 for allj=k. LetDLbe anm mmatrix derived from the identity matrix by flipping the sign ofthe kth diagonal element. More formally, DL = I 2ekek, where ek is thekth standard basis vector in Rm. The matrixDL is known as a Householderreflection.

    Let DR be the top left n n submatrix ofDL. Clearly DLDL = I andDRD

    R= I, so bothDLandDR are orthogonal. Moreover, all three matrices

    DL, Y, andDRare diagonal, and therefore we have the identityY =DLY DR.It then follows from the orthogonal invariance ofg() that

    A = g(Y) = g(DLY DR) = DL g(Y) DR = DLA DR.The (i, j)th element of the matrixDLA DRisAij(1)ik(1)jk , and thereforeAkj =Akj ifj=k. Ask was arbitrary,Ais diagonal.

    As an immediate corollary of Theorem 6 and equation (4) we obtain acompact, and useful, representation of any orthogonally invariant reconstruc-tion scheme g().Corollary 7. Letg()be an orthogonally invariant reconstruction method. Ifthe observed matrixY has singular value decompositionY = djujv

    j then

    the reconstructed matrix has the form

    A = g(Y) = mnj=1

    cjujvj, (5)

    where the coefficientscj depend only on the singular values ofY.

    The converse of Corollary 7 is true under a mild additional condition.Let g() be a reconstruction scheme such that g(Y) = cjujvj, wherecj = cj(d1, . . . , dmn) are fixed functions of the singular values of Y. Ifthe functions

    {cj(

    )

    }are such that ci(d) =cj(d) whenever di =dj , then g(

    )

    is orthogonally invariant. This follows from the uniqueness of the singularvalue decomposition.

    11

  • 7/26/2019 1007.4148

    12/34

    3 Asymptotic Matrix Reconstruction and

    Random Matrix Theory

    Random matrix theory is broadly concerned with the spectral properties ofrandom matrices, and is an obvious starting point for an analysis of ma-trix reconstruction. The matrix reconstruction problem has several pointsof intersection with random matrix theory. Recently a number of authorshave studied low rank deformations of Wigner matrices (Capitaine et al.,2009;Feral and Peche,2007;Mada,2007;Peche,2006). However, their re-sults concern symmetric matrices, a constraint not present in the reconstruc-tion model, and are not directly applicable to the reconstruction problemof interest here. (Indeed, our simulations of non-symmetric matrices exhibitbehavior deviating from that predicted by the results of these papers.) Asignal plus noise framework similar to matrix reconstruction is studied inDozier and Silverstein (2007) and Nadakuditi and Silverstein (2007), how-ever both these papers model the signal matrix to be random, while in thematrix reconstruction problem we assume it to be non-random. El Karoui(2008) considered the problem of estimation the eigenvalues of a populationcovariance matrix from a sample covariance matrix, which is similar to theproblem of estimation of the singular values ofA from the singular values ofY. However for the matrix reconstruction problem it is equally important toestimate the difference between the singular vectors ofA and Y, in addition

    to the estimate of the singular values ofA.Our proposed denoising scheme is based on the theory of spiked pop-

    ulation models in random matrix theory. Using recent results on spikedpopulation models, we establish asymptotic connections between the sin-gular values and vectors of the signal matrix A and those of the observedmatrixY. These asymptotic connections provide us with finite-sample esti-mates that can be applied in a non-asymptotic setting to matrices of smallor moderate dimensions.

    3.1 Asymptotic Matrix Reconstruction Model

    The proposed reconstruction method is derived from an asymptotic versionof the matrix reconstruction problem (1). For n1 let integers m= m(n)be defined in such a way that

    m

    n c > 0 as n . (6)

    12

  • 7/26/2019 1007.4148

    13/34

    For each n let Y, A, and W be m nmatrices such that

    Y = A + 1n

    W, (7)

    where the entries ofWare independentN(0, 1) random variables. We assumethat the signal matrix A has fixed rank r 0 and fixed non-zero singularvalues1(A), . . . , r(A) that are independent ofn. The constantc representsthe limiting aspect ratio of the observed matrices Y. The scale factor n1/2

    ensures that the singular values of the signal matrix are comparable to thoseof the noise. We note that Model (7) matches the asymptotic model used byCapitaine et al. (2009);Feral and Peche(2007) in their study of fixed rankperturbations of Wigner matrices.

    In what follows j(B) will denote the j-th singular value of a matrix B,and uj(B) and vj(B) will denote, respectively, the left and right singularvalues corresponding to j(B). Our first proposition concerns the behaviorof the singular values ofYwhen the signal matrix A is equal to zero.

    Proposition 8. Under the asymptotic reconstruction model withA= 0 theempirical distribution of the singular values 1(Y) mn(Y) con-verges weakly to a (non-random) limiting distribution with density

    fY(s) = s1

    (c 1)

    (a s2)(s2 b), s[a,

    b], (8)

    wherea= (1 c)2 andb= (1 + c)2. Moreover, 1(Y) P

    1 + c andmn(Y)

    P 1 c asn tends to infinity.The existence and form of the density fY() are a consequence of the

    classical Marcenko-Pastur theorem (Marcenko and Pastur, 1967; Wachter,1978). The in-probability limits of1(Y) andmn(Y) follow from later workof Geman (1980) and Wachter (1978), respectively. If c = 1, the densityfunction fY(s) simplifies to the quarter-circle law fY(s) =

    14 s2 fors[0, 2].

    The next two results concern the limiting eigenvalues and eigenvectorsof Y when A is non-zero. Proposition9 relates the limiting eigenvalues of

    Y to the (fixed) eigenvalues ofA, while Proposition10relates the limitingsingular vectors ofYto the singular vectors ofA. Proposition9 is based onrecent work ofBaik and Silverstein(2006), while Proposition10 is based onrecent work ofPaul(2007),Nadler(2008), andLee et al.(2010). The proofsof both results are given in Section 5.5.

    13

  • 7/26/2019 1007.4148

    14/34

    Proposition 9. LetYfollow the asymptotic matrix reconstruction model (7)

    with signal singular values1(A) ... r(A) > 0. For 1 j r, asntends to infinity,

    j(Y) P

    1 +2j(A) +c+

    c2j(A)

    1/2if j(A)> 4

    c

    1 +

    c if 0< j(A) 4

    c

    The remaining singular valuesr+1(Y), . . . , mn(Y)ofYare associated withthe zero singular values ofA: their empirical distribution converges weaklyto the limiting distribution in Proposition8.

    Proposition 10. LetY follow the asymptotic matrix reconstruction model

    (7) with distinct signal singular values 1(A) > 2(A) > ... > r(A) > 0.Fixj such thatj(A)> 4

    c. Then asn tends to infinity,

    uj(Y), uj(A)2 P 1 c

    4j(A)

    1 +

    c

    2j(A)

    and

    vj(Y), vj(A)2 P 1 c

    4j(A)

    1 +

    1

    2j(A)

    Moreover, if k = 1, . . . , r not equal to j thenuj(Y), uk(A) P 0 and

    vj(Y), vk(A) P

    0 asn tends to infinity.The limits established in Proposition 9 indicate a phase transition. If

    the singular value j(A) is less than or equal to 4

    c then, asymptotically,the singular value j(Y) lies within the support of the Marcenko-Pasturdistribution and is not distinguishable from the noise singular values. Onthe other hand, ifj(A) exceeds 4

    cthen, asymptotically,j(Y) lies outside

    the support of the Marcenko-Pastur distribution, and the corresponding leftand right singular vectors ofY are associated with those ofA (Proposition10).

    4 Proposed Reconstruction Method

    Assume for the moment that the variance2 of the noise is known, and equalto one. LetYbe an observedmnmatrix generated from the additive model

    14

  • 7/26/2019 1007.4148

    15/34

    Y =A +n1/2W, and let

    Y =mnj=1

    j(Y) uj(Y)vj(Y)

    be the SVD ofY. Following the discussion in Section2, we seek an estimateA of the signal matrix A having the formA = mn

    j=1

    cjuj(Y)vj(Y),

    where each coefficient cj depends only on the singular values 1(Y), . . . ,

    mn(Y) of Y. We deriveA from the limiting relations in Propositions 9and10. By way of approximation, we treat these relations as exact in the

    non-asymptotic setting under study, using the symbols l= ,

    l and l> todenote limiting equality and inequality relations.

    Suppose initially that the singular values and vectors of the signal matrixA are known. In this case we wish to find coefficients{cj} minimizing

    Loss(A,

    A) =

    mnj=1

    cjuj(Y)vj(Y)

    rj=1

    j(A) uj(A)vj(A)

    2

    F.

    Proposition9shows that asymptotically the information about the singularvalues of A that are smaller that 4

    c is not recoverable from the singular

    values of Y. Thus we can restrict the first sum to the first r0 = #{j :j(A)> 4

    c}terms

    Loss(A,A) = r0j=1

    cjuj(Y)vj(Y)

    rj=1

    j(A) uj(A)vj(A)

    2F

    .

    Proposition 10 ensures that the left singular vectors uj(Y) and uk(A) areasymptotically orthogonal for k = 1, . . . , r not equal to j = 1, . . . , r0, and

    therefore

    Loss(A,A) l= r0j=1

    cjuj(Y)vj(Y) j(A) uj(A)vj(A)2F + rj=r0+1

    2j(A).

    15

  • 7/26/2019 1007.4148

    16/34

    Fix 1jr0. Expanding the j -th term in the above sum givesj(A) uj(A)vj(A) cjuj(Y)vj(Y)2F= c2j

    uj(Y)vj(Y)2F + 2j(A)uj(A)vj(A)2F 2cjj(A)

    uj(A)v

    j(A), uj(Y)v

    j(Y)

    = 2j(A) + c

    2j 2cjj(A)

    uj(A), uj(Y)

    vj(A), vj(Y)

    .

    Differentiating the last expression with respect to cj yields the optimal value

    cj = j(A)

    uj(A), uj(Y)

    vj(A), vj(Y)

    . (9)

    In order to estimate the coefficient cj we consider separately singular

    values ofYthat are at most, or greater than 1 + c, where c= m/n is theaspect ratio ofY. By Proposition9, the asymptotic relationj(Y)

    l 1 +cimplies j(A) 4

    c, and in this case the j-th singular value of A is not

    recoverable from Y. Thus if j(Y) 1 +

    c we set the correspondingcoefficientcj = 0.

    On the other hand, the asymptotic relation j(Y) l>1 +

    c implies that

    j(A) > 4

    c, and that each of the inner products in (9) are asymptoticallypositive. The displayed equations in Propositions 9 and 10 can then beused to obtain estimates of each term in (9) based only on the (observed)singular values ofYand its aspect ratioc. These equations yield the following

    relations:2j(A) = 12

    2j(Y) (1 + c) +

    [2j(Y) (1 +c)]2 4c

    estimates 2j(A),

    2j =

    1 c4j(A)

    1 +

    c2j(A)

    estimates uj(A), uj(Y)2,

    2j =

    1 c

    4j(A)

    1 +

    1

    2j(A)

    estimates vj(A), vj(Y)2.

    With these estimates in hand, the proposed reconstruction scheme is definedvia the equation

    GRMTo (Y) =

    j(Y)>1+c

    j(A)j juj(Y)vj(Y), (10)16

  • 7/26/2019 1007.4148

    17/34

    where

    j(A), j, andj are the positive square roots of the estimates defined

    above.The RMT method shares features with both hard and soft thresholding.

    It sets to zero singular values of Y smaller than the threshold (1 +

    c),and it shrinks the remaining singular values towards zero. However, unlikesoft thresholding the amount of shrinkage depends on the singular values,the larger singular values are shrunk less than the smaller ones. This latterfeature is similar to that of LASSO type estimators based on an Lq penaltywith 0< q

  • 7/26/2019 1007.4148

    18/34

    mn random matrices with orthogonally invariant distributions, and letY = A+n

    1/2

    W for some. Then s(Y) has the same expected value ass(Y) and a smaller or equal variance.

    Based Propositions12and13we restrict our attention to the estimates ofthat depend only on the singular values ofY. It follows from Proposition9that the empirical distribution of the (m r) singular values S={j(Y /) :j(A) = 0}converges weakly to a distribution with density (8) supported onthe interval [|1c|, 1 +c]. Following the general approach outlined inGyorfi et al.(1996), we estimate by minimizing the Kolmogorov-Smirnovdistance between the observed sample distribution of the singular values ofY and that predicted by theory. LetFbe the CDF of the density (8). For

    each >0 letS be the set of singular values j(Y) that fall in the interval[|1 c|, (1 + c)], and letF be the empirical CDF ofS. Then

    K() = sups

    |F(s/) F(s)|is the Kolmogorov-Smirnov distance between the empirical and theoreticalsingular value distribution functions, and we define

    (Y) = argmin>0

    K() (12)

    to be the value of minimizing K(). A routine argument shows that theestimator is scale invariant, in the sense that ( Y) = (Y) for each >0.

    By considering the jump points of the empirical CDFF(s), the supre-mum in K() simplifies to

    K() = maxsi S

    F(si/) i 1/2|S|+ 12|S| ,

    where{si} are the ordered elements of

    S. The objective function K() is

    discontinuous at points where theS changes, so we minimize it over a finegrid of points in the range where |S|>(mn)/2 and(1+c)< 21(Y).The closed form of the cumulative distribution function F() is presented inSection5.4.

    18

  • 7/26/2019 1007.4148

    19/34

    5 Simulations

    We carried out a simulation study to evaluate the performance of the RMTreconstruction scheme GRMT() defined in (11) using the variance estimate in (12). The study compared the performance of GRMT() to three al-ternatives: the best hard thresholding reconstruction scheme, the best softthresholding reconstruction scheme, and the best orthogonally invariant re-construction scheme. Each of the three competing alternatives is an oracle-type procedure that is based on information about the signal matrix A thatis not available toGRMT().

    5.1 Hard and Soft Thresholding Oracle Procedures

    Hard and soft thresholding schemes require specification of a threshold pa-rameter that can depend on the observed matrix Y. Estimation of the noisevariance can be incorporated into the choice of the threshold parameter. Inorder to compare the performance of GRMT() against every possible hardand soft thresholding scheme, we define oracle procedures

    GH(Y) = gH(Y) where = arg min

    >0

    A gH(Y)2F (13)GS(Y) = gS(Y) where

    = arg min>0

    A gS(Y)2

    F (14)

    that make use of the signal A. By definition, the lossAGH(Y)2F ofGH(Y) is less than that of any hard thresholding scheme, and similarly theloss ofGS(Y) is less than that of any soft thresholding procedure. In effect,the oracle procedures have access to both the unknown signal matrix A andthe unknown variance . They are constrained only by the form of theirrespective thresholding families. The oracle procedures are not realizable inpractice.

    5.2 Orthogonally Invariant Oracle Procedure

    As shown in Corrolary7, every orthogonally invariant reconstruction schemeg() has the form

    g(Y) =mnj=1

    cjuj(Y)vj(Y),

    19

  • 7/26/2019 1007.4148

    20/34

    where the coefficients cj are functions of the singular values ofY.

    The orthogonally invariant oracle scheme has coefficients co

    j minimizingthe loss A mnj=1

    cjuj(Y)vj(Y)2

    F

    over all choices cj . As in the case with the hard and soft thresholding oracleschemes, the coefficientscoj depend on the signal matrix A, which in practiceis unknown.

    The (rank one) matrices{uj(Y)vj(Y)} form an orthonormal basis ofan mn-dimensional subspace of the mn-dimensional space of all mnmatrices. Thus the optimal coefficient coj is simply the matrix inner product

    A, uj(Y)vj(Y), and the orthogonally invariant oracle scheme has the formof a projection

    G(Y) =mnj=1

    A, uj(Y)vj(Y)

    uj(Y)vj(Y). (15)By definition, for any orthogonally invariant reconstruction scheme g() andobserved matrixY, we haveA G(Y)2F A g(Y)2F.

    5.3 Simulations

    We compared the reconstruction schemes GH(Y), GS(Y),and GRMT(Y) toG(Y) on a wide variety of signal matrices generated according to the model(1). As shown in Proposition5, the distribution of the lossAG(Y)2Fdepends only on the singular values of A, so we considered only diagonalsignal matrices. As the variance estimate used in GRMT() is scale invariant,all simulations were run with noise of unit variance. (Estimation of noisevariance is not necessary for the oracle reconstruction schemes.)

    5.3.1 Square Matrices

    Our initial simulations considered 1000

    1000 square matrices. Signal matri-cesA were generated using three parameters: the rankr; the largest singularvalue1(A); and the decay profile of the remaining singular values. We con-sidered ranksr {1, 3, 10, 32, 100} corresponding to successive powers of10up to (mn)/10, and maximum singular values1(A) {0.9, 1, 1.1,..., 10} 4

    c

    20

  • 7/26/2019 1007.4148

    21/34

    falling below and above the critical threshold of 4

    c= 1. We considered sev-

    eral coefficient decay profiles: (i) all coefficients equal; (ii) linear decay tozero; (iii) linear decay to 1(A)/2; and (iv) exponential decay as powers of0.5, 0.7, 0.9, 0.95, or 0.99. Independent noise matrices Wwere generated foreach signal matrix A. All reconstruction schemes were then applied to theresulting matrix Y = A+ n1/2W. The total number of generated signalmatrices was 3,680.

    103

    102

    101

    103

    102

    101

    Orthogonally invariant oracle loss

    Softthresholding

    oracle

    loss

    Relative performance of orthogonally invariant oracle and soft thresholding oracle

    103

    102

    101

    103

    102

    101

    Orthogonally invariant oracle loss

    Hardthresholdingoracle

    loss

    Relative performance of orthogonally invariant oracle and hard thresholding oracle

    Figure 3: Relative performance of soft and hard thresholding method againstthe orthogonally invariant oracle for 1000

    1000 matrices.

    Figures3and4 illustrate, respectively, the loss of the best soft thresh-olding, best hard thresholding and RMT reconstruction methods (y axis)relative to the best orthogonally invariant scheme (x axis). In each casethe diagonal represents the performance of the orthogonally invariant oracle:points farther from the diagonal represent worse performance. The plots showclearly that GRMT() outperforms the oracle schemes GH() and GS(), andhas performance comparable to that of the orthogonally invariant oracle. Inparticular,GRMT() outperforms any hard or soft thresholding scheme, evenif the latter schemes have access to the unknown variance and the signalmatrixA.

    In order to summarize the results of our simulations, for each scheme G()and for each matrix Y generated from a signal matrix A we calculated the

    21

  • 7/26/2019 1007.4148

    22/34

    103

    102

    101

    103

    102

    101

    Orthogonally invariant oracle loss

    RMTmethodloss

    Relative performance of orthogonally invariant oracle and RMT method

    Figure 4: Relative performance of RMT method and orthogonally invariantoracle method for 1000 1000 matrices.

    relative excess loss ofG() with respect toG():

    REL(A, G(Y)) = Loss(A, G(Y))

    Loss(A, G(Y)) 1 (16)

    The definition ofG() ensures that relative excess loss is non-negative. The

    average REL ofGS(),GH(), andGRMT() across the 3680 simulated 1000 1000 matrices was 68.3%, 18.3%, and 0.61% respectively. Table1summa-rizes these results, and the results of analogous simulations carried out onsquare matrices of different dimensions. The table clearly shows the strongperformance of RMT method for matrices with at least 50 rows and columns.Even for m = n = 50, the average relative excess loss of the RMT methodis almost twice smaller then those of the oracle soft and hard thresholdingmethods.

    5.3.2 Rectangular Matrices

    We performed simulations for rectangular matrices of different dimensionsm, n and different aspect ratios c = m/n. For each choice of dimensionsm, nwe simulated target matrices using the same rules as in the square case:rankr {1, 3, 10, 32, . . .} not exceeding (mn)/10, maximum singular values

    22

  • 7/26/2019 1007.4148

    23/34

    Table 1: Average relative excess losses of oracle soft thresholding, oracle

    hard thresholding and the proposed RMT reconstruction method for squarematrices of different dimensions.

    Matrix size (square) 2000 1000 500 100 50

    GS() 0.740 0.683 0.694 0.611 0.640Scheme GH() 0.182 0.183 0.178 0.179 0.176

    GRMT() 0.003 0.006 0.008 0.029 0.071

    1(A) {0.9, 1, 1.1,..., 10} 4

    c, and coefficients decay profiles like those above.A summary of the results is given in Table 2,which shows the average REL

    for matrices with 2000 rows and 10 to 2000 columns. Although randommatrix theory used to construct the RMT scheme requires m andnto tendto infinity and at the same rate, the numbers in Table2clearly show that theperformance of the RMT scheme is excellent even for small n, where averageREL ranges between 0.3% and 0.54%. The average REL of soft and hardthresholding are above 18% in each of the simulations.

    Table 2: Average relative excess loss of oracle soft thresholding, oracle hardthresholding, and RMT reconstruction schemes for matrices with differentdimensions and aspect ratios.

    Matrix m 2000 2000 2000 2000 2000 2000size n 2000 1000 500 100 50 10

    GS() 0.740 0.686 0.653 0.442 0.391 0.243Method GH() 0.182 0.188 0.198 0.263 0.292 0.379

    GRMT() 0.003 0.004 0.004 0.004 0.004 0.005

    Acknowledgments

    This work was supported, in part, by grants from US EPA (RD832720 andRD833825) and grant from NSF (DMS-0907177).

    The authors would like to thank Florentina Bunea and Marten Wegkampfor helpful discussions.

    23

  • 7/26/2019 1007.4148

    24/34

    Appendix

    5.4 Cumulative Distribution Function for Variance

    Estimation

    The cumulative density function F() is calculated as the integral offn1/2W(s).For c = 1 (a= 0, b= 4) it is a common integral

    F(x) =

    xa

    f(s)ds = 1

    x0

    b s2ds = 1

    2

    x

    4 x2 + 4 arcsinx2

    Forc= 1 the calculations are more complicated. First we perform the changeof variables t = s2, which yields

    F(x) = xa

    f(s)ds = C xa

    s2(b s2)(s2 a)ds2= C

    x2a

    t1

    (b t)(t a)dt,

    whereC= 1/(2(c 1)).Next we perform a change of variables y = t[a+ b]/2 to make the

    expression in the square root look like h2 x2, giving

    F(x) = C

    x2[a+b]/2

    [ba]/2

    ([b a]/2 y)(y+ [b a]/2)y+ [a +b]/2

    dy

    = C x2(1+c)2c

    4c y2y+ 1 +c

    dy,

    The second equality above uses the fact that a +b= 2(1+c) andba= 4c.The simple change of variables y = 2

    cz is performed next to make the

    numerator

    1 z2:

    F(x) =

    c

    (c 1) [x2(1+c)]/2c1

    1 z2

    z+ (1 +c)/2

    cdz

    Next, the formula 1 z2z+q

    dw =

    1 z2 +qarcsin(z)

    q2 1 arctan

    qz+ 1(q2 1)(1 z2)

    24

  • 7/26/2019 1007.4148

    25/34

    is applied to find the closed form of F(x) by substituting z = [x2 (1 +c)]/2c and q= (1 +c)/2c. The final expression above can be simplifiedas q2 1 = [(1 +c)/2c]2 1 =|1 c|/2c.5.5 Limit Theorems for Asymptotic Matrix

    Reconstruction Problem

    Propositions9and10in Section3provide an asymptotic connection betweenthe eigenvalues and eigenvectors of the signal matrix A and those of theobserved matrixY. Each proposition is derived from recent work in randommatrix theory on spiked population models. Spiked population models wereintroduced byJohnstone(2001).

    5.5.1 The Spiked Population Model

    The spiked population model is formally defined as follows. Letr 1 andconstants1 r >1 be given, and for n1 let integers m = m(n)be defined in such a way that

    m

    n c > 0 as n . (17)

    For each n letT= diag(

    1, . . . ,

    r, 1, . . . , 1)

    be an mm diagonal matrix (with m = m(n)), and let X be an mnmatrix with independentNm(0, T) columns. LetT =n1XXbe the samplecovariance matrix ofX.

    The matrixXappearing in the spiked population model may be decom-posed as a sum of matrices that parallel those in the matrix reconstructionproblem. In particular, Xcan be represented as a sum

    X = X1+Z, (18)

    whereX1 has independentNm(0, T

    I) columns,Zhas independentN(0, 1)

    entries, and X1 and Z are independent. It follows from the definition ofTthat

    (T I) = diag(1 1, . . . , r 1, 0, ..., 0),

    25

  • 7/26/2019 1007.4148

    26/34

    and therefore the entries in rowsr + 1, . . . , mofX1 are equal to zero. Thus,

    the sample covariance matrixT1= n1X1X1ofX1has the simple block formT1 = T11 00 0

    whereT11is anr rmatrix equal to the sample covariance of the firstr rowsofX1. It is clear from the block structure that the first r eigenvalues ofT1areequal to the eigenvalues ofT11, and that the remaining (m r) eigenvaluesofT1 are equal to zero. The size ofT11 is fixed, and therefore as n tends toinfinity, its entries converge in probability to those of diag(1 1, . . . , r 1).In particular,

    1n X1X1 (T I)2F P 0. (19)Consequently, for each j = 1, . . . , r, as n tends to infinity

    2j(n1/2X1) = j(T1) = j(T11) P j 1 (20)

    and uj(T11), ej2 P 1, (21)

    whereej is the j -th canonical basis element in Rr. An easy argument shows

    that uj(n1/2X1) =uj(

    T1), and it then follows from (21) that

    uj(n1/2X1), ej2 P 1, (22)whereej is the j -th canonical basis element in R

    m.

    5.5.2 Proof of Proposition 9

    Proposition9 is derived from existing results on the limiting singular valuesofT in the spiked population model. These results are summarized in thefollowing theorem, which is a combination of Theorems 1.1, 1.2 and 1.3 inBaik and Silverstein(2006).

    Theorem A. IfT is derived from the spiked population model with param-eters1, . . . , r >1, then forj = 1, . . . , r, asn

    j(T) P j+c jj1 if j >1 + c(1 +

    c)2 if 1< j1 +

    c

    26

  • 7/26/2019 1007.4148

    27/34

    The remaining sample eigenvaluesr+1(T), . . . , mn(

    T) are associated with

    the unit eigenvalues ofT, their empirical distribution converges weakly to theMarcenko-Pastur distribution.

    We also require the following inequality ofMirsky(1960).

    Theorem B. IfB andCarem nmatrices thenmnj=1 [j(C)j(B)]2 C B2F.

    Proof of Proposition9. Fix n 1 and let Y follow the asymptotic recon-struction model (7), where the signal matrixA has fixed rankr and non-zerosingular values1(A), . . . , r(A). Based on orthogonal invariance of the ma-

    trix reconstruction problem, without loss of generality, we will assume thatthe signal matrixA = diag(1(A), . . . , r(A), 0, . . . , 0).

    We begin by considering a spiked population model whose parametersmatch those of the matrix reconstruction model. LetX have the same di-mensions as Y and be derived from a spiked population model with covariancematrixT having r non-unit eigenvalues

    j =2j(A) + 1, j = 1, . . . , r . (23)

    As noted above, we may represent X as X = X1+ Z, where X1 has inde-pendent N(0, T

    I) columns, Z has independent N(0.1) entries and X1 is

    independent ofZ. Recall that the limiting relations (19)-(22) hold for thisrepresentation.

    The matrix reconstruction problem and spiked population model may becoupled in a natural way. Let random orthogonal matrices U1 and V1 bedefined for each sample point in such a way that U1D1V

    1 is the SVD ofX1.

    By construction, the matrices U1, V1 depend only on X1, and are thereforeindependent ofZ. ConsequentlyU1ZV1 has the same distribution as Z. Ifwe define W =U1ZV1, then Y = A +n

    1/2 Whas the same distribution asthe observed matrix Y in the matrix reconstruction problem.

    We apply Mirskys theorem withB = Y andC=n1/2U1XV1in order to

    27

  • 7/26/2019 1007.4148

    28/34

    bound the difference between the singular values ofY and those ofn1/2X:

    mnj=1

    j(n

    1/2X) j(Y)2 n1/2U1XV1 Y2F

    =(n1/2U1X1V1 A) +n1/2(U1ZV1 W)2F

    =n1/2U1X1V1 A2F

    =mnj=1

    j(n

    1/2U1X1V1) j(A)2

    =

    mn

    j=1

    j(n1/2X1) j(A)2.The first inequality above follows from Mirskys theorem and the fact thatthe singular values of n1/2X and n1/2U1XV1 are the same, even thoughU1 and V1 may not be independent ofX. The next two equalities follow byexpandingX and Y, and the fact that W =U1ZV1. The third equality is aconsequence of the fact that both U1X1V1 andAare diagonal, and the finalequality follows from the equality of the singular values ofX1 and U

    1X1V1.

    In conjunction with (20) and (23), the last display implies that

    jj(n1/2X) j(Y)2 P 0.

    Thus the distributional and limit results for the eigenvalues ofT =n1XXhold also for the eigenvalues ofYY, and therefore for Y Y as well. Therelationj(Y) =

    j(Y Y) completes the proof.

    5.5.3 Proof of Proposition 10

    Proposition10 may be derived from existing results on the limiting singularvectors of the sample covariance

    T in the spiked population model. These

    results are summarized in TheoremCbelow. The result was first establishedfor Gaussian models and aspect ratios 0 < c < 1 by Paul (2007). Nadler(2008) extended Pauls results to c > 0. RecentlyLee et al. (2010) furtherextended the theorem to c0 and non-Gaussian models.

    28

  • 7/26/2019 1007.4148

    29/34

    Theorem C. If

    T is derived from the spiked population model with distinct

    parameters1 > > r >1, then for1jr,uj(T), uj(T)2 P

    1 c(j1)2

    /

    1 + cj1

    if j >1 +

    c

    0 if 1< j1 +

    c

    Moreover, forj >1 +

    c andk=j such that1kr we haveuj(T), uk(T)2 P 0.

    Although the last result is not explicitly stated in Paul(2007), it followsimmediately from the central limit theorem for eigenvectors (Theorem 5,

    Paul,2007).We also require the following result, which is a special case of an inequality

    of Wedin (Wedin,1972;Stewart,1991).

    Theorem D. LetB and C be m n matrices and let 1 j m n. Ifthe j-th singular value ofC is separated from the singular values ofB andbounded away from zero, in the sense that for some >0

    mink=j

    j(C) k(B) > and j(C) > then uj(B), uj(C)2 + vj(B), vj(C)2 2 2B C2F

    2 .

    Proof of Proposition10: Fix n 1 and let Y follow the asymptotic recon-struction model (7), where the signal matrixA has fixed rankr and non-zerosingular values 1(A), . . . , r(A). Assume without loss of generality thatA= diag(1(A), . . . , r(A),0, . . . , 0).

    We consider a spiked population model whose parameters match thoseof the matrix reconstruction problem and couple it with the matrix recon-struction model exactly as in the proof of Proposition 9. In particular, thequantitiesj, T , X , X 1, Z , U 1, V1, W ,and Yare as in the proof of Proposition

    9and the preceding discussion.Fix an index j such that j(A) > 4

    c and thus j >1 +

    c. We apply

    Wedins theorem with B = Y and C = n1/2U1XV1. There is > 0 suchthat both conditions of Wedins theorem are satisfied for the given j withprobability converging to 1 as n . The precise choice of is presented

    29

  • 7/26/2019 1007.4148

    30/34

    at the end of this proof. It follows from Wedins theorem and inequality

    vj(B), vj(C)2

    1 thatuj(Y), uj(n

    1/2U1XV1)2

    =

    uj(B), uj(C)2 1 2B C2F

    2

    It is shown in the proof of Proposition 9 thatB C2F =n1/2U1XV1Y2F P 0 as n . Substitutinguj(n1/2U1XV) =U1uj(X) then yields

    uj(Y), U1uj(X)

    2 P 1. (24)Fixk = 1, . . . , r. Asj >1 +

    c TheoremCshows thatuj(

    T), ek2 has

    a non-random limit in probability, which we will denote by 2jk . The relation

    j = 2j(A) + 1 implies that 2jk = [1 c4j (A)]/[1 + c2j (A)] if j = k,and 2jk = 0 otherwise. As uj(

    T) =uj(X), it follows thatuj(X), ek

    2 P 2jk .Recall that the matrix U1 consists of the left singular vectors of X1, i.e.

    uk(X1) =U1ek. It is shown in (22) thatU1ek, ek2 =uk(X1), ek2 P 1, sowe can replace ek byU1ek in the previous display to obtain

    uj(X), U1ek

    2 P 2jk .

    It the follows from the basic properties of inner products thatU1uj(X), ek

    2 P 2jk .Using the result (24) of Wedins theorem we may replace the left term in theinner product byuj(Y), which yields

    uj(Y), ek2 P 2jk .

    AsA= diag(1(A), . . . , r(A), 0, . . . , 0) we haveek =uk(A). By constructionthe matrixYhas the same distribution asY, so it follows from the last display

    that uj(Y), uk(A)2 P 2jk ,which is equivalent to the statement of Proposition 10for the left singularvectors. The statement for the right singular vectors follows from considera-tion of the transposed reconstruction problem.

    30

  • 7/26/2019 1007.4148

    31/34

    Now we find such >0 that for the fixed j the conditions of Wedins the-

    orem are satisfied with probability going to 1. It follows from Proposition 9that for k= 1, . . . , rthek-th singular value ofYhas a non-random limit inprobability

    k = lim k(n1/2X) = lim k(Y).

    Letr0 be the number of eigenvalues ofA such that k(A)> 4

    c (i.e. theinequality holds only for k = 1, . . . , r0). It follows from the formula for

    k

    thatk >1 +

    cfor k = 1, . . . , r0. Note also that in this case k is a strictly

    increasing function ofk(A). All non-zero j(A) are distinct by assumption,so all k are distinct for k = 1, . . . , r0. Note that

    r0+1 = 1 +

    c is smaller

    that r0 . Thus the limits of the first r0 singular values of Y are not onlydistinct, they are bounded away from all other singular values. Define

    = 1

    3 mink=1,...,r0

    (k k+1) > 0.

    For any k= 1, . . . , r0+ 1 the following inequalities are satisfied with proba-bility going to 1 as n

    |k(Y) k| < and |k(n1/2X) k| < . (25)In applying Wedins theorem to B = Y andC=n1/2U1XV1 we must verifythat for any j = 1, . . . , r0 its two conditions are satisfied with probabilitygoing to 1. The first condition is j(C)> . When inequalities (25) hold

    j(C) = j(n1/2U1XV1) = j(n1/2X) > j > (j j+1) > 3 = 2,

    (26)

    so the first condition is satisfied with probability going to 1. The secondcondition is|j(C) k(B)| > for all k= j. It is sufficient to check thecondition for k = 1, . . . , r0+ 1 as asymptotically j(C) > r0+1(B). Fromthe definition ofand the triangle inequality we get

    3 also holds with probability going to 1.

    31

  • 7/26/2019 1007.4148

    32/34

    References

    Alter, O., Brown, P., and Botstein, D. 2000. Singular value de-composition for genome-wide expression data processing and modeling.Proceedings of the National Academy of Sciences 97, 18, 10101.

    Baik, J. and Silverstein, J. 2006. Eigenvalues of large sample covari-ance matrices of spiked population models. Journal of Multivariate Anal-ysis 97, 6, 13821408.

    Bunea, F., She, Y., and Wegkamp, M. 2010. Adaptive Rank PenalizedEstimators in Multivariate Regression. Arxiv preprint arXiv:1004.2995.

    Candes, E. and Recht, B. Exact matrix completion via convex optimiza-tion. Foundations of Computational Mathematics, 156.

    Candes, E.,Romberg, J., and Tao, T.2006. Robust uncertainty princi-ples: Exact signal reconstruction from highly incomplete frequency infor-mation. IEEE Transactions on information theory 52, 2, 489509.

    Capitaine, M., Donati-Martin, C., and Feral, D. 2009. The largesteigenvalue of finite rank deformation of large Wigner matrices: convergenceand non-universality of the fluctuations. The Annals of Probability 37, 1,147.

    Donoho, D.2006. Compressed sensing. IEEE Transactions on InformationTheory 52, 4, 12891306.

    Dozier, R. and Silverstein, J. 2007. On the empirical distributionof eigenvalues of large dimensional information-plus-noise-type matrices.Journal of Multivariate Analysis 98, 4, 678694.

    El Karoui, N.2008. Spectrum estimation for large dimensional covariancematrices using random matrix theory. The Annals of Statistics 36, 6,27572790.

    Feral, D. and Peche, S. 2007. The largest eigenvalue of rank one de-formation of large Wigner matrices. Communications in MathematicalPhysics 272, 1, 185228.

    Fu, W.1998. Penalized regressions: the bridge versus the lasso. Journal ofcomputational and graphical statistics 7, 3, 397416.

    32

  • 7/26/2019 1007.4148

    33/34

    Geman, S. 1980. A limit theorem for the norm of random matrices. The

    Annals of Probability 8, 2, 252261.Gyorfi, L., Vajda, I., and Van Der Meulen, E. 1996. Minimum Kol-

    mogorov distance estimates of parameters and parametrized distributions.Metrika 43, 1, 237255.

    Hofmann, K. and Morris, S. 2006. The structure of compact groups: aprimer for the student, a handbook for the expert. Walter De Gruyter Inc.

    Holter, N.,Mitra, M.,Maritan, A.,Cieplak, M.,Banavar, J.,andFedoroff, N. 2000. Fundamental patterns underlying gene expressionprofiles: simplicity from complexity. Proceedings of the National Academy

    of Sciences 97, 15, 8409.

    Johnstone, I. 2001. On the distribution of the largest eigenvalue in prin-cipal components analysis. The Annals of Statistics 29, 2, 295327.

    Konstantinides, K., Natarajan, B., and Yovanof, G. 1997. Noiseestimation and filtering using block-based singular value decomposition.IEEE Transactions on Image Processing 6, 3, 479483.

    Lee, S., Zou, F., and Wright, F. A. 2010. Convergence and Predictionof Principal Component Scores in High-Dimensional Settings. The Annals

    of Statistics.Mada, M. 2007. Large deviations for the largest eigenvalue of rank one

    deformations of Gaussian ensembles. The Electronic Journal of Probabil-ity 12, 11311150.

    Marcenko, V. and Pastur, L.1967. Distribution of eigenvalues for somesets of random matrices. USSR Sbornik: Mathematics 1, 4, 457483.

    Mirsky, L.1960. Symmetric gauge functions and unitarily invariant norms.The Quarterly Journal of Mathematics 11, 1, 50.

    Nadakuditi, R. and Silverstein, J. 2007. Fundamental limit of sampleeigenvalue based detection of signals in colored noise using relatively fewsamples. Signals, Systems and Computers, 686690.

    33

  • 7/26/2019 1007.4148

    34/34