Randomized Algorithms for Low-Rank Matrix Decomposition Ben Sapp, WPE II May 6, 2011 1

  • View
    214

  • Download
    3

Embed Size (px)

Citation preview

  • Slide 1
  • Randomized Algorithms for Low-Rank Matrix Decomposition Ben Sapp, WPE II May 6, 2011 1
  • Slide 2
  • Low-Rank Decomposition: The Goal where 2
  • Slide 3
  • Advantages Requires mk+nk numbers to represent the matrix, instead of mn. I.e., compression. Less numbers = less storage space and faster matrix multiplication. In many applications, exposes the structure of the data. 3
  • Slide 4
  • Exposing structure: an example Lets say your matrix is full of points in n -dimensional space which have some underlying linear structure: This data can be almost exactly described by 2-dimensions, i.e., a rank-2 matrix. description of the 2 axes in Coefficients (embedding) of the points in 4
  • Slide 5
  • Formulation: The Fixed-Rank Problem Non-convex constraint set But, global optimum exists And there is a known solution 5
  • Slide 6
  • Classical Solution: The Fixed-Rank Problem Singular Value Decomposition of A: Optimal Solution: truncated SVD (proof to come later) 6
  • Slide 7
  • Truncated SVD Properties Properties (via power method): O( mnk ), ok, but O( k ) passes through the data Expects cheap random access Iterative Issues: Datasets are huge! Netflix dataset 2GB FERET Face DB 5GB Wiki English 14GB Watson KB 1TB Architecture is parallel and decentralized Data access is expensive 7
  • Slide 8
  • Low-Rank Algorithm Desiderata 1-2 passes over the matrix as pre-processing / examination step. Remainder of the work sub-O(mn), and should depend on desired rank k, rather than ambient dimensions m x n. Trade off accuracy with computation time Decentralized, simple and numerically stable 8
  • Slide 9
  • Randomization to the Rescue! 1-2 passes over the matrix in pre-processing/examination step. Remainder of the work sub-O( mn ), depending on underlying rank, rather than ambient dimensions Tradeoff accuracy with computation time Decentralized, simple and numerically stable Randomized meta-algorithm: 1.Given A ( m x n ), randomly obtain Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD on Y in O( ms 2 ) or O( s 3 ). 3.Use Y s decomposition to approximate A s decomposition. 1.Given A ( m x n ), randomly obtain Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD on Y in O( ms 2 ) or O( s 3 ). 3.Use Y s decomposition to approximate A s decomposition. 9
  • Slide 10
  • Outline Introduction Linear Algebra preliminaries and intuition The algorithms SampleCols AdaSample RandProj A comparison 10
  • Slide 11
  • Singular Value Decomposition Any real rectangular matrix A can be factored into the form Tall & skinny: Short & fat: (m > n) (m < n) 11
  • Slide 12
  • SVD Properties U and V are unitary matrices with mutually orthonormal columns These columns are called the left and right singular vectors left singular vectors right singular vectors 12
  • Slide 13
  • SVD Properties contains the singular values on the diagonal, in order: which correspond to the left and right singular vectors:,, 13
  • Slide 14
  • SVD Properties For a vector, the geometric interpretation of : 1. Rotate x by rotation matrix V T. 2. Scale the result along the coordinate axes. 4. Rotate the result by U. 3. Either discard n-m dimensions (m n) to map to. 14
  • Slide 15
  • Fundamental Subspaces Define range( A ) as the set of all vectors which A maps to: and null(A) as the set of vectors that A maps to zero: If A has rank k { u 1,,u k } are an orthonormal basis for range( A ) u1u1 u2u2 u3u3 synonymous: range( A ) is the linear subspace spanned by the columns of A 15
  • Slide 16
  • Frobenius norm Equivalent definitions Is unitarily invariant : in Matlab: sum(X(:).^2) since tr( XY) = tr (YX) 16
  • Slide 17
  • The Optimal Low-Rank Solution Theorem (Eckart-Young, 1936).. Then is minimized by with optimal value Let 17
  • Slide 18
  • The Optimal Low-Rank Solution Proof. First, the Frobenius norm is unitarily invariant: Thus, To make the right term diagonal, can construct A of the form should be diagonal too 18 diagonal
  • Slide 19
  • The Optimal Low-Rank Solution Proof (continued). At this point, we have Since rank( A ) is at most k, at most k singular values can be non-zero. Conclude: In summary: 19
  • Slide 20
  • Orthogonal Projections Project matrix A onto orthonormal basis Q: Vector x projection onto unit vector: 20
  • Slide 21
  • Randomized Meta-algorithm Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Form lower dimensional Y from sampling s rows and/or columns from A, or by applying s random projections. Y is m x s or s x s. 2.Compute the top k left singular vectors of Y to form the best rank- k basis Q for range( Y). 3.Project A onto the subspace spanned by Q: 21
  • Slide 22
  • The Main Idea Meta-Algorithm 1.Form Y from A randomly 2.Get optimal rank- k basis Q for span of Y via SVD 3.Project A onto Q Bounds: How far is from? 22
  • Slide 23
  • Outline Introduction Linear Algebra preliminaries and intuition The algorithms SampleCols AdaSample RandProj A comparison 23
  • Slide 24
  • Comparing the algorithms Method Running Time # PassesError w.h.p. SampleCols ? ?? SampleRowsCols ? ?? AdaSample ? ?? RandProj ? ?? Exact partial SVD O(mnk) O(k) 24
  • Slide 25
  • Sampling Rows & Columns Simple idea: too much data to deal with? Subsample! But, sample proportional to squared magnitude not so useful more informative 25
  • Slide 26
  • First pass: SampleCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular values of Y. 4.Project A onto the subspace spanned by Q: 26 [Frieze, Kannan & Vempala, 1998]
  • Slide 27
  • Running time: SampleCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular values of Y. 4.Project A onto the subspace spanned by Q: O(mn) O(ms 2 ) O(ms) 27
  • Slide 28
  • Analysis: SampleCols How different is A from Y on average? Exact: Randomly use s cols: Lets start by analyzing a randomized matrix-vector multiplication algorithm: 28
  • Slide 29
  • Analysis: SampleCols Random matrix-vector multiplication Exact: Randomly use s rows: Let random variable X take on value with probability p i. Then and How do we easily bound this variance? Then set 29
  • Slide 30
  • Analysis: SampleCols Handle on the variance of matrix-vector multiplication: With s samples, variance gets s times better: Lets extend the idea to matrix-matrix multiplication, randomly choosing col i from A and corresponding row i from matrix B : Define random variable Z Then 30 A B col i j row i j
  • Slide 31
  • Analysis: SampleCols Handle on the variance of matrix-matrix multiplication: Now, lets look at variance when B = A T. Let Z = YY T. Then plugging into the above we have 31
  • Slide 32
  • Analysis: SampleCols When Y is a sampled version of A as in SampleCols, then Need one more lemma, which quantifies distortion when projecting on matrix onto anothers range. 32
  • Slide 33
  • Analysis: SampleCols Given any two matrices A and B, let Q be a top- k basis of range( B ). Then 33 projection of A onto best k- basis of B projection of A onto best k- basis of A differs by at most this much range( A ) range( B )
  • Slide 34
  • Analysis: SampleCols Lemma (Distortion from sampling). When Y is a sampled version of A as in SampleCols, then Lemma (Distortion from projection). Given any two matrices A and B, let Q be a basis of the top k left singular values of B. Then Taking expectation w.r.t. sampled columns of the second lemma, we obtain a bound for SampleCols: 34
  • Slide 35
  • Comparing the algorithms Method Running Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols ? ?? AdaSample ? ?? RandProj ? ?? Exact partial SVD O(mnk) O(k) 35
  • Slide 36
  • One step further: SampleRowsCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude. Scale and form Y, m x s. 2.Sample s rows from Y: Y (i 1,:),,Y(i s,:), proportional to their squared magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the top s right singular values of W. 4.Compute Q = [q 1 q k ], where q i = 5.Project A onto the subspace spanned by Q: 36 [Frieze, Kannan & Vempala, 2004]
  • Slide 37
  • Running time: SampleRowsCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude. Scale and form Y, m x s. 2.Sample s rows from Y: Y (i 1,:),,Y(i s,:), proportional to their squared magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the top s right singular values of W. 4.Compute Q = [q 1 q k ], where q i = 5.Project A onto the subspace spanned by Q: O( mn ) O( ms ) O( s 3 ) O( ms 2 ) Total running time: O( mn + s 3 ) 37
  • Slide 38
  • Analysis: SampleRowsCols 38 compute basis for rows of W sample cols convert from row basis of W to col basis of Y sample rows project A onto Q exact SampleCols
  • Slide 39
  • Analysis: SampleRowCols It turns out this additive error dominates the errors incurred from other steps of the algorithm, i.e., Lemma (Good left projections from good right projections). Letas in SampleRowsCols. Then thus, we can bound the algorithm as follows Theorem (SampleRowsCols average Frobenius error). SampleRowsCols finds a rank k matrix such that 39
  • Slide 40
  • Comparing the algorithms Method Running Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample ? ?? RandProj ? ?? Exact partial SVD O(mnk) O(k) 40
  • Slide 41
  • SampleCols is easy to break Error is additive, and we have no control over Consider: a few important points This data has a near-perfect rank-2 decomposition. But, SampleCols will almost surely miss the outliers! 41
  • Slide 42
  • Improvement: AdaSample Sample some and form a basis. Next round of sampling should be proportional to the residual part of A not captured by the current sample. 42
  • Slide 43
  • AdaSample Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Start with empty sample set S = { }, E := A; 2.For t = 1 to T a.Pick a subset S t of s rows of A, with row i chosen according to b.Update c.Update 3.Return 43 [Deshpande, Rademacher, Vempala & Wang, 2006]
  • Slide 44
  • Analysis: AdaSample Lets look at one iteration of AdaSample: Let A be m x n and be a linear subspace. Let and S a collection of s rows of A sampled proportional to. Then Proof is similar to derivation of bound on SampleCols. where is the optimal projection of A onto subspace L with rank k. 44
  • Slide 45
  • Analysis: AdaSample Proof. 45 Let A be m x n and be a linear subspace. Let and S a collection of s rows of A sampled proportional to. Then Applying this lemma T times :
  • Slide 46
  • Running Time: AdaSample At iteration t, we need to 1.Extend the orthonormal basis for S to an orthonormal basis for Orthogonalizing s new vectors against st orthogonal, n x 1 vectors: O( nts 2 ). 2.Project A onto new portion of the basis spanned by S t, takes O( mns ) So the iterative process takes Finally, we need to compute singular vectors to obtain our final rank-k basis, taking O(ms 2 T 2 ). Total running time: basis for S A 46
  • Slide 47
  • Comparing the algorithms MethodRunning Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj ? ?? Exact partial SVD O(mnk) O(k) 47
  • Slide 48
  • Recap so far Previous algorithms attempted to capture range ( A ) by (intelligently) sampling rows and/or columns. What if instead, we probed A with a variety of vectors to get a feel for range(A) ? 48
  • Slide 49
  • RandProj: Geometric intuition 49
  • Slide 50
  • RandProj Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Draw a random test matrix 2.Form 3.Compute an orthonormal basis for the range of Y via SVD. 4.Return 50 [Halko, Martinsson & Tropp, 2010]
  • Slide 51
  • Running time: RandProj Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Draw a random test matrix 2.Form 3.Compute an orthonormal basis for the range of Y via SVD. 4.Return Total Running Time: O( mns ) O( ms 2 ) O( mns + ms 2 ) 51
  • Slide 52
  • Analysis: RandProj Partition the SVD of A like so: Letand. Then 52 optimal errorextra cost proportional to tail singular values: wasted sampling
  • Slide 53
  • Analysis: RandProj Take expectations w.r.t. Lemma (Gaussian matrix properties). Let G be a zero-mean, unit variance Gaussian matrix of size k x s, and B,C fixed: Lemma (Random projection error bound). 53
  • Slide 54
  • Comparing the algorithms MethodRunning Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2 Exact partial SVD O( mnk ) O( k ) 54
  • Slide 55
  • RandProj refinement Can combine RandProj with power iterations: Drives noise down from tail n-k singular values exponentially fast: But pay in running time & # of passes: O( (q+1)mns + ms 2 ) & 2 q passes through data 55
  • Slide 56
  • RandProj refinements (2) Can combine RandProj with structured random matrices: Subsampled random Fourier transform (SRFT) matrices. Compute using FFT in O( mn log s ) instead of standard O( mns ) Difficult to prove bounds, but works as well as Gaussian in practice. 56
  • Slide 57
  • Comparing the algorithms MethodRunning Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2 RandProj+power O( (q+1)mns + ms 2 ) 2q2q Exact partial SVD O( mnk ) O( k ) 57
  • Slide 58
  • Outline Introduction Linear Algebra preliminaries and intuition The algorithms SampleCols AdaSample RandProj A comparison 58
  • Slide 59
  • Which one is best? Fix k and s, assume bounds are tight, and that Error Time power iterations rounds of adaptive sampling optimal error O ( mn+s 3 ) SampleRowsCols O( mn+ms 2 ) SampleCols O( mnsT + s 2 T 2 (m+n) ) AdaSample 59 O( (q+1)mns + ms 2 ) RandProj
  • Slide 60
  • Caveats Bounds are not tight Relative scaling of time-vs-error axes depend on m, n, s, k, and Which is best in practice? 60
  • Slide 61
  • Experiments function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); 61
  • Slide 62
  • Experiments Eigenfaces from Labeled Faces in the Wild 13,233 images, each 96x96 pixels, collected online in 2007 A is a 13233x9216 matrix, 975.6Mb double precision 62
  • Slide 63
  • Eigenfaces examples pixels loadings k -dim. face basis 63 We want the top k = 25 eigenfaces.
  • Slide 64
  • Time (seconds) in ~ 4.5 minutes via Matlabs svds(A,25) 64
  • Slide 65
  • Time (seconds) in ~ 4 minutes RandProj SampleRowsCols AdaSample 65
  • Slide 66
  • log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample 66
  • Slide 67
  • log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample RandProj + power iters 4.6 secs! 67 q =1 q=2 q=3
  • Slide 68
  • Summary Exact eigenfaces in 4+ minutes Approximate eigenfaces in 4 seconds (75 random projections + 1 power iteration) 68
  • Slide 69
  • Conclusion Classical truncated SVD ill-suited for large datasets Randomized algorithms allow an error-vs-computation tradeoff require only a few passes through the data simple and robust 69
  • Slide 70
  • Thanks! 70
  • Slide 71
  • function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W'*W [Uw,S,Vw] = svd(W'*W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W'*W [Uw,S,Vw] = svd(W'*W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); LFW: svds takes ~8 GB of memory and uses up to all 8 cores, takes 263 secs, k=25 for X = 13233 x 9216 (96x96 images) function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; 71
  • Slide 72
  • Classical Algorithm: Truncated SVD 1.Multiply k random vectors by A or A T. 2.Orthogonalize. 3.Repeat until convergence. 1.Multiply k random vectors by A or A T. 2.Orthogonalize. 3.Repeat until convergence. random vectors converge to singular vectors Full SVD of an m x n matrix takes O( mn min (m,n) ) time. If we only care about the top k singular values and vectors, takes O( mnk ), via power method: 72
  • Slide 73
  • Matrix Norms Frobenius norm L 2 operator norm, a.k.a. spectral norm: in Matlab: sum(X(:).^2) 73
  • Slide 74
  • Formulation: the fixed-rank problem Non-convex feasible set But, global optimum exists 74
  • Slide 75
  • SVD Properties Eigenvectors ofarewith eigenvalues. Proof. Also,eigenvectors ofarewith eigenvalues. 75
  • Slide 76
  • Analysis: SampleCols Given any two matrices A and B, let Q be a basis of the top k left singular values of B. Then Proof. Easy to show from manipulating the trace definition of the Frobenius norm: 76
  • Slide 77
  • Analysis: SampleRowsCols Apply distortion lemma weve already seen, twice: once for rows, once for columns: When Y is a sampled version of A as in SampleCols, then Need one more piece of glue, relating right projections (of W ) to left projections (of Y ). Letas in SampleRowsCols. Then 77
  • Slide 78
  • Analysis: RandProj Let G be a zero-mean, unit variance Gaussian matrix of size k x s, and B,C fixed matrices of agreeable dimensions. Then 78
  • Slide 79
  • Analysis: RandProj Lemma (Random projection error bound). Taking expectation w.r.t. and and using some properties of expectations of Gaussian random matrices, we can bound the algorithm: RandProj finds a rank k matrix such that 79
  • Slide 80
  • 80 [Halko, Martinsson & Tropp, 2010] [Frieze, Kannan & Vempala, 1998] [Deshpande, Rademacher, Vempala & Wang, 2006]