Randomized Algorithms for Low-Rank Matrix Decomposition Ben Sapp, WPE II May 6, 2011 1

Low-Rank Decomposition: The Goal where 2

Advantages Requires mk+nk numbers to represent the matrix, instead of mn. I.e., compression. Less numbers = less storage space and faster matrix multiplication. In many applications, exposes the structure of the data. 3

Exposing structure: an example Lets say your matrix is full of points in n -dimensional space which have some underlying linear structure: This data can be almost exactly described by 2-dimensions, i.e., a rank-2 matrix. description of the 2 axes in Coefficients (embedding) of the points in 4

Formulation: The Fixed-Rank Problem Non-convex constraint set But, global optimum exists And there is a known solution 5

Classical Solution: The Fixed-Rank Problem Singular Value Decomposition of A: Optimal Solution: truncated SVD (proof to come later) 6

Truncated SVD Properties Properties (via power method): O( mnk ), ok, but O( k ) passes through the data Expects cheap random access Iterative Issues: Datasets are huge! Netflix dataset 2GB FERET Face DB 5GB Wiki English 14GB Watson KB 1TB Architecture is parallel and decentralized Data access is expensive 7

Low-Rank Algorithm Desiderata 1-2 passes over the matrix as pre-processing / examination step. Remainder of the work sub-O(mn), and should depend on desired rank k, rather than ambient dimensions m x n. Trade off accuracy with computation time Decentralized, simple and numerically stable 8

Randomization to the Rescue! 1-2 passes over the matrix in pre-processing/examination step. Remainder of the work sub-O( mn ), depending on underlying rank, rather than ambient dimensions Tradeoff accuracy with computation time Decentralized, simple and numerically stable Randomized meta-algorithm: 1.Given A ( m x n ), randomly obtain Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD on Y in O( ms 2 ) or O( s 3 ). 3.Use Y s decomposition to approximate A s decomposition. 1.Given A ( m x n ), randomly obtain Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD on Y in O( ms 2 ) or O( s 3 ). 3.Use Y s decomposition to approximate A s decomposition. 9

Outline Introduction Linear Algebra preliminaries and intuition The algorithms SampleCols AdaSample RandProj A comparison 10

Singular Value Decomposition Any real rectangular matrix A can be factored into the form Tall & skinny: Short & fat: (m > n) (m < n) 11

SVD Properties U and V are unitary matrices with mutually orthonormal columns These columns are called the left and right singular vectors left singular vectors right singular vectors 12

SVD Properties contains the singular values on the diagonal, in order: which correspond to the left and right singular vectors:,, 13

SVD Properties For a vector, the geometric interpretation of : 1. Rotate x by rotation matrix V T. 2. Scale the result along the coordinate axes. 4. Rotate the result by U. 3. Either discard n-m dimensions (m n) to map to. 14

Fundamental Subspaces Define range( A ) as the set of all vectors which A maps to: and null(A) as the set of vectors that A maps to zero: If A has rank k { u 1,,u k } are an orthonormal basis for range( A ) u1u1 u2u2 u3u3 synonymous: range( A ) is the linear subspace spanned by the columns of A 15

Frobenius norm Equivalent definitions Is unitarily invariant : in Matlab: sum(X(:).^2) since tr( XY) = tr (YX) 16

The Optimal Low-Rank Solution Theorem (Eckart-Young, 1936).. Then is minimized by with optimal value Let 17

The Optimal Low-Rank Solution Proof. First, the Frobenius norm is unitarily invariant: Thus, To make the right term diagonal, can construct A of the form should be diagonal too 18 diagonal

The Optimal Low-Rank Solution Proof (continued). At this point, we have Since rank( A ) is at most k, at most k singular values can be non-zero. Conclude: In summary: 19

Orthogonal Projections Project matrix A onto orthonormal basis Q: Vector x projection onto unit vector: 20

Randomized Meta-algorithm Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Form lower dimensional Y from sampling s rows and/or columns from A, or by applying s random projections. Y is m x s or s x s. 2.Compute the top k left singular vectors of Y to form the best rank- k basis Q for range( Y). 3.Project A onto the subspace spanned by Q: 21

The Main Idea Meta-Algorithm 1.Form Y from A randomly 2.Get optimal rank- k basis Q for span of Y via SVD 3.Project A onto Q Bounds: How far is from? 22

Comparing the algorithms Method Running Time # PassesError w.h.p. SampleCols ? ?? SampleRowsCols ? ?? AdaSample ? ?? RandProj ? ?? Exact partial SVD O(mnk) O(k) 24

Sampling Rows & Columns Simple idea: too much data to deal with? Subsample! But, sample proportional to squared magnitude not so useful more informative 25

First pass: SampleCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular values of Y. 4.Project A onto the subspace spanned by Q: 26 [Frieze, Kannan & Vempala, 1998]

Running time: SampleCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular values of Y. 4.Project A onto the subspace spanned by Q: O(mn) O(ms 2 ) O(ms) 27

Analysis: SampleCols How different is A from Y on average? Exact: Randomly use s cols: Lets start by analyzing a randomized matrix-vector multiplication algorithm: 28

Analysis: SampleCols Random matrix-vector multiplication Exact: Randomly use s rows: Let random variable X take on value with probability p i. Then and How do we easily bound this variance? Then set 29

Analysis: SampleCols Handle on the variance of matrix-vector multiplication: With s samples, variance gets s times better: Lets extend the idea to matrix-matrix multiplication, randomly choosing col i from A and corresponding row i from matrix B : Define random variable Z Then 30 A B col i j row i j

Analysis: SampleCols Handle on the variance of matrix-matrix multiplication: Now, lets look at variance when B = A T. Let Z = YY T. Then plugging into the above we have 31

Analysis: SampleCols When Y is a sampled version of A as in SampleCols, then Need one more lemma, which quantifies distortion when projecting on matrix onto anothers range. 32

Analysis: SampleCols Given any two matrices A and B, let Q be a top- k basis of range( B ). Then 33 projection of A onto best k- basis of B projection of A onto best k- basis of A differs by at most this much range( A ) range( B )

Analysis: SampleCols Lemma (Distortion from sampling). When Y is a sampled version of A as in SampleCols, then Lemma (Distortion from projection). Given any two matrices A and B, let Q be a basis of the top k left singular values of B. Then Taking expectation w.r.t. sampled columns of the second lemma, we obtain a bound for SampleCols: 34

Comparing the algorithms Method Running Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols ? ?? AdaSample ? ?? RandProj ? ?? Exact partial SVD O(mnk) O(k) 35

One step further: SampleRowsCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude. Scale and form Y, m x s. 2.Sample s rows from Y: Y (i 1,:),,Y(i s,:), proportional to their squared magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the top s right singular values of W. 4.Compute Q = [q 1 q k ], where q i = 5.Project A onto the subspace spanned by Q: 36 [Frieze, Kannan & Vempala, 2004]

Running time: SampleRowsCols Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared magnitude. Scale and form Y, m x s. 2.Sample s rows from Y: Y (i 1,:),,Y(i s,:), proportional to their squared magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the top s right singular values of W. 4.Compute Q = [q 1 q k ], where q i = 5.Project A onto the subspace spanned by Q: O( mn ) O( ms ) O( s 3 ) O( ms 2 ) Total running time: O( mn + s 3 ) 37

Analysis: SampleRowsCols 38 compute basis for rows of W sample cols convert from row basis of W to col basis of Y sample rows project A onto Q exact SampleCols

Analysis: SampleRowCols It turns out this additive error dominates the errors incurred from other steps of the algorithm, i.e., Lemma (Good left projections from good right projections). Letas in SampleRowsCols. Then thus, we can bound the algorithm as follows Theorem (SampleRowsCols average Frobenius error). SampleRowsCols finds a rank k matrix such that 39

Comparing the algorithms Method Running Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample ? ?? RandProj ? ?? Exact partial SVD O(mnk) O(k) 40

SampleCols is easy to break Error is additive, and we have no control over Consider: a few important points This data has a near-perfect rank-2 decomposition. But, SampleCols will almost surely miss the outliers! 41

Improvement: AdaSample Sample some and form a basis. Next round of sampling should be proportional to the residual part of A not captured by the current sample. 42

AdaSample Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Start with empty sample set S = { }, E := A; 2.For t = 1 to T a.Pick a subset S t of s rows of A, with row i chosen according to b.Update c.Update 3.Return 43 [Deshpande, Rademacher, Vempala & Wang, 2006]

Analysis: AdaSample Lets look at one iteration of AdaSample: Let A be m x n and be a linear subspace. Let and S a collection of s rows of A sampled proportional to. Then Proof is similar to derivation of bound on SampleCols. where is the optimal projection of A onto subspace L with rank k. 44

Analysis: AdaSample Proof. 45 Let A be m x n and be a linear subspace. Let and S a collection of s rows of A sampled proportional to. Then Applying this lemma T times :

Running Time: AdaSample At iteration t, we need to 1.Extend the orthonormal basis for S to an orthonormal basis for Orthogonalizing s new vectors against st orthogonal, n x 1 vectors: O( nts 2 ). 2.Project A onto new portion of the basis spanned by S t, takes O( mns ) So the iterative process takes Finally, we need to compute singular vectors to obtain our final rank-k basis, taking O(ms 2 T 2 ). Total running time: basis for S A 46

Comparing the algorithms MethodRunning Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj ? ?? Exact partial SVD O(mnk) O(k) 47

Recap so far Previous algorithms attempted to capture range ( A ) by (intelligently) sampling rows and/or columns. What if instead, we probed A with a variety of vectors to get a feel for range(A) ? 48

RandProj: Geometric intuition 49

RandProj Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Draw a random test matrix 2.Form 3.Compute an orthonormal basis for the range of Y via SVD. 4.Return 50 [Halko, Martinsson & Tropp, 2010]

Running time: RandProj Input: Matrix A, m x n, target rank k, number of samples s Output: which approximately solves 1.Draw a random test matrix 2.Form 3.Compute an orthonormal basis for the range of Y via SVD. 4.Return Total Running Time: O( mns ) O( ms 2 ) O( mns + ms 2 ) 51

Analysis: RandProj Partition the SVD of A like so: Letand. Then 52 optimal errorextra cost proportional to tail singular values: wasted sampling

Analysis: RandProj Take expectations w.r.t. Lemma (Gaussian matrix properties). Let G be a zero-mean, unit variance Gaussian matrix of size k x s, and B,C fixed: Lemma (Random projection error bound). 53

Comparing the algorithms MethodRunning Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2 Exact partial SVD O( mnk ) O( k ) 54

RandProj refinement Can combine RandProj with power iterations: Drives noise down from tail n-k singular values exponentially fast: But pay in running time & # of passes: O( (q+1)mns + ms 2 ) & 2 q passes through data 55

RandProj refinements (2) Can combine RandProj with structured random matrices: Subsampled random Fourier transform (SRFT) matrices. Compute using FFT in O( mn log s ) instead of standard O( mns ) Difficult to prove bounds, but works as well as Gaussian in practice. 56

Comparing the algorithms MethodRunning Time # Passes Error w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2 AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2 RandProj+power O( (q+1)mns + ms 2 ) 2q2q Exact partial SVD O( mnk ) O( k ) 57

Which one is best? Fix k and s, assume bounds are tight, and that Error Time power iterations rounds of adaptive sampling optimal error O ( mn+s 3 ) SampleRowsCols O( mn+ms 2 ) SampleCols O( mnsT + s 2 T 2 (m+n) ) AdaSample 59 O( (q+1)mns + ms 2 ) RandProj

Caveats Bounds are not tight Relative scaling of time-vs-error axes depend on m, n, s, k, and Which is best in practice? 60

Experiments function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); 61

Experiments Eigenfaces from Labeled Faces in the Wild 13,233 images, each 96x96 pixels, collected online in 2007 A is a 13233x9216 matrix, 975.6Mb double precision 62

Eigenfaces examples pixels loadings k -dim. face basis 63 We want the top k = 25 eigenfaces.

Time (seconds) in ~ 4.5 minutes via Matlabs svds(A,25) 64

Time (seconds) in ~ 4 minutes RandProj SampleRowsCols AdaSample 65

log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample 66

log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample RandProj + power iters 4.6 secs! 67 q =1 q=2 q=3

Summary Exact eigenfaces in 4+ minutes Approximate eigenfaces in 4 seconds (75 random projections + 1 power iteration) 68

Conclusion Classical truncated SVD ill-suited for large datasets Randomized algorithms allow an error-vs-computation tradeoff require only a few passes through the data simple and robust 69

Thanks! 70

function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W'*W [Uw,S,Vw] = svd(W'*W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function [U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W'*W [Uw,S,Vw] = svd(W'*W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V = (U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub = bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); LFW: svds takes ~8 GB of memory and uses up to all 8 cores, takes 263 secs, k=25 for X = 13233 x 9216 (96x96 images) function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] = adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; 71

Classical Algorithm: Truncated SVD 1.Multiply k random vectors by A or A T. 2.Orthogonalize. 3.Repeat until convergence. 1.Multiply k random vectors by A or A T. 2.Orthogonalize. 3.Repeat until convergence. random vectors converge to singular vectors Full SVD of an m x n matrix takes O( mn min (m,n) ) time. If we only care about the top k singular values and vectors, takes O( mnk ), via power method: 72

Matrix Norms Frobenius norm L 2 operator norm, a.k.a. spectral norm: in Matlab: sum(X(:).^2) 73

Formulation: the fixed-rank problem Non-convex feasible set But, global optimum exists 74

SVD Properties Eigenvectors ofarewith eigenvalues. Proof. Also,eigenvectors ofarewith eigenvalues. 75

Analysis: SampleCols Given any two matrices A and B, let Q be a basis of the top k left singular values of B. Then Proof. Easy to show from manipulating the trace definition of the Frobenius norm: 76

Analysis: SampleRowsCols Apply distortion lemma weve already seen, twice: once for rows, once for columns: When Y is a sampled version of A as in SampleCols, then Need one more piece of glue, relating right projections (of W ) to left projections (of Y ). Letas in SampleRowsCols. Then 77

Analysis: RandProj Let G be a zero-mean, unit variance Gaussian matrix of size k x s, and B,C fixed matrices of agreeable dimensions. Then 78

Analysis: RandProj Lemma (Random projection error bound). Taking expectation w.r.t. and and using some properties of expectations of Gaussian random matrices, we can bound the algorithm: RandProj finds a rank k matrix such that 79

80 [Halko, Martinsson & Tropp, 2010] [Frieze, Kannan & Vempala, 1998] [Deshpande, Rademacher, Vempala & Wang, 2006]

Documents

Randomized Algorithms for Low-Rank Matrix Decomposition Ben Sapp, WPE II May 6, 2011 1