Randomized Algorithms for Low-Rank Matrix Decomposition Ben
Sapp, WPE II May 6, 2011 1
Slide 2
Low-Rank Decomposition: The Goal where 2
Slide 3
Advantages Requires mk+nk numbers to represent the matrix,
instead of mn. I.e., compression. Less numbers = less storage space
and faster matrix multiplication. In many applications, exposes the
structure of the data. 3
Slide 4
Exposing structure: an example Lets say your matrix is full of
points in n -dimensional space which have some underlying linear
structure: This data can be almost exactly described by
2-dimensions, i.e., a rank-2 matrix. description of the 2 axes in
Coefficients (embedding) of the points in 4
Slide 5
Formulation: The Fixed-Rank Problem Non-convex constraint set
But, global optimum exists And there is a known solution 5
Slide 6
Classical Solution: The Fixed-Rank Problem Singular Value
Decomposition of A: Optimal Solution: truncated SVD (proof to come
later) 6
Slide 7
Truncated SVD Properties Properties (via power method): O( mnk
), ok, but O( k ) passes through the data Expects cheap random
access Iterative Issues: Datasets are huge! Netflix dataset 2GB
FERET Face DB 5GB Wiki English 14GB Watson KB 1TB Architecture is
parallel and decentralized Data access is expensive 7
Slide 8
Low-Rank Algorithm Desiderata 1-2 passes over the matrix as
pre-processing / examination step. Remainder of the work sub-O(mn),
and should depend on desired rank k, rather than ambient dimensions
m x n. Trade off accuracy with computation time Decentralized,
simple and numerically stable 8
Slide 9
Randomization to the Rescue! 1-2 passes over the matrix in
pre-processing/examination step. Remainder of the work sub-O( mn ),
depending on underlying rank, rather than ambient dimensions
Tradeoff accuracy with computation time Decentralized, simple and
numerically stable Randomized meta-algorithm: 1.Given A ( m x n ),
randomly obtain Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD
on Y in O( ms 2 ) or O( s 3 ). 3.Use Y s decomposition to
approximate A s decomposition. 1.Given A ( m x n ), randomly obtain
Y ( m x s or s x s) in 1 pass. 2.Compute exact SVD on Y in O( ms 2
) or O( s 3 ). 3.Use Y s decomposition to approximate A s
decomposition. 9
Slide 10
Outline Introduction Linear Algebra preliminaries and intuition
The algorithms SampleCols AdaSample RandProj A comparison 10
Slide 11
Singular Value Decomposition Any real rectangular matrix A can
be factored into the form Tall & skinny: Short & fat: (m
> n) (m < n) 11
Slide 12
SVD Properties U and V are unitary matrices with mutually
orthonormal columns These columns are called the left and right
singular vectors left singular vectors right singular vectors
12
Slide 13
SVD Properties contains the singular values on the diagonal, in
order: which correspond to the left and right singular vectors:,,
13
Slide 14
SVD Properties For a vector, the geometric interpretation of :
1. Rotate x by rotation matrix V T. 2. Scale the result along the
coordinate axes. 4. Rotate the result by U. 3. Either discard n-m
dimensions (m n) to map to. 14
Slide 15
Fundamental Subspaces Define range( A ) as the set of all
vectors which A maps to: and null(A) as the set of vectors that A
maps to zero: If A has rank k { u 1,,u k } are an orthonormal basis
for range( A ) u1u1 u2u2 u3u3 synonymous: range( A ) is the linear
subspace spanned by the columns of A 15
Slide 16
Frobenius norm Equivalent definitions Is unitarily invariant :
in Matlab: sum(X(:).^2) since tr( XY) = tr (YX) 16
Slide 17
The Optimal Low-Rank Solution Theorem (Eckart-Young, 1936)..
Then is minimized by with optimal value Let 17
Slide 18
The Optimal Low-Rank Solution Proof. First, the Frobenius norm
is unitarily invariant: Thus, To make the right term diagonal, can
construct A of the form should be diagonal too 18 diagonal
Slide 19
The Optimal Low-Rank Solution Proof (continued). At this point,
we have Since rank( A ) is at most k, at most k singular values can
be non-zero. Conclude: In summary: 19
Slide 20
Orthogonal Projections Project matrix A onto orthonormal basis
Q: Vector x projection onto unit vector: 20
Slide 21
Randomized Meta-algorithm Input: Matrix A, m x n, target rank
k, number of samples s Output: which approximately solves 1.Form
lower dimensional Y from sampling s rows and/or columns from A, or
by applying s random projections. Y is m x s or s x s. 2.Compute
the top k left singular vectors of Y to form the best rank- k basis
Q for range( Y). 3.Project A onto the subspace spanned by Q:
21
Slide 22
The Main Idea Meta-Algorithm 1.Form Y from A randomly 2.Get
optimal rank- k basis Q for span of Y via SVD 3.Project A onto Q
Bounds: How far is from? 22
Slide 23
Outline Introduction Linear Algebra preliminaries and intuition
The algorithms SampleCols AdaSample RandProj A comparison 23
Sampling Rows & Columns Simple idea: too much data to deal
with? Subsample! But, sample proportional to squared magnitude not
so useful more informative 25
Slide 26
First pass: SampleCols Input: Matrix A, m x n, target rank k,
number of samples s Output: which approximately solves 1.Sample s
columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared
magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular
values of Y. 4.Project A onto the subspace spanned by Q: 26
[Frieze, Kannan & Vempala, 1998]
Slide 27
Running time: SampleCols Input: Matrix A, m x n, target rank k,
number of samples s Output: which approximately solves 1.Sample s
columns from A: A(:,i 1 ),,A(:,i s ), proportional to their squared
magnitude: 2.Form 3.Compute Q = [q 1 q k ], the top k left singular
values of Y. 4.Project A onto the subspace spanned by Q: O(mn) O(ms
2 ) O(ms) 27
Slide 28
Analysis: SampleCols How different is A from Y on average?
Exact: Randomly use s cols: Lets start by analyzing a randomized
matrix-vector multiplication algorithm: 28
Slide 29
Analysis: SampleCols Random matrix-vector multiplication Exact:
Randomly use s rows: Let random variable X take on value with
probability p i. Then and How do we easily bound this variance?
Then set 29
Slide 30
Analysis: SampleCols Handle on the variance of matrix-vector
multiplication: With s samples, variance gets s times better: Lets
extend the idea to matrix-matrix multiplication, randomly choosing
col i from A and corresponding row i from matrix B : Define random
variable Z Then 30 A B col i j row i j
Slide 31
Analysis: SampleCols Handle on the variance of matrix-matrix
multiplication: Now, lets look at variance when B = A T. Let Z = YY
T. Then plugging into the above we have 31
Slide 32
Analysis: SampleCols When Y is a sampled version of A as in
SampleCols, then Need one more lemma, which quantifies distortion
when projecting on matrix onto anothers range. 32
Slide 33
Analysis: SampleCols Given any two matrices A and B, let Q be a
top- k basis of range( B ). Then 33 projection of A onto best k-
basis of B projection of A onto best k- basis of A differs by at
most this much range( A ) range( B )
Slide 34
Analysis: SampleCols Lemma (Distortion from sampling). When Y
is a sampled version of A as in SampleCols, then Lemma (Distortion
from projection). Given any two matrices A and B, let Q be a basis
of the top k left singular values of B. Then Taking expectation
w.r.t. sampled columns of the second lemma, we obtain a bound for
SampleCols: 34
One step further: SampleRowsCols Input: Matrix A, m x n, target
rank k, number of samples s Output: which approximately solves
1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to
their squared magnitude. Scale and form Y, m x s. 2.Sample s rows
from Y: Y (i 1,:),,Y(i s,:), proportional to their squared
magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the
top s right singular values of W. 4.Compute Q = [q 1 q k ], where q
i = 5.Project A onto the subspace spanned by Q: 36 [Frieze, Kannan
& Vempala, 2004]
Slide 37
Running time: SampleRowsCols Input: Matrix A, m x n, target
rank k, number of samples s Output: which approximately solves
1.Sample s columns from A: A(:,i 1 ),,A(:,i s ), proportional to
their squared magnitude. Scale and form Y, m x s. 2.Sample s rows
from Y: Y (i 1,:),,Y(i s,:), proportional to their squared
magnitude. Scale and form W, s x s. 3.Compute V = [v 1 v s ], the
top s right singular values of W. 4.Compute Q = [q 1 q k ], where q
i = 5.Project A onto the subspace spanned by Q: O( mn ) O( ms ) O(
s 3 ) O( ms 2 ) Total running time: O( mn + s 3 ) 37
Slide 38
Analysis: SampleRowsCols 38 compute basis for rows of W sample
cols convert from row basis of W to col basis of Y sample rows
project A onto Q exact SampleCols
Slide 39
Analysis: SampleRowCols It turns out this additive error
dominates the errors incurred from other steps of the algorithm,
i.e., Lemma (Good left projections from good right projections).
Letas in SampleRowsCols. Then thus, we can bound the algorithm as
follows Theorem (SampleRowsCols average Frobenius error).
SampleRowsCols finds a rank k matrix such that 39
SampleCols is easy to break Error is additive, and we have no
control over Consider: a few important points This data has a
near-perfect rank-2 decomposition. But, SampleCols will almost
surely miss the outliers! 41
Slide 42
Improvement: AdaSample Sample some and form a basis. Next round
of sampling should be proportional to the residual part of A not
captured by the current sample. 42
Slide 43
AdaSample Input: Matrix A, m x n, target rank k, number of
samples s Output: which approximately solves 1.Start with empty
sample set S = { }, E := A; 2.For t = 1 to T a.Pick a subset S t of
s rows of A, with row i chosen according to b.Update c.Update
3.Return 43 [Deshpande, Rademacher, Vempala & Wang, 2006]
Slide 44
Analysis: AdaSample Lets look at one iteration of AdaSample:
Let A be m x n and be a linear subspace. Let and S a collection of
s rows of A sampled proportional to. Then Proof is similar to
derivation of bound on SampleCols. where is the optimal projection
of A onto subspace L with rank k. 44
Slide 45
Analysis: AdaSample Proof. 45 Let A be m x n and be a linear
subspace. Let and S a collection of s rows of A sampled
proportional to. Then Applying this lemma T times :
Slide 46
Running Time: AdaSample At iteration t, we need to 1.Extend the
orthonormal basis for S to an orthonormal basis for Orthogonalizing
s new vectors against st orthogonal, n x 1 vectors: O( nts 2 ).
2.Project A onto new portion of the basis spanned by S t, takes O(
mns ) So the iterative process takes Finally, we need to compute
singular vectors to obtain our final rank-k basis, taking O(ms 2 T
2 ). Total running time: basis for S A 46
Recap so far Previous algorithms attempted to capture range ( A
) by (intelligently) sampling rows and/or columns. What if instead,
we probed A with a variety of vectors to get a feel for range(A) ?
48
Slide 49
RandProj: Geometric intuition 49
Slide 50
RandProj Input: Matrix A, m x n, target rank k, number of
samples s Output: which approximately solves 1.Draw a random test
matrix 2.Form 3.Compute an orthonormal basis for the range of Y via
SVD. 4.Return 50 [Halko, Martinsson & Tropp, 2010]
Slide 51
Running time: RandProj Input: Matrix A, m x n, target rank k,
number of samples s Output: which approximately solves 1.Draw a
random test matrix 2.Form 3.Compute an orthonormal basis for the
range of Y via SVD. 4.Return Total Running Time: O( mns ) O( ms 2 )
O( mns + ms 2 ) 51
Slide 52
Analysis: RandProj Partition the SVD of A like so: Letand. Then
52 optimal errorextra cost proportional to tail singular values:
wasted sampling
Slide 53
Analysis: RandProj Take expectations w.r.t. Lemma (Gaussian
matrix properties). Let G be a zero-mean, unit variance Gaussian
matrix of size k x s, and B,C fixed: Lemma (Random projection error
bound). 53
Slide 54
Comparing the algorithms MethodRunning Time # Passes Error
w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2
AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2
Exact partial SVD O( mnk ) O( k ) 54
Slide 55
RandProj refinement Can combine RandProj with power iterations:
Drives noise down from tail n-k singular values exponentially fast:
But pay in running time & # of passes: O( (q+1)mns + ms 2 )
& 2 q passes through data 55
Slide 56
RandProj refinements (2) Can combine RandProj with structured
random matrices: Subsampled random Fourier transform (SRFT)
matrices. Compute using FFT in O( mn log s ) instead of standard O(
mns ) Difficult to prove bounds, but works as well as Gaussian in
practice. 56
Slide 57
Comparing the algorithms MethodRunning Time # Passes Error
w.h.p. SampleCols O( mn+ms 2 ) 2 SampleRowsCols O( mn+s 3 ) 2
AdaSample O( mnsT + s 2 T 2 (m+n) ) 2T2T RandProj O( mns + ms 2 ) 2
RandProj+power O( (q+1)mns + ms 2 ) 2q2q Exact partial SVD O( mnk )
O( k ) 57
Slide 58
Outline Introduction Linear Algebra preliminaries and intuition
The algorithms SampleCols AdaSample RandProj A comparison 58
Slide 59
Which one is best? Fix k and s, assume bounds are tight, and
that Error Time power iterations rounds of adaptive sampling
optimal error O ( mn+s 3 ) SampleRowsCols O( mn+ms 2 ) SampleCols
O( mnsT + s 2 T 2 (m+n) ) AdaSample 59 O( (q+1)mns + ms 2 )
RandProj
Slide 60
Caveats Bounds are not tight Relative scaling of time-vs-error
axes depend on m, n, s, k, and Which is best in practice? 60
Slide 61
Experiments function [U,S,V] = adaSample(A,s,T) S = []; E = A;
for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p / sum(p);
idx = sample_from_weights(p,s); St = A(:,idx); S = [S St];
%orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt =
Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] =
svds(B,s*T); U = Q*U; function [U,S,V] = adaSample(A,s,T) S = []; E
= A; for t=1:T %sample columns p = sqrt(sum(E.^2,1)); p = p /
sum(p); idx = sample_from_weights(p,s); St = A(:,idx); S = [S St];
%orthogonalize & project [Q,R] = qr(S,0); proj_A_on_Qt =
Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap up B = Q'*A; [U,S,V] =
svds(B,s*T); U = Q*U; function [U,S,V] = randProj(A,s,q) Omega =
randn(size(A,2),s); Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end
[Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function
[U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega;
for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A;
[U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] =
sampleColsRows(A,s) %subsample Y = subsample(A,s); W =
subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U =
Y*Vw; [U,R] = qr(U); V = (U'*A)'; V =
bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s)
p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx =
sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function
[U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W =
subsample(Y,s); %SVD of W [Uw,S,Vw] = svd(W); S = sqrt(S); U =
Y*Vw; [U,R] = qr(U); V = (U'*A)'; V =
bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s)
p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx =
sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); 61
Slide 62
Experiments Eigenfaces from Labeled Faces in the Wild 13,233
images, each 96x96 pixels, collected online in 2007 A is a
13233x9216 matrix, 975.6Mb double precision 62
Slide 63
Eigenfaces examples pixels loadings k -dim. face basis 63 We
want the top k = 25 eigenfaces.
Slide 64
Time (seconds) in ~ 4.5 minutes via Matlabs svds(A,25) 64
Slide 65
Time (seconds) in ~ 4 minutes RandProj SampleRowsCols AdaSample
65
Slide 66
log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample
66
Slide 67
log(Time) in ~ 4 minutes RandProj SampleRowsCols AdaSample
RandProj + power iters 4.6 secs! 67 q =1 q=2 q=3
Slide 68
Summary Exact eigenfaces in 4+ minutes Approximate eigenfaces
in 4 seconds (75 random projections + 1 power iteration) 68
Slide 69
Conclusion Classical truncated SVD ill-suited for large
datasets Randomized algorithms allow an error-vs-computation
tradeoff require only a few passes through the data simple and
robust 69
Slide 70
Thanks! 70
Slide 71
function [U,S,V] = randProj(A,s,q) Omega = randn(size(A,2),s);
Y = A*Omega; for i=1:q Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B =
Q'*A; [U0,S,V] = svds(B,s); U = Q*U0; function [U,S,V] =
randProj(A,s,q) Omega = randn(size(A,2),s); Y = A*Omega; for i=1:q
Y = A'*Y; Y = A*Y; end [Q,R] = qr(Y,0); B = Q'*A; [U0,S,V] =
svds(B,s); U = Q*U0; function [U,S,V] = sampleColsRows(A,s)
%subsample Y = subsample(A,s); W = subsample(Y,s); %SVD of W'*W
[Uw,S,Vw] = svd(W'*W); S = sqrt(S); U = Y*Vw; [U,R] = qr(U); V =
(U'*A)'; V = bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] =
subsample(X,s) p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col);
colidx = sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); function
[U,S,V] = sampleColsRows(A,s) %subsample Y = subsample(A,s); W =
subsample(Y,s); %SVD of W'*W [Uw,S,Vw] = svd(W'*W); S = sqrt(S); U
= Y*Vw; [U,R] = qr(U); V = (U'*A)'; V =
bsxfun(@times,V,1./sqrt(sum(V.^2))); subfun [Xsub] = subsample(X,s)
p_col = sqrt(sum(A.^2)); p_col = p_col / sum(p_col); colidx =
sample_from_weights(p_col,s); Xsub = X(:,colidx); Xsub =
bsxfun(@times,Xsub,... 1./sqrt(p_col(colidx)))/sqrt(s); LFW: svds
takes ~8 GB of memory and uses up to all 8 cores, takes 263 secs,
k=25 for X = 13233 x 9216 (96x96 images) function [U,S,V] =
adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p =
sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s);
St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] =
qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap
up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; function [U,S,V] =
adaSample(A,s,T) S = []; E = A; for t=1:T %sample columns p =
sqrt(sum(E.^2,1)); p = p / sum(p); idx = sample_from_weights(p,s);
St = A(:,idx); S = [S St]; %orthogonalize & project [Q,R] =
qr(S,0); proj_A_on_Qt = Q*(Q'*A); E = A - proj_A_on_Qt; end %wrap
up B = Q'*A; [U,S,V] = svds(B,s*T); U = Q*U; 71
Slide 72
Classical Algorithm: Truncated SVD 1.Multiply k random vectors
by A or A T. 2.Orthogonalize. 3.Repeat until convergence.
1.Multiply k random vectors by A or A T. 2.Orthogonalize. 3.Repeat
until convergence. random vectors converge to singular vectors Full
SVD of an m x n matrix takes O( mn min (m,n) ) time. If we only
care about the top k singular values and vectors, takes O( mnk ),
via power method: 72
Slide 73
Matrix Norms Frobenius norm L 2 operator norm, a.k.a. spectral
norm: in Matlab: sum(X(:).^2) 73
Slide 74
Formulation: the fixed-rank problem Non-convex feasible set
But, global optimum exists 74
Analysis: SampleCols Given any two matrices A and B, let Q be a
basis of the top k left singular values of B. Then Proof. Easy to
show from manipulating the trace definition of the Frobenius norm:
76
Slide 77
Analysis: SampleRowsCols Apply distortion lemma weve already
seen, twice: once for rows, once for columns: When Y is a sampled
version of A as in SampleCols, then Need one more piece of glue,
relating right projections (of W ) to left projections (of Y ).
Letas in SampleRowsCols. Then 77
Slide 78
Analysis: RandProj Let G be a zero-mean, unit variance Gaussian
matrix of size k x s, and B,C fixed matrices of agreeable
dimensions. Then 78
Slide 79
Analysis: RandProj Lemma (Random projection error bound).
Taking expectation w.r.t. and and using some properties of
expectations of Gaussian random matrices, we can bound the
algorithm: RandProj finds a rank k matrix such that 79