31
EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” Maryam Fazel, EE and Marina Meila, Statistics Spring Quarter 2010 Univ. of Washington

EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

EE546/Stat593C, “Sparse Representations: Theory,

Algorithms, Applications”

Maryam Fazel, EE and Marina Meila, Statistics

Spring Quarter 2010Univ. of Washington

Page 2: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Course logistics

• Tues/Thurs 4:30-5:45pm, EEB 042

• Course webpage: http://www.ee.washington.edu/class/546/2010spr/

• Course requirements:

– reading papers from literature. 1-2 homework sets.– final project: 30 min presentation on themes related to the course. Can be

literature review, or application of course topics to your research.Topic proposal due 1st week of May, presentations in last 2 weeks of class.

• Grading: based on homework, project, and participation.

• Prerequisites: probability theory and statistics, convex optimization (duality,linear and semidefinite programming).

1

Page 3: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Motivation

• Many applications in signal processing: medical imaging (MRI), new imagingsystems

• Recommender systems, e.g. Netflix problem

• Face recognition (website/examples–Yi Ma, UIUC)

• System identification and control

• Video processing (website/demo–Mario Sznaier, NEU)

• Quantum tomography

• Network tomography and inference

• Machine learning, dimensionality reduction

active research area: many blogs, tutorials, paper repositories

2

Page 4: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Central mathematical problems

• sparse vector recovery (compressed sensing and recovery of sparse signals)

• low-rank matrix recovery; matrix completion problem

• many extensions, variations

• towards a unified framework

3

Page 5: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Problem 1: Sparse vector recovery

minimize Card(x)subject to Ax = b

where x ∈ Rn, Card(x) is the cardinality or number of nonzero entries in x.A ∈ Rp×n with p ≪ n

• meaning: find ‘simplest’ x consistent with constraints

• NP-hard [Natarajan’95]; all known exact methods require enumeration

a popular heuristic:

minimize ‖x‖1

subject to Ax = b

• works empirically

• why ℓ1 norm?

−1.5 −1 −0.5 0 0.5 1 1.50

0.5

1

1.5

2

2.5

xi

4

Page 6: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Compressed sensing

framework for measurement and recovery of sparse signals [Candes,Romberg,Tao’04;

Donoho’04; Tropp’04; Wakin,et al’06; Baraniuk,et al’07;. . . ]

idea: signal x ∈ Rn is k-sparse,

measurements: b = Ax, A ∈ Rp×n, p ≪ n

seminal result [Candes,Romberg,Tao’04]:

• if x is sparse enough, it can be recovered from a small number of randommeasurements with high probability

• recovery: ℓ1 minimization

huge impact on acquisition, transmission, and analysis of signals... (many naturalsignals are sparse in time, frequency, wavelet basis, etc.)

5

Page 7: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Problem 2: (affine) rank minimization problem

find low-rank matrix given limited info (e.g., a few linear ‘measurements’):

minimize Rank(X)subject to A(X) = b

where X ∈ Rm×n; A : Rm×n → Rp is a linear map, b ∈ Rp.

can also write as Ap×mnvec(X) = b.

Rank(X) = dimension of span of rows (or columns)= number on non-zero singular values= smallest r s.t. X = LRT where L is m × r and R is n × r

meaning: notions of order, degrees of freedom, complexity

6

Page 8: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Problem 2: rank minimization problem

find low-rank matrix given limited info (e.g., a few linear ‘measurements’):

minimize RankX

subject to A(X) = b

where X ∈ Rm×n; A : Rm×n → Rp is a linear map, b ∈ Rp.NP-hard, nonconvex problem.

system ID

t

?u

y

Step response

min state-space dimension

distance geometry

ijdi

j

min embedding dimension,

given pairwise distances

machine learningusers/movies database

5 ? 4 . . . ?? 1 3 . . . 5

. . .

2 1 ? . . . 4

small number of features,predict unknown entries

special case: finding sparse vector (if X diagonal)

7

Page 9: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

special case: finding sparse vector (if X diagonal)

when X =

x1

x2. . .

xn

, RankX = # of nonzero xi

find sparsest vector in C

question: is there a norm analogous to ℓ1 for the matrix rank problem?

8

Page 10: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Outline

for each central problem (sparsity, rank), we look at

• theory: what relaxation/heuristic? when does it provably work?

• algorithms:

– convex relaxation: ℓ1 minimization (e.g. basis pursuit), matrix nuclearnorm/trace. algorithms to solve the resulting LP/SDP.

– greedy algorithms: matching pursuit, many variations

• applications: signal sensing and processing, recommender systems, networktomography,. . .

• extensions: other related problems, concepts. general notions of parsimony.

theory tools: ideas from geometry of normed spaces, randomization, optimizationand duality

9

Page 11: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Some themes

A general approach to modeling and exploiting parsimony:

• pick appropriate notion of parsimony

• find a “natural” convex heuristic

• use probabilistic analysis to prove for heuristic works

• give efficient algorithms to solve the heuristic

10

Page 12: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Preview: some results on matrix rank

To give an idea of the type of problems and results we’ll discuss in the course, let’slook at the rank minimization problem

minimize RankX

subject to A(X) = b

where X ∈ Rm×n; A : Rm×n → Rp is a linear map, b ∈ Rp.

rest of today’s talk: focus on this problem

11

Page 13: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Trace Minimization, X � 0

minimize TrX

subject to X ∈ C,

X = XT � 0

(RankX: # of non-zero λi’s, TrX =∑

i λi)λi(X)

λi(X)

1

1

• convex problem, used often in practice [e.g.,Pare’00,Beck’96,’99]

• variation/improvement: iterative weighted trace minimization

12

Page 14: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Nuclear Norm Minimization

Extension: minimizing sum of singular values:

minimize ‖X‖∗subject to X ∈ C

‖X‖∗ =∑n

i=1 σi(X) is called nuclear norm (trace norm, Schatten 1-norm) of X,

where σi(X) =√

λi(XTX);

dual norm of ‖X‖ = σmax(X)

• for X = diag(x), reduces to minimizing ‖x‖1 =∑

|xi|;well-known ℓ1 heuristic for finding sparse vectors

• useful in generalizing sparse recovery and ℓ1 results tomatrix rank

−1.5 −1 −0.5 0 0.5 1 1.50

0.5

1

1.5

2

2.5

xi

13

Page 15: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Connection to compressed sensing

Compressed sensing: a framework for measurement/recovery of sparse signals[Candes,Tao’04; Donoho’04; Wakin,et al’06; Baraniuk,et al’07; many others. . . ]

• signal x ∈ Rn has k nonzeros (k-sparse)

• measurements: b = Ax, A ∈ Rp×n, p ≪ n

• underdetermined. . . but for some (random) A k-sparse solution is unique andcoincides with min ℓ1 norm solution, with high probability

• recovery: ℓ1 minimization

What if object of interest is a low-rank matrix?

Examples: matrix completion from partial data (e.g., Netflix problem)

Hankel system identification; quantum tomography

14

Page 16: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

When does nuclear norm heuristic work?

minimize RankX

subject to A(X) = b

Several key concepts from compressed sensing can be generalized:

• restricted isometry of A• spherical section property for nullspace of A• incoherence properties

Nuclear norm minimization “works” under these conditions...

15

Page 17: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

A dictionary:

object x ∈ Rn X ∈ Rm×n

parsimony concept cardinality rankconstraints Ax A(X)

Hilbert space norm ℓ2 Frobeniussparsity inducing norm ℓ1 nuclear

dual norm ℓ∞ operatorrecovery problem linear programming semidefinite programming

(Frobenius norm: ‖X‖F =∑

i,j X2ij =

i σ2i (X) )

general “recipe”: (for proving heuristics work)

• give a deterministic condition on A for heuristic to give exact solution

• sometimes (often!) condition is (NP-)hard to check. . .

• invoke randomness of problem data: random A satisfy the condition with highprobability

16

Page 18: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Restricted Isometry Property (RIP)

How does A behave when acting on low-rank matrices?

Def. For A : Rm×n → Rp, the restricted isometry constant is the smallest δr suchthat

1 − δr(A) ≤ ‖A(X)‖2

‖X‖F

≤ 1 + δr(A)

holds for all matrices X of rank up to r.

17

Page 19: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Restricted Isometry Property (RIP)

How does A behave when acting on low-rank matrices?

Def. For A : Rm×n → Rp, the restricted isometry constant is the smallest δr suchthat

1 − δr(A) ≤ ‖A(X)‖2

‖X‖F

≤ 1 + δr(A)

holds for all matrices X of rank up to r.

Let RankX0 = r and A(X0) = b. Let X be given by the heuristic.

Theorem. (exact recovery) If certain δ(A)’s are small enough, X0 is the uniquesolution with rank < r and X = X0.

(What linear maps satisfy this? we’ll see it holds for iid Gaussian, iid Bernoulli,. . . )

18

Page 20: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Guaranteed minimum rank solution via nuclear norm

Theorem. Pick A ∈ Rp×mn “randomly” (e.g., iid Gaussian or Bernoulli). Thenexact recovery happend with very high probability if

p ≥ c0 r(m + n)︸ ︷︷ ︸

≈low-rank dim

log (mn).︸ ︷︷ ︸

ambient dim

meaning: if X0 has low enough rank, then given a ‘small’ number of randomconstraints, nuclear norm heuristic finds X0 with high probability.

[Recht,Fazel,Parrilo’07]

(recent improvement: O(nr) measurements enough!)

19

Page 21: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Numerical Example

total pixels=46x81=3726, rank=5

700 constraints 1100 constraints 1250 constraints

• random Gaussian measurements

• solve SDP (just using SeDuMi here)

20

Page 22: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Recovery errors:

700 800 900 1000 1100 1200 1300 1400 150010

-6

10-5

10-4

10-3

10-2

10-1

100

101

err

or

in fro

beniu

s n

orm

number of measurements

gaussianprojectionbinarysparse binary

1050 1100 1150 1200 1250 130010

-5

10-4

10-3

10-2

10-1

100

101

err

or

in fro

beniu

s n

orm

number of measurements

gaussianprojectionbinarysparse binary

• sharp transition near 1200 measurements (≈ 2r(m + n − r))

21

Page 23: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Phase transition: series of experiments for various n, r, p

n = 40, p runs from 0 to n2. for fixed (n, p), r covers all values satisfyingr(2n − r) ≤ p. for each (n, p, r), generate 10 random cases for Y0, solve

min ‖X‖∗subject to Avec(X) = Avec(Y0)

p/n2

r(2

n-r

)/p

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(analogous to [Donoho,Tanner’05])

22

Page 24: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Error bounds for noisy and approximate recovery

things work fine even with noisy constraints and approximately low-rank X. . .

y = A(X) + z, ‖z‖2 ≤ β

Theorem. If A satisfies the RIP with small δ’s, then

‖X − X‖F ≤ c0√r‖X − Xr‖∗ + c1 β

and if β = 0, we have‖X − X‖∗ ≤ c2‖X − Xr‖∗

where Xr is the best possible rank-r approximation to X (c0, c1 are constants).

23

Page 25: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

A nullspace condition

Another sufficient condition for exact recovery:

Def. a subspace V ⊂ Rm×n satisfies ∆-spherical section property if

‖Z‖∗‖Z‖F

≥√

∆, for all Z ∈ V, Z 6= 0

∆ lower bounds RankZ: ∆ large ⇒ V doesn’t include low rank matrices

vector case: for subspaces of Rn[Kashin’77],[Gluskin,Garnaev’84]

also used in compressed sensing e.g., [Kashin,Temlyakov’07],[Zhang’08],[Vavasis’09]

24

Page 26: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

vector case: consider ‖x‖1‖x‖2

≥√

∆ large means if ℓ1 unit ball is cut by subspace V , the intersection looks“spherical”

intuitively: random subspaces should have large ∆. . .

25

Page 27: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Exact recovery

Let r = RankX0, b = A(X0), m = min{m, n}, and

X := argmin‖X‖∗ subject to A(X) = b.

Suppose nullspace of A, Null(A), has the ∆-spherical section property.

Theorem. (exact recovery) If r < ∆2 , X0 is the only matrix of rank at most r

satisfying the constraints. If r < ∆6 , then X = X0.

When does spherical section property hold? iid Gaussian. Others?

summary: spherical section property deals directly with nullspace; gives simplerproofs than RIP.

26

Page 28: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

Algorithms for minimizing ‖X‖∗

Semidefinite program and its dual:

minX,Y,Z

TrY + TrZ maxz

bTz

s.t.

[Y X

XT Z

]

� 0 s.t.

[Im A∗(z)

A∗(z)T In

]

� 0.

A(X) = b

• custom interior point methods (e.g., [Liu,Vandenberghe’08])

• subgradient methods, proximal gradients (e.g., [Ma,Goldfarb,Chen’09])

• Singular Value Thresholding (SVT) [Cai,Candes,Shen’08]

• low-rank factorization (e.g., SDPLR [Burer,Monteiro’05])

• Alternating Directions Method (ADM); Augmented Lagrangian Method (ALM)

27

Page 29: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

other (non-SDP) algorithms:

• greedy algorithms [Lee,Bresler’09]

• special case algorithms for matrix completion [Keshavan,Oh,Montanari’09]

• Singular Value Projection (SVP) [Meka,et al’09]

choice of algorithm depends on application: problem size, accuracy required,specific structure, . . .

28

Page 30: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

A summary of matrix rank minimization

• Rank minimization problem is NP-hard in general; many applications

• Convex heuristic: Nuclear norm minimization, variations

• For affine rank minimization with certain (random) constraints: theoreticalguarantees for exact solution

• A generalization of vector sparsity and compressed sensing theory; has openedthe door to new set of applications and new links between areas

29

Page 31: EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

an active research area:

A generalization of vector sparsity and compressed sensing theory; has opened thedoor to new set of applications and new links between areas

• low-rank matrix completion: e.g., recommendation systems[Candes,Recht’08;Candes,Tao’09;Keshavan,et al’09]

• low-rank+sparse decompositions: e.g., graphical models; matrix rigidity theory[Chandrasekaran,et al’09], robust PCA and face recognition [Wright,Ma,et

al,09],[Candes,Ma,et al’09]

• graph problems: e.g., some max-clique problems [Ames,Vavasis’09]

• Hankel rank minimization and system identification[Liu,Vandenberghe’09],[Mohan,Fazel’09]

• ...and more to come!

30