EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and

EE546/Stat593C, “Sparse Representations: Theory,

Algorithms, Applications”

Maryam Fazel, EE and Marina Meila, Statistics

Spring Quarter 2010Univ. of Washington

Course logistics

• Tues/Thurs 4:30-5:45pm, EEB 042

• Course webpage: http://www.ee.washington.edu/class/546/2010spr/

• Course requirements:

– reading papers from literature. 1-2 homework sets.– final project: 30 min presentation on themes related to the course. Can be

literature review, or application of course topics to your research.Topic proposal due 1st week of May, presentations in last 2 weeks of class.

• Grading: based on homework, project, and participation.

• Prerequisites: probability theory and statistics, convex optimization (duality,linear and semidefinite programming).

1

Motivation

• Many applications in signal processing: medical imaging (MRI), new imagingsystems

• Recommender systems, e.g. Netflix problem

• Face recognition (website/examples–Yi Ma, UIUC)

• System identification and control

• Video processing (website/demo–Mario Sznaier, NEU)

• Quantum tomography

• Network tomography and inference

• Machine learning, dimensionality reduction

active research area: many blogs, tutorials, paper repositories

2

Central mathematical problems

• sparse vector recovery (compressed sensing and recovery of sparse signals)

• low-rank matrix recovery; matrix completion problem

• many extensions, variations

• towards a unified framework

3

Problem 1: Sparse vector recovery

minimize Card(x)subject to Ax = b

where x ∈ Rn, Card(x) is the cardinality or number of nonzero entries in x.A ∈ Rp×n with p ≪ n

• meaning: find ‘simplest’ x consistent with constraints

• NP-hard [Natarajan’95]; all known exact methods require enumeration

a popular heuristic:

minimize ‖x‖1

subject to Ax = b

• works empirically

• why ℓ1 norm?

−1.5 −1 −0.5 0 0.5 1 1.50

0.5

1

1.5

2

2.5

xi

4

Compressed sensing

framework for measurement and recovery of sparse signals [Candes,Romberg,Tao’04;

Donoho’04; Tropp’04; Wakin,et al’06; Baraniuk,et al’07;. . . ]

idea: signal x ∈ Rn is k-sparse,

measurements: b = Ax, A ∈ Rp×n, p ≪ n

seminal result [Candes,Romberg,Tao’04]:

• if x is sparse enough, it can be recovered from a small number of randommeasurements with high probability

• recovery: ℓ1 minimization

huge impact on acquisition, transmission, and analysis of signals... (many naturalsignals are sparse in time, frequency, wavelet basis, etc.)

5

Problem 2: (affine) rank minimization problem

find low-rank matrix given limited info (e.g., a few linear ‘measurements’):

minimize Rank(X)subject to A(X) = b

where X ∈ Rm×n; A : Rm×n → Rp is a linear map, b ∈ Rp.

can also write as Ap×mnvec(X) = b.

Rank(X) = dimension of span of rows (or columns)= number on non-zero singular values= smallest r s.t. X = LRT where L is m × r and R is n × r

meaning: notions of order, degrees of freedom, complexity

6

Problem 2: rank minimization problem

find low-rank matrix given limited info (e.g., a few linear ‘measurements’):

minimize RankX

subject to A(X) = b

where X ∈ Rm×n; A : Rm×n → Rp is a linear map, b ∈ Rp.NP-hard, nonconvex problem.

system ID

t

?u

y

Step response

min state-space dimension

distance geometry

ijdi

j

min embedding dimension,

given pairwise distances

machine learningusers/movies database

5 ? 4 . . . ?? 1 3 . . . 5

. . .

2 1 ? . . . 4

small number of features,predict unknown entries

special case: finding sparse vector (if X diagonal)

7

special case: finding sparse vector (if X diagonal)

when X =

x1

x2. . .

xn

, RankX = # of nonzero xi

find sparsest vector in C

question: is there a norm analogous to ℓ1 for the matrix rank problem?

8

Outline

for each central problem (sparsity, rank), we look at

• theory: what relaxation/heuristic? when does it provably work?

• algorithms:

– convex relaxation: ℓ1 minimization (e.g. basis pursuit), matrix nuclearnorm/trace. algorithms to solve the resulting LP/SDP.

– greedy algorithms: matching pursuit, many variations

• applications: signal sensing and processing, recommender systems, networktomography,. . .

• extensions: other related problems, concepts. general notions of parsimony.

theory tools: ideas from geometry of normed spaces, randomization, optimizationand duality

9

Some themes

A general approach to modeling and exploiting parsimony:

• pick appropriate notion of parsimony

• find a “natural” convex heuristic

• use probabilistic analysis to prove for heuristic works

• give efficient algorithms to solve the heuristic

10

Preview: some results on matrix rank

To give an idea of the type of problems and results we’ll discuss in the course, let’slook at the rank minimization problem

minimize RankX

subject to A(X) = b

where X ∈ Rm×n; A : Rm×n → Rp is a linear map, b ∈ Rp.

rest of today’s talk: focus on this problem

11

Trace Minimization, X � 0

minimize TrX

subject to X ∈ C,

X = XT � 0

(RankX: # of non-zero λi’s, TrX =∑

i λi)λi(X)

λi(X)

1

1

• convex problem, used often in practice [e.g.,Pare’00,Beck’96,’99]

• variation/improvement: iterative weighted trace minimization

12

Nuclear Norm Minimization

Extension: minimizing sum of singular values:

minimize ‖X‖∗subject to X ∈ C

‖X‖∗ =∑n

i=1 σi(X) is called nuclear norm (trace norm, Schatten 1-norm) of X,

where σi(X) =√

λi(XTX);

dual norm of ‖X‖ = σmax(X)

• for X = diag(x), reduces to minimizing ‖x‖1 =∑

|xi|;well-known ℓ1 heuristic for finding sparse vectors

• useful in generalizing sparse recovery and ℓ1 results tomatrix rank

−1.5 −1 −0.5 0 0.5 1 1.50

0.5

1

1.5

2

2.5

xi

13

Connection to compressed sensing

Compressed sensing: a framework for measurement/recovery of sparse signals[Candes,Tao’04; Donoho’04; Wakin,et al’06; Baraniuk,et al’07; many others. . . ]

• signal x ∈ Rn has k nonzeros (k-sparse)

• measurements: b = Ax, A ∈ Rp×n, p ≪ n

• underdetermined. . . but for some (random) A k-sparse solution is unique andcoincides with min ℓ1 norm solution, with high probability

• recovery: ℓ1 minimization

What if object of interest is a low-rank matrix?

Examples: matrix completion from partial data (e.g., Netflix problem)

Hankel system identification; quantum tomography

14

When does nuclear norm heuristic work?

minimize RankX

subject to A(X) = b

Several key concepts from compressed sensing can be generalized:

• restricted isometry of A• spherical section property for nullspace of A• incoherence properties

Nuclear norm minimization “works” under these conditions...

15

A dictionary:

object x ∈ Rn X ∈ Rm×n

parsimony concept cardinality rankconstraints Ax A(X)

Hilbert space norm ℓ2 Frobeniussparsity inducing norm ℓ1 nuclear

dual norm ℓ∞ operatorrecovery problem linear programming semidefinite programming

(Frobenius norm: ‖X‖F =∑

i,j X2ij =

∑

i σ2i (X) )

general “recipe”: (for proving heuristics work)

• give a deterministic condition on A for heuristic to give exact solution

• sometimes (often!) condition is (NP-)hard to check. . .

• invoke randomness of problem data: random A satisfy the condition with highprobability

16

Restricted Isometry Property (RIP)

How does A behave when acting on low-rank matrices?

Def. For A : Rm×n → Rp, the restricted isometry constant is the smallest δr suchthat

1 − δr(A) ≤ ‖A(X)‖2

‖X‖F

≤ 1 + δr(A)

holds for all matrices X of rank up to r.

17

Restricted Isometry Property (RIP)

How does A behave when acting on low-rank matrices?

Def. For A : Rm×n → Rp, the restricted isometry constant is the smallest δr suchthat

1 − δr(A) ≤ ‖A(X)‖2

‖X‖F

≤ 1 + δr(A)

holds for all matrices X of rank up to r.

Let RankX0 = r and A(X0) = b. Let X be given by the heuristic.

Theorem. (exact recovery) If certain δ(A)’s are small enough, X0 is the uniquesolution with rank < r and X = X0.

(What linear maps satisfy this? we’ll see it holds for iid Gaussian, iid Bernoulli,. . . )

18

Guaranteed minimum rank solution via nuclear norm

Theorem. Pick A ∈ Rp×mn “randomly” (e.g., iid Gaussian or Bernoulli). Thenexact recovery happend with very high probability if

p ≥ c0 r(m + n)︸︷︷︸

≈low-rank dim

log (mn).︸︷︷︸

ambient dim

meaning: if X0 has low enough rank, then given a ‘small’ number of randomconstraints, nuclear norm heuristic finds X0 with high probability.

[Recht,Fazel,Parrilo’07]

(recent improvement: O(nr) measurements enough!)

19

Numerical Example

total pixels=46x81=3726, rank=5

700 constraints 1100 constraints 1250 constraints

• random Gaussian measurements

• solve SDP (just using SeDuMi here)

20

Recovery errors:

700 800 900 1000 1100 1200 1300 1400 150010

-6

10-5

10-4

10-3

10-2

10-1

100

101

err

or

in fro

beniu

s n

orm

number of measurements

gaussianprojectionbinarysparse binary

1050 1100 1150 1200 1250 130010

-5

10-4

10-3

10-2

10-1

100

101

err

or

in fro

beniu

s n

orm

number of measurements

gaussianprojectionbinarysparse binary

• sharp transition near 1200 measurements (≈ 2r(m + n − r))

21

Phase transition: series of experiments for various n, r, p

n = 40, p runs from 0 to n2. for fixed (n, p), r covers all values satisfyingr(2n − r) ≤ p. for each (n, p, r), generate 10 random cases for Y0, solve

min ‖X‖∗subject to Avec(X) = Avec(Y0)

p/n2

r(2

n-r

)/p

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(analogous to [Donoho,Tanner’05])

22

Error bounds for noisy and approximate recovery

things work fine even with noisy constraints and approximately low-rank X. . .

y = A(X) + z, ‖z‖2 ≤ β

Theorem. If A satisfies the RIP with small δ’s, then

‖X − X‖F ≤ c0√r‖X − Xr‖∗ + c1 β

and if β = 0, we have‖X − X‖∗ ≤ c2‖X − Xr‖∗

where Xr is the best possible rank-r approximation to X (c0, c1 are constants).

23

A nullspace condition

Another sufficient condition for exact recovery:

Def. a subspace V ⊂ Rm×n satisfies ∆-spherical section property if

‖Z‖∗‖Z‖F

≥√

∆, for all Z ∈ V, Z 6= 0

∆ lower bounds RankZ: ∆ large ⇒ V doesn’t include low rank matrices

vector case: for subspaces of Rn[Kashin’77],[Gluskin,Garnaev’84]

also used in compressed sensing e.g., [Kashin,Temlyakov’07],[Zhang’08],[Vavasis’09]

24

vector case: consider ‖x‖1‖x‖2

≥√

∆

∆ large means if ℓ1 unit ball is cut by subspace V , the intersection looks“spherical”

intuitively: random subspaces should have large ∆. . .

25

Exact recovery

Let r = RankX0, b = A(X0), m = min{m, n}, and

X := argmin‖X‖∗ subject to A(X) = b.

Suppose nullspace of A, Null(A), has the ∆-spherical section property.

Theorem. (exact recovery) If r < ∆2 , X0 is the only matrix of rank at most r

satisfying the constraints. If r < ∆6 , then X = X0.

When does spherical section property hold? iid Gaussian. Others?

summary: spherical section property deals directly with nullspace; gives simplerproofs than RIP.

26

Algorithms for minimizing ‖X‖∗

Semidefinite program and its dual:

minX,Y,Z

TrY + TrZ maxz

bTz

s.t.

[Y X

XT Z

]

� 0 s.t.

[Im A∗(z)

A∗(z)T In

]

� 0.

A(X) = b

• custom interior point methods (e.g., [Liu,Vandenberghe’08])

• subgradient methods, proximal gradients (e.g., [Ma,Goldfarb,Chen’09])

• Singular Value Thresholding (SVT) [Cai,Candes,Shen’08]

• low-rank factorization (e.g., SDPLR [Burer,Monteiro’05])

• Alternating Directions Method (ADM); Augmented Lagrangian Method (ALM)

27

other (non-SDP) algorithms:

• greedy algorithms [Lee,Bresler’09]

• special case algorithms for matrix completion [Keshavan,Oh,Montanari’09]

• Singular Value Projection (SVP) [Meka,et al’09]

choice of algorithm depends on application: problem size, accuracy required,specific structure, . . .

28

A summary of matrix rank minimization

• Rank minimization problem is NP-hard in general; many applications

• Convex heuristic: Nuclear norm minimization, variations

• For affine rank minimization with certain (random) constraints: theoreticalguarantees for exact solution

• A generalization of vector sparsity and compressed sensing theory; has openedthe door to new set of applications and new links between areas

29

an active research area:

A generalization of vector sparsity and compressed sensing theory; has opened thedoor to new set of applications and new links between areas

• low-rank matrix completion: e.g., recommendation systems[Candes,Recht’08;Candes,Tao’09;Keshavan,et al’09]

• low-rank+sparse decompositions: e.g., graphical models; matrix rigidity theory[Chandrasekaran,et al’09], robust PCA and face recognition [Wright,Ma,et

al,09],[Candes,Ma,et al’09]

• graph problems: e.g., some max-clique problems [Ames,Vavasis’09]

• Hankel rank minimization and system identification[Liu,Vandenberghe’09],[Mohan,Fazel’09]

• ...and more to come!

30

Documents

EE546/Stat593C, “Sparse Representations: Theory ... · EE546/Stat593C, “Sparse Representations: Theory, Algorithms, Applications” ... • Prerequisites: probability theory and