MMSE Estimation for Sparse Representation Modeling Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000,

MMSE Estimation for Sparse Representation Modeling

Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000, Israel

*

Joint work with

Irad Yavneh & Matan Protter

The CS Department, The Technion

April 6th, 2009

MMSE Estimation for Sparse Representation ModelingBy: Michael Elad

2

Noise Removal?

In this talk we focus on signal/image denoising …

Important: (i) Practical application; (ii) A convenient platform for testing basic ideas in signal/image processing.

Many Considered Directions: Partial differential equations, Statistical estimators, Adaptive filters, Inverse problems & regularization, Wavelets, Example-based techniques, Sparse representations, …

Main Massage Today: Several sparse representations can be found and used for better denoising performance – we introduce, motivate, discuss, demonstrate, and explain this new idea.

Remove Additive

Noise ?


3

1. Background on Denoising with Sparse Representations

2. Using More than One Representation: Intuition

3. Using More than One Representation: Theory

4. A Closer Look At the Unitary Case

5. Summary and Conclusions

Agenda


4

Part I Background on Denoising with Sparse

Representations


5

Relation to measurements

Denoising By Energy Minimization

Thomas Bayes 1702 -

1761

Prior or regularizationy : Given measurements

x : Unknown to be recovered

xPryx21

xf2

2

Many of the proposed signal denoising algorithms are related to the minimization of an energy function of the form

This is in-fact a Bayesian point of view, adopting the Maximum-A-posteriori Probability (MAP) estimation.

Clearly, the wisdom in such an approach is within the choice of the prior – modeling the signals of interest.


6

Sparse Representation Modeling

MK

N

DA fixed Dictionary

Every column in D (dictionary) is a prototype signal (atom).

The vector is generated randomly with few (say L for now) non-zeros at random locations and with random values.

A sparse & random vector

αx

N


7

ˆx

L.t.sy21

minargˆ00

22

D

D

D-y = -

Back to Our MAP Energy Function The L0 “norm” is effectively

counting the number of non-zeros in .

The vector is the representation (sparse/redundant).

Bottom line: Denoising of y is done by minimizing

x

L.t.symin 00

2

2

D 22

200 y.t.smin

Dor


8

Next steps: given the previously found atoms, find the next one to best fit the residual. The algorithm stops when the error is below the destination threshold.

The MP is one of the greedy algorithms that finds one atom at a time [Mallat & Zhang

(’93)]. Step 1: find the one atom that

best matches the signal.

The Orthogonal MP (OMP) is an improved version that re-evaluates the coefficients by Least-Squares after each round.

2yD

The Solver We Use: Greed Based

22

200 y.t.smin

D


9

Orthogonal Matching Pursuit

Ki1forrdzmin)i(ECompute 1ni

z

0

00

0

Sand

yyr

0,0n

D

2

nr

1nn

Initialization

Main Iteration

1.

2.

3.

4.

5.

)i(E)i(E,Ki1.t.siChoose 00

nn Spsup.t.symin:LS

Dnn yr:sidualReUpdate D

}i{SS:SUpdate 01nnn

StopYesNo

OMP finds one atom at a time for approximating the solution of

22

200 y.t.smin

D


10

Part II Using More than One Representation: Intuition


11

Back to the Beginning. What If …

Consider the denoising problem

and suppose that we can find a group of J candidate

solutions

such that

22

200 y.t.smin

D

J1jj

22

2j

0

0j

y

Nj

D

Basic Questions: What could we do with

such a set of competing solutions in order to better denoise y?

Why should this help?

How shall we practically find such a set of solutions?

Relevant work:[Larsson & Selen (’07)][Schintter et. al. (`08)]

[Elad and Yavneh (’08)]


12

Motivation – General

Why bother with such a set?

Because each representation conveys a different story about the desired signal.

Because pursuit algorithms are often wrong in finding the sparsest representation, and then relying on their solution is too sensitive.

… Maybe there are “deeper” reasons?

1

2

D

D


13

Our Motivation

An intriguing relationship between this idea and the common-practice in example-based techniques, where several examples are merged.

Consider the Non-Local-Means [Buades, Coll, & Morel (‘05)]. It uses (i) a local dictionary (the neighborhood patches), (ii) it builds several sparse representations (of cardinality 1), and (iii) it merges them.

Why not take it further, and use general sparse representations?

2

D

1

D


14

Generating Many Representations

Our Answer: Randomizing the OMP

Ki1forrdzmin)i(ECompute 1ni

z

0

00

0

Sand

yyr

0,0n

D

2

nr

1nn

Initialization

Main Iteration

1.

2.

3.

4.

5.

)i(E)i(E,Ki1.t.siChoose 00

nn Spsup.t.symin:LS

Dnn yr:sidualReUpdate D

}i{SS:SUpdate 01nnn

Stop

)i(EcexpyprobabilitwithiChoose0

YesNo

* Larsson and Schnitter propose a more complicated and deterministic tree pruning method

*

For now, lets set the parameter c manually for best performance. Later we shall

define a way to set it automatically


15

Lets Try

10000

y

Proposed Experiment :

Form a random dictionary D.

Multiply by a sparse vector α0 ( ).

Add Gaussian iid noise v with σ=1 and obtain .

Solve the problem

using OMP, and obtain .

Use Random-OMP and obtain .

Lets look at the obtained representations …

100

200

D

1000

1jRandOMPj

0

100y.t.smin2

200

D

+=v y

OMP


16

Some Observations

0 10 20 30 400

50

100

150

Candinality

His

togra

m

Random-OMP cardinalitiesOMP cardinality

85 90 95 100 1050

50

100

150

200

250

300

350

Representation ErrorH

isto

gra

m

Random-OMP errorOMP error

0 0.1 0.2 0.3 0.40

50

100

150

200

250

300

Noise Attenuation

His

togr

am

Random-OMP denoisingOMP denoising

0 5 10 15 200.05

0.1

0.15

0.2

0.25

0.3

0.35

Cardinality

No

ise

Att

enu

ation

Random-OMP denoisingOMP denoising

0

0

RandOMPi

2

20

220

y

D

DD

2

2yD

We see that

•The OMP gives the sparsest solution

•Nevertheless, it is not the most effective for denoising.

•The cardinality of a representation does not reveal its efficiency.


17

The Surprise (at least for us) …

0 50 100 150 200-3

-2

-1

0

1

2

3

index

valu

e

Averaged Rep.Original Rep.OMP Rep.

1000

1j

RandOMPj1000

1ˆ

Lets propose the average

as our representation

This representation IS NOT SPARSE AT

ALL but it gives

06.0y

ˆ2

20

220

D

DD


18

Is It Consistent? … Yes!

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

OMP Denoising Factor

Ra

ndO

MP

Den

oisi

ng F

acto

r

OMP versus RandOMP resultsMean Point

Here are the results of 1000 trials with

the same parameters …

?Cases of

zero solution


19

Part III Using More than One Representation:

Theory


20

Our Signal Model

K

N

DA fixed Dictionary

α

x D is fixed and known.

The vector α is built by: Choosing the support s with

probability P(s) from all the 2K possibilities Ω.

For simplicity, assume that |s|=k is fixed and known.

Choosing the αs coefficients using iid Gaussian entries N(0,σx).

The ideal signal is x=Dα=Dsαs.The p.d.f. P(α) and P(x) are clear and known


21

Adding Noise

K

N

DA fixed Dictionary

α

x

yv

+Noise Assumed:

The noise v is additive white Gaussian vector with probability Pv(v)

The conditional p.d.f.’s P(y|s), P(s|y), and even also P(x|y) are all clear and well-

defined (although they may appear nasty).

2

2

2

yxexpCxyP


22

The Key – The Posterior P(x|y)

y|xPWe have access to

MAP MMSE

)y|x(PArgMaxxx

MAP y|xExMMSE

The estimation of α and multiplication by D is equivalent to the above.

These two estimators are impossible to compute, as we show next.

Oracle known

support s

oraclex


23

Lets Start with The Oracle

yP

P|yPy|Ps,y|P ss

s

2x

2s

s2

expP

2

2ss

s2

yexp|yP

D

2x

2s

2

2ss

s22

yexpy|P

D

s1

sTs2

1

2x

sTs2s hy

111ˆ

QDIDD

* When s is known

*

Comments:

• This estimate is both the MAP and MMSE.

• The oracle estimate of x is obtained by multiplication by Ds.


24

We have seen this as the oracle’s probability for the

support s:

2x

2s

2

2ss

s22

yexpy|P

D

The MAP Estimation

2))log(det(

2

hhexp)s(P

...)s|y(P)s(P)y|s(P

1ss

1s

Ts QQ

s

MAP )s,y|(P)y|s(PArgMaxˆ )y|(PArgMaxˆMAP

)y|(P)y|s(PArgMaxˆ ss,

MAP

s

2))log(det(

2

hhexp)s(PArgMaxs

1ss

1s

Ts

s

MAP QQ


25

The MAP Estimation

Implications:

The MAP estimator requires to test all the possible supports for the maximization. In typical problems, this is impossible as there is a combinatorial set of possibilities.

This is why we rarely use exact MAP, and we typically replace it with approximation algorithms (e.g., OMP).

2))log(det(

2

hhexp)s(PArgMaxs

1ss

1s

Ts

s

MAP QQ


26

s1

ss hˆs,y|E Q

This is the oracle for s, as we have seen before

The MMSE Estimation

2))log(det(

2

hhexp)s(P

...)s|y(P)s(P)y|s(P

1ss

1s

Ts QQ

s

MMSE s,y|E)y|s(Py|Eˆ

s

sMMSE )y|s(Pˆ


27

The MMSE Estimation

s

MMSE s,y|E)y|s(Py|Eˆ

s

sMMSE )y|s(Pˆ

Implications:

The best estimator (in terms of L2 error) is a weighted average of many sparse representations!!!

As in the MAP case, in typical problems one cannot compute this expression, as the summation is over a combinatorial set of possibilities. We should propose approximations here as well.


28

This is our c in the

Random-OMP

The Case of |s|=k=1

2k

T22

x

2x

2

1ss

1s

Ts

)dy(2

1exp

2))log(det(

2

hhexp)s(P)y|s(P

QQ

The k-th atom in D Based on this we can propose a greedy

algorithm for both MAP and MMSE:

MAP – choose the atom with the largest inner product (out of K), and do so one at a time, while freezing the previous ones (almost OMP).

MMSE – draw at random an atom in a greedy algorithm, based on the above probability set, getting close to P(s|y) in the overall draw.


29

Bottom Line

The MMSE estimation we got requires a sweep through all supports (i.e. combinatorial search) – impractical.

Similarly, an explicit expression for P(x/y) can be derived and maximized – this is the MAP estimation, and it also requires a sweep through all possible supports – impractical too.

The OMP is a (good) approximation for the MAP estimate.

The Random-OMP is a (good) approximation of the Minimum-Mean-Squared-Error (MMSE) estimate. It is close to the Gibbs sampler of the probability P(s|y) from which we should draw the weights. Back to the beginning: Why Use Several

Representations? Because their average leads to a provable better noise suppression.


30

0.5 1 1.5

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Re

lativ

e M

ea

n-S

qu

are

d-E

rro

r

Comparative Results

The following results correspond to a small dictionary (20×30), where the combinatorial formulas can be evaluated as well.Parameters:

• N=20, K=30

• True support=3

• σx=1

• J=10 (RandOMP)

• Averaged over 1000 experiments

20

1. Emp. Oracle2. Theor. Oracle3. Emp. MMSE4. Theor. MMSE5. Emp. MAP6. Theor. MAP7. OMP8. RandOMP

Known support


31

Part IV A Closer Look At the

Unitary Case IDDDD TT


32

Few Basic Observations

ss22x

2

2x

2

s1

ss

s2Ts2s

2x

2

2x

2

2x

sTs2s

c1

hˆ

1y

1h

11

Q

D

IIDDQ

yTDLet us denote

(The Oracle)


33

Back to the MAP Estimation

2))log(det(

2

hhexpArgMaxs

1ss

1s

Ts

s

MAP QQ

We assume |s|=k fixed with

equal probabilities

*

This part becomes a constant, and

thus can be discarded

2

2s2s1

sTs

chh

Q

This means that MAP estimation can be easily evaluated by

computing β, sorting its entries in descending order, and choosing

the k leading ones.


34

Closed-Form Estimation

It is well-known that MAP enjoys a closed form and simple solution in the case of a unitary dictionary D.

This closed-form solution takes the structure of thresholding or shrinkage. The specific structure depends on the fine details of the model assumed.

It is also known that OMP in this case becomes exact.

What about the MMSE? Could it

have a simple closed-form solution

too ?


35

The MMSE … Again

s s

MMSE )y|s(PcˆThis is the formula we got:

=

We combine linearly many sparse representations (with proper weights)

+ + + + + + +

The result is one effective representation (not sparse anymore)


36

The MMSE … Again

s s

MMSE )y|s(PcˆThis is the formula we got:

We change the above summation to

K

1jjj

kj

MMSE eqˆ

where there are K contributions (one per each atom) to be found and used.

We have developed a closed-form recursive formula for computing the q coefficients.


37

Towards a Recursive Formula

iq

2i2si

2

2s2 2

cexp

2

cexp...)y|s(P

We have seen that the governing probability for the weighted averaging is given by

jjK

1j ss

sii

ss

sii

ss

MMSE

e)j(Iqq

)y|s(Pcˆ

Indicator function stating if j is in s

kjq


38

The Recursive Formula

K

1

1k1

1kj

1j

ss

sii

kj

qq1

)q1(qk...)j(Iqq

where j1j qq

K

1j2jq

K

1j1jq

K

1j3jq

K

1jkjq


39

0.5 1 1.5

0

0.07

0.08

0.09

0.1

Re

lativ

e M

ea

n-S

qu

are

d-E

rro

r

This is a synthetic experiment resembling the previous one, but with few important changes:

2

OracleKnown support

An Example

• D is unitary

• The representation’s cardinality is 5 (the higher it is, the weaker the Random-OMP becomes)

• Dimensions are different: N=K=64

• J=20 (RandOMP runs)

Theor. MAPOMP

Recursive MMSE

Theor. MMSE

RandOMP


40

Part V Summary and

Conclusions


41

Today We Have Seen that …

By finding the sparsest

representation and using it to recover the clean signal

How ?

Sparsity and Redundancy are

used for denoising of signals/images

Can we do better? Today we have shown that

averaging several sparse representations for a signal

lead to better denoising, as it approximates the MMSE

estimator.

More on these (including the slides and the relevant papers) can be found in http://www.cs.technion.ac.il/~elad

Documents

MMSE Estimation for Sparse Representation Modeling Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000,