Denoising in digital radiography: A total variation approachhomes.di.unimi.it/frosio/Lessons/AY2010-2011-SI/Download/SI_2011.… · Gaussian noise and likelihood • Im gesImages

Denoising in digital radiography:Denoising in digital radiography:A total variation approach

I. FrosioM. LuccheseN. A. Borghese

http://ais-lab.dsi.unimi.it I. Frosio, M. Lucchese, N. A. Borghese1 / 46

Images are corrupted by noise…

i) When measurement ofi) When measurement ofsome physical parameter isperformed, noisecorruption cannot becorruption cannot beavoided.

ii E h i di i iii) Each pixel of a digital imagemeasures a number ofphotons.

Therefore, from i) and ii)…

…Images are corrupted bynoise!


Gaussian noise(not so useful for digital radiographs, but a good model for learning…)(not so useful for digital radiographs, but a good model for learning…)

M i i f d l d• Measurement noise is often modeled asGaussian noise…

• Let x be the measured physical parameter,let μ be the noise free parameter and let σ2

be the variance of the measured parameter(noise power); the probability densityfunction for x is given by:

⎤⎡ ⎞⎛211( )⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎠⎞

⎜⎝⎛ −

−=21exp

21,|

σμ

πσσμ xxp


⎦⎣

Gaussian noise and likelihood

Im ges e com osed by set of ixels x (x is vecto !)• Images are composed by a set of pixels, x (x is a vector!)• How can we quantify the probability to measure the image x,

given the probability density function for each pixel?g p f f p• Let us assume that the variance is equal for each pixel;• Let xi and μi be the measured and noiseless values for the i-th

pixelpixel;• Likelihood function, L(x|μ):

⎤⎡ ⎞⎛NN x 211 μ( ) ( ) ∏∏== ⎥

⎥⎦

⎤

⎢⎢⎣

⎡⎟⎠⎞

⎜⎝⎛ −

−==N

i

iiN

iii

xxpL11 2

1exp2

1||σ

μπσ

μμx

• L(x|μ) describes the probability to measure the image x, giventhe noise free value for each pixel, μ.


What about denoising???

Wh i d i i g h ?• What is denoising then?

Denoising estimate Denoising estimate from from xxDenoising = estimate Denoising = estimate μμ from from xx..

• How can we estimate μ?• How can we estimate μ?• Maximize p(μ|x) => this usually leads to an hard, inverse

problem.• It is easier to maximize p(x|μ), that is => maximize the

likelihood function (a “simple”, direct problem).• But… Is maximization of p(μ|x) different from that of

p(x|μ)?


Bayes and likelihood

Bayes theo em• Bayes theorem:( ) ( ) ( ) ( )μμxxxμ pppp || ⇒=( ) ( ) ( ) ( )

( ) ( )

μμμ pppp ||

A priori hypothesis on the Likelihood

( ) ( ) ( )( )x

μμxxμp

ppp || =⇒A p hyp h s s n h

estimated parameters μ. For the moment, let us suppose

p(μ) = cost.( )xpProbability density function for the

data x… Just a normalization factor!!!

• In this case, maximizing p(μ|x) or p(x|μ) isthe same!


the same!

So, let us maximize the likelihood…

I d f m imizi g L( | ) i i i mi iz• Instead of maximizing L(x|μ), it is easier to minize –log[L(x|μ)].

• When the noise is Gaussian, we get:W G g

( ) ( ) ∏∏ ⎥⎤

⎢⎡

⎟⎞

⎜⎛ −N

iiN xL

211|| μ( ) ( ) ∏∏==

⎞⎛

⎥⎥⎦⎢

⎢⎣

⎟⎠⎞

⎜⎝⎛−==

i

ii

iiixpL

11 2exp

2||

σμ

πσμμx

( ) ( )[ ] ( )∑∑==

−+⎟⎠

⎞⎜⎝

⎛−=−=N

iii

N

ixLf

1

2

1 21

21ln|ln| μ

σπσμxμx

•• MaximizeMaximize LL =>=> LeastLeast squaressquares problem!problem!

⎠⎝

Least squares!


Constant!

However, what about noise in digital radiography?radiography?

N i i di i di h i P i• Noise in digital radiography is Poisson(photon counting noise)!

• Let pn,i be the noisy (measured)number of photons associated to pixelnumber of photons associated to pixeli, and pi the unnoisy number ofho o Thphotons. Then:

pp iin −

( )!

|,

,

in

ppi

iin pepppp

iin

=


,inp

Gaussian noise: example

Constant variance

10000

12000Gaussian noise

8000

10000

6000

2000

4000

0 100 200 300 400 500 600 7000


Poisson noise: exampleLower variance for low signal

10000

12000Poisson noise

8000

10000

6000

2000

4000

0 100 200 300 400 500 600 7000


Likelihood for Poisson noise

Let us w ite the negative log likelihood fo• Let us write the negative log likelihood forthe Poisson case:

N ppN

( ) ( )

( ) ( )[ ] ( )[ ] ( )∑ ∑∑

∏∏=

−

=

==

N NN

N

i in

ppi

N

iiin p

eppppLiin

1 ,1, !

||,

ppn

( ) ( )[ ] ( )[ ] ( )

( )[ ]∑

∑ ∑∑= ==

⋅−=

=++⋅−=−=

Ni i

ini

iiin

ppp

ppppLf1 1

,1

,

ln

!lnln|ln| μxppn

• L(pn|p) is also known as Kullback-Leibler

( )[ ]∑=

⋅−=i

iini ppp1

, ln

L(pn|p) al a K llba L bldivergence (apart from a constant term,which does not affect the minimization

|http://ais-lab.dsi.unimi.it I. Frosio, M. Lucchese, N. A. Borghese11 / 46

process), KL(pn|p).

Maximize L!

L is maximized < > f is minimizedL is maximized <=> f is minimized;• Optimization (Gaussian noise) can be performed

posing: Nposing:

( ) ( )( )

ix

iff

N

jjj

⇒∀=∂

−∂⇒∀=

∂∂

⇔=∂

∂ ∑= ,0,0|| 1

2μμx0μx

( ) ixix iiii

ii

∀=⇒∀=−⇒∂∂∂

,,02 μμμμμ

• The noisy image gives the highest likelihood!!!• This solution is not so interesting… The likelihood

h i iapproach suffers from a severe overfittingproblem.


Maximize L!

L is maximized < > f is minimizedL is maximized <=> f is minimized;• Optimization (Poisson noise) can be performed

posing: Nposing:( ) ( ) ( )[ ]

ip

pppi

pff

i

N

iiini

i

⇒∀=∂

⋅−∂⇒∀=

∂∂

⇔=∂

∂ ∑= ,0

ln,0|| 1

,pp0ppp nn

ippipp

pp

inii

in

ii

∀=⇒∀=−⇒

∂∂∂

,,01 ,,

p

• The noisy image gives the highest likelihood!!!• This solution is not so interesting… The likelihood

h i i

i

approach suffers from a severe overfittingproblem.


Back to Bayes

Bayes theo em• Bayes theorem:

A priori hypothesis on the A priori hypothesis on the Likelihood

( ) ( ) ( )( )

nn p

pppppp

ppp || =⇒A priori hypothesis on the A priori hypothesis on the estimated parameters estimated parameters μμ. .

( )nppProbability density function for the

data x… Just a normalization

• If we introduce a-priori knowledge about

data x… Just a normalization factor!!!

• If we introduce a priori knowledge aboutthe solution μ, we get a Maximum APosteriori (MAP) solution – p(p|pn) is


P (MAP) l p(p|pn)maximized!

What do we have to minimize now?

W i i ( | ) ( | ) ( )• We want to maximize p(p|pn) ~ p(pn|p) p(p),that is:

( )[ ] ( ) ( )[ ] ( ) ( )[ ]∏ =⋅−=−=−N

iiin pp|pppppp1

,ln|ln|ln ppppp nn

( ) ( )[ ] ( ) ( )∑∑∑=

=−−=⋅−N

ii

N

iiin

N

iiiin

i

pp|ppppp|ppp11

,1

,

1

lnlnln

( )[ ] ( )∑=

===

−−=N

ii

iii

ppL1

111

ln|ln ppn=i 1

Negative log likelihoodRegularization term (a priori information)


A priori term

L ll d h f• Let us call px and py the two components ofthe gradient of the image.

• These are easily computed, for instance as:– px = p(i,j) – p(i-1,j);– py = p(i,j) – p(i,j-1);

• The gradient (a vector!) will be indicated asThe g adien (a vec !) will be indica ed as∇p;

• ||∇p|| indicates the norm of the gradient• ||∇p|| indicates the norm of the gradient.


A priori term – image gradients (no noise)noise)

px = p(i,j) – p(i-1,j) py = p(i,j) – p(i,j-1)y


A priori term – image gradients (noise)

px = p(i,j) – p(i-1,j) py = p(i,j) – p(i,j-1)y


A priori term – norm of image gradient

No noise Noise

In the real image, most of the areas are characterized by an (almost) nullgradient norm;

WeWe cancan forfor instanceinstance supposesuppose thatthat ||||∇∇p||p|| isis aa randomrandom variablevariable withwithGaussianGaussian distribution,distribution, zerozero meanmean andand variancevariance equalequal toto ββ22..

[Note that in the noisy image the norm of the gradient assume higher


[Note that, in the noisy image, the norm of the gradient assume highervalues low ||∇p|| means low noise!]

MAP and regularization theory

P i i di ib i• Poisson noise, normal distribution forthe norm of the gradient:

( ) ( )[ ] ( )∑=

=∇−−=N

ipLf

1ln|ln| inn ppppp

( )[ ] ∑∑ =⎥⎥⎦

⎤

⎢⎢⎣

⎡

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ ∇−−⋅−=

N

i

N

iiini ppp

12

2

1, 2

1exp2

1lnln ipβπβ

( )[ ] ( ) ∑∑

==

∇++⋅−=

⎥⎦⎢⎣⎟⎠

⎜⎝

NN

iini

ii

Nppp 22,

11

212lnln

22

ipβπβ

βπβ

( )[ ] ( ) ∑∑== ii 1

21

, 2β

Negative log likelihood

Const!!!

Regularization term (a


Negative log likelihood Regularization term (a priori information)

MAP and regularization theory

• We look for the minimum of f…• … The likelihood is maximized (data fittingTh l l h ma m ( a a f g

term)…• … At the same time, the squared norm ofA h a , h q a f

the gradient is minimized (regularizationterm)…

• … The regularization parameter (1/2β2)balances between a perfect data fitting and

ivery regular image…

( ) ( )[ ] ∑∑ ∇+⋅−=NN

pppf 21ln| ppp


( ) ( )[ ] ∑∑==

∇+⋅−=ii

iini pppf1

21

, 2ln| in ppp

β

MAP and regularization theoryFor (1/2β2) = 0 we get the maximum

likelihood solution;Increasing (1/2β2) we get a more regularIncreasing (1/2β ) we get a more regular

(less noisy) solution;For (1/2β2) -> ∞, a completely smooth

image is achieved.(1/2β2) = 0

Noise reduction.

(1/2β2) = 0.005

10

12

-2

0

2

4

6

8

Noise and edge reduction.


(1/2β2) = 0.2-6

-4

Fix the ideas

• A statistical based denoising filter is achieved minimizing:• A statistical based denoising filter is achieved minimizing:

f = -ln[L(pn|p)]-λ·ln[p(p)]

• The datadata fittingfitting term is derived fromfrom thethe noisenoise statisticalstatisticaldistributiondistribution (likelihood of the data); generally, the choice forthis term is unquestionablethis term is unquestionable.

• The regularizationregularization termterm is derived from aa--prioripriori knowledgeknowledgeregarding some properties of the solution; this term isregarding some properties of the solution; this term isgenerally user defined.

• Depending on the regularization parameter λ the first or the• Depending on the regularization parameter λ, the first or thesecond term assume more or less importance. For λ->0, themaximum likelihood solution is obtained.


Gibbs prior

U to now we ssumed no m l dist ibution fo the no m of• Up to now, we assumed a normal distribution for the norm ofthe gradient, Tikhonov regularization (quadraticpenalization).

• A more general framework is obtained considering:

p(p) = exp[-R(p)] (Gibb’s prior)

• R(p) -> Energy function ~ regularization term (note that -lnexp[-R(p)] = R(p)!)

• Tikhonov assumes R(p) = -½ (ll∇pll/β)2


Edge preserving denoising?

• Tikhonov termpenalizes the imageedges (high gradient)more than the noise

16TikhonovT t l i timore than the noise

gradients.• It is well known that

10

12

14Total variation

Tikhonovregularization does notpreserve edges. 6

8

10

R(||

Gra

d||)

preserve edges.• An edge preserving

algorithm is obtained 2

4

considering R(p)=-ll∇pll[Total variation, TV].

0 0.5 1 1.5 2 2.5 3 3.5 40

||Grad||


Tikhonov vs. TV (preview)

50

100

50

100

Filtered image Difference

100

150

200

250

300

100

150

200

250

300

Tikhonov =>

Original image 300

350

400

450

300

350

400

450

500

50

100

150

Original image

50

50 100 150 200 250 300 350 400 450 500

50050 100 150 200 250 300 350 400 450 500

500

50

200

250

300

350

50

100

150

200

50

100

150

20050 100 150 200 250 300 350 400 450 500

400

450

500

250

300

350

400

250

300

350

400TV =>


50 100 150 200 250 300 350 400 450 500

450

500

50 100 150 200 250 300 350 400 450 500

450

500

TV in digital radiography: starting point and problemspoint and problems

p noisy image affected by Poisson noise (negative• pn, noisy image affected by Poisson noise (negativelog likelihood => KL);

• p, noise free image (unknown);p, noise free image (unknown);• R(p) = ||∇p|| (Total variation);• Minimize f(p|pn) = KL(pn,p) + λ· Σi=1 N||∇pi||.M n m f(p|pn) KL(pn,p) λ i=1..N||∇pi||

• How to compute ||∇pi||? => AA compromisecompromise|| i||betweenbetween computationalcomputational efficiencyefficiency andand accuracyaccuracyhashas toto bebe achievedachieved.

• How to minimize f(p|p )? > AnAn iterativeiterative• How to minimize f(p|pn)? => AnAn iterativeiterativeoptimizationoptimization techniquetechnique isis requiredrequired.


How to compute ||∇pi||?

px = p(u,v) – p(u-1,v) The computational costxpy = p(u,v) – p(u,v-1)||pi||1= |px|+|py| L1 norm||pi||2= [px

2+py2]½ L2 norm

Com

pu

The computational costincreases with thenumber of neighboursconsidered for computing

||pi||1= |px|+|py|+ |pxy|+|pyx|||pi||2= [px

2+py2+ pxy

2+pyx2]½

utationa

considered for computingthe gradient;

The computational cost ishighe fo L2 no m with

|| i||2 x y xy yx al cost

higher for L2 norm withrespect to L1 norm;

What about accuracy? =>||pi||1= |px|+|py|+ |pxy|+|pyx|+…||pi||2= [px

2+py2+ pxy

2+pyx2+…]½

WSee experimental results!


How to minimize f(p|pn)?

f( | ) is st ongly non line solving• f(p|pn) is strongly non linear; solvingdf(p|pn)/dp=0 directly is not possible

i i i iz i h d=> iterative optimization methods.

1) Steepest descent + line search(SD+LS)(SD+LS)

2) Expectation – Maximization (dampedwith line search EM)with line search - EM)

3) Scaled gradient (SG)


Steepest descent + line search (SD+LS)

Pk+1 k df( | )/d• Pk+1=pk-α·df(p|pn)/dp =>=> Pk+1=pk-α·df(p|pn)/dp

• The damping parameter α is estimated ateach iteration to assure convergence(fk+1 fk)(fk+1<fk);

+: easy implementation;-: slow convergence, the method has been

damped (line search) to improve convergence (α>1).


EM + line search (EM)

• Consider the pixel i then:• Consider the pixel i, then:

df(p|pn)/dpi=0 =>dKL( | )/d dR/d=> dKL(p|pn)/dpi + dR/dpi = 0

=> pi·β·dR/dpi + pi - pn,i = 0 =>=> pi = pn,i / (β·dR/dpi + 1) [Fixed poin iteration]

• Damped formula: pi = pi ·(1-α) + α·pn,I / (β·dR/dpi + 1)

• The damping parameter α is estimated at each iteration toassure convergence (fk+1<fk);

+: easy implementation, fast convergence;-: the method has been damped to assure convergence (α<1,

what happens when β·dR/dpi + 1 -> 0???).


pp β R pi

Scaled gradient (SG)

• Consider the gradient method formula;• Consider the gradient method formula;• Each component of the gradient is scaled to improve

convergence (S is a diagonal matrix containing the scalingparameters):parameters):

Pk+1=pk-α·S·df(p|pn)/dp

• The matrix S is computed from an opportune gradientdecomposition and KKT conditions;

+: easy implementation, fastest convergence; it can also be demonstrated that, for positive initial values, the estimated

solution remains positive at each iteration!solution remains positive at each iteration!-: ???.


Problems with dR/dpi

Inde endently f om the o timization• Independently from the optimizationmethod, the term dR/dpi has to becomputed at each iteration far any i;computed at each iteration far any i;

• We have:

dR/dpi = d[Σi=1..N(ll∇pill1)]/dpi

XOR

dR/dpi = d[Σi=1..N(ll∇pill2)]/dpi



L i R/d• Let us compute it for ll.ll2 (R/dpi =d[Σi=1..N(ll∇pill2)]/dpi)i 1..N i 2 i

( ) ( )[ ] ( ) ( )[ ]...

1,,,1, 22

1

2,

2,

=+⎟⎠⎞⎜

⎝⎛ −−+−−

=+

=∑

=

d

vupvupvupvupd

d

ppd

ddR

N

iiyix

( ) ( )[ ] ( ) ( )[ ]( ) ( )[ ] ( ) ( )[ ] ( ) ...

,2...

1,,,1,

1,,2,1,2 ,,

22+

∇

+=+

−−+−−

−−+−−=

vuppp

vupvupvupvup

vupvupvupvup

dpdpdp

iyix

iii

• To avoid division by zero:

( ) ( ) ( )[ ] ( ) ( )[ ]...

1,,,1,2...

,2

22

,,,, ++−−+−−

+→+

∇

+=

δvupvupvupvup

ppvuppp

dpdR iyixiyix

i



L i R/d• Let us compute it for ll.ll1 (R/dpi =d[Σi=1..N(ll∇pill1)]/dpi)i 1..N i 1 i

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )

∑∑

= =+⎥⎥⎦

⎤

⎢⎢⎣

⎡ −−+

−−=

+=

N

N

iiyix vupvupvupvup

dpd

dp

ppd

dpdR

22221

,,

...1

1,,

1

,1,

( ) ( ) ( ) ( )

( ) ( )[ ]∑=

=

++=

⎥⎦⎢⎣ −−−−N

iiyix

i iii

psignpsign

vupvupvupvupdpdpdp

1,,

12222

...

1,,,1,

• Here divisions by zero areautomatically avoided only “sign” isautomatically avoided – only sign isrequired -> computationally efficient!


Questions

H i hb i d h• How many neighbor pixels do we haveto consider to achieve a satisfyingaccuracy at low computational cost?

• Best norm, ll.ll1 vs ll.ll2?

• Best optimization method (SD+LS• Best optimization method (SD+LS,EM, SG)?


TV in digital radiography…

Research in progress…


Results (answers)

i l d di h i h diff• 75 simulated radiographs with differentfrequency content, corrupted by Poisson

i ( h )noise (max 15,000 photons).

• For any filtered image, measure:

MAE = 1/NΣi=1..Nlpi,noisefree-pi,filteredlRMSE = [1/NΣi=1..N(pi,noisefree-pi,filtered)2 ]½

KL = Σi=1..N[pi,noisefree·ln(pi,noisefree/pi,filtered)+pi,noisefree-pi,filtered]


2 neighbors (01010000) vs. 4 neighbors (11110000)


ll.ll1 vs. ll.ll2


EM vs. SD+LS


EM vs. SG


Convergence and iterations


Filter effect

Original Filtered


Filter effect: before filtering


Filter effect: after filtering


Conclusion

Effective edge preserving filter• Effective edge preserving filter;• 2 neighbors, ll.ll1 and EM achieve the best

compromise between accuracy and computationalcompromise between accuracy and computationalcost;

• SD achieves results better then EM when theregularization parameter is not correctly selected.

Ad iv g l iz i• Adaptive regularization parameter;• GPU (CUDA) implementation;• Expanding the likelihood model• Expanding the likelihood model

– Mixture of Poisson, Gaussian and Impulsive noise;– Include the sensor point spread function.


I p p f

Documents

Denoising in digital radiography: A total variation approachhomes.di.unimi.it/frosio/Lessons/AY2010-2011-SI/Download/SI_2011.… · Gaussian noise and likelihood • Im gesImages