27
Learning Doubly Sparse Transforms for Image Representation Saiprasad Ravishankar Department of Electrical and Computer Engineering and Coordinated Science Laborarory University of Illinois at Urbana-Champaign October 1, 2012

Learning Doubly Sparse Transforms for Image Representation

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Learning Doubly Sparse Transformsfor Image Representation

Saiprasad Ravishankar

Department of Electrical and Computer Engineeringand Coordinated Science Laborarory

University of Illinois at Urbana-Champaign

October 1, 2012

Outline

Synthesis and Analysis models for sparse representation

Transform model - A generalized Analysis model

Unstructured transform learning

Doubly Sparse transform learning

Formulations, Algorithms, and Properties

Numerical examples

Image Denoising

Conclusions and Future Work

Synthesis Model for Sparse Representation

Given a signal y ∈ Rn, and dictionary D ∈ R

n×K , we assumey = Dx with ‖x‖0 = |supp(x)| K .

Real world signals modeled as y = Dx + e, e is deviation term.

Analytical overcomplete (K > n) dictionaries for sparse signalrepresentation - Ridgelet, Contourlet, and Curvelet dictionaries.

Given D, x = argmin ‖y − Dx‖22 subject to ‖x‖0 ≤ s, s is sparsity

level. This is synthesis sparse coding - NP-hard problem!

Greedy (e.g. Subspace Pursuit) and `1-relaxation algorithms (e.g.Lasso) exist but are computationally expensive.

Analysis Model for Sparse Representation

Strict Analysis Model : Given a signal y ∈ Rn, and analysis

dictionary Ω ∈ Rm×n, ‖Ωy‖0 m.

Noisy Signal Analysis Model : y = q + e, Ωq = z sparse.

Given Ω, q = argmin ‖y − q‖22 subject to ‖Ωq‖0 ≤ m − t. This isanalysis sparse coding, t is co-sparsity level.

When Ω is square and full rank, q = Ω−1z , and the model isidentical to the synthesis model ⇒ finding z or q is NP-hard!

Rubinstein et al. (2012) use a backward-greedy algorithm. Yaghoobiet al. (2012) solve Lagrangian version with `1-relaxation.

However, the algorithms are computationally expensive.

Transform Model for Sparse Representation

Given a signal y ∈ Rn, and transform W ∈ R

m×n, we modelWy = x + η with ‖x‖0 m and η - error term.

Natural signals and images are approximately sparse in analyticaltransforms - Wavelets and DCT.

Given W , x = argmin ‖Wy − x‖22 subject to ‖x‖0 ≤ s. This istransform sparse coding.

x computed exactly by thresholding Wy . Sparse coding is cheap! -just like for classical transforms. Signal recovered as W †x .

Sparsifying transforms used in compression (JPEG2000), etc.

Summary of the Models

Synthesis : finding x with given D is NP-hard.

y = Dx + e , ‖x‖0 ≤ s (1)

Noisy Signal Analysis : finding q with given Ω is NP-hard.

y = q + e , ‖Ωq‖0 ≤ m − t (2)

Transform : finding x with given W is easy ⇒ efficiency inapplications.

Wy = x + η , ‖x‖0 ≤ s (3)

Learning Synthesis and Analysis Dictionaries

Adapting dictionaries to a class of data advantageous in applications.

Synthesis and Analysis dictionary learning formulations - typicallynon-convex and NP-hard.

Approximate algorithms for Synthesis : MOD1, K-SVD2, onlinedictionary learning3, etc.

Heuristics for Analysis :

Strict Analysis: Sequential Minimal Eigenvalues4, AOL5.

Noisy Analysis: Analysis K-SVD6, NAAOL7.

Algorithms Computationally expensive. No global convergenceguarantees. Algorithms may converge to bad local minima.

Yaghoobi et al. (2012) show learnt analysis operators denoise notmuch better than fixed finite difference operator.

1 [Engan et al. ’99] , 2 [Aharon et al. ’06], 3 [Mairal et al. ’09], 4 [Ophir et al. ’11], 5 [Yaghoobi et al. ’11], 6 [Rubinstein et al. ’12],7 [Yaghoobi et al. ’12].

Unstructured Transform Learning

(P1) minW ,X

‖WY − X‖2F − λ log detW + µ ‖W ‖2Fs.t. ‖Xi‖0 ≤ s ∀ i

Y = [Y1 |Y2 | ..... |YN ] ∈ Rn×N : matrix of training signals.

X = [X1 |X2 | ..... |XN ] ∈ Rm×N : matrix of sparse codes of Yi .

‖WY − X‖2F is sparsification error - measures deviation of data intransform domain from perfect sparsity at sparsity level s.

Problem (P1) is non-convex.

(P1) favors both a low sparsification error and good conditioning.

It enables complete control over condition number. Conditioning oftransform important in applications.

Doubly Sparse Transforms

We propose to learn W ∈ Rn×n as W = BΦ.

Φ ∈ Rn×n : efficient analytical transform, B ∈ R

n×n : sparse matrix.

Motivation: Φ matrices such as DCT produce approximately sparseresult when applied to natural images. Modifying the result usingonly a sparse B can produce highly sparse output.

BΦ called ‘doubly sparse’, since it provides sparse representations fordata and has matrix B that is sparse.

W = BΦ combines advantages of trained and analytic transforms:

adapting to data, it performs better than Φ.

sparse B ⇒ it can be stored, applied efficiently.

Learning is more efficient than for unstructured transforms.

Double Sparsity

We propose to learn W ∈ Rn×n as W = BΦ.

Φ ∈ Rn×n : efficient analytical transform, B ∈ R

n×n : sparse matrix.

Learning a synthesis D = DaG with Da : analytical dictionary, andG : sparse matrix, proposed with different motivations by Rubinsteinet al. ’10.

Their algorithm is similar to K-SVD and has similar drawbacks.

Doubly Sparse Transform Learning

We formulate doubly sparse transform learning by setting W = BΦin (P1). Φ is square and invertible.

(P2′) minB,X

‖BΦY − X‖2F − λ log det (BΦ) + µ ‖BΦ‖2Fs.t. ‖B‖0 ≤ r , ‖Xi‖0 ≤ s ∀ i

Measure sparsity by ‖B‖0 ,∑

i ,j 1Bij 6=0. r is sparsity level.

(P2’) and (P1) are equivalent for r = n2.

Let Y = ΦY . Equivalent Problem -

(P2) minB,X

∥BY − X

2

F− λ log det (B) + µ ‖B‖2F

s.t. ‖B‖0 ≤ r , ‖Xi‖0 ≤ s ∀ i

Learning Algorithm

Our algorithm for (P2) alternates between updating X and B.

Sparse Coding Step solved with fixed B. Solution is thresholding.

minX

∥BY − X

2

Fs.t. ‖Xi‖0 ≤ s ∀ i (4)

Transform Update Step for (P2) -

minB

∥BY − X

2

F− λ log det B + µ ‖B‖2F

s.t. ‖B‖0 ≤ r . (5)

We could use projected gradients, or projected CG.

However, the heuristic strategy of employing the standard CGfollowed by post-thresholding led to better empirical performance.

Computational Advantages vs. Synthesis/Analysis

Cost per iteration of proposed algorithm: O(Nn2) for N trainingsignals and B ∈ R

n×n.

Synthesis/Analysis K-SVD cost per iteration : O(Nn3) – for squarecase. Cost dominated by sparse coding.

Faster computations enable larger problem sizes and much lower runtimes for applications.

Computational Advantages Over Unstructured Learning

Sparse coding step of (P2) has a cost per iteration ofNn(βn + C2 log n), where C2 is a constant and β = r/n2 is sparsityfactor of B that arises in sparse matrix multiplications.

For β 1, this cost is much lower than cost of sparse coding in theunstructured Problem (P1), i.e., β = 1 case.

Empirical Observation : For sufficiently small r , the algorithm for(P2) converges in fewer iterations than that for (P1).

We hypothesize that this is because, for small r , doubly sparsetransforms have far fewer free parameters.

Numerical Examples - Framework and Metrics

The next couple of examples demonstrate properties offormulation/algorithm.

Data - zero mean, non-overlapping patches of natural images(Barbara).

Normalized Sparsification Error (NSE) measures the fraction ofenergy lost in sparse fitting with sparse code X .

NSE =‖BΦY − X‖2F

‖BΦY ‖2Frecovery Peak Signal to Noise Ratio (rPSNR) for images -

rPSNR =255

√P

‖Y −W−1X‖FP - # of image pixels, W = BΦ.

rPSNR measures error in recovering patches as Y = W−1X .

Convergence (n = 64, Φ = DCT, s = 11, r of 25%)

100

101

1022.4

2.5

2.6

2.7

2.8

2.9

3x 107

Obj

ectiv

e F

unct

ion

Iteration Number10

010

110

22

4

6

8

10x 106

Spa

rsifi

catio

n E

rror

Iteration Number

Φ = DCTW = DCT

50 100 150 200 250 3001.4

1.5

1.6

1.7

1.8

1.9

2

Iteration Number

Con

ditio

n N

umbe

r

Objective Function Sparsification Error κ(B)

2 10 1000

0.02

0.04

0.06

0.08

0.1

0.12

Iteration Number

Rel

ativ

e Ite

rate

Cha

nge

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rel. Iterate Change Magnitude of B rPSNR = 34.39 dB

Performance for Different Φ (Barbara : r of 25%)

Φ NSE rPSNR NSE-Φ rPSNR-ΦDCT 0.0456 34.37 0.0676 32.85

Hadamard 0.0467 34.24 0.1156 30.52Wavelets 0.0574 33.48 0.1692 28.86Identity 0.1193 31.12 0.5145 24.03

Learnt transforms perform better than the analytical Φ.

Learnt W with DCT and Hadamard Φ differ only slightly inperformance, although DCT itself performs much better thanHadamard.

Operations involving Hadamard are faster since its entries are ±1

⇒ doubly sparse learning allows us to exploit the inexpensive but

poor Hadamard without loss of performance.

The case of Φ = In corresponds to a ‘self-sparse’ transform.

Doubly sparse W performs significantly better than self-sparse one.

Performance as Function of s and r ( Φ = DCT )

101

1020.01

0.02

0.03

0.04

0.05

0.06

0.07

Percentage Sparsity

Nor

mal

ized

Spa

rsifi

catio

n E

rror

s = 10s = 15s = 20

101

10232

34

36

38

40

Percentage Sparsity

Rec

over

y P

SN

R

s = 10s = 15s = 20

NSE vs. r rPSNR vs. r

101

1021

1.2

1.4

1.6

1.8

2

Percentage Sparsity

Con

ditio

n N

umbe

r

s = 10

s = 15

s = 20

101

10250

100

150

200

250

Percentage Sparsity

Num

ber

of It

erat

ions

κ vs. r # Iterations vs. r (s = 15)

Performance as Function of s and r

We can learn highly sparse B with only a marginal loss inperformance.

A good choice of r would also depend on Φ.

The learnt transforms perform better than DCT even at very low r

such as 2% ⇒ promise of efficient adaptive transforms overanalytical ones.

Depending on r , the number of iterations is reduced 2-4x.

We expect greater speed-ups with optimized parameter choices andoptimized implementation of sparse matrix operations.

Performance vs. Patch Size n (r of 15%, s of 17%)

0 100 200 300 40010

−2

10−1

100

Patch Size

Nor

mal

ized

Spa

rsifi

catio

n E

rror

Doubly Sparse TransformDCT

0 100 200 300 40030

32

34

36

38

40

Patch Size

Rec

over

y P

SN

R

Doubly Sparse TransformDCT

0 100 200 300 4001

1.5

2

2.5

3

3.5

Patch Size

Con

ditio

n N

umbe

r

NSE vs. n rPSNR vs. n κ vs. n

The performance gap between the adaptive doubly sparse transformsand DCT increases as a function of number of pixels in patch (n).

DCT performance saturates at larger n.

Thus, adaptivity and efficiency can help even more at large n.

Global Transforms - Image Compression

Global transforms are learnt over a variety of images (e.g. MRI) andused to represent other images.

Such transforms were observed to perform better than fixed onessuch as DCT in test images.

Doubly sparse global transforms generalize better than unstructuredglobal transforms.

Doubly sparse global transforms are also efficient.

Noisy Signal Transform Model - Image Denoising

Goal - estimate an image x ∈ RP from its noisy measurement

y = x + h.

(P3) minxi ,αi

M∑

i=1

‖Wxi − αi‖22 + τ

M∑

i=1

‖Ri y − xi‖22

s.t. ‖αi‖0 ≤ si ∀ i (6)

Ri ∈ Rn×P extracts i th patch from y . M overlapping patches

assumed.

Assumption: Noisy Ri y approximated by noiseless patch xi that issparsifiable.

αi ∈ Rn - sparse code of xi ; τ ∝ 1

σwith σ - noise level .

Denoised image x obtained by averaging the xi ’s .

W is learnt from patches of noisy image with fixed s.

Image Denoising Algorithm

We solve (P3) in two steps with initial xi = Ri y .

Step 1 - αi ’s are updated by thresholding WRi y .Step 2 - Each xi independently updated by least squares as follows.

xi = G

[√τRiy

αi

]

,where G =

[√τ IW

]†

(7)

Single iteration per patch suffices in practice.

Choose si such that ‖Ri y − xi‖22 ≤ nC 2σ2 after Step 2. C - fixed.

Requires repeating the two steps at various si ’s to determine thelevel at which error condition satisfied.

This is done efficiently by adding one non-zero at a time from WRi y

(in descending order) to αi in (7) until condition satisfied with newlyupdated xi .

Matrix G is pre-computed and we update xi by adding scaledcolumns of G .

Image Denoising Example: n = 64, Φ = DCT.

5 10 10034.23

34.27

34.31

34.35

34.39

34.43

34.47

Percentage Sparsity

Den

oise

d P

SN

R

Original Peppers Noisy (28.16 dB) Denoised PSNR vs. r

PSNR = 34.30 dB PSNR = 34.45 dB PSNR = 34.28 dB64× 64 W at 10% r 64× 64 W at 100% r 64× 256 Synthesis D

Image Denoising Example

Learnt W at r of 10% has κ = 2.79 ⇒ well-conditioned transformsdenoise well.

The denoised PSNR at r of 100% is 0.4 dB better than thatobtained using the fixed DCT in (P3).

Even at a low r of 5%, learnt W denoises better than the DCT.

Our denoising algorithm is about 3-5x faster than K-SVD denoisingat low r , while at 100% r , the speed-up is lower.

Run times can be drastically reduced by efficient implementation ofsparse operations, and using fewer iterations, with only a marginaldecrease in denoised PSNR.

Conclusions

We proposed formulations for learning doubly sparse transforms thatare highly effective for natural images.

Proposed algorithms encourage well-conditioning and have lowcomputational cost.

Doubly sparse property leads to faster learning, faster computations,reduced storage requirement, and better generalization.

Adapted doubly sparse transforms provide significantly betterrepresentations than analytical ones.

They denoise better than learnt overcomplete synthesis dictionaries.

Doubly sparse transforms also denoise as well as non-sparse ones butfaster.

Future Work : Denoising with a collection of adaptive transforms.

Issues With Hard Thresholding

For poor initializations and low r , hard thresholding of B in (P6) cancause rank loss or change of determinant sign.

det (B) is affine in Bij with all other entries fixed.

In case of rank loss, we scale a non-zero Bij of B so that modified B

has positive determinant.

Alternatively, we can differentially scale the entries of an entirecolumn of B.

The procedures maintain support of B.

If thresholding produces B with det (B) < 0, then trivially swap tworows of B along with the corresponding rows of X .

Empirical observation: With identity initialization, algorithm doesnot reach degenerate states.