Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Learning Doubly Sparse Transformsfor Image Representation
Saiprasad Ravishankar
Department of Electrical and Computer Engineeringand Coordinated Science Laborarory
University of Illinois at Urbana-Champaign
October 1, 2012
Outline
Synthesis and Analysis models for sparse representation
Transform model - A generalized Analysis model
Unstructured transform learning
Doubly Sparse transform learning
Formulations, Algorithms, and Properties
Numerical examples
Image Denoising
Conclusions and Future Work
Synthesis Model for Sparse Representation
Given a signal y ∈ Rn, and dictionary D ∈ R
n×K , we assumey = Dx with ‖x‖0 = |supp(x)| K .
Real world signals modeled as y = Dx + e, e is deviation term.
Analytical overcomplete (K > n) dictionaries for sparse signalrepresentation - Ridgelet, Contourlet, and Curvelet dictionaries.
Given D, x = argmin ‖y − Dx‖22 subject to ‖x‖0 ≤ s, s is sparsity
level. This is synthesis sparse coding - NP-hard problem!
Greedy (e.g. Subspace Pursuit) and `1-relaxation algorithms (e.g.Lasso) exist but are computationally expensive.
Analysis Model for Sparse Representation
Strict Analysis Model : Given a signal y ∈ Rn, and analysis
dictionary Ω ∈ Rm×n, ‖Ωy‖0 m.
Noisy Signal Analysis Model : y = q + e, Ωq = z sparse.
Given Ω, q = argmin ‖y − q‖22 subject to ‖Ωq‖0 ≤ m − t. This isanalysis sparse coding, t is co-sparsity level.
When Ω is square and full rank, q = Ω−1z , and the model isidentical to the synthesis model ⇒ finding z or q is NP-hard!
Rubinstein et al. (2012) use a backward-greedy algorithm. Yaghoobiet al. (2012) solve Lagrangian version with `1-relaxation.
However, the algorithms are computationally expensive.
Transform Model for Sparse Representation
Given a signal y ∈ Rn, and transform W ∈ R
m×n, we modelWy = x + η with ‖x‖0 m and η - error term.
Natural signals and images are approximately sparse in analyticaltransforms - Wavelets and DCT.
Given W , x = argmin ‖Wy − x‖22 subject to ‖x‖0 ≤ s. This istransform sparse coding.
x computed exactly by thresholding Wy . Sparse coding is cheap! -just like for classical transforms. Signal recovered as W †x .
Sparsifying transforms used in compression (JPEG2000), etc.
Summary of the Models
Synthesis : finding x with given D is NP-hard.
y = Dx + e , ‖x‖0 ≤ s (1)
Noisy Signal Analysis : finding q with given Ω is NP-hard.
y = q + e , ‖Ωq‖0 ≤ m − t (2)
Transform : finding x with given W is easy ⇒ efficiency inapplications.
Wy = x + η , ‖x‖0 ≤ s (3)
Learning Synthesis and Analysis Dictionaries
Adapting dictionaries to a class of data advantageous in applications.
Synthesis and Analysis dictionary learning formulations - typicallynon-convex and NP-hard.
Approximate algorithms for Synthesis : MOD1, K-SVD2, onlinedictionary learning3, etc.
Heuristics for Analysis :
Strict Analysis: Sequential Minimal Eigenvalues4, AOL5.
Noisy Analysis: Analysis K-SVD6, NAAOL7.
Algorithms Computationally expensive. No global convergenceguarantees. Algorithms may converge to bad local minima.
Yaghoobi et al. (2012) show learnt analysis operators denoise notmuch better than fixed finite difference operator.
1 [Engan et al. ’99] , 2 [Aharon et al. ’06], 3 [Mairal et al. ’09], 4 [Ophir et al. ’11], 5 [Yaghoobi et al. ’11], 6 [Rubinstein et al. ’12],7 [Yaghoobi et al. ’12].
Unstructured Transform Learning
(P1) minW ,X
‖WY − X‖2F − λ log detW + µ ‖W ‖2Fs.t. ‖Xi‖0 ≤ s ∀ i
Y = [Y1 |Y2 | ..... |YN ] ∈ Rn×N : matrix of training signals.
X = [X1 |X2 | ..... |XN ] ∈ Rm×N : matrix of sparse codes of Yi .
‖WY − X‖2F is sparsification error - measures deviation of data intransform domain from perfect sparsity at sparsity level s.
Problem (P1) is non-convex.
(P1) favors both a low sparsification error and good conditioning.
It enables complete control over condition number. Conditioning oftransform important in applications.
Doubly Sparse Transforms
We propose to learn W ∈ Rn×n as W = BΦ.
Φ ∈ Rn×n : efficient analytical transform, B ∈ R
n×n : sparse matrix.
Motivation: Φ matrices such as DCT produce approximately sparseresult when applied to natural images. Modifying the result usingonly a sparse B can produce highly sparse output.
BΦ called ‘doubly sparse’, since it provides sparse representations fordata and has matrix B that is sparse.
W = BΦ combines advantages of trained and analytic transforms:
adapting to data, it performs better than Φ.
sparse B ⇒ it can be stored, applied efficiently.
Learning is more efficient than for unstructured transforms.
Double Sparsity
We propose to learn W ∈ Rn×n as W = BΦ.
Φ ∈ Rn×n : efficient analytical transform, B ∈ R
n×n : sparse matrix.
Learning a synthesis D = DaG with Da : analytical dictionary, andG : sparse matrix, proposed with different motivations by Rubinsteinet al. ’10.
Their algorithm is similar to K-SVD and has similar drawbacks.
Doubly Sparse Transform Learning
We formulate doubly sparse transform learning by setting W = BΦin (P1). Φ is square and invertible.
(P2′) minB,X
‖BΦY − X‖2F − λ log det (BΦ) + µ ‖BΦ‖2Fs.t. ‖B‖0 ≤ r , ‖Xi‖0 ≤ s ∀ i
Measure sparsity by ‖B‖0 ,∑
i ,j 1Bij 6=0. r is sparsity level.
(P2’) and (P1) are equivalent for r = n2.
Let Y = ΦY . Equivalent Problem -
(P2) minB,X
∥
∥
∥BY − X
∥
∥
∥
2
F− λ log det (B) + µ ‖B‖2F
s.t. ‖B‖0 ≤ r , ‖Xi‖0 ≤ s ∀ i
Learning Algorithm
Our algorithm for (P2) alternates between updating X and B.
Sparse Coding Step solved with fixed B. Solution is thresholding.
minX
∥
∥
∥BY − X
∥
∥
∥
2
Fs.t. ‖Xi‖0 ≤ s ∀ i (4)
Transform Update Step for (P2) -
minB
∥
∥
∥BY − X
∥
∥
∥
2
F− λ log det B + µ ‖B‖2F
s.t. ‖B‖0 ≤ r . (5)
We could use projected gradients, or projected CG.
However, the heuristic strategy of employing the standard CGfollowed by post-thresholding led to better empirical performance.
Computational Advantages vs. Synthesis/Analysis
Cost per iteration of proposed algorithm: O(Nn2) for N trainingsignals and B ∈ R
n×n.
Synthesis/Analysis K-SVD cost per iteration : O(Nn3) – for squarecase. Cost dominated by sparse coding.
Faster computations enable larger problem sizes and much lower runtimes for applications.
Computational Advantages Over Unstructured Learning
Sparse coding step of (P2) has a cost per iteration ofNn(βn + C2 log n), where C2 is a constant and β = r/n2 is sparsityfactor of B that arises in sparse matrix multiplications.
For β 1, this cost is much lower than cost of sparse coding in theunstructured Problem (P1), i.e., β = 1 case.
Empirical Observation : For sufficiently small r , the algorithm for(P2) converges in fewer iterations than that for (P1).
We hypothesize that this is because, for small r , doubly sparsetransforms have far fewer free parameters.
Numerical Examples - Framework and Metrics
The next couple of examples demonstrate properties offormulation/algorithm.
Data - zero mean, non-overlapping patches of natural images(Barbara).
Normalized Sparsification Error (NSE) measures the fraction ofenergy lost in sparse fitting with sparse code X .
NSE =‖BΦY − X‖2F
‖BΦY ‖2Frecovery Peak Signal to Noise Ratio (rPSNR) for images -
rPSNR =255
√P
‖Y −W−1X‖FP - # of image pixels, W = BΦ.
rPSNR measures error in recovering patches as Y = W−1X .
Convergence (n = 64, Φ = DCT, s = 11, r of 25%)
100
101
1022.4
2.5
2.6
2.7
2.8
2.9
3x 107
Obj
ectiv
e F
unct
ion
Iteration Number10
010
110
22
4
6
8
10x 106
Spa
rsifi
catio
n E
rror
Iteration Number
Φ = DCTW = DCT
50 100 150 200 250 3001.4
1.5
1.6
1.7
1.8
1.9
2
Iteration Number
Con
ditio
n N
umbe
r
Objective Function Sparsification Error κ(B)
2 10 1000
0.02
0.04
0.06
0.08
0.1
0.12
Iteration Number
Rel
ativ
e Ite
rate
Cha
nge
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Rel. Iterate Change Magnitude of B rPSNR = 34.39 dB
Performance for Different Φ (Barbara : r of 25%)
Φ NSE rPSNR NSE-Φ rPSNR-ΦDCT 0.0456 34.37 0.0676 32.85
Hadamard 0.0467 34.24 0.1156 30.52Wavelets 0.0574 33.48 0.1692 28.86Identity 0.1193 31.12 0.5145 24.03
Learnt transforms perform better than the analytical Φ.
Learnt W with DCT and Hadamard Φ differ only slightly inperformance, although DCT itself performs much better thanHadamard.
Operations involving Hadamard are faster since its entries are ±1
⇒ doubly sparse learning allows us to exploit the inexpensive but
poor Hadamard without loss of performance.
The case of Φ = In corresponds to a ‘self-sparse’ transform.
Doubly sparse W performs significantly better than self-sparse one.
Performance as Function of s and r ( Φ = DCT )
101
1020.01
0.02
0.03
0.04
0.05
0.06
0.07
Percentage Sparsity
Nor
mal
ized
Spa
rsifi
catio
n E
rror
s = 10s = 15s = 20
101
10232
34
36
38
40
Percentage Sparsity
Rec
over
y P
SN
R
s = 10s = 15s = 20
NSE vs. r rPSNR vs. r
101
1021
1.2
1.4
1.6
1.8
2
Percentage Sparsity
Con
ditio
n N
umbe
r
s = 10
s = 15
s = 20
101
10250
100
150
200
250
Percentage Sparsity
Num
ber
of It
erat
ions
κ vs. r # Iterations vs. r (s = 15)
Performance as Function of s and r
We can learn highly sparse B with only a marginal loss inperformance.
A good choice of r would also depend on Φ.
The learnt transforms perform better than DCT even at very low r
such as 2% ⇒ promise of efficient adaptive transforms overanalytical ones.
Depending on r , the number of iterations is reduced 2-4x.
We expect greater speed-ups with optimized parameter choices andoptimized implementation of sparse matrix operations.
Performance vs. Patch Size n (r of 15%, s of 17%)
0 100 200 300 40010
−2
10−1
100
Patch Size
Nor
mal
ized
Spa
rsifi
catio
n E
rror
Doubly Sparse TransformDCT
0 100 200 300 40030
32
34
36
38
40
Patch Size
Rec
over
y P
SN
R
Doubly Sparse TransformDCT
0 100 200 300 4001
1.5
2
2.5
3
3.5
Patch Size
Con
ditio
n N
umbe
r
NSE vs. n rPSNR vs. n κ vs. n
The performance gap between the adaptive doubly sparse transformsand DCT increases as a function of number of pixels in patch (n).
DCT performance saturates at larger n.
Thus, adaptivity and efficiency can help even more at large n.
Global Transforms - Image Compression
Global transforms are learnt over a variety of images (e.g. MRI) andused to represent other images.
Such transforms were observed to perform better than fixed onessuch as DCT in test images.
Doubly sparse global transforms generalize better than unstructuredglobal transforms.
Doubly sparse global transforms are also efficient.
Noisy Signal Transform Model - Image Denoising
Goal - estimate an image x ∈ RP from its noisy measurement
y = x + h.
(P3) minxi ,αi
M∑
i=1
‖Wxi − αi‖22 + τ
M∑
i=1
‖Ri y − xi‖22
s.t. ‖αi‖0 ≤ si ∀ i (6)
Ri ∈ Rn×P extracts i th patch from y . M overlapping patches
assumed.
Assumption: Noisy Ri y approximated by noiseless patch xi that issparsifiable.
αi ∈ Rn - sparse code of xi ; τ ∝ 1
σwith σ - noise level .
Denoised image x obtained by averaging the xi ’s .
W is learnt from patches of noisy image with fixed s.
Image Denoising Algorithm
We solve (P3) in two steps with initial xi = Ri y .
Step 1 - αi ’s are updated by thresholding WRi y .Step 2 - Each xi independently updated by least squares as follows.
xi = G
[√τRiy
αi
]
,where G =
[√τ IW
]†
(7)
Single iteration per patch suffices in practice.
Choose si such that ‖Ri y − xi‖22 ≤ nC 2σ2 after Step 2. C - fixed.
Requires repeating the two steps at various si ’s to determine thelevel at which error condition satisfied.
This is done efficiently by adding one non-zero at a time from WRi y
(in descending order) to αi in (7) until condition satisfied with newlyupdated xi .
Matrix G is pre-computed and we update xi by adding scaledcolumns of G .
Image Denoising Example: n = 64, Φ = DCT.
5 10 10034.23
34.27
34.31
34.35
34.39
34.43
34.47
Percentage Sparsity
Den
oise
d P
SN
R
Original Peppers Noisy (28.16 dB) Denoised PSNR vs. r
PSNR = 34.30 dB PSNR = 34.45 dB PSNR = 34.28 dB64× 64 W at 10% r 64× 64 W at 100% r 64× 256 Synthesis D
Image Denoising Example
Learnt W at r of 10% has κ = 2.79 ⇒ well-conditioned transformsdenoise well.
The denoised PSNR at r of 100% is 0.4 dB better than thatobtained using the fixed DCT in (P3).
Even at a low r of 5%, learnt W denoises better than the DCT.
Our denoising algorithm is about 3-5x faster than K-SVD denoisingat low r , while at 100% r , the speed-up is lower.
Run times can be drastically reduced by efficient implementation ofsparse operations, and using fewer iterations, with only a marginaldecrease in denoised PSNR.
Conclusions
We proposed formulations for learning doubly sparse transforms thatare highly effective for natural images.
Proposed algorithms encourage well-conditioning and have lowcomputational cost.
Doubly sparse property leads to faster learning, faster computations,reduced storage requirement, and better generalization.
Adapted doubly sparse transforms provide significantly betterrepresentations than analytical ones.
They denoise better than learnt overcomplete synthesis dictionaries.
Doubly sparse transforms also denoise as well as non-sparse ones butfaster.
Future Work : Denoising with a collection of adaptive transforms.
Issues With Hard Thresholding
For poor initializations and low r , hard thresholding of B in (P6) cancause rank loss or change of determinant sign.
det (B) is affine in Bij with all other entries fixed.
In case of rank loss, we scale a non-zero Bij of B so that modified B
has positive determinant.
Alternatively, we can differentially scale the entries of an entirecolumn of B.
The procedures maintain support of B.
If thresholding produces B with det (B) < 0, then trivially swap tworows of B along with the corresponding rows of X .
Empirical observation: With identity initialization, algorithm doesnot reach degenerate states.