Sparse Coding and Dictionary Learningfor Image Analysis
Julien Mairal
INRIA Visual Recognition and Machine Learning Summer School,27th July 2010
Julien Mairal Sparse Coding and Dictionary Learning 1/137
What this lecture is about?
Why sparsity, what for and how?
Signal and image processing: Restoration, reconstruction.
Machine learning: Selecting relevant features.
Computer vision: Modelling the local appearance of imagepatches.
Computer vision: Recent (and intriguing) results in bags ofwords models.
Optimization: Solving challenging problems.
Julien Mairal Sparse Coding and Dictionary Learning 2/137
The Image Denoising Problem
y︸︷︷︸
measurements
= xorig︸︷︷︸
original image
+ w︸︷︷︸noise
Julien Mairal Sparse Coding and Dictionary Learning 3/137
Sparse representations for image restoration
y︸︷︷︸
measurements
= xorig︸︷︷︸
original image
+ w︸︷︷︸
noise
Energy minimization problem - MAP estimation
E (x) =1
2‖y − x‖22
︸ ︷︷ ︸
relation to measurements
+ Pr(x)︸ ︷︷ ︸
image model (-log prior)
Some classical priors
Smoothness λ‖Lx‖22Total variation λ‖∇x‖21MRF priors
. . .
Julien Mairal Sparse Coding and Dictionary Learning 4/137
What is a Sparse Linear Model?
Let x in Rm be a signal.
Let D = [d1, . . . ,dp] ∈ Rm×p be a set of
normalized “basis vectors”.We call it dictionary.
D is “adapted” to x if it can represent it with a few basis vectors—thatis, there exists a sparse vector α in R
p such that x ≈ Dα. We call αthe sparse code.
x
︸ ︷︷ ︸
x∈Rm
≈
d1 d2 · · · dp
︸ ︷︷ ︸
D∈Rm×p
α[1]α[2]...
α[p]
︸ ︷︷ ︸
α∈Rp,sparse
Julien Mairal Sparse Coding and Dictionary Learning 5/137
First Important Idea
Why Sparsity?
A dictionary can be good for representing a class ofsignals, but not for representing white Gaussian noise.
Julien Mairal Sparse Coding and Dictionary Learning 6/137
The Sparse Decomposition Problem
minα∈Rp
1
2‖x−Dα‖22
︸ ︷︷ ︸
data fitting term
+ λψ(α)︸ ︷︷ ︸
sparsity-inducingregularization
ψ induces sparsity in α. It can be
the ℓ0 “pseudo-norm”. ‖α‖0△
= #{i s.t. α[i ] 6= 0} (NP-hard)
the ℓ1 norm. ‖α‖1△
=∑p
i=1 |α[i ]| (convex),
. . .
This is a selection problem. When ψ is the ℓ1-norm, the problem iscalled Lasso [Tibshirani, 1996] or basis pursuit [Chen et al., 1999]
Julien Mairal Sparse Coding and Dictionary Learning 7/137
Sparse representations for image restoration
Designed dictionaries
[Haar, 1910], [Zweig, Morlet, Grossman ∼70s], [Meyer, Mallat,Daubechies, Coifman, Donoho, Candes ∼80s-today]. . . (see [Mallat,1999])Wavelets, Curvelets, Wedgelets, Bandlets, . . . lets
Learned dictionaries of patches
[Olshausen and Field, 1997], [Engan et al., 1999], [Lewicki andSejnowski, 2000], [Aharon et al., 2006] , [Roth and Black, 2005], [Leeet al., 2007]
minαi ,D∈C
∑
i
1
2‖xi −Dαi‖
22
︸ ︷︷ ︸
reconstruction
+λψ(αi )︸ ︷︷ ︸
sparsity
ψ(α) = ‖α‖0 (“ℓ0 pseudo-norm”)
ψ(α) = ‖α‖1 (ℓ1 norm)
Julien Mairal Sparse Coding and Dictionary Learning 8/137
Sparse representations for image restoration
Solving the denoising problem
[Elad and Aharon, 2006]
Extract all overlapping 8× 8 patches xi .
Solve a matrix factorization problem:
minαi ,D∈C
n∑
i=1
1
2‖xi −Dαi‖
22
︸ ︷︷ ︸
reconstruction
+λψ(αi)︸ ︷︷ ︸
sparsity
,
with n > 100, 000
Average the reconstruction of each patch.
Julien Mairal Sparse Coding and Dictionary Learning 9/137
Sparse representations for image restorationK-SVD: [Elad and Aharon, 2006]
Figure: Dictionary trained on a noisy version of the imageboat.
Julien Mairal Sparse Coding and Dictionary Learning 10/137
Sparse representations for image restoration
Inpainting, Demosaicking
minD∈C,α
∑
i
1
2‖βi ⊗ (xi −Dαi )‖
22 + λiψ(αi )
RAW Image Processing
Whitebalance.Black
substraction.
Denoising
Demosaicking
Conversionto sRGB.Gamma
correction.
Julien Mairal Sparse Coding and Dictionary Learning 11/137
Sparse representations for image restoration[Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b]
Julien Mairal Sparse Coding and Dictionary Learning 12/137
Sparse representations for image restoration[Mairal, Sapiro, and Elad, 2008d]
Julien Mairal Sparse Coding and Dictionary Learning 13/137
Sparse representations for image restorationInpainting, [Mairal, Elad, and Sapiro, 2008b]
Julien Mairal Sparse Coding and Dictionary Learning 14/137
Sparse representations for image restorationInpainting, [Mairal, Elad, and Sapiro, 2008b]
Julien Mairal Sparse Coding and Dictionary Learning 15/137
Sparse representations for video restoration
Key ideas for video processing
[Protter and Elad, 2009]
Using a 3D dictionary.
Processing of many frames at the same time.
Dictionary propagation.
Julien Mairal Sparse Coding and Dictionary Learning 16/137
Sparse representations for image restorationInpainting, [Mairal, Sapiro, and Elad, 2008d]
Figure: Inpainting results.
Julien Mairal Sparse Coding and Dictionary Learning 17/137
Sparse representations for image restorationInpainting, [Mairal, Sapiro, and Elad, 2008d]
Figure: Inpainting results.
Julien Mairal Sparse Coding and Dictionary Learning 18/137
Sparse representations for image restorationInpainting, [Mairal, Sapiro, and Elad, 2008d]
Figure: Inpainting results.
Julien Mairal Sparse Coding and Dictionary Learning 19/137
Sparse representations for image restorationInpainting, [Mairal, Sapiro, and Elad, 2008d]
Figure: Inpainting results.
Julien Mairal Sparse Coding and Dictionary Learning 20/137
Sparse representations for image restorationInpainting, [Mairal, Sapiro, and Elad, 2008d]
Figure: Inpainting results.
Julien Mairal Sparse Coding and Dictionary Learning 21/137
Sparse representations for image restorationColor video denoising, [Mairal, Sapiro, and Elad, 2008d]
Figure: Denoising results. σ = 25
Julien Mairal Sparse Coding and Dictionary Learning 22/137
Sparse representations for image restorationColor video denoising, [Mairal, Sapiro, and Elad, 2008d]
Figure: Denoising results. σ = 25
Julien Mairal Sparse Coding and Dictionary Learning 23/137
Sparse representations for image restorationColor video denoising, [Mairal, Sapiro, and Elad, 2008d]
Figure: Denoising results. σ = 25
Julien Mairal Sparse Coding and Dictionary Learning 24/137
Sparse representations for image restorationColor video denoising, [Mairal, Sapiro, and Elad, 2008d]
Figure: Denoising results. σ = 25
Julien Mairal Sparse Coding and Dictionary Learning 25/137
Sparse representations for image restorationColor video denoising, [Mairal, Sapiro, and Elad, 2008d]
Figure: Denoising results. σ = 25
Julien Mairal Sparse Coding and Dictionary Learning 26/137
Digital ZoomingCouzinie-Devy, 2010, Original
Julien Mairal Sparse Coding and Dictionary Learning 27/137
Digital ZoomingCouzinie-Devy, 2010, Bicubic
Julien Mairal Sparse Coding and Dictionary Learning 28/137
Digital ZoomingCouzinie-Devy, 2010, Proposed method
Julien Mairal Sparse Coding and Dictionary Learning 29/137
Digital ZoomingCouzinie-Devy, 2010, Original
Julien Mairal Sparse Coding and Dictionary Learning 30/137
Digital ZoomingCouzinie-Devy, 2010, Bicubic
Julien Mairal Sparse Coding and Dictionary Learning 31/137
Digital ZoomingCouzinie-Devy, 2010, Proposed approach
Julien Mairal Sparse Coding and Dictionary Learning 32/137
Inverse half-toningOriginal
Julien Mairal Sparse Coding and Dictionary Learning 33/137
Inverse half-toningReconstructed image
Julien Mairal Sparse Coding and Dictionary Learning 34/137
Inverse half-toningOriginal
Julien Mairal Sparse Coding and Dictionary Learning 35/137
Inverse half-toningReconstructed image
Julien Mairal Sparse Coding and Dictionary Learning 36/137
Inverse half-toningOriginal
Julien Mairal Sparse Coding and Dictionary Learning 37/137
Inverse half-toningReconstructed image
Julien Mairal Sparse Coding and Dictionary Learning 38/137
Inverse half-toningOriginal
Julien Mairal Sparse Coding and Dictionary Learning 39/137
Inverse half-toningReconstructed image
Julien Mairal Sparse Coding and Dictionary Learning 40/137
Inverse half-toningOriginal
Julien Mairal Sparse Coding and Dictionary Learning 41/137
Inverse half-toningReconstructed image
Julien Mairal Sparse Coding and Dictionary Learning 42/137
One short slide on compressed sensing
Important message
Sparse coding is not “compressed sensing”.
Compressed sensing is a theory [see Candes, 2006] saying that a sparsesignal can be recovered with high probability from a few linearmeasurements under some conditions.
Signal Acquisition: W⊤x, where W ∈ Rm×s is a “sensing” matrix
with s ≪ m.
Signal Decoding: minα∈Rp ‖α‖1 s.t. W⊤x = W⊤Dα.
with extensions to approximately sparse signals, noisy measurements.
Remark
The dictionaries we are using in this lecture do not satisfy the recoveryassumptions of compressed sensing.
Julien Mairal Sparse Coding and Dictionary Learning 43/137
Important messages
Patch-based approaches are achieving state-of-the-art results formany image processing task.
Dictionary Learning adapts to the data you want to restore.
Dictionary Learning is well adapted to data that admit sparserepresentation. Sparsity is for sparse data only.
Julien Mairal Sparse Coding and Dictionary Learning 44/137
Next topics
Why does the ℓ1-norm induce sparsity?
Some properties of the Lasso.
Beyond sparsity: Group-sparsity.
The simplest algorithm for learning dictionaries.
Links between dictionary learning and matrix factorizationtechniques.
Julien Mairal Sparse Coding and Dictionary Learning 45/137
Why does the ℓ1-norm induce sparsity?Exemple: quadratic problem in 1D
minα∈R
1
2(x − α)2 + λ|α|
Piecewise quadratic function with a kink at zero.
Derivative at 0+: g+ = −x + λ and 0−: g− = −x − λ.
Optimality conditions. α is optimal iff:
|α| > 0 and (x − α) + λ sign(α) = 0
α = 0 and g+ ≥ 0 and g− ≤ 0
The solution is a soft-thresholding:
α⋆ = sign(x)(|x | − λ)+.
Julien Mairal Sparse Coding and Dictionary Learning 46/137
Why does the ℓ1-norm induce sparsity?
x
α⋆
(a) soft-thresholding operator
x
α⋆
(b) hard-thresholding operator
Julien Mairal Sparse Coding and Dictionary Learning 47/137
Why does the ℓ1-norm induce sparsity?Analysis of the norms in 1D
ψ(α) = α2
ψ′(α) = 2α
ψ(α) = |α|
ψ′−(α) = −1, ψ′
+(α) = 1
The gradient of the ℓ2-norm vanishes when α get close to 0. On itsdifferentiable part, the norm of the gradient of the ℓ1-norm is constant.
Julien Mairal Sparse Coding and Dictionary Learning 48/137
Why does the ℓ1-norm induce sparsity?Geometric explanation
x
y
x
y
minα∈Rp
1
2‖x−Dα‖22 + λ‖α‖1
minα∈Rp
‖x−Dα‖22 s.t. ‖α‖1 ≤ T .
Julien Mairal Sparse Coding and Dictionary Learning 49/137
Important property of the LassoPiecewise linearity of the regularization path
0 1 2 3 4−0.5
0
0.5
1
1.5
λ
co
eff
icie
nt
va
lue
s
α1
α2
α3
α4
α5
Figure: Regularization path of the Lasso
min1‖x−Dα‖2 + λ‖α‖ .
Julien Mairal Sparse Coding and Dictionary Learning 50/137
Sparsity-Inducing Norms (1/2)
minα∈Rp
data fitting term︷︸︸︷
f (α) + λ ψ(α)︸ ︷︷ ︸
sparsity-inducing norm
Standard approach to enforce sparsity in learning procedures:
Regularizing by a sparsity-inducing norm ψ.
The effect of ψ is to set some αj ’s to zero, depending on theregularization parameter λ ≥ 0.
The most popular choice for ψ:
The ℓ1 norm, ‖α‖1 =∑p
j=1 |αj |.
For the square loss, Lasso [Tibshirani, 1996].
However, the ℓ1 norm encodes poor information, just cardinality!
Julien Mairal Sparse Coding and Dictionary Learning 51/137
Sparsity-Inducing Norms (2/2)
Another popular choice for ψ:
The ℓ1-ℓ2 norm,
∑
G∈G
‖αG‖2 =∑
G∈G
(∑
j∈G
α2j
)1/2, with G a partition of {1, . . . , p}.
The ℓ1-ℓ2 norm sets to zero groups of non-overlapping variables
(as opposed to single variables for the ℓ1 norm).
For the square loss, group Lasso [Yuan and Lin, 2006].
However, the ℓ1-ℓ2 norm encodes fixed/static prior information,requires to know in advance how to group the variables !
Applications:
Selecting groups of features instead of individual variables.
Multi-task learning.
Julien Mairal Sparse Coding and Dictionary Learning 52/137
Optimization for Dictionary Learning
minα∈Rp×n
D∈C
n∑
i=1
1
2‖xi −Dαi‖
22 + λ‖αi‖1
C△
= {D ∈ Rm×p s.t. ∀j = 1, . . . , p, ‖dj‖2 ≤ 1}.
Classical optimization alternates between D and α.
Good results, but slow!
Instead use online learning [Mairal et al., 2009a]
Julien Mairal Sparse Coding and Dictionary Learning 53/137
Optimization for Dictionary LearningInpainting a 12-Mpixel photograph
Julien Mairal Sparse Coding and Dictionary Learning 54/137
Optimization for Dictionary LearningInpainting a 12-Mpixel photograph
Julien Mairal Sparse Coding and Dictionary Learning 55/137
Optimization for Dictionary LearningInpainting a 12-Mpixel photograph
Julien Mairal Sparse Coding and Dictionary Learning 56/137
Optimization for Dictionary LearningInpainting a 12-Mpixel photograph
Julien Mairal Sparse Coding and Dictionary Learning 57/137
Matrix Factorization Problems and Dictionary Learning
minα∈Rp×n
D∈C
n∑
i=1
1
2‖xi −Dαi‖
22 + λ‖αi‖1
can be rewritten
minα∈Rp×n
D∈C
1
2‖X−Dα‖2F + λ‖α‖1,
where X = [x1, . . . , xn] and α = [α1, . . . ,αn].
Julien Mairal Sparse Coding and Dictionary Learning 58/137
Matrix Factorization Problems and Dictionary LearningPCA
minα∈Rp×n
D∈Rm×p
‖X−Dα‖2F ,
with the additional constraints that D is orthonormal and α⊤ isorthogonal.
D = [d1, . . . ,dp] are the principal components.
Julien Mairal Sparse Coding and Dictionary Learning 59/137
Matrix Factorization Problems and Dictionary LearningHard clustering
minα∈Rp×n
D∈Rm×p
‖X−Dα‖2F ,
with the additional constraints that α is binary and its columns sum toone.
Julien Mairal Sparse Coding and Dictionary Learning 60/137
Matrix Factorization Problems and Dictionary LearningSoft clustering
minα∈Rp×n
D∈Rm×p
‖X−Dα‖2F ,
with the additional constraints that the columns of α sum to one.
Julien Mairal Sparse Coding and Dictionary Learning 61/137
Matrix Factorization Problems and Dictionary LearningNon-negative matrix factorization [Lee and Seung, 2001]
minα∈Rp×n
D∈Rm×p
‖X−Dα‖2F ,
with the additional constraints that the entries of D and α arenon-negative.
Julien Mairal Sparse Coding and Dictionary Learning 62/137
Matrix Factorization Problems and Dictionary LearningNMF+sparsity?
minα∈Rp×n
D∈Rm×p
‖X−Dα‖2F + λ‖α‖1
with the additional constraints that the entries of D and α arenon-negative.
Most of these formulations can be addressed the same types of
algorithms.
Julien Mairal Sparse Coding and Dictionary Learning 63/137
Matrix Factorization Problems and Dictionary LearningNatural Patches
(a) PCA (b) NNMF (c) DL
Julien Mairal Sparse Coding and Dictionary Learning 64/137
Matrix Factorization Problems and Dictionary LearningFaces
(d) PCA (e) NNMF (f) DL
Julien Mairal Sparse Coding and Dictionary Learning 65/137
Important messages
The ℓ1-norm induces sparsity and shrinks the coefficients(soft-thresholding)
The regularization path of the Lasso is piecewise linear.
Sparsity can be induced at the group level.
Learning the dictionary is simple, fast and scalable.
Dictionary learning is related to several matrix factorizationproblems.
Software SPAMS is available for all of this:
www.di.ens.fr/willow/SPAMS/.
Julien Mairal Sparse Coding and Dictionary Learning 66/137
Next topics: Computer Vision
Intriguing results on the use of dictionary learning for bags of words.
Modelling the local appearance of image patches.
Julien Mairal Sparse Coding and Dictionary Learning 67/137
Learning Codebooks for Image Classification
Idea
Replacing Vector Quantization by Learned Dictionaries!
unsupervised: [Yang et al., 2009]
supervised: [Boureau et al., 2010, Yang et al., 2010]
Julien Mairal Sparse Coding and Dictionary Learning 68/137
Learning Codebooks for Image Classification
Let an image be represented by a set of low-level descriptors xi at Nlocations identified with their indices i = 1, . . . ,N.
hard-quantization:
xi ≈ Dαi , αi ∈ {0, 1}p and
p∑
j=1
αi [j ] = 1
soft-quantization:
αi [j ] =e−β‖xi−dj‖
22
∑pk=1 e
−β‖xi−dk‖22
sparse coding:
xi ≈ Dαi , αi = argminα
1
2‖xi −Dα‖22 + λ‖α‖1
Julien Mairal Sparse Coding and Dictionary Learning 69/137
Learning Codebooks for Image ClassificationTable from Boureau et al. [2010]
Yang et al. [2009] have won the PASCAL VOC’09 challenge using thiskind of techniques.
Julien Mairal Sparse Coding and Dictionary Learning 70/137
Learning dictionaries with a discriminative cost function
Idea:
Let us consider 2 sets S−, S+ of signals representing 2 different classes.Each set should admit a dictionary best adapted to its reconstruction.
Classification procedure for a signal x ∈ Rn:
min(R⋆(x,D−),R⋆(x,D+))
whereR⋆(x,D) = min
α∈Rp‖x−Dα‖22 s.t. ‖α‖0 ≤ L.
“Reconstructive” training{
minD−
∑
i∈S−R⋆(xi ,D−)
minD+
∑
i∈S+R⋆(xi ,D+)
[Grosse et al., 2007], [Huang and Aviyente, 2006],[Sprechmann et al., 2010] for unsupervised clustering (CVPR ’10)
Julien Mairal Sparse Coding and Dictionary Learning 71/137
Learning dictionaries with a discriminative cost function
“Discriminative” training
[Mairal, Bach, Ponce, Sapiro, and Zisserman, 2008a]
minD−,D+
∑
i
C(
λzi(R⋆(xi ,D−)− R⋆(xi ,D+)
))
,
where zi ∈ {−1,+1} is the label of xi .
Logistic regression function
Julien Mairal Sparse Coding and Dictionary Learning 72/137
Learning dictionaries with a discriminative cost functionExamples of dictionaries
Top: reconstructive, Bottom: discriminative, Left: Bicycle, Right:Background.
Julien Mairal Sparse Coding and Dictionary Learning 73/137
Learning dictionaries with a discriminative cost functionTexture segmentation
Julien Mairal Sparse Coding and Dictionary Learning 74/137
Learning dictionaries with a discriminative cost functionTexture segmentation
Julien Mairal Sparse Coding and Dictionary Learning 75/137
Learning dictionaries with a discriminative cost functionPixelwise classification
Julien Mairal Sparse Coding and Dictionary Learning 76/137
Learning dictionaries with a discriminative cost functionweakly-supervised pixel classification
Julien Mairal Sparse Coding and Dictionary Learning 77/137
Application to edge detection and classification[Mairal, Leordeanu, Bach, Hebert, and Ponce, 2008c]
Good edges Bad edges
Julien Mairal Sparse Coding and Dictionary Learning 78/137
Application to edge detection and classificationBerkeley segmentation benchmark
Raw edge detection on the right
Julien Mairal Sparse Coding and Dictionary Learning 79/137
Application to edge detection and classificationBerkeley segmentation benchmark
Raw edge detection on the right
Julien Mairal Sparse Coding and Dictionary Learning 80/137
Application to edge detection and classificationBerkeley segmentation benchmark
Julien Mairal Sparse Coding and Dictionary Learning 81/137
Application to edge detection and classificationContour-based classifier: [Leordeanu, Hebert, and Sukthankar, 2007]
Is there a bike, a motorbike, a car or a person on thisimage?
Julien Mairal Sparse Coding and Dictionary Learning 82/137
Application to edge detection and classification
Julien Mairal Sparse Coding and Dictionary Learning 83/137
Application to edge detection and classificationPerformance gain due to the prefiltering
Ours + [Leordeanu ’07] [Leordeanu ’07] [Winn ’05]
96.8% 89.4% 76.9%
Recognition rates for the same experiment as [Winn et al., 2005] onVOC 2005.
Category Ours+[Leordeanu ’07] [Leordeanu ’07]Aeroplane 71.9% 61.9%
Boat 67.1% 56.4%Cat 82.6% 53.4%Cow 68.7% 59.2%Horse 76.0% 67%
Motorbike 80.6% 73.6%Sheep 72.9% 58.4%
Tvmonitor 87.7% 83.8%
Average 75.9% 64.2 %
Recognition performance at equal error rate for 8 classes on a subset ofimages from Pascal 07.
Julien Mairal Sparse Coding and Dictionary Learning 84/137
Julien Mairal Sparse Coding and Dictionary Learning 85/137
Julien Mairal Sparse Coding and Dictionary Learning 86/137
Digital Art AuthentificationData Courtesy of Hugues, Graham, and Rockmore [2009]
Authentic Fake
Julien Mairal Sparse Coding and Dictionary Learning 87/137
Digital Art AuthentificationData Courtesy of Hugues, Graham, and Rockmore [2009]
Authentic Fake
Fake
Julien Mairal Sparse Coding and Dictionary Learning 88/137
Digital Art AuthentificationData Courtesy of Hugues, Graham, and Rockmore [2009]
Authentic Fake
Authentic
Julien Mairal Sparse Coding and Dictionary Learning 89/137
Important messages
Learned dictionaries are well adapted to model the localappearance of images and edges.
They can be used to learn dictionaries of SIFT features.
Julien Mairal Sparse Coding and Dictionary Learning 90/137
Next topics
Optimization for solving sparse decomposition problems
Optimization for dictionary learning
Julien Mairal Sparse Coding and Dictionary Learning 91/137
Recall: The Sparse Decomposition Problem
minα∈Rp
1
2‖x−Dα‖22
︸ ︷︷ ︸
data fitting term
+ λψ(α)︸ ︷︷ ︸
sparsity-inducingregularization
ψ induces sparsity in α. It can be
the ℓ0 “pseudo-norm”. ‖α‖0△
= #{i s.t. α[i ] 6= 0} (NP-hard)
the ℓ1 norm. ‖α‖1△
=∑p
i=1 |α[i ]| (convex)
. . .
This is a selection problem.
Julien Mairal Sparse Coding and Dictionary Learning 92/137
Finding your way in the sparse coding literature. . .
. . . is not easy. The literature is vast, redundant, sometimesconfusing and many papers are claiming victory. . .
The main class of methods are
greedy procedures [Mallat and Zhang, 1993], [Weisberg, 1980]
homotopy [Osborne et al., 2000], [Efron et al., 2004],[Markowitz, 1956]
soft-thresholding based methods [Fu, 1998], [Daubechies et al.,2004], [Friedman et al., 2007], [Nesterov, 2007], [Beck andTeboulle, 2009], . . .
reweighted-ℓ2 methods [Daubechies et al., 2009],. . .
active-set methods [Roth and Fischer, 2008].
. . .
Julien Mairal Sparse Coding and Dictionary Learning 93/137
Matching Pursuit α = (0, 0, 0)
d1
d2
d3
rz
x
y
Julien Mairal Sparse Coding and Dictionary Learning 94/137
Matching Pursuit α = (0, 0, 0)
z
x
y
d1
d2
d3
r
< r,d3 > d3
Julien Mairal Sparse Coding and Dictionary Learning 95/137
Matching Pursuit α = (0, 0, 0)
z
x
y
d1
d2
d3
rr− < r,d3 > d3
Julien Mairal Sparse Coding and Dictionary Learning 96/137
Matching Pursuit α = (0, 0, 0.75)
d1
d2
d3
rz
x
y
Julien Mairal Sparse Coding and Dictionary Learning 97/137
Matching Pursuit α = (0, 0, 0.75)
z
x
y
d1
d2
d3
r
Julien Mairal Sparse Coding and Dictionary Learning 98/137
Matching Pursuit α = (0, 0, 0.75)
z
x
y
d1
d2
d3
r
Julien Mairal Sparse Coding and Dictionary Learning 99/137
Matching Pursuit α = (0, 0.24, 0.75)
d1
d2
d3r
z
x
y
Julien Mairal Sparse Coding and Dictionary Learning 100/137
Matching Pursuit α = (0, 0.24, 0.75)
z
x
y
d1
d2
d3r
Julien Mairal Sparse Coding and Dictionary Learning 101/137
Matching Pursuit α = (0, 0.24, 0.75)
z
x
y
d1
d2
d3r
Julien Mairal Sparse Coding and Dictionary Learning 102/137
Matching Pursuit α = (0, 0.24, 0.65)
d1
d2
d3r
z
x
y
Julien Mairal Sparse Coding and Dictionary Learning 103/137
Matching Pursuit
minα∈Rp
‖ x−Dα︸ ︷︷ ︸
r
‖22 s.t. ‖α‖0 ≤ L
1: α← 02: r← x (residual).3: while ‖α‖0 < L do
4: Select the atom with maximum correlation with the residual
ı← argmaxi=1,...,p
|dTi r|
5: Update the residual and the coefficients
α[ı] ← α[ı] + dTı r
r ← r − (dTı r)dı
6: end while
Julien Mairal Sparse Coding and Dictionary Learning 104/137
Orthogonal Matching Pursuitα = (0, 0, 0)
Γ = ∅
d1
d2
d3
xz
x
y
Julien Mairal Sparse Coding and Dictionary Learning 105/137
Orthogonal Matching Pursuit α = (0, 0, 0.75)Γ = {3}
z
x
y
d1
d2
d3
xr
π1
π2
π3
Julien Mairal Sparse Coding and Dictionary Learning 106/137
Orthogonal Matching Pursuit α = (0, 0.29, 0.63)Γ = {3, 2}
z
x
y
d1
d2
d3
xr
π31
π32
Julien Mairal Sparse Coding and Dictionary Learning 107/137
Orthogonal Matching Pursuit
minα∈Rp
‖x−Dα‖22 s.t. ‖α‖0 ≤ L
1: Γ = ∅.2: for iter = 1, . . . , L do
3: Select the atom which most reduces the objective
ı← argmini∈ΓC
{
minα
′
‖x−DΓ∪{i}α′‖22
}
4: Update the active set: Γ← Γ ∪ {ı}.5: Update the residual (orthogonal projection)
r← (I−DΓ(DTΓ DΓ)
−1DTΓ )x.
6: Update the coefficients
αΓ ← (DTΓ DΓ)
−1DTΓ x.
7: end for
Julien Mairal Sparse Coding and Dictionary Learning 108/137
Orthogonal Matching Pursuit
Contrary to MP, an atom can only be selected one time with OMP. It is,however, more difficult to implement efficiently. The keys for a goodimplementation in the case of a large number of signals are
Precompute the Gram matrix G = DTD once in for all,
Maintain the computation of DT r for each signal,
Maintain a Cholesky decomposition of (DTΓ DΓ)
−1 for each signal.
The total complexity for decomposing n L-sparse signals of size m with adictionary of size p is
O(p2m)︸ ︷︷ ︸
Gram matrix
+O(nL3)︸ ︷︷ ︸
Cholesky
+O(n(pm + pL2))︸ ︷︷ ︸
DT r
= O(np(m + L2))
It is also possible to use the matrix inversion lemma instead of aCholesky decomposition (same complexity, but less numerical stability)
Julien Mairal Sparse Coding and Dictionary Learning 109/137
Example with the software SPAMS
Software available at http://www.di.ens.fr/willow/SPAMS/
>> I=double(imread(’data/lena.eps’))/255;
>> %extract all patches of I
>> X=im2col(I,[8 8],’sliding’);
>> %load a dictionary of size 64 x 256
>> D=load(’dict.mat’);
>>
>> %set the sparsity parameter L to 10
>> param.L=10;
>> alpha=mexOMP(X,D,param);
On a 8-cores 2.83Ghz machine: 230000 signals processed per second!
Julien Mairal Sparse Coding and Dictionary Learning 110/137
Optimality conditions of the LassoNonsmooth optimization
Directional derivatives and subgradients are useful tools for studyingℓ1-decomposition problems:
minα∈Rp
1
2‖x−Dα‖22 + λ‖α‖1
In this tutorial, we use the directional derivatives to derive simpleoptimality conditions of the Lasso.
For more information on convex analysis and nonsmooth optimization,see the following books: [Boyd and Vandenberghe, 2004], [Nocedal andWright, 2006], [Borwein and Lewis, 2006], [Bonnans et al., 2006],[Bertsekas, 1999].
Julien Mairal Sparse Coding and Dictionary Learning 111/137
Optimality conditions of the LassoDirectional derivatives
Directional derivative in the direction u at α:
∇f (α,u) = limt→0+
f (α+ tu)− f (α)
t
Main idea: in non smooth situations, one may need to look at alldirections u and not simply p independent ones!
Proposition 1: if f is differentiable in α, ∇f (α,u) = ∇f (α)Tu.
Proposition 2: α is optimal iff for all u in Rp, ∇f (α,u) ≥ 0.
Julien Mairal Sparse Coding and Dictionary Learning 112/137
Optimality conditions of the Lasso
minα∈Rp
1
2‖x−Dα‖22 + λ‖α‖1
α⋆ is optimal iff for all u in Rp, ∇f (α,u) ≥ 0—that is,
−uTDT (x−Dα⋆) + λ∑
i ,α⋆[i ] 6=0
sign(α⋆[i ])u[i ] + λ∑
i ,α⋆[i ]=0
|ui | ≥ 0,
which is equivalent to the following conditions:
∀i = 1, . . . , p,
{|dTi (x−Dα⋆)| ≤ λ if α⋆[i ] = 0dTi (x−Dα⋆) = λ sign(α⋆[i ]) if α⋆[i ] 6= 0
Julien Mairal Sparse Coding and Dictionary Learning 113/137
Homotopy
A homotopy method provides a set of solutions indexed by aparameter.
The regularization path (λ,α⋆(λ)) for instance!!
It can be useful when the path has some “nice” properties(piecewise linear, piecewise quadratic).
LARS [Efron et al., 2004] starts from a trivial solution, and followsthe regularization path of the Lasso, which is is piecewise linear.
Julien Mairal Sparse Coding and Dictionary Learning 114/137
Homotopy, LARS[Osborne et al., 2000], [Efron et al., 2004]
∀i = 1, . . . , p,
{|dTi (x−Dα⋆)| ≤ λ if α⋆[i ] = 0dTi (x−Dα⋆) = λ sign(α⋆[i ]) if α⋆[i ] 6= 0
(1)The regularization path is piecewise linear:
DTΓ (x−DΓα
⋆Γ) = λ sign(α⋆
Γ)
α⋆Γ(λ) = (DT
Γ DΓ)−1(DT
Γ x− λ sign(α⋆Γ)) = A+ λB
A simple interpretation of LARS
Start from the trivial solution (λ = ‖DTx‖∞,α⋆(λ) = 0).
Maintain the computations of |dTi (x−Dα⋆(λ))| for all i .
Maintain the computation of the current direction B.
Follow the path by reducing λ until the next kink.
Julien Mairal Sparse Coding and Dictionary Learning 115/137
Example with the software SPAMShttp://www.di.ens.fr/willow/SPAMS/
>> I=double(imread(’data/lena.eps’))/255;
>> %extract all patches of I
>> X=normalize(im2col(I,[8 8],’sliding’));
>> %load a dictionary of size 64 x 256
>> D=load(’dict.mat’);
>>
>> %set the sparsity parameter lambda to 0.15
>> param.lambda=0.15;
>> alpha=mexLasso(X,D,param);
On a 8-cores 2.83Ghz machine: 77000 signals processed per second!
Note that it can also solve constrained version of the problem. Thecomplexity is more or less the same as OMP and uses the same tricks(Cholesky decomposition).
Julien Mairal Sparse Coding and Dictionary Learning 116/137
Coordinate Descent
Coordinate descent + nonsmooth objective: WARNING: not
convergent in general
Here, the problem is equivalent to a convex smooth optimizationproblem with separable constraints
minα+,α−
1
2‖x−D+α++D−α−‖
22+λα
T+1+λα
T−1 s.t. α−,α+ ≥ 0.
For this specific problem, coordinate descent is convergent.
Supposing ‖di‖2 = 1, updating the coordinate i :
α[i ]← argminβ
1
2‖ x−
∑
j 6=i
α[j ]dj
︸ ︷︷ ︸
r
−βdi‖22 + λ|β|
← sign(dTi r)(|dTi r| − λ)
+
⇒ soft-thresholding!
Julien Mairal Sparse Coding and Dictionary Learning 117/137
Example with the software SPAMShttp://www.di.ens.fr/willow/SPAMS/
>> I=double(imread(’data/lena.eps’))/255;
>> %extract all patches of I
>> X=normalize(im2col(I,[8 8],’sliding’));
>> %load a dictionary of size 64 x 256
>> D=load(’dict.mat’);
>>
>> %set the sparsity parameter lambda to 0.15
>> param.lambda=0.15;
>> param.tol=1e-2;
>> param.itermax=200;
>> alpha=mexCD(X,D,param);
On a 8-cores 2.83Ghz machine: 93000 signals processed per second!
Julien Mairal Sparse Coding and Dictionary Learning 118/137
first-order/proximal methods
minα∈Rp
f (α) + λψ(α)
f is strictly convex and continuously differentiable with a Lipshitzgradient.
Generalize the idea of gradient descent
αk+1←argminα∈R
f (αk)+∇f (αk)T (α−αk)+
L
2‖α−αk‖
22+λψ(α)
← argminα∈R
1
2‖α− (αk −
1
L∇f (αk))‖
22 +
λ
Lψ(α)
When λ = 0, this is equivalent to a classical gradient descent step.
Julien Mairal Sparse Coding and Dictionary Learning 119/137
first-order/proximal methods
They require solving efficiently the proximal operator
minα∈Rp
1
2‖u−α‖22 + λψ(α)
For the ℓ1-norm, this amounts to a soft-thresholding:
α⋆[i ] = sign(u[i ])(u[i ]− λ)+.
There exists accelerated versions based on Nesterov optimalfirst-order method (gradient method with “extrapolation”) [Beckand Teboulle, 2009, Nesterov, 2007, 1983]
suited for large-scale experiments.
Julien Mairal Sparse Coding and Dictionary Learning 120/137
Optimization for Grouped Sparsity
The formulation:
minα∈Rp
1
2‖x−Dα‖22
︸ ︷︷ ︸
data fitting term
+ λ∑
g∈G
‖αg‖q
︸ ︷︷ ︸
group-sparsity-inducingregularization
The main class of algorithms for solving grouped-sparsity problems are
Greedy approaches
Block-coordinate descent
Proximal methods
Julien Mairal Sparse Coding and Dictionary Learning 121/137
Optimization for Grouped Sparsity
The proximal operator:
minα∈Rp
1
2‖u−α‖22 + λ
∑
g∈G
‖αg‖q
For q = 2,
α⋆g =
ug
‖ug‖2(‖ug‖2 − λ)
+, ∀g ∈ G
For q =∞,α⋆
g = ug − Π‖.‖1≤λ[ug ], ∀g ∈ G
These formula generalize soft-thrsholding to groups of variables. Theyare used in block-coordinate descent and proximal algorithms.
Julien Mairal Sparse Coding and Dictionary Learning 122/137
Reweighted ℓ2
Let us start from something simple
a2 − 2ab + b2 ≥ 0.
Then
a ≤1
2
(a2
b+ b
)
with equality iff a = b
and
‖α‖1 = minηj≥0
1
2
p∑
j=1
α[j ]2
ηj+ ηj .
The formulation becomes
minα,ηj≥ε
1
2‖x−Dα‖22 +
λ
2
p∑
j=1
α[j ]2
ηj+ ηj .
Julien Mairal Sparse Coding and Dictionary Learning 123/137
Important messages
Greedy methods directly address the NP-hard ℓ0-decompositionproblem.
Homotopy methods can be extremely efficient for small ormedium-sized problems, or when the solution is very sparse.
Coordinate descent provides in general quickly a solution with asmall/medium precision, but gets slower when there is a lot ofcorrelation in the dictionary.
First order methods are very attractive in the large scale setting.
Other good alternatives exists, active-set, reweighted ℓ2 methods,stochastic variants, variants of OMP,. . .
Julien Mairal Sparse Coding and Dictionary Learning 124/137
Optimization for Dictionary Learning
minα∈Rp×n
D∈C
n∑
i=1
1
2‖xi −Dαi‖
22 + λ‖αi‖1
C△
= {D ∈ Rm×p s.t. ∀j = 1, . . . , p, ‖dj‖2 ≤ 1}.
Classical optimization alternates between D and α.
Good results, but very slow!
Julien Mairal Sparse Coding and Dictionary Learning 125/137
Optimization for Dictionary Learning[Mairal, Bach, Ponce, and Sapiro, 2009a]
Classical formulation of dictionary learning
minD∈C
fn(D) = minD∈C
1
n
n∑
i=1
l(xi ,D),
where
l(x,D)△
= minα∈Rp
1
2‖x−Dα‖22 + λ‖α‖1.
Which formulation are we interested in?
minD∈C
{
f (D) = Ex [l(x,D)] ≈ limn→+∞
1
n
n∑
i=1
l(xi ,D)}
[Bottou and Bousquet, 2008]: Online learning can
handle potentially infinite or dynamic datasets,
be dramatically faster than batch algorithms.Julien Mairal Sparse Coding and Dictionary Learning 126/137
Optimization for Dictionary Learning
Require: D0 ∈ Rm×p (initial dictionary); λ ∈ R
1: A0 = 0, B0 = 0.2: for t=1,. . . ,T do
3: Draw xt4: Sparse Coding
αt ← argminα∈Rp
1
2‖xt −Dt−1α‖
22 + λ‖α‖1,
5: Aggregate sufficient statisticsAt ← At−1 +αtα
Tt , Bt ← Bt−1 + xtα
Tt
6: Dictionary Update (block-coordinate descent)
Dt ← argminD∈C
1
t
t∑
i=1
(1
2‖xi −Dαi‖
22 + λ‖αi‖1
)
.
7: end for
Julien Mairal Sparse Coding and Dictionary Learning 127/137
Optimization for Dictionary Learning
Which guarantees do we have?
Under a few reasonable assumptions,
we build a surrogate function ft of the expected cost f verifying
limt→+∞
ft(Dt)− f (Dt) = 0,
Dt is asymptotically close to a stationary point.
Extensions (all implemented in SPAMS)
non-negative matrix decompositions.
sparse PCA (sparse dictionaries).
fused-lasso regularizations (piecewise constant dictionaries)
Julien Mairal Sparse Coding and Dictionary Learning 128/137
Optimization for Dictionary LearningExperimental results, batch vs online
m = 8× 8, p = 256Julien Mairal Sparse Coding and Dictionary Learning 129/137
Optimization for Dictionary LearningExperimental results, batch vs online
m = 12× 12× 3, p = 512Julien Mairal Sparse Coding and Dictionary Learning 130/137
References I
M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designingof overcomplete dictionaries for sparse representations. IEEE Transactions onSignal Processing, 54(11):4311–4322, November 2006.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
D. P. Bertsekas. Nonlinear programming. Athena Scientific Belmont, Mass, 1999.
J.F. Bonnans, J.C. Gilbert, C. Lemarechal, and C.A. Sagastizabal. Numericaloptimization: theoretical and practical aspects. Springer-Verlag New York Inc,2006.
J. M. Borwein and A. S. Lewis. Convex analysis and nonlinear optimization: Theoryand examples. Springer, 2006.
L. Bottou and O. Bousquet. The trade-offs of large scale learning. In J.C. Platt,D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural InformationProcessing Systems, volume 20, pages 161–168. MIT Press, Cambridge, MA, 2008.
Y-L. Boureau, F. Bach, Y. Lecun, and J. Ponce. Learning mid-level features forrecognition. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2010.
Julien Mairal Sparse Coding and Dictionary Learning 131/137
References IIS. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
E. Candes. Compressive sampling. In Proceedings of the International Congress ofMathematicians, volume 3, 2006.
S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basispursuit. SIAM Journal on Scientific Computing, 20:33–61, 1999.
I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm forlinear inverse problems with a sparsity constraint. Comm. Pure Appl. Math, 57:1413–1457, 2004.
I. Daubechies, R. DeVore, M. Fornasier, and S. Gunturk. Iteratively re-weighted leastsquares minimization for sparse recovery. Commun. Pure Appl. Math, 2009.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals ofstatistics, 32(2):407–499, 2004.
M. Elad and M. Aharon. Image denoising via sparse and redundant representationsover learned dictionaries. IEEE Transactions on Image Processing, 54(12):3736–3745, December 2006.
Julien Mairal Sparse Coding and Dictionary Learning 132/137
References IIIK. Engan, S. O. Aase, and J. H. Husoy. Frame based signal compression using
method of optimal directions (MOD). In Proceedings of the 1999 IEEEInternational Symposium on Circuits Systems, volume 4, 1999.
J. Friedman, T. Hastie, H. Holfling, and R. Tibshirani. Pathwise coordinateoptimization. Annals of statistics, 1(2):302–332, 2007.
W. J. Fu. Penalized regressions: The bridge versus the Lasso. Journal ofcomputational and graphical statistics, 7:397–416, 1998.
R. Grosse, R. Raina, H. Kwong, and A. Y. Ng. Shift-invariant sparse coding for audioclassification. In Proceedings of the Twenty-third Conference on Uncertainty inArtificial Intelligence, 2007.
A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen,69:331–371, 1910.
K. Huang and S. Aviyente. Sparse representation for signal classification. In Advancesin Neural Information Processing Systems, Vancouver, Canada, December 2006.
J. M. Hugues, D. J. Graham, and D. N. Rockmore. Quantification of artistic stylethrough sparse coding analysis in the drawings of Pieter Bruegel the Elder.Proceedings of the National Academy of Science, TODO USA, 107(4):1279–1283,2009.
Julien Mairal Sparse Coding and Dictionary Learning 133/137
References IVD. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In
Advances in Neural Information Processing Systems, 2001.
H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. InB. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural InformationProcessing Systems, volume 19, pages 801–808. MIT Press, Cambridge, MA, 2007.
M. Leordeanu, M. Hebert, and R. Sukthankar. Beyond local appearance: Categoryrecognition from pairwise interactions of simple features. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. NeuralComputation, 12(2):337–365, 2000.
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learneddictionaries for local image analysis. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2008a.
J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration.IEEE Transactions on Image Processing, 17(1):53–69, January 2008b.
J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriminative sparseimage models for class-specific edge detection and image interpretation. InProceedings of the European Conference on Computer Vision (ECCV), 2008c.
Julien Mairal Sparse Coding and Dictionary Learning 134/137
References VJ. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for
image and video restoration. SIAM Multiscale Modelling and Simulation, 7(1):214–241, April 2008d.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparsecoding. In Proceedings of the International Conference on Machine Learning(ICML), 2009a.
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse modelsfor image restoration. In Proceedings of the IEEE International Conference onComputer Vision (ICCV), 2009b.
S. Mallat. A Wavelet Tour of Signal Processing, Second Edition. Academic Press,New York, September 1999.
S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary. IEEETransactions on Signal Processing, 41(12):3397–3415, 1993.
H. M. Markowitz. The optimization of a quadratic function subject to linearconstraints. Naval Research Logistics Quarterly, 3:111–133, 1956.
Y. Nesterov. A method for solving a convex programming problem with convergencerate O(1/k2). Soviet Math. Dokl., 27:372–376, 1983.
Julien Mairal Sparse Coding and Dictionary Learning 135/137
References VIY. Nesterov. Gradient methods for minimizing composite objective function.
Technical report, CORE, 2007.
J. Nocedal and SJ Wright. Numerical Optimization. Springer: New York, 2006. 2ndEdition.
B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: Astrategy employed by V1? Vision Research, 37:3311–3325, 1997.
M. R. Osborne, B. Presnell, and B. A. Turlach. On the Lasso and its dual. Journal ofComputational and Graphical Statistics, 9(2):319–37, 2000.
M. Protter and M. Elad. Image sequence denoising via sparse and redundantrepresentations. IEEE Transactions on Image Processing, 18(1):27–36, 2009.
S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2005.
V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness ofsolutions and efficient algorithms. In Proceedings of the International Conferenceon Machine Learning (ICML), 2008.
P. Sprechmann, I. Ramirez, G. Sapiro, and Y. C. Eldar. Collaborative hierarchicalsparse modeling. Technical report, 2010. Preprint arXiv:1003.0400v1.
Julien Mairal Sparse Coding and Dictionary Learning 136/137
References VIIR. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society. Series B, 58(1):267–288, 1996.
S. Weisberg. Applied Linear Regression. Wiley, New York, 1980.
J. Winn, A. Criminisi, and T. Minka. Object categorization by learned universal visualdictionary. In Proceedings of the IEEE International Conference on ComputerVision (ICCV), 2005.
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparsecoding for image classification. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2009.
J. Yang, K. Yu, , and T. Huang. Supervised translation-invariant sparse coding. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2010.
M. Yuan and Y. Lin. Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society Series B, 68:49–67, 2006.
Julien Mairal Sparse Coding and Dictionary Learning 137/137