91
Learning Sparse Representations Gabriel Peyré www.numerical-tours.com

Learning Sparse Representation

Embed Size (px)

DESCRIPTION

Slides of the keynote presentation at the conference ADA7, Cargese, France, 14-18 May 2012.

Citation preview

Page 1: Learning Sparse Representation

Learning Sparse

Representations

Gabriel Peyré

www.numerical-tours.com

Page 2: Learning Sparse Representation

Mathematical image prior:� compression, denoising, super-resolution, . . .

Image Priors

Page 3: Learning Sparse Representation

�||�f ||2

Mathematical image prior:� compression, denoising, super-resolution, . . .

Smooth images:Sobolev prior:

Low-pass Fourier coe�cients.

Image Priors

Page 4: Learning Sparse Representation

�||�f ||2

Mathematical image prior:� compression, denoising, super-resolution, . . .

Smooth images:Sobolev prior:

Low-pass Fourier coe�cients.

Total variation prior:

Piecewise smooth images:

Sparse wavelets coe�cients.

�||�f ||

Image Priors

Page 5: Learning Sparse Representation

�||�f ||2

Mathematical image prior:� compression, denoising, super-resolution, . . .

Smooth images:Sobolev prior:

Low-pass Fourier coe�cients.

Total variation prior:

Piecewise smooth images:

Sparse wavelets coe�cients.

�||�f ||

� Learning the prior from exemplars?

Image Priors

Page 6: Learning Sparse Representation

Overview

•Sparsity and Redundancy

•Dictionary Learning

•Extensions

•Task-driven Learning

•Texture Synthesis

Page 7: Learning Sparse Representation

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �

dm

Image Representation

dm

Page 8: Learning Sparse Representation

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �

Image approximation: f � Dx

dm

Image Representation

dm

Page 9: Learning Sparse Representation

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �Orthogonal dictionary: N = Q

xm = �f, dm�

Image approximation: f � Dx

dm

Image Representation

dm

Page 10: Learning Sparse Representation

Dictionary D = {dm}Q�1m=0 of atoms dm � RN .

Image decomposition: f =Q�1�

m=0

xmdm = Dx

f D x

xm

= �Orthogonal dictionary: N = Q

xm = �f, dm�

Redundant dictionary: N � Q

� x is not unique.

Image approximation: f � Dx

Examples: TI wavelets, curvelets, . . .

dm

Image Representation

dm

Page 11: Learning Sparse Representation

Sparsity: most xm are small.

Decomposition:

Coe�cients x

Image f

Sparsity

f =Q�1�

m=0

xmdm = Dx

Example: wavelet transform.

Page 12: Learning Sparse Representation

Sparsity: most xm are small.

Ideal sparsity: most xm are zero.

J0(x) = | {m \ xm �= 0} |

Decomposition:

Coe�cients x

Image f

Sparsity

f =Q�1�

m=0

xmdm = Dx

Example: wavelet transform.

Page 13: Learning Sparse Representation

Sparsity: most xm are small.

Ideal sparsity: most xm are zero.

J0(x) = | {m \ xm �= 0} |

Decomposition:

Coe�cients x

Image f

Approximate sparsity: compressibility

||f �Dx|| is small with J0(x) � M .

Sparsity

f =Q�1�

m=0

xmdm = Dx

Example: wavelet transform.

Page 14: Learning Sparse Representation

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparse Coding

Page 15: Learning Sparse Representation

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparsest approximation: minx

12

||f �Dx||2 + �J0(x)

minJ0(x)�M

||f �Dx||

min||f�Dx||��

J0(x)

Equivalence��M � ⇥

Sparse Coding

Page 16: Learning Sparse Representation

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparsest approximation: minx

12

||f �Dx||2 + �J0(x)

minJ0(x)�M

||f �Dx||

min||f�Dx||��

J0(x)

Equivalence��M � ⇥

Ortho-basis D:

xm =�

�f, dm⇥ if |xm| �⇤

2�0 otherwise.

�Pick the M largest

coe�cientsin {�f, dm⇥}m

Sparse Coding

Page 17: Learning Sparse Representation

Redundant dictionary D = {dm}Q�1m=0, Q � N .

� non-unique representation f = Dx.

Sparsest decomposition: minf=Dx

J0(x)

Sparsest approximation: minx

12

||f �Dx||2 + �J0(x)

minJ0(x)�M

||f �Dx||

min||f�Dx||��

J0(x)

Equivalence��M � ⇥

Ortho-basis D:

General redundant dictionary: NP-hard.

xm =�

�f, dm⇥ if |xm| �⇤

2�0 otherwise.

�Pick the M largest

coe�cientsin {�f, dm⇥}m

Sparse Coding

Page 18: Learning Sparse Representation

Image with 2 pixels:

q = 0

J0(x) = | {m \ xm �= 0} |

d1

J0(x) = 0 �� null image.J0(x) = 1 �� sparse image.J0(x) = 2 �� non-sparse image.

Convex Relaxation: L1 Prior

d0

Page 19: Learning Sparse Representation

Image with 2 pixels:

q = 0 q = 1 q = 2q = 3/2q = 1/2

J0(x) = | {m \ xm �= 0} |

Jq(x) =�

m

|xm|q

d1

J0(x) = 0 �� null image.J0(x) = 1 �� sparse image.J0(x) = 2 �� non-sparse image.

Convex Relaxation: L1 Prior

�q priors: (convex for q � 1)

d0

Page 20: Learning Sparse Representation

Image with 2 pixels:

q = 0 q = 1 q = 2q = 3/2q = 1/2

J0(x) = | {m \ xm �= 0} |

Jq(x) =�

m

|xm|q

d1

J0(x) = 0 �� null image.J0(x) = 1 �� sparse image.J0(x) = 2 �� non-sparse image.

J1(x) = ||x||1 =�

m

|xm|

Convex Relaxation: L1 Prior

Sparse �1 prior:

�q priors: (convex for q � 1)

d0

Page 21: Learning Sparse Representation

Denoising/approximation: � = Id.

Inverse Problems

Page 22: Learning Sparse Representation

Examples: Inpainting, super-resolution, compressed-sensing

Denoising/approximation: � = Id.

Inverse Problems

Page 23: Learning Sparse Representation

Fidelity

Denoising/compression: y = f0 + w � RN .

Sparse approximation: f� = Dx� where

x� ⇥ argminx

12

||y �Dx||2 + �||x||1

Regularized Inversion

Page 24: Learning Sparse Representation

Fidelity

Denoising/compression: y = f0 + w � RN .

Sparse approximation: f� = Dx� where

x� ⇥ argminx

12

||y �Dx||2 + �||x||1

x� ⇥ argminx

12

||y � �Dx||2 + �||x||1

Inverse problems y = �f0 + w � RP .

ReplaceD by �D

Regularized Inversion

Page 25: Learning Sparse Representation

Fidelity

Denoising/compression: y = f0 + w � RN .

Sparse approximation: f� = Dx� where

x� ⇥ argminx

12

||y �Dx||2 + �||x||1

x� ⇥ argminx

12

||y � �Dx||2 + �||x||1

Inverse problems y = �f0 + w � RP .

Numerical solvers: proximal splitting schemes.�� www.numerical-tours.com

ReplaceD by �D

Regularized Inversion

Page 26: Learning Sparse Representation

Inpainting Results

Page 27: Learning Sparse Representation

Overview

•Sparsity and Redundancy

•Dictionary Learning

•Extensions

•Task-driven Learning

•Texture Synthesis

Page 28: Learning Sparse Representation

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1

Dictionary Learning: MAP Energy

Page 29: Learning Sparse Representation

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learning

Dictionary Learning: MAP Energy

Page 30: Learning Sparse Representation

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learningConstraint: C = {D = (dm)m \ �m, ||dm|| � 1}

Otherwise: D � +�, X � 0

Dictionary Learning: MAP Energy

Page 31: Learning Sparse Representation

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learning

Matrix formulation:min f(X, D) =

12

||Y �DX||2 + �||X||1X ⇥ RQ�K

D ⇥ C � RN�Q

Constraint: C = {D = (dm)m \ �m, ||dm|| � 1}Otherwise: D � +�, X � 0

Dictionary Learning: MAP Energy

Page 32: Learning Sparse Representation

Set of (noisy) exemplars {yk}k.

Sparse approximation: minxk

12

||yk �Dxk||2 + �||xk||1�

k

minD�C

Dictionary learning

Matrix formulation:min f(X, D) =

12

||Y �DX||2 + �||X||1X ⇥ RQ�K

D ⇥ C � RN�Q

� Convex with respect to X.� Convex with respect to D.� Non-onvex with respect to (X, D).

Constraint: C = {D = (dm)m \ �m, ||dm|| � 1}Otherwise: D � +�, X � 0

DLocal minima

minX

f(X, D)

Dictionary Learning: MAP Energy

Page 33: Learning Sparse Representation

Step 1: � k, minimization on xk

� Convex sparse coding.

D, initialization

minxk

12

||yk �Dxk||2 + �||xk||1

Dictionary Learning: Algorithm

Page 34: Learning Sparse Representation

Step 1: � k, minimization on xk

� Convex sparse coding.

minD�C

||Y �DX||2

� Convex constraint minimization.

D, initializationStep 2: Minimization on D

minxk

12

||yk �Dxk||2 + �||xk||1

Dictionary Learning: Algorithm

Page 35: Learning Sparse Representation

Step 1: � k, minimization on xk

� Convex sparse coding.

minD�C

||Y �DX||2

� Convex constraint minimization.Projected gradient descent:

D, initializationStep 2: Minimization on D

minxk

12

||yk �Dxk||2 + �||xk||1

D(�+1) = ProjC�D(�) � ��(D(�)X � Y )X�

Dictionary Learning: Algorithm

Page 36: Learning Sparse Representation

Step 1: � k, minimization on xk

� Convex sparse coding.

minD�C

||Y �DX||2

� Convex constraint minimization.Projected gradient descent:

D, initialization

D, convergence

Convergence: toward a stationary pointof f(X, D).

Step 2: Minimization on D

minxk

12

||yk �Dxk||2 + �||xk||1

D(�+1) = ProjC�D(�) � ��(D(�)X � Y )X�

Dictionary Learning: Algorithm

Page 37: Learning Sparse Representation

Learning D

Exemplar patches yk

� State of the art denoising [Elad et al. 2006]

Dictionary D[Olshausen, Fields 1997]

Patch-based Learning

Page 38: Learning Sparse Representation

Learning D

Exemplar patches yk

� State of the art denoising [Elad et al. 2006]

Dictionary D[Olshausen, Fields 1997]

� Sparse texture synthesis, inpainting [Peyre 2008]

Patch-based Learning

Learning D

Page 39: Learning Sparse Representation

D(k) = (dm)k�1m=0

PCA dimensionality reduction:⇥ k, min

D||Y �D(k)X||

Linear (PCA): Fourier-like atoms.

Comparison with PCA

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

DCT PCA

Page 40: Learning Sparse Representation

D(k) = (dm)k�1m=0

PCA dimensionality reduction:⇥ k, min

D||Y �D(k)X||

Linear (PCA): Fourier-like atoms.Sparse (learning): Gabor-like atoms.

Comparison with PCA

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

RUBINSTEIN et al.: DICTIONARIES FOR SPARSE REPRESENTATION 3

Fig. 1. Left: A few 12 £ 12 DCT atoms. Right: The first 40 KLT atoms,trained using 12£ 12 image patches from Lena.

B. Non-Linear Revolution and Elements of Modern DictionaryDesign

In statistics research, the 1980’s saw the rise of a newpowerful approach known as robust statistics. Robust statisticsadvocates sparsity as a key for a wide range of recovery andanalysis tasks. The idea has its roots in classical Physics, andmore recently in Information Theory, and promotes simplicityand conciseness in guiding phenomena descriptions. Motivatedby these ideas, the 1980’s and 1990’s were characterizedby a search for sparser representations and more efficienttransforms.

Increasing sparsity required departure from the linear model,towards a more flexible non-linear formulation. In the non-linear case, each signal is allowed to use a different setof atoms from the dictionary in order to achieve the bestapproximation. Thus, the approximation process becomes

x ⇤X

n�IK(x)

cn�n , (5)

where IK(x) is an index set adapted to each signal individually(we refer the reader to [5], [7] for more information on thiswide topic).

The non-linear view paved the way to the design ofnewer, more efficient transforms. In the process, many ofthe fundamental concepts guiding modern dictionary designwere formed. Following the historic time line, we trace theemergence of the most important modern dictionary designconcepts, which are mostly formed during the last two decadesof the 20th century.

Localization: To achieve sparsity, transforms required betterlocalization. Atoms with concentrated supports allow moreflexible representations based on the local signal characteris-tics, and limit the effects of irregularities, which are observedto be the main source of large coefficients. In this spirit, oneof the first structures to be used was the Short Time FourierTransform (STFT) [8], which emerges as a natural extensionto the Fourier transform. In the STFT, the Fourier transform isapplied locally to (possibly overlapping) portions of the signal,revealing a time-frequency (or space-frequency) descriptionof the signal. An example of the STFT is the JPEG imagecompression algorithm [9], which is based on this concept.

During the 1980’s and 1990’s, the STFT was extensivelyresearched and generalized, becoming more known as theGabor transform, named in homage of Dennis Gabor, whofirst suggested the time-frequency decomposition back in1946 [10]. Gabor’s work was independently rediscovered in

1980 by Bastiaans [11] and Janssen [12], who studied thefundamental properties of the expansion.

A basic 1-D Gabor dictionary consists of windowed wave-forms

G =©

⇤n,m(x) = w(x� ⇥m)ei2⇥�nx™

n,m�Z ,

where w(·) is a low-pass window function localized at 0(typically a Gaussian), and � and ⇥ control the time andfrequency resolution of the transform. Much of the mathe-matical foundations of this transform were laid out during thelate 1980’s by Daubechies, Grossman and Meyer [13], [14]who studied the transform from the angle of frame theory,and by Feichtinger and Grochenig [15]–[17] who employed ageneralized group-theoretic point of view. Study of the discreteversion of the transform and its numerical implementationfollowed in the early 1990’s, with notable contributions byWexler and Raz [18] and by Qian and Chen [19].

In higher dimensions, more complex Gabor structures weredeveloped which add directionality, by varying the orientationof the sinusoidal waves. This structure gained substantialsupport from the work of Daugman [20], [21], who discoveredoriented Gabor-like patterns in simple-cell receptive fields inthe visual cortex. These results motivated the deployment ofthe transform to image processing tasks, led by works such asDaugman [22] and Porat and Zeevi [23]. Today, practical usesof the Gabor transform are mainly in analysis and detectiontasks, as a collection of directional filters. Figure 2 showssome examples of 2-D Gabor atoms of various orientationsand sizes.

Multi-Resolution: One of the most significant conceptualadvancements achieved in the 1980’s was the rise of multi-scale analysis. It was realized that natural signals, and imagesspecifically, exhibited meaningful structures over many scales,and could be analyzed and described particularly efficientlyby multi-scale constructions. One of the simplest and bestknown such structures is the Laplacian pyramid, introducedin 1984 by Burt and Adelson [24]. The Laplacian pyramidrepresents an image as a series of difference images, whereeach one corresponds to a different scale and roughly adifferent frequency band.

In the second half of the 1980’s, though, the signal process-ing community was particularly excited about the developmentof a new very powerful tool, known as wavelet analysis [5],[25], [26]. In a pioneering work from 1984, Grossman andMorlet [27] proposed a signal expansion over a series oftranslated and dilated versions of a single elementary function,taking the form

W =n

⇤n,m(x) = �n/2f(�nx� ⇥m)o

n,m�Z.

This simple idea captivated the signal processing and harmonicanalysis communities, and in a series of influential works byMeyer, Daubechies, Mallat and others [13], [14], [28]–[33],an extensive wavelet theory was formalized. The theory wasformulated for both the continuous and discrete domains, anda complete mathematical framework relating the two was putforth. A significant breakthrough came from Meyer’s work in1985 [28], who found that unlike the Gabor transform (and

DCT PCA

Gabor Learned

4 IEEE PROCEEDINGS, VOL. X, NO. X, XX 20XX

Fig. 2. Left: A few 12£12 Gabor atoms at different scales and orientations.Right: A few atoms trained by Olshausen and Field (extracted from [34]).

contrary to common belief) the wavelet transform could bedesigned to be orthogonal while maintaining stability — anextremely appealing property to which much of the initialsuccess of the wavelets can be attributed to.

Specifically of interest to the signal processing communitywas the work of Mallat and his colleagues [31]–[33] whichestablished the wavelet decomposition as a multi-resolutionexpansion and put forth efficient algorithms for computingit. In Mallat’s description, a multi-scale wavelet basis isconstructed from a pair of localized functions referred to asthe scaling function and the mother wavelet, see Figure 3.The scaling function is a low frequency signal, and alongwith its translations, spans the coarse approximation of thesignal. The mother wavelet is a high frequency signal, andwith its various scales and translations spans the signal detail.In the orthogonal case, the wavelet basis functions at eachscale are critically sampled, spanning precisely the new detailintroduced by the finer level.

Non-linear approximation in the wavelet basis was shownto be optimal for piecewise-smooth 1-D signals with a finitenumber of discontinuities, see e.g. [32]. This was a strikingfinding at the time, realizing that this is achieved withoutprior detection of the discontinuity locations. Unfortunately,in higher dimensions the wavelet transform loses its opti-mality; the multi-dimensional transform is a simple separableextension of the 1-D transform, with atoms supported overrectangular regions of different sizes (see Figure 3). Thisseparability makes the transform simple to apply, however theresulting dictionary is only effective for signals with pointsingularities, while most natural signals exhibit elongated edgesingularities. The JPEG2000 image compression standard,based on the wavelet transform, is indeed known for its ringing(smoothing) artifacts near edges.

Adaptivity: Going to the 1990’s, the desire to push sparsityeven further, and describe increasingly complex phenomena,was gradually revealing the limits of approximation in orthog-onal bases. The weakness was mostly associated with the smalland fixed number of atoms in the dictionary — dictated by theorthogonality — from which the optimal representation couldbe constructed. Thus, one option to obtain further sparsity wasto adapt the transform atoms themselves to the signal content.

One of the first such structures to be proposed was thewavelet packet transform, introduced by Coifman, Meyerand Wickerhauser in 1992 [35]. The transform is built uponthe success of the wavelet transform, adding adaptivity toallow finer tuning to the specific signal properties. The mainobservation of Coifman et al. was that the wavelet transform

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Scaling functionMother wavelet

Fig. 3. Left: Coiflet 1-D scaling function (solid) and mother wavelet (dashed).Right: Some 2-D separable Coiflet atoms.

enforced a very specific time-frequency structure, with highfrequency atoms having small supports and low frequencyatoms having large supports. Indeed, this choice has deepconnections to the behavior of real natural signals; however,for specific signals, better partitionings may be possible. Thewavelet packet dictionary essentially unifies all dyadic time-frequency atoms which can be derived from a specific pairof scaling function and mother wavelet, so atoms of differentfrequencies can come in an array of time supports. Out ofthis large collection, the wavelet packet transform allows toefficiently select an optimized orthogonal sub-dictionary forany given signal, with the standard wavelet basis being justone of an exponential number of options. The process wasthus named by the authors a Best Basis search. The waveletpacket transform is, by definition, at least as good as waveletsin terms of coding efficiency. However, we note that the multi-dimensional wavelet packet transform remains a separable andnon-oriented transform, and thus does not generally provide asubstantial improvement over wavelets for images.

Geometric Invariance and Overcompleteness: In 1992, Si-moncelli et al. [36] published a thorough work advocating adictionary property they termed shiftability, which describesthe invariance of the dictionary under certain geometric defor-mations, e.g. translation, rotation or scaling. Indeed, the mainweakness of the wavelet transform is its strong translation-sensitivity, as well as rotation-sensitivity in higher dimensions.The authors concluded that achieving these properties requiredabandoning orthogonality in favor of overcompleteness, sincethe critical number of atoms in an orthogonal transform wassimply insufficient. In the same work, the authors developedan overcomplete oriented wavelet transform — the steerablewavelet transform — which was based on their previous workon steerable filters and consisted of localized 2-D waveletatoms in many orientations, translations and scales.

For the basic 1-D wavelet transform, translation-invariancecan be achieved by increasing the sampling density of theatoms. The stationary wavelet transform, also known as theundecimated or non-subsampled wavelet transform, is obtainedfrom the orthogonal transform by eliminating the sub-samplingand collecting all translations of the atoms over the signaldomain. The algorithmic foundation for this was laid byBeylkin in 1992 [37], with the development of an efficientalgorithm for computing the undecimated transform. Thestationary wavelet transform was indeed found to substantiallyimprove signal recovery compared to orthogonal wavelets,and its benefits were independently demonstrated in 1995 byNason and Silverman [38] and Coifman and Donoho [39].

4 IEEE PROCEEDINGS, VOL. X, NO. X, XX 20XX

Fig. 2. Left: A few 12£12 Gabor atoms at different scales and orientations.Right: A few atoms trained by Olshausen and Field (extracted from [34]).

contrary to common belief) the wavelet transform could bedesigned to be orthogonal while maintaining stability — anextremely appealing property to which much of the initialsuccess of the wavelets can be attributed to.

Specifically of interest to the signal processing communitywas the work of Mallat and his colleagues [31]–[33] whichestablished the wavelet decomposition as a multi-resolutionexpansion and put forth efficient algorithms for computingit. In Mallat’s description, a multi-scale wavelet basis isconstructed from a pair of localized functions referred to asthe scaling function and the mother wavelet, see Figure 3.The scaling function is a low frequency signal, and alongwith its translations, spans the coarse approximation of thesignal. The mother wavelet is a high frequency signal, andwith its various scales and translations spans the signal detail.In the orthogonal case, the wavelet basis functions at eachscale are critically sampled, spanning precisely the new detailintroduced by the finer level.

Non-linear approximation in the wavelet basis was shownto be optimal for piecewise-smooth 1-D signals with a finitenumber of discontinuities, see e.g. [32]. This was a strikingfinding at the time, realizing that this is achieved withoutprior detection of the discontinuity locations. Unfortunately,in higher dimensions the wavelet transform loses its opti-mality; the multi-dimensional transform is a simple separableextension of the 1-D transform, with atoms supported overrectangular regions of different sizes (see Figure 3). Thisseparability makes the transform simple to apply, however theresulting dictionary is only effective for signals with pointsingularities, while most natural signals exhibit elongated edgesingularities. The JPEG2000 image compression standard,based on the wavelet transform, is indeed known for its ringing(smoothing) artifacts near edges.

Adaptivity: Going to the 1990’s, the desire to push sparsityeven further, and describe increasingly complex phenomena,was gradually revealing the limits of approximation in orthog-onal bases. The weakness was mostly associated with the smalland fixed number of atoms in the dictionary — dictated by theorthogonality — from which the optimal representation couldbe constructed. Thus, one option to obtain further sparsity wasto adapt the transform atoms themselves to the signal content.

One of the first such structures to be proposed was thewavelet packet transform, introduced by Coifman, Meyerand Wickerhauser in 1992 [35]. The transform is built uponthe success of the wavelet transform, adding adaptivity toallow finer tuning to the specific signal properties. The mainobservation of Coifman et al. was that the wavelet transform

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Scaling functionMother wavelet

Fig. 3. Left: Coiflet 1-D scaling function (solid) and mother wavelet (dashed).Right: Some 2-D separable Coiflet atoms.

enforced a very specific time-frequency structure, with highfrequency atoms having small supports and low frequencyatoms having large supports. Indeed, this choice has deepconnections to the behavior of real natural signals; however,for specific signals, better partitionings may be possible. Thewavelet packet dictionary essentially unifies all dyadic time-frequency atoms which can be derived from a specific pairof scaling function and mother wavelet, so atoms of differentfrequencies can come in an array of time supports. Out ofthis large collection, the wavelet packet transform allows toefficiently select an optimized orthogonal sub-dictionary forany given signal, with the standard wavelet basis being justone of an exponential number of options. The process wasthus named by the authors a Best Basis search. The waveletpacket transform is, by definition, at least as good as waveletsin terms of coding efficiency. However, we note that the multi-dimensional wavelet packet transform remains a separable andnon-oriented transform, and thus does not generally provide asubstantial improvement over wavelets for images.

Geometric Invariance and Overcompleteness: In 1992, Si-moncelli et al. [36] published a thorough work advocating adictionary property they termed shiftability, which describesthe invariance of the dictionary under certain geometric defor-mations, e.g. translation, rotation or scaling. Indeed, the mainweakness of the wavelet transform is its strong translation-sensitivity, as well as rotation-sensitivity in higher dimensions.The authors concluded that achieving these properties requiredabandoning orthogonality in favor of overcompleteness, sincethe critical number of atoms in an orthogonal transform wassimply insufficient. In the same work, the authors developedan overcomplete oriented wavelet transform — the steerablewavelet transform — which was based on their previous workon steerable filters and consisted of localized 2-D waveletatoms in many orientations, translations and scales.

For the basic 1-D wavelet transform, translation-invariancecan be achieved by increasing the sampling density of theatoms. The stationary wavelet transform, also known as theundecimated or non-subsampled wavelet transform, is obtainedfrom the orthogonal transform by eliminating the sub-samplingand collecting all translations of the atoms over the signaldomain. The algorithmic foundation for this was laid byBeylkin in 1992 [37], with the development of an efficientalgorithm for computing the undecimated transform. Thestationary wavelet transform was indeed found to substantiallyimprove signal recovery compared to orthogonal wavelets,and its benefits were independently demonstrated in 1995 byNason and Silverman [38] and Coifman and Donoho [39].

Page 41: Learning Sparse Representation

[Aharon & Elad 2006]

yk(·) = f(zk + ·)

Patch-based Denoising

Step 1: Extract patches.Noisy image: f = f0 + w.

yk

Page 42: Learning Sparse Representation

Step 2: Dictionary learning.

[Aharon & Elad 2006]

yk(·) = f(zk + ·)

minD,(xk)k

k

12

||yk �Dxk||2 + �||xk||1

Patch-based Denoising

Step 1: Extract patches.Noisy image: f = f0 + w.

yk

Page 43: Learning Sparse Representation

Step 3: Patch averaging.

Step 2: Dictionary learning.

yk = Dxk

[Aharon & Elad 2006]

f(·) ⇥�

k

yk(·� zk)

yk(·) = f(zk + ·)

minD,(xk)k

k

12

||yk �Dxk||2 + �||xk||1

Patch-based Denoising

Step 1: Extract patches.Noisy image: f = f0 + w.

ykyk

Page 44: Learning Sparse Representation

Inverse problem:

D � C

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 45: Learning Sparse Representation

Inverse problem:

D � C

� Convex sparse coding.

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Step 1: � k, minimization on xk

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 46: Learning Sparse Representation

Inverse problem:

D � C

Step 2: Minimization on D

� Convex sparse coding.

� Quadratic constrained.

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Step 1: � k, minimization on xk

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 47: Learning Sparse Representation

Inverse problem:

D � C

Step 2: Minimization on D

Step 3: Minimization on f

� Convex sparse coding.

� Quadratic constrained.

� Quadratic.

pk(f) = f(zk + ·)

Learning with Missing Datay = �f0 + w

minf,(xk)k

12

||y � �f ||2 + ��

k

12

||pk(f)�Dxk||2 + ⇥||xk||1

Step 1: � k, minimization on xk

Patch extractor:

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

pk

y

f0

Page 48: Learning Sparse Representation

Image f0 Observationsy = �f0 + w

Regularized f

[Mairal et al. 2008]

Inpainting Example

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original (b) Damaged

(c) Restored, N = 1 (d) Restored, N = 2

Fig. 14. Inpainting using N = 2 and n = 16!16 (bottom-right image), or N = 1 and n = 8!8(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During thelearning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learningprocess and L = 25 for the final reconstruction. The damaged image was created by removing 75% ofthe data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is33.97dB and 31.75dB for N = 1.

Page 49: Learning Sparse Representation

[Peyre, Fadili, Starck 2010]

Adaptive Inpainting and Separation

Local DCT

Local DCTWavelets

Wavelets Learned

Page 50: Learning Sparse Representation

Overview

•Sparsity and Redundancy

•Dictionary Learning

•Extensions

•Task-driven Learning

•Texture Synthesis

Page 51: Learning Sparse Representation

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

Higher Dimensional LearningMAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MA

IRA

Letal.:SPA

RSE

RE

PRE

SEN

TAT

ION

FOR

CO

LO

RIM

AG

ER

EST

OR

AT

ION

61

Fig.7.D

atasetused

forevaluating

denoisingexperim

ents.

TAB

LE

IPSN

RR

ESU

LTS

OF

OU

RD

EN

OISIN

GA

LG

OR

ITH

MW

ITH

256A

TO

MS

OF

SIZ

E7

73

FOR

AN

D6

63

FOR

.EA

CH

CA

SEIS

DIV

IDE

DIN

FO

UR

PA

RT

S:TH

ET

OP-L

EFT

RE

SULT

SA

RE

TH

OSE

GIV

EN

BY

MCA

UL

EY

AN

DA

L[28]W

ITH

TH

EIR

“33

MO

DE

L.”T

HE

TO

P-RIG

HT

RE

SULT

SA

RE

TH

OSE

OB

TAIN

ED

BY

APPLY

ING

TH

EG

RA

YSC

AL

EK

-SVD

AL

GO

RIT

HM

[2]O

NE

AC

HC

HA

NN

EL

SE

PAR

AT

ELY

WIT

H8

8A

TO

MS.T

HE

BO

TT

OM

-LE

FTA

RE

OU

RR

ESU

LTS

OB

TAIN

ED

WIT

HA

GL

OB

AL

LYT

RA

INE

DD

ICT

ION

AR

Y.TH

EB

OT

TO

M-R

IGH

TA

RE

TH

EIM

PRO

VE

ME

NT

SO

BTA

INE

DW

ITH

TH

EA

DA

PTIV

EA

PPRO

AC

HW

ITH

20IT

ER

AT

ION

S.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT

SFO

RE

AC

HG

RO

UP.

AS

CA

NB

ESE

EN,

OU

RP

RO

POSE

DT

EC

HN

IQU

EC

ON

SISTE

NT

LYP

RO

DU

CE

ST

HE

BE

STR

ESU

LTS

TAB

LE

IIC

OM

PAR

ISON

OF

TH

EPSN

RR

ESU

LTS

ON

TH

EIM

AG

E“C

AST

LE”

BE

TW

EE

N[28]

AN

DW

HA

TW

EO

BTA

INE

DW

ITH

2566

63

AN

D7

73

PA

TC

HE

S.F

OR

TH

EA

DA

PTIV

EA

PPRO

AC

H,20

ITE

RA

TIO

NS

HA

VE

BE

EN

PE

RFO

RM

ED.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT,

IND

ICA

TIN

GO

NC

EA

GA

INT

HE

CO

NSIST

EN

TIM

PRO

VE

ME

NT

OB

TAIN

ED

WIT

HO

UR

PR

OPO

SED

TE

CH

NIQ

UE

patch),inorder

topreventany

learningof

theseartifacts

(over-fitting).

We

definethen

thepatch

sparsityof

thedecom

po-sition

asthis

number

ofsteps.T

hestopping

criteriain

(2)be-

comes

thenum

berof

atoms

usedinstead

ofthe

reconstructionerror.U

singa

small

duringthe

OM

Pperm

itsto

learna

dic-tionary

specializedin

providinga

coarseapproxim

ation.O

urassum

ptionis

that(pattern)

artifactsare

lesspresent

incoarse

approximations,preventing

thedictionary

fromlearning

them.

We

proposethen

thealgorithm

describedin

Fig.6.We

typicallyused

toprevent

thelearning

ofartifacts

andfound

outthattw

oouteriterations

inthe

scheme

inFig.6

aresufficientto

givesatisfactory

results,while

within

theK

-SVD

,10–20itera-

tionsare

required.To

conclude,inorderto

addressthedem

osaicingproblem

,we

usethe

modified

K-SV

Dalgorithm

thatdealsw

ithnonuniform

noise,asdescribed

inprevious

section,andadd

toitan

adaptivedictionary

thathasbeen

learnedw

ithlow

patchsparsity

inorder

toavoid

over-fittingthe

mosaic

pattern.The

same

techniquecan

beapplied

togeneric

colorinpainting

asdem

onstratedin

thenextsection.

V.

EX

PER

IME

NTA

LR

ESU

LTS

We

arenow

readyto

presentthe

colorim

agedenoising,in-

painting,anddem

osaicingresultsthatare

obtainedw

iththe

pro-posed

framew

ork.

A.

Denoising

Color

Images

The

state-of-the-artperform

anceof

thealgorithm

ongrayscale

images

hasalready

beenstudied

in[2].

We

nowevaluate

ourextension

forcolor

images.

We

trainedsom

edictionaries

with

differentsizesof

atoms

55

3,66

3,7

73

and8

83,

on200

000patches

takenfrom

adatabase

of15

000im

agesw

iththe

patch-sparsityparam

eter(six

atoms

inthe

representations).We

usedthe

databaseL

abelMe

[55]to

buildour

image

database.T

henw

etrained

eachdictionary

with

600iterations.

This

providedus

aset

ofgeneric

dictionariesthat

we

usedas

initialdictionaries

inour

denoisingalgorithm

.C

omparing

theresults

obtainedw

iththe

globalapproach

andthe

adaptiveone

permits

usto

seethe

improvem

entsin

thelearning

process.W

echose

toevaluate

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61

Fig. 7. Data set used for evaluating denoising experiments.

TABLE IPSNR RESULTS OF OUR DENOISING ALGORITHM WITH 256 ATOMS OF SIZE 7 7 3 FOR AND 6 6 3 FOR . EACH CASE IS DIVIDED IN FOURPARTS: THE TOP-LEFT RESULTS ARE THOSE GIVEN BY MCAULEY AND AL [28] WITH THEIR “3 3 MODEL.” THE TOP-RIGHT RESULTS ARE THOSE OBTAINED BY

APPLYING THE GRAYSCALE K-SVD ALGORITHM [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT ARE OUR RESULTS OBTAINEDWITH A GLOBALLY TRAINED DICTIONARY. THE BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPROACH WITH 20 ITERATIONS.

BOLD INDICATES THE BEST RESULTS FOR EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUCES THE BEST RESULTS

TABLE IICOMPARISON OF THE PSNR RESULTS ON THE IMAGE “CASTLE” BETWEEN [28] AND WHAT WE OBTAINED WITH 256 6 6 3 AND 7 7 3 PATCHES.

FOR THE ADAPTIVE APPROACH, 20 ITERATIONS HAVE BEEN PERFORMED. BOLD INDICATES THE BEST RESULT, INDICATING ONCEAGAIN THE CONSISTENT IMPROVEMENT OBTAINED WITH OUR PROPOSED TECHNIQUE

patch), in order to prevent any learning of these artifacts (over-fitting). We define then the patch sparsity of the decompo-sition as this number of steps. The stopping criteria in (2) be-comes the number of atoms used instead of the reconstructionerror. Using a small during the OMP permits to learn a dic-tionary specialized in providing a coarse approximation. Ourassumption is that (pattern) artifacts are less present in coarseapproximations, preventing the dictionary from learning them.We propose then the algorithm described in Fig. 6. We typicallyused to prevent the learning of artifacts and found outthat two outer iterations in the scheme in Fig. 6 are sufficient togive satisfactory results, while within the K-SVD, 10–20 itera-tions are required.

To conclude, in order to address the demosaicing problem, weuse the modified K-SVD algorithm that deals with nonuniformnoise, as described in previous section, and add to it an adaptivedictionary that has been learned with low patch sparsity in orderto avoid over-fitting the mosaic pattern. The same technique canbe applied to generic color inpainting as demonstrated in thenext section.

V. EXPERIMENTAL RESULTS

We are now ready to present the color image denoising, in-painting, and demosaicing results that are obtained with the pro-posed framework.

A. Denoising Color Images

The state-of-the-art performance of the algorithm ongrayscale images has already been studied in [2]. We nowevaluate our extension for color images. We trained somedictionaries with different sizes of atoms 5 5 3, 6 6 3,7 7 3 and 8 8 3, on 200 000 patches taken from adatabase of 15 000 images with the patch-sparsity parameter

(six atoms in the representations). We used the databaseLabelMe [55] to build our image database. Then we trainedeach dictionary with 600 iterations. This provided us a set ofgeneric dictionaries that we used as initial dictionaries in ourdenoising algorithm. Comparing the results obtained with theglobal approach and the adaptive one permits us to see theimprovements in the learning process. We chose to evaluate

Learning D

Page 52: Learning Sparse Representation

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

Inpainting

Higher Dimensional LearningMAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57

Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).Color artifacts are reduced with our proposed technique ( in our proposed new metric). Both images have been denoised with the same global dictionary.In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm, dB. (c) Proposed algorithm,

dB.

Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.

MA

IRA

Letal.:SPA

RSE

RE

PRE

SEN

TAT

ION

FOR

CO

LO

RIM

AG

ER

EST

OR

AT

ION

61

Fig.7.D

atasetused

forevaluating

denoisingexperim

ents.

TAB

LE

IPSN

RR

ESU

LTS

OF

OU

RD

EN

OISIN

GA

LG

OR

ITH

MW

ITH

256A

TO

MS

OF

SIZ

E7

73

FOR

AN

D6

63

FOR

.EA

CH

CA

SEIS

DIV

IDE

DIN

FO

UR

PA

RT

S:TH

ET

OP-L

EFT

RE

SULT

SA

RE

TH

OSE

GIV

EN

BY

MCA

UL

EY

AN

DA

L[28]W

ITH

TH

EIR

“33

MO

DE

L.”T

HE

TO

P-RIG

HT

RE

SULT

SA

RE

TH

OSE

OB

TAIN

ED

BY

APPLY

ING

TH

EG

RA

YSC

AL

EK

-SVD

AL

GO

RIT

HM

[2]O

NE

AC

HC

HA

NN

EL

SE

PAR

AT

ELY

WIT

H8

8A

TO

MS.T

HE

BO

TT

OM

-LE

FTA

RE

OU

RR

ESU

LTS

OB

TAIN

ED

WIT

HA

GL

OB

AL

LYT

RA

INE

DD

ICT

ION

AR

Y.TH

EB

OT

TO

M-R

IGH

TA

RE

TH

EIM

PRO

VE

ME

NT

SO

BTA

INE

DW

ITH

TH

EA

DA

PTIV

EA

PPRO

AC

HW

ITH

20IT

ER

AT

ION

S.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT

SFO

RE

AC

HG

RO

UP.

AS

CA

NB

ESE

EN,

OU

RP

RO

POSE

DT

EC

HN

IQU

EC

ON

SISTE

NT

LYP

RO

DU

CE

ST

HE

BE

STR

ESU

LTS

TAB

LE

IIC

OM

PAR

ISON

OF

TH

EPSN

RR

ESU

LTS

ON

TH

EIM

AG

E“C

AST

LE”

BE

TW

EE

N[28]

AN

DW

HA

TW

EO

BTA

INE

DW

ITH

2566

63

AN

D7

73

PA

TC

HE

S.F

OR

TH

EA

DA

PTIV

EA

PPRO

AC

H,20

ITE

RA

TIO

NS

HA

VE

BE

EN

PE

RFO

RM

ED.B

OL

DIN

DIC

AT

ES

TH

EB

EST

RE

SULT,

IND

ICA

TIN

GO

NC

EA

GA

INT

HE

CO

NSIST

EN

TIM

PRO

VE

ME

NT

OB

TAIN

ED

WIT

HO

UR

PR

OPO

SED

TE

CH

NIQ

UE

patch),inorder

topreventany

learningof

theseartifacts

(over-fitting).

We

definethen

thepatch

sparsityof

thedecom

po-sition

asthis

number

ofsteps.T

hestopping

criteriain

(2)be-

comes

thenum

berof

atoms

usedinstead

ofthe

reconstructionerror.U

singa

small

duringthe

OM

Pperm

itsto

learna

dic-tionary

specializedin

providinga

coarseapproxim

ation.O

urassum

ptionis

that(pattern)

artifactsare

lesspresent

incoarse

approximations,preventing

thedictionary

fromlearning

them.

We

proposethen

thealgorithm

describedin

Fig.6.We

typicallyused

toprevent

thelearning

ofartifacts

andfound

outthattw

oouteriterations

inthe

scheme

inFig.6

aresufficientto

givesatisfactory

results,while

within

theK

-SVD

,10–20itera-

tionsare

required.To

conclude,inorderto

addressthedem

osaicingproblem

,we

usethe

modified

K-SV

Dalgorithm

thatdealsw

ithnonuniform

noise,asdescribed

inprevious

section,andadd

toitan

adaptivedictionary

thathasbeen

learnedw

ithlow

patchsparsity

inorder

toavoid

over-fittingthe

mosaic

pattern.The

same

techniquecan

beapplied

togeneric

colorinpainting

asdem

onstratedin

thenextsection.

V.

EX

PER

IME

NTA

LR

ESU

LTS

We

arenow

readyto

presentthe

colorim

agedenoising,in-

painting,anddem

osaicingresultsthatare

obtainedw

iththe

pro-posed

framew

ork.

A.

Denoising

Color

Images

The

state-of-the-artperform

anceof

thealgorithm

ongrayscale

images

hasalready

beenstudied

in[2].

We

nowevaluate

ourextension

forcolor

images.

We

trainedsom

edictionaries

with

differentsizesof

atoms

55

3,66

3,7

73

and8

83,

on200

000patches

takenfrom

adatabase

of15

000im

agesw

iththe

patch-sparsityparam

eter(six

atoms

inthe

representations).We

usedthe

databaseL

abelMe

[55]to

buildour

image

database.T

henw

etrained

eachdictionary

with

600iterations.

This

providedus

aset

ofgeneric

dictionariesthat

we

usedas

initialdictionaries

inour

denoisingalgorithm

.C

omparing

theresults

obtainedw

iththe

globalapproach

andthe

adaptiveone

permits

usto

seethe

improvem

entsin

thelearning

process.W

echose

toevaluate

MAIRAL et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61

Fig. 7. Data set used for evaluating denoising experiments.

TABLE IPSNR RESULTS OF OUR DENOISING ALGORITHM WITH 256 ATOMS OF SIZE 7 7 3 FOR AND 6 6 3 FOR . EACH CASE IS DIVIDED IN FOURPARTS: THE TOP-LEFT RESULTS ARE THOSE GIVEN BY MCAULEY AND AL [28] WITH THEIR “3 3 MODEL.” THE TOP-RIGHT RESULTS ARE THOSE OBTAINED BY

APPLYING THE GRAYSCALE K-SVD ALGORITHM [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT ARE OUR RESULTS OBTAINEDWITH A GLOBALLY TRAINED DICTIONARY. THE BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPROACH WITH 20 ITERATIONS.

BOLD INDICATES THE BEST RESULTS FOR EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUCES THE BEST RESULTS

TABLE IICOMPARISON OF THE PSNR RESULTS ON THE IMAGE “CASTLE” BETWEEN [28] AND WHAT WE OBTAINED WITH 256 6 6 3 AND 7 7 3 PATCHES.

FOR THE ADAPTIVE APPROACH, 20 ITERATIONS HAVE BEEN PERFORMED. BOLD INDICATES THE BEST RESULT, INDICATING ONCEAGAIN THE CONSISTENT IMPROVEMENT OBTAINED WITH OUR PROPOSED TECHNIQUE

patch), in order to prevent any learning of these artifacts (over-fitting). We define then the patch sparsity of the decompo-sition as this number of steps. The stopping criteria in (2) be-comes the number of atoms used instead of the reconstructionerror. Using a small during the OMP permits to learn a dic-tionary specialized in providing a coarse approximation. Ourassumption is that (pattern) artifacts are less present in coarseapproximations, preventing the dictionary from learning them.We propose then the algorithm described in Fig. 6. We typicallyused to prevent the learning of artifacts and found outthat two outer iterations in the scheme in Fig. 6 are sufficient togive satisfactory results, while within the K-SVD, 10–20 itera-tions are required.

To conclude, in order to address the demosaicing problem, weuse the modified K-SVD algorithm that deals with nonuniformnoise, as described in previous section, and add to it an adaptivedictionary that has been learned with low patch sparsity in orderto avoid over-fitting the mosaic pattern. The same technique canbe applied to generic color inpainting as demonstrated in thenext section.

V. EXPERIMENTAL RESULTS

We are now ready to present the color image denoising, in-painting, and demosaicing results that are obtained with the pro-posed framework.

A. Denoising Color Images

The state-of-the-art performance of the algorithm ongrayscale images has already been studied in [2]. We nowevaluate our extension for color images. We trained somedictionaries with different sizes of atoms 5 5 3, 6 6 3,7 7 3 and 8 8 3, on 200 000 patches taken from adatabase of 15 000 images with the patch-sparsity parameter

(six atoms in the representations). We used the databaseLabelMe [55] to build our image database. Then we trainedeach dictionary with 600 iterations. This provided us a set ofgeneric dictionaries that we used as initial dictionaries in ourdenoising algorithm. Comparing the results obtained with theglobal approach and the adaptive one permits us to see theimprovements in the learning process. We chose to evaluate

Learning D

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

Figure 7: Inpainting example on a 12-Megapixel image. Top: Damaged and restored images. Bot-tom: Zooming on the damaged and restored images. Note that the pictures presented herehave been scaled down for display. (Best seen in color).

6.4 Application to Large-Scale Image Processing

We demonstrate in this section that our algorithm can be used for a difficult large-scale imageprocessing task, namely, removing the text (inpainting) from the damaged 12-Megapixel imageof Figure 7. Using a multi-threaded version of our implementation, we have learned a dictionarywith 256 elements from the roughly 7! 106 undamaged 12! 12 color patches in the image withtwo epochs in about 8 minutes on a 2.4GHz machine with eight cores. Once the dictionary has beenlearned, the text is removed using the sparse coding technique for inpainting of Mairal et al. (2008b).Our intent here is of course not to evaluate our learning procedure in inpainting tasks, which wouldrequire a thorough comparison with state-the-art techniques on standard data sets. Instead, we justwish to demonstrate that it can indeed be applied to a realistic, non-trivial image processing task ona large image. Indeed, to the best of our knowledge, this is the first time that dictionary learningis used for image restoration on such large-scale data. For comparison, the dictionaries used forinpainting in Mairal et al. (2008b) are learned (in batch mode) on 200,000 patches only.

49

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

Figure 7: Inpainting example on a 12-Megapixel image. Top: Damaged and restored images. Bot-tom: Zooming on the damaged and restored images. Note that the pictures presented herehave been scaled down for display. (Best seen in color).

6.4 Application to Large-Scale Image Processing

We demonstrate in this section that our algorithm can be used for a difficult large-scale imageprocessing task, namely, removing the text (inpainting) from the damaged 12-Megapixel imageof Figure 7. Using a multi-threaded version of our implementation, we have learned a dictionarywith 256 elements from the roughly 7! 106 undamaged 12! 12 color patches in the image withtwo epochs in about 8 minutes on a 2.4GHz machine with eight cores. Once the dictionary has beenlearned, the text is removed using the sparse coding technique for inpainting of Mairal et al. (2008b).Our intent here is of course not to evaluate our learning procedure in inpainting tasks, which wouldrequire a thorough comparison with state-the-art techniques on standard data sets. Instead, we justwish to demonstrate that it can indeed be applied to a realistic, non-trivial image processing task ona large image. Indeed, to the best of our knowledge, this is the first time that dictionary learningis used for image restoration on such large-scale data. For comparison, the dictionaries used forinpainting in Mairal et al. (2008b) are learned (in batch mode) on 200,000 patches only.

49

Page 53: Learning Sparse Representation

Movie Inpainting

Page 54: Learning Sparse Representation

Image registration.

Facial Image Compressionshow recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

[Elad et al. 2009]

Page 55: Learning Sparse Representation

Image registration.

Non-overlapping patches (fk)k.

Facial Image Compressionshow recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

[Elad et al. 2009]

fk

Page 56: Learning Sparse Representation

Image registration.

Non-overlapping patches (fk)k.

Dictionary learning (Dk)k.

Dk

Facial Image CompressionBefore turning to preset the results we should add the follow-

ing: while all the results shown here refer to the specific databasewe operate on, the overall scheme proposed is general and shouldapply to other face images databases just as well. Naturally, somechanges in the parameters might be necessary, and among those,the patch size is the most important to consider. We also note thatas one shifts from one source of images to another, the relative sizeof the background in the photos may vary, and this necessarilyleads to changes in performance. More specifically, when the back-ground regions are larger (e.g., the images we use here have rela-tively small such regions), the compression performance isexpected to improve.

4.1. K-SVD dictionaries

The primary stopping condition for the training process was setto be a limitation on the maximal number of K-SVD iterations(being 100). A secondary stopping condition was a limitation onthe minimal representation error. In the image compression stagewe added a limitation on the maximal number of atoms per patch.These conditions were used to allow us to better control the ratesof the resulting images and the overall simulation time.

Every obtained dictionary contains 512 patches of size15! 15 pixels as atoms. In Fig. 6 we can see the dictionary that

was trained for patch number 80 (The left eye) for L " 4 sparsecoding atoms, and similarly, in Fig. 7 we can see the dictionary thatwas trained for patch number 87 (The right nostril) also for L " 4sparse coding atoms. It can be seen that both dictionaries containimages similar in nature to the image patch for which they weretrained for. A similar behavior was observed in other dictionaries.

4.2. Reconstructed images

Our coding strategy allows us to learn which parts of the im-age are more difficult than others to code. This is done byassigning the same representation error threshold to all of thepatches, and observing how many atoms are required for therepresentation of each patch on average. Clearly, patches witha small number of allocated atoms are simpler to represent thanothers. We would expect that the representation of smooth areasof the image such as the background, parts of the face andmaybe parts of the clothes will be simpler than the representa-tion of areas containing high frequency elements such as thehair or the eyes. Fig. 8 shows maps of atom allocation per patchand representation error (RMSE—squared-root of the meansquared error) per patch for the images in the test set in twodifferent bit-rates. It can be seen that more atoms were allocatedto patches containing the facial details (hair, mouth, eyes, and

Fig. 6. The Dictionary obtained by K-SVD for Patch No. 80 (the left eye) using the OMP method with L " 4.

Fig. 7. The Dictionary obtained by K-SVD for Patch No. 87 (the right nostril) using the OMP method with L " 4.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 275

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

[Elad et al. 2009]

fk

Page 57: Learning Sparse Representation

Image registration.

Non-overlapping patches (fk)k.

Dictionary learning (Dk)k.

fk � Dkxk

Sparse approximation:

Entropic coding: xk � file.

JPEG-2k PCA Learning

Dk

Facial Image CompressionBefore turning to preset the results we should add the follow-

ing: while all the results shown here refer to the specific databasewe operate on, the overall scheme proposed is general and shouldapply to other face images databases just as well. Naturally, somechanges in the parameters might be necessary, and among those,the patch size is the most important to consider. We also note thatas one shifts from one source of images to another, the relative sizeof the background in the photos may vary, and this necessarilyleads to changes in performance. More specifically, when the back-ground regions are larger (e.g., the images we use here have rela-tively small such regions), the compression performance isexpected to improve.

4.1. K-SVD dictionaries

The primary stopping condition for the training process was setto be a limitation on the maximal number of K-SVD iterations(being 100). A secondary stopping condition was a limitation onthe minimal representation error. In the image compression stagewe added a limitation on the maximal number of atoms per patch.These conditions were used to allow us to better control the ratesof the resulting images and the overall simulation time.

Every obtained dictionary contains 512 patches of size15! 15 pixels as atoms. In Fig. 6 we can see the dictionary that

was trained for patch number 80 (The left eye) for L " 4 sparsecoding atoms, and similarly, in Fig. 7 we can see the dictionary thatwas trained for patch number 87 (The right nostril) also for L " 4sparse coding atoms. It can be seen that both dictionaries containimages similar in nature to the image patch for which they weretrained for. A similar behavior was observed in other dictionaries.

4.2. Reconstructed images

Our coding strategy allows us to learn which parts of the im-age are more difficult than others to code. This is done byassigning the same representation error threshold to all of thepatches, and observing how many atoms are required for therepresentation of each patch on average. Clearly, patches witha small number of allocated atoms are simpler to represent thanothers. We would expect that the representation of smooth areasof the image such as the background, parts of the face andmaybe parts of the clothes will be simpler than the representa-tion of areas containing high frequency elements such as thehair or the eyes. Fig. 8 shows maps of atom allocation per patchand representation error (RMSE—squared-root of the meansquared error) per patch for the images in the test set in twodifferent bit-rates. It can be seen that more atoms were allocatedto patches containing the facial details (hair, mouth, eyes, and

Fig. 6. The Dictionary obtained by K-SVD for Patch No. 80 (the left eye) using the OMP method with L " 4.

Fig. 7. The Dictionary obtained by K-SVD for Patch No. 87 (the right nostril) using the OMP method with L " 4.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 275

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

show recognizable faces. We use a database containing around 6000such facial images, some of which are used for training and tuningthe algorithm, and the others for testing it, similar to the approachtaken in [17].

In our work we propose a novel compression algorithm, relatedto the one presented in [17], improving over it.

Our algorithm relies strongly on recent advancements made inusing sparse and redundant representation of signals [18–26], andlearning their sparsifying dictionaries [27–29]. We use the K-SVDalgorithm for learning the dictionaries for representing smallimage patches in a locally adaptive way, and use these to sparse-code the patches’ content. This is a relatively simple andstraight-forward algorithm with hardly any entropy coding stage.Yet, it is shown to be superior to several competing algorithms:(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],and (iii) A Principal Component Analysis (PCA) approach.2

In the next section we provide some background material forthis work: we start by presenting the details of the compressionalgorithm developed in [17], as their scheme is the one we embarkfrom in the development of ours. We also describe the topic ofsparse and redundant representations and the K-SVD, that arethe foundations for our algorithm. In Section 3 we turn to presentthe proposed algorithm in details, showing its various steps, anddiscussing its computational/memory complexities. Section 4presents results of our method, demonstrating the claimedsuperiority. We conclude in Section 5 with a list of future activitiesthat can further improve over the proposed scheme.

2. Background material

2.1. VQ-based image compression

Among the thousands of papers that study still imagecompression algorithms, there are relatively few that considerthe treatment of facial images [2–17]. Among those, the mostrecent and the best performing algorithm is the one reported in[17]. That paper also provides a thorough literature survey thatcompares the various methods and discusses similarities anddifferences between them. Therefore, rather than repeating sucha survey here, we refer the interested reader to [17]. In thissub-section we concentrate on the description of the algorithmin [17] as our method resembles it to some extent.

This algorithm, like some others before it, starts with a geomet-rical alignment of the input image, so that the main features (ears,nose, mouth, hair-line, etc.) are aligned with those of a database ofpre-aligned facial images. Such alignment increases further theredundancy in the handled image, due to its high cross similarityto the database. The warping in [17] is done by an automaticdetection of 13 feature points on the face, and moving them topre-determined canonical locations. These points define a slicingof the input image into disjoint and covering set of triangles, eachexhibiting an affine warp, being a function of the motion of itsthree vertices. Side information on these 13 feature locationsenables a reverse warp of the reconstructed image in the decoder.Fig. 1 (left side) shows the features and the induced triangles. Afterthe warping, the image is sliced into square and non-overlappingpatches (of size 8! 8 pixels), each of which is coded separately.Such possible slicing (for illustration purpose we show this slicingwith larger patches) is shown in Fig. 1 (right side).

Coding of the image patches in [17] is done using vector quan-tization (VQ) [30–32]. The VQ dictionaries are trained (using tree-

K-Means) per each patch separately, using patches taken from thesame location from 5000 training images. This way, each VQ isadapted to the expected local content, and thus the high perfor-mance presented by this algorithm. The number of code-wordsin the VQ is a function of the bit-allocation for the patches. Aswe argue in the next section, VQ coding is limited by the availablenumber of examples and the desired rate, forcing relatively smallpatch sizes. This, in turn, leads to a loss of some redundancy be-tween adjacent patches, and thus loss of potential compression.

Another ingredient in this algorithm that partly compensatesfor the above-described shortcoming is a multi-scale codingscheme. The image is scaled down and VQ-coded using patchesof size 8! 8. Then it is interpolated back to the original resolution,and the residual is coded using VQ on 8! 8 pixel patches onceagain. This method can be applied on a Laplacian pyramid of theoriginal (warped) image with several scales [33].

As already mentioned above, the results shown in [17] surpassthose obtained by JPEG2000, both visually and in Peak-Signal-to-Noise Ratio (PSNR) quantitative comparisons. In our work we pro-pose to replace the coding stage from VQ to sparse and redundantrepresentations—this leads us to the next subsection, were we de-scribe the principles behind this coding strategy.

2.2. Sparse and redundant representations

We now turn to describe a model for signals known as Sparse-land [29]. This model suggests a parametric description of signalsources in a way that adapts to their true nature. This model willbe harnessed in this work to provide the coding mechanism forthe image patches. We consider a family of image patches of sizeN ! N pixels, ordered lexicographically as column vectors x 2 Rn

(with n " N2). Assume that we are given a matrix D 2 Rn!k (withpossibly k > n). We refer hereafter to this matrix as the dictionary.The Sparseland model suggests that every such image patch, x,could be represented sparsely using this dictionary, i.e., the solu-tion of

a " argmina

kak0 subject to kDa# xk22 6 e2; $1%

is expected to be very sparse, kak0 & n. The notation kak0 counts thenon-zero entries in a. Thus, every signal instance from the family weconsider is assumed to be represented as a linear combination offew columns (referred to hereafter as atoms) from the redundantdictionary D.

The requirement kDa# xk2 6 e suggests that the approximationof x using Da need not be exact, and could absorb a moderate errore. This suggests an approximation that trades-off accuracy of repre-sentation with its simplicity, very much like the rate-distortion

2 The PCA algorithm is developed in this work as a competitive benchmark, andwhile it is generally performing very well, it is inferior to the main algorithmpresented in this work.

Fig. 1. (Left) Piece-wise affine warping of the image by triangulation. (Right) Auniform slicing to disjoint square patches for coding purposes.

O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

Much like other compression methods, the quality of thereconstructed images in our method improves as the bit-rateincreases. However, the contribution gained from such a rateincrement is not divided equally over the image. Additional bitsare allocated to patches with higher representation error, andthose are improved first. This property is directly caused bythe nature of the compression process, which is RMSE orientedand not bit-rate oriented. The compression process sets a singleRMSE threshold for all the patches, forcing each of them toreach it without fixing the number of allocated atoms perpatch. Patches with simple (smooth) content are most likelyto have a representation error far below the threshold evenusing zero or one atom, whereas patches with more complexcontent are expected to give a representation error very closeto the threshold. Such problematic patches will be forced to im-prove their representation error by increasing the number ofatoms they use as the RMSE threshold is decreased, whilepatches with a representation error below the threshold willnot be forced to change at all. Fig. 11 illustrates thegradual improvement in the image quality as the bit-rate in-creases. As can be seen, not all the patches improve as the

bit-rate increases but only some of them, such as severalpatches in the clothes area, in the ears and in the outline ofthe hair. These patches were more difficult to represent thanothers.

4.3. Comparing to other techniques

An important part in assessing the performance of our com-pression method is its comparison to known and competitivecompression techniques. As mentioned before, we compare ourresults in this work with JPEG, JPEG2000, The VQ-Based compres-sion method described in [17], and a PCA-Based compressionmethod that was built especially for this work as a competitivebenchmark. We therefore start with a brief description of thePCA technique.

The PCA-Based compression method is very similar to thescheme described in this work, simply replacing the K-SVD dic-tionaries with a Principal Component Analysis (PCA) ones. Thesedictionaries are square matrices storing the eigenvectors of theautocorrelation matrices of the training examples in each patch,sorted by a decreasing order of their corresponding eigenvalues.

Fig. 12. Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are therepresentation RMSE.

278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282

Much like other compression methods, the quality of thereconstructed images in our method improves as the bit-rateincreases. However, the contribution gained from such a rateincrement is not divided equally over the image. Additional bitsare allocated to patches with higher representation error, andthose are improved first. This property is directly caused bythe nature of the compression process, which is RMSE orientedand not bit-rate oriented. The compression process sets a singleRMSE threshold for all the patches, forcing each of them toreach it without fixing the number of allocated atoms perpatch. Patches with simple (smooth) content are most likelyto have a representation error far below the threshold evenusing zero or one atom, whereas patches with more complexcontent are expected to give a representation error very closeto the threshold. Such problematic patches will be forced to im-prove their representation error by increasing the number ofatoms they use as the RMSE threshold is decreased, whilepatches with a representation error below the threshold willnot be forced to change at all. Fig. 11 illustrates thegradual improvement in the image quality as the bit-rate in-creases. As can be seen, not all the patches improve as the

bit-rate increases but only some of them, such as severalpatches in the clothes area, in the ears and in the outline ofthe hair. These patches were more difficult to represent thanothers.

4.3. Comparing to other techniques

An important part in assessing the performance of our com-pression method is its comparison to known and competitivecompression techniques. As mentioned before, we compare ourresults in this work with JPEG, JPEG2000, The VQ-Based compres-sion method described in [17], and a PCA-Based compressionmethod that was built especially for this work as a competitivebenchmark. We therefore start with a brief description of thePCA technique.

The PCA-Based compression method is very similar to thescheme described in this work, simply replacing the K-SVD dic-tionaries with a Principal Component Analysis (PCA) ones. Thesedictionaries are square matrices storing the eigenvectors of theautocorrelation matrices of the training examples in each patch,sorted by a decreasing order of their corresponding eigenvalues.

Fig. 12. Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are therepresentation RMSE.

278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282

Much like other compression methods, the quality of thereconstructed images in our method improves as the bit-rateincreases. However, the contribution gained from such a rateincrement is not divided equally over the image. Additional bitsare allocated to patches with higher representation error, andthose are improved first. This property is directly caused bythe nature of the compression process, which is RMSE orientedand not bit-rate oriented. The compression process sets a singleRMSE threshold for all the patches, forcing each of them toreach it without fixing the number of allocated atoms perpatch. Patches with simple (smooth) content are most likelyto have a representation error far below the threshold evenusing zero or one atom, whereas patches with more complexcontent are expected to give a representation error very closeto the threshold. Such problematic patches will be forced to im-prove their representation error by increasing the number ofatoms they use as the RMSE threshold is decreased, whilepatches with a representation error below the threshold willnot be forced to change at all. Fig. 11 illustrates thegradual improvement in the image quality as the bit-rate in-creases. As can be seen, not all the patches improve as the

bit-rate increases but only some of them, such as severalpatches in the clothes area, in the ears and in the outline ofthe hair. These patches were more difficult to represent thanothers.

4.3. Comparing to other techniques

An important part in assessing the performance of our com-pression method is its comparison to known and competitivecompression techniques. As mentioned before, we compare ourresults in this work with JPEG, JPEG2000, The VQ-Based compres-sion method described in [17], and a PCA-Based compressionmethod that was built especially for this work as a competitivebenchmark. We therefore start with a brief description of thePCA technique.

The PCA-Based compression method is very similar to thescheme described in this work, simply replacing the K-SVD dic-tionaries with a Principal Component Analysis (PCA) ones. Thesedictionaries are square matrices storing the eigenvectors of theautocorrelation matrices of the training examples in each patch,sorted by a decreasing order of their corresponding eigenvalues.

Fig. 12. Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are therepresentation RMSE.

278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282

[Elad et al. 2009]

400

byte

s

fk

Page 58: Learning Sparse Representation

PCA Learning

Dictionary learning:C = {D \ ||dm|| � 1}

Exemplars Y

Constraints on the Learning

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

min12

||Y �DX||2 + �||X||1X, D � C

Page 59: Learning Sparse Representation

PCA NMFLearning

Dictionary learning:C = {D \ ||dm|| � 1}

C = {D \ ||dm|| � 1, D � 0}Non-negative matrix factorization:

Exemplars Y

Constraints on the Learning

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

min12

||Y �DX||2 + �||X||1X, D � C

Page 60: Learning Sparse Representation

Sparse PCAPCA NMFLearning

Dictionary learning:C = {D \ ||dm|| � 1}

C = {D \ ||dm|| � 1, D � 0}

C =�D \ ||dm||2 + �||dm||1 � 1

Non-negative matrix factorization:

Sparse-PCA:

Exemplars Y

Constraints on the Learning

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

ONLINE LEARNING FOR MATRIX FACTORIZATION AND SPARSE CODING

(a) PCA (b) SPCA, τ= 70%

(c) NMF (d) SPCA, τ= 30%

(e) Dictionary Learning (f) SPCA, τ= 10%

Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.

45

min12

||Y �DX||2 + �||X||1X, D � C

Page 61: Learning Sparse Representation

Translation invariance + patches: [Aharon & Elad 2008][Jojic et al. 2003]

Low dimensional dictionary parameterization.D = (dm)m

Dictionary D

Image f

dm(t) = d(zm + t)

Signature d

Dictionary Signature / Epitome

Figure 7. 1, 2, 4 and 20 epitomes learned on the barbara imagefor the same parameters. They are of sizes 42, 32, 25 and 15 inorder to keep the same number of elements in D. They are notrepresented to scale.

5.3. Influence of the Number of EpitomesWe present in this section an experiment where the num-

ber of learned epitomes vary, while keeping the same num-bers of columns in D. The 1, 2, 4 and 20 epitomes learnedon the image barbara are shown in Figure 7. When the num-ber of epitomes is small, we observe in the epitomes somediscontinuities between texture areas with different visualcharacteristics, which is not the case when learning severalindependant epitomes.

5.4. Application to DenoisingIn order to evaluate the performance of epitome learn-

ing in various regimes (single epitome, multiple epitomes),we use the same methodology as [1] that uses the success-ful denoising method first introduced by [9]. Let us con-sider first the classical problem of restoring a noisy image y

in Rn which has been corrupted by a white Gaussian noiseof standard deviation ⇥. We denote by yi in Rm the patchof y centered at pixel i (with any arbitrary ordering of theimage pixels).

The method of [9] proceeds as follows:

• Learn a dictionary D adapted to all overlappingpatches y1,y2, . . . from the noisy image y.

• Approximate each noisy patch using the learned dic-tionary with a greedy algorithm called orthogonalmatching pursuit (OMP) [17] to have a clean estimateof every patch of yi by addressing the following prob-lem

argmin↵i�Rp

⇤↵i⇤0 s.t. ⇤yi �D↵i⇤22 � (C⇥2),

where D↵i is a clean estimate of the patch yi, ⇤↵i⇤0

is the ⌃0 pseudo-norm of ↵i, and C is a regularizationparameter. Following [9], we choose C = 1.15.

• Since every pixel in y admits many clean estimates(one estimate for every patch the pixel belongs to), av-erage the estimates.

Figure 8. Artificially noised boat image (with standard deviation� = 15), and the result of our denoising algorithm.

Quantitative results for single epitome, and multi-scalemulti-epitomes are presented in Table 1 on six images andfive levels of noise. We evaluate the performance of the de-noising process by computing the peak signal-to-noise ratio(PSNR) for each pair of images. For each level of noise,we have selected the best regularization parameter � overallthe six images, and have then used it all the experiments.The PNSR values are averaged over 5 experiments with 5different noise realizations. The mean standard deviation isof 0.05dB both for the single epitome and the multi-scalemulti-epitomes.

We see from this experiment that the formulation we pro-pose is competitive compared to the one of [1]. Learningmulti epitomes instead of a single one seems to provide bet-ter results, which might be explained by the lack of flexi-bility of the single epitome representation. Evidently, theseresults are not as good as recent state-of-the-art denoisingalgorithms such as [7, 15] which exploit more sophisticated

2918

ourselves for simplicity to the case of single and multipleepitomes of the same size and shape.

The multi-epitome version of our approach can be seenas an interpolation between classical dictionary and singleepitome. Indeed, defining a multitude of epitomes of thesame size as the considered patches is equivalent to work-ing with a dictionary. Defining a large number a epito-mes slightly larger than the patches is equivalent to shift-invariant dictionaries. In Section 5, we experimentally com-pare these different regimes for the task of image denoising.

4.4. InitializationBecause of the nonconvexity of the optimization prob-

lem, the question of the initialization is an important issuein epitome learning. We have already mentioned a multi-scale strategy to overcome this issue, but for the first scale,the problem remains. Whereas classical flat dictionaries cannaturally be initialized with prespecified dictionaries suchas overcomplete DCT basis (see [9]), the epitome does notadmit such a natural choice. In all the experiences (un-less written otherwise), we use as the initialization a singleepitome (or a collection of epitomes), common to all ex-periments, which is learned using our algorithm, initializedwith a Gaussian low-pass filtered random image, on a set of100 000 random patches extracted from 5 000 natural im-ages (all different from the test images used for denoising).

5. Experimental Validation

Figure 4. House, Peppers, Cameraman, Lena, Boat and Barbaraimages.

We provide in this section qualitative and quantitativevalidation. We first study the influence of the differentmodel hyperparameters on the visual aspect of the epitomebefore moving to an image denoising task. We choose torepresent the epitomes as images in order to visualize moreeasily the patches that will be extracted to form the images.Since epitomes contain negative values, they are arbitrarilyrescaled between 0 and 1 for display.

In this section, we will work with several images, whichare shown in Figure 4.

5.1. Influence of the Initialization

In order to measure the influence of the initialization onthe resulting epitome, we have run the same experience withdifferent initializations. Figure 5 shows the different resultsobtained.

The difference in contrast may be due to the scaling ofthe data in the displaying process. This experiment illus-trates that different initializations lead to visually differentepitomes. Whereas this property might not be desirable, theclassical dictionary learning framework also suffers fromthis issue, but yet has led to successful applications in im-age processing [9].

Figure 5. Three epitomes obtained on the boat image for differentinitializations, but all the same parameters. Left: epitome obtainedwith initialization on a epitome learned on random patches fromnatural images. Middle and Right: epitomes obtained for two dif-ferent random initializations.

5.2. Influence of the Size of the Patches

The size of the patches seem to play an important role inthe visual aspect of the epitome. We illustrate in Figure 6an experiment where pairs of epitome of size 46 � 46 arelearned with different sizes of patches.

Figure 6. Pairs of epitomes of width 46 obtained for patches ofwidth 6, 8, 9, 10 and 12. All other parameters are unchanged. Ex-periments run with 2 scales (20 iterations for the first scale, 5 forthe second) on the house image.

As we see, learning epitomes with small patches seemsto introduce finer details and structures in the epitome,whereas large patches induce epitomes with coarser struc-tures.

2917

Page 62: Learning Sparse Representation

Translation invariance + patches: [Aharon & Elad 2008][Jojic et al. 2003]

Low dimensional dictionary parameterization.D = (dm)m

Dictionary D

Image f

� Faster learning.� Make use of atoms spacial location xm.

dm(t) = d(zm + t)

Signature d

Dictionary Signature / Epitome

Figure 7. 1, 2, 4 and 20 epitomes learned on the barbara imagefor the same parameters. They are of sizes 42, 32, 25 and 15 inorder to keep the same number of elements in D. They are notrepresented to scale.

5.3. Influence of the Number of EpitomesWe present in this section an experiment where the num-

ber of learned epitomes vary, while keeping the same num-bers of columns in D. The 1, 2, 4 and 20 epitomes learnedon the image barbara are shown in Figure 7. When the num-ber of epitomes is small, we observe in the epitomes somediscontinuities between texture areas with different visualcharacteristics, which is not the case when learning severalindependant epitomes.

5.4. Application to DenoisingIn order to evaluate the performance of epitome learn-

ing in various regimes (single epitome, multiple epitomes),we use the same methodology as [1] that uses the success-ful denoising method first introduced by [9]. Let us con-sider first the classical problem of restoring a noisy image y

in Rn which has been corrupted by a white Gaussian noiseof standard deviation ⇥. We denote by yi in Rm the patchof y centered at pixel i (with any arbitrary ordering of theimage pixels).

The method of [9] proceeds as follows:

• Learn a dictionary D adapted to all overlappingpatches y1,y2, . . . from the noisy image y.

• Approximate each noisy patch using the learned dic-tionary with a greedy algorithm called orthogonalmatching pursuit (OMP) [17] to have a clean estimateof every patch of yi by addressing the following prob-lem

argmin↵i�Rp

⇤↵i⇤0 s.t. ⇤yi �D↵i⇤22 � (C⇥2),

where D↵i is a clean estimate of the patch yi, ⇤↵i⇤0

is the ⌃0 pseudo-norm of ↵i, and C is a regularizationparameter. Following [9], we choose C = 1.15.

• Since every pixel in y admits many clean estimates(one estimate for every patch the pixel belongs to), av-erage the estimates.

Figure 8. Artificially noised boat image (with standard deviation� = 15), and the result of our denoising algorithm.

Quantitative results for single epitome, and multi-scalemulti-epitomes are presented in Table 1 on six images andfive levels of noise. We evaluate the performance of the de-noising process by computing the peak signal-to-noise ratio(PSNR) for each pair of images. For each level of noise,we have selected the best regularization parameter � overallthe six images, and have then used it all the experiments.The PNSR values are averaged over 5 experiments with 5different noise realizations. The mean standard deviation isof 0.05dB both for the single epitome and the multi-scalemulti-epitomes.

We see from this experiment that the formulation we pro-pose is competitive compared to the one of [1]. Learningmulti epitomes instead of a single one seems to provide bet-ter results, which might be explained by the lack of flexi-bility of the single epitome representation. Evidently, theseresults are not as good as recent state-of-the-art denoisingalgorithms such as [7, 15] which exploit more sophisticated

2918

ourselves for simplicity to the case of single and multipleepitomes of the same size and shape.

The multi-epitome version of our approach can be seenas an interpolation between classical dictionary and singleepitome. Indeed, defining a multitude of epitomes of thesame size as the considered patches is equivalent to work-ing with a dictionary. Defining a large number a epito-mes slightly larger than the patches is equivalent to shift-invariant dictionaries. In Section 5, we experimentally com-pare these different regimes for the task of image denoising.

4.4. InitializationBecause of the nonconvexity of the optimization prob-

lem, the question of the initialization is an important issuein epitome learning. We have already mentioned a multi-scale strategy to overcome this issue, but for the first scale,the problem remains. Whereas classical flat dictionaries cannaturally be initialized with prespecified dictionaries suchas overcomplete DCT basis (see [9]), the epitome does notadmit such a natural choice. In all the experiences (un-less written otherwise), we use as the initialization a singleepitome (or a collection of epitomes), common to all ex-periments, which is learned using our algorithm, initializedwith a Gaussian low-pass filtered random image, on a set of100 000 random patches extracted from 5 000 natural im-ages (all different from the test images used for denoising).

5. Experimental Validation

Figure 4. House, Peppers, Cameraman, Lena, Boat and Barbaraimages.

We provide in this section qualitative and quantitativevalidation. We first study the influence of the differentmodel hyperparameters on the visual aspect of the epitomebefore moving to an image denoising task. We choose torepresent the epitomes as images in order to visualize moreeasily the patches that will be extracted to form the images.Since epitomes contain negative values, they are arbitrarilyrescaled between 0 and 1 for display.

In this section, we will work with several images, whichare shown in Figure 4.

5.1. Influence of the Initialization

In order to measure the influence of the initialization onthe resulting epitome, we have run the same experience withdifferent initializations. Figure 5 shows the different resultsobtained.

The difference in contrast may be due to the scaling ofthe data in the displaying process. This experiment illus-trates that different initializations lead to visually differentepitomes. Whereas this property might not be desirable, theclassical dictionary learning framework also suffers fromthis issue, but yet has led to successful applications in im-age processing [9].

Figure 5. Three epitomes obtained on the boat image for differentinitializations, but all the same parameters. Left: epitome obtainedwith initialization on a epitome learned on random patches fromnatural images. Middle and Right: epitomes obtained for two dif-ferent random initializations.

5.2. Influence of the Size of the Patches

The size of the patches seem to play an important role inthe visual aspect of the epitome. We illustrate in Figure 6an experiment where pairs of epitome of size 46 � 46 arelearned with different sizes of patches.

Figure 6. Pairs of epitomes of width 46 obtained for patches ofwidth 6, 8, 9, 10 and 12. All other parameters are unchanged. Ex-periments run with 2 scales (20 iterations for the first scale, 5 forthe second) on the house image.

As we see, learning epitomes with small patches seemsto introduce finer details and structures in the epitome,whereas large patches induce epitomes with coarser struc-tures.

2917

Page 63: Learning Sparse Representation

Overview

•Sparsity and Redundancy

•Dictionary Learning

•Extensions

•Task-driven Learning

•Texture Synthesis

Page 64: Learning Sparse Representation

Ground trust: yk = �fk + �k

Exemplar fk Observation yk

Task Driven Learning

Page 65: Learning Sparse Representation

Ground trust: yk = �fk + �k

yk f(D, yk)Estimator

f(D, ·)Exemplar fk Observation yk

Example: �1 regularization.

Task Driven Learning

Page 66: Learning Sparse Representation

Ground trust: yk = �fk + �k

yk f(D, yk)Estimator

f(D, ·)Exemplar fk Observation yk

Task driven learning:

minD

E(D) =�

k

||fk � f(D, yk)||2[Mairal et al. 2010][Peyre & Fadili 2010]

Example: �1 regularization.

Task Driven Learning

Page 67: Learning Sparse Representation

Ground trust: yk = �fk + �k

yk f(D, yk)Estimator

f(D, ·)Exemplar fk Observation yk

Task driven learning:

minD

E(D) =�

k

||fk � f(D, yk)||2

Gradient descent:

[Mairal et al. 2010][Peyre & Fadili 2010]

D � D � ��

k

⇥f(D, yk)� [f(D, yk)� fk]

Example: �1 regularization.

Task Driven Learning

Page 68: Learning Sparse Representation

Ground trust: yk = �fk + �k

yk f(D, yk)Estimator

f(D, ·)Exemplar fk Observation yk

Task driven learning:

minD

E(D) =�

k

||fk � f(D, yk)||2

Gradient descent:

[Mairal et al. 2010][Peyre & Fadili 2010]

D � D � ��

k

⇥f(D, yk)� [f(D, yk)� fk]

Compute thederivative �f

w.r.t. D ?

Example: �1 regularization.

Task Driven Learning

Page 69: Learning Sparse Representation

“s = sign(x)”

x(D, y) = argminx�RP

12

||y � �Dx||2 + �||x||1

D���(�Dx� y) + �s = 0

Dictionary Sensitivityf(D, y) = Dx(D, y)Sparse estimator:

=�

Page 70: Learning Sparse Representation

“s = sign(x)”

=� xI = (D�I���DI)�1(D�

I�y � �sI)

DI = (dm)m�ISupport: I = {m \ xm �= 0}

Ix

x(D, y) = argminx�RP

12

||y � �Dx||2 + �||x||1

D���(�Dx� y) + �s = 0

Dictionary Sensitivityf(D, y) = Dx(D, y)Sparse estimator:

Local expression of x(D, y) around (D, y):

=�

D

Page 71: Learning Sparse Representation

“s = sign(x)”

=� xI = (D�I���DI)�1(D�

I�y � �sI)

Locally: � sI is constant.� The map y �� x(D, y) is a�ne.� The map D �� x(D, y) is a rational function.

DI = (dm)m�ISupport: I = {m \ xm �= 0}

Compute the derivative of D �� x(D, y).

Ix

x(D, y) = argminx�RP

12

||y � �Dx||2 + �||x||1

D���(�Dx� y) + �s = 0

Dictionary Sensitivityf(D, y) = Dx(D, y)Sparse estimator:

Local expression of x(D, y) around (D, y):

=�

D

Page 72: Learning Sparse Representation

Sparse recovery with linearized model �:

Unknown degradation operator: yk = �0(fk) + wk

f(D,�, y) = D argminx�RP

12

||y � �Dx||2 + �||x||1

Blind Sparse Restoration

Task-Driven Dictionary Learning 17

Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even thoughthe halftoned images (center column) perceptually look relatively close to the original images (left col-umn), they are binary. Reconstructed images (right column) are obtained by restoring the halftonedbinary images. Best viewed by zooming on a computer screen.

RR n° 7400

Task-Driven Dictionary Learning 17

Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even thoughthe halftoned images (center column) perceptually look relatively close to the original images (left col-umn), they are binary. Reconstructed images (right column) are obtained by restoring the halftonedbinary images. Best viewed by zooming on a computer screen.

RR n° 7400

fk yk

Page 73: Learning Sparse Representation

Sparse recovery with linearized model �:

Task driven learning of D and �:

Unknown degradation operator: yk = �0(fk) + wk

minD,�

k

||fk � f(D,�, yk)||2

f(D,�, y) = D argminx�RP

12

||y � �Dx||2 + �||x||1

Blind Sparse Restoration

Task-Driven Dictionary Learning 17

Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even thoughthe halftoned images (center column) perceptually look relatively close to the original images (left col-umn), they are binary. Reconstructed images (right column) are obtained by restoring the halftonedbinary images. Best viewed by zooming on a computer screen.

RR n° 7400

Task-Driven Dictionary Learning 17

Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even thoughthe halftoned images (center column) perceptually look relatively close to the original images (left col-umn), they are binary. Reconstructed images (right column) are obtained by restoring the halftonedbinary images. Best viewed by zooming on a computer screen.

RR n° 7400

Task-Driven Dictionary Learning 17

Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even thoughthe halftoned images (center column) perceptually look relatively close to the original images (left col-umn), they are binary. Reconstructed images (right column) are obtained by restoring the halftonedbinary images. Best viewed by zooming on a computer screen.

RR n° 7400

fk yk f(D,�, yk)

Page 74: Learning Sparse Representation

Overview

•Sparsity and Redundancy

•Dictionary Learning

•Extensions

•Task-driven Learning

•Texture Synthesis

Page 75: Learning Sparse Representation

Texture SynthesisGenerate f perceptually similar to some input f0

f0f f

ff

Page 76: Learning Sparse Representation

�� Design and manipulate statistical constraints.�� Use statistical constraints for other imaging problems.

Texture SynthesisGenerate f perceptually similar to some input f0

f0f f

ff

Page 77: Learning Sparse Representation

Dictionaries for Textures

Page 78: Learning Sparse Representation

Sparse model for all the texture patches:

E(f) =�

k

minx

12||pk(f)�Dx||2 + �||x||1

Texture ensemble: T = {f \ f local minimum of E}

pk(f) = f(· + zk)

Sparse Texture Ensemble[Peyre, 2008]

Page 79: Learning Sparse Representation

Sparse model for all the texture patches:

E(f) =�

k

minx

12||pk(f)�Dx||2 + �||x||1

Almost bias-free sampling of T :

Texture ensemble: T = {f \ f local minimum of E}

f �� white noise.Initialization:

pk(f) = f(· + zk)

Sparse Texture Ensemble[Peyre, 2008]

Page 80: Learning Sparse Representation

Sparse model for all the texture patches:

E(f) =�

k

minx

12||pk(f)�Dx||2 + �||x||1

Almost bias-free sampling of T :

Texture ensemble: T = {f \ f local minimum of E}

f �� white noise.Initialization:

Iteration: xk ⇥ argminx

12

||pk(f)�Dx||2 + �||x||1

pk(f) = f(· + zk)

Sparse Texture Ensemble[Peyre, 2008]

Page 81: Learning Sparse Representation

Sparse model for all the texture patches:

E(f) =�

k

minx

12||pk(f)�Dx||2 + �||x||1

Almost bias-free sampling of T :

Texture ensemble: T = {f \ f local minimum of E}

f �� white noise.Initialization:

Iteration: xk ⇥ argminx

12

||pk(f)�Dx||2 + �||x||1

pk(f) = f(· + zk)

f(·) ⇥�

k

yk(·� zk)

Sparse Texture Ensemble

!

!

0

0

"

"

Figure 1: Parameterization of the dictionary of edge patches and some examples.

Figure 2: Iterations of the synthesis algorithm with the dictionary of edges (sparsity s = 2).

Dictionary of local oscillations. In order to synthesize highly oscillating textures, we considerthe following set of functions

⌥⇤(t) = sin�R⇥(t� (�, 0))/⌅

⇥, and ⇤ = (⇥, �) ⇤ [0, 2⇧)⇥ R+. (15)

The local frequency ⌅ controls globally the width of the oscillations whereas ⇥ is the local orientationof these oscillations.

Dictionaries of lines. Similarly to the edge dictionary (14), a dictionary of lines is obtained byrotating and translating a straight line

⌥⇤(t) = �⇥,�,⌅(t) = exp⇤

12⌃2

||R⇥(t� (�, 0))||2⌅

, (16)

where ⇤ = (⇥, �) ⇤ [0, 2⇧)⇥ R+ and where ⌃ control the width of the line pattern.

Dictionaries of crossings. A dictionary of crossings is obtained by considering atoms whichcontain two overlapping lines

⌥⇤(t) = max (�⇥1,�1,⌅(t), �⇥2,�2,⌅(t)) where ⇤ = (⇥1, �1, ⇥2, �2). (17)

Figure 3 shows examples of synthesis for the four dictionaries generated by the set of functions(14), (15), (16) and (17).

3 Strict Sparsity and Non-local Expansions

Most approaches for texture synthesis in computer graphics [18, 53, 19, 28, 3, 30, 27] perform arecopy of patches from an original input texture f in order to create a new texture f with similarstructures. These processings can be casted into the sparsity framework presented in this paper.This section indeed considers our texture model in a restricted case where one seeks a strict sparsitywith s = 1 in a highly redundant dictionary.

6

Dictionary D

!

!

0

0

"

"

Figure 1: Parameterization of the dictionary of edge patches and some examples.

Figure 2: Iterations of the synthesis algorithm with the dictionary of edges (sparsity s = 2).

Dictionary of local oscillations. In order to synthesize highly oscillating textures, we considerthe following set of functions

⌥⇤(t) = sin�R⇥(t� (�, 0))/⌅

⇥, and ⇤ = (⇥, �) ⇤ [0, 2⇧)⇥ R+. (15)

The local frequency ⌅ controls globally the width of the oscillations whereas ⇥ is the local orientationof these oscillations.

Dictionaries of lines. Similarly to the edge dictionary (14), a dictionary of lines is obtained byrotating and translating a straight line

⌥⇤(t) = �⇥,�,⌅(t) = exp⇤

12⌃2

||R⇥(t� (�, 0))||2⌅

, (16)

where ⇤ = (⇥, �) ⇤ [0, 2⇧)⇥ R+ and where ⌃ control the width of the line pattern.

Dictionaries of crossings. A dictionary of crossings is obtained by considering atoms whichcontain two overlapping lines

⌥⇤(t) = max (�⇥1,�1,⌅(t), �⇥2,�2,⌅(t)) where ⇤ = (⇥1, �1, ⇥2, �2). (17)

Figure 3 shows examples of synthesis for the four dictionaries generated by the set of functions(14), (15), (16) and (17).

3 Strict Sparsity and Non-local Expansions

Most approaches for texture synthesis in computer graphics [18, 53, 19, 28, 3, 30, 27] perform arecopy of patches from an original input texture f in order to create a new texture f with similarstructures. These processings can be casted into the sparsity framework presented in this paper.This section indeed considers our texture model in a restricted case where one seeks a strict sparsitywith s = 1 in a highly redundant dictionary.

6

yk = Dxk

[Peyre, 2008]

Page 82: Learning Sparse Representation

Synt

hesi

zed

fExamples of Sparse Synthesis

Dic

tion

ary

DS

=1

S=

2

Edges Oscillations Lines Crossings

Figure 3: Examples of synthesis for two sparsity levels s for the four kinds of dictionaries consid-ered.3.1 Strict Sparsity Model

Considering the extreme case where s = 1 means that one wants each patch of the synthesizedimage f to be close to a patch in the original exemplar texture f . Within this assumption, onecan consider as a dictionary the set of all the patches extracted from the exemplar

D = (pxi(f))N�1i=0 = �(f). (18)

This dictionary is highly redundant and the synthesis algorithm looks for a perfect match

⌅ i, pxi(f) = �i p�(xi)(f), where �i ⇤ R, (19)

and the warping function ⇤ : {0, . . . ,�

N � 1}2 ⇥ {0, . . . ,�

N � 1}2 maps the pixel locations ofthe synthesized f to the pixel locations of f .

A further simplifying assumption done frequently in computer graphics assumes that �i = 1,which leads to the following definition of the mapping ⇤

⌅x, ⇤(x) def.= argminy

||px(f)� py(f)||. (20)

In this setting, the algorithm 2 iterates between the best-fit computation (20) (step 3) and theaveraging of the patches (step 4). This is similar to the optimization procedure of Kwartra et al.[27].

One can apply the iterative algorithm described in listing 2 in order to draw a random texturethat minimizes ED. Figure 4 shows the iterations of texture synthesis with this highly redundantdictionary. For these examples, the size of the patches is set to ⇥ = 6 pixels. Figure 6 shows otherexamples of synthesis and compares the results with texture quilting [19]. Methods based on pixelsand regions copy like [19] tend to synthesize images very close to the original. Large parts of theinput are often copied verbatim in the output, with sometime periodic repetitions. In contrast,and similarly to [27], our method treats all the pixels equally and often leads to a better layout ofthe structures, with less global fidelity to the original.

7

Dic

tion

ary

D

Page 83: Learning Sparse Representation

Learn D from a single exemplar f0.

Exemplar f0

Learning Sparse Ensemble

Listing 5 Sparse texture synthesis algorithm.(1) Initialization: set f at random.(2) Sparse code: for all locations x, compute

sx ⇥ ProjM(px(f)).

(3) Reconstruction: compute the texture f by averaging the patches

f(x) =1

n� 2

|y�x|��/2

(Dsy)(x� y).

(4) Impose constraints: perform the histogram equalization of f with fe, see[60].

(5) Stop: while not converged, go back to 2.

The corresponding algorithm is detailed in 5, and is equivalent to the iterativeprojection algorithm of Peyre [60]. This iterative algorithm can be seen as anextension of classical texture synthesis methods such as [18,61]. These com-puter graphics approaches use the highly redundant dictionary D = (px(fe))x

of all the patches of the exemplar fe and enforce a perfect recopy by asking astrict sparsity k = 1.

The texture model � captures a compact set of parameters through the dic-tionary D. This model shares also similarity with statistical approaches totexture synthesis such as [62–64] where some transform domain randomiza-tion is performed. Whereas these approaches use a fixed wavelet transform[62,64] or filters optimized from a fixed library [63] we learn this transform ina non-parametric fashion.

Figure 21 shows examples of texture synthesis for various values of the param-eters m and k. Increasing the size of the dictionary allows for a more realisticsynthesis and increasing the redundancy creates more blending between thefeatures.

Original m/n = 1, k = 2 m/n = 1, k = 8 m/n = 2, k = 2

Fig. 21. Examples of texture synthesis for various redundancy m/n an sparsity.

Inverse problems. Figure 22 shows a reconstruction from compressive

30

Page 84: Learning Sparse Representation

Learn D from a single exemplar f0.

Exemplar f0

Redundancy Q/N

Spar

sity

Learning Sparse Ensemble

Listing 5 Sparse texture synthesis algorithm.(1) Initialization: set f at random.(2) Sparse code: for all locations x, compute

sx ⇥ ProjM(px(f)).

(3) Reconstruction: compute the texture f by averaging the patches

f(x) =1

n� 2

|y�x|��/2

(Dsy)(x� y).

(4) Impose constraints: perform the histogram equalization of f with fe, see[60].

(5) Stop: while not converged, go back to 2.

The corresponding algorithm is detailed in 5, and is equivalent to the iterativeprojection algorithm of Peyre [60]. This iterative algorithm can be seen as anextension of classical texture synthesis methods such as [18,61]. These com-puter graphics approaches use the highly redundant dictionary D = (px(fe))x

of all the patches of the exemplar fe and enforce a perfect recopy by asking astrict sparsity k = 1.

The texture model � captures a compact set of parameters through the dic-tionary D. This model shares also similarity with statistical approaches totexture synthesis such as [62–64] where some transform domain randomiza-tion is performed. Whereas these approaches use a fixed wavelet transform[62,64] or filters optimized from a fixed library [63] we learn this transform ina non-parametric fashion.

Figure 21 shows examples of texture synthesis for various values of the param-eters m and k. Increasing the size of the dictionary allows for a more realisticsynthesis and increasing the redundancy creates more blending between thefeatures.

Original m/n = 1, k = 2 m/n = 1, k = 8 m/n = 2, k = 2

Fig. 21. Examples of texture synthesis for various redundancy m/n an sparsity.

Inverse problems. Figure 22 shows a reconstruction from compressive

30

Figure 10: Iteration of the synthesis process for s = 2.

The redundancy m/n of the dictionary. More redundancy provides more geometric fidelity duringthe synthesis since patches of the original texture f will be better approximated in D. In contrast,using a small m leads to a compact texture model that compresses the geometric characteristicsof the original texture within a few atoms. Such a model allows good generalization performancefor task such as texture discrimination or classification when the data to process is unknown butclose to f .The sparsity s � 1 of the patch expansion. Increasing the sparsity s is a way to overcome thelimitations inherent to compact dictionary (low redundancy m/n) by providing more complexlinear combination. In contrast, for very redundant dictionaries (such as the non-local expansionpresented in section 3) one can even impose that s = 1. Increasing the sparsity also allows to haveblending of features and linear variations in intensity that leads to slow illumination gradientsnot present in the original texture.

Figure 11 shows the influence of the sparsity parameter.In order to capture features of various sizes, one can perform a progressive synthesis with

various sizes of patches � . This leads to a multiscale synthesis algorithm that follows the onealready presented in listing 3. Note that this synthesis algorithm implicitly considers a set Dj ofhighly redundant dictionaries at various resolution. Other approaches have been proposed to learna multiscale dictionary, see for instance [45, 35].

s=

2s

=4

s=

8

r = 0.2 r = 0.5 r = 1 r = 2 r = 4

Figure 11: Influence of the redundancy r = m/n and sparsity s.

14

Page 85: Learning Sparse Representation

Exemplar f0

pz(f) � p�(z)(f0)

Patches pz(f0)

Dictionary: all patches pz(f0) = f0(z + ·)

Computer Graphics Approach

Page 86: Learning Sparse Representation

Patch copy: 1-sparsity of the synthesized f .

Mapping � from f to f0:

Synthesized fExemplar f0

pz(f) � p�(z)(f0)

Patches pz(f0)

Dictionary: all patches pz(f0) = f0(z + ·)

Computer Graphics Approach

Page 87: Learning Sparse Representation

Patch copy: 1-sparsity of the synthesized f .

Mapping � from f to f0:

Synthesized fExemplar f0

f0

pz(f) � p�(z)(f0)

Patches pz(f0)

Dictionary: all patches pz(f0) = f0(z + ·)

Computer Graphics Approach

Page 88: Learning Sparse Representation

Texture Inpainting

Page 89: Learning Sparse Representation

Sparse dictionary adaptation:� Minimizing MAP on D.

� Task-driven formulation.

Conclusion

Page 90: Learning Sparse Representation

Sparse dictionary adaptation:� Minimizing MAP on D.

� Task-driven formulation.

Patch based-processing:

� Connection with computer graphics.

� Learn on the data to process.

Conclusion

Page 91: Learning Sparse Representation

Sparse dictionary adaptation:� Minimizing MAP on D.

� Task-driven formulation.

Open problems:

Patch based-processing:

� Non-convex, slow.

� Connection with computer graphics.

� Beyond �1: structured sparsity.

� Learn on the data to process.

� Theoretical guarantees.

Conclusion