Bilevel Sparse Coding - University of Illinois at Urbana-Champaign

IntroductionBilevel Sparse Coding

Sparse Modeling ApplicationsConclusion

Bilevel Sparse Coding

Jianchao Yang

Adobe Research345 Park Ave, San Jose, CA

Mar 15, 2013

Jianchao Yang Bilevel Sparse Coding



Outline

1 Introduction

2 Bilevel Sparse CodingThe learning modelThe learning algorithm

3 Sparse Modeling ApplicationsSingle image super-resolutionSupervised dictionary learningAdaptive compressive sensing

4 Conclusion




Sparse Modeling

Many types of sensory data, e.g., images and audio, are inhigh-dimensional spaces, but with low-intrinsic dimensions

Sparse representation in some domain.Simple model, effective prior.

Sparse representation: represent data in the most parsimoniousterms

x = Dz,

where x ∈ Rd , D ∈ Rd×K , and ‖z‖0 d .Sparsity: driving factor for broad applications

Compressive sensing, low-rank matrices, etc.Compression, denoising, deblurring, super-resolution, etc.Recognition, subspace clustering, deep learning, etc.




Sparse Modeling




x = Dz,

where x ∈ Rd , D ∈ Rd×K , and ‖z‖0 d .

Sparsity: driving factor for broad applicationsCompressive sensing, low-rank matrices, etc.Compression, denoising, deblurring, super-resolution, etc.Recognition, subspace clustering, deep learning, etc.




Sparse Modeling




x = Dz,

where x ∈ Rd , D ∈ Rd×K , and ‖z‖0 d .Sparsity: driving factor for broad applications

Compressive sensing, low-rank matrices, etc.Compression, denoising, deblurring, super-resolution, etc.Recognition, subspace clustering, deep learning, etc.




Sparse Coding – Quest for Dictionary

Signals are normally mixtures of diverse phenomena; how canwe wisely choose D to perform well on the given signals?

A data driven solution: train adaptive dictionaries from the givensignal instances for sparse representations.Given training data xiN

i=1, the dictionary learning problem, in itsmost popular form, can be formulated as

minD,αiN

i=1

N∑i=1

‖xi − Dαi‖22 + λ‖αi‖1, s.t. ‖D(:, j)‖2 ≤ 1,

where D ∈ Rd×K (d < K ) is an over-complete dictionary.Problem: it only cares about low-level “sparse reconstruction”,not the high-level task!





Signals are normally mixtures of diverse phenomena; how canwe wisely choose D to perform well on the given signals?A data driven solution: train adaptive dictionaries from the givensignal instances for sparse representations.

Given training data xiNi=1, the dictionary learning problem, in its

most popular form, can be formulated as

minD,αiN

i=1

N∑i=1

‖xi − Dαi‖22 + λ‖αi‖1, s.t. ‖D(:, j)‖2 ≤ 1,






Signals are normally mixtures of diverse phenomena; how canwe wisely choose D to perform well on the given signals?A data driven solution: train adaptive dictionaries from the givensignal instances for sparse representations.Given training data xiN


minD,αiN

i=1

N∑i=1

‖xi − Dαi‖22 + λ‖αi‖1, s.t. ‖D(:, j)‖2 ≤ 1,

where D ∈ Rd×K (d < K ) is an over-complete dictionary.

Problem: it only cares about low-level “sparse reconstruction”,not the high-level task!





Signals are normally mixtures of diverse phenomena; how canwe wisely choose D to perform well on the given signals?A data driven solution: train adaptive dictionaries from the givensignal instances for sparse representations.Given training data xiN


minD,αiN

i=1

N∑i=1

‖xi − Dαi‖22 + λ‖αi‖1, s.t. ‖D(:, j)‖2 ≤ 1,





Bilevel Sparse Coding – Quest for Dictionary

Many vision and learning tasks can be formulated based onsparse representations

Image feature learningImage super-resolutionCompressive sensingImage classification, etc

We relate the low-level dictionary learning with the high-level tasknaturally with a bilevel formulation.Goal: learn more meaningful sparse representation for the giventask.Advantage: the training procedure is totally consistent with thetesting objective.







We relate the low-level dictionary learning with the high-level tasknaturally with a bilevel formulation.

Goal: learn more meaningful sparse representation for the giventask.Advantage: the training procedure is totally consistent with thetesting objective.







We relate the low-level dictionary learning with the high-level tasknaturally with a bilevel formulation.Goal: learn more meaningful sparse representation for the giventask.

Advantage: the training procedure is totally consistent with thetesting objective.







We relate the low-level dictionary learning with the high-level tasknaturally with a bilevel formulation.Goal: learn more meaningful sparse representation for the giventask.Advantage: the training procedure is totally consistent with thetesting objective.




Bilevel optimization

Mathematical programs with optimization problems in the constraints:

minx∈X,y

F (x , y)

s.t. G(x , y) ≤ 0,y = arg min

yf (x , y),

s.t. g(x , y) ≤ 0.

F and f are the upper-level and lower-level objective functionsrespectively.G and g are the upper-level and lower-level constraintsrespectively.




Bilevel optimization

Simple example: Toll-setting problem on a transportation networkNetwork manager maximizes the revenue raised from tollsNetwork users minimize their travel costs

maxT ,f ,x

∑a∈A

Taxa

s.t. la ≤ Ta ≤ ua,∀a ∈ A

(f , x) ∈ arg minf ′,x ′

∑a∈A

cax′

a +∑a∈A

Tax′

a

s.t. ...




The learning modelThe learning algorithm

Bilevel Sparse Coding: Outline

1 Introduction



4 Conclusion





The Learning Model

A generic bilevel learning model:

minD,Θ

1N

N∑i=1

L(D, zi ,Θ)

s.t. zi = arg minα‖α‖1, s.t. ‖xi − Dα‖2

2 ≤ ε, ∀i ,

G(Θ) ≤ 0,‖D(:, k)‖2 ≤ 1, ∀k .

L is some smooth cost function defined by the specific task.Θ is the parameter set of a specific model.xiN

i=1 are training samples from the input space X .May involve more than one feature space.





A Simple Example

Coupled sparse coding: Relate two feature spaces by theircommon sparse representations.

minDx ,Dy

1N

N∑i=1

‖zxi − zy

i ‖22

s.t. zxi = arg min

α‖α‖1, s.t. ‖xi − Dxα‖2

2 ≤ εx , ∀i ,

zyi = arg min

α‖α‖1, s.t. ‖yi − Dyα‖2

2 ≤ εy , ∀i ,

‖Dx (:, k)‖2 ≤ 1, ∀k ,‖Dy (:, k)‖2 ≤ 1, ∀k ,

where xi ,yiNi=1 are randomly sampled from the joint space X × Y.






1 Introduction



4 Conclusion





A Difficult Problem

Bilevel optimization: mathematical programs with optimizationproblems in the constraints

minD,Θ

1N

N∑i=1

L(D, zi ,Θ)

s.t. zi = arg minα‖α‖1, s.t. ‖xi − Dα‖2

2 ≤ ε, ∀i ,

G(Θ) ≤ 0,‖D(:, k)‖2 ≤ 1, ∀k .

Optimization for D is a bilevel optimization.L is the upper-level objective and `1-norm minimization is thelower-level optimization.Highly nonconvex and highly nonlinear.





Descent Method?

Regard z as an implicit function of D in the lower-level problem,the bilevel program can be viewed solely in terms of theupper-level variable D.

Applying the chain rule, we have, whenever ∇Dz(D) is welldefined

∇DL(D, z(D),Θ) = ∇DL(D, z,Θ) +∇zL(D, z,Θ)∇Dz(D).

Problem: Is the gradient ∇Dz(D) available?

z = arg minα‖α‖1, s.t. ‖x− Dα‖2

2 ≤ ε.





Descent Method?

Regard z as an implicit function of D in the lower-level problem,the bilevel program can be viewed solely in terms of theupper-level variable D.Applying the chain rule, we have, whenever ∇Dz(D) is welldefined




2 ≤ ε.





Descent Method?

Regard z as an implicit function of D in the lower-level problem,the bilevel program can be viewed solely in terms of theupper-level variable D.Applying the chain rule, we have, whenever ∇Dz(D) is welldefined




2 ≤ ε.





Differentiability

Lasso

The `1-norm minimization problem can be reformulated as the Lassoproblem

z = arg minα‖x− Dα‖2

2 + λ‖α‖1.

Transition Point (Efron et al. 2004)

For a given response vector x, there is a finite sequence of λ’s,λ0 > λ1 > · · · > λK = 0, such that if λ is in the interval of (λm, λm+1),the active set Λ = k : z(k) 6= 0 and sign vector sign(zΛ) areconstant with respect to λ.





Differentiability

TheoremFix any λ > 0, and λ is not a transition point for x, the active set Λ andthe sign vector sign(zΛ) are locally constant with respect to both x andD.





Differentiability

If λ is not a transition point of x, we have the equiangularconditions

a :∂‖x− Dz‖2

2∂z(k)

+ λ sign(z(k)) = 0, for k ∈ Λ,

b :

∣∣∣∣∂‖x− Dz‖22

∂z(k)

∣∣∣∣ < λ, for k 6∈ Λ.

Applying implicit differentiation on the above Eqn. (a), we have

∂zΛ

∂DΛ=(DT

Λ DΛ

)−1(∂DT

Λ x∂DΛ

− ∂DTΛ DΛ

∂DΛzΛ

).





Differentiability

If λ is not a transition point of x, we have the equiangularconditions

a :∂‖x− Dz‖2

2∂z(k)

+ λ sign(z(k)) = 0, for k ∈ Λ,

b :

∣∣∣∣∂‖x− Dz‖22

∂z(k)

∣∣∣∣ < λ, for k 6∈ Λ.

Applying implicit differentiation on the above Eqn. (a), we have

∂zΛ

∂DΛ=(DT

Λ DΛ

)−1(∂DT

Λ x∂DΛ

− ∂DTΛ DΛ

∂DΛzΛ

).





Differentiability

Let Ω denotes the nonactive set, we observe thatAs zΛ is only connected with DΛ, a perturbation on DΩ will notchange its value. Therefore, we have

∂zΛ

∂DΩ= 0. (1)

As Λ and sign(zΛ) are constant for a small perturbation of D, zΩ

stays zero, so we have∂zΩ

∂D= 0 (2)

Therefore, the nonzero part of ∇Dz(D) is defined by ∂zΛ/∂DΛ.





Differentiability

Let Ω denotes the nonactive set, we observe thatAs zΛ is only connected with DΛ, a perturbation on DΩ will notchange its value. Therefore, we have

∂zΛ

∂DΩ= 0. (1)

As Λ and sign(zΛ) are constant for a small perturbation of D, zΩ

stays zero, so we have∂zΩ

∂D= 0 (2)

Therefore, the nonzero part of ∇Dz(D) is defined by ∂zΛ/∂DΛ.





Stochastic Gradient Descent

Given ∇Dz(D), ∇DL can be evaluated. Applying stochasticgradient descent, we have

Dn+1 = Dn − rn∂Ln

∂D/‖∂Ln

∂D‖2

rn =r0

(n/N + 1)p ,

where p controls the shrinkage rate the step size.

Project the updated dictionary onto the unit ball.

The complete optimization procedure alternatively optimize over Dand Θ.





Stochastic Gradient Descent

Given ∇Dz(D), ∇DL can be evaluated. Applying stochasticgradient descent, we have

Dn+1 = Dn − rn∂Ln

∂D/‖∂Ln

∂D‖2

rn =r0

(n/N + 1)p ,

where p controls the shrinkage rate the step size.Project the updated dictionary onto the unit ball.

The complete optimization procedure alternatively optimize over Dand Θ.




Single image super-resolutionSupervised dictionary learningAdaptive compressive sensing


1 Introduction



4 Conclusion





Single Frame Super-resolution

Problem: Given a single low-resolution input, and a set of pairs(high- and low-resolution) of training patches sampled fromsimilar images, reconstruct a high-resolution version of the input.

ApplicationsPhoto zooming (e.g., Photoshop, Genuine Fractal)Photo printingVideo standard conversion, etc

Difficulty: single-image super-resolution is an extremelyill-posed problem.





Super-resolution via Sparse Recovery

High-resolution patches have sparse representations in terms ofsome over-complete dictionary

x = Dhz0

where x ∈ Rm, Dh ∈ Rm×K , and ‖z0‖0 m

We do not observe the high-resolution patch x, but itslow-resolution version y ∈ Rn

y = Lx = LDhz0 = Dlz0

L is the sampling matrix (blurring and downsampling)y is the n linear measurements of the sparse coefficients z0

Sparse recovery? If we can obtain z0 from y = Dlz(underdetermined linear system), we can recover x as Dhz0.







x = Dhz0

where x ∈ Rm, Dh ∈ Rm×K , and ‖z0‖0 mWe do not observe the high-resolution patch x, but itslow-resolution version y ∈ Rn










x = Dhz0

where x ∈ Rm, Dh ∈ Rm×K , and ‖z0‖0 mWe do not observe the high-resolution patch x, but itslow-resolution version y ∈ Rn









Assume we have the coupled dictionaries Dh and Dl .

Input: low-resolution image Y.

Find sparse solution for each patch yp of Y by

z0 = arg minz‖Dlz− yp‖2

2 + λ‖z‖1.

Recover the corresponding high-resolution image patch asxp = Dhz0.

How to train Dl and Dh for good recovery?







Input: low-resolution image Y.Find sparse solution for each patch yp of Y by


2 + λ‖z‖1.











2 + λ‖z‖1.











2 + λ‖z‖1.







Joint Dictionary Training – Previous Approach

Our previous solution.Randomly sample high- and low-resolution image patch pairsxi ,yiN

i=1 from the training data.Learn Dh, Dl jointly:

minDh,Dl ,zi

N∑i=1

‖xi − Dhzi‖22 + ‖yi − Dlzi‖2

2 + λ‖zi‖1,

s.t. ‖Dh(:, k)‖2 ≤ 1, ‖Dl(:, k)‖2 ≤ 1

However, ...





Joint Dictionary Training – Previous Approach

Our previous solution.Randomly sample high- and low-resolution image patch pairsxi ,yiN

i=1 from the training data.Learn Dh, Dl jointly:

minDh,Dl ,zi

N∑i=1


2 + λ‖zi‖1,

s.t. ‖Dh(:, k)‖2 ≤ 1, ‖Dl(:, k)‖2 ≤ 1

However, ...





Joint Dictionary Training – Problem

In training, we have

minDh,Dl ,zi

N∑i=1


2 + λ‖zi‖1

In testing, we only have the low-resolution patch yi ,

minzi‖xi − Dhzi‖2

2+‖yi − Dlzi‖22 + λ‖zi‖1,

and therefore, good reconstruction of xi is not guaranteed.





Joint Dictionary Training – Problem

In training, we have

minDh,Dl ,zi

N∑i=1


2 + λ‖zi‖1

In testing, we only have the low-resolution patch yi ,

minzi‖xi − Dhzi‖2

2+‖yi − Dlzi‖22 + λ‖zi‖1,

and therefore, good reconstruction of xi is not guaranteed.





Bilevel Formulation

Goal: Learn Dh and Dl , such that the sparse representation z ofy in terms of Dl can well reconstruct x with Dh.

Given high- and low-resolution training patch pairs xi ,yiNi=1, the

learning model is formulated as

minDh,Dl

1N

N∑i=1

‖Dhzi − xi‖22

s.t. zi = arg minα‖α‖1, s.t. ‖yi − Dlα‖2

2 ≤ ε

‖Dl (:, k)‖2 ≤ 1,‖Dh(:, k)‖2 ≤ 1,

The training process is completely consist with testing.





Bilevel Formulation

Goal: Learn Dh and Dl , such that the sparse representation z ofy in terms of Dl can well reconstruct x with Dh.Given high- and low-resolution training patch pairs xi ,yiN

i=1, thelearning model is formulated as

minDh,Dl

1N

N∑i=1

‖Dhzi − xi‖22


2 ≤ ε

‖Dl (:, k)‖2 ≤ 1,‖Dh(:, k)‖2 ≤ 1,






Bilevel Formulation

Goal: Learn Dh and Dl , such that the sparse representation z ofy in terms of Dl can well reconstruct x with Dh.Given high- and low-resolution training patch pairs xi ,yiN

i=1, thelearning model is formulated as

minDh,Dl

1N

N∑i=1

‖Dhzi − xi‖22


2 ≤ ε

‖Dl (:, k)‖2 ≤ 1,‖Dh(:, k)‖2 ≤ 1,






Results

Setting: 100,000 high- and low-resolution 5× 5 image patch pairsare sampled for training and 100,000 for testing. Dh and Dl areinitialized from joint dictionary training. The learning algorithmconverges in 5 iterations.

21.61% 19.60% 21.89% 18.91% 20.55%

17.43% 15.75% 17.92% 15.69% 14.70%

17.15% 16.96% 19.95% 17.57% 15.99%

16.41% 17.78% 18.30% 16.80% 15.82%

20.48% 14.68% 15.52% 14.64% 20.51%

1 2 3 4 5

1

2

3

4

5

Pixel-wise MSE reduction compared with joint dictionary training





SR Results

Visual comparison: Top: joint dictionary training; bottom: bilevelsparse coding.





Practical Implementation

Learn fast sparse codingapproximations with aneural network.Selective patchprocessing.Takes 5s to upscale animage from 200× 200 to800× 800 on a single core3 GHz with 4G RAM.One of the fastest SRalgorithms.

Input







Bicubic







Ours






1 Introduction



4 Conclusion





Feature Representation by Pooling Sparse Codes

Fig. The image feature extraction diagram.





Feature Representation by Pooling Sparse Codes

A simple two-layer network.

Coding: VQ, soft assignment, LLC,sparse coding, linear filtering.

Pooling: average, energy, max, log,`p.

Works well on diverse recognitionbenchmarks: object, scene, action,face, digit, gender, expression, ageestimation, and so on.

Key component of the winner systemfor PASCAL09 on image recognition. Image feature extraction diagram





The Feature Extraction Algorithm

1 Represent image X as sets of local descriptors in a spatialpyramid

X =[Y0

11,Y111,Y

112, ...,Y

244],

2 Given dictionary D, encode the local descriptors into sparsecodes by

Zsij = arg min

A‖Ys

ij − DA‖22 + λ‖A‖1,

and we obtainS =

[Z0

11, Z111, Z

212, ..., Z

244

]3 Max pooling over each set of sparse codes and concatenate

them

β =2⋃

s=0

2s⋃i,j=1

[βs

ij] , where βs

ij = max(|Zs

ij |).





Unsupervised Dictionary Learning

Randomly sample a set of local descriptors xiNi=1 from the training

set, use current sparse coding technique to learn a dictionary D thatcan sparsely represent the data.

minD,αiN

i=1

n∑i=1

‖xi − Dαi‖22 + λ‖αi‖1,

s.t. ‖D(:, k)‖2 ≤ 1,

Optimization is performed in an alternating fashion: fix D,optimize αiN

i=1; and fix αiNi=1, and optimize D.





Supervised Dictionary Learning

The unsupervised dictionary learning is good for reconstruction,not necessarily effective for classification.

Training data with image labels (Xi , yi )Ni=1.

Train the dictionary together with the classifier

minD,w

N∑

i=1

`(yi , f (βi ,w)) + γ‖w‖22

,

s.t. βi = pooling(Zi)

Zi = arg minA‖Xi − DA‖2

2 + λ‖A‖1

‖D(:, k)‖2 ≤ 1,∀k ,

where `(·) is a loss function and f (·,w) is the linear predictionmodel.

Optimization for w is training the classifier.Optimization for D is a bilevel program.






The unsupervised dictionary learning is good for reconstruction,not necessarily effective for classification.Training data with image labels (Xi , yi )N

i=1.

Train the dictionary together with the classifier

minD,w

N∑

i=1

`(yi , f (βi ,w)) + γ‖w‖22

,



2 + λ‖A‖1

‖D(:, k)‖2 ≤ 1,∀k ,








The unsupervised dictionary learning is good for reconstruction,not necessarily effective for classification.Training data with image labels (Xi , yi )N

i=1.Train the dictionary together with the classifier

minD,w

N∑

i=1

`(yi , f (βi ,w)) + γ‖w‖22

,



2 + λ‖A‖1

‖D(:, k)‖2 ≤ 1,∀k ,







Face recognition

CMU Multi-PIE DatabaseThis dataset contains 337 subjectsacross simultaneous variations inpose, expression, and illumination.We use session 1 as training, andthe rest sessions 2-4 for testing.The dataset is challenging due tothe large number of subjects, anddue to natural variations in subjectappearance over time.





Face recognition

Face recognition error (%) on large-scale Multi-PIE.

Rec. Rates Session 2 Session 3 Session 4LDA 50.6 55.7 52.1NN 32.7 33.8 37.2NS 22.4 25.7 26.6SR 8.6 9.7 9.8U-SC 5.4 9.0 7.5S-SC 4.8 6.6 4.9Improvements 11.1% 26.7% 34.7%





Gender Recognition

FRGC 2.0The dataset contains 568 individuals, totally 14714 face images undervarious lighting conditions and backgrounds. 11700 images from 451randomly chosen individuals serve as the training set, and 3014images from the rest 114 persons are modeled as the testing set.

Classification Error (%)

Algorithms SVM (RBF) CNN U-SC S-SC ImprovementsError Rate 8.6 5.9 6.8 5.3 22.1%





Hand Written Digit Recognition

MNIST: The dataset consists of 70,000 handwritten digits, of which60,000 are selected for training and the rest 10,000 for testing.

Algorithms Error RateSVM (RBF) 1.41L1 sparse coding 2.02Local coordinate coding 1.90Deep Belief Network 1.20CNN 0.82U-SC 0.98S-SC 0.84Improvements 14.3%






1 Introduction



4 Conclusion





Formulation

Let x be the original signal, Φ be the sampling matrix, andy = Φx be the linear measurements. Compressive sensingrecovery is done by

z = minα‖α‖1, s.t. y = ΦDxα

x =Dxz

Dx is important for the recovery quality.With the training samples xiN

i=1, learn Dx by directly minimizingthe compressive sensing recovery error:

minDx

1N

N∑i=1

‖xi − Dxzi‖22

s.t. yi = Φxi , Dy = ΦDx

zi = arg minα‖α‖1, s.t. ‖yi − Dyα‖2

2 ≤ ε





Formulation



x =Dxz

Dx is important for the recovery quality.

With the training samples xiNi=1, learn Dx by directly minimizing

the compressive sensing recovery error:

minDx

1N

N∑i=1

‖xi − Dxzi‖22



2 ≤ ε





Formulation



x =Dxz

Dx is important for the recovery quality.With the training samples xiN

i=1, learn Dx by directly minimizingthe compressive sensing recovery error:

minDx

1N

N∑i=1

‖xi − Dxzi‖22



2 ≤ ε





CS Results

Settings: 10,000 image patches of 16× 16 are randomly sampledfor training and 5,000 for testing from medical images. Haar Waveletbasis is used as our baseline and initialization. Bernouli randommatrix is used as the sampling matrix.

0 5 10 15 20 250.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6x 10

5

Iteration

Co

st

Objective value vs. iteration number for 10%sample rate.

0.1 0.15 0.2 0.25 0.3

18

20

22

24

26

28

Sampling Rate

PS

NR

Learned

Wavelet

Recovery accuracy comparison on the test imagepatches in PSNR.





CS Results

Settings: 10,000 image patches of 16× 16 are randomly sampledfor training and 5,000 for testing from medical images. Haar Waveletbasis is used as our baseline and initialization. Bernouli randommatrix is used as the sampling matrix.

0 5 10 15 20 250.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6x 10

5

Iteration

Co

st

Objective value vs. iteration number for 10%sample rate.

0.1 0.15 0.2 0.25 0.3

18

20

22

24

26

28

Sampling Rate

PS

NR

Learned

Wavelet

Recovery accuracy comparison on the test imagepatches in PSNR.





CS Results

Image recovery on the “bone” image with 20% measurements

Ground truth Wavelet(22.8 dB) Ours (27.6 dB)




Conclusion

Learning the meaningful representation is critical for manyapplicationsMany sparse coding based applications can be formulated as abilevel programBilevel programs are extremely useful in many hierarchicalmodelsMore applications in computer vision and machine learning?E.g., model selection.


Documents

Bilevel Sparse Coding - University of Illinois at Urbana-Champaign