(b) Hierarchical Convolutional Sparse Image Decomposition ...fergus/drafts/cvpr10_matt.pdf · In the machine learning community “deep learning” methods for learning feature hierarchies

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1726

CVPR#1726

CVPR 2010 Submission #1726. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Hierarchical Convolutional Sparse Image Decomposition

Anonymous CVPR submission

Paper ID 1726

AbstractBuilding robust low-level image representations, beyond

edge primitives, is a long-standing goal in vision. Manyexisting feature detectors spatially pool edge informationwhich destroys cues such as edge intersections, parallelismand symmetry. We present a learning framework wherefeatures that capture these mid-level cues spontaneouslyemerge from image data. Our approach is based on the con-volutional decomposition of images under a sparsity con-straint and is totally unsupervised. By building a hierarchyof such decompositions we can learn rich feature sets thatare a robust image representation for both the analysis andsynthesis of images.

1. Introduction

In its most basic form, an image is a matrix of intensities.How we should progress from this matrix to stable mid-level representations, useful for high-level vision tasks, re-mains unclear. Popular feature representations such as SIFTor HOG spatially pool edge information to form descriptorsthat are invariant to local transformations. However, in do-ing so important cues such as edge intersections, grouping,parallelism and symmetry are lost.

The primal sketch scheme of Marr [17] proposed thatimage representations should be built in stages, startingwith image primitives such as edges. These would then begrouped using geometric constraints to form “tokens” andthence to larger-scale tokens that capture the high-level con-tent of images such as objects. While this idea has a certainelegance, the difficulties lie in its instantiation: What shouldthe tokens be? How should the grouping occur? Which geo-metric constraints are important and which can be ignored?Robust schemes that address these issues have proved to bedifficult to formulate.

In this paper we propose a simple framework that per-mits the automatic construction of hierarchical image rep-resentations which embody the key elements of Marr’s pri-mal sketch. Each level of the hierarchy groups informationfrom the level beneath to form more complex tokens thatexist over a larger scale in the image. Our grouping cue

(a)

(b)

Figure 1. (a): “Tokens” from Fig. 2-4 of Vision by D. Marr [17].These idealized local groupings are proposed as an intermediatelevel of representation in Marr’s primal sketch theory. (b): Se-lected filters from the 3rd layer of our hierarchical model, trainedin an unsupervised fashion on real-world images using convolu-tional sparse image decomposition.

is sparsity: by encouraging parsimonious representationsat each level of the hierarchy, tokens will naturally groupinto more complex structures. However, as we demonstrate,sparsity itself is not enough – it must be deployed within thecorrect architecture to have the desired effect. We adopt aconvolutional approach since it provides stable latent rep-resentations at each level which preserve locality and thusfacilitates the grouping mechanism.

A key feature of our approach is that it is entirely unsu-pervised. At each level we solve a sparse, over-completedecomposition of the latent representation from the layerbeneath. Using the same parameters for learning each layer,we can automatically build rich features that correspond toconcepts such as edge junctions, parallel lines, curves andbasic geometric elements, such as squares. Remarkably,some of them look very similar to the tokens posited byMarr in his description of the primal sketch (see Fig. 1).A technical contribution of our paper concerns the convo-lutional sparse decomposition. This is a challenging opti-mization problem that must be effectively solved if usefulfeatures are to spontaneously emerge. Standard techniquesperform poorly and we propose an alternative scheme thatworks well in practice.

2. Related Work

Our work has links to recent work in sparse image de-compositions, as well as hierarchical representations. Leeet al. [13] and Mairal et al. [15, 16] have proposed effi-

1

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1726

CVPR#1726


cient schemes for learning sparse over-complete decompo-sitions of image patches [18], using a convex `1 sparsityterm. Our approach differs in that we perform sparse de-composition over the whole image at once, not just for smallimage patches. As demonstrated by our experiments, this isvital if rich features are to be learned. The key to makingthis work efficiently is to use a convolutional approach.

A range of hierarchical image models have been pro-posed. Particularly relevant is the work of Zhu and col-leagues [26, 23], in particular Guo et al. [7]. Here, edgesare composed using a hand-crafted set of image tokens intolarge-scale image structures. Grouping is performed via ba-sis pursuit with intricate splitting and merging operationson image edges. The stochastic image grammars of Zhuand Mumford [26] also use fixed image primitives, as wellas a complex Markov Chain Monte-Carlo (MCMC)-basedscheme to parse scenes. Our work differs in two importantways: first, we learn our image tokens completely automat-ically. Second, our inference scheme is far simpler thaneither of the above frameworks.

Zhu et al. [25] propose a top-down parts-and-structuremodel but it only reasons about image edges, as providedby a standard edge detector, unlike ours which directly op-erates on pixels. The biologically inspired HMax model ofSerre et al. [22, 21] use exemplar templates in their inter-mediate representations, rather than learning conjunctionsof edges as we do. Fidler and Leonardis [5, 4] propose atop-down model for object recognition which has an ex-plicit notion of parts whose correspondence is explicitlyreasoned about at each level. In contrast, our approach sim-ply performs a low-level deconvolution operation at eachlevel, rather attempting to solve a correspondence problem.Amit and Geman [1] and Jin and Geman [10] apply hierar-chical models to deformed LATEXdigits and car license platerecognition.

In the machine learning community “deep learning”methods for learning feature hierarchies have recently be-come popular [2, 8]. Deep Belief Networks (DBNs) [8]and hierarchies of sparse auto-encoders [20, 9], like our ap-proach, greedily construct layers from the image upwardsin an unsupervised fashion. A key difference with respectto our work is that both these methods have top-down (de-coder) and bottom-up (encoder) mechanisms within a sin-gle layer in learning. As these models have been motivatedto improve high-level tasks like recognition, an encoder isneeded to provide a fast way to compute the latent represen-tation at test time. However, during training the latent repre-sentation produced by performing top-down inference withthe decoder is constrained to be close to the output of theencoder. Since the encoders are typically simple non-linearfunctions, they have the potential to significantly restrict thelatent representation obtainable. Restricted Boltzmann Ma-chines, the basic module of DBNs, have the additional con-

straint that the encoder and decoder must share weights.Most deep learning architectures are not convolutional,

but recent work by Lee et al. [14] demonstrated a convo-lutional RBM architecture that learns high-level image fea-tures for recognition.

In our approach, since we are concerned with image rep-resentation rather than recognition, we have no bottom-upencoder at each layer. Hence, in learning we have the singlegoal of finding the optimal latent representation using thetop-down decoder, unencumbered by the need to also makeit similar to the output of an encoder. As such, our approachcan be considered a hierarchical stack of decoders. Apply-ing the model requires us to perform inference to reliablyfind the optimal code at each layer.

3. ModelWe first consider a single layer of sparse convolutional

decomposition applied to an image. This layer takesas input an image yi, composed of K0 color channelsyi1, . . . , y

iK0

. We represent each channel c of this image asa linear sum of K1 latent feature maps zi

k convolved withfilters fk,c:

K1∑k=1

zik ⊕ fk,c = yi

c (1)

Henceforth, unless otherwise stated, symbols correspond tomatrices. If yi

c is an Nr×Nc image and the filters are H×H ,then the latent feature maps are (Nr+H−1)×(Nc+H−1)in size. But Eqn. 1 is an under-determined system, so toyield a unique solution we introduce a regularization termon zi

k that encourages sparsity in the latent feature maps.This gives us an overall cost function of the form:

C1(yi) = λ

K0∑c=1

‖K1∑k=1

zik ⊕ fk,c − yi

c‖22 +

K1∑k=1

|zik|p (2)

where we assume Gaussian noise on the convolutional de-composition and some sparse norm p for the regularization.Note that the sparse norm |w|p is actually the p-norm on thevectorized version of matrix w, i.e. |w|p =

∑i,j |w(i, j)|p.

Typically, p = 1, although other values are possible, as de-scribed in Section 3.2. λ is a constant that balances the rela-tive contributions of the reconstruction of yi and the sparsityof the feature maps zi

k.Note that our model is top-down in nature: given the la-

tent feature maps, we can synthesize an image. But unlikethe sparse auto-encoder approach of Ranzato et al. [19], orDBNs [8], there is no mechanism for generating the fea-ture maps from the input, apart from minimizing the costfunction C1 in Eqn. 2. Many approaches focus on bottom-up inference, but we concentrate on obtaining high qualitylatent representations.

2

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1726

CVPR#1726


⊕ ⊕ ⊕ ⊕

∑ z1,l-1zKl-1,l-1

f1,1

zKl,l

l fKl,1f1,l-1

l l

z1,l

Layer l — Kl feature maps

Layer l-1 — Kl-1 feature maps

Figure 2. A single layer of convolutional sparse decomposition(best viewed in color). For clarity, only the connectivity for asingle input map is shown. In practice the first layer is fully con-nected, while the connectivity of the higher layers is specified bythe map gl, which is sparse.

In learning, described in Section 3.2, we use aset of images y = {y1, . . . , yI} for which we seekargminf,z C1(y)1, the latent feature maps for each imageand the filters. Note that each image has its own set of fea-ture maps while the filters are common to all images. Fur-thermore, each feature map is connected to all color planes.

3.1. Forming a Hierarchy

The architecture described above produces sparse featuremaps from a multi-channel input image. It can easily bestacked to form a hierarchy by treating the feature maps zi

k,l

of layer l as input for layer l +1. In other words, layer l hasas its input an image with Kl−1 channels being the numberof feature maps at layer l−1. The cost function Cl for layerl is a generalization of Eqn. 2, being:

Cl(y) = λ

I∑i=1

Kl−1∑c=1

‖Kl∑

k=1

glk,c(z

ik,l ⊕ f l

k,c)− zic,l−1‖2

2

+I∑

i=1

Kl∑k=1

|zik,l|p (3)

where zic,l−1 are the feature maps from the previous layer,

and glk,c are elements of a fixed binary matrix that deter-

mines the connectivity between the feature maps at succes-sive layers, i.e. whether zi

k,l is connected to zic,l−1 or not

[12]. In the first layer we assume that g1k,c is always 1, but

in higher layers it will be sparse. We train the hierarchyfrom the bottom upwards, thus zi

c,l−1 is given from the re-sults of learning on Cl−1(y). This structure is illustrated inFig. 2.

Unlike several other hierarchical models [14, 19, 9] wedo not perform any pooling, sub-sampling or divisive nor-malization operations between layers, although they couldeasily be incorporated.

1C1(y) =P

i C1(yi)

3.2. Learning Filters

Learning of the filters requires learning the feature mapsas well. To achieve this, we alternately minimize Cl(y)over the feature maps while keeping the filters fixed andthen minimize Cl(y) over the filters while keeping the fea-ture maps fixed. This minimization is done in a layer-wisemanner starting with the first layer where the inputs are thetraining images y. Details are given in Algorithm 1. Wenow describe how we learn the feature maps and filters byintroducing a framework suited for large scale problems.

Inferring feature maps: Inferring the optimal featuremaps zi

k,l, given the filters and inputs is the crux of our ap-proach. The sparsity constraint on zi

k,l which prevents themodel from learning trivial solutions such as the identityfunction. When p = 1 the minimization problem for thefeature maps is convex and a wide range of techniques havebeen proposed [3, 13]. Although in theory the global mini-mum can always be found, in practice this is difficult as theproblem is very poorly conditioned. This is due to the factthat elements in the feature maps are coupled to one anotherthrough the filters. One element in the map can be affectedby another distant element, meaning that the minimizationcan take a very long time to converge to a good solution.

We tried a range of different minimization approachesto solve Eqn. 3, including direct gradient descent, IterativeReweighted Least Squares (IRLS) and stochastic gradientdescent. We found that direct gradient descent suffers fromthe usual problem of flat-lining and thereby gives a poorsolution. IRLS is too slow for large-scale problems withmany input images. Stochastic gradient descent was foundto require many thousands of iterations for convergence.

Instead, we introduce a more general framework that issuitable for any value of p > 0, including pseudo-normswhere p < 1. The approach is a type of continuationmethod, as used by Geman [6] and Wang et al. [24]. Insteadof optimizing Eqn. 3 directly, we minimize an auxiliary costfunction Cl(y) which incorporates auxiliary variables xi

k,l

for each element in the feature maps zik,l:

Cl(y) = λ

I∑i=1

Kl−1∑c=1

‖Kl∑

k=1

glk,c(z

ik,l ⊕ f l

k,c)− zic,l−1‖2

2

+ β

I∑i=1

Kl∑k=1

‖zik,l − xi

k,l‖22 +

I∑i=1

Kl∑k=1

|xik,l|p (4)

where β is a continuation parameter. Introducing the aux-iliary variables separates the convolution part of the costfunction from the | · |p term. By doing so, an alternatingform of minimization for zi

k,l can now be used. We first fixxi

k,l yielding a quadratic problem in zik,l. Then, we fix zi

k,l

and solve a separable 1D problem for each element in xik,l.

We call these two stages the z and x sub-problems respec-tively. As we alternate between these two steps, we slowly

3

324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377

378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431

CVPR#1726

CVPR#1726


increase β from a small initial value until it strongly clampszik,l to xi

k,l. This has the effect of gradually introducing thesparsity constraint into the problem and gives good numer-ical stability in practice [11, 24]. We now consider each ofthe sub-problems.

z sub-problem: From Eqn. 4, we see that we can solvefor each zi

k,l independently of the others. Here we takederivatives of Cl(y) w.r.t. zi

k,l, assuming a fixed xik,l:

∂C(y)∂zi

k,l

= 2λ

Kl−1∑c=1

F lT

k,c(K∑

k=1

F lk,c

zik,l−zi

c,l−1)+2β(zik,l−xi

k,l)

(5)where if gl

k,c = 1, F lk,c is a sparse convolution matrix2

equivalent to convolving with f lk,c, and is zero if gl

k,c = 0.Although a variety of other sparse decomposition tech-niques [15, 19] use stochastic gradient descent methods toupdate zi

k,l for each i, k separately, this is not viable in aconvolutional setting. Here, the various feature maps com-pete with each other to explain local structure in the mostcompact way. This requires us to simultaneously optimizeover all zi

k,l’s for a fixed i and varying k. For a fixed i, set-

ting ∂C(y)∂zi

k,l

= 0 ∀ k, the optimal zik,l are the solution to the

following Kl(Nr + H − 1)(Nc + H − 1)linear system:

A

zi1,l

·ziKl,l

=

∑Kl−1

c=1 F lT

1,czic,l−1 + β

λxi1,l

·∑Kl−1c=1 F lT

K,czic,l−1 + β

λxiKl,l

(6)

where

A =

∑Kl−1

c=1 F lT

1,cFl1,c + β

λI ·∑Kl−1

c=1 F l1,cF

lT

Kl,c

· · ·∑Kl−1c=1 F lT

Kl,cF l

1,c ·∑Kl−1

c=1 F lT

Kl,cF l

Kl,c+ β

λI

(7)

In the above equation, xi1,l, . . . , x

iKl,l

, zic,l−1 and

zi1,l, . . . , z

iKl,l

are in vectorized form. Eqn. 6 can be effec-tively minimized by conjugate gradient (CG) descent. Notethat A never needs to be formed since the Az product canbe directly computed using convolution operations insidethe CG iteration. Each Az product requires 2cKl convolu-tions of filters with the (Nr + H − 1)(Nc + H − 1) filtermaps and can easily be parallelized.

Although some speed-up might be gained by using FFTsin place of spatial convolutions, particularly if the filter sizeH is large, this can introduce boundary effects in the fea-ture maps zi

k,l – therefore solving in the spatial domain ispreferred.

x sub-problem: Given fixed zik,l, finding the optimal

xik,l requires solving a 1D optimization problem for each

2F lk,czi

k,l ≡ zik,l ⊕ f l

k,c and F lT

k,czik,l ≡ zi

k,l⊕ flipud(fliplr(

f lk,c )) using Matlab notation.

element in the feature map. If p = 1 then, following Wanget al. [24], xi

k,l has a closed-form solution given by:

xik,l = max(|zi

k,l| −1β

, 0)zik,l

|zik,l|

(8)

where all operations are element-wise. Alternatively for ar-bitrary values of p > 0, the optimal solution can be com-puted via a lookup-table [11]. This permits us to imposemore aggressive forms of sparsity than p = 1.

Updating the filters : Updating the f ’s given fixed xrequires the computation of:

∂C(y)∂f l

k,c

= 2λ

I∑i=1

Kl−1∑c=1

ZiT

k,l(Kl∑

k=1

glk,c

Zik,l

f lk,c

− zic,l−1) (9)

which we then use to perform a gradient update of f lk,c af-

ter computing zik,l for a fixed i. Z is a convolution matrix

similar to F . The learning procedure is summarized in Al-gorithm 1.

Algorithm 1 : Learning convolutional sparse image decom-position for a single layer l.Require: Training images y, # feature maps K, connectivity gRequire: Regularization weight λ, # epochs ERequire: Continuation parameters: β0, βInc, βMax

1: Initialize feature maps and filters z ∼ N (0, ε), f ∼ N (0, ε)2: for epoch = 1 : E do3: for i = 1 : I do4: β = β0

5: while β < βMax do6: Given zi

k,l, solve for xik,l using Eqn. 8, ∀k.

7: Given xik,l, solve for zi

k,l using Eqn. 6, ∀k.8: β = β · βInc

9: end while10: Update f l

k,c using gradient descent on Eqn. 9, ∀k, c.11: end for12: end for13: Output: Filters f

3.3. Image Decomposition/Reconstruction

To use the model for image reconstruction, we first de-compose an input image by using the learned filters f tofind the latent representation z. We explain the procedurefor a 2 layer model. We first infer the feature maps zk,1 forlayer 1 using the input y′ and the filters f1

k,c by minimizingC1(y′). Next we update the feature maps for layer 2, zk,2 inan alternating fashion. In step 1, we first minimize the re-construction error w.r.t. y′, projecting zk,2 through f2

k,c andf1

k,c to the image:

λ2

K0∑c=1

‖K1∑k=1

(K2∑b=1

g2b,c(zb,2⊕f2

b,c))⊕f1k,c)−y′c‖2

2+K2∑k=1

|zk,2|

(10)

4

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

CVPR#1726

CVPR#1726


In step 2, we update zk,2 by minimizing the error w.r.t. zk,2:

λ2

K1∑c=1

‖K2∑k=1

g2k,c(zk,2 ⊕ f2

k,c)− zc,1‖22 +

K2∑k=1

|zk,2| (11)

We alternate between steps 1 and 2, using conjugate gra-dient descent in both steps. Once zk,2 has converged, wereconstruct y′ by projecting back to the image via f2

k,c andf1

k,c:

y′ =K1∑k=1

g1k,c(

K2∑b=1

g2b,c(zb,2 ⊕ f2

b,c))⊕ f1k,c (12)

An important detail is the addition of an extra featuremap z0 per input map at the first layer that connects to theimage via a constant uniform filter f0. Unlike the sparsitypriors on the other feature maps, z0 has an `2 prior on thegradients of z0, i.e. the prior is of the form ‖∇z0‖2. Thesemaps capture the low-frequency components, leaving thehigh-frequency edge structure to be modeled by the learnedfilters. The maps are only used for decomposing the im-age. When learning the filters, we pre-process the imagesby high-pass filtering them.

4. ExperimentsIn our experiments, we use two datasets of 100×100 im-

ages, one containing natural scenes of fruits and vegetablesand the other consisting of scenes of urban environments.Sample images from the two datasets (“food” and “city” re-spectively) are shown in Figure 3.

In all our experiments, unless otherwise stated, the samelearning settings were used for all layers, namely: H = 7,λ = 1, p = 1, β0 = 10−5, βInc = 12, βMax = 106, E = 25.

Using these parameters we trained a separate 3 layermodel for each dataset, using an identical architecture. Thefirst layer had 9 feature maps fully-connected to the input.The second layer had 45 maps: 36 of which were connectedto pairs of maps in the first layer, and the remainder weresingly-connected. The third layer had 150 feature maps,each of which was connected to a random pair of secondlayer feature maps. In Fig. 9 and Fig. 10 we show the filtersthat spontaneously emerge, projected back into pixel space.

The first layer in each model learns Gabor-style filters,although in the city case they are not evenly distributedin orientation, preferring vertical and horizontal structures.The second layer filters comprise an assorted set of V2-likeelements, with center-surround (Fig. 9 – E2, I1, Fig. 10 –B4, I1); corners (Fig. 9 – D4, I2, Fig. 10 – C4, E1); T-junctions (Fig. 10 – I4); angle-junctions (Fig. 9 – D4, G4,Fig. 10 – E2); curves (Fig. 9 – A3, G1) and others. As withthe 1st layer, horizontal and vertical structures dominate inthe filters trained on the city dataset, while all orientationsare represented in the food filters.

Figure 3. Examples from the food dataset (top two rows) and citydataset (bottom two rows).

The third layer filters are highly diverse. Those fromthe model trained on food images (Fig. 9) comprise severaltypes: oriented gratings (rows 1–4); blobs (D8, E7, H9);box-like structures (B10, F12) and others that capture par-allel and converging lines (C12, J11). The filters trained oncity images (Fig. 10) capture line groupings in horizontaland vertical configurations. These include: conjunctions ofT-junctions (C15, G11); boxes (D14, E4) and various paral-lel lines (B15, D8, I3). Some of the filters are representativeof the tokens shown in Fig. 2-4 of Marr [17] (see Fig. 1).

Since our model is generative, we can sample from it.In Fig. 4 we show samples from the two different modelsfrom each level projected down into pixel space. The sam-ples were drawn using the relative firing frequencies of eachfeature from the training set.

4.1. Reconstruction Experiments

Our model can be used for synthesis as well as analysis.We explore the ability of our learned representation to de-noise images, using a 2 layer model. In Fig. 5 we show thereconstruction results from layers 1 and 2, given a noisy in-put image. The reconstruction from layer 2 is closer to theoriginal image, having a higher SNR score.

5

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

CVPR#1726

CVPR#1726


Frui

tCi

ty

Layer 1 Layer 2 Layer 3

Figure 4. Samples from our two models, from each layer projecteddown into pixel space.

Original ImageNoisy InputSNR = 13.84dB

Layer 1 OutputSNR = 16.31dB

Layer 2 OutputSNR = 18.01dB

Figure 5. Denoising an image corrupted by white Gaussian noiseusing the filters learned on the fruit dataset. The reconstructionfrom layer 2 is superior to the reconstruction from layer 1. Forexample, note the recovery of the shadow lines in the bottom rightof the image in the layer 2 reconstruction.

We also explore the relative sparsity of the feature mapsin the 1st and 2nd layers of our model as we vary λ (seeFig. 6). Plotting the average sparsity of each feature mapagainst RMS reconstruction error, we see that the featuremaps at layer 2 are sparser and give a lower reconstructionerror, than that achieved by layer 1. In Fig. 6 we also plotthe same curve for the patch-based sparse decompositionof Mairal et al. [15]. In this framework, inference is runon each patch in the image separately and since the patchesoverlap, a much larger number of latent features are neededto represent the image. The curve was produced by vary-ing the number of active dictionary elements per patch inreconstruction.

0.05

0.052

0.054

0.056

0.058

0.06

0.062

0.064

0.066

0.068

Layer 1 Layer 2 Patch-based

101 102 103

∑ |z|1Sparsity per feature map— ( )

RMS

Reco

nstr

uctio

n Er

ror

Figure 6. Exploring the trade-off between sparsity and denoisingperformance for our 1st and 2nd layer representations (red andgreen respectively), as well as the patch-based approach of Mairalet al. [15] (blue). Our 2nd layer representation simulatenouslyachieves a lower reconstruction error and sparser feature maps.

4.2. Comparison to Patch-Based Decomposition

To demonstrate the benefits of imposing sparsity withina convolutional architecture, we compare our model tothe patch-based sparse image decomposition approach ofMairal et al. [15]. Using the SPAMS code accompanying[15] we performed a patch-based decomposition of the twoimage sets, using 100 dictionary elements. The resulting fil-ters are shown in Fig. 8(top). We then attempted to build ahierarchical 2 layer model by taking the sparse output vec-tors from each image patch and arranging them into a mapover the image. Applying the SPAMS code to this map pro-duces the 2nd layer filters shown in Fig. 8(bottom). Whilelarger in scale than the 1st layer filters, they are mostly stillGabor-like and do not show the diversity of edge conjunc-tions present in our 2nd layer filters. To probe this result,we visualize the latent feature maps of our convolutionaldecomposition and Mairal et. al.’s patch-based decomposi-tion in Fig. 7.

5. Conclusion

We have presented a conceptually simple framework forlearning sparse, over-complete feature hierarchies. Apply-ing this framework to natural images produces a highly di-verse set of filters that capture high-order image structurebeyond edge primitives. These arise without the need forhyper-parameter tuning or additional modules, such as localcontrast normalization, max-pooling and rectification [9].Our approach relies on robust optimization techniques tominimize the poorly conditioned cost functions that arise inthe convolutional setting. While our algorithm is slow com-pared to approaches that use bottom-up encoders, heavy

6

648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701

702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755

CVPR#1726

CVPR#1726


Feature map Filters(c) Patch-based Representation

Feature map Filters(b) Convolutional Representation

(a) Cropped image & Sliding Window

Figure 7. A comparison of convolutional and patch-based sparse representations for a crop from a natural image (a). (b): Sparse con-volutional decomposition of (a). Note the smoothly varying feature maps that preserve spatial locality. (c): Patch-based convolutionaldecomposition of (a) using a sliding window (green). Each column in the feature map corresponds to the sparse vector over the filters for agiven x-location of the sliding window. As the sliding window moves the latent representation is highly unstable, changing rapidly acrossedges. Without a stable representation, stacking the layers will not yield higher-order filters, as demonstrated in Fig. 8.

1st layer -- Mairal et al. -- 2nd layer

Figure 8. Examples of 1st and 2nd layer filters learned using thepatch-based sparse deconvolution approach of Mairal et al. [15],applied to the food dataset. While the first layer filters look similarto ours, the 2nd layer filters are merely larger versions of the 1stlayer filters, lacking the edge compositions found in our 2nd layerfilters.

use of the convolution operator makes it amenable to paral-lelization and GPU implementations which could give sig-nificant speed gains.

References[1] Y. Amit and D. Geman. A computational model for visual selection.

Neural Computation, 11(7):1691–1715, 1999.[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-

wise training of deep networks. In NIPS, pages 153–160, 2007.[3] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by

basis pursuit. SIAM J. Sci Comp., 20(1):33–61, 1999.[4] S. Fidler, M. Boben, and A. Leonardis. Similarity-based cross-

layered hierarchical representation for object categorization. InCVPR, 2008.

[5] S. Fidler and A. Leonardis. Towards scalable representations of ob-ject categories: Learning a hierarchy of parts. In CVPR, 2007.

[6] D. Geman and Y. C. Nonlinear image recovery with half-quadraticregularization. PAMI, 4:932–946, 1995.

[7] C. E. Guo, S. C. Zhu, and Y. N. Wu. Primal sketch: Integratingtexture and structure. Computer Vision and Image Understanding,106:5–19, April 2007.

[8] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithmfor deep belief nets. Neural Comput., 18(7):1527–1554, 2006.

[9] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is thebest multi-stage architecture for object recognition? In ICCV, 2009.

[10] Y. Jin and S. Geman. Context and hierarchy in a probabilistic imagemodel. In CVPR, volume 2, pages 2145–2152, 2006.

[11] D. Krishnan and R. Fergus. Analytic Hyper-Laplacian Priors for FastImage Deconvolution. In NIPS, 2009. To appear.

[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-basedlearning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.

[13] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse codingalgorithms. In NIPS, pages 801–808, 2007.

[14] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deepbelief networks for scalable unsupervised learning of hierarchicalrepresentations. In ICML, pages 609–616, 2009.

[15] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learn-ing for sparse coding. In ICML, pages 689–696, 2009.

[16] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Superviseddictionary learning. In NIPS, 2008.

[17] D. Marr. Vision. Freeman, San Francisco, 1982.[18] B. A. Olshausen and D. J. Field. Sparse coding with an overcom-

plete basis set: A strategy employed by V1? Vision Research,37(23):3311–3325, 1997.

[19] M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning fordeep belief networks. In J. Platt, D. Koller, Y. Singer, and S. Roweis,editors, NIPS. MIT Press, 2008.

[20] M. Ranzato, C. S. Poultney, S. Chopra, and Y. LeCun. Efficientlearning of sparse representations with an energy-based model. InNIPS, pages 1137–1144, 2006.

[21] M. Riesenhuber and T. Poggio. Hierarchical models of object recog-nition in cortex. Nature Neuroscience, 2(11):1019–1025, 1999.

[22] T. Serre, L. Wolf, and T. Poggio. Object recognition with featuresinspired by visual cortex. In CVPR, 2005.

[23] Z. W. Tu and S. C. Zhu. Parsing images into regions, curves, andcurve groups. IJCV, 69(2):223–249, August 2006.

[24] Y. Wang, J. Yang, W. Yin, and Y. Zhang. A new alternating mini-mization algorithm for total variation image reconstruction. SIAM J.Imag. Sci., 1(3):248–272, 2008.

[25] L. Zhu, Y. Chen, and A. L. Yuille. Learning a hierarchical deformabletemplate for rapid deformable object parsing. PAMI, March 2009.

[26] S. Zhu and D. Mumford. A stochastic grammar of images. Foun-dations and Trends in Computer Graphics and Vision, 2(4):259–362,2006.

7

756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809

810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863

CVPR#1726

CVPR#1726


E F G HA B C D I J

5

6

7

8

1

2

3

4

9

10

15

11

12

13

14

5

1

2

3

4

E F G HA B C D I

3rd Layer Filters

2nd Layer Filters

1st Layer FiltersE F G HA B C D I

Figure 9. Filters from each layer in our model, trained on foodscenes (see Fig. 3(top)). Note the rich diversity of filters and theirincreasing complexity with each layer. In contrast to the filtersshown in Fig. 10, the filters are evenly distributed over orientation.

E F G HA B C D I J

5

6

7

8

1

2

3

4

9

10

15

11

12

13

14

5

1

2

3

4

E F G HA B C D I

3rd Layer Filters

2nd Layer Filters

1st Layer FiltersE F G HA B C D I

Figure 10. Filters from each layer in our model, trained on the citydataset (see Fig. 3(bottom). Note the predominance of horizontaland vertical structures.

8

Documents

(b) Hierarchical Convolutional Sparse Image Decomposition ...fergus/drafts/cvpr10_matt.pdf · In the machine learning community “deep learning” methods for learning feature hierarchies