Image Understanding Using Sparse Representations

Image Understanding Using Sparse Representations

Jayaraman J. ThiagarajanKarthikeyan Natesan RamamurthyPavan Turaga Andreas Spanias

ThIAg

ARAJAN Im

Age UNd

eRSTANdINg USINg SPARSe RePReSeNTATIo

NS M

or

ga

n&

Cl

ay

po

ol

Image Understanding Using Sparse RepresentationsJayaraman J. Thiagarajan, Lawrence Livermore National Laboratory Karthikeyan Natesan Ramamurthy, IBM Thomas J. Watson Research Center Pavan Turaga and Andreas Spanias, Arizona State University Image understanding has been playing an increasingly crucial role in several inverse problems and computer vision. Sparse models form an important component in image understanding, since they emulate the activity of neural receptors in the primary visual cortex of the human brain. Sparse methods have been utilized in several learning problems because of their ability to provide parsimonious, interpretable, and efficient models. Exploit-ing the sparsity of natural signals has led to advances in several application areas including image compression, denoising, inpainting, compressed sensing, blind source separation, super-resolution, and classification.The primary goal of this book is to present the theory and algorithmic considerations in using sparse models

for image understanding and computer vision applications. To this end, algorithms for obtaining sparse repre-sentations and their performance guarantees are discussed in the initial chapters. Furthermore, approaches for designing over complete, data-adapted dictionaries to model natural images are described. The development of theory behind dictionary learning involves exploring its connection to unsupervised clustering and analyzing its generalization characteristics using principles from statistical learning theory. An exciting application area that has benefited extensively from the theory of sparse representations is compressed sensing of image and video data. Theory and algorithms pertinent to measurement design, recovery, and model-based compressed sensing are presented. The paradigm of sparse models, when suitably integrated with powerful machine learn-ing frameworks, can lead to advances in computer vision applications such as object recognition, clustering, segmentation, and activity recognition. Frameworks that enhance the performance of sparse models in such applications by imposing constraints based on the prior discriminatory information and the underlying geo-metrical structure, and kernelizing the sparse coding and dictionary learning methods are presented. In ad-dition to presenting theoretical fundamentals in sparse learning, this book provides a platform for interested readers to explore the vastly growing application domains of sparse representations.

ISBN: 978-1-62705-359-4

9 781627 053594

90000

Series editor: Alan C. Bovik, University of Texas at Austin

SyntheSiS LectureS image, Video & muLtimedia ProceSSing

aBoUT SynTHESISThis volume is a printed version of a work that appears in the Synthesis Digital Library of Engineering and Computer Science. Synthesis Lectures provide concise, original presentations of important research and development topics, published quickly, in digital and print formats. For more information visit www.morganclaypool.com

w w w . m o r g a n c l a y p o o l . c o mMorgan&Claypool pUBlISHErS

Series ISSN: 1559-8136

SyntheSiS LectureS image, Video & muLtimedia ProceSSingAlan C. Bovik, Series Editor

Morgan&Claypool pUBlISHErS



ThIAg

ARAJAN Im

Age UNd


NS M

or

ga

n&

Cl

ay

po

ol



ISBN: 978-1-62705-359-4

9 781627 053594

90000










ThIAg

ARAJAN Im

Age UNd


NS M

or

ga

n&

Cl

ay

po

ol



ISBN: 978-1-62705-359-4

9 781627 053594

90000








Image UnderstandingUsing Sparse Representations

Synthesis Lectures on Image,Video, andMultimedia

Processing

EditorAlan C. Bovik,University of Texas, Austin

e Lectures on Image, Video and Multimedia Processing are intended to provide a unique andgroundbreaking forum for the worlds experts in the eld to express their knowledge in unique andeective ways. It is our intention that the Series will contain Lectures of basic, intermediate, andadvanced material depending on the topical matter and the authors level of discourse. It is alsointended that these Lectures depart from the usual dry textbook format and instead give the authorthe opportunity to speak more directly to the reader, and to unfold the subject matter from a morepersonal point of view. e success of this candid approach to technical writing will rest on ourselection of exceptionally distinguished authors, who have been chosen for their noteworthyleadership in developing new ideas in image, video, and multimedia processing research,development, and education.In terms of the subject matter for the series, there are few limitations that we will impose other thanthe Lectures be related to aspects of the imaging sciences that are relevant to furthering ourunderstanding of the processes by which images, videos, and multimedia signals are formed,processed for various tasks, and perceived by human viewers. ese categories are naturally quitebroad, for two reasons: First, measuring, processing, and understanding perceptual signals involvesbroad categories of scientic inquiry, including optics, surface physics, visual psychophysics andneurophysiology, information theory, computer graphics, display and printing technology, articialintelligence, neural networks, harmonic analysis, and so on. Secondly, the domain of application ofthese methods is limited only by the number of branches of science, engineering, and industry thatutilize audio, visual, and other perceptual signals to convey information. We anticipate that theLectures in this series will dramatically inuence future thought on these subjects as theTwenty-First Century unfolds.

Image Understanding Using Sparse RepresentationsJayaraman J. iagarajan, Karthikeyan Natesan Ramamurthy, Pavan Turaga, and Andreas Spanias2014

Contextual Analysis of VideosMyo ida, How-lung Eng, Dorothy Monekosso, and Paolo Remagnino2013

iv

Wavelet Image CompressionWilliam A. Pearlman2013

Remote Sensing Image ProcessingGustavo Camps-Valls, Devis Tuia, Luis Gmez-Chova, Sandra Jimnez, and Jess Malo2011

e Structure and Properties of Color Spaces and the Representation of Color ImagesEric Dubois2009

Biomedical Image Analysis: SegmentationScott T. Acton and Nilanjan Ray2009

Joint Source-Channel Video TransmissionFan Zhai and Aggelos Katsaggelos2007

Super Resolution of Images and VideoAggelos K. Katsaggelos, Rafael Molina, and Javier Mateos2007

Tensor Voting: A Perceptual Organization Approach to Computer Vision and MachineLearningPhilippos Mordohai and Grard Medioni2006

Light Field SamplingCha Zhang and Tsuhan Chen2006

Real-Time Image and Video Processing: From Research to RealityNasser Kehtarnavaz and Mark Gamadia2006

MPEG-4 Beyond Conventional Video Coding: Object Coding, Resilience, and ScalabilityMihaela van der Schaar, Deepak S Turaga, and omas Stockhammer2006

Modern Image Quality AssessmentZhou Wang and Alan C. Bovik2006

vBiomedical Image Analysis: TrackingScott T. Acton and Nilanjan Ray2006

Recognition of Humans and eir Activities Using VideoRama Chellappa, Amit K. Roy-Chowdhury, and S. Kevin Zhou2005

Copyright 2014 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in

any form or by any meanselectronic, mechanical, photocopy, recording, or any other except for brief quotations

in printed reviews, without the prior permission of the publisher.


Jayaraman J. iagarajan, Karthikeyan Natesan Ramamurthy, Pavan Turaga, and Andreas Spanias

www.morganclaypool.com

ISBN: 9781627053594 paperback

ISBN: 9781627053600 ebook

DOI 10.2200/S00563ED1V01Y201401IVM015

A Publication in the Morgan & Claypool Publishers series

SYNTHESIS LECTURES ON IMAGE, VIDEO, ANDMULTIMEDIA PROCESSING

Lecture #15

Series Editor: Alan C. Bovik, University of Texas, Austin

Series ISSN

Synthesis Lectures on Image, Video, and Multimedia Processing

Print 1559-8136 Electronic 1559-8144

Image UnderstandingUsing Sparse Representations

Jayaraman J. iagarajanLawrence Livermore National Laboratory

Karthikeyan Natesan RamamurthyIBMomas J. Watson Research Center

Pavan TuragaArizona State University

Andreas SpaniasArizona State University

SYNTHESIS LECTURES ON IMAGE, VIDEO, ANDMULTIMEDIAPROCESSING #15

CM&

cLaypoolMorgan publishers&

ABSTRACTImage understanding has been playing an increasingly crucial role in several inverse problems andcomputer vision. Sparse models form an important component in image understanding, sincethey emulate the activity of neural receptors in the primary visual cortex of the human brain.Sparse methods have been utilized in several learning problems because of their ability to provideparsimonious, interpretable, and ecient models. Exploiting the sparsity of natural signals hasled to advances in several application areas including image compression, denoising, inpainting,compressed sensing, blind source separation, super-resolution, and classication.

e primary goal of this book is to present the theory and algorithmic considerations inusing sparse models for image understanding and computer vision applications. To this end, algo-rithms for obtaining sparse representations and their performance guarantees are discussed in theinitial chapters. Furthermore, approaches for designing overcomplete, data-adapted dictionar-ies to model natural images are described. e development of theory behind dictionary learninginvolves exploring its connection to unsupervised clustering and analyzing its generalization char-acteristics using principles from statistical learning theory. An exciting application area that hasbeneted extensively from the theory of sparse representations is compressed sensing of imageand video data. eory and algorithms pertinent to measurement design, recovery, and model-based compressed sensing are presented.e paradigm of sparse models, when suitably integratedwith powerful machine learning frameworks, can lead to advances in computer vision applicationssuch as object recognition, clustering, segmentation, and activity recognition. Frameworks thatenhance the performance of sparse models in such applications by imposing constraints based onthe prior discriminatory information and the underlying geometrical structure, and kernelizingthe sparse coding and dictionary learning methods are presented. In addition to presenting the-oretical fundamentals in sparse learning, this book provides a platform for interested readers toexplore the vastly growing application domains of sparse representations.

KEYWORDSsparse representations, natural images, image reconstruction, image recovery, imageclassication, dictionary learning, clustering, compressed sensing, kernel methods,graph embedding

ix

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Modeling Natural Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Natural Image Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Sparseness in Biological Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 e Generative Model for Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Sparse Models for Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5.1 Dictionary Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5.2 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Sparse Models for Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.6.1 Discriminative Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6.2 Bag of Words and its Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.6.3 Dictionary Design with Graph Embedding Constraints . . . . . . . . . . . . . 121.6.4 Kernel Sparse Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 e Sparsity Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Other Sparsity Regularizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.2 Non-Negative Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Geometrical Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Uniqueness of `0 and its Equivalence to the `1 Solution . . . . . . . . . . . . . . . . . . 172.3.1 Phase Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Numerical Methods for Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 Basis Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Greedy Pursuit Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.4 Feature-Sign Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.5 Iterated Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Dictionary Learning:eory and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1 Dictionary Learning and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Clustering Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

x3.1.2 Probabilistic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1 Method of Optimal Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 K-SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.3 Multilevel Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.4 Online Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.5 Learning Structured Sparse Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.6 Sparse Coding Using Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Stability and Generalizability of Learned Dictionaries . . . . . . . . . . . . . . . . . . . . 503.3.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3.2 An Example Case: Multilevel Dictionary Learning . . . . . . . . . . . . . . . . . 52

4 Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.1 Measurement Matrix Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 e Restricted Isometry Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.1.2 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.1.3 Optimized Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Compressive Sensing of Natural Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Video Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.3.1 Frame-by-Frame Compressive Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 614.3.2 Model-Based Video Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . 624.3.3 Direct Feature Extraction from Compressed Videos . . . . . . . . . . . . . . . . 63

5 SparseModels in Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.1 A Simple Classication Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Discriminative Dictionary Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Sparse-Coding-Based Subspace Identication . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Using Unlabeled Data in Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Generalizing Spatial Pyramids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.5.1 Supervised Dictionary Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.6 Locality in Sparse Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.6.1 Local Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.6.2 Dictionary Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.7 Incorporating Graph Embedding Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 805.7.1 Laplacian Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.7.2 Local Discriminant Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.8 Kernel Methods in Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

xi

5.8.1 Kernel Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.8.2 Kernel Dictionaries in Representation and Discrimination . . . . . . . . . . . 855.8.3 Combining Diverse Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.8.4 Application: Tumor Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Authors Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

1C H A P T E R 1

Introduction

1.1 MODELINGNATURAL IMAGES

Natural images are pervasive entities that have been studied over the past ve decades. e termnatural images includes those images that are usually present in the environment in which welive. e denition encompasses a general class that includes scenes and objects present in natureas well as man-made entities such as cars, buildings, and so on. Natural images have interestedscientists from a wide range of elds from psychology to applied mathematics. ough theseimages have rich variability when viewed as a whole, their local regions often demonstrate sub-stantial similarities. Some examples of natural images that we have used in this book are given inFigure 1.1.

Many proposed models for representing natural images have tried to mimic the processingof images by the human visual system.Generally, image representation is concernedwith low-levelvision. Hence, using local regions or patches of images, and exploiting the local similarities in orderto build a set of features, can be benecial in image representation.e representative features thatare extracted from local image regions are referred to as low-level features. It is imperative that weunderstand and incorporate our knowledge of the human visual processing principles, as well asthe local image statistics, when building systems for representating images. As a result, exploringnatural image statistics has become an integral part in the design of ecient image representationsystems.

As we shift our focus from the problem of representing images to the task of recognizingobjects or classifying scenes, it can easily be seen that it cannot be performed just by representingthe local regions of images using the low-level features. Considering the problem of recognizingan object from a scene, it can be seen that the local features extracted will not be invariant to globaltransformations of the object, its local deformations, and occlusion. Hence, we need to ensure thatthey capture the key invariants of an object to transformations and, to some extent, occlusions,instead of just extracting low-level features that can provide a good representation. Such featurescan be used to build image descriptors that are helpful in recognition and classication.

1.2 NATURAL IMAGE STATISTICS

A digital image denoted using the symbol I is represented using pixel values between 0 and 255,for grayscale. Image statistics refer to the various statistical information of the raw pixel values,or their transformed versions. By incorporating natural image statistics as prior information in

2 1. INTRODUCTION

Figure 1.1: Examples of some natural images used in this book. ese images are obtained from theBerkeley segmentation dataset (BSDS) [1].

a Bayesian inference scheme, one can hope to learn features that are adapted to the image data,rather than try to use classical transform methods that may not be adapted well to the data.

Since analyzing raw pixel statistics hardly produces any useful information, analysis of nat-ural image statistics has been performed in Fourier [2] and Wavelet domains [3]. It is observedthat the power spectrum changes as 1=f 2, where f is the frequency, and this has been one ofthe earliest proofs of redundancy in natural images. e marginal statistics of wavelet coecientsin a subband, illustrated in Figure 1.2, show that the distributions are more peaky and heavilytailed when compared to a Gaussian distribution. Furthermore, the joint statistics would indi-cate a strong dependence between the coecients in dierent subbands. e coecient statisticsobtained using log-Gabor lters on natural images also reveal a peaked distribution. Figure 1.3shows the output distribution at scale 1 and orientation 1 for the Barbara image.ese log-Gaborlters are typically used to detect edges in natural images.e histogram of the coecients impliesthat most of the responses obtained from the lters are close to zero. is demonstration showsthat natural images can be sparsely decomposed in any set of basis functions that can identifyedges in the images eciently.

Besides understanding the statistics, analyzing the topological structure of the natural im-age patch distribution can also provide some crucial insights. As a result of such an eort re-ported in [4, 5], it has been shown that high-contrast image patches lie in clusters and near low-dimensional manifolds. Patches of size 3 3 were chosen from natural images and preprocessedin the logarithmic scale to remove the mean and normalize the contrast values. ey were thenprojected using the discrete cosine transform (DCT) basis and normalized to unit norm, whichmakes the processed data lie in a 7sphere in R8. By computing the fraction of the data pointsthat are near the dense sampling of the surface of the sphere, the empirical distribution of thedata can be determined. It has been observed that a majority of the data points are concentratedin a few high-density regions of the sphere. e sample points corresponding to the high-densityregions are similar to blurred step edges for natural image patches. Further topological analysis ofnatural image patches has been performed in [5], which suggests that features that can ecientlyrepresent a large portion of natural images can be extracted by computing topological equivalentsto the space of natural image patches.

1.3. SPARSENESS INBIOLOGICALVISION 3

Nu

mb

er

of

Co

e

cie

nts

Coecients Value

-300

250

500

750

1000

0 300

Figure 1.2: (a) e Lena image, (b) marginal statistics of wavelet coecients in a subband.

600

500

400

300

200

100

00 0.01 0.02 0.03 0.04 0.05 0.06 0.07

Coe cient

histogram

(a) (b)

Figure 1.3: (a) e Barbara image, (b) log-Gabor coecient statistics at scale 1 and orientation 1 forthe image.

1.3 SPARSENESS INBIOLOGICALVISION

Most of our cognitive functions and perceptual processes are carried out in the neocortex, whichis the largest part of the human brain. e primary visual cortex, also referred to as V1, is the partof the neocortex that receives visual input from the retina. e Nobel Prize-winning discoveriesof Hubel and Weisel showed that the primary visual cortex consists of cells responsive to simpleand complex features in the input. V1 has receptive elds that are characterized as being spatiallylocalized, oriented, and bandpass. In other words, they are selective to the structure of the visualinput at dierent spatial scales. One approach to understanding the response properties of visual

4 1. INTRODUCTION

neurons has been to consider their relationship to the statistical structure of natural images interms of ecient coding.

A generative model that constructs random noise intensities typically results in images withequal probability. However, the statistics of natural image patches have been shown to contain avariety of regularities.Hence, the probability of generating a natural image using the randomnoisemodel is extremely low. In other words, the redundancy in natural images makes the evolutionof an ecient visual system possible. Extending this argument, understanding the behavior ofthe visual system in exploiting redundancy will enable us to build a plausible explanation for thecoding behavior of visual neurons [6].

e optimality of the visual coding process can be addressed with respect to dierent met-rics of eciency. One of the most commonly used metrics is representation eciency. Severalapproaches have been developed to explore the properties of neurons involved in image represen-tation. In [7], it was rst reported that the neurons found in the V1 showed similarities to Gaborlters, and, hence, dierent computational models were developed based on this relation [8, 9].Furthermore, the 1=f k fall-o observed in the Fourier spectra of natural images demonstratedthe redundancy in the images. e 1=f structure arises because of the pairwise correlations inthe data and the observation that natural images are approximately scale invariant. e pairwisecorrelations account for about 40% of the total redundancy in natural scenes. Any representationwith signicant correlations implies that most of the signal lies in a subspace within the largerspace of possible representations [10] and, hence, data can be represented with much-reduced di-mensionality. ough pairwise correlations have been an important form of redundancy, studieshave shown that there exists a number of other forms. Two images with similar 1=f spectra canbe described in terms of dierences in their sparse structure.

For example, all linear representations with a noise image having a 1=f spectra will result ina Gaussian response distribution. However, with natural images, the histogram of the responseswill be non-Gaussian for a suitable choice of linear lters (Figure 1.3). A visual system that gen-erates such response distributions with high kurtosis (fourth statistical moment) can produce arepresentation with visual neurons that are maximally independent. In other words, codes withmaximal independence will activate neurons with maximal unique information. With respect tobiological vision, sparsity implies that a small proportion of the neurons are active, and the rest ofthe population is inactive with high probability. Olshausen and Field showed that, by designing aneural network that attempts to nd sparse linear codes for natural scenes, we can obtain a familyof localized, oriented, and bandpass basis functions, similar to those found in the primary visualcortex. is evidenced that, at the level of V1, visual system representations can eciently matchthe sparseness of natural scenes. However, it must be emphasized that the sparse outputs of thesemodels result from the sparse structure of the data. For example, similar arguments cannot bemade for white-noise images. Further studies have shown that the visual neurons produce sparseresponses in higher stages of cortical processing, such as inferotemporal cortex, in addition to the

1.4. THEGENERATIVEMODELFOR SPARSECODING 5

primary visual cortex. However, there is no biological evidence to show that these sparse responsesimply ecient representation of the environment by the neurons.

Measuring the eciency of neural systems is very complicated compared to that of en-gineered visual systems. In addition to characterizing visual representations, there is a need tounderstand its dependency on ecient learning and development. A learning algorithm must ad-dress a number of problems pertinent to generalization. For example, invariance is an importantproperty that the learning algorithm must possess. In this case, ecient learning is determinedby its ability to balance the selection of suitable neurons and achieving invariance across exampleswith features that vary. In other words, it is critical for the algorithm to generalize to multiple in-stances of an object in the images. At one end, this problem can be addressed by building a neuronfor every object. Simpler tasks, such as distinguishing between dierent faces, can be ecientlyperformed using this approach. However, the number of object detectors in such a system wouldbe exorbitantly high. On the other end, sparse codes with more neurons are necessary for visu-ally challenging tasks, such as object categorization. Hence, the general task of object recognitionrequires dierent strategies with varying degrees of sparseness.

Another important property of the representations in biological vision is that they are highlyovercomplete, and, hence, they involve signicant redundancy. is is one of the motivating fac-tors behind using overcompleteness as an ecient way to model the redundancy in images [11].Furthermore, overcompleteness can result in highly sparse representations, when compared tousing complete codes, and this property is very crucial for generalization during learning.ougha comprehensive theory to describe the human visual processing has not yet been developed, ithas been found that considering dierent measures of eciency together is a crucial step towardthis direction.

1.4 THEGENERATIVEMODELFOR SPARSECODINGe linear generative model for sparse representation of an image patch x is given by

x D Da; (1.1)where x 2 RM is the arbitrary image patch, D 2 RMK is the set of K elementary features,and a 2 RK is the coecient vector. If we assume that the coecient vector is sparse and hasstatistically independent components, the elements of the dictionary D can be inferred from thegenerative model using appropriate constraints. e reason for assuming a sparse prior on theelements of a is the intuition that images allow for their ecient representation as a sparse linearcombination of patterns, such as edges, lines, and other elementary features [12]. Let us assumethat all the T patches extracted from the image I are given by the matrix X 2 RMT , and thecoecient vectors for all the T patches are given by the matrix A 2 RKT . e likelihood for theimage I , which is represented as the matrix X, is given by

logp.XjD/ DZ

p.XjD;A/p.A/dA: (1.2)

6 1. INTRODUCTION

p.XjD;A/ can be computed using the linear generative model assumed in (1.1) and p.A/ is theprior that enforces sparsity constraints on the entries of the coecient vectors. e dictionarynow can be inferred as

OD D argmaxD

logp.XjD/: (1.3)

e strategy proposed by Olshausen and Field [13] to determine the dictionary D is to use thegenerative model (1.1) and an approximation of (1.3), along with the constraint that the columnsof D are of unit `2 norm. e dictionary D here is overcomplete, i.e., the number of dictionaryelements is greater than the dimensionality of the image patch, K > M . is leads to innitesolutions for the coecient vector in (1.1) and, hence, the assumption on the sparsity of a can beused to choose the most appropriate dictionary elements that represent x. Our strategy for com-puting sparse representations and learning dictionaries will also be based on this generative model.Although the generative model assumes that the distribution of coecients is independent, therewill still be statistical dependencies between the inferred coecients. One of the reasons for thiscould be that the elementary features corresponding to the non-zero coecients may occlude eachother partially. In addition, the features themselves could co-occur to represent a more complexpattern in a patch.

It should be noted that using a model such as (1.1), along with sparsity constraints on a,deviates from the linear representation framework. Assume that we have a representation that isSsparse, (i.e.), only S out of K coecients are non-zero at any time in a. ere are totally K

S

such combinations possible, each representing an Sdimensional subspace assuming that each setof S chosen dictionary elements are independent. is model leads to a union of Sdimensionalsubspaces where the signal to be represented can possibly reside. It is clearly not a linear model,as the sum of signals lying in two Sdimensional subspaces can possibly lie in a 2Sdimensionalsubspace.

1.5 SPARSEMODELS FOR IMAGERECONSTRUCTION

Images can be modeled using sparse representations on predened dictionaries as well as on thoselearned from the data itself. Predened dictionaries obtained from the discrete cosine transform(DCT), wavelet, wedgelet, and curvelet transforms are widely used in signal and image processingapplications. Since a single predened dictionary cannot completely represent the patterns in animage, using a combination of predened dictionaries is helpful in some applications.

1.5.1 DICTIONARYDESIGN

DCT dictionaries were some of the earliest dictionaries for patch-based image representationand are still used for initializing various dictionary learning procedures. Overcomplete DCT dic-tionaries can easily be constructed and have been used with sparse representations, performingsubstantially better than orthonormal DCT dictionaries. e set of predened dictionaries thathave probably found the widest applications in multiscale image representation are the wavelet

1.5. SPARSEMODELS FOR IMAGERECONSTRUCTION 7

HIGHER

SCALES

LH

j=2

LH

j=1

HH

j=2

HH

j=2

HL

j=2

HL

j=1

(a) (b)

Figure 1.4: (a) Parent-child relationship between the wavelet coecients across dierent scales, (b)the quad-tree model.

dictionaries. Wavelet coecients have strong inter- and intra-scale dependencies and model-ing them can be helpful in applications that exploit this additional redundancy. Many successfulimage coders consider either implicit or explicit statistical dependencies between the wavelet co-ecients. For example, the JPEG-2000 standard [14] considers the neighboring coecients inthe adjacent subbands of the same scale jointly while coding. Figure 1.4 provides the quad-treemodel for a parent-child relationship between the wavelet coecients, and this structured spar-sity has been used in denoising [15], pattern recognition using templates [16], and model-basedcompressive sensing frameworks [17, 18]. Such a tree-based model had been used in one of theearliest wavelet codersthe Embedded Zerotree Wavelet coder [19].

Using sparsity models with learned dictionaries has been very successful. e simplestmodel that assumes the coecients are independent gives rise to the K-SVD learning algorithm[20]. If the sparsity patterns appear in groups or blocks, block-based sparse representations canbe performed [21]. Additional structure can be imposed on group sparsity and each sparse pat-tern can be penalized accordingly. is gives rise to the structured sparsity frameworks, an areaof ongoing research [22]. Apart from considering a structure when computing the coecients,structured dictionaries can also be learned. An important class of dictionary learning algorithmsimposes a hierarchical structure on the dictionary. For example, the multilevel dictionary learning[23] algorithm exploits the energy hierarchy found in natural images to learn dictionaries usingmultiple levels of 1sparse representations.

1.5.2 EXAMPLEAPPLICATIONS

Sparsity of the coecients can be exploited in a variety of signal and image processing applica-tions. Applications that use sparse models for recovery problems assume that the image data canbe represented as a sparse linear combination of elements from an appropriate dictionary.

8 1. INTRODUCTION

Denoising: An image with additive noise can be represented as

Xn D XCN; (1.4)where Xn is the noisy image (vectorized) and N represents the noise added to the image.e denoising problem that recovers X from its noisy counterpart can be posed as [24]

fOaij ; OXg D argminaij ;X

Xij

kDaij xij k; subj. to kaij k0 S; kXn Xk2 : (1.5)

Here, xij and aij denote the patch at location i; j in the image, and its correspondingcoecient vector, k:k0 represents the `0 norm, which counts the number of nonzero entriesin a vector, and denotes the error goal that depends on the additive noise. is generalmodel can be used for any predened or learned dictionary that operates patchwise in theimage. However, it is clear that the model can be improved by considering dependenciesbetween the coecients within a patch or across the patches themselves [25].

Image Inpainting: Image inpainting is a problem where pixel values at some locations ofthe image would be unknown and have to be predicted. Assuming the simple case wherethe values of random pixels in a patch is unknown, we can express the incomplete patch as

z D xC n; (1.6)where z 2 RN is the incomplete patch with unknown entries from x removed and 2RNM is an identity matrix with rows corresponding to unknown pixel values removed.Since we know that x is sparsely representable, we can consider

z D DaC n; (1.7)which implies that z is sparsely representable using the equivalent dictionary D.

CompressiveRecovery: In compressive sensing, we sense the data x using a linear measure-ment system 2 RNM , whereN < M .e linear measurementmatrix usually consists ofentries from independent entries from a Gaussian or Bernoulli distribution [26], althoughdesigning deterministic measurement systems is also possible [27]. is can be describedby an equation similar to (1.6) and (1.7), though the matrix here is dierent. Recovery isperformed by computing a sparse representation using the equivalent dictionary. It is alsopossible to design the linear measurement system optimized to the dictionary so that therecovery performance improves [28].

Note that for all three inverse problems discussed, i.e., denoising, inpainting, and compres-sive sensing, recovery performance depends on the sparsity of the uncorrupted/completepatch y and the dictionary used for recovery. For denoising, this will be the sparsifyingdictionary D, whereas, for compressive sensing and inpainting, this will be the equivalentdictionary D. e conditions on sparsity and the dictionary for recovery of a unique rep-resentation are discussed in Chapter 2.

1.6. SPARSEMODELS FORRECOGNITION 9

Source Separation: e source separation problem is dierent in spirit from the inverseproblems discussed earlier. Assume that we haveK sparse sources given by the rows of A 2RKT and the mixing matrix (dictionary) given byD. Note that the mixing matrix is usuallyovercomplete. e observations are noisy and mixed versions from the sources. e goal isto estimate the sources based on assumptions of their sparsity. While the inverse problemsdiscussed above may or may not involve learning a dictionary, source separation requiresthat both the mixing matrix and the sparse sources be inferred from the observations.

1.6 SPARSEMODELS FORRECOGNITION

e relevance of sparsity in machine learning systems has been studied extensively over the lastdecade. Deep Belief Networks (DBNs) have been used to eectively infer the data distributionand extract features in an unsupervised fashion [29]. Imposing sparsity constraints on DBNsfor natural images will result in features that closely resemble their biological counterparts [30].Sparse representations have also been known more likely to be separable in high-dimensionalspaces [31], and, hence, helpful in classication tasks. However, empirical studies designed toevaluate the actual importance of sparsity in image classication state the contrary.

e authors of [32] argue that no performance improvement is gained by imposing sparsityin the feature extraction process, when this sparsity is not tailored for discrimination. Beforediscussing the discriminative models using sparse representations, it is imperative to understandwhy sparse representations computed without any additional constraints cannot be directly usedfor discrimination. It can be understood by considering a simple analogy. Consider a set of low-level features needed to represent a human face. It can easily be seen that a general set of featurescan eciently represent a human face. However, if we try to classify male and female faces, forinstance, directly using those features will result in a poor performance. For good classicationperformance, we may need to incorporate features that describe the discriminating characteristicsbetween male and female faces, such as presence of facial hair, thickness of eyebrows, etc.

Computer vision frameworks that use sparse representations can be divided into two broadcategories: (a) those that incorporate explicit discriminative constraints between classes whenlearning dictionaries or computing sparse codes; and (b) those that use the sparse codes as low-level features to generalize bag-of-words-based models for use in classication. As a rst example,let us consider a standard object recognition architecture based on sparse features, similar to theones developed in [33, 34]. Given a known set of lters (dictionary atoms), sparse features areextracted by coding local regions in an image. Note that, in some cases, it may be possible toreplace the computationally intensive sparse coding process by a simple convolution of the patcheswith the linear lters, for a small or no loss in performance. e image lters can be learned usingdictionary learning procedures, or can be pre-determined based on knowledge of the data. earchitecture illustrated in gure 1.5 also includes a pre-processing step where operations such asconversion of color images to grayscale, whitening, and dimensionality reduction can be carriedout.

10 1. INTRODUCTION

Pre-Processing

Learn Linear

Filters from

Data

UNLABELED DATA

Sparse Feature

Extraction

Non-Linear

Function

Pooling

Classification

Figure 1.5: A biologically inspired standard object recognition architecture based on sparse represen-tations.

Sparse feature extraction is followed by the application of a non-linear function, such astaking the absolute values of the coecients, and pooling. Note that this procedure is commonin biologically inspired multi-layer architectures for recognition. For example, the models pro-posed in [35] start with grayscale pixels and successively compute the S (simple) and the C(complex) layers in an alternating manner. e simple layers typically apply local lters to com-pute higher-order features by combining various lower-order features in the previous layer. Onthe other hand, the complex layers attempt to achieve invariance to minor pose and appearancechanges by pooling features of the same unit in the previous layer. Since the dimension of the de-scriptors is too high for practical applications, downsampling is usually performed. Some of thecommonly used pooling functions include Gaussian pooling, average pooling, and maximum-value pooling (max-pooling) in the neighborhood. e nal step in the architecture is to learn aclassier that discriminates the pooled features belonging to the dierent classes.

1.6.1 DISCRIMINATIVEDICTIONARIES

e general approaches in using sparse representations for discrimination applications involveeither learning class-specic dictionaries or a single dictionary for multiple classes. Class-specicdictionaries are learned for C classes of the training data with the constraint that kx Dcack22 M . When the solution is sparse, it can be computed using an optimization program similarto (2.1), with an additional non-negativity constraint. e convex program used to solve for thisnon-negative coecient vector is given as

minb1T b subj. to x D Gb; 0: (2.10)

If the set

fbjx D Gb;b 0g (2.11)contains only one solution, any variational function on b can be used to obtain the solution [58, 59]and `1 minimization is not required.

2.2. GEOMETRICAL INTERPRETATION 17

2.2 GEOMETRICAL INTERPRETATION

e generative model indicated in (1.1) with sparsity constraints is a non-linear model, becausethe set of all Ssparse vectors is not closed under addition. e sum of two Ssparse vectorsgenerally results in a 2Ssparse vector. Furthermore, sparse models are generalizations of linearsubspace models since each sparse pattern represents a subspace, and the union of all patternsrepresents a union of subspaces. Considering Ssparse coecient vectors obtained from a dic-tionary of sizeM K, the data samples x, obtained using the model (1.1), lie in a union of K

S

Sdimensional subspaces. In the case of a non-negative sparse model given by (2.9), for Ssparserepresentations, the data samples lie in a union of

KgS

simplical cones. Given a subset G of

dictionary atoms, where is the set of S indices corresponding to the non-zero coecients, thesimplical cone generated by the atoms is given by8 0

9=; : (2.12)Note that a simplical cone is a subset of the subspace spanned by the same dictionary atoms.

Computing sparse representations using a `0 minimization procedure incurs combinato-rial complexity, as discussed before, and it is instructive to compare this complexity for the caseof general and non-negative representations. For a general representation, we need to identifyboth the support and the sign pattern, whereas for a non-negative representation, identicationof support alone is sucient. e complexity of identifying the support alone for an Ssparserepresentation is

KS

, and identication of the sign pattern along with the support incurs a com-

plexity of2KSS

. is is because there are 2K signs to choose from and we cannot choose both

positive and negative signs for the same coecient. From gure 2.3 it is clear that the complexityincreases as the number of coecients increases, and the non-negative sparse representation is farless complex compared to the general representation. However, we will see later that the generalrepresentation incurs lesser complexity, both with practical convex and greedy optimization pro-cedures. e reason for this is that including additional constraints, such as non-negativity, in anoptimization problem usually increases the complexity of the algorithmic procedure in arriving atan optimal solution.

2.3 UNIQUENESSOF `0 AND ITS EQUIVALENCETOTHE `1SOLUTION

So far, we have discussed, in length, the sparse representations and obtaining representationsusing `0 minimization. However, it is also important to ensure that, for a given dictionary D,such a representation obtained is unique, and both the `0 and `1 solutions, obtained respectivelyusing (2.1) and (2.2), are equivalent to each other [60, 61].

18 2. SPARSEREPRESENTATIONS

0 20 40 60 80 10010

0

1020

1040

1060

1080

No. of nonzero coefficients

No

. o

f s

pa

rsit

y p

att

ern

s

General representation

Nonnegative representation

Figure 2.3: e number of patterns to be searched for if `0 minimization is used to compute sparserepresentations. Size of the dictionary is 100 200. Computing non-negative representations alwaysincurs lesser complexity compared to general representations.

To analyze the uniqueness of the solution for an arbitrary (in this case overcomplete) dic-tionary D, assume that there are two suitable representations for the input signal x,

9a1 a2 such that x D Da1 D Da2: (2.13)

Hence, the dierence of the representations a1 a2must be in the null space of the representationD.a1 a2/ D 0. is implies that some group of elements in the dictionary should be linearlydependent. To quantify the relation, we dene the spark of a matrix. Given a matrix, the sparkis dened as the smallest number of columns that are linearly dependent. is is quite dierentfrom the rank of a matrix, which is the largest number of columns that are linearly independent. Ifa signal has two dierent representations, as in (2.13), we must have ka1k0 C ka2k0 spark.D/.From this argument, if any representation exists satisfying the relation ka1k0 < spark.D/=2, then,for any other representation a2, we have ka2k0 > spark.D/=2. is indicates that the sparsestrepresentation is a1. To conclude, a representation is the sparsest possible if kak0 < spark.D/=2.Assuming that the dictionary atoms are normalized to unit `2 norm, let us dene the Grammatrix for the dictionary asH D DTD and denote the coherence as the maximum magnitude of

2.3. UNIQUENESSOF `0 AND ITS EQUIVALENCETOTHE `1 SOLUTION 19

0.1 0.2 0.3 0.4 0.50

10

20

30

40

50

Coherence

De

term

inis

tic

Sp

ars

ity

Th

res

ho

ld

Figure 2.4: Deterministic sparsity threshold with respect to the coherence of the dictionary.

the o-diagonal elements D max

ijjhi;j j: (2.14)

en we have the bound spark.D/ > 1=M [60], and, hence, it can be inferred that the represen-tation obtained from `0 minimization is unique and equivalent to `1 minimization if

kak0 12

1C 1

: (2.15)

is is referred to as the deterministic sparsity threshold, since it holds true for all sparsity patternsand non-zero values in the coecient vectors.is threshold is illustrated in Figure 2.4 for variousvalues of . Since the general representation also encompasses the non-negative case, the samebound holds true for non-negative sparse representations. Furthermore, the threshold is the samefor `1 minimization, as well as greedy recovery algorithms, such as the Orthogonal MatchingPursuit (OMP). e deterministic sparsity threshold scales at best as

pM with increasing values

ofM . Probabilistic or Robust sparsity thresholds, on the other hand, scale in the order ofM= logK[62] and break the square-root bottleneck. However, the trade-o is that the unique recoveryusing `1 minimization is only assured with high probability, and robust sparsity thresholds forunique recovery using greedy algorithms, such as OMP, are also unknown.


2.3.1 PHASETRANSITIONS

Deterministic thresholds are too pessimistic, and, in reality, the performance of sparse recovery ismuch better than that predicted by the theory. Robust sparsity thresholds are better, but still re-strictive, as they are not available for several greedy sparse recovery algorithms. Phase-transitiondiagrams describe sparsity thresholds at which the recovery algorithm transitions from a highprobability of success to a high probability of failure, for various values of the ratio M=K (un-dersampling factor) ranging from 0 to 1. For random dictionaries and coecient vectors whoseentries are realized from various probability distributions, empirical phase transitions can be com-puted by nding the points at which the fraction of success for sparse recovery is 0:5 with respectto a nely spaced grid of sparsity and undersampling factors [63].

Asymptotic phase transitions can be computed based on the theory of polytopes [58, 63,64, 65]. For K !1, when the dictionary entries are derived from N .0; 1/ and the non-zerocoecients are signs (1), asymptotic phase transitions are shown in Figure 2.5 for `1 minimiza-tion algorithms. It can easily be shown that the unconstrained `1 minimization corresponds tothe cross-polytope object, and non-negative `1 minimization corresponds to the simplex object.Clearly, imposing non-negativity constraint gives an improved phase transition when comparedto the unconstrained case. Furthermore, empirical phase transitions computed for Rademacher,partial Hadamard, Bernoulli, and random Ternary ensembles are similar to those computed forGaussian i.i.d. ensembles [63]. e phase transitions of several greedy recovery algorithms havebeen analyzed and presented in [66], and optimal tuning of sparse approximation algorithms thatuse iterative thresholding have been performed by studying their phase-transition characteristics[67].

2.4 NUMERICALMETHODSFOR SPARSECODINGWhen `0 norm is used as the cost function, exact determination of the sparsest representation isan NP-hard problem [68], and the complexity of search becomes intractable, even for a moder-ate number of non-zero coecients, as evident from Figure 2.3. Hence, a number of numericalalgorithms that use `1 approximation and greedy procedures have been developed to solve theseproblems. Note that most of these algorithms have a non-negative counterpart and it is straight-forward to develop them by appropriately placing the non-negativity constraint.

Some of the widely used methods for computing sparse representations include theMatch-ing Pursuit (MP) [69], Orthogonal Matching Pursuit (OMP) [70], Basis Pursuit (BP) [71], FO-CUSS [72], Feature Sign Search (FSS) [73], Least Angle Regression (LARS) [74], and iteratedshrinkage algorithms [75, 76]. Before describing the sparse coding algorithms, we will present anoverview of the optimality conditions used by these procedures.

e recovery performance of `1 minimization (BP) and greedy (OMP) methods are com-pared in Figure 2.6 for general and non-negative representations. It can be seen that non-negativerepresentations result in a better performance, in terms of recovery, when compared to generalrepresentations, both with BP and OMP.

2.4. NUMERICALMETHODSFOR SPARSECODING 21

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Undersam lin factor M/K

Sp

ars

ity

S/M

Simplex

Crosspolytope

gp

Figure 2.5: Asymptotic phase transitions for non-negative (simplex) and general `1 minimization(cross-polytope) when the dictionary elements are derived from i.i.d. Gaussian N .0; 1/ and non-zerocoecients are signs.

2.4.1 OPTIMALITYCONDITIONS

Considering the `0 optimization problem in (2.1), it can be shown that a unique and, hence, anoptimal solution can be obtained using BP, MP, and OMP algorithms [70, 77] if the conditiongiven in (2.15) is satised. For algorithms that use the penalized `1 formulation given in (2.4),the optimality condition is obtained computing the following sub-gradient set:

2DT .Da x/C p; (2.16)and ensuring that it contains the zero vector. Here p is the sub-gradient of kak1 and is dened as

pi D

8 01;C1 ai D 0:1 ai < 0

(2.17)

Hence, the optimality conditions can be simplied as [78]

2dTi .x Da/ D sign.ai /; if ai 0; (2.18)2jdTi .x Da/j < ; otherwise: (2.19)


Table 2.1: Matching pursuit algorithm

Input:Input signal, x 2 RM , Dictionary, D.

Output:Coecient vector, a 2 RK .

Initialization:Initial residual: r.0/ D x.Initial coecient vector: a D 0.Loop index: l D 1.

while convergence not reachedDetermine an index kl : kl D argmax

k

jhr.l1/;dkij.Update the coecient: akl akl C hr.l1/;dkl i.Compute the new residual: r.l/ D r.l1/ hr.l1/;dkl idkl .Update the loop counter: l D l C 1.

end

ese conditions are used as criterion for optimality and coecient selection by LARS and FSSalgorithms.

2.4.2 BASIS PURSUIT

Basis Pursuit (BP) is a linear programming approach [79] that solves (2.2) to nd the minimum`1 norm representation of the signal [71].

A standard linear program (LP) is a constrained optimization problem with ane objectiveand constraints. We can convert the problem in (2.2) to a standard LP, by adding an auxiliaryvariable u 2 RK as

minu1Tu subject to Da D x; u 0; u a u: (2.20)

Relating (2.2) to this LP, the problem is to identify which elements in a should be zero. Tosolve this, both simplex methods and interior point methods have been used [71]. Geometrically,the collection of feasible points is a convex polyhedron or a simplex. e simplex method worksby moving around the boundary of the simplex, jumping from one vertex of the polyhedron toanother, where the objective is better. On the contrary, interior point methods start from theinterior of the simplex. Since, the solution of LP is at an extreme point of the simplex, as thealgorithm converges, the estimate moves toward the boundary. When the representation is notexact and the error goal is known, the sparse code can be computed by solving the quadraticallyconstrained problem

Oa D argmina

kak1 subj. to kx Dak2 : (2.21)


is is referred to as Basis Pursuit Denoising (BPDN) and the solution to this can be obtainedusing several ecient procedures [71]. For a data vector x of dimensionality M , corrupted withadditive white Gaussian noise (AWGN) of variance 2, the squared error goal 2 is xed at.1:152/M . Using this error goal will ensure that each component in the AWGN noise vec-tor will lie within the range 1:15; 1:15 with a probability of 0:75. Hence, we will have a verylow chance of picking noise as a part of the representation. For the non-negative case, the prob-lem given in (2.10) can be directly solved as an LP and the BPDN problem also can be modiedeasily in order to incorporate non-negativity.

2.4.3 GREEDYPURSUITMETHODS

Greedy procedures for computing sparse representations operate by choosing the atom that ismost strongly correlated to the current target vector, removing its contribution and iterating.Hence, they make a sequence of locally optimal choices in an eort to obtain a global optimalsolution. ere are several versions of such greedy algorithms, some that are aggressive and re-move all of the contributions of the particular atom from the target, and some are less aggressiveand remove only a part of the contribution. For orthonormal dictionaries, even the most aggres-sive greedy algorithms perform well, whereas, for overcomplete dictionaries, those that are morecareful in xing the coecient result in a better approximation. e greedy methods for solvingsparse approximation problems generalize this simple idea to the case of any arbitrary dictionary.A clear advantage with greedy algorithms is that there is a exibility to x the number of non-zerocoecients, and/or the error goal , which is not the case with BP or BPDN.

Matching Pursuit (MP)

is is the simplest of greedy pursuit algorithms and was proposed byMallat and Zhang [69].esteps involved in the MP algorithm are shown in Table 2.1. e algorithm begins by setting theinitial residual to the input signal x and making a trivial initial approximation. It then iterativelychooses the best correlating atom to the residual and updates the residual correspondingly. Inevery iteration, the algorithm takes into account only the current residual and not the previouslyselected atoms, thereby making this step greedy. It is important to note that MP might select thesame dictionary atom many times, when the dictionary is not orthogonal. e contribution of anatom to the approximation is the inner product itself, as in the case of an orthogonal expansion.e residual is updated by removing the contribution of the component in the direction of theatom dkl . At each step of the algorithm a new approximant of the target signal, x

.l/, is calculatedbased on the relationship

x.l/ D x r.l/: (2.22)When the dictionary is orthonormal, the representation x.S/ is always an optimal and unique S-term representation of the signal [70]. It has been shown that, for general dictionaries, the normof the residual converges to zero [80].


Table 2.2: Orthogonal matching pursuit algorithm

Input:Input signal, x 2 RN , Dictionary, D.

Output:Coecient vector, a 2 RK .

Initialization:Initial residual, r.0/ D x.Index set, D fg.Loop index, l D 1.

while convergence not reachedDetermine an index kl : kl D argmaxk jhr.l1/;dkij.Update the index set: [ kl .Compute the coecient: a D argmina

x PljD1 akj dkj

2.Compute the new residual: r.l/ D x PljD1 akj dkj .Update the loop counter: l D l C 1.

end

OrthogonalMatching Pursuit (OMP)

is algorithm introduces a least squares minimization to each step of theMP algorithm, in orderto ensure that the best approximation is obtained over the atoms that have already been chosen[81, 82, 83]. is is also referred to as forward selection algorithm [84] and the steps involvedin this algorithm are provided in Table 2.2. e initialization of OMP is similar to that of MP,with a dierence that an index set is initialized to hold the indices of atoms chosen at everystep. e atom selection procedure of this algorithm is also greedy, as in MP. e index set isupdated by adding the index of the currently chosen atom to the list of atoms that have alreadybeen chosen. Unlike the MP algorithm, where the correlations themselves were the coecients,OMP computes the coecients by solving a least squares problem. e most important behaviorof OMP is that the greedy selection always picks an atom that is linearly independent from theatoms already chosen. In other words,

hr.l/;dkj i D 0 for j D 1; : : : ; l: (2.23)

As a result, the residual must be equal to zero in N steps. Further, the atoms corresponding toindex set result in a full rank matrix and, hence, the least squares solution is unique. Sincethe residual is orthogonal to the atoms already chosen, OMP ensures that an atom is not chosenmore than once in the approximation. e solution of the least squares problem determines the


approximant,

x.l/ DlX

jD1akj dkl : (2.24)

A non-negative version of the OMP algorithm has been proposed in [85] as a greedy procedurefor recovering non-negative sparse representations. Several pursuit methods have been recentlyproposed to improve upon the performance of OMP. One such procedure, the compressive sam-pling matching pursuit (CoSaMP) [86], follows a procedure similar to the OMP, but selectsmultiple columns of the dictionary and prunes the set of active columns in each step.

Simultaneous Orthogonal Matching Pursuit (S-OMP)Using a set of T observations from a single source, we build the signal matrix X D

x1x2 : : :xT . e S-OMP algorithm identies a representation such that all the signals in Xuse the same set of dictionary atoms from D with dierent coecients, minimizing the errorgiven by kX DAk2F . Here,A 2 RKT indicates the coecient matrix, where each column cor-responds to the coecients for a single observation and k:kF refers to the Frobenius norm. isalgorithm reduces to a simple OMP, when T D 1. e atom selection step of this algorithm isalso greedy, similar to the cases of the MP and the OMP algorithms discussed earlier. Each atomis chosen such that it has the maximum sum of absolute correlations with the current target signalmatrix. e basic intuition behind this is that the chosen atom will capture a lot of energy fromeach column of the current target matrix [87]. Hence, this approach will be very eective, if thesame set of atoms is used to approximate all the columns of X.

Least Angle Regression (LARS)

e LARS procedure computes the solution to penalized `1 optimization in (2.4) by selectingthe active coecients one at a time and performing a least squares procedure at every step tore-estimate them. is method is closely related to the OMP or the forward selection algorithm,but it is less greedy in reducing the residual error. Another closely related method to OMP andLARS is the forward stagewise algorithm, which takes thousands of tiny steps as it moves towardthe solution. All of these algorithms update their current approximant of the signal by taking astep in the direction of the largest correlation of the dictionary atom with the current approximant

p.lC1/ D p.l/ C sign.hr.l/;dkl i/; where l D argmaxj

jhr.j /;dkj ij: (2.25)

For MP, the step size is the same as jhr.l/;dkl ij, whereas for OMP the current approximantis re-estimated using a least squares method. e forward stagewise procedure is overly carefuland xes as a small fraction of the MP step size. LARS is a method that strikes a balance instep-size selection.

e idea behind the LARS algorithm can be described as follows. e coecients are ini-tialized as zero and the dictionary atom dk1 that is most correlated with the signal x is chosen.


e largest step possible is taken in the direction of dk1 until another dictionary atom dk2 has asmuch correlation as dk1 with the current residual.en the algorithm proceeds in the equiangulardirection between dk1 and dk2 until a new dictionary atom dk3 has as much correlation with theresidual as the other two. erefore, at any step in the process, LARS proceeds equiangularly be-tween the currently chosen atoms. e dierence between MP and LARS is that, in the formercase, a step is taken in the direction of the atom with maximum correlation with the residual,whereas, in the latter case, the step taken is in the direction that is equiangular to all the atomsthat are most correlated with the residual. Note that the LARS algorithm always adds coecientsto the active set and never removes any, and, hence, it can only produce a sub-optimal solution for(2.4). Considering the optimality constraints in (2.18) and (2.19), the LARS algorithm implicitlychecks for the all the conditions except the sign condition in (2.18). Hence, in order to make theLARS solution the same as the optimal solution for (2.4), a small modication that provides away for the deletion of coecients from the current non-zero coecient set has to be included.

e LARS algorithm can produce a whole set of solutions for (2.4) for varying between0, when M dictionary atoms are chosen and a least squares solution is obtained, to its originalvalue, where a sparse solution is obtained. If the computations are arranged properly, LARS is acomputationally cheap procedure with complexity of the same order as that of least squares.

2.4.4 FEATURE-SIGN SEARCH

Feature-sign search is an ecient sparse coding method that solves the penalized `1 optimizationin (2.4) bymaintaining an active set of non-zero coecients and corresponding signs, and searchesfor the optimal active set and coecient signs. In each feature-sign step, given an active set, ananalytical solution is computed for the resulting quadratic program. en, the active set and thesigns are updated using a discrete line search and by selecting coecients that promote optimalityof the objective. In this algorithm, the sparse code a is initialized to the zero vector and the activeset A is initialized to an empty set. e sign vector is denoted by and its entries are dened asi D sign.ai /, which is one of the elements from f1; 0;C1g. e algorithm consists of thefollowing steps.

Step 1: An element of the non-active set that results in the maximum change in the errorterm is chosen as

i D argmaxi

@kx Dak2@ai ; (2.26)

and add i to the active set if it improves the objective locally. is is obtained when the sub-gradient given by (2.16) is less than 0 for a given index i . Hence, we append i toA, if @kxDak

@ai>

=2 setting i D 1, or if @kxDak@ai

< =2, setting i D 1.Step 2:Using coecients only from the active set, let OD be the sub-dictionary corresponding

to the active set, and let Oa, O be the active coecient and sign vectors. e strategy used in thisfeature-sign step is to nd the new set of coecients that have the sign pattern consistent with


Figure2.6: Recovery performance for general and non-negative sparse representations.e dictionaryis obtained as realizations from N .0; 1/ and is of size 100 200. Non-zero coecients are realizedfrom a uniform random distribution.e performance is averaged over 100 iterations for each sparsitylevel.

the sign vector. In order to do this, we rst solve the unconstrained optimization

minOakx OD Oak22 C

2OT Oa (2.27)

and obtain Oanew . Denoting the objective in (2.27) as f .Oa; O/, it is easy to see that f .Oanew ; O/ ; 1 i T g, index of training vectors withsquared norm greater than error goal.OR0 D r0;i i20 :

while l1 ; and l LInitialize:Al , coecient matrix, size Kl M , all zeros.Rl , residual matrix for level l , sizeM T , all zeros.fDl ; OAlg D KLC. ORl1; Kl/.RtlD ORl1 Dl OAl .

rl;i D r tl;j where i D l1.j /; 8j D 1; :::; jl1j.al;i D Oal;j where i D l1.j /; 8j D 1; :::; jl1j.l D fi j krl;ik22 > ; 1 i T g.ORl D

rl;ii2l .

l l C 1.end

Equation (3.25) states that the energy of any training vector is equal to the sum of squares of itscoecients and the energy of its residual. From (3.24), we also have that

kRl1k2F D kDlAlk2F C kRlk2F : (3.26)Amultilevel dictionary, along with the activitymeasure of its atoms, given by (3.20), learned

using the algorithmwith the same training set as the K-SVD dictionary obtained in Section 3.2.2,is shown in Figure 3.5. e levelwise representation energy and the average activity measure foreach level for the learned MLD are given in Figure 3.6, which clearly shows the energy hierarchyof the learning algorithm. is also demonstrates that the algorithm learns geometric patterns in

40 3. DICTIONARYLEARNING: THEORYANDALGORITHMS

Figure 3.5: An example MLD with 16 levels of 16 atoms each (left), with the leftmost column indi-cating atoms in level 1, proceeding toward level 16 in the rightmost column. e dictionary comprisesof geometric patterns in the rst few levels, stochastic textures in the last few levels, and a combinationof both in the middle levels, as quantied by its activity measure (right).

the rst few levels, stochastic textures in the last few levels, and a hybrid set of patterns in themiddle levels.

Sparse Approximation using anMLD

Sparse approximation for any test data can be performed by stacking all the levels of an MLDtogether into a single dictionary, and using any standard pursuit algorithm on D. ough thisimplementation is straightforward, it does not exploit the energy hierarchy observed in the learn-ing process. e Regularized Multilevel OMP (RM-OMP) procedure incorporates energy hier-archy in the pursuit scheme by evaluating a 1-sparse representation for each level l using thesub-dictionary QDl , and orthogonalizes the residual to the dictionary atoms chosen so far. Inorder to introduce exibility in the representation, this sub-dictionary is built using atoms se-lected from the current level as well as the u immediately preceding and following levels, i.e.,QDl D

DluDl.u1/ : : :Dl : : :DlC.u1/DlCu

. Considering an MLD with L levels and K=L

atoms per level, the complexity of choosing S dictionary atoms using the M-OMP algorithm isS.2uC 1/KM=L, whereM is the dimensionality of the dictionary atom. In contrast, the com-plexity of dictionary atom selection for the OMP algorithm is SKM , for a dictionary of sizeM K. If u is chosen to be much smaller than L, we have .2uC 1/=L < 1, and, hence, thesavings in computations for the RM-OMP algorithm is still signicant in comparison to thestandard OMP algorithm.

3.2. LEARNINGALGORITHMS 41

2 4 6 8 10 12 14 16

1

2

3

4

5

6

7

8

x 108

Level

To

tal R

ep

res

en

tati

on

En

erg

y

(a)

2 4 6 8 10 12 14 160.2

0.3

0.4

0.5

0.6

0.7

0.8

Level

Av

era

ge

Ac

tiv

ity

Me

as

ure

(b)

Figure 3.6: (a) Level-wise representation energy for the learned MLD with the BSDS training dataset. (b) e levelwise activity measure shows that the atoms slowly change from geometric to texture-like as the level progresses.

3.2.4 ONLINEDICTIONARYLEARNING

e dictionary learning algorithms described so far are batch procedures, since they require accessto the whole training set in order to minimize the cost function. Hence, the size of the data setthat can be used for training is limited. In order for the learning procedure to scale up to millionsof training samples, we need an online algorithm that is ecient, both in terms of memory and


computations. We will discuss one such procedure proposed by Mairal et al. [104], based onstochastic approximations.

Although dictionary learning results in a solution that minimizes the empirical cost g givenin (3.2), it will approach the solution that minimizes the expected cost Og given in (3.1), as the num-ber of samples T !1. e idea of online learning is to use a well-designed stochastic gradientalgorithm that can result in a lower expected cost, when compared to an accurate empirical min-imization procedure [105]. Furthermore, for large values of T , empirical minimization of (3.2),using a batch algorithm, becomes computationally infeasible and it is necessary to use an onlinealgorithm.

e online algorithm alternates between computing sparse code for the t th training sample,xt , with the current estimate of the dictionary Dt1, and updating the new dictionary Dt byminimizing the objective

Ogt .D/ 1t

tXiD1

1

2kxi Daik22 C kaik1

; (3.27)

along with the constraint that the columns ofD are of unit `2 norm.e sparse codes for the sam-ples i < t are used from the previous steps of the algorithm. e online algorithm for dictionarylearning is given in Table 3.3. e dictionary update using warm restart is performed by updat-ing each column of it separately. e detailed algorithm for this update procedure is availablein [104]. e algorithm is quite general and can be tuned to certain special cases. For example,when a xed-size dataset is used, we can randomly cycle through the data multiple times andalso remove the contribution of the previous cycles. When the dataset is huge, the computationscan be speeded up by processing in mini-batches instead of one sample at a time. In addition, toimprove the robustness of the algorithm, unused dictionary atoms must be removed and replacedwith a randomly chosen sample from the training set, in a manner similar to clustering proce-dures. Since the reliability of the dictionary improves over iterations, the initial iterations may beslowed by adjusting the step size and down-weighting the contributions of the previous data.

In [104], it is proved that this algorithm converges to a stationary point of the objectivefunction. is is shown by proving that, under the assumptions of compact support for the data,convexity of the functions Ogt and uniqueness of the sparse coding solution, Ogt acts as a convergingsurrogate of g, as the total number of training samples T !1.

3.2.5 LEARNINGSTRUCTUREDSPARSEMODELS

e generative model in (1.1) assumes that the data is generated as a sparse linear combinationof dictionary atoms. When solving an inverse problem, the observations are usually a degradedversion of the data denoted as

z D xC n; (3.28)where 2 RNM is the degradation operator with N M , and n N .0; 2IN /. Consideringthe case of images, x usually denotes an image patch having the sparse representation Da, and

3.2. LEARNINGALGORITHMS 43

Table 3.3: Online Dictionary Learning

Initialization:T training samples drawn from the random distribution p.x/.Initial dictionary, D.0/ 2 RMK , with normalized columns., regularization parameter.Set B 2 RKK and C 2 RMK to zero matrices.

for t D 1 to TDraw the training sample xt from p.x/.Sparse Coding:

at D argmina 12kxi Dt1ak22 C kak1Bt D Bt1 C ataTtCt D Ct1 C xtaTtCompute Dt using Dt1 as warm restart,

Dt D 1tPt

iD112kxi Daik22 C kaik1

D 1t

PtiD1

12Tr.D

TDAt / Tr.DTDAt /

end forReturn learned dictionary DT .

the matrix performs operations such as downsampling, masking, or convolution. Restorationof original images corrupted by these degradations corresponds to the inverse problems of single-image superresolution, inpainting, and deblurring respectively. e straightforward method ofrestoring these images, using sparse representations, is to considerD as the dictionary and com-pute the sparse coecient vector a using the observation z. e restored data is then given asDa.Apart from the obvious necessity that x should be sparsely decomposable in D, the conditions tobe satised in order to get a good reconstruction are [106]: (a) the norms of any column in thedictionary D should not become close to zero, as in this case it is not possible to recover thecorresponding coecient with any condence; and (b) the columns of D should be incoherent,since coherent atoms lead to ambiguity in coecient support. For uniform degradations such asdownsampling and convolution, even an orthonormalD could result inD that violate these twoconditions. For example, consider that is an operator that downsamples by a factor of two, inwhich case the DC atom f1; 1; 1; 1; : : :g and the highest frequency atom f1;1; 1;1; : : :g willbecome identical after downsampling. If the dictionary D contains directional lters, an isotropicdegradation operator will not lead to a complete loss in the incoherence property for D.

In order to overcome these issues and obtain a stable representation for inverse problems,it is necessary to consider the fact that similar data admit similar representations and design dic-tionaries, accordingly. As discussed in Chapter 2, this corresponds to the simultaneous sparse ap-proximation (SSA) problem where a set of data will be represented by a set of dictionary atoms.In the section, we will briey discuss the formulations by Yu et al. [106] and Mairal et al. [107]that are specically designed for inverse problems in imaging.


Dictionaries for Simultaneous Sparse Approximation

We restate the problem of SSA discussed in Chapter 2 and elaborate on it from a dictionarylearning perspective. Let us assume that the data samples given by fxigTiD1 can be grouped intoG disjoint sets, and the indices of the members of group g are available inAg . Note that the eachdictionary atom can participate in more than one group of representations. e problem of SSAand dictionary learning is given by

minD;AkX DAk2F C

GXgD1kAgkp;q : (3.29)

ematrixAg 2 RKjAg j contains the coecient vectors of data samples that belong to the groupg. e pseudonorm kAkp;q is dened asPKiD1 kaikpq , where ai denotes the i th row of A and thepair .p; q/ is chosen to be .1; 2/ or .0;1/. e choice of p D 1 and q D 2 results in a convexpenalty. Dictionary learning can be carried out using an alternating minimization scheme, wherethe dictionary is obtained by xing the coecients and the coecients are obtained by xing thedictionary. In practice, two basic issues need to be addressed in the group sparse coding stagethat is usually carried out using an error constrained optimization: (a) how should the groups beselected? and (b) what is the error goal for the representation?

We provide simple answers to these questions that lead to ecient implementationschemes, following [107]. Patches can be grouped by clustering them into a prespecied numberof groups, thereby keeping the similar patches together and avoiding the problem of overlappingin groups. e error goal for the group, g , is the goal for one patch scaled by the number ofelements in the group. Similar to BPDN in (2.21), the error constrained minimization is nowobtained as

minfAggGgD1

GXgD1

kAgkp;qjAg jp such that

Xi2Agkxi Daik22 g ; (3.30)

where the group norm is normalized to ensure equal weighting for all the groups. When thesparse codes are xed, dictionary learning can be performed using any standard procedure. Withdenoising and demosaicing applications, a proper adaptation of the SSA and dictionary learningprocedure leads to state-of-the-art results.

Dictionaries for Piecewise Linear Approximation

e main idea in building these dictionaries is to design a set of directional bases, and repre-sent each cluster of image patches linearly

Documents

Image Understanding Using Sparse Representations