140
6.870 Object Recognition and Scene Understanding student presentation MIT

Mit6870 template matching and histograms

  • Upload
    zukun

  • View
    788

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mit6870 template matching and histograms

6.870

Object Recognition and Scene Understanding

student presentationMIT

Page 2: Mit6870 template matching and histograms

6.870

Template matching and histograms

Nicolas Pinto

Page 3: Mit6870 template matching and histograms

Introduction

Page 4: Mit6870 template matching and histograms

Hosts

Page 5: Mit6870 template matching and histograms

Hostsa guy...

(who has big arms)

Page 6: Mit6870 template matching and histograms

HostsAntonio T...

(who knows a lot about vision)

a guy...

(who has big arms)

Page 7: Mit6870 template matching and histograms

HostsAntonio T...

(who knows a lot about vision)

a frog...

(who has big eyes)

a guy...

(who has big arms)

Page 8: Mit6870 template matching and histograms

HostsAntonio T...

(who knows a lot about vision)

a frog...

(who has big eyes)and thus should knowa lot about vision...

a guy...

(who has big arms)

Page 9: Mit6870 template matching and histograms

3 papers

yey!!

Page 10: Mit6870 template matching and histograms

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

3 papers

yey!!

Page 11: Mit6870 template matching and histograms

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

Nalal and Triggs (2005)

3 papers

yey!!

Page 12: Mit6870 template matching and histograms

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

Nalal and Triggs (2005)

3 papers

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6, 10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

Felzenszwalb et al.(2008)

yey!!

Page 13: Mit6870 template matching and histograms

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computation

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

Nalal and Triggs (2005)

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution

Felzenszwalb et al.(2008)

Page 14: Mit6870 template matching and histograms

Scale-Invariant Feature Transform (SIFT)

adapted from Kucuktunc

Page 15: Mit6870 template matching and histograms

Scale-Invariant Feature Transform (SIFT)

adapted from Brown, ICCV 2003

Page 16: Mit6870 template matching and histograms

SIFT local features are

invariant...

adapted from David Lee

Page 17: Mit6870 template matching and histograms

like me they are robust...

Text

Page 18: Mit6870 template matching and histograms

like me they are robust...

Text

... to changes in illumination, noise, viewpoint, occlusion, etc.

Page 19: Mit6870 template matching and histograms

I am sure you want to know

how to build them

Text

Page 20: Mit6870 template matching and histograms

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

Page 21: Mit6870 template matching and histograms

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

2. find their dominant orientation

Page 22: Mit6870 template matching and histograms

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

2. find their dominant orientation

3. compute their descriptor

Page 23: Mit6870 template matching and histograms

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

2. find their dominant orientation

3. compute their descriptor

4. match them on other images

Page 24: Mit6870 template matching and histograms

Text1. find interest points or “keypoints”

Page 25: Mit6870 template matching and histograms

Text

keypoints are taken as maxima/minima

of a DoG pyramid

in this settings, extremas are invariant to scale...

Page 26: Mit6870 template matching and histograms

a DoG (Difference of Gaussians) pyramidis simple to compute...

even him can do it!

before after

adapted from Pallus and Fleishman

Page 27: Mit6870 template matching and histograms

then we just have to find neighborhood extremasin this 3D DoG space

Page 28: Mit6870 template matching and histograms

then we just have to find neighborhood extremasin this 3D DoG space

if a pixel is an extremain its neighboring regionhe becomes a candidate

keypoint

Page 29: Mit6870 template matching and histograms

too manykeypoints?

adapted from wikipedia

Page 30: Mit6870 template matching and histograms

too manykeypoints?

1. remove low contrast

adapted from wikipedia

Page 31: Mit6870 template matching and histograms

too manykeypoints?

1. remove low contrast

adapted from wikipedia

Page 32: Mit6870 template matching and histograms

too manykeypoints?

1. remove low contrast

2. remove edges

adapted from wikipedia

Page 33: Mit6870 template matching and histograms

too manykeypoints?

1. remove low contrast

2. remove edges

adapted from wikipedia

Page 34: Mit6870 template matching and histograms

Text

2. find their dominant orientation

Page 35: Mit6870 template matching and histograms

each selected keypoint is assigned to one or more“dominant” orientations...

Page 36: Mit6870 template matching and histograms

each selected keypoint is assigned to one or more“dominant” orientations...

... this step is important to achieve rotation invariance

Page 37: Mit6870 template matching and histograms

How?

Page 38: Mit6870 template matching and histograms

How?using the DoG pyramid to achieve scale invariance:

Page 39: Mit6870 template matching and histograms

How?using the DoG pyramid to achieve scale invariance:

a. compute image gradient magnitude and orientation

Page 40: Mit6870 template matching and histograms

How?using the DoG pyramid to achieve scale invariance:

a. compute image gradient magnitude and orientation

b. build an orientation histogram

Page 41: Mit6870 template matching and histograms

How?using the DoG pyramid to achieve scale invariance:

a. compute image gradient magnitude and orientation

b. build an orientation histogram

c. keypoint’s orientation(s) = peak(s)

Page 42: Mit6870 template matching and histograms

a. compute image gradient magnitude and orientation

Page 43: Mit6870 template matching and histograms

a. compute image gradient magnitude and orientation

Page 44: Mit6870 template matching and histograms

b. build an orientation histogram

adapted from Ofir Pele

Page 45: Mit6870 template matching and histograms

c. keypoint’s orientation(s) = peak(s)

*

* the peak ;-)

Page 46: Mit6870 template matching and histograms

Text

3. compute their descriptor

Page 47: Mit6870 template matching and histograms

SIFT descriptor= a set of orientation histograms

4x4 array x 8 bins= 128 dimensions (normalized)

16x16 neighborhood of pixel gradients

Page 48: Mit6870 template matching and histograms

Text

4. match them on other images

Page 49: Mit6870 template matching and histograms

How to atch?

Page 50: Mit6870 template matching and histograms

How to atch?

nearest neighbor

Page 51: Mit6870 template matching and histograms

How to atch?

nearest neighborhough transform voting

Page 52: Mit6870 template matching and histograms

How to atch?

nearest neighborhough transform votingleast-squares fit

Page 53: Mit6870 template matching and histograms

How to atch?

nearest neighborhough transform votingleast-squares fitetc.

Page 54: Mit6870 template matching and histograms

SIFT is great!

Text

Page 55: Mit6870 template matching and histograms

SIFT is great!

Text\\ invariant to affine transformations

Page 56: Mit6870 template matching and histograms

SIFT is great!

Text\\ invariant to affine transformations

\\ easy to understand

Page 57: Mit6870 template matching and histograms

SIFT is great!

Text\\ invariant to affine transformations

\\ easy to understand

\\ fast to compute

Page 58: Mit6870 template matching and histograms

Extension example: Spatial Pyramid Matching using SIFT

Text

Beyond Bags of Features: Spatial Pyramid Matchingfor Recognizing Natural Scene Categories

Svetlana Lazebnik1

[email protected] Institute

University of Illinois

Cordelia Schmid2

[email protected] Rhone-AlpesMontbonnot, France

Jean Ponce1,3

[email protected] Normale Superieure

Paris, France

CVPR 2006

Page 59: Mit6870 template matching and histograms

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

Nalal and Triggs (2005)

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution

Felzenszwalb et al.(2008)

Page 60: Mit6870 template matching and histograms

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

first of all, let me put this paper in context

Page 61: Mit6870 template matching and histograms

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

histograms of local image measurement have been quite successful

Swain & Ballard 1991 - Color Histograms

Schiele & Crowley 1996 - Receptive Fields Histograms

Lowe 1999 - SIFT

Schneiderman & Kanade 2000 - Localized Histograms of Wavelets

Leung & Malik 2001 - Texton Histograms

Belongie et al. 2002 - Shape Context

Dalal & Triggs 2005 - Dense Orientation Histograms

...

λ λ λ

Page 62: Mit6870 template matching and histograms

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

tons of “feature sets” have been proposed

features

Gravrila & Philomen 1999 - Edge Templates + Nearest Neighbor

Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al. 2002 - Haar Wavelets + SVM

Viola & Jones 2001 - Rectangular Differential Features + AdaBoost

Mikolajczyk et al. 2004 - Parts Based Histograms + AdaBoost

Ke & Sukthankar 2004 - PCA-SIFT

...

Page 63: Mit6870 template matching and histograms

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

localizing humans in images is achallenging task...

difficult!

Wide variety of articulated poses

Variable appearance/clothing

Complex backgrounds

Unconstrained illuminations

Occlusions

Different scales

...

Page 64: Mit6870 template matching and histograms

Approach

Page 65: Mit6870 template matching and histograms

Approach

•robust feature set (HOG)

Page 66: Mit6870 template matching and histograms

Approach

•robust feature set (HOG)

Page 67: Mit6870 template matching and histograms

Approach

•robust feature set (HOG)

•simple classifier (linear SVM)

Page 68: Mit6870 template matching and histograms

Approach

•robust feature set (HOG)

•simple classifier (linear SVM)

•fast detection (sliding window)

Page 69: Mit6870 template matching and histograms

adapted from Bill Triggs

Page 70: Mit6870 template matching and histograms

• Gamma normalization

• Space: RGB, LAB or Gray

• Method: SQRT or LOG

Page 71: Mit6870 template matching and histograms

• Filtering with simple masks

uncentered

centered

cubic-corrected

diagonal

Sobel

uncentered

centered

cubic-corrected

diagonal

Sobel

* centered performs the best

*

Page 72: Mit6870 template matching and histograms

• Filtering with simple masks

uncentered

centered

cubic-corrected

diagonal

Sobel

remember SIFT ?

Page 73: Mit6870 template matching and histograms

...after filtering, each “pixel” represents an oriented gradient...

Page 74: Mit6870 template matching and histograms

...pixels are regrouped in “cells”, they cast a weighted vote for an orientation histogram...

HOG (Histogram of Oriented Gradients)

Page 75: Mit6870 template matching and histograms

a window can be represented like that

Page 76: Mit6870 template matching and histograms

then, cells are locally normalized using overlapping “blocks”

Page 77: Mit6870 template matching and histograms

they used two types of blocks

Page 78: Mit6870 template matching and histograms

they used two types of blocks

• rectangular

• similar to SIFT (but dense)

Page 79: Mit6870 template matching and histograms

they used two types of blocks

• rectangular

• similar to SIFT (but dense)

• circular

• similar to Shape Context

Page 80: Mit6870 template matching and histograms

and four different types of block normalization

Page 81: Mit6870 template matching and histograms

and four different types of block normalization

Page 82: Mit6870 template matching and histograms

like SIFT, they gain invariance...

...to illuminations, small deformations, etc.

Page 83: Mit6870 template matching and histograms

finally, a sliding window is

classified by a simple linear SVM

Page 84: Mit6870 template matching and histograms

during the learning phase, the algorithm “looked” for hard examples

Training

adapted from Martial Hebert

Page 85: Mit6870 template matching and histograms

average gradients

positive weights negative weights

Page 86: Mit6870 template matching and histograms

Example

Page 87: Mit6870 template matching and histograms

Example

adapted from Bill Triggs

Page 88: Mit6870 template matching and histograms

Example

adapted from Martial Hebert

Page 89: Mit6870 template matching and histograms

Further Development

Page 90: Mit6870 template matching and histograms

Further Development

• Detection on Pascal VOC (2006)

Page 91: Mit6870 template matching and histograms

Further Development

• Detection on Pascal VOC (2006)

• Human Detection in Movies (ECCV 2006)

Page 92: Mit6870 template matching and histograms

Further Development

• Detection on Pascal VOC (2006)

• Human Detection in Movies (ECCV 2006)

• US Patent by MERL (2006)

Page 93: Mit6870 template matching and histograms

Further Development

• Detection on Pascal VOC (2006)

• Human Detection in Movies (ECCV 2006)

• US Patent by MERL (2006)

• Stereo Vision HoG (ICVES 2008)

Page 94: Mit6870 template matching and histograms

Extension example:

Pyramid HoG++

Page 95: Mit6870 template matching and histograms

Extension example:

Pyramid HoG++

Page 96: Mit6870 template matching and histograms

Extension example:

Pyramid HoG++

Page 97: Mit6870 template matching and histograms

A simple demo...

Page 98: Mit6870 template matching and histograms

A simple demo...

Page 99: Mit6870 template matching and histograms

A simple demo...

VIDEO HERE

Page 100: Mit6870 template matching and histograms

A simple demo...

VIDEO HERE

Page 101: Mit6870 template matching and histograms
Page 102: Mit6870 template matching and histograms

so, it doesn’t work ?!?

Page 103: Mit6870 template matching and histograms

so, it doesn’t work ?!?

no no, it works...

Page 104: Mit6870 template matching and histograms

so, it doesn’t work ?!?

no no, it works...

...it just doesn’t work well...

Page 105: Mit6870 template matching and histograms

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computation

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

Nalal and Triggs (2005)

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6, 10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

Felzenszwalb et al.(2008)

Page 106: Mit6870 template matching and histograms

This paper describes one of the best algorithm in object detection...

Page 107: Mit6870 template matching and histograms

They used the following methods:

HOG Features

Deformable Part Model

Latent SVM

Page 108: Mit6870 template matching and histograms

They used the following methods:

HOG Features

Introduced by Dalal & Triggs (2005)

Page 109: Mit6870 template matching and histograms

They used the following methods:

Deformable Part Model

Introduced by Fischler & Elschlager (1973)

Page 110: Mit6870 template matching and histograms

They used the following methods:

Latent SVM

Introduced by the authors

Page 111: Mit6870 template matching and histograms

HOG Features

Page 112: Mit6870 template matching and histograms

Model Overviewdetection root filter part filters

deformation models

Page 113: Mit6870 template matching and histograms

HOG Features

// 8x8 pixel blocks window

// features computed at different resolutions (pyramid)

Page 114: Mit6870 template matching and histograms

HOG Pyramid

Page 115: Mit6870 template matching and histograms

Deformable Part Model

Page 116: Mit6870 template matching and histograms

Deformable Part Model

// each part is a local property

// springs capture spatial relationships

// here, the springs can be “negative”

Page 117: Mit6870 template matching and histograms

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 118: Mit6870 template matching and histograms

root filter

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 119: Mit6870 template matching and histograms

root filter

part filters

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 120: Mit6870 template matching and histograms

root filter

part filtersdeformable

model

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 121: Mit6870 template matching and histograms

Deformable Part Model

filters feature vector(at position p

in the pyramid H)

position relativeto the root location

coefficients of a quadratic function on

the placement

score of a placement

Page 122: Mit6870 template matching and histograms

Latent SVM

Page 123: Mit6870 template matching and histograms

Latent SVM

filters and deformation parameters

features part displacements

Page 124: Mit6870 template matching and histograms

Latent SVM

Page 125: Mit6870 template matching and histograms

Bonus

// Data Mining Hard Negatives

// Model Initialization

Page 126: Mit6870 template matching and histograms

Results

Pascal VOC 2006

Page 127: Mit6870 template matching and histograms

Results

Models learned

Page 128: Mit6870 template matching and histograms

Experiments

~ Dalal’s model~ Dalal’s + LSVM

Page 129: Mit6870 template matching and histograms

Examples

errors

Page 130: Mit6870 template matching and histograms

A simple demo...

Page 131: Mit6870 template matching and histograms

A simple demo...

Page 132: Mit6870 template matching and histograms

A simple demo...

Page 133: Mit6870 template matching and histograms

A simple demo...

Page 134: Mit6870 template matching and histograms

Conclusions

Page 135: Mit6870 template matching and histograms

Conclusions

so, it doesn’t work ?!?

Page 136: Mit6870 template matching and histograms

Conclusions

so, it doesn’t work ?!?

no no, it works...

Page 137: Mit6870 template matching and histograms

Conclusions

so, it doesn’t work ?!?

no no, it works...

...it just doesn’t work well...

Page 138: Mit6870 template matching and histograms

Conclusions

so, it doesn’t work ?!?

no no, it works...

...it just doesn’t work well...

...or there is a problem with the seat-computer interface...

Page 139: Mit6870 template matching and histograms

Conclusion

Page 140: Mit6870 template matching and histograms