Deformable Part Models are Convolutional Neural Networks

1/26

Deformable Part Models are ConvolutionalNeural Networks

Ross Girshick, Forrest Iandola, Trevor Darrell, Jitendra Malik

Presentor: YANG Wei

January 25, 2016

2/26

Outline

1 Introduction

2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM

3 Implementation details

4 Experiments

3/26

Outline

1 Introduction



4 Experiments

4/26

Deformable Part Models vs. Convolutional NeuralNetworks

Deformable part models

Convolutional neural networks

4/26

Deformable Part Models vs. Convolutional NeuralNetworks

Deformable part models

Convolutional neural networks

5/26

Are DPMs and CNNs actually distinct?

DPMs: graphical modelsCNNs: “black-box” non-linear classifiers

This paper shows that any DPM can be formulated as anequivalent CNN, i.e., deformable part models are convolutionalneural networks.

6/26

Outline

1 Introduction



4 Experiments

7/26

DeepPyramid DPMs

Schematic model overview: “front-end CNN” + DPM-CNN

input: image pyramidoutput: object detection scores

8/26

Feature pyramid front-end CNN

front-end CNN: AlexNet (conv1-conv5).

A CNN that maps an image pyramid to a feature pyramidAlexNetsingle-scale architecture

9/26

Constructing an equivalent CNN from a DPM

A single-component DPM.

mixture of componentscomponent = root filter + part filter

10/26

Inference with DPMs

The matching process at one scale.

11/26

Architecture of DPM-CNN

The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:

1 input: conv5 feature pyramid of front-end CNN

2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)

with the transformed part feature maps5 The resulting P+1 channel feature map is convolved with

an object geometry filter, which produces the output DPMscore map for the input pyramid level

11/26



1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters

3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)



11/26



1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer

4 root feature map are stacked (channel-wise concatenated)with the transformed part feature maps

5 The resulting P+1 channel feature map is convolved withan object geometry filter, which produces the output DPMscore map for the input pyramid level

11/26



1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)

with the transformed part feature maps

5 The resulting P+1 channel feature map is convolved withan object geometry filter, which produces the output DPMscore map for the input pyramid level

11/26



1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)



12/26


CNN equivalent to a single-component DPM.

13/26

Traditional distance transform

Traditional distance transforms are defined for sets of points ona grid [FH05].

G : gridd(p−q): measure ofdistance between pointsp,q ∈ GB⊆ G

Then the distance transform ofB on G

DB(p) = minq∈B

d(p−q)Distance transform (Euclidean distance)

14/26

Traditional distance transform

DT can be also formulated as

DB(p) = minq∈G

(d(p−q)+1B(q))

where

1B(q) =

{0, if q ∈ B,∞, if q /∈ B.

(1)

15/26

Generalized distance transform

A generalization of distance transforms can be obtained byreplacing the indicator function with some arbitrary functionover the grid G

D f ′(p) = minq∈G

(d(p−q)+ f ′(q))

We can also define the generalized DT as maximization byletting f (q) =− f ′(q)

D f (p) = maxq∈G

( f (q)−d(p−q))

16/26

Distance transform in DPM

In DPM, after computing filter responses we transform theresponses of the part filters to allow spatial uncertainty,

Di(x,y) = maxdx,dy

(Ri(x+dx,y+dy)−wi ·φd(dx,dy))

whereφd(dx,dy) = [dx,dy,dx2, ,dy2]

The value Di(x,y) is the maximum contribution of the partto the score of a root location that places the anchor of thispart at position (x,y).

By letting p = (x,y), p−q = (dx,dy) andd(p−q) = w ·φ(p−q), we can see that it is exactly in theform of distance transform.

16/26

Distance transform in DPM

In DPM, after computing filter responses we transform theresponses of the part filters to allow spatial uncertainty,

Di(x,y) = maxdx,dy

(Ri(x+dx,y+dy)−wi ·φd(dx,dy))

whereφd(dx,dy) = [dx,dy,dx2, ,dy2]

The value Di(x,y) is the maximum contribution of the partto the score of a root location that places the anchor of thispart at position (x,y).By letting p = (x,y), p−q = (dx,dy) andd(p−q) = w ·φ(p−q), we can see that it is exactly in theform of distance transform.

17/26

Max pooling as distance transform

Consider max pooling on f : G 7→ R on a regular grid G .Let a window half-length as k, then max pooling can be definedas

M f (p) = max∆p∈{−k,··· ,k}

f (p+∆p)

Max pooling can be expressed equivalently as distancetransform:

M f (p) = maxq∈G

( f (q)−dmax(p−q))

where

dmax(p−q) =

{0, if (p−q) ∈ {−k, · · · ,k},∞, otherwise .

(2)

18/26

Generalize max pooling to distance transform pooling

We can generalize max pooling to distance transform pooling:unlike max pooling, the distance transform of f at p istaken over the entire domain Grather than specifying a fixed pooling window a priori, theshape of the pooling region can be learned from the data.

The released code does not include the DT pooling layer.Please refer to [OW13] for more details.

18/26

Generalize max pooling to distance transform pooling

We can generalize max pooling to distance transform pooling:unlike max pooling, the distance transform of f at p istaken over the entire domain Grather than specifying a fixed pooling window a priori, theshape of the pooling region can be learned from the data.

The released code does not include the DT pooling layer.Please refer to [OW13] for more details.

19/26

Object geometry filters

The root convolution map and the DT pooled part convolution maps are stacked into asingle feature map with P+1 channels and then convolved with a sparse objectgeometry filter.

20/26

Combining mixture components with maxout

CNN equivalent to a multi-component DPM. A multi-component DPM-CNN iscomposed of one DPM-CNN per component and a maxout [GWFM+13] layer thattakes a max over component DPM-CNN outputs at each location.

21/26

Outline

1 Introduction



4 Experiments

22/26

Feature pyramid front-end CNN

Implementation detailspretrain on ILSVRC 2012 classification using Caffeuse conv5 as output layer“same” convolution

zero-pad each conv/pooling layer’s input with xk/2y zeroson all sides (top, bottom, left and right)(x,y) in conv5 feature map has a receptive field centered onpixel (16x,16y) in the input imageconv5 feature maps: stride: 16; receptive field: 163×163

23/26

Outline

1 Introduction



4 Experiments

24/26

Experiments

Detection average precision (%) on VOC 2007 test. Column C shows the number ofcomponents and column P shows the number of parts per component.

25/26

Experiments

HOG versus conv5 feature pyramids. In contrast to HOG features, conv5 features aremore part-like and scale selective. Each conv5 pyramid shows 1 of 256 featurechannels. The top two rows show a HOG feature pyramid and the face channel of aconv5 pyramid on the same input image.

26/26

References

Pedro F Felzenszwalb and Daniel P Huttenlocher, Pictorial structures for objectrecognition, International Journal of Computer Vision 61 (2005), no. 1, 55–79.

Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and YoshuaBengio, Maxout networks, arXiv preprint arXiv:1302.4389 (2013).

Wanli Ouyang and Xiaogang Wang, Joint deep learning for pedestrian detection,ICCV, IEEE, 2013, pp. 2056–2063.

Science

Deformable Part Models are Convolutional Neural Networks