Upload
wei-yang
View
83
Download
1
Embed Size (px)
Citation preview
1/26
Deformable Part Models are ConvolutionalNeural Networks
Ross Girshick, Forrest Iandola, Trevor Darrell, Jitendra Malik
Presentor: YANG Wei
January 25, 2016
2/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
3/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
4/26
Deformable Part Models vs. Convolutional NeuralNetworks
Deformable part models
Convolutional neural networks
4/26
Deformable Part Models vs. Convolutional NeuralNetworks
Deformable part models
Convolutional neural networks
5/26
Are DPMs and CNNs actually distinct?
DPMs: graphical modelsCNNs: “black-box” non-linear classifiers
This paper shows that any DPM can be formulated as anequivalent CNN, i.e., deformable part models are convolutionalneural networks.
6/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
7/26
DeepPyramid DPMs
Schematic model overview: “front-end CNN” + DPM-CNN
input: image pyramidoutput: object detection scores
8/26
Feature pyramid front-end CNN
front-end CNN: AlexNet (conv1-conv5).
A CNN that maps an image pyramid to a feature pyramidAlexNetsingle-scale architecture
9/26
Constructing an equivalent CNN from a DPM
A single-component DPM.
mixture of componentscomponent = root filter + part filter
10/26
Inference with DPMs
The matching process at one scale.
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN
2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps5 The resulting P+1 channel feature map is convolved with
an object geometry filter, which produces the output DPMscore map for the input pyramid level
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters
3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps5 The resulting P+1 channel feature map is convolved with
an object geometry filter, which produces the output DPMscore map for the input pyramid level
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer
4 root feature map are stacked (channel-wise concatenated)with the transformed part feature maps
5 The resulting P+1 channel feature map is convolved withan object geometry filter, which produces the output DPMscore map for the input pyramid level
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps
5 The resulting P+1 channel feature map is convolved withan object geometry filter, which produces the output DPMscore map for the input pyramid level
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps5 The resulting P+1 channel feature map is convolved with
an object geometry filter, which produces the output DPMscore map for the input pyramid level
12/26
Architecture of DPM-CNN
CNN equivalent to a single-component DPM.
13/26
Traditional distance transform
Traditional distance transforms are defined for sets of points ona grid [FH05].
G : gridd(p−q): measure ofdistance between pointsp,q ∈ GB⊆ G
Then the distance transform ofB on G
DB(p) = minq∈B
d(p−q)Distance transform (Euclidean distance)
14/26
Traditional distance transform
DT can be also formulated as
DB(p) = minq∈G
(d(p−q)+1B(q))
where
1B(q) =
{0, if q ∈ B,∞, if q /∈ B.
(1)
15/26
Generalized distance transform
A generalization of distance transforms can be obtained byreplacing the indicator function with some arbitrary functionover the grid G
D f ′(p) = minq∈G
(d(p−q)+ f ′(q))
We can also define the generalized DT as maximization byletting f (q) =− f ′(q)
D f (p) = maxq∈G
( f (q)−d(p−q))
16/26
Distance transform in DPM
In DPM, after computing filter responses we transform theresponses of the part filters to allow spatial uncertainty,
Di(x,y) = maxdx,dy
(Ri(x+dx,y+dy)−wi ·φd(dx,dy))
whereφd(dx,dy) = [dx,dy,dx2, ,dy2]
The value Di(x,y) is the maximum contribution of the partto the score of a root location that places the anchor of thispart at position (x,y).
By letting p = (x,y), p−q = (dx,dy) andd(p−q) = w ·φ(p−q), we can see that it is exactly in theform of distance transform.
16/26
Distance transform in DPM
In DPM, after computing filter responses we transform theresponses of the part filters to allow spatial uncertainty,
Di(x,y) = maxdx,dy
(Ri(x+dx,y+dy)−wi ·φd(dx,dy))
whereφd(dx,dy) = [dx,dy,dx2, ,dy2]
The value Di(x,y) is the maximum contribution of the partto the score of a root location that places the anchor of thispart at position (x,y).By letting p = (x,y), p−q = (dx,dy) andd(p−q) = w ·φ(p−q), we can see that it is exactly in theform of distance transform.
17/26
Max pooling as distance transform
Consider max pooling on f : G 7→ R on a regular grid G .Let a window half-length as k, then max pooling can be definedas
M f (p) = max∆p∈{−k,··· ,k}
f (p+∆p)
Max pooling can be expressed equivalently as distancetransform:
M f (p) = maxq∈G
( f (q)−dmax(p−q))
where
dmax(p−q) =
{0, if (p−q) ∈ {−k, · · · ,k},∞, otherwise .
(2)
18/26
Generalize max pooling to distance transform pooling
We can generalize max pooling to distance transform pooling:unlike max pooling, the distance transform of f at p istaken over the entire domain Grather than specifying a fixed pooling window a priori, theshape of the pooling region can be learned from the data.
The released code does not include the DT pooling layer.Please refer to [OW13] for more details.
18/26
Generalize max pooling to distance transform pooling
We can generalize max pooling to distance transform pooling:unlike max pooling, the distance transform of f at p istaken over the entire domain Grather than specifying a fixed pooling window a priori, theshape of the pooling region can be learned from the data.
The released code does not include the DT pooling layer.Please refer to [OW13] for more details.
19/26
Object geometry filters
The root convolution map and the DT pooled part convolution maps are stacked into asingle feature map with P+1 channels and then convolved with a sparse objectgeometry filter.
20/26
Combining mixture components with maxout
CNN equivalent to a multi-component DPM. A multi-component DPM-CNN iscomposed of one DPM-CNN per component and a maxout [GWFM+13] layer thattakes a max over component DPM-CNN outputs at each location.
21/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
22/26
Feature pyramid front-end CNN
Implementation detailspretrain on ILSVRC 2012 classification using Caffeuse conv5 as output layer“same” convolution
zero-pad each conv/pooling layer’s input with xk/2y zeroson all sides (top, bottom, left and right)(x,y) in conv5 feature map has a receptive field centered onpixel (16x,16y) in the input imageconv5 feature maps: stride: 16; receptive field: 163×163
23/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
24/26
Experiments
Detection average precision (%) on VOC 2007 test. Column C shows the number ofcomponents and column P shows the number of parts per component.
25/26
Experiments
HOG versus conv5 feature pyramids. In contrast to HOG features, conv5 features aremore part-like and scale selective. Each conv5 pyramid shows 1 of 256 featurechannels. The top two rows show a HOG feature pyramid and the face channel of aconv5 pyramid on the same input image.
26/26
References
Pedro F Felzenszwalb and Daniel P Huttenlocher, Pictorial structures for objectrecognition, International Journal of Computer Vision 61 (2005), no. 1, 55–79.
Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and YoshuaBengio, Maxout networks, arXiv preprint arXiv:1302.4389 (2013).
Wanli Ouyang and Xiaogang Wang, Joint deep learning for pedestrian detection,ICCV, IEEE, 2013, pp. 2056–2063.