Upload
ken-chatfield
View
56
Download
1
Tags:
Embed Size (px)
Citation preview
Devil in the Details: Analysing the Performance
of ConvNet FeaturesKen Chatfield - University of Oxford
May 2015
The Devil is still in the Details2011 2014
• This work is about comparing the latest ConvNet based feature representations on common ground
• We compare both different pre-trained network architectures and different learning heuristics
Comparing Apples to Apples
Fixed Evaluation Protocol
Fixed Learning
CNN Arch 1
CNN Arch 2
IFV
Input Dataset
…
Performance Evolution over VOC2007
BOW32K–
IFV-BL327K–
IFV84K–
IFV84Kf s
DeCAF4Kt t
CNN-F4Kf s
CNN-M 2K2Kf s
CNN-S4K (TN)f s
VGG-D+E4KS s
545658606264666870727476788082848688
mAP
68.02
54.48
61.6964.36
73.41
77.15
80.13
2008 2010 2013 2014...
82.42
MethodDim.Aug.
201589.70
CNN-based methods
Evaluation Setup
SVM Classifier
train
test
training set
test set
Evaluate using mAP, accuracy etc.
classifier output
Pre-trained Net on 1,000 ImageNet Classes
CNN Feature Extractor
(4096-D feature vector out)
Outline
1
2
3
4
Different pre-trained networks
Data augmentation (for both CNN and IFV)
Dataset fine-tuning
• CNN-F Network
• CNN-M Network
• CNN-S Network
• VGG Very Deep Network
Network Architectures
Network ArchitecturesCNN-F NetworkSimilar to Krizhevsky et al. (ILSVRC-2012 winner)
conv3 256x3x3 stride 1
conv4 512x3x3
conv2 256x5x5 stride 1
conv1 64x11x11 stride 4
conv5 512x3x3
fc6 d.o. 4096-D
fc7 d.o. 4096-D
input image
x2 x2
Network ArchitecturesCNN-M NetworkSimilar to Zeiler & Fergus (ILSVRC-2013 winner)
conv3 512x3x3 stride 1
conv4 512x3x3
conv2 256x5x5 stride 2
conv1 96x7x7
stride 2
conv5 512x3x3
fc6 d.o. 4096-D
fc7 d.o. 4096-D
input image
x2 x2
Smaller receptive window size + stride in conv1
Network ArchitecturesCNN-S NetworkSimilar to Overfeat ‘accurate’ network (ICLR 2014)
conv3 512x3x3 stride 1
conv4 512x3x3
conv2 256x5x5 stride 1
conv1 96x7x7
stride 2
conv5 512x3x3
fc6 d.o. 4096-D
fc7 d.o. 4096-D
input image
x3 x2
Smaller stride in in conv2
Network ArchitecturesVGG Very Deep NetworkSimonyan & Zisserman (ICLR 2015)
conv1a 64x3x3
stride 1
fc6 d.o. 4096-D
fc7 d.o. 4096-D
input image
Smaller receptive window size + stride, and deeper
conv1b 64x3x3
stride 1
conv1c 64x3x3
stride 1x2
conv2a 128x3x3 stride 1
conv2b 128x3x3 stride 1
conv2c 128x3x3 stride 1
3(32C2) = 27C2
72C2 = 49C2
Pre-trained networks
mAP
( V
OC
07 )
70
75
80
85
90
Decaf CNN-F CNN-M CNN-S VGG-VD
89.3
79.7479.8977.38
73.41
Outline
1
2
3
4
Different pre-trained networks
Data augmentation (for both CNN and IFV)
Dataset fine-tuning
Data Augmentation
Given pre-trained ConvNet, augmentation applied at test time
CNN Feature Extractor
Pre-trained Network
a. Extract crops
b. Pool features (average, max)
Data Augmentation
a. No augmentation (= 1 image)
b. Flip augmentation (= 2 images)
c. Crop+Flip augmentation (= 10 images)
+
+ flips
224x224
224x224
224x224
Data Augmentationm
AP (
VO
C07
)
60
65
70
75
80
IFV CNN-M
79.89
67.17
79.44
66.68
76.99
64.35
76.97
64.36
NoneFlipCrop+Flip (train pooling: sum, test pooling: sum)Crop+Flip (train pooling: none, test pooling: sum)
Scale Augmentation
+ flips224x224
[Smin
, Smax
] = [256, 512]
+ flips224x224
256
512
Q = {Smin
, 0.5(Smin
+ Smax
), Smax
}
Fully Convolutional Net
Sermanet et al. 2014 (Overfeat)
• Convert final fc layers to convolutional layers • Output is then an activation map which can be pooled
8.8% ⇒ 7.5% top-5 val. error ILSVRC-2014
Outline
1
2
3
4
Different pre-trained networks
Data augmentation (for both CNN and IFV)
Dataset fine-tuning
Fine Tuning
conv3 512x3x3
conv4 512x3x3
conv2 256x5x5
conv1 96x7x7
conv5 512x3x3
fc6 d.o. 4096-D
fc7 d.o. 4096-D
ILSVRC softmax
Fine Tuning
conv3 512x3x3
conv4 512x3x3
conv2 256x5x5
conv1 96x7x7
conv5 512x3x3
fc6 d.o. 4096-D
fc7 d.o. 4096-D
VOC07 SVM loss
VOC 2007 Train Images
Fine Tuning
mAP
( V
OC
07 )
79
80
81
82
83
No TN TN-RNK TN-RNK
82.482.2
79.7
• TN-CLS – classification loss max{ 0, 1 - ywTφ( I ) }
• TN-RNK – ranking loss max{ 0, 1 - wT( φ( IPOS ) - φ( INEG ) ) }
Comparison with State of the ArtVOC2007 VOC2012ILSVRC-2012
CNN-M 2048CNN-SCNN-S TUNE-RNK
13.513.113.1
80.179.782.4
82.482.983.2
Zeiler & FergusOquab et al.Wei et al.
Clarifai (1 net)
16.1 79.018.0 77.7 78.7 (82.8*)
81.5 (85.2*) 81.7 (90.3*)
GoogLeNet (1 net)12.57.9
VGG Very Deep (1 net) 89.3 89.07.0
If you get the details right, a relatively simple ConvNet-based pipeline can outperform much more complex architectures
• Data augmentation helps a lot, both for deep and shallow features
• Fine tuning makes a difference, and should use ranking loss where appropriate
• Smaller filters and deeper networks help, although feature computation is slower
Take-home Messages
• Presented here was just a subset of the full results from the paper
• Check out the paper for full results on:
• VOC 2007 • VOC 2012 • Caltech-101 • Caltech-256 • ILSVRC-2012
There’s more…
• Caffe-compatible CNN models can be downloaded from the Caffe Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo
• Matlab feature computation code is also available from the project website: http://www.robots.ox.ac.uk/~vgg/software/deep_eval
Source Code
Related Publications
“Return of the Devil in the Details: Delving Deep into Convolutional Nets” BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman (Best Paper Prize)
“The devil is in the details: an evaluation of recent feature encoding methods” BMVC 2011 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Victor Lempitsky, Andrew Zisserman (Best Poster Prize Honourable Mention, 300+ citations)
http://www.robots.ox.ac.uk/~ken