Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Top-down Neural Attention by Excitation Backprop
J ianming Zhang 1 , Zhe L in 1 , Jona than Brand t 1 ,
Xiaohu i Shen 1 , S tan Sc la ro f f 2
1ADOBE RESEARCH
2BOSTON UNIVERSITY
ECCV 2016 Amsterdam
Motivation
Artificial Neural Networks
© soul wind / stock.adobe.com2
Object Categories
Captions
Stories
Motivation
Artificial Neural Networks
© soul wind / stock.adobe.com3
Object Categories
Captions
Stories
Can these models ground their own predictions?
Goal: Generate Top-Down Attention Maps
• elephant
• zebra
Activation M
aps P
red
ictio
n
conv convInner
Prod.conv
Input
4
Bottom-up Inference
Goal: Generate Top-Down Attention Maps
• elephant
• zebra
Activation M
aps P
red
ictio
n
conv convInner
Prod.conv
Input
4
Top-down Attention
Bottom-up Inference
• elephant
• zebra
Goal: Generate Top-Down Attention Maps
• elephant
• zebra
Activation M
aps P
red
ictio
n
conv convInner
Prod.conv
Input
4
Top-down Attention
Bottom-up Inference
• elephant
• zebra
• elephant
• zebra
Related Work
5
[1] Zhou et al. “Object detectors emerge in deep scene
CNNs.” ICLR, 2015.
[2] Bergamo et al. “Self-taught object localization with deep
networks.” arXiv preprint arXiv:1409.3964, 2014.
[3] Cao et al. “Look and think twice: Capturing top-down visual
attention with feedback convolutional neural networks.” ICCV,
2015.
[4] Sermanet et al. “Overfeat: Integrated recognition, localization
and detection using convolutional networks.” ICLR, 2014.
[5] Zhou et al. “Learning Deep Features for Discriminative
Localization.” CVPR, 2016.
[6] Zeiler et al. “Visualizing and understanding convolutional
networks.” ECCV, 2014.
[7] Simonyan et al. “Deep inside convolutional networks:
Visualizing image classification models and saliency
maps.” ICLRW, 2014.
[8] Bach et al. “On pixel-wise explanations for non-linear classifier
decisions by layer-wise relevance propagation.” PloS One,
2015.
Masking-based [1, 2]
Optimization-based
[3]
Fully-conv-based [4,
5]
Backprop-based [6, 7,
8]
Related Work
5
[1] Zhou et al. “Object detectors emerge in deep scene
CNNs.” ICLR, 2015.
[2] Bergamo et al. “Self-taught object localization with deep
networks.” arXiv preprint arXiv:1409.3964, 2014.
[3] Cao et al. “Look and think twice: Capturing top-down visual
attention with feedback convolutional neural networks.” ICCV,
2015.
[4] Sermanet et al. “Overfeat: Integrated recognition, localization
and detection using convolutional networks.” ICLR, 2014.
[5] Zhou et al. “Learning Deep Features for Discriminative
Localization.” CVPR, 2016.
[6] Zeiler et al. “Visualizing and understanding convolutional
networks.” ECCV, 2014.
[7] Simonyan et al. “Deep inside convolutional networks:
Visualizing image classification models and saliency
maps.” ICLRW, 2014.
[8] Bach et al. “On pixel-wise explanations for non-linear classifier
decisions by layer-wise relevance propagation.” PloS One,
2015.
Masking-based [1, 2]
Optimization-based
[3]
Fully-conv-based [4,
5]
Backprop-based [6, 7,
8]
› General: is applicable
to a wide variety of
DNNs
› Simple: can generate
an attention map in a
single backward pass
Contributions
6
Excitation Backprop
• Based on the biologically-inspired Selective Tuning model of visual attention
• Probabilistic Winner-Take-All scheme that is applicable to modern DNNs
Contrastive Top-down Attention Formulation
• Significantly improves the discriminativeness of our attention maps
The Selective Tuning Model [Tsotsos et al. 1995]
Forward pass to compute the feature values at each layer, as well as
predictions
Backward pass to localize relevant regions
7
Winner-Take-AllBackward pass
output layer
[1] Tsotsos et al. “Modeling Visual Attention via Selective Tuning.” Artificial Intelligence,
1995.
The Selective Tuning Model [Tsotsos et al. 1995]
Forward pass to compute the feature values at each layer, as well as
predictions
Backward pass to localize relevant regions
7
Winner-Take-AllBackward pass
output layer
[1] Tsotsos et al. “Modeling Visual Attention via Selective Tuning.” Artificial Intelligence,
1995.
For deep neural networks, this greedy, winner-take-all method produces very sparse
binary maps, and only uses information of a very small portion of the whole network.
Our Approach: Probabilistic Winner-Take-All
[1] Tsotsos et al. “Modeling Visual Attention via
Selective Tuning.” Artificial Intelligence,
1995.
Winner Sampling
8
Winner-Take-All [1]
Our Approach: Probabilistic Winner-Take-All
[1] Tsotsos et al. “Modeling Visual Attention via
Selective Tuning.” Artificial Intelligence,
1995.
Marginal Winning Probability (MWP):
Winner Sampling
8
Winner-Take-All [1]
Our Approach: Probabilistic Winner-Take-All
[1] Tsotsos et al. “Modeling Visual Attention via
Selective Tuning.” Artificial Intelligence,
1995.
Marginal Winning Probability (MWP):
Equivalent to an
Absorbing Markov
Chain process.
Winner Sampling
8
Winner-Take-All [1]
Excitation Backprop
Assumptions:
The responses of the activation neurons are non-negative.
An activation neuron is tuned to detect certain visual features. Its
response is positively correlated to its confidence of the detection.
9
Excitation Backprop
Assumptions:
The responses of the activation neurons are non-negative.
An activation neuron is tuned to detect certain visual features. Its
response is positively correlated to its confidence of the detection.
Activation
Layer N
Activation
Layer N-1
+++_
Inhibitory Neuron
Excitatory Neuron
9
Excitation Backprop
Assumptions:
The responses of the activation neurons are non-negative.
An activation neuron is tuned to detect certain visual features. Its
response is positively correlated to its confidence of the detection.
Activation
Layer N
Activation
Layer N-1
+++_
Inhibitory Neuron
Excitatory Neuron
9
Excitation Backprop
10
Running excitation backprop, we can extract attention maps from different
layers.
Lower layers can generate maps that highlight features of smaller scale.
Challenge: Responsive to Top-down Signals?
zebra elephant
11
Maps obtained using VGG16 pool3
Challenge: Responsive to Top-down Signals?
zebra elephant
11
Dominant neurons always win!
Maps obtained using VGG16 pool3
Negating the Output Layer for Contrastive Signals
zebra
classifier
zebra map
12
Negating the Output Layer for Contrastive Signals
non-zebra
classifier
zebra
classifier
zebra map non-zebra map
12
Negating the Output Layer for Contrastive Signals
non-zebra
classifier
zebra
classifier
zebra map non-zebra map
12
Contrastive Maps
zebra elephant
13
Negative values truncated to 0 and image values rescaled (for
visualization)
Contrastive attention map can be computed by a single pass
Evaluation: The Pointing Game
Task:
› Given an image and an object category, point to the targets.
Evaluation Metric:
› Mean pointing accuracy across categories
› Pointing anywhere on the targets is fine
CNN Models Tested:
› CNN-S [Chatfield et al. BMVC’14]
› VGG16 [Simonyan et al. ICLR’15]
› GoogleNet [Szegedy et al. CVPR’15]
Model Training:
› Multi-label cross-entropy loss
› Do not use any localization annotations
credit: elena milevska / stock.adobe.com
14
credit: howtomontessori.com
Results on VOC07 (GoogleNet)
15
69,5
79,3
74,3
72,8
80,8
79,3
85,1
60
65
70
75
80
85
90
Mean Accuracy over Categories
[1] Simonyan et al. “Deep inside convolutional networks: Visualizing
image classification models and saliency maps.” ICLRW, 2014.
[2] Zeiler et al. “Visualizing and understanding convolutional
networks.” ECCV, 2014.
[3] Bach et al. “On pixel-wise explanations for non-linear classifier
decisions by layer-wise relevance propagation.” PloS One,
2015.
[4] Zhou et al. “Learning Deep Features for Discriminative
Localization.” CVPR, 2016.
Results on MS COCO (GoogleNet)
16
27,7
42,6
35,7
40,241,6
43,6
53,8
20
25
30
35
40
45
50
55
60
Mean Accuracy over Categories
Qualitative Comparison
17
Qualitative Comparison
18
Top-down Attention from an 18K-Tag Classifier
Train an image tag classifier for ~18K tags
› 6M Stock images with user tags
› Pre-trained GoogleNet model from Caffe Model Zoo
› Cross entropy multi-label loss
19
An Interesting Case
20
Phrase Localization
Follow the evaluation protocol of the Flickr30K entities dataset
Localization based on top-down attention maps:
› Take the average of word attention maps to get the phrase attention map
› Compute object proposals
› Re-rank proposals using the top-down phrase map
21
0
5
10
15
20
25
30
MCG_base Grad (MCG) Deconv (MCG) LRP (MCG) CAM (MCG) Ours (MCG) CCA [1]
(EdgeBoxes)
All Small Objects
Accuracy/Recall@1
[1] Plummer, et al. “Flickr30k
entities: Collecting region-
to-phrase correspondences
for richer image-to-
sentence models.” ICCV,
2015.
Conclusion
22
GPU&CPU Implementation in Caffe
https://github.com/jimmie33/Caffe-ExcitationBP
Excitation
Backprop
Contrastive
Attention
Discriminative
Top-down
Attention Map+ =
Backup Slides
23
Does the Contrastive Attention Formulation Work for Other Methods?
24
60,461,4 61,9
49,4
70,6
61,9
67,7
40
45
50
55
60
65
70
75
Our Grad CAM Deconv
original
contrastive
Deconv:+ Truncates negative signals- Requires normalization- Requires two backward passes- Does not use the activation values in the
backpropgation
VOC07 Difficult Set
Phrase Localization on the Flickr30K Entity Dataset [1]
[1] Plummer et al. “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.” ICCV, 2015.
25
Results
Mean Accuracy over Object Categories in the Pointing Game
26
Excitation Backprop
Assumptions:
The responses of the activation neurons are non-negative.
An activation neuron is tuned to detect certain visual features. Its
response is positively correlated to its confidence of the detection.
27
Running excitation backprop, we can extract attention maps from different
layers.
Lower layers can generate maps that highlight features of smaller scale.
Example Results
28
Contrastive Attention
zebra elephant
29
threshold at 0
Contrastive Attention
zebra elephant
29
threshold at 0
Contrastive Attention
zebra elephant
elephant zebra
29
threshold at 0
Contrastive Attention
zebra elephant
elephant zebra
29
The pair of maps are well normalized using our probabilistic framework
Contrastive attention map can be computed by a single pass
threshold at 0