Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Top-down Neural Attention by Excitation Backprop

J ianming Zhang 1 , Zhe L in 1 , Jona than Brand t 1 ,

Xiaohu i Shen 1 , S tan Sc la ro f f 2

1ADOBE RESEARCH

2BOSTON UNIVERSITY

ECCV 2016 Amsterdam

Motivation

Artificial Neural Networks

© soul wind / stock.adobe.com2

Object Categories

Captions

Stories

Motivation

Artificial Neural Networks

© soul wind / stock.adobe.com3

Object Categories

Captions

Stories

Can these models ground their own predictions?

Goal: Generate Top-Down Attention Maps

• elephant

• zebra

Activation M

aps P

red

ictio

n

conv convInner

Prod.conv

Input

4

Bottom-up Inference


• elephant

• zebra

Activation M

aps P

red

ictio

n

conv convInner

Prod.conv

Input

4

Top-down Attention

Bottom-up Inference

• elephant

• zebra


• elephant

• zebra

Activation M

aps P

red

ictio

n

conv convInner

Prod.conv

Input

4

Top-down Attention

Bottom-up Inference

• elephant

• zebra

• elephant

• zebra

Related Work

5

[1] Zhou et al. “Object detectors emerge in deep scene

CNNs.” ICLR, 2015.

[2] Bergamo et al. “Self-taught object localization with deep

networks.” arXiv preprint arXiv:1409.3964, 2014.

[3] Cao et al. “Look and think twice: Capturing top-down visual

attention with feedback convolutional neural networks.” ICCV,

2015.

[4] Sermanet et al. “Overfeat: Integrated recognition, localization

and detection using convolutional networks.” ICLR, 2014.

[5] Zhou et al. “Learning Deep Features for Discriminative

Localization.” CVPR, 2016.

[6] Zeiler et al. “Visualizing and understanding convolutional

networks.” ECCV, 2014.

[7] Simonyan et al. “Deep inside convolutional networks:

Visualizing image classification models and saliency

maps.” ICLRW, 2014.

[8] Bach et al. “On pixel-wise explanations for non-linear classifier

decisions by layer-wise relevance propagation.” PloS One,

2015.

Masking-based [1, 2]

Optimization-based

[3]

Fully-conv-based [4,

5]

Backprop-based [6, 7,

8]

Related Work

5

[1] Zhou et al. “Object detectors emerge in deep scene

CNNs.” ICLR, 2015.

[2] Bergamo et al. “Self-taught object localization with deep

networks.” arXiv preprint arXiv:1409.3964, 2014.

[3] Cao et al. “Look and think twice: Capturing top-down visual

attention with feedback convolutional neural networks.” ICCV,

2015.

[4] Sermanet et al. “Overfeat: Integrated recognition, localization

and detection using convolutional networks.” ICLR, 2014.





[7] Simonyan et al. “Deep inside convolutional networks:

Visualizing image classification models and saliency

maps.” ICLRW, 2014.



2015.

Masking-based [1, 2]

Optimization-based

[3]

Fully-conv-based [4,

5]

Backprop-based [6, 7,

8]

› General: is applicable

to a wide variety of

DNNs

› Simple: can generate

an attention map in a

single backward pass

Contributions

6

Excitation Backprop

• Based on the biologically-inspired Selective Tuning model of visual attention

• Probabilistic Winner-Take-All scheme that is applicable to modern DNNs

Contrastive Top-down Attention Formulation

• Significantly improves the discriminativeness of our attention maps

The Selective Tuning Model [Tsotsos et al. 1995]

Forward pass to compute the feature values at each layer, as well as

predictions

Backward pass to localize relevant regions

7

Winner-Take-AllBackward pass

output layer

[1] Tsotsos et al. “Modeling Visual Attention via Selective Tuning.” Artificial Intelligence,

1995.

The Selective Tuning Model [Tsotsos et al. 1995]

Forward pass to compute the feature values at each layer, as well as

predictions

Backward pass to localize relevant regions

7

Winner-Take-AllBackward pass

output layer

[1] Tsotsos et al. “Modeling Visual Attention via Selective Tuning.” Artificial Intelligence,

1995.

For deep neural networks, this greedy, winner-take-all method produces very sparse

binary maps, and only uses information of a very small portion of the whole network.

Our Approach: Probabilistic Winner-Take-All

[1] Tsotsos et al. “Modeling Visual Attention via

Selective Tuning.” Artificial Intelligence,

1995.

Winner Sampling

8

Winner-Take-All [1]




1995.

Marginal Winning Probability (MWP):

Winner Sampling

8

Winner-Take-All [1]




1995.

Marginal Winning Probability (MWP):

Equivalent to an

Absorbing Markov

Chain process.

Winner Sampling

8

Winner-Take-All [1]

Excitation Backprop

Assumptions:

The responses of the activation neurons are non-negative.

An activation neuron is tuned to detect certain visual features. Its

response is positively correlated to its confidence of the detection.

9

Excitation Backprop

Assumptions:




Activation

Layer N

Activation

Layer N-1

+++_

Inhibitory Neuron

Excitatory Neuron

9

Excitation Backprop

Assumptions:




Activation

Layer N

Activation

Layer N-1

+++_

Inhibitory Neuron

Excitatory Neuron

9

Excitation Backprop

10

Running excitation backprop, we can extract attention maps from different

layers.

Lower layers can generate maps that highlight features of smaller scale.

Challenge: Responsive to Top-down Signals?

zebra elephant

11

Maps obtained using VGG16 pool3

Challenge: Responsive to Top-down Signals?

zebra elephant

11

Dominant neurons always win!

Maps obtained using VGG16 pool3

Negating the Output Layer for Contrastive Signals

zebra

classifier

zebra map

12


non-zebra

classifier

zebra

classifier

zebra map non-zebra map

12


non-zebra

classifier

zebra

classifier

zebra map non-zebra map

12

Contrastive Maps

zebra elephant

13

Negative values truncated to 0 and image values rescaled (for

visualization)

Contrastive attention map can be computed by a single pass

Evaluation: The Pointing Game

Task:

› Given an image and an object category, point to the targets.

Evaluation Metric:

› Mean pointing accuracy across categories

› Pointing anywhere on the targets is fine

CNN Models Tested:

› CNN-S [Chatfield et al. BMVC’14]

› VGG16 [Simonyan et al. ICLR’15]

› GoogleNet [Szegedy et al. CVPR’15]

Model Training:

› Multi-label cross-entropy loss

› Do not use any localization annotations

credit: elena milevska / stock.adobe.com

14

credit: howtomontessori.com

Results on VOC07 (GoogleNet)

15

69,5

79,3

74,3

72,8

80,8

79,3

85,1

60

65

70

75

80

85

90

Mean Accuracy over Categories

[1] Simonyan et al. “Deep inside convolutional networks: Visualizing

image classification models and saliency maps.” ICLRW, 2014.





2015.



Results on MS COCO (GoogleNet)

16

27,7

42,6

35,7

40,241,6

43,6

53,8

20

25

30

35

40

45

50

55

60

Mean Accuracy over Categories

Qualitative Comparison

17

Qualitative Comparison

18

Top-down Attention from an 18K-Tag Classifier

Train an image tag classifier for ~18K tags

› 6M Stock images with user tags

› Pre-trained GoogleNet model from Caffe Model Zoo

› Cross entropy multi-label loss

19

An Interesting Case

20

Phrase Localization

Follow the evaluation protocol of the Flickr30K entities dataset

Localization based on top-down attention maps:

› Take the average of word attention maps to get the phrase attention map

› Compute object proposals

› Re-rank proposals using the top-down phrase map

21

0

5

10

15

20

25

30

MCG_base Grad (MCG) Deconv (MCG) LRP (MCG) CAM (MCG) Ours (MCG) CCA [1]

(EdgeBoxes)

All Small Objects

Accuracy/Recall@1

[1] Plummer, et al. “Flickr30k

entities: Collecting region-

to-phrase correspondences

for richer image-to-

sentence models.” ICCV,

2015.

Conclusion

22

GPU&CPU Implementation in Caffe

https://github.com/jimmie33/Caffe-ExcitationBP

Excitation

Backprop

Contrastive

Attention

Discriminative

Top-down

Attention Map+ =

Backup Slides

23

Does the Contrastive Attention Formulation Work for Other Methods?

24

60,461,4 61,9

49,4

70,6

61,9

67,7

40

45

50

55

60

65

70

75

Our Grad CAM Deconv

original

contrastive

Deconv:+ Truncates negative signals- Requires normalization- Requires two backward passes- Does not use the activation values in the

backpropgation

VOC07 Difficult Set

Phrase Localization on the Flickr30K Entity Dataset [1]

[1] Plummer et al. “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.” ICCV, 2015.

25

Results

Mean Accuracy over Object Categories in the Pointing Game

26

Excitation Backprop

Assumptions:




27

Running excitation backprop, we can extract attention maps from different

layers.

Lower layers can generate maps that highlight features of smaller scale.

Example Results

28

Contrastive Attention

zebra elephant

29

threshold at 0


zebra elephant

29

threshold at 0


zebra elephant

elephant zebra

29

threshold at 0


zebra elephant

elephant zebra

29

The pair of maps are well normalized using our probabilistic framework

Contrastive attention map can be computed by a single pass

threshold at 0

Documents

Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing