43
Top-down Neural Attention by Excitation Backprop Jianming Zhang 1 , Zhe Lin 1 , Jonathan Brandt 1 , Xiaohui Shen 1 , Stan Sclaroff 2 1 A DOBE R ESEARCH 2 B OSTON U NIVERSITY ECCV 2016 Amsterdam

Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Top-down Neural Attention by Excitation Backprop

J ianming Zhang 1 , Zhe L in 1 , Jona than Brand t 1 ,

Xiaohu i Shen 1 , S tan Sc la ro f f 2

1ADOBE RESEARCH

2BOSTON UNIVERSITY

ECCV 2016 Amsterdam

Page 2: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Motivation

Artificial Neural Networks

© soul wind / stock.adobe.com2

Object Categories

Captions

Stories

Page 3: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Motivation

Artificial Neural Networks

© soul wind / stock.adobe.com3

Object Categories

Captions

Stories

Can these models ground their own predictions?

Page 4: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Goal: Generate Top-Down Attention Maps

• elephant

• zebra

Activation M

aps P

red

ictio

n

conv convInner

Prod.conv

Input

4

Bottom-up Inference

Page 5: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Goal: Generate Top-Down Attention Maps

• elephant

• zebra

Activation M

aps P

red

ictio

n

conv convInner

Prod.conv

Input

4

Top-down Attention

Bottom-up Inference

• elephant

• zebra

Page 6: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Goal: Generate Top-Down Attention Maps

• elephant

• zebra

Activation M

aps P

red

ictio

n

conv convInner

Prod.conv

Input

4

Top-down Attention

Bottom-up Inference

• elephant

• zebra

• elephant

• zebra

Page 7: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Related Work

5

[1] Zhou et al. “Object detectors emerge in deep scene

CNNs.” ICLR, 2015.

[2] Bergamo et al. “Self-taught object localization with deep

networks.” arXiv preprint arXiv:1409.3964, 2014.

[3] Cao et al. “Look and think twice: Capturing top-down visual

attention with feedback convolutional neural networks.” ICCV,

2015.

[4] Sermanet et al. “Overfeat: Integrated recognition, localization

and detection using convolutional networks.” ICLR, 2014.

[5] Zhou et al. “Learning Deep Features for Discriminative

Localization.” CVPR, 2016.

[6] Zeiler et al. “Visualizing and understanding convolutional

networks.” ECCV, 2014.

[7] Simonyan et al. “Deep inside convolutional networks:

Visualizing image classification models and saliency

maps.” ICLRW, 2014.

[8] Bach et al. “On pixel-wise explanations for non-linear classifier

decisions by layer-wise relevance propagation.” PloS One,

2015.

Masking-based [1, 2]

Optimization-based

[3]

Fully-conv-based [4,

5]

Backprop-based [6, 7,

8]

Page 8: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Related Work

5

[1] Zhou et al. “Object detectors emerge in deep scene

CNNs.” ICLR, 2015.

[2] Bergamo et al. “Self-taught object localization with deep

networks.” arXiv preprint arXiv:1409.3964, 2014.

[3] Cao et al. “Look and think twice: Capturing top-down visual

attention with feedback convolutional neural networks.” ICCV,

2015.

[4] Sermanet et al. “Overfeat: Integrated recognition, localization

and detection using convolutional networks.” ICLR, 2014.

[5] Zhou et al. “Learning Deep Features for Discriminative

Localization.” CVPR, 2016.

[6] Zeiler et al. “Visualizing and understanding convolutional

networks.” ECCV, 2014.

[7] Simonyan et al. “Deep inside convolutional networks:

Visualizing image classification models and saliency

maps.” ICLRW, 2014.

[8] Bach et al. “On pixel-wise explanations for non-linear classifier

decisions by layer-wise relevance propagation.” PloS One,

2015.

Masking-based [1, 2]

Optimization-based

[3]

Fully-conv-based [4,

5]

Backprop-based [6, 7,

8]

› General: is applicable

to a wide variety of

DNNs

› Simple: can generate

an attention map in a

single backward pass

Page 9: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Contributions

6

Excitation Backprop

• Based on the biologically-inspired Selective Tuning model of visual attention

• Probabilistic Winner-Take-All scheme that is applicable to modern DNNs

Contrastive Top-down Attention Formulation

• Significantly improves the discriminativeness of our attention maps

Page 10: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

The Selective Tuning Model [Tsotsos et al. 1995]

Forward pass to compute the feature values at each layer, as well as

predictions

Backward pass to localize relevant regions

7

Winner-Take-AllBackward pass

output layer

[1] Tsotsos et al. “Modeling Visual Attention via Selective Tuning.” Artificial Intelligence,

1995.

Page 11: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

The Selective Tuning Model [Tsotsos et al. 1995]

Forward pass to compute the feature values at each layer, as well as

predictions

Backward pass to localize relevant regions

7

Winner-Take-AllBackward pass

output layer

[1] Tsotsos et al. “Modeling Visual Attention via Selective Tuning.” Artificial Intelligence,

1995.

For deep neural networks, this greedy, winner-take-all method produces very sparse

binary maps, and only uses information of a very small portion of the whole network.

Page 12: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Our Approach: Probabilistic Winner-Take-All

[1] Tsotsos et al. “Modeling Visual Attention via

Selective Tuning.” Artificial Intelligence,

1995.

Winner Sampling

8

Winner-Take-All [1]

Page 13: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Our Approach: Probabilistic Winner-Take-All

[1] Tsotsos et al. “Modeling Visual Attention via

Selective Tuning.” Artificial Intelligence,

1995.

Marginal Winning Probability (MWP):

Winner Sampling

8

Winner-Take-All [1]

Page 14: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Our Approach: Probabilistic Winner-Take-All

[1] Tsotsos et al. “Modeling Visual Attention via

Selective Tuning.” Artificial Intelligence,

1995.

Marginal Winning Probability (MWP):

Equivalent to an

Absorbing Markov

Chain process.

Winner Sampling

8

Winner-Take-All [1]

Page 15: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Excitation Backprop

Assumptions:

The responses of the activation neurons are non-negative.

An activation neuron is tuned to detect certain visual features. Its

response is positively correlated to its confidence of the detection.

9

Page 16: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Excitation Backprop

Assumptions:

The responses of the activation neurons are non-negative.

An activation neuron is tuned to detect certain visual features. Its

response is positively correlated to its confidence of the detection.

Activation

Layer N

Activation

Layer N-1

+++_

Inhibitory Neuron

Excitatory Neuron

9

Page 17: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Excitation Backprop

Assumptions:

The responses of the activation neurons are non-negative.

An activation neuron is tuned to detect certain visual features. Its

response is positively correlated to its confidence of the detection.

Activation

Layer N

Activation

Layer N-1

+++_

Inhibitory Neuron

Excitatory Neuron

9

Page 18: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Excitation Backprop

10

Running excitation backprop, we can extract attention maps from different

layers.

Lower layers can generate maps that highlight features of smaller scale.

Page 19: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Challenge: Responsive to Top-down Signals?

zebra elephant

11

Maps obtained using VGG16 pool3

Page 20: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Challenge: Responsive to Top-down Signals?

zebra elephant

11

Dominant neurons always win!

Maps obtained using VGG16 pool3

Page 21: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Negating the Output Layer for Contrastive Signals

zebra

classifier

zebra map

12

Page 22: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Negating the Output Layer for Contrastive Signals

non-zebra

classifier

zebra

classifier

zebra map non-zebra map

12

Page 23: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Negating the Output Layer for Contrastive Signals

non-zebra

classifier

zebra

classifier

zebra map non-zebra map

12

Page 24: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Contrastive Maps

zebra elephant

13

Negative values truncated to 0 and image values rescaled (for

visualization)

Contrastive attention map can be computed by a single pass

Page 25: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Evaluation: The Pointing Game

Task:

› Given an image and an object category, point to the targets.

Evaluation Metric:

› Mean pointing accuracy across categories

› Pointing anywhere on the targets is fine

CNN Models Tested:

› CNN-S [Chatfield et al. BMVC’14]

› VGG16 [Simonyan et al. ICLR’15]

› GoogleNet [Szegedy et al. CVPR’15]

Model Training:

› Multi-label cross-entropy loss

› Do not use any localization annotations

credit: elena milevska / stock.adobe.com

14

credit: howtomontessori.com

Page 26: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Results on VOC07 (GoogleNet)

15

69,5

79,3

74,3

72,8

80,8

79,3

85,1

60

65

70

75

80

85

90

Mean Accuracy over Categories

[1] Simonyan et al. “Deep inside convolutional networks: Visualizing

image classification models and saliency maps.” ICLRW, 2014.

[2] Zeiler et al. “Visualizing and understanding convolutional

networks.” ECCV, 2014.

[3] Bach et al. “On pixel-wise explanations for non-linear classifier

decisions by layer-wise relevance propagation.” PloS One,

2015.

[4] Zhou et al. “Learning Deep Features for Discriminative

Localization.” CVPR, 2016.

Page 27: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Results on MS COCO (GoogleNet)

16

27,7

42,6

35,7

40,241,6

43,6

53,8

20

25

30

35

40

45

50

55

60

Mean Accuracy over Categories

Page 28: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Qualitative Comparison

17

Page 29: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Qualitative Comparison

18

Page 30: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Top-down Attention from an 18K-Tag Classifier

Train an image tag classifier for ~18K tags

› 6M Stock images with user tags

› Pre-trained GoogleNet model from Caffe Model Zoo

› Cross entropy multi-label loss

19

Page 31: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

An Interesting Case

20

Page 32: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Phrase Localization

Follow the evaluation protocol of the Flickr30K entities dataset

Localization based on top-down attention maps:

› Take the average of word attention maps to get the phrase attention map

› Compute object proposals

› Re-rank proposals using the top-down phrase map

21

0

5

10

15

20

25

30

MCG_base Grad (MCG) Deconv (MCG) LRP (MCG) CAM (MCG) Ours (MCG) CCA [1]

(EdgeBoxes)

All Small Objects

Accuracy/Recall@1

[1] Plummer, et al. “Flickr30k

entities: Collecting region-

to-phrase correspondences

for richer image-to-

sentence models.” ICCV,

2015.

Page 33: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Conclusion

22

GPU&CPU Implementation in Caffe

https://github.com/jimmie33/Caffe-ExcitationBP

Excitation

Backprop

Contrastive

Attention

Discriminative

Top-down

Attention Map+ =

Page 34: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Backup Slides

23

Page 35: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Does the Contrastive Attention Formulation Work for Other Methods?

24

60,461,4 61,9

49,4

70,6

61,9

67,7

40

45

50

55

60

65

70

75

Our Grad CAM Deconv

original

contrastive

Deconv:+ Truncates negative signals- Requires normalization- Requires two backward passes- Does not use the activation values in the

backpropgation

VOC07 Difficult Set

Page 36: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Phrase Localization on the Flickr30K Entity Dataset [1]

[1] Plummer et al. “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.” ICCV, 2015.

25

Page 37: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Results

Mean Accuracy over Object Categories in the Pointing Game

26

Page 38: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Excitation Backprop

Assumptions:

The responses of the activation neurons are non-negative.

An activation neuron is tuned to detect certain visual features. Its

response is positively correlated to its confidence of the detection.

27

Running excitation backprop, we can extract attention maps from different

layers.

Lower layers can generate maps that highlight features of smaller scale.

Page 39: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Example Results

28

Page 40: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Contrastive Attention

zebra elephant

29

threshold at 0

Page 41: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Contrastive Attention

zebra elephant

29

threshold at 0

Page 42: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Contrastive Attention

zebra elephant

elephant zebra

29

threshold at 0

Page 43: Top-down Neural Attention by Excitation Backprop€¦ · [5] Zhou et al. “Learning Deep Features for Discriminative Localization.” CVPR, 2016. [6] Zeiler et al. “Visualizing

Contrastive Attention

zebra elephant

elephant zebra

29

The pair of maps are well normalized using our probabilistic framework

Contrastive attention map can be computed by a single pass

threshold at 0