DeepFix: a fully convolutional neural network for predicting human fixations (UPC Reading Group)

Preview:

Citation preview

DeepFix: A Fully ConvolutionalNeural Network for Predicting

Human Fixations

Srinivas S S Kruthiventi, Kumar Ayush, and R. Venkatesh Babu (arXiv October 2015) [URL]

Slides by Xavier Giró-i-Nieto, from the Computer Vision Reading Group. (27/10/2015)https://imatge.upc.edu/web/teaching/computer-vision-reading-group

Introduction

2

Introduction

3

Bottom-up attention

AutomaticReflexiveStimulus-driven

Introduction

4

Top-down attention

Subjective’s prior knowledgeExpectationsTask orientedMemoryBehavioral goals

Introduction

5

Visual Attentional Mechanisms

Bottom-upAutomaticReflexiveStimulus-driven

Top-downSubjective’s prior knowledgeExpectationsTask orientedMemoryBehavioral goals

Introduction

Introduction

7

DeepFixClassic method

Introduction

8

mit300 benchmark [URL]

Introduction

9

cat200 benchmark [URL]

The ingredients

10

Very deep network

11

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014)

● Inspired by Oxford’s VGG net (19 layers).● 20 layers● Small kernel sizes.

Fully convolutional network (FCN)

12

● Fully connected layers at the end are replaced by convolutional layers with very large receptive fields.

● They capture the global context of the scene.

● End-to-end training

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431-3440)

13

Inception layers

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going Deeper With Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9)

● GoogLeNet● Different kernel sizes

operating in parallel.

14

Location Biased Convolutional (LBC) layer

● Centre-bias●

The network

15

Architecture

16

Small convolutional filters of 3x3 with stride of 1 to allow a large depth without increasing the memory requirement

Architecture

17

Max pooling layers (in red) reduce computation.

Architecture

18

Gradual increase in the amount of channels to progressively learn richer semantic representations: 64, 128, 256, 512...

Architecture

19

Weights initialized from VGG-16 net for stable and effective learning

Architecture

20

Convolution kernel 3x3 with hole size 2 have a receptive field of 5x5.

Architecture

21

Capture multi-scale semantic structure using two inception style convolutional modules

Architecture

22

Very large receptive fields of 25x25 by introducing holes of size 6 in kernels

Architecture

23

Location Biased Convolutional (LBC) layers

Architecture

24

Location Biased Convolutional (LBC) layers

Architecture

25

constant during training learnt during training

weights from c’th filter in a convolutional layer

input blob

Architecture

26

Final output W/8xH/8 is upsampled.

Experiments

27

Training

28

2nd stage

MIT 1003

CAT2000Mouse clicks from Microsoft CoCo

Not mentioned how to go from eye fixations to heat mapa !!

Training

29

● End to end (as JuntingNet)● Caffeframework● 1 day in K40 GOU!

Results

30

Results

31

Results

32

Results

33

Results

34

Recommended