Deep Networks for Human Visual Attention: A hybrid model using … · Resumo A atenc¸ao visual desempenha um papel fundamental nos sistemas naturais e artiﬁciais no que toca˜

Deep Networks for Human Visual Attention:

A hybrid model using foveal vision

Ana Filipa Vieira de Jesus Almeida

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor(s): Professor Alexandre José Malheiro Bernardino

Professor José Alberto Rosado dos Santos-Victor

Examination Committee

Chairperson: Professor João Fernando Cardoso Silva SequeiraSupervisor: Professor Alexandre José Malheiro Bernardino

Member of the Committee: Professor Pedro Daniel dos Santos Miraldo

April 2017

ii

Acknowledgments

First, I would like to thank my thesis advisor Professor Alexandre Bernardino for the opportunity to

develop this research project in the Computer and Robot Vision Laboratory of the Instituto Superior

Tecnico. The door to Prof. Bernardino’s office was always open and he was a splendid advisor who has

always supported me during this past year.

Next, I would like to give a special thanks to my lab mate Rui Figueiredo for all his support and for

encouraging me to do more and better. And, of course, to the other lab mates who contributed directly

or indirectly to this thesis, in particular to Atabak Dehban.

Last but not least I would like to thank my loved ones, my parents, my twin sister and my brother,

who have supported me throughout this entire process. Finally, I extend my thanks to my friends who

accompanied me in these harsh academic years.

iii

iv

Resumo

A atencao visual desempenha um papel fundamental nos sistemas naturais e artificiais no que toca

ao controlo dos recursos percetuais. Os sistemas classicos de atencao visual artificial utilizam carac-

terısticas salientes da imagem obtidas a partir da informacao proveniente de filtros.

Recentemente, foram desenvolvidas redes neuronais profundas para o reconhecimento de milhares de

objetos onde estas geram autonomamente caracterısticas visuais otimizadas por treino com conjun-

tos extensos de dados. Para alem de serem utilizadas para reconhecimento de objetos, estas carac-

terısticas tem tido muito sucesso noutros problemas visuais tais como a segmentacao de objetos, o

seguimento e, recentemente, a atencao visual.

Este trabalho propoe uma estrutura biologicamente plausıvel de classificacao e localizacao de objetos

que incorpora mecanismos de atencao bottom-up e top-down, combinando redes neuronais convolu-

cionais com visao foveal. Primeiro e feita uma passagem feed-forward de forma a obter as previsoes da

rede neuronal quanto as etiquetas das classes. De seguida, para cada uma das top-5 classes previs-

tas e obtida uma proposta relativa a localizacao do objeto. Esta proposta resulta da aplicacao de uma

mascara de segmentacao sobre o mapa de saliencia que primeiramente e computado atraves de uma

passagem backward. Por fim, e feita uma segunda passagem feed-forward onde a imagem e reclas-

sificada, desta vez com atencao. Nesta ultima etapa sao comparadas duas configuracoes de detecao

visual: uma uniforme (Cartesiana) e uma nao uniforme (foveada). Na primeira, a imagem e recortada

segundo a proposta de localizacao do objeto e a atencao e direcionada para a nova imagem, descar-

tando o contexto. Na segunda, e aplicado o nosso modelo de foveacao visual humana onde a imagem

e foveada a partir do centro da localizacao proposta para um dado objeto. Desta forma, a atencao e

direcionada para o objeto e este e classificado para diferentes nıveis de resolucao.

A principal contribuicao do nosso trabalho reside na avaliacao que fazemos na utilizacao de imagens

com resolucao uniforme e foveada. Foi possıvel estabelecer a relacao entre estes diferentes metodos

e avaliar a informacao preservada em cada tipo de sensor, em funcao dos seus parametros.

Os resultados demonstram que nao e necessario guardar e/ou transmitir toda a informacao presente

numa imagem com alta-resolucao uma vez que, a partir de uma dada quantidade de informacao, o

desempenho obtido na tarefa de classificacao satura.

Palavras-chave: Atencao visual, classificacao e localizacao de objetos, redes neuronais

profundas, visao computacional, visao variante no espaco.

v

vi

Abstract

Visual attention plays a central role in natural and artificial systems to control perceptual resources. The

classic artificial visual attention systems use salient features of the image obtained from the information

given by predefined filters.

Recently, deep neural networks have been developed for recognizing thousands of objects and au-

tonomously generate visual characteristics optimized by training with large data sets. Besides being

used for object recognition, these features have been very successful in other visual problems such as

object segmentation, tracking and recently, visual attention.

In this work, we propose a biologically inspired object classification and localization framework that

incorporates bottom-up and top-down attentional mechanisms, combining Deep Convolutional Neural

Networks with foveal vision. First, a feed-forward pass is performed to obtain the predicted class la-

bels. Next, we get the object location proposals by applying a segmentation mask on the saliency map

calculated through a backward pass. At last, an image re-classification with attention is done by a sec-

ond feed-forward pass. In this final stage, two visual sensing configurations are compared: a uniform

(Cartesian) that uses a crop patch of the image to re-classify, discarding the surrounding context and a

non-uniform tessellation that transforms the image by applying the human visual foveation model at the

center of the object location proposal.

The main contribution of our work lies in the evaluation of the performances obtained with uniform and

non-uniform resolutions. We were able to establish the relationship between performance and the dif-

ferent levels of information preserved by each of the sensing configurations. The results demonstrate

that we do not need to store and transmit all the information present on high-resolution images since,

beyond a certain amount of preserved information, the performance in the classification task saturates.

Keywords: Computer vision, deep neural networks, object classification and localization, space-

variant vision, visual attention.

vii

viii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 5

2.1 Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Preattention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Feature Integration Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Guided Search Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Boolean Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Mechanisms for Information Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Related Work 19

3.1 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Modern Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 ImageNet Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Pre-trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.1 CaffeNet/AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.2 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.3 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ix

4 Hybrid Attention Model 29

4.1 Class Saliency Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Uniform vs Foveal Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Uniform Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.2 Foveal Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Information Attenuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Uniform Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.2 Non-Uniform Foveal Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Implementation 37

5.1 Image-Specific Class Saliency Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Weakly Supervised Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Image Re-Classification with Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Results 43

6.1 Uniform vs Non-Uniform Foveal Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 First Feed-Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.2.1 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2.2 Localization Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3 Top-Down Class Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 First vs Second Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusions 51

7.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Bibliography 57

x

List of Tables

3.1 ConvNet performance following the state of the art. . . . . . . . . . . . . . . . . . . . . . . 27

6.1 Summary of the evaluated topologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xi

xii

List of Figures

2.1 Photo-receptors density in the retina. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Diagram of the macula of the retina. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Treisman’s feature integration model of early vision. . . . . . . . . . . . . . . . . . . . . . 9

2.4 Guided search for steep green targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Boolean maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 The perceptual model cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Neural network basic structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 Convolutional Neural Network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.9 Representation of max-pooling operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Taxonomy of visual attention models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 Ideal low-pass filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Gaussian filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 A summary of the human visual foveation model with four levels. . . . . . . . . . . . . . . 33

4.4 Representation of images acquired with two different visual sensing configurations: a

uniform and a log-polar distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1 Multistage attentional pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Representation of saliency maps and location proposals. . . . . . . . . . . . . . . . . . . 39

5.3 Representation of the convolutional layer output for 4 different input images on VGGNet

pre-trained model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1 Information gain in function of σ for uniform and non-uniform vision. . . . . . . . . . . . . 44

6.2 Demonstration of our weakly supervised object localization method . . . . . . . . . . . . . 45

6.3 Classification and localization performance of the first pass for several topologies. . . . . 49

6.4 Classification and localization performance of the second pass for several topologies. . . 50

xiii

xiv

Nomenclature

Acronyms

ANN Artificial Neural Networks

CNN Convolutional Neural Network

ConvNet Convolutional Network

FBA Feature-based Attention

ILSVRC ImageNet Large Scale Visual Recognition Challenge

IOR Inhibition of Return

LRN Local Response Normalization

OBA Object-based Attention

ReLU Rectified Linear Unit

WTA Winner-Take-All

Parameters

σ0 Quantity of uniform blur

f0 Size of the region with high acuity

c Class label

I Image

th Threshold

xv

xvi

Chapter 1

Introduction

1.1 Motivation

The available human brain computational resources are limited, so it is not possible to process all the

sensory information provided by the visual perceptual modality. For this reason, it is essential to focus

our resources such that only the relevant stimuli are processed and interpreted. Selective visual atten-

tion mechanisms are the fundamental mechanisms in biological systems, responsible for prioritizing the

elements of the visual scene to be attended.

Likewise, an important issue in many computer vision applications requiring real time visual process-

ing, resides in the involved computational effort [1]. Therefore, in the past decades, many biologically

inspired attention-based methods and approaches, were proposed with the goal of building efficient sys-

tems, capable of working in real-time. Hence, attention modeling is still a topic under active research,

studying different ways to selectively process information in order to reduce the time and computational

complexity of the existing methods.

Humans use attention mechanisms based on goal-oriented (top-down) and stimulus-driven (bottom-

up) information to define the region in the visual input where the attentional focus should be oriented [2].

In this way, the amount of processing is limited to a certain region of the visual field and the regions to

explore (salient) are prioritized in time. Similar mechanisms can also be applied to artificial systems that

share similar resource limitations. Much effort has been made towards understanding and applying the

human attention mechanisms in robotic systems.

Nowadays, modeling attention is still challenging due to the huge amount of information available

at any time and the laborious and time-consuming task that is to create models by hand, trying to

tune where (regions) and what (objects) the observer should look at. For this purpose, biologically

inspired neural networks have been extensively used, since they can implicitly learn those mechanisms,

circumventing the need of creating models by hand.

1

1.2 Goals

Our work is inspired by [3] which proposed to capture visual attention through feedback Deep Convo-

lutional Neural Networks. Similarly in spirit, we propose a biologically inspired hybrid attention model,

that combines bottom-up and top-down mechanisms and is capable of efficiently locate and recognize

objects in digital images, using human-like vision.

More specifically, our method is constituted by three steps: first, we perform a feed-forward pass to

obtain the predicted class labels. Second, a backward pass is made to create a saliency map that is

used to obtain object location proposals after applying a segmentation mask. Finally, a second feed-

forward pass is executed to re-classify the image with selective attention. With a non-uniform foveal

visual sensor, the attention is directed to the proposed locations using a foveal spotlight model, whereas

for the uniform sensor, the attentional spotlight is oriented in a covert manner to crop patches of the

original image.

What differentiates our model from Cao’s model [3] is the use of two visual sensing configurations.

On one hand, Cao uses images with high-resolution in both passages of the model. On the other hand,

we use images with different resolutions corresponding to two visual sensing configurations: the uniform

vision where we simulate the use of a sensor with low-resolution and a non-uniform foveal vision where

the sensor presents space-variant resolution.

Our primary goal is to evaluate the performance of several well-known Convolutional Neural Net-

work architectures that are part of the state of the art in tasks of detection and localization of objects.

Moreover, we assess the performance of two different visual sensory structures: a conventional uniform

(Cartesian) and a multi-resolution, human-inspired, foveal configuration, on the first and second feed-

forward passages. For the Cartesian, a re-classification is performed for a cropped patch of the image,

discarding the periphery. For the human-like tessellation, the image is foveated at the center of the pro-

posed location. At last, a combined configuration is evaluated where on the first feed-forward pass, the

input image presents a uniform resolution and on the second feed-forward pass, the image is foveated

from the center of the object location proposal.

1.3 Document Outline

The remainder of this document is organized as follows: Chapter 2 overviews the related work and some

fundamental concepts needed for better understanding the proposed attentional framework. Section 2.1

mentions the different types of visual attention and Section 2.2 gives a definition of preattention and

presents some existing theories about how this process is done in visual systems; Section 2.3 presents

the differences between sensation and perception and a characterization of the two types of process-

ing, top-down and bottom-up is described in more detail. In Section 2.4, a brief introduction to artificial

neural networks is done and the architecture of the used Convolutional Neural Networks is presented

on Section 2.5. Finally, Section 2.6 talks about the origins of Deep Neural Networks.

2

In Chapter 3, a taxonomy of the diverse computational models used in visual attention is presented.

In Section 3.1, the classical methods are presented and Section 3.2 describes the modern methods.

Both types of methods are supported by work done by well-known researchers in the visual attention

field. The data set selected for this work is presented on Section 3.3 and several convolutional network

models are described on Section 3.4.

In Chapter 4, a theoretical explanation of the saliency calculation for a specific object class is pre-

sented on Section 4.1. The different visual sensors implemented on this work are presented on Sec-

tion 4.2, in particular the uniform visual system on Section 4.2.1 and the foveal vision system on Sec-

tion 4.2.2, respectively. At last, a study on the information content present in images, useful for the

comparison between the two sensors, is presented on Section 4.3.

The proposed hybrid attention model is presented on Chapter 5, where a description of the various

steps that constitute the framework is made.

Finally, the obtained results are presented in Chapter 6 and in Chapter 7, we present our conclusions

drawn from the work carried out and some references to future contributions that can be done.

3

4

Chapter 2

Background

Vision is one of the five senses that allows organisms to improve their perception of the surrounding

environment, enabling a greater knowledge of the world. There is evidence that vision is the dominant

sense that human beings possess. For example, Colativa [4] did experiences where visual (light) and

auditory (tone) stimuli were presented, to which the participants were instructed to identify the respective

stimulus. The study revealed a predisposition among the subjects to direct their attention preferentially

toward the visual modality.

The process of seeing starts with light entering the eye through the cornea. The eye has the ability

to adapt to different levels of brightness (adaptation) and to shape its lens and pupil size in order to

focus objects at different distances (accommodation). The colorful part of the eye called iris controls the

amount of light that enters the eye allowing more light when the environment is dark and less when there

is plenty of light. Then, the light passes through the pupil and is focused by the lens, a nearly spherical

body, onto the retina. The retina is a sensory membrane that is responsible for receiving and converting

the visual stimuli into electric signals to be transmitted to the visual cortex in the brain, through the optic

nerve. The retina is full of photo-receptors like rods that are located mostly at the periphery of the retina

and the cones that distinguish colors and are located mostly in the center (see Figure 2.1).

The most sensitive part of the retina is called macula and comprises hundreds of nerve endings

which allows us to see objects with great detail. It is subdivided into the perifovea, parafovea and fovea

areas of which the fovea is located in the center of the macula (see Figure 2.2). Finally, the visual stimuli

are received when the signals coming through the optic nerve reach the back of the brain, where the

visual cortex is located and the stimuli are interpreted.

The proposed object localization and classification framework uses several biologically inspired at-

tention mechanisms, which include space-variant vision, and Artificial Neural Networks (ANN) for top-

down cognitive processes (i.e. guided, task-biased attention) for the recognition and location of objects.

As such, in the remainder of this section, we describe the fundamental concepts from neuroscience and

5

Figure 2.1: Photo-receptors density in theretina. The cones are concentrated in thefovea, the region of highest acuity and the rodsare distributed in the periphery. Figure adaptedfrom [5].

Figure 2.2: Diagram of the macula of theretina, showing perifovea, parafovea andfovea.

computer science on which the proposed framework is based.

2.1 Visual Attention

Attention is a process through which an organism selects a sub-region of the visual field, the so-called

”focus of attention”, to be processed in detail. This allows suppressing the rest of the available informa-

tion to obtain an efficient perception.

Depending on the number of processing inputs, the attention can be selective (process only one

input) or divided (process more than one input at once) [6]. In selective attention, the irrelevant stimuli

are blocked and the wanted information is promoted. The amount of computational resources available

for humans are limited. This idea is suggested by Broadbent’s filter theory [7] which introduced the

structural bottleneck concept, that is a limitation of the amount of information that can pass through the

visual pathways, at any time. With this cognitive limitation, a selective filter is needed for information

processing.

Thereby, different selection models have been proposed to decide when to attend to a certain stim-

ulus. On one hand, in early selection models, stimuli are filtered or selected to be attended at an early

stage of the processing. The early filters select the relevant information based on basic low-level fea-

tures like color or direction of stimulus. On the other hand, later selection models involved semantic

selection which requires more attentional resources comparing to early selection.

6

In divided attention, the focus of attention is divided in more than one task at a time. This is hard

since resources are limited by a cognitive budget, so divided attention demands to separate resources

among different tasks. When we want to perform two tasks at once, our attention needs to be divided

between both of them. In this case, carrying out task A will decrease the performance of task B when

both are performed at the same time. There is interference if both tasks share sensory modalities (e.g.

both use visual inputs), use the same mental processing states (e.g. see and listen words) or use the

same response mechanisms (e.g. listen words and see pictures) [8]. These kind of interferences appear

because the amount of available resources are insufficient to perform well both tasks. However, if the

task is carried out frequently, knowledge acquired during task execution will lead to automation, meaning

less cognitive effort.

There are three types of visual attention [9]:

1. Spatial attention which can be overt, when the observer move its eyes to the relevant locations

and the focus of attention match with the eye movement, or covert, when the attention is allocated

to relevant locations without eye movement;

2. Feature-based attention (FBA) that can attend to specific features (e.g. color, orientation or motion

direction) of objects, regardless their location;

3. Object-based attention (OBA) where attention is guided by object structure.

The observer attention can be stimuli-driven, triggered by scene characteristics like color or orien-

tation (bottom-up factors) or by specific visual characteristics that depend on the task or goal that he

wants to achieve (top-down factors). Two questions stand out: where should the observer focus his

attention (spatial attention) and to which features (feature-based attention) and a problem can pop up:

inattentional blindness [10] that is the inability to detect unexpected objects to which we are not paying

attention. This temporary blindness can sometimes happen because it is impossible for a viewer to

attend all stimuli.

William James [11] defines two modes of spatial attention that facilitate the processing and selection

of information: endogenous and exogenous.

The exogenous system is responsible for orienting automatically our attention, in an involuntary and

reflexive manner, to locations where sudden changes take place. For instance, let us imagine that we

listen a loud sound coming from outside. Our first reaction will be to direct our gaze to the sound source

(orienting reflex) with the purpose of updating our model of the world [12]. Since these changes are

unexpected and all related with the stimuli, they correspond to bottom-up processing, also known as

stimulus-driven attention.

7

The endogenous system is voluntary and corresponds to allocate attentional resources to a prede-

termined location. This way, orienting of attention results from taking into account task-specific goals. In

this situation, we can direct attention to a location in space or to an object. This is known as top-down

processing or goal-driven attention. For example, if our goal is to count how many people will go out of

a room, we will orient our attention to the door. This means that, with this knowledge, we can guide our

attention to relevant places to make the process more efficient [13].

When a viewer is asked to find a specific target, he knows what to find but not where. For that, he

must search [8]. There are two types of search mechanisms: the parallel or pop out search and the

serial search. If the target differs from the other elements of the visual field by a single feature (e.g.

color), the search is performed very quickly - pop out search [14] [15] (see Section 2.2). There is no

need to guide attention for any of the elements of the field since it is enough to detect the presence of an

activation in the corresponding feature map. Still, if the target differs from the non-targets (distractors)

by a conjunction of features (e.g. red-horizontal bar), the search is slower - serial search.

In this case, search time depend linearly on the number of non-target elements since there is the need

to focus attention in each and every one of them [14] [16].

2.2 Preattention

The concept of preattention is related to noticing of something before attention is fully focused on it, i.e.

we can almost instantly recognize an element in the visual field using low-level information. Typically,

tasks that can be performed in less than 200-250 milliseconds are considered preattentive [17]. The

visual properties that we can detect effortlessly - called preattentive attributes - are color, movement,

form, lightness or spatial position, since they simply ”pop out”. Imagine an image with a certain number

of blue balls (distractors) and a red ball (target) randomly placed. If we take a look at the image for a

fraction of a second, we will detect the presence of the target without focusing at any specific region.

This happens because the target has a visual property ”red” that the distractors do not. The same logic

can be applied to images containing elements with different geometric forms like circles and triangles.

However, if a target is defined by the presence of two or more visual properties, often, it cannot be found

preattentively. In those cases (conjunction targets), viewers should perform a search through the display

to confirm its presence or absence [17].

Some theories attempt to explain how preattentive processing is done including the feature integra-

tion [18], guided search [19] and boolean maps [20]. In the remainder of this section, we explain the

basic ideas behind these theories.

8

2.2.1 Feature Integration Theory

Treisman [21] focused on two problems. First, to determine which visual features are detected preat-

tentively and then, she formulated a hypothesis about how the visual system make the preattentive

processing [18].

To identify the preattentive features, she made experiences by detecting targets and measuring per-

formance by response time and by accuracy. In the response time model, viewers are asked to complete

the task as quickly as possible and the number of distractors on the display varied. If elapsed time is

below some chosen threshold, the task is considered to be preattentive. In the accuracy model, the task

was the same and the number of distractors also varied. The display was shown for a fraction of time

and then removed. If the viewers had finished the task accurately, the feature used to define the target

was considered to be preattentive. After these experiences she was able to identify some visual features

that were detected preattentively: shape, color, size, contrast, orientation, intensity, among others [21].

To understand how preattentive processing is done, Treisman proposed a model (see Figure 2.3). In

this model, each feature map registers the activity for a specific visual feature like contrast or size. When

an image is shown, features are encoded in parallel into their respective maps. These maps only provide

us the activity log of each feature. If the target has a unique feature, we just have to check if there is

activity on the respective feature map. However, for conjunction target, one feature map is not enough.

Thereby, a serial search must be done in order to find the target that has the correct combination of

features. In this case, a focus attention is used which will increase the time and effort spent.

Figure 2.3: Treisman’s feature integration model of early vision — detection of activity in individualfeature maps can be done in parallel, but to search for a combination of features, attention must befocused. Figure adapted from [17].

2.2.2 Guided Search Theory

The theory of guided search was proposed by Wolfe [19] [22] and tries to take into account the goals of

the observer in the task. He proposed that the combination of bottom-up and top-down information will

create an activation map during visual search. Then, the attention will be guided to peaks in the acti-

vation map that will correspond to the areas of the image in which bottom-up and top-down information

9

more contributed.

Wolfe agreed with Treisman when it comes to the belief that, in preattentive processing, the image is

divided into individual feature maps. Each map corresponds to a feature that can be filtered into several

categories like the feature ’orientation’ can be divided into shallow, steep, left and right (see Figure 2.4).

Bottom-up activation measures how different an element is from its neighbors and top-down activation

is user-driven because it tries to give answers to the requests. For instance, if we want to find a ”green”

element, this will appeal for a top-down request considering the task. Wolfe also defends that observers

should specify the request in terms of the categories existing in each feature map (e.g. color map,

category ’G’) [23].

Figure 2.4: Guided search for steep green targets: an image is divided into individual feature maps thatare filtered into categories and bottom-up and top-down activation will point out target regions. Then,the information in blend in an activation map where the highest peaks represent the locations whereattention will be directed. Figure adapted from [17].

2.2.3 Boolean Map

With the purpose of understanding why we cannot notice features that are irrelevant to the immediate

task, Huang proposes a new model of low-level vision [20]. In this model, the scene is divided into two

distinct and complementary regions: the excluded elements and the selected elements. The last ones

can be accessed for detailed analysis.

There are two ways to create boolean maps. First, the observer defines a value for an individual

feature and with this, all objects with the feature value specified are selected. Imagine we are looking

for green objects, thus the color feature label for the boolean map will be ”green”. This mean that, for

the feature ’color’, we want access to objects with ”green” labels. Once we are not looking for other

features (e.g. orientation), their labels should be undefined. An example of boolean maps is presented

in Figure 2.5 where the goal is to select red objects or vertical objects. Second, the observer should

choose a set of elements at specific spatial locations. In this case, no value was assigned to the features

so all labels are undefined.

Huang’s theory differs from feature integration theory because the last do not provide information on

feature location. Still, boolean maps preserve the spatial locations of the selected elements as well as

feature labels. Another advantage is that we can create a boolean map that is a union or an intersection

of two existing maps as shown in Figure 2.5 (d).

10

Figure 2.5: Boolean maps: (a) red and blue vertical and horizontal elements; (b) map for “red”, colorlabel is red, orientation label is undefined; (c) map for “vertical”, orientation label is vertical, color labelis undefined; (d) map for set intersection on “red” and “vertical” maps. Figure adapted from [17].

2.3 Mechanisms for Information Processing

Sensation and perception play different but complementary roles in the way we interpret our world. Sen-

sation is the process by which we feel the environment around us through the senses: touch, taste,

sight, sound, and smell. Then, the information is sent to the brain where perception gets into action,

interpreting the received information and making us understand what is happening around us, allowing

us to form a mental representation of the environment.

In general, when it comes to processing in the context of sensation and perception, two types of pro-

cessing are commonly characterized: top-down processing and bottom-up processing. On one hand,

top-down processing corresponds to allocate attention voluntarily on features, objects or spatial regions

based on prior knowledge and current goals/tasks. Thus, prior knowledge and the task at hand are used

to influence attention in a goal-driven manner. On the other hand, bottom-up processing refers to the

involuntary mechanisms responsible for directing attention to salient regions based on differences from

a region and its surround (e.g. contrast). In this case, the stimuli directly triggers our attention and, thus,

it is a data-driven process.

Humans perceive data but the question is how is this done? Some theories known as constructivism

theories assume that information need to be processed at a higher level and then we build our percep-

tion of the world. Other theories called direct theories defend that the environment provides us enough

information to perceive our world. Finally, some authors like Neisser [24] refer that visual perception

depends on both bottom-up and top-down processing.

From Gregory’s perspective, perception involves top-down processing. His theory [25] defends that prior

knowledge and past experiences related to stimuli helps the agent to better guess or hypothesize. In this

way, the agent is constantly building his perception of the reality based on the environment and stored

information. However, the brain may create some incorrect hypotheses which result in visual illusions.

Whereas, for Gibson, the information provided by the environment is enough to make sense of the world

(theory of affordances [26]) considering that perception is direct and involve bottom-up processing which

11

is reflexive and involuntary, independent of the agent past experiences. Thereby, Gibson argued that

perception is a bottom-up process once the visual information needed is available in the environment,

excluding the need for prior knowledge.

At last, Neisser [24] advanced the cycle theory which describes perception as a nonstop process

(see Figure 2.6) where bottom-up and top-down processing work together and continuously. People use

the prior knowledge (top-down) of the world to build schemas and, with them, we are able to predict

which information may be available for us (bottom-up).

In conclusion, Gibson’s theory is more limited, restricting perception solely in terms of the environ-

ment while Gregory suggests that what we see is not enough and attention mechanisms take advantage

of prior knowledge.

Figure 2.6: The perceptual model cycle.1

2.4 Artificial Neural Networks

Artificial Neural Networks (ANN) are computational models inspired by the central nervous system of

an animal, specially the brain, and try to mimic the way a biological brain solves problems. Modern

networks are limited by computation power working with a few thousand to a few million neural units and

millions of connections which is still far away from human brain complexity.

A neural network receives inputs, transforms them and generates an output. Its key element is the

ability to learn implicit mappings between inputs and outputs, making it a powerful tool. They are also

capable of recognizing patterns.

1source: http://www.southampton.ac.uk/engineering/research/projects [seen in May, 2016]

12

A neural network is organized in layers that establish connections between neurons. It starts with an

input layer where each neuron is fully connected to all neurons at the next layer. To each connection be-

tween two neurons is assigned a weight that controls the signal transmission between them. The input

units receive information from the outside world and communicate with one or more hidden layers where

actual processing takes place. In classification networks, the hidden layers apply a kind of distortion of

the input data in a non-linear way with the aim of having linearly separable categories at the end [27].

The last hidden layer links to the output layer where items are assigned to the believed belonging class

(see Figure 2.7). All neurons in the hidden layers are processed by an activation function that can be

linear, threshold or sigmoid function.

Figure 2.7: Neural network basic structure.

There are two main learning algorithms for training a neural networks based classifier:

• Supervised learning - It requires a large labeled data set with input samples associated to cate-

gories. The network produces an output in the form of a vector of scores, one score for each cate-

gory. Then, an objective function is computed to measure the error, i.e. the difference between the

output scores and the desired pattern of scores. With this knowledge, all internal weights parame-

ters are adjusted with the goal of minimizing the error. To correctly perform these adjustments, the

learning algorithm computes a gradient vector that, for each weight, indicates what would be the

error value variation if the weight were increased by a tiny amount [27]. Finally, the weight vector

is adjusted in the opposite direction to the gradient vector.

• Unsupervised learning - The network learns intrinsic relations about the data without having a

target or label. It exploits only the statistical distribution of the input data to associate samples to

groups of related elements.

In supervised learning, there are mainly three steps to follow: the training set used to build the

model by finding relationships between data and pre-classified targets (labeled data), the validation set

used to tune the hyper parameters as the number of hidden units or the depth of the neural network and

finally, the test set used to estimate the performance of the model on unseen data.

13

One of the most popular neural networks training algorithm is back propagation. It requires a known

desired output for each input value to calculate a loss function representing the difference between the

current and the desired output. The back propagation training algorithm goes as follow. It starts by per-

forming feed-forward computations where the weights and bias are randomly set producing an output

and the loss function is calculated. Then, feedback from the output layer is used to make adjustments to

the weights and bias such that the error is incrementally minimized. This process is continuously done

in all hidden layers until the input layer is reached, trying to minimize the loss function for that set of

input values. These incremental changes have to be small since the weights will affect all inputs from

the training set.

Once the network has been trained, we can present a whole new set of inputs and see how it

responds. It will attempt to categorize the new inputs in the right class. For example, if we present a

set of inputs with images from cats and dogs, in the training phase, the network will learn the features

corresponding to these classes. In the test phase, we can present an image of a dog and see if the

network correctly classifies this input as class ’dog’ or if it fails considering the input a class ’cat’.

2.5 Convolutional Neural Networks

There are several types of neural networks but as far as visual attention is concerned, the most com-

monly used are the Convolutional Neural Networks (CNN), that are feed-forward artificial neural networks

that take into account the spatial structure. They have the ability to learn discriminative features from

raw data input and have been used in several visual tasks like object recognition and classification.

This type of neural networks is named convolutional once it performs the mathematical operation

convolution. In case of CNN that uses images, the signal is discrete so the convolution of two discrete

signals is done by summing the product of the two signals, where one of them is flipped and shifted [28].

The mathematical formula for convolution of discrete signals is defined in (2.1) where x is the input signal

and h is the impulse response. This operation has several applications on signal processing such as

filter signals (2D - image processing) or find patterns between them.

y[n] = x[n] ∗ h[n] =∞∑

k=−∞

x[k]h[n− k]. (2.1)

A CNN is constituted by multiple stacked layers that filter (convolve) the input stimuli to extract use-

ful and meaningful information depending on the task at hand. These layers have parameters that are

learned in a way that allows filters to automatically adjust to extract useful information without feature

selection so there is no need to manually select relevant features. The general architecture of a CNN is

shown in Figure 2.8.

14

Figure 2.8: Convolutional Neural Network architecture. Figure adapted from [29].

Convolutional layer: Each neuron receives a sub-region from a previous layer as input and these

local inputs are multiplied by the weights. These filters are applied throughout input space with the pur-

pose of looking for specific features. Their weights are shared and their output is a feature map.

To configure a convolution layer, it is necessary to set some hyper parameters [28] such as:

• Kernel size - size of the filters;

• Stride - number of pixels that the kernel window will slide (usually, 1 for convolution layers);

• Number of filters - number of patterns that the convolution layer will look for.

Pooling layer: Is generally placed in-between convolutional layers and its goal is to down-sampling

the input, reducing dimensionality and producing a single output from the local region. It also decrease

the amount of computation in the upstream layers by reducing the number of parameters to learn and

provides basic translation invariance. A commonly used down-sampling function is the max-pooling

which determines the maximum value within each sub-region (see Figure 2.9.)

Figure 2.9: Representation of max-pooling operation.2

2source: https://www.quora.com/What-is-max-pooling-in-convolutional-neural-networks [seen in December, 2016]

15

Fully-connected layer: Is the upper layer and computes the class scores to be consistent with training

set labels. The input of the fully-connected layers is the set of all feature maps at the previous layer.

Since they are not spatially located, there can not be a convolutional layer after a fully-connected one.

In a CNN, the neurons are arranged in a 2D structure (width, height) in a way that allows spatial

relations between neurons and original data to be preserved. However, with the use of colored images

specially RGB images, an additional dimension for separate color channels is required. In this way, we

have a 3D dimensional input (width, height and depth).

In CNNs, the number of input neurons residing in the first network layer is equal to the input size. In

essence, if an image is presented as input, the number of neurons at the first layer will be the same as the

number of pixels of the input image. Therefore, if an image was used as input of a fully-connected net-

work, it would require a combinatorial number of connections between neurons and hence the training of

this network would be unmanageable. CNNs are capable of dealing with this computational complexity

issue by connecting sub-region of the previous layer to a neuron and the weights and bias are shared

allowing to look for the same feature in several regions.

In the second layer, each neuron is connected to a subset of neurons from the previous layer, called

a receptive field. In this way, the receptive fields of neurons in a deeper layer involve a combination of

receptive fields from several neurons from the previous layer.

2.6 Deep Neural Networks

Deep Neural Networks (DNN) are a subclass of artificial neural networks and are characterized by hav-

ing several hidden layers between the input and output layers.

Before 2006, most neural networks typically used one hidden layer, two at the most, due to the ex-

pensive cost of computation and the scarce amount of available data. The deep breakthrough occurred

exactly in that year, 2006 when Hinton [30], Bengio [31] and Ranzato [32], three researchers brought

together by the Canadian Institute for Advanced Research (CIFAR) were capable of training networks

with much more layers for the handwriting recognition task.

They used unsupervised learning methods to create layers of feature detectors without the need of la-

belled data. Then, they pre-trained some layers with more complex feature detectors providing enough

information to initialize the weights with sensible values. This method allowed researchers to train net-

works 10 or 20 times faster [27].

16

In recent years, CNN are becoming deeper and deeper which resulted in a performance boost. How-

ever, they are not becoming wider (number of parameters in each layer), since very wide and shallow

networks exhibit very weak performance at generalization despite being good at memorization. As op-

posed, deeper networks can learn features through several levels of abstraction and present much better

results in generalization because they learn all the intermediate features between the raw data and the

high-level classification. Note that using wider and deeper networks lead to an increase in the number

of the parameters that the network will have to learn.

Following the tendency to work with deeper networks and considering the overfitting problem that

occurs when the model fits too closely to the data set, a recent technique called Dropout has been

successfully implemented. Dropout technique consist on randomly drop out neurons at training phase

[33] which forces the network to learn more robust features since a neuron can not trust in a presence

of another particular neuron, improving the generalization of the neural network.

One attempt to speed up the network by decreasing the number of parameters has been done by

substituting large convolutions with the combination of smaller ones. Researchers replaced a large con-

volution like 7×7 convolution by a cascade of several small convolutions like 3 3×3 convolutions with the

same depth [28]. In-between each of these small convolution layers, a ReLU layer is placed to increase

the number of non-linearities. Therefore, we end up with a similar network but with fewer weights that

result in fewer computations and a faster network.

However, this type of substitution can not be done on the first layer because it will result in an enormous

consumption of memory [28].

17

18

Chapter 3

Related Work

In this chapter, some methods used in visual attention are presented. The computational models asso-

ciated with visual attention can be divided among three models: bottom-up, top-down and hybrid.

We divided the state of the art concerning visual attention among two categories: the classical meth-

ods related with models designed by hand and the modern methods, which we characterized as the

ones who use neural networks. A detailed explanation of classical and modern methods is present in

Section 3.1 and in Section 3.2, respectively. A well-known collection of images called ImageNet is pre-

sented on Section 3.3 and several convolutional network models are described on Section 3.4.

The taxonomy of visual attention computational models is presented in Figure 3.1 and these models

are explained in the following sections.

Figure 3.1: Taxonomy of visual attention models.

19

3.1 Classical Methods

Visual attention computational models attempt to mimic the behavioral aspects of the human visual sys-

tem. Filter-based models have three branches corresponding to bottom-up models, top-down models

and a combination of both that we call hybrid models.

The bottom-up model corresponds to the process that provides information about the environment

towards cognition (brain) and relies totally on stimuli information. Bottom-up mechanisms are agnostic

to the task/goal at hand and have the purpose of extracting relevant low-level features and finding the

most salient regions where attention should be attended.

There are several studies about how to determine salient regions, based on purely low-level visual

features. The pionneering works of Itti [14] [34] consist at combining multi-scale image features (color,

intensity and orientation) into a single saliency map. Then, the WTA principle is applied, selecting the

most salient location which, using the Inhibition of Return (IOR) mechanism [35], creates a sequence in

order of decreasing saliency.

Osberger’s approach [36] starts by performing image segmentation and then assigns perceptual im-

portance based on the number of different factors. Human visual attention is influenced by low-level

(contrast, size, shape, color and motion) and high-level features (location, people and context). Os-

berger chose only 5 features to use in his algorithm and, per region, assigns an importance score to

each. Lastly, a combination of these features results in a map which represents important regions in an

image.

Kadir et al. [37] identify salient regions based on entropy measures of image intensity while Gao [38]

defines a salient region considering how different this is from the surrounding background (center-

surround mechanism [39]).

The top-down model takes into account the observer’s prior knowledge, expectations and current

goals. The literature on visual attention suggests several sources of top-down influences [1] when the

problem is how to decide where to look: attention can be drawn to specific object visual features in

search models to easily reach the goal or use the context or gist to constrain locations.

If an image is presented to an observer, say ∼ 80 ms or less [1], he is able to tell some essential

characteristics of a scene. The eye movements can be conditioned by contextual cues taking into ac-

count, for instance, that a computer mouse is often on top of a desk, near a keyboard and a computer.

Then, using that information based on scene context, it is possible to constrain the search.

There are several models for gist where different low-level features were used. The gist vector can be

computed by applying Gabor filters to an image and extracting universal textons [40] or by averaging

20

filter output and then apply PCA (Principal Component Analysis) [41]. Another approach was presented

by Itti that used center-surround features from orientation, color and intensity channels to model gist [39].

Gist representations give rich information important to constrain the search to relevant objects consider-

ing the observer’s goals (top-down attention).

Whenever there is a search task, top-down processes tend to dominate guidance and target-specific

features are an essential source to draw attention more effectively. Moreover, our attention is oriented to

task-relevant features and in this way, attentional resources are not wasted and time and computational

effort are saved for processing pertinent/relevant parts of the visual field. In these conditions, we know

what we are looking for (goal) thus we know from a priori knowledge the distinguishing features that we

should be searching for. Thereby as defended by guided search theory [19] [22] (see Section 2.2.2), we

can modulate the gains assigned to different features. If, for example, the task is to find a green object,

the gain assigned to green color will be higher.

Taking into account that building saliency maps is an intensive computational process [14], Lukic and

Billard [42] present an efficient method to allocate visual resources in the task of reaching and grasping

where the information provided by the motor system is taken into account. They compute projections

from the workspace to the image plane by applying motor babbling in simulation. This allows obtaining a

large number of training samples to train a feed-forward neural network in an incremental online manner.

To take into account the motor plans of the robot, the authors propose a Coupled Dynamical System

(CDS) [43] [44] to mentally simulate a trajectory and avoid obstacles. Following this approach, the initial

visual search space is confined to the peripersonal space attention. When the robot starts to move, the

attention should then switch to the motor-relevant parts.

Current visual attention approaches, model bottom-up and top-down processes independently. How-

ever, there must be a trade-off between purely bottom-up models that typically miss to detect inconspic-

uous objects of interest and top-down systems which confine search according to expectations related

to task related priors, excluding everything else.

In recent years, a combination of bottom-up and top-down models that we designate as hybrid mod-

els have been presented. For instance, Frintrop’s model [45] is compound by two saliency maps: one

corresponding to top-down influences and another related with bottom-up influences. The aggregated

saliency map is computed as a linear combination of those maps using a fixed weight which revealed

to be a non-flexible approach. Due to the losses of bottom-up information, Rasolzadeh et al. [46] pre-

sented a more flexible model where the combination of top-down and bottom-up saliency maps are

done dynamically, using entropy measures that give the information of how should the linear combina-

tion change over time. The conspicuity maps were created following Itti’s approach in [14] besides the

extra parameters used to weight the saliency map. They used a neural network to learn the bias of the

top-down saliency map based on information provided by contextual scene and the current task.

21

These hybrid models suggest that the human visual system can guide attention applying optimal

top-down weights on bottom-up saliency maps allowing a quicker target detection in a background full

of distractors [46].

3.2 Modern Methods

As previously mentioned on Section 3.1, there are many approaches based on models designed by

hand. Lately, new approaches using Convolutional Neural Networks have been presented. In this sec-

tion, we introduce a brief overview of the recent work done on visual attention with an emphasis on

Convolutional Neural Networks.

Several approaches have been presented to further improve the discriminative ability of deep neural

networks. There are two ways of achieving that: 1) adding regularization to improve robustness and

avoid overfitting; or 2) making the network deeper [47].

In neural networks, the number of input and output units depends on the dimensionality of the data

set. Thus, regularization can be performed by establishing the number of hidden units (free parameter).

Deep neural networks (see Section 2.6) are potent machine learning systems and overfitting can be very

difficult to handle once the models memorize the training data instead of learning them to further gener-

alize. A model that has been overfitting presents poor predictive performance. Dropout is a technique

to address this problem and was proposed by Srivastava et al. [33]. The main idea was drop randomly

units and their connections from the network during training. At test time, they use a network that has

smaller weights which attempt to make the same effects of averaging the predictions of all affected net-

works.

Szegedy et al. [47] presented a deep CNN architecture inspired by Lin’s et al. [48] work. They added

1×1 convolutional layers, increasing the depth (number of levels) and width (number of units per level) of

the network to remove the computational bottleneck. This implementation has several drawbacks: big-

ger networks require more parameters increasing the likelihood of overfitting and more computational

resources are needed. The solution passed by introducing sparse layers inside the convolutions.

Recently, work has been made to incorporate feedback strategy into deep neural networks. For in-

stance, Recurrent Neural Networks (RNN) are used to capture the attention in a dynamic environment

and exhibit dynamic temporal behavior. The inputs are fed back into the network giving a kind of mem-

ory. Other examples like Long Short-Term Memory (LSTM) or End-To-End Memory networks are used.

In this project, we focus on Convolutional Neural Networks.

22

Generally, neural networks are just a tool, depending on the approach can be applied in a bottom-

up, top-down or hybrid way. Until now, models proposed to detect interest regions employ hand-design

features [14] [49] which lack adaptiveness.

Lin et al. [50] proposed a way of detecting saliency using deep convolutional neural networks. They

use the k -means algorithm to learn low-level filters and then, convolve them with the image (input),

generating low-level features that carry texture and color information. Over these low-level features,

pooling techniques were applied to generate mid-level features. Then, local contrast at multiple levels

was calculated using hand-designed filters yielding several maps which are combined to produce a final

saliency map.

Xiao et al. [51] present a hybrid model to detect and locate parts taking advantaged of deep con-

volutional networks applied to features extracted from bottom-up region proposals. They were inspired

by Girshick’s work [52] and applied regions with convolutional neural networks (R-CNN) to model object

parts besides the whole objects and locate them. In this case, the data set used was a set of 200 species

of birds, containing more than 11 000 images.

The part localization model proposed consists in three phases: train object and part detectors from

bottom-up region proposals using deep convolutional features (train phase); apply a score function to

all detectors and apply geometric constraints to choose the best object and part detection (test phase);

and finally, extract features from the located parts and train a classifier to assign the parts to a category.

In the training phase, ground truth bounding box annotations were used for the whole object and se-

mantic parts. The features extracted from region proposals are trained using a support vector machine

(SVM) where regions with ≥ 0.7 overlap with the ground truth region are labeled as positive. Otherwise,

they are labeled negatively. They conclude after performing some experiences that there is no need to

annotate boxes during the test phase to correctly classify the bird species.

Cao et al. [3] proposed a method called Look and Think Twice to detect and locate an object, in

a top-down manner. He uses feedback Convolutional Neural Networks and performs two passages

through the network. In the first feed-forward pass, the predicted class labels are obtained which gives

a notion of the set of most probable object classes that are presented in the input image. Then, based

on the top-ranked labels given by the network, he computes the saliency map of the image with respect

to each one of the top-5 class labels. Next, a segmentation mask is applied to the saliency map for a

given threshold. If the salience of the pixels is greater than the threshold, they are retained, otherwise

discarded, which leaves us with the pixels that more contributed to the class score. The resultant stain

of points is then used to define a bounding box defining the object location proposal.

In the second feed-forward pass, the original image is cropped by the bounding box and the region is

re-classified getting a new set of predicted class labels. At the end, they are ranked and the top-5 are

selected as the final solution.

23

3.3 ImageNet Data Set

ImageNet is a large visual data set of over 15 million labeled images taking part of about 22 thousand

categories that is publicly available. The annual ImageNet Large Scale Visual Recognition Challenge

(ILSVRC) started in 2010 and uses a subset of ImageNet formed by roughly 1 000 images in each of

the 1 000 categories1.

The ILSVRC 2012 data set [53] was previously divided into training, validation and test images. The

validation and test data consist of 50 000 and 100 000 photographs hand labeled but only validation

labels data were released. The remaining images (test data) were released without label and will be

used to evaluate the algorithm.

Since this data set was part of a competition, the participants had to submit their results on the avail-

able test images and only at the end of the competition they knew the results and the respective winner.

These 150 000 images (validation and test) were not part of the training data that is formed by 1.2 million

images containing the 1 000 categories.

The challenge consisted of three tasks and the data set [53] was already divided and publicly avail-

able for each of them:

1. Classification - For each image, a list of the top 5 object categories is presented in descending

order of confidence;

2. Classification with localization - The algorithm produces top 5 class labels and the correspond-

ing bounding box indicating the position of each of them. This task assesses the ability to locate

one instance of an object category;

3. Fine-grained classification - For each one of the 100+ dog categories, predict if the dog images

on test data belong to a particular category. The output of the system should be the real-valued

confidence that the dog is of a particular category.

For tasks 1 and 2, the images were hand labeled with the presence of one of 1 000 object categories

and each image contain only one ground truth label.

3.4 Pre-trained Models

Train a network from scratch using a large amount of color images is computationally expensive and time

consuming. Thereby, there are some pre-trained Convolutional Network (ConvNet) models available at

Caffe [54] Model Zoo.

1source: http://image-net.org/challenges/LSVRC/2012/browse-synsets [seen in November, 2016]

24

In this section, an explanation is given on the different architectures of several pre-trained models

and some preliminary results available on Model Zoo are shown2.

3.4.1 CaffeNet/AlexNet

Krizhevsky’s work [55] presents a deep convolutional neural network constituted by five convolutional

and three fully-connected layers called AlexNet model. The convolutional layers are followed by a ReLU

layer, then the neurons are normalized by a Local Response Normalization (LRN) layer and finally a

down-sampling is performed by a max-pooling layer. The fully-connected layers are followed by a ReLU

and a Dropout layer with dropout ratio of 0.5.

Two techniques were used to combat overfit: first, artificially increase the data set by applying small

transformations to the original images like translations and horizontal reflections or change intensity of

color channels during training and secondly, use the dropout technique (see Section 2.6).

Caffe [54] provides a reference CaffeNet3model which is a modification of AlexNet where the order

of Pooling and Normalization (LRN) layers are switched. Besides this, all the rest remains the same

including all the parameters of all layers.

The change originates a slight computational advantage to CaffeNet since the max-pooling operation is

done before the normalization which will use less memory and calculations. Yet, there is not a significant

performance difference between both models.

A pre-trained version of both models is available and both were tested to check for performance

differences (see Table 3.1). Both models were trained without the data-augmentation used to prevent

the overfit mentioned on [55] and the AlexNet model was initialized with non-zero biases of 0.1 instead

of 1.4

Results released at [55] shows a top-1 classification error of 40.7% and a top-5 classification error of

18.2% of AlexNet model while public replication of AlexNet presented a top-1/top-5 classification error of

42.9% / 19.8%. The results of CaffeNet differed by less than 0.5% from the AlexNet but once it requires

less memory, the CaffeNet was the chosen model to perform the tests.

2source: http://caffe.berkeleyvision.org/model zoo.html [seen in November, 2016]3source: https://github.com/BVLC/caffe/tree/master/models/bvlc reference caffenet [seen in December, 2016]4source: https://github.com/BVLC/caffe/tree/master/models/bvlc alexnet [seen in December, 2016]

25

3.4.2 GoogLeNet

GoogLeNet is a deep convolutional neural network with 22 weight layers proposed by Szegedy et al. [47]

for classification and detection tasks which improved the use of computational resources. It has nine

Inception modules that allow parallel pooling and convolution operations. For classification, it uses the

spatial average of the feature maps from the last convolution layer as the confidence of categories via a

global average pooling layer. The resulting vector is then used as input into the softmax layer.

The most direct form of improving the performance of deep networks is by increasing their size

including depth (more layers) and width (more units at each layer). Even with a bigger network, a con-

stant computational budget was managed by using additional 1×1 convolutions as dimension reduction

method [28] before the expensive 3×3 and 5×5 convolutions and by replacing fully connected layers by

sparse ones.

A replication of the model in [47] was trained and the weights file is publicly available5. However,

there are some training differences that should be highlighted: the replication uses ”xavier” to initialize

the weights instead of ”gaussian”; the learning rate decay policy is different allowing a faster training and

training was done without data-augmentation. Xavier initialization is characterized by setting the weights

with a Gaussian distribution with zero mean and a weight variance equal to the inverse of the number of

input neurons ensuring faster convergence [56].

In one hand, the original model [47] achieved a top-5 classification error of 10.07% in the validation

data and a localization error of 38.02%. The top-1 classification error was not disclosed. On the other

hand, replication model obtained a top-1 error of 31.3% and a top-5 error of 11.1%. The localization

error was not published. Once the weights file of the replication model was the one used, the results

obtained on this project were compared with theirs (see Table 3.1).

3.4.3 VGGNet

It is a deep convolutional network for object recognition developed and trained by Oxford’s renowned

Visual Geometry Group (VGG)6 [57].

This architecture was developed with the purpose of exploring the effect of the ConvNet depth on

its accuracy. Different configurations were used that goes from a ConvNet with 11 weight layers to a

ConvNet with 19 weight layers and the performance of individual ConvNet models were evaluated.

For localization task, the 16 weight layers architecture was used where the last fully connected layer

predicts the bounding box location instead of the class scores.

5source: https://github.com/BVLC/caffe/tree/master/models/bvlc googlenet [seen in December, 2016]6source: https://github.com/BVLC/caffe/wiki/Model-Zoo#models-used-by-the-vgg-team-in-ilsvrc-2014 [seen in December,

2016]

26

In comparison with the state-of-the-art at the time, an evident improvement was reached with a

deeper network achieving the optimal configuration at 16-19 weight layers. Since usually deeper net-

works mean more parameters and more chance to overfit, Simonyan et al. used small 3×3 filters in all

convolutional layers.

Besides this improvement, a demonstration of the generalization power of their model was done by

achieving the state-of-the-art results with other image recognition data sets such as PASCAL Visual Ob-

ject Classes (2007 and 2012) [58].

The 16 weight layer configuration achieved a top-1/top-5 classification error of 25.6% / 8.1% and a

localization error of 26.9%. The 19 weight layer configuration decreased only 1% of the previous classi-

fication error which proved to be the best results achieved so far. In this project, the pre-trained model

VGGNet that was used has 16 weight layers.

Table 3.1 has a compilation of the classification and localization errors disclosed by the current state-

of-the-art. There are some fields of the table that contain a line which means that these results have not

been published.

As explained on Section 3.4.1, AlexNet pre-trained model is not used in our tests once there is no

significant difference of performance between AlexNet and CaffeNet pre-trained model and CaffeNet

requires less memory.

Table 3.1: ConvNet performance following the state of the art.

Number of Classification Error Localization ErrorModel weight layers Top-1 [%] Top-5 [%] Error [%]

CaffeNet [55] 8 42.6 19.6 —-AlexNet [55] 8 42.9 19.8 —-

GoogLeNet [47] 22 31.3 11.1 38.02GoogLeNet Feedback [3] – 30.5 10.5 38.80

VGGNet [59] 8 39.7 17.7 44.60VGGNet [57] (16 layers) 16 25.6 8.1 26.90VGGNet [57] (19 layers) 19 25.5 8.0 —-

27

28

Chapter 4

Hybrid Attention Model

Our model is inspired by Cao’s et al. [3] work which uses feedback Deep Convolutional Neural Networks

to capture visual attention. We propose a biologically inspired hybrid attention model that is capable of

efficiently locate and recognize objects in digital images, in a multistage manner.

Briefly, our model goes as follows:

• Load an image into the network and capture the gist of the scene getting the predicted top-5 class

labels (feed-forward pass);

• For each of the top-5 class labels, compute the saliency map in a top-down manner (backward

pass) and apply a segmentation mask;

• Calculate the tightest bounding box that covers the stain of points resultant from the segmentation

mask and considered it as an object location proposal;

• Re-classify the image with selective attention (feed-forward pass) and obtain a final solution.

In this chapter, we mention the saliency map concept and explain in detail on Section 4.1 the method

proposed by Cao [3] about the computation of the saliency map, in a top-down manner, for a given

class. In the final stage of our model, the image re-classification with attention is done for two visual

sensing configurations: a uniform and a non-uniform foveal vision, that are presented on Section 4.2.

Section 4.3 presents a study on image information content for both visual sensing configurations, in

order to be possible to establish a relationship between them.

4.1 Class Saliency Visualization

The need to locate objects quickly and efficiently gave rise to the method proposed by Itti [14], based on

visual salience that proposed the most likely candidates and eliminates those that are less likely.

29

The visual features that contribute to the selection of attention of a stimulus (color, motion, orienta-

tion) are combined in a saliency map that has normalized information of the individual features maps.

In order to get a saliency map, the input visual information is analyzed for visual neurons, sensitive to

several visual features of the stimuli. This analysis is done in parallel through all visual field at multi-

ple spatial and temporal scales, originating a series of feature maps where each map represents the

amount of a certain visual resource at any place of the visual field. In each map, according to Koch and

Ullman [15], a local saliency is determined for how different this location is from nearby locations in terms

of color, orientation, motion, depth. The most salient location could be a good candidate for attentional

selection. Finally, all highlighted locations from all feature maps are combined in a single saliency map

that represents a pure relevant signal which is independent of visual features.

As opposed to Itti’s [14] method that computes the saliency map in a bottom-up manner, Cao [3]

proposed a way to calculate the saliency map, in a top-down manner, given an image I and a class

c. The class score Sc(I) is a non-linear function of the image, hence an approximation of the neural

network class score with the first-order Taylor expansion [3] [59] in the neighborhood of I can be done

as follows

Sc(I) ≈ G>c I + b (4.1)

where b is the bias of the model and Gc is the gradient of Sc with respect to I:

Gc =∂Sc∂I

. (4.2)

Accordingly, the saliency map is computed for a class c by calculating the score derivative of that specific

class employing a back propagation pass. This is done as follows: a comparison between the network

output and the desired output is done originating an error value. Since we want to get the saliency map

for a specific class c, our desired output is a vector of zeros where the position corresponding to class c

is set to one. In this way, to each neuron of the output layer is assigned an error value that is propagated

backward until it reaches the input layer where each neuron has associated an error value that roughly

represents its contribution to the output. These error values are used to calculate the gradient Gc which

is used to update the weights in order to minimize the difference between the network output and the

desired one.

In order to get the saliency value for each pixel (u, v) and once the images used are multi-channel

(RGB - three color channels), we rearrange the elements of the vector Gc by taking the maximum

magnitude of it over all color channels. This method for saliency map computation is extremely simple

and fast since only a back propagation pass is necessary. Simonyan et al. [59] shows that the magnitude

of the gradient Gc express which pixels contribute more to the class score. Consequently, it is expected

that these pixels can give us the localization of the object pertaining to that class (see Section 5.2), in

the image (see Figure 5.2).

30

4.2 Uniform vs Foveal Vision

In this work we will study and evaluate two types of organization of receptor fields: a conventional

uniform distribution, typical in artificial vision systems (e.g. in standard image sensors), against a log-

polar distribution, which approximates the human eye. The latter is composed by a region of high acuity

– the fovea – and the periphery, where central and low-resolution peripheral vision occurs, respectively.

4.2.1 Uniform Visual System

As many theories of visual processing proposed, a natural scene is processed in a fraction of a sec-

ond [60] where a first rough description (the gist) of the scene is computed. Typically, imaging sensors

use uniform resolution.

In the first feed-forward pass, we mimic the human behaviour on capturing the gist of the scene,

quickly and with limited resources. For this matter, there is no need to rely on high-resolution images

since this first glimpse takes only a split second and humans are capable of extract rough information of

it [60]. In this way, we compress the images to save resources since in most cases, they are scarce.

For the initial glimpse, we want to simulate the use of a sensor with low-resolution, this is with lower

level of detail which consequently requires fewer resources and comprehends a reduction of the informa-

tion. However, image details correspond to edges that typically are only perceptible with high-resolution

imaging sensors. For this purpose, the high-frequency details will be removed through low-pass filters.

When a low-pass filter is applied to a signal, its high frequency components are completely removed.

The simplest low-pass filter is the ideal low-pass filter (see Figure 4.1) that eliminates all frequencies

higher than a given cut-off frequency (fc) and keeps the lower frequencies intact. Following this ap-

proach, we lose the high-frequency features like the edges. However, there is a way to remove the noise

and preserve the edges and other (high-frequency) details. For this purpose, we use a Gaussian filter

that does not abruptly remove high frequencies but soften them (see Figure 4.2). The Gaussian filter

alters the input image by convolution with an isotropic 2D Gaussian function that is defined as

g(u, v, σ) =1√2πσ2

e−u2+v2

2σ2 (4.3)

where u and v represent the image coordinates and σ the standard deviation of the Gaussian distri-

bution. The 2D Gaussian function is separable into u and v components thus we can perform first a

convolution with a 1D Gaussian in the u direction, and then convolve with another 1D Gaussian in the v

direction. In this study, we define σ0 as the level of uniform blur (see Figure 4.4).

1source: https://i.stack.imgur.com/nLwKi.png [seen in April, 2017]

31

Figure 4.1: Shape of the 1D ideal low-pass filter in the frequency domain.1

Figure 4.2: 2D Representation of a Gaussian filterwith σ = 60.

4.2.2 Foveal Visual System

The central region of the retina of the human eye named fovea is a photoreceptor layer predominantly

constituted by cones which provide localized high-resolution color vision. The concentration of these

photoreceptor cells reduce drastically towards the periphery (see Figure 2.1) causing a loss of defini-

tion. This space-variant resolution decay is a natural mechanism to decrease the amount of information

that is transmitted to the brain (see Figure 4.4). Many artificial foveation methods have been proposed in

the literature that attempt to mimic similar behavior: geometric method [61], filtering-based method [62]

and multi-resolution methods [63].

In this work, we rely on the method proposed in [64] for image compression (e.g. in encoding/decoding

applications) which is extremely fast and easy to implement, with demonstrated applicability in real-time

image processing and pattern recognition tasks as in [65]. This approach comprises four steps that

go as follow. The first step consists on building a Gaussian pyramid. The first pyramid level (level 1)

contains the original image g1 that is low-pass filtered and down-sampled by a factor of two obtaining

the image g2 at level 2. The image g3 can be obtained from the g2 by applying the same operations, and

so forth. The image gk+1 has a quarter of the resolution of image gk where k ∈ {1, ...,K} denotes the

index of a pyramid level and K defines the total pyramid levels. This process is repeated as many times

as the desired number of resolution levels for the pyramid.

In the next step, the Laplacian pyramid is build where the difference between the original image and

the low-pass filtered image is computed. The Laplacian pyramid comprises a set of error images where

each level represents the difference between two levels of the previous output (see Figure 4.3).

Next, Gaussian weighting kernels are multiplied to each level of the Laplacian pyramid to implement

the foveation mechanism. The Gaussian kernels are defined as in (4.3) and the kernels are generated

just once for each image and then displaced for a given point defining the focus of attention.

32

The next step consists of locating the foveation point which corresponds to the image location that

will be displayed at the highest resolution. In our case, the foveation point is given by the center of the

object location proposal obtained through the analysis of the segmentation mask applied to the saliency

map.

At last, the foveated image is obtained by the reverse process used when building the Laplacian

pyramid. A more detailed explanation of the foveation system can be found on [64].

A summary of the human visual foveation model with four levels is presented on Figure 4.3. Starting

with the original image, the levels g1 to g4 of the reduced pyramid are computed. Then, the difference

between successive outputs from the previous step is obtained resulting the images L1 to L4 on the

Laplacian pyramid. These images are multiplied by the kernels and an expand-and-sum procedure is

done. An example of a foveated image obtained by this method is presented on Figure 4.4 where f0

simulates the size of the fovea, central region of the retina of the human eye.

Figure 4.3: A summary of the steps in the human visual foveation model with four levels. The imageg1 corresponds to the original image and f1 to the foveated image. The thick up arrows representsub-sampling and the thick down arrows represent up-sampling.

4.3 Information Attenuation

The different visual systems presented on Section 4.2 are based on different filtering strategies which

result on reduction of information. To be possible to compare these systems, we have to understand

how each system reduces the image information and what is the relationship between them.

33

a: σ0 = 0 b: σ0 = 5 c: σ0 = 10

d: f0 = 30 e: f0 = 60 f: f0 = 90

Figure 4.4: Different images acquired with two different visual sensing configurations are shown: auniform and a log-polar distribution. On top, the image of a bee eater is evenly blurred for different levelsof blur (σ0). At bottom, the same image is foveated from the center of the object location proposals fordifferent levels of blur. The parameter f0 defines the size of the region with high acuity.

4.3.1 Uniform Vision

The uniform visual system is computed via low-pass Gaussian filters. Let us define the original image

as i(u, v) to which corresponds the discrete time Fourier Transform I(ejwu , ejwv ). The filtered image

O(ejwu , ejwv ) is given by the convolution theorem as follows

O(ejwu , ejwv ) = I(ejwu , ejwv ) ∗G(ejwu , ejwv ). (4.4)

The Parseval’s theorem describes the unitarity of a Fourier transform establishing that the sum of

the square of a function is equal to the integral of the square of its transform. Therefore, the signal

information of the original image i is given by

Ei =

+∞∑u=−∞

+∞∑v=−∞

|i(u, v)|2dudv =1

4π2

∫ π

−π

∫ π

−π|I(ejwu , ejwv )|2dwudwv, (4.5)

34

and the information in the filtered image o is given by

Eo =

+∞∑u=−∞

+∞∑v=−∞

|o(u, v)|2dudv =1

4π2

∫ π

−π

∫ π

−π|I(ejwu , ejwv ).G(ejwu , ejwv )|2dwudwv. (4.6)

Assuming that I(ejwu , ejwv ) has energy/information equally distributed across all frequencies, of

magnitude M:

M = I(ejwu , ejwv ),∀wu, wv ∈ [−π, π], (4.7)

the information Eo can be expressed as

Eo =M2

4π2

∫ π

−πG(wu)

2dwu

∫ π

−πG(wv)

2dwv. (4.8)

Furthermore, since we use σ ≥ 1, the discrete time Fourier Transform is well approximated by the

continuous time Fourier Transform. Thus, the Gaussian filter has low energy content for

|wu|, |wv| > π ⇒∫ π

−πe−w

2uσ

2

dwu ≈∫ ∞−∞

e−w2uσ

2

dwu. (4.9)

Knowing that ∫ ∞−∞

e−12t2

σ2 dt =√2πσ, (4.10)

we are able to define G(wu) as ∫ ∞−∞

e−12

w2uσ2 dwu =

√π

σ, (4.11)

where the same applies to wv.

Thereby, we are now capable of simplify the expression of Eo presented on (4.8) as

Eo =M2

4π2.π

σ2=

M2

4πσ2. (4.12)

Finally, the information gain P is given by the ratio of the information of the filtered image to the

information of the original image, getting

P (σ) =EoEi

=1

4πσ2. (4.13)

4.3.2 Non-Uniform Foveal Vision

For the non-uniform foveal vision, we implement the method explained on Section 4.2.2 where the blur

is not evenly distributed, in the spatial domain.

In the first step of our foveation system, we apply low-pass Gaussian filters as we did for the case

of uniform vision (see Section 4.3.1) by applying (4.13) and perform down-sampling in each level of the

reduced pyramid.

35

The normalized information due to filtering for each level k of the pyramid is given by

P k(σk) =1

4πσ2k

(4.14)

where the parameter σk is related to σ0 as

σk = 2kσ0. (4.15)

The information due to spatial weighting for each pyramid level k is given by

Rk(fk) =

(∫ N/2−N/2 e

− 12u2

f2k du

N

)2

(4.16)

where N is the size of the image. Since the images are 2D, it is needed to calculate Rk for each

dimension. The foveation fk presented on each level can be related to the fovea dimension as follows

fk = 2kf0. (4.17)

Thus, to compute the total information compression of the pyramid for the non-uniform foveal vision,

we need to take into account the normalized informations due to filtering and due to spatial weighting at

each level of the pyramid. The total information reduction of the pyramid is given by

T (k) =

K∑k=0

RkP k. (4.18)

36

Chapter 5

Implementation

In this chapter, a detailed explanation of our model is made (see Figure 5.1). In the first feed-forward

pass, a rough description (the gist) of the scene is computed (Section 5.1) and analyzed via backward

propagation to obtain proposals regarding the location of the object in the scene(Section 5.2). For the

second feed-forward pass, two approaches have been compared, the human visual foveation model and

the cartesian one (Section 5.3). For the former, an image re-classification is done directing the attention

to the center of the proposed location. For the latter, the attention is directed to the cropped patch of the

image, thus the remaining part of the image is discarded.

Figure 5.1: Schematization of the proposed multistage attentional pipeline. Begins by uploading aninput image to the neural network and get the top-5 predicted class labels. For each class label, abackward pass is done obtaining the saliency map. In this, a segmentation mask is applied based ona threshold ending up with a proposed region for the location of the object. Then, the foveation systemis applied from the center of each proposed bounding box for a given f0 (in this case, f0 = 60). Thefoveated image is used as input of the neural network and for each, a forward pass is done resultinga new top-5 predicted class labels. The red rectangles represent the bounding boxes that contain allpixels above the specified threshold, in this case the threshold was 0.75. The red circles represent thefocused area simulating the fovea and the ground truth label of the input image is go-kart.

37

5.1 Image-Specific Class Saliency Extraction

After making the input data selection (see Section 3.3), the pre-trained models CaffeNet, GoogLeNet

and VGGNet were loaded to the corresponding networks for the test phase. Each network receives raw

input data which needs to be pre-processed: subtract the mean over all images used in the training set

in each color channel and swap channels from RGB to BGR. In our tests, we use the first 100 images

from the ILSVRC 2012 data set.

CaffeNet and GoogLeNet pre-trained models required a constant input dimension of 227×227 RGB

images while VGGNet pre-trained model required a constant input dimension of 224×224 RGB images.

Therefore the ImageNet images which present several resolutions were down-sampled for the required

fixed resolution of the corresponding system. The ILSVRC 2012 validation set was used to perform the

tests and evaluate our model.

After the pre-processing has been done, the network was loaded with images from the ILSVRC 2012

data set. We started by getting the network’s output for the input image by performing a feed-forward

pass filling the layers with data. Accessing the network’s output layer of type softmax, the actual proba-

bility scores for each class label (1 000 in total) were collected.

Retaining our attention on the five highest predicted class labels which are more likely to be present

in a given image, the saliency map for each one of those predicted classes was computed (see Fig-

ure 5.2).

The method put into action to compute the saliency map, in a top-down manner was the one described

on Section 4.1 where only an image I and a class c is required. As mentioned and previously ex-

plained, a back propagation pass was done to calculate the score derivative of the specific class c.

The calculation of the gradient tell us which pixels are more relevant for the class score [59].

5.2 Weakly Supervised Object Localization

Considering Simonyan’s findings [59] mentioned on Section 5.1, the class saliency maps hold the object

localization of the correspondent class in a given image. Surprisingly and despite been trained on image

labels only, the saliency maps can be used on localization tasks.

Our object localization method based on saliency maps goes as follow. Given an image I and the

corresponding class saliency map Mc, a segmentation mask is computed by selecting the pixels with

the saliency higher than a certain threshold and set the rest of the pixels to zero.

Considering the stain of points resulting from the segmentation mask, for a given threshold, we are able

to define a bounding box covering all the non-zero saliency pixels, obtaining a guess of the localization

of the object (see Figure 5.2). To set the bounding box, we use the boundingRect function from OpenCV

38

library that calculates the minimal up-right bounding rectangle for the specified point set. In our case,

we define a Mat array with all non-zero saliency pixels as input to the boundingRect function.

Figure 5.2: Representation of the saliency map and the correspondent bounding box of each of thetop- 5 predicted class labels of a bee eater image of ILSVRC 2012 data set. The rectangles representthe bounding boxes that cover all non-zero saliency pixels resultant from a segmentation mask with athreshold of 0.75. The rectangles shown are the same where the blacks delimit the pixels with non-zerosaliency and the red ones show the input image with the location proposal.

5.3 Image Re-Classification with Attention

The objective of performing a second pass through the neural network is to re-classify the class labels

obtained in the first pass, where the gist of the scene was captured.

Given the initial guess of the object localization through the bounding boxes, the image labels are

re-classified. We tested two different ways to re-classify image labels: firstly, inspired by Cao’s work [3],

we use cropped patches around the bounding boxes that are resized to the input dimension of the cor-

respondent pre-trained model (227×227 for CaffeNet and GoogLeNet and 224×224 for VGGNet) and

secondly, we foveate the images from the center of the bounding boxes with a fixed fovea size.

Following the first approach for image re-classification, the image patch was cropped from the original

input image to ensure a good resolution and resized to the input dimension of the pre-trained model that

supposedly corresponds to the smallest region that contained the object. Those new regions are then

loaded into the neural network and a new feed-forward pass is done resulting in a re-classification of

the regions. This strategy of re-classification is named by Cao [3] as the ”Look and Think Twice” method.

For the second approach, there is no need to crop or resize the image. We use the bounding boxes

obtained from the segmentation mask and apply the foveation method described on Section 4.2.2. Con-

sidering that the bounding box provided by our framework contains the object, we direct our attention

39

to the center of the bounding box and foveate the image for a given parameter f0, highly specialized

for high-resolution vision. The foveated image is then used as input to the network for the second feed-

forward pass, giving rise to an image re-classification.

The image re-classification method (for both approaches) is applied to each of the five bounding

boxes proposed from the first feed-forward pass where the highest five predicted class labels of each

bounding box are preserved (see Figure 5.1). Given the total 25 labels and the corresponding scores

(confidence given by the network), we sort by descending order and pick the top-5 labels as the final

solution. The sorted top-5 labels are then used to compute the classification error, corresponding to the

second time we look to the image.

Our framework were evaluated for three different topologies:

• Uniform Cartesian vision: In the first feed-forward pass, the input image has uniform resolution.

Then, in the second feed-forward pass, a crop patch of the image is used as input;

• Non-uniform foveal vision: The input image is foveated from the center for several f0 and on

second feed-forward pass, the high-resolution image is foveated from the center of the object

location proposal;

• Combined vision: In the first feed-forward pass, the input image has uniform resolution with σ0 = 5

and for the second feed-forward pass, a high-resolution input image is foveated for different f0 from

the center of the bounding box and used as input.

Figure 5.1 summarize our framework. We start by loading an image into the network and perform a

feed-forward pass producing a list of the actual probability scores for each class label considered in the

ILSVRC 2012 data set. In this case, a go-kart input image was used and as we can verify, the ground

truth label go-kart is not present on the classification top-5.

For each of the top-5 predicted class labels, a saliency map and a segmentation mask was com-

puted resulting in a total of five proposed bounding boxes. Next, the foveation system described on

Section 4.2.2 was used by foveating the input image from the center of the proposed bounding box for

a given f0. Each of the five foveated images are then loaded into the network and a new feed-forward

pass is performed giving rise to five predicted class labels for each input image, obtaining in total 25

predicted class labels. In order to get a final classification from the network in this second pass, the

predicted class labels are sorted by descending order and the five classes with the higher score are

considered as the final solution. For the example of Figure 5.1, we end up with a final solution that

corresponds with the ground truth label of the input image, this is go-kart with a confidence of 27%.

Figure 5.3 presents the convolutional layer output for several input images. These results show the

filters learned by the network. In the first feed-forward pass, the input images have high-resolution and

in a second feed-forward pass, a crop patch of the original image is used as input.

40

Figure 5.3: Representation of the first convolutional layer output for 4 different input images on VGGNetpre-trained model: for each image, first line shows 5 output filters and the bounding boxes; second lineshows the same filters applied to the new input images which were cropped by bounding box. Theground truth labels are in orange and the bounding boxes in red.

41

42

Chapter 6

Results

In this chapter, the results and tests performed in this project are presented. We begin by establishing

a numerical relationship between uniform and non-uniform visual systems on Section 6.1 in order to be

able to make a fair comparison between both visual systems. Next, an evaluation of the classification

and localization performance obtained for the first and second feed-forward passes are done on Sec-

tion 6.2 and Section 6.3, respectively. Finally, on Section 6.4 the performance of the first pass is directly

compared with the performance of the second pass for the different visual topologies. Table 6.1 shows

the different topologies considered in this work.

Topology First Pass Second PassUniform Uniform blur Cropped patchFoveal Foveate center Foveate bounding boxCombined Uniform blur Foveate bounding box

Table 6.1: Summary of the evaluated topologies.

6.1 Uniform vs Non-Uniform Foveal Vision

Through the study of information gain done in Section 4.3, we can represent the relationship between

σ0 and f0, uniform and non-uniform vision, respectively (see Figure 6.1). With this analysis, it is possible

to define the intersection point, that is, the values of σ0 and f0 where the information gain is the same

for both types of sensors. The tests were done for a pyramid with 5 levels.

Figure 6.1 was computed following the theory presented on Section 4.3 where expression (4.13) give

us the evolution of the information gain for uniform vision and expression (4.18) for foveal vision.

It is possible to verify that the information gain to the uniform vision is linear, in logarithmic scale, with

respect to σ0. As the blur level increases (σ0), more information is compressed which leads to less gain.

43

Figure 6.1: Information gain in function of σ0 for uniform and f0 for non-uniform foveal vision.

For non-uniform foveal vision there is a tendency to increase the information gain with the raise of f0

but not linearly. This kind of evolution with f0 for non-uniform vision makes sense since, as it increases,

the size of the high-resolution region of the image also increases. It is important to notice that for

f0 = 100, there is no gain of information, that is, for f0 greater than 100, the processed image has the

same information as the original one. The intersection point between the two different vision types is

obtained with a gain of approximately −24 dB when

σ0 = f0 ≈ 5. (6.1)

6.2 First Feed-Forward Pass

The first stage of our hybrid model consists on loading an image into the neural network and perform a

feed-forward pass in order to get the predicted class labels where the top-5 class labels are preserved.

Then, in a top-down manner, for each one of the top-5 class labels, a backward pass is done result-

ing for each, a saliency map corresponding to the respective class label. The saliency map provides

meaningful information for a given class since it results from a feedback visualization with respect to

that particular class showing which pixels contribute more to the class score. For this reason, a feasible

manner for localization was derived from the saliency map. A segmentation mask was applied to the

saliency map by selecting the pixels whose saliency value was higher then a certain threshold. The

remainder of the pixels were discarded, setting them to zero. Finally, a tightest bounding box covering

the stain of non-zero saliency values is computed resulting an object location proposal.

44

To evaluate our model, we compute two types of measurements: the classification error and the

localization error. The classification error is calculated comparing the ground truth class labels provided

by ILSVRC with the preserved predicted class labels. Usually, two error rates are commonly mentioned:

the top-1 and the top-5. The former serves to verify if the predicted class label with the highest score

is equal to the ground truth label provided for the same image. If they are not a match, it leads to an

error. For the latter, we verify if the ground truth label is in the set of the five highest predicted class labels.

The localization is considered correct if at least one of the five predicted bounding boxes for an im-

age overlaps over 50% with the ground truth bounding box1, otherwise the bounding box is considered a

false positive [53]. The evaluation metric consists on the intersection over union between the proposed

and the ground truth bounding box (see Figure 6.2) and this criteria was established on the ILSVRC

2012 challenge.

Figure 6.2: We select a spoonbill image from the ILSVRC 2012 data set to demonstrate our weaklysupervised object localization method. The red rectangles represent the computed bounding boxes forthe top-3 predicted class labels and the greens, the ground truth bounding box of the spoonbill image.

The classification and localization errors were calculated for the three different topologies considered

in this project: first, a non-uniform foveal vision where the images are foveated from the center for

different f0; second, a uniform vision characterized by having evenly blurred images for various σ0 and

finally, a combined vision that, for the first pass have uniformly blurred images with a level of blur equal

to σ0 = 5, which corresponds approximately to the point of intersection obtained from Figure 6.1.

6.2.1 Classification Performance

For the first pass, both with uniform and foveal sensors, we obtained the results presented on Fig-

ure 6.3. With regard to the classification, a global conclusion can be withdrawn: CaffeNet pre-trained

model which presents the shallower architecture had the worst performance, obtaining the highest clas-

sification errors in all topologies. One possible justification for this is that the other GoogLeNet and VGG

models use smaller convolutional filters and deeper networks that can enhance the distinction between1source: http://image-net.org/challenges/LSVRC/2012/index#task [seen in December, 2016]

45

similar and nearby objects.

For non-uniform foveal vision, a common tendency is visible on Figure 6.3a, in all three pre-trained

models, there is a f0 value from which the classification error saturates being this approximately f0 = 70.

This result is corroborated by the evolution of the gain depicted in Figure 6.1 where from f0 = 70, the

value of the gain is approximately −2 dB. This means that, beyond this fovea size, the amount of infor-

mation that is added is not relevant for the correct classification of the object.

As expected for uniform (Cartesian) vision, as σ0 increases and the blur level applied to the image

rises, the amount of information present in the image decreases, resulting in an increase in the classifi-

cation error. From Figure 6.3c it can be seen that this increase is approximately linear.

Through the relation obtained in Section 6.1, we can compare the two types of vision, the uniform

and the non-uniform foveal one. Thus, for σ0 = 5, uniform vision presents a lower classification error,

in the order of 50%. In turn, non-uniform foveal vision with f0 = 5 shows an extremely high error. We

hypothesize that the foveated area for f0 corresponds to a very small region characterized by having

high acuity. The images that make up the ILSVRC data set have objects that occupy most of the image

area, that is, although the image has a region with high-resolution, it may be small and not suffice to

give an idea of the object in the image, which leads to poor performance in the classification task.

6.2.2 Localization Performance

The threshold parameter that defines which pixels will be selected to create the bounding boxes that

represents the object location proposal. On one hand, if we set low thresholds, we will select all the

pixels in the saliency map that have an intensity higher than this threshold, i.e., we will base our localiza-

tion task on a large number of pixels at the risk of having many outliers. On the other hand, the higher

the threshold, the more restrictive the selection of pixels used for the localization. By visualizing the

evolution of the location error as a function of the threshold, it is possible to verify that there is a trade-off

between the chosen threshold and the location error obtained.

A consistent result in all topologies with respect to the localization error is the range of threshold

values that get the smallest error. For thresholds smaller than 0.4, the localization error remains stable

where the VGG model presents a smaller error compared to the other models. From this point, the

evolution of the error presents the form of a valley obtaining the lowest localization error for thresholds

of 0.65 and 0.7, depending on the topology used.

46

GoogLeNet, the deeper model considered in this work, presents a better location performance com-

pared to the other models in the range of thresholds located in the valley. Although the VGG model is

deeper than the CaffeNet model, the latter has better performance in the location. Both models feature

two fully-connected layers of 4096 dimension that can ruin the spatial distinction of image characteris-

tics. GoogLeNet does not have these fully-connected layers, instead it adopted a global average pooling

for classification in CNN which has better results when it comes to the location of the object.

6.3 Top-Down Class Refinement

As previously explained, the objective of performing a second pass through the neural network is to

re-classify the class labels obtained in the first pass, where the gist of the scene was captured.

For the second pass, the topologies have undergone minor changes: first, in the non-uniform foveal

vision, the foveation point ceases to be the center of the image and becomes the center of the bound-

ing box proposals. Second, in the uniform (Cartesian) vision, instead of using the whole image for the

re-classification, a cropped patch of the input image defined by the proposed bounding boxes is used,

resulting in a loss of context but improved acuity. Finally, in the second pass of the combined vision,

the original image is used with high-resolution and the foveal visual system presented on Section 4.2.2

is applied where the foveation point is given by the center of the bounding boxes. This configuration

corresponds to the one used in the non-uniform foveal vision, in the second feed-forward pass.

In the case of non-uniform vision, the foveation point is now given by the center of the proposed

bounding boxes. The performance in the classification task in this second pass is cumulative, this is, it

depends on the parameters that were used in the first pass. Thus, the foveation point that in this second

pass corresponds to the center of the bounding boxes, is dependent on the threshold that was used

in the first pass and that gave rise to the location proposals. For the three topologies considered, the

threshold chosen to be used in the segmentation mask and that conditions the location proposals in

the second pass was th = 0.7. In the second feed-forward pass, it is expected that the evolution of the

classification error performance follows the trend observed in the first one, i.e. the classification error

increases with σ0 for the uniform vision and decreases with f0 for the non-uniform and combined vision,

since we know that the data set images have the objects centered on them.

For the uniform vision case, the input image used to be re-classified is the cropped patch of the

original image defined by the location proposals. In this way, the presence of the surrounding context is

discarded. Surprisingly, the presence or not of the context seems to have no great contribution in the

classification of the object. One possible justification for this fact is that each image contains only one

object that occupies almost the full image (see Figure 6.2).

47

6.4 First vs Second Pass

For the non-uniform foveal vision, the difference between the first and the second pass is the selected

foveation point in the input image: in the first, the image is foveated at the center and in the second, the

foveation point is moved to the center of the proposed bounding boxes.

In Figure 6.3a and Figure 6.4a, it is possible to verify that there is practically no difference in clas-

sification error, this is, there is no significant difference in foveate from the center of the image or from

the center of the object location proposal. One major limitation of this experiment is the fact that the

objects are large scale and centered on the image. Therefore, we can conclude that for this data set

and topology, there is no advantage in making a second pass through the network.

For the uniform vision, in the first pass, the image is evenly blurred for a given σ0. In the second

pass, a crop patch of the image defined by the proposed bounding boxes is used resulting in loss of

context. As expected, regardless of the passage, the higher the blur level (σ0), more information is re-

duced making it more difficult for the network to correctly classify the image, resulting in an increase of

the classification error (see Figure 6.3c and Figure 6.4c).

The combined vision model is characterized for using images with a uniform blur of σ0 = 5, in the

first pass. In this case, the deeper the neural network used, the lower the classification error.

This tendency remains in the second pass where predominantly, the deeper networks obtain better

results. Again, due to the fact that the objects are centered in the image, as the size of the high-

resolution region f0 applied in the center of the location proposals increases, the lower classification

error is obtained.

48

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

Cla

ssific

ation E

rror

(%)

Feed-foward (CaffeNet)

Feed-foward (VGGNet)

Feed-foward (GoogLeNet)

(a) Classification error: Non-uniform foveal vision.

0 0.2 0.4 0.6 0.80

20

40

60

80

100

Lo

ca

liza

tio

n E

rro

r (%

)

Backward (CaffeNet)

Backward (VGGNet)

Backward (GoogLeNet)

(b) Localization error: Non-uniform foveal vision.

1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

Cla

ssific

atio

n E

rro

r (%

)

(c) Classification error: Uniform Cartesian vision.

0 0.2 0.4 0.6 0.80

20

40

60

80

100

Lo

ca

liza

tio

n E

rro

r (%

)

(d) Localization error: Uniform Cartesian vision.

1 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

Cla

ssific

ation E

rror

(%)

(e) Classification error: Combined vision.

0 0.2 0.4 0.6 0.80

20

40

60

80

100

Lo

ca

liza

tio

n E

rro

r (%

)

(f) Localization error: Combined vision.

Figure 6.3: Classification and localization performance of the first pass for several topologies. Threedifferent architectures are evaluated: CaffeNet (red lines with circles), VGGNet (green lines with stars)and GoogLeNet (blue lines with squares). This order and color arrangement are the same for all thesubfigures. Left column correspond to classifcation error where dash lines correspond to top-1 error andthe solid ones correspond to top-5 error. The localization error is at the right column. For Figure 6.3b,dash lines correspond to a foveation with f0 = 80 and solid lines to f0 = 100. For Figure 6.3d, dash linescorrespond to a uniform blur with σ0 = 1 and solid lines to σ0 = 5. For Figure 6.3f, dash lines correspondto a uniform blur with σ0 = 5. The classification error was based on the predicted class labels providedby the first feed-forward pass and the localization error was computed using the proposed boundingboxes resulting from the backward pass for various thresholds.

49

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

Cla

ssific

ation E

rror

(%)

Feed-foward (CaffeNet)

Feed-foward (VGGNet)

Feed-foward (GoogLeNet)

(a) Classification error: Non-uniform foveal vision.

0 0.2 0.4 0.6 0.80

20

40

60

80

100

Lo

ca

liza

tio

n E

rro

r (%

)

Backward (CaffeNet)

Backward (VGGNet)

Backward (GoogLeNet)

(b) Localization error: Non-uniform foveal vision.

1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

Cla

ssific

atio

n E

rro

r (%

)

(c) Classification error: Uniform Cartesian vision.

0 0.2 0.4 0.6 0.80

20

40

60

80

100

Lo

ca

liza

tio

n E

rro

r (%

)

(d) Localization error: Uniform Cartesian vision.

1 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

Cla

ssific

ation E

rror

(%)

(e) Classification error: Combined vision.

0 0.2 0.4 0.6 0.80

20

40

60

80

100

Lo

ca

liza

tio

n E

rro

r (%

)

(f) Localization error: Combined vision.

Figure 6.4: Classification and localization performance of the second pass for several topologies. Threedifferent architectures are evaluated: CaffeNet (red lines with circles), VGGNet (green lines with stars)and GoogLeNet (blue lines with squares). This order and color arrangement are the same for all thesubfigures. Left column correspond to classifcation error where dash lines correspond to top-1 error andthe solid ones correspond to top-5 error. The localization error is at the right column. For Figure 6.4b,dash lines correspond to a foveation with f0 = 80 and solid lines to f0 = 100. For Figure 6.4d, dash linescorrespond to a uniform blur with σ0 = 1 and solid lines to σ0 = 5. For Figure 6.4f, dash lines correspondto a non-uniform foveal blur with f0 = 80 and solid lines to a f0 = 100. The classification error was basedon the predicted class labels provided by the second feed-forward pass and the localization error wascomputed using the proposed bounding boxes resulting from the backward pass for various thresholds.

50

Chapter 7

Conclusions

In this thesis we propose a biologically inspired framework for object classification and localization that

incorporates bottom-up and top-down attentional mechanisms, combining the recent Deep Convolu-

tional Neural Networks with foveal vision.

We had as main goal of this study to evaluate the performance of several CNN architectures al-

ready known and usually used in recognition and localization tasks such as CaffeNet, VGGNet and

GoogLeNet. Furthermore, we tested two different visual sensory structures, namely a uniform vision

where it is not necessary to move the eyes towards the region of interest (covert attention) and a non-

uniform foveal vision where the attention is directed to the location proposals of the object, by means of

overt eye movements.

Our multistage framework begins by receiving evenly blurred images in the case of uniform vision and

multi-resolution images in the case of the foveal vision which simulates the humans’ peripheral vision.

The input image is then classified by means of a CNN where, for each of the top-5 predicted class labels,

a backward pass is performed obtaining the saliency maps. To these maps is applied a segmentation

mask for a given threshold that causes a spot of points that will serve as location proposals. Finally, the

image is reclassified with selective attention. For the non-uniform foveal visual sensor, the attention is

directed to the proposed locations, by means of overt attentional spotlight movements whereas for the

uniform sensor, the attentional spotlight is oriented in a covert manner to crop patches of the original

image.

7.1 Achievements

Through the analysis performed to our tests, we can conclude that the deep neural networks present

better performance when it comes to classification. These deep nets have the ability to learn more fea-

tures which results in a better learning in distinguishing similar and close objects.

51

Comparing the classification performance for uniform vision sensors and non-uniform foveal vision

sensors, it is possible to verify that it is preferable to have an image with a lower resolution uniformly

distributed, than to have a multi-resolution image where the region of greater acuity is small.

On one hand, as one would expect, the higher the level of uniformly distributed blur applied to an

image, the greater the classification error since it is more difficult to get the gist from the scene.

On the other hand, in the case of using multi-resolution vision sensors, the higher the region of high-

resolution, the greater the level of detail that the network can achieve which leads to better performance

in the classification task.

In this way, the scene gist is best captured when the entire picture is displayed, even if it is blurred. When

using a foveated image, in a glimpse we will direct our attention to the region of high-resolution, which

despite having great detail, may not suffice to give us an idea of what is really in the image.

The results we obtained for the non-uniform foveal vision are promising. We conclude that it is

not necessary to store and transmit all the information present in a high-resolution image since, from

a given f0, the performance in the classification task remains constant, regardless of the size of the

high-resolution region.

7.2 Future Work

One of the major limitations on the evaluation of non-uniform foveal vision is that it is constrained by the

chosen data set that presents the objects centered on the image. In the future, we intend to test this type

of vision in other data sets trained for recognition and location tasks where objects are not centered, thus

having a greater localization variety. The other very relevant limitation that also conditioned the tests is

the scale of the images. Scaling is a problem for the foveal sensor in particular for very close objects

because it loses the overall characteristics as the resolution decays very rapidly towards the periphery.

It would also be interesting to train the system directly with blur (uniform and non-uniform foveal). In

this case, it would be expected that with this tuning of the network, its performance should improve for

both classification and localization tasks.

52

Bibliography

[1] A. Borji and L. Itti, “State-of-the-art in visual attention modelling,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185–207, 2013.

[2] F. Katsuki and C. Constantinidis, “Bottom-up and top-down attention: different processes and over-

lapping neural systems,” The Neuroscientist, vol. 20, no. 5, pp. 509–521, 2014.

[3] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, L. Wang, C. Huang, T. S. Huang, W. Xu, D. Ra-

manan, and Y. Huang, “Look and Think Twice : Capturing Top-Down Visual Attention with Feed-

back,” IEEE International Conference on Computer Vision, 2015.

[4] F. B. Colavita, “Human Sensory Dominance,” Perception & Psychophysics, vol. 16, no. 2, pp. 409–

412, 1974.

[5] B. Wandell, Foundations of Vision. Sinauer Associates, 1995.

[6] R. Parasuraman and S. Yantis, The attentive brain. Mit Press Cambridge, MA, 1998.

[7] P. Quinlan and B. Dyson, “Attention: general introduction, basic models and data,” Cognitive Psy-

chology, pp. 271–311, 2008.

[8] L. M. Ward, “Attention,” Scholarpedia, vol. 3, no. 10:1538, 2008.

[9] M. Carrasco, “Visual attention: The past 25 years,” Vision Research, vol. 51, no. 13, pp. 1484–1525,

2011.

[10] A. Mack and I. Rock, “Inattentional Blindness,” Cambridge MA MIT Press Malik J Perona P, vol. 7,

no. 1998, p. 287, 1998.

[11] W. James, “The principles of psychology (Vols. 1 & 2),” New York Holt, vol. 118, p. 688, 1890.

[12] E. Sokolov and O. Vinogradova, Neuronal mechanisms of the orienting reflex. L. Erlbaum Asso-

ciates, 1975.

[13] M. I. Posner, “Orienting of attention,” Quarterly journal of experimental psychology, vol. 32, no. 1,

pp. 3–25, 1980.

[14] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259,

1998.

53

[15] C. Koch and S. Ullman, Shifts in selective visual attention: towards the underlying neural circuitry.

Springer Netherlands, 1985.

[16] G. F. Woodman and S. J. Luck, “Electrophysiological measurement of rapid shifts of attention during

visual search.,” Nature, vol. 400, no. 6747, pp. 867–869, 1999.

[17] C. G. Healey and J. T. Enns, “Attention and Visual Perception in Visualization and Computer Graph-

ics,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 7, pp. 1–20, 2011.

[18] A. M. Treisman, “A Feature-Integration Theory of Attention,” Cognitive Psychology, vol. 12, pp. 97–

136, 1980.

[19] J. M. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: an alternative to the feature integration

model for visual search,” Journal of Experimental Psychology: Human Perception and Performance,

vol. 15, no. 3, pp. 419–433, 1989.

[20] L. Huang and H. Pashler, “A Boolean map theory of visual attention.,” Psychological review,

vol. 114, no. 3, pp. 599–631, 2007.

[21] A. Treisman, “Preattentive processing in vision,” Computer Vision, Graphics, and Image Processing,

vol. 31, pp. 156–177, aug 1985.

[22] J. M. Wolfe, “Guided Search 2 . 0 A revised model of visual search,” Psychnomic Bulletin & Review,

vol. 1, no. 2, pp. 202–238, 1994.

[23] J. M. Wolfe, S. R. Friedman-Hill, M. I. Stewart, and K. M. O’Connell, “The role of categorization in

visual search for orientation.,” Journal of experimental psychology. Human perception and perfor-

mance, vol. 18, no. 1, pp. 34–49, 1992.

[24] U. Neisser, Cognition and reality: principles and implications of cognitive psychology. 1976.

[25] R. L. Gregory, Perceptions as Hypotheses. Philosophical Transactions of the Royal Society of

London. Series B, Biological sciences, vol.290, No. 1038, 1980.

[26] J. J. Gibson, “The Theory of Affordances,” in Perceiving, Acting, and Knowing, pp. 127–142 (332),

Hoboken, NJ: John Wiley & Sons Inc., 1977.

[27] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.

[28] A. Message, A. Farke, A. Farke, A. Farke, V. M. Arbour, M. E. Burns, R. M. Sullivan, S. G. Lu-

cas, A. K. Cantrell, T. L. Suazo, J.-r. Boisserie, A. Souron, H. T. Mackaye, A. Likius, P. Vignaud,

M. Brunet, M. Tallman, N. Amenta, E. Delson, S. R. Frost, D. Ghosh, and Z. S. Klukkert, “Artificial

Inteligence,” no. August, 2014.

[29] J.-T. Huang, J. Li, and Y. Gong, “An analysis of convolutional neural networks for speech recog-

nition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),

pp. 4989–4993, 2015.

54

[30] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural

Computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[31] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy Layer-Wise Training of Deep Net-

works,” Advances in Neural Information Processing Systems, vol. 19, no. 1, p. 153, 2007.

[32] M. aurelio Ranzato, C. Poultney, S. Chopra, Y. L. Cun, M. Ranzato, C. Poultney, S. Chopra, and

Y. L. Cun, “Efficient Learning of Sparse Representations with an Energy-Based Model,” Advances

in Neural Information Processing Systems, vol. 19, no. 1, pp. 1137–1144, 2007.

[33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout : A Simple

Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research (JMLR),

vol. 15, pp. 1929–1958, 2014.

[34] L. Itti and C. Koch, “A saliency-based search mechanism for overt and covert hifts of visual atten-

tion,” Vision Research, vol. 40, no. 10-12, pp. 1489–1506, 2000.

[35] S. P. Tipper, J. Driver, and B. Weaver, “Object-centred inhibition of return of visual attention.,” The

Quarterly journal of experimental psychology. A, Human experimental psychology, vol. 43, no. 2,

pp. 289–298, 1991.

[36] W. Osberger and A. Maeder, “Automatic identification of perceptually important regions in an

image,” Proceeding of the Fourteenth International Conference on Pattern Recognition, vol. 1,

pp. 701–704, 1998.

[37] T. Kadir and J. M. Brady, “Scale, Saliency and Image Description,” International Journal of Computer

Vision, vol. 45, no. 2, pp. 83–105, 2001.

[38] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discriminant process,” Proceedings of the

IEEE International Conference on Computer Vision, 2007.

[39] C. Siagian and L. Itti, “Rapid biologically-inspired scene classification using features shared with

visual attention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 2,

pp. 300–312, 2007.

[40] L. W. Renninger and J. Malik, “When is scene identification just texture recognition?,” Vision Re-

search, vol. 44, no. 19, pp. 2301–2311, 2004.

[41] A. Torralba, “Modeling global scene factors in attention,” Journal of the Optical Society of America

A, vol. 20, no. 7, p. 1407, 2003.

[42] L. Lukic and A. Billard, “Motor-primed Visual Attention for Humanoid Robots,” IEEE Transactions

on Autonomous Mental Development, pp. 1–16, 2015.

[43] L. Lukic and A. Billard, “Learning Coupled Dynamical Systems from Human Demonstration for

Robotic Eye-Hand Coordination,” IEEE, pp. 552–559, 2012.

55

[44] L. Lukic, J. Santos-Victor, and A. Billard, “Learning robotic eye-arm-hand coordination from human

demonstration: A coupled dynamical systems approach,” Biological Cybernetics, vol. 108, no. 2,

pp. 223–248, 2014.

[45] S. Frintrop, ”VOCUS: A Visual Attention System for Object Detection and Goal-Directed Search”.

Springer-Verlag Berlin Heidelberg, 2006.

[46] B. Rasolzadeh, A. T. Targhi, and J.-O. Eklundh, “An attentional system combining top-down and

bottom-up influences,” Attention in Cognitive Systems. Theories and Systems from an Interdisci-

plinary Viewpoint Lecture Notes in Computer Science, vol. 4840, pp. 123–140, 2007.

[47] C. Szegedy, W. Liu, Y. Jia, and P. Sermanet, “Going deeper with convolutions,” Computer Vision

Foundation, 2014.

[48] M. Lin, Q. Chen, and S. Yan, “Network In Network,” Computer Vision Foundation, 2013.

[49] Y. Lin, “A Computational Model for Saliency Maps by Using Local Entropy,” Artificial Intelligence,

pp. 967–973, 2001.

[50] Y. Lin, S. Kong, D. Wang, and Y. Zhuang, “Saliency Detection within a Deep Convolutional Archi-

tecture,” Cognitive Computing for Augmented Human Intelligence: From the AAAI-14 Workshop,

pp. 31–37, 2014.

[51] T. D. Ning Zhang, Jeff Donahue, Ross Girshick, Part-based R-CNNs for Fine-Grained Category

Detection. 2014.

[52] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object de-

tection and semantic segmentation,” Proceedings of the IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, pp. 580–587, 2014.

[53] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,

M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”

International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

[54] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Dar-

rell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093,

2014.

[55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional

Neural Networks,” Advances In Neural Information Processing Systems, pp. 1–9, 2012.

[56] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,”

Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS),

vol. 9, pp. 249–256, 2010.

[57] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recog-

nition,” International Conference on Learning Representations, pp. 1–14, 2015.

56

[58] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The

Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vi-

sion, vol. 111, no. 1, pp. 98–136, 2014.

[59] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising

Image Classification Models and Saliency Maps,” Computer Vision and Pattern Recognition, 2014.

[60] I. I. A. Groen, S. Ghebreab, H. Prins, V. A. F. Lamme, and H. S. Scholte, “From image statistics to

scene gist: evoked neural activity reveals transition from low-level natural image structure to scene

category.,” Journal of Neuroscience, vol. 33, no. 48, pp. 18814–18824, 2013.

[61] R. S. Wallace, P.-W. Ong, B. B. Bederson, and E. L. Schwartz, “Space variant image processing,”

International Journal of Computer Vision, vol. 13, no. 1, pp. 71–90, 1994.

[62] Z. Wang, “Rate scalable foveated image and video communications [ph. d. thesis],” 2003.

[63] W. S. Geisler and J. S. Perry, “Real-time foveated multiresolution system for low-bandwidth video

communication,” in Photonics West’98 Electronic Imaging, pp. 294–305, International Society for

Optics and Photonics, 1998.

[64] P. J. Burt and E. H. Adelson, “The laplacian pyramid as a compact image code,” IEEE TRANSAC-

TIONS ON COMMUNICATIONS, vol. 31, pp. 532–540, 1983.

[65] M. J. Bastos, “Modeling human gaze patterns to improve visual search in autonomous systems,”

Master’s thesis, Instituto Superior Tecnico, 2016.

57

58

Documents

Deep Networks for Human Visual Attention: A hybrid model using … · Resumo A atenc¸ao visual desempenha um papel fundamental nos sistemas naturais e artiﬁciais no que toca˜