Upload
vandieu
View
213
Download
0
Embed Size (px)
Citation preview
Deep Networks for Human Visual Attention:
A hybrid model using foveal vision
Ana Filipa Vieira de Jesus Almeida
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisor(s): Professor Alexandre José Malheiro Bernardino
Professor José Alberto Rosado dos Santos-Victor
Examination Committee
Chairperson: Professor João Fernando Cardoso Silva SequeiraSupervisor: Professor Alexandre José Malheiro Bernardino
Member of the Committee: Professor Pedro Daniel dos Santos Miraldo
April 2017
Acknowledgments
First, I would like to thank my thesis advisor Professor Alexandre Bernardino for the opportunity to
develop this research project in the Computer and Robot Vision Laboratory of the Instituto Superior
Tecnico. The door to Prof. Bernardino’s office was always open and he was a splendid advisor who has
always supported me during this past year.
Next, I would like to give a special thanks to my lab mate Rui Figueiredo for all his support and for
encouraging me to do more and better. And, of course, to the other lab mates who contributed directly
or indirectly to this thesis, in particular to Atabak Dehban.
Last but not least I would like to thank my loved ones, my parents, my twin sister and my brother,
who have supported me throughout this entire process. Finally, I extend my thanks to my friends who
accompanied me in these harsh academic years.
iii
Resumo
A atencao visual desempenha um papel fundamental nos sistemas naturais e artificiais no que toca
ao controlo dos recursos percetuais. Os sistemas classicos de atencao visual artificial utilizam carac-
terısticas salientes da imagem obtidas a partir da informacao proveniente de filtros.
Recentemente, foram desenvolvidas redes neuronais profundas para o reconhecimento de milhares de
objetos onde estas geram autonomamente caracterısticas visuais otimizadas por treino com conjun-
tos extensos de dados. Para alem de serem utilizadas para reconhecimento de objetos, estas carac-
terısticas tem tido muito sucesso noutros problemas visuais tais como a segmentacao de objetos, o
seguimento e, recentemente, a atencao visual.
Este trabalho propoe uma estrutura biologicamente plausıvel de classificacao e localizacao de objetos
que incorpora mecanismos de atencao bottom-up e top-down, combinando redes neuronais convolu-
cionais com visao foveal. Primeiro e feita uma passagem feed-forward de forma a obter as previsoes da
rede neuronal quanto as etiquetas das classes. De seguida, para cada uma das top-5 classes previs-
tas e obtida uma proposta relativa a localizacao do objeto. Esta proposta resulta da aplicacao de uma
mascara de segmentacao sobre o mapa de saliencia que primeiramente e computado atraves de uma
passagem backward. Por fim, e feita uma segunda passagem feed-forward onde a imagem e reclas-
sificada, desta vez com atencao. Nesta ultima etapa sao comparadas duas configuracoes de detecao
visual: uma uniforme (Cartesiana) e uma nao uniforme (foveada). Na primeira, a imagem e recortada
segundo a proposta de localizacao do objeto e a atencao e direcionada para a nova imagem, descar-
tando o contexto. Na segunda, e aplicado o nosso modelo de foveacao visual humana onde a imagem
e foveada a partir do centro da localizacao proposta para um dado objeto. Desta forma, a atencao e
direcionada para o objeto e este e classificado para diferentes nıveis de resolucao.
A principal contribuicao do nosso trabalho reside na avaliacao que fazemos na utilizacao de imagens
com resolucao uniforme e foveada. Foi possıvel estabelecer a relacao entre estes diferentes metodos
e avaliar a informacao preservada em cada tipo de sensor, em funcao dos seus parametros.
Os resultados demonstram que nao e necessario guardar e/ou transmitir toda a informacao presente
numa imagem com alta-resolucao uma vez que, a partir de uma dada quantidade de informacao, o
desempenho obtido na tarefa de classificacao satura.
Palavras-chave: Atencao visual, classificacao e localizacao de objetos, redes neuronais
profundas, visao computacional, visao variante no espaco.
v
Abstract
Visual attention plays a central role in natural and artificial systems to control perceptual resources. The
classic artificial visual attention systems use salient features of the image obtained from the information
given by predefined filters.
Recently, deep neural networks have been developed for recognizing thousands of objects and au-
tonomously generate visual characteristics optimized by training with large data sets. Besides being
used for object recognition, these features have been very successful in other visual problems such as
object segmentation, tracking and recently, visual attention.
In this work, we propose a biologically inspired object classification and localization framework that
incorporates bottom-up and top-down attentional mechanisms, combining Deep Convolutional Neural
Networks with foveal vision. First, a feed-forward pass is performed to obtain the predicted class la-
bels. Next, we get the object location proposals by applying a segmentation mask on the saliency map
calculated through a backward pass. At last, an image re-classification with attention is done by a sec-
ond feed-forward pass. In this final stage, two visual sensing configurations are compared: a uniform
(Cartesian) that uses a crop patch of the image to re-classify, discarding the surrounding context and a
non-uniform tessellation that transforms the image by applying the human visual foveation model at the
center of the object location proposal.
The main contribution of our work lies in the evaluation of the performances obtained with uniform and
non-uniform resolutions. We were able to establish the relationship between performance and the dif-
ferent levels of information preserved by each of the sensing configurations. The results demonstrate
that we do not need to store and transmit all the information present on high-resolution images since,
beyond a certain amount of preserved information, the performance in the classification task saturates.
Keywords: Computer vision, deep neural networks, object classification and localization, space-
variant vision, visual attention.
vii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 5
2.1 Visual Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Preattention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Feature Integration Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Guided Search Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Boolean Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Mechanisms for Information Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Related Work 19
3.1 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Modern Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 ImageNet Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Pre-trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 CaffeNet/AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.2 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.3 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ix
4 Hybrid Attention Model 29
4.1 Class Saliency Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Uniform vs Foveal Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Uniform Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Foveal Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Information Attenuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Uniform Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Non-Uniform Foveal Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Implementation 37
5.1 Image-Specific Class Saliency Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Weakly Supervised Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Image Re-Classification with Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Results 43
6.1 Uniform vs Non-Uniform Foveal Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 First Feed-Forward Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2.1 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.2 Localization Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Top-Down Class Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.4 First vs Second Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Conclusions 51
7.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 57
x
List of Tables
3.1 ConvNet performance following the state of the art. . . . . . . . . . . . . . . . . . . . . . . 27
6.1 Summary of the evaluated topologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
xi
List of Figures
2.1 Photo-receptors density in the retina. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Diagram of the macula of the retina. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Treisman’s feature integration model of early vision. . . . . . . . . . . . . . . . . . . . . . 9
2.4 Guided search for steep green targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Boolean maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 The perceptual model cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Neural network basic structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Convolutional Neural Network architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Representation of max-pooling operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Taxonomy of visual attention models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Ideal low-pass filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Gaussian filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 A summary of the human visual foveation model with four levels. . . . . . . . . . . . . . . 33
4.4 Representation of images acquired with two different visual sensing configurations: a
uniform and a log-polar distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Multistage attentional pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Representation of saliency maps and location proposals. . . . . . . . . . . . . . . . . . . 39
5.3 Representation of the convolutional layer output for 4 different input images on VGGNet
pre-trained model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Information gain in function of σ for uniform and non-uniform vision. . . . . . . . . . . . . 44
6.2 Demonstration of our weakly supervised object localization method . . . . . . . . . . . . . 45
6.3 Classification and localization performance of the first pass for several topologies. . . . . 49
6.4 Classification and localization performance of the second pass for several topologies. . . 50
xiii
Nomenclature
Acronyms
ANN Artificial Neural Networks
CNN Convolutional Neural Network
ConvNet Convolutional Network
FBA Feature-based Attention
ILSVRC ImageNet Large Scale Visual Recognition Challenge
IOR Inhibition of Return
LRN Local Response Normalization
OBA Object-based Attention
ReLU Rectified Linear Unit
WTA Winner-Take-All
Parameters
σ0 Quantity of uniform blur
f0 Size of the region with high acuity
c Class label
I Image
th Threshold
xv
Chapter 1
Introduction
1.1 Motivation
The available human brain computational resources are limited, so it is not possible to process all the
sensory information provided by the visual perceptual modality. For this reason, it is essential to focus
our resources such that only the relevant stimuli are processed and interpreted. Selective visual atten-
tion mechanisms are the fundamental mechanisms in biological systems, responsible for prioritizing the
elements of the visual scene to be attended.
Likewise, an important issue in many computer vision applications requiring real time visual process-
ing, resides in the involved computational effort [1]. Therefore, in the past decades, many biologically
inspired attention-based methods and approaches, were proposed with the goal of building efficient sys-
tems, capable of working in real-time. Hence, attention modeling is still a topic under active research,
studying different ways to selectively process information in order to reduce the time and computational
complexity of the existing methods.
Humans use attention mechanisms based on goal-oriented (top-down) and stimulus-driven (bottom-
up) information to define the region in the visual input where the attentional focus should be oriented [2].
In this way, the amount of processing is limited to a certain region of the visual field and the regions to
explore (salient) are prioritized in time. Similar mechanisms can also be applied to artificial systems that
share similar resource limitations. Much effort has been made towards understanding and applying the
human attention mechanisms in robotic systems.
Nowadays, modeling attention is still challenging due to the huge amount of information available
at any time and the laborious and time-consuming task that is to create models by hand, trying to
tune where (regions) and what (objects) the observer should look at. For this purpose, biologically
inspired neural networks have been extensively used, since they can implicitly learn those mechanisms,
circumventing the need of creating models by hand.
1
1.2 Goals
Our work is inspired by [3] which proposed to capture visual attention through feedback Deep Convo-
lutional Neural Networks. Similarly in spirit, we propose a biologically inspired hybrid attention model,
that combines bottom-up and top-down mechanisms and is capable of efficiently locate and recognize
objects in digital images, using human-like vision.
More specifically, our method is constituted by three steps: first, we perform a feed-forward pass to
obtain the predicted class labels. Second, a backward pass is made to create a saliency map that is
used to obtain object location proposals after applying a segmentation mask. Finally, a second feed-
forward pass is executed to re-classify the image with selective attention. With a non-uniform foveal
visual sensor, the attention is directed to the proposed locations using a foveal spotlight model, whereas
for the uniform sensor, the attentional spotlight is oriented in a covert manner to crop patches of the
original image.
What differentiates our model from Cao’s model [3] is the use of two visual sensing configurations.
On one hand, Cao uses images with high-resolution in both passages of the model. On the other hand,
we use images with different resolutions corresponding to two visual sensing configurations: the uniform
vision where we simulate the use of a sensor with low-resolution and a non-uniform foveal vision where
the sensor presents space-variant resolution.
Our primary goal is to evaluate the performance of several well-known Convolutional Neural Net-
work architectures that are part of the state of the art in tasks of detection and localization of objects.
Moreover, we assess the performance of two different visual sensory structures: a conventional uniform
(Cartesian) and a multi-resolution, human-inspired, foveal configuration, on the first and second feed-
forward passages. For the Cartesian, a re-classification is performed for a cropped patch of the image,
discarding the periphery. For the human-like tessellation, the image is foveated at the center of the pro-
posed location. At last, a combined configuration is evaluated where on the first feed-forward pass, the
input image presents a uniform resolution and on the second feed-forward pass, the image is foveated
from the center of the object location proposal.
1.3 Document Outline
The remainder of this document is organized as follows: Chapter 2 overviews the related work and some
fundamental concepts needed for better understanding the proposed attentional framework. Section 2.1
mentions the different types of visual attention and Section 2.2 gives a definition of preattention and
presents some existing theories about how this process is done in visual systems; Section 2.3 presents
the differences between sensation and perception and a characterization of the two types of process-
ing, top-down and bottom-up is described in more detail. In Section 2.4, a brief introduction to artificial
neural networks is done and the architecture of the used Convolutional Neural Networks is presented
on Section 2.5. Finally, Section 2.6 talks about the origins of Deep Neural Networks.
2
In Chapter 3, a taxonomy of the diverse computational models used in visual attention is presented.
In Section 3.1, the classical methods are presented and Section 3.2 describes the modern methods.
Both types of methods are supported by work done by well-known researchers in the visual attention
field. The data set selected for this work is presented on Section 3.3 and several convolutional network
models are described on Section 3.4.
In Chapter 4, a theoretical explanation of the saliency calculation for a specific object class is pre-
sented on Section 4.1. The different visual sensors implemented on this work are presented on Sec-
tion 4.2, in particular the uniform visual system on Section 4.2.1 and the foveal vision system on Sec-
tion 4.2.2, respectively. At last, a study on the information content present in images, useful for the
comparison between the two sensors, is presented on Section 4.3.
The proposed hybrid attention model is presented on Chapter 5, where a description of the various
steps that constitute the framework is made.
Finally, the obtained results are presented in Chapter 6 and in Chapter 7, we present our conclusions
drawn from the work carried out and some references to future contributions that can be done.
3
Chapter 2
Background
Vision is one of the five senses that allows organisms to improve their perception of the surrounding
environment, enabling a greater knowledge of the world. There is evidence that vision is the dominant
sense that human beings possess. For example, Colativa [4] did experiences where visual (light) and
auditory (tone) stimuli were presented, to which the participants were instructed to identify the respective
stimulus. The study revealed a predisposition among the subjects to direct their attention preferentially
toward the visual modality.
The process of seeing starts with light entering the eye through the cornea. The eye has the ability
to adapt to different levels of brightness (adaptation) and to shape its lens and pupil size in order to
focus objects at different distances (accommodation). The colorful part of the eye called iris controls the
amount of light that enters the eye allowing more light when the environment is dark and less when there
is plenty of light. Then, the light passes through the pupil and is focused by the lens, a nearly spherical
body, onto the retina. The retina is a sensory membrane that is responsible for receiving and converting
the visual stimuli into electric signals to be transmitted to the visual cortex in the brain, through the optic
nerve. The retina is full of photo-receptors like rods that are located mostly at the periphery of the retina
and the cones that distinguish colors and are located mostly in the center (see Figure 2.1).
The most sensitive part of the retina is called macula and comprises hundreds of nerve endings
which allows us to see objects with great detail. It is subdivided into the perifovea, parafovea and fovea
areas of which the fovea is located in the center of the macula (see Figure 2.2). Finally, the visual stimuli
are received when the signals coming through the optic nerve reach the back of the brain, where the
visual cortex is located and the stimuli are interpreted.
The proposed object localization and classification framework uses several biologically inspired at-
tention mechanisms, which include space-variant vision, and Artificial Neural Networks (ANN) for top-
down cognitive processes (i.e. guided, task-biased attention) for the recognition and location of objects.
As such, in the remainder of this section, we describe the fundamental concepts from neuroscience and
5
Figure 2.1: Photo-receptors density in theretina. The cones are concentrated in thefovea, the region of highest acuity and the rodsare distributed in the periphery. Figure adaptedfrom [5].
Figure 2.2: Diagram of the macula of theretina, showing perifovea, parafovea andfovea.
computer science on which the proposed framework is based.
2.1 Visual Attention
Attention is a process through which an organism selects a sub-region of the visual field, the so-called
”focus of attention”, to be processed in detail. This allows suppressing the rest of the available informa-
tion to obtain an efficient perception.
Depending on the number of processing inputs, the attention can be selective (process only one
input) or divided (process more than one input at once) [6]. In selective attention, the irrelevant stimuli
are blocked and the wanted information is promoted. The amount of computational resources available
for humans are limited. This idea is suggested by Broadbent’s filter theory [7] which introduced the
structural bottleneck concept, that is a limitation of the amount of information that can pass through the
visual pathways, at any time. With this cognitive limitation, a selective filter is needed for information
processing.
Thereby, different selection models have been proposed to decide when to attend to a certain stim-
ulus. On one hand, in early selection models, stimuli are filtered or selected to be attended at an early
stage of the processing. The early filters select the relevant information based on basic low-level fea-
tures like color or direction of stimulus. On the other hand, later selection models involved semantic
selection which requires more attentional resources comparing to early selection.
6
In divided attention, the focus of attention is divided in more than one task at a time. This is hard
since resources are limited by a cognitive budget, so divided attention demands to separate resources
among different tasks. When we want to perform two tasks at once, our attention needs to be divided
between both of them. In this case, carrying out task A will decrease the performance of task B when
both are performed at the same time. There is interference if both tasks share sensory modalities (e.g.
both use visual inputs), use the same mental processing states (e.g. see and listen words) or use the
same response mechanisms (e.g. listen words and see pictures) [8]. These kind of interferences appear
because the amount of available resources are insufficient to perform well both tasks. However, if the
task is carried out frequently, knowledge acquired during task execution will lead to automation, meaning
less cognitive effort.
There are three types of visual attention [9]:
1. Spatial attention which can be overt, when the observer move its eyes to the relevant locations
and the focus of attention match with the eye movement, or covert, when the attention is allocated
to relevant locations without eye movement;
2. Feature-based attention (FBA) that can attend to specific features (e.g. color, orientation or motion
direction) of objects, regardless their location;
3. Object-based attention (OBA) where attention is guided by object structure.
The observer attention can be stimuli-driven, triggered by scene characteristics like color or orien-
tation (bottom-up factors) or by specific visual characteristics that depend on the task or goal that he
wants to achieve (top-down factors). Two questions stand out: where should the observer focus his
attention (spatial attention) and to which features (feature-based attention) and a problem can pop up:
inattentional blindness [10] that is the inability to detect unexpected objects to which we are not paying
attention. This temporary blindness can sometimes happen because it is impossible for a viewer to
attend all stimuli.
William James [11] defines two modes of spatial attention that facilitate the processing and selection
of information: endogenous and exogenous.
The exogenous system is responsible for orienting automatically our attention, in an involuntary and
reflexive manner, to locations where sudden changes take place. For instance, let us imagine that we
listen a loud sound coming from outside. Our first reaction will be to direct our gaze to the sound source
(orienting reflex) with the purpose of updating our model of the world [12]. Since these changes are
unexpected and all related with the stimuli, they correspond to bottom-up processing, also known as
stimulus-driven attention.
7
The endogenous system is voluntary and corresponds to allocate attentional resources to a prede-
termined location. This way, orienting of attention results from taking into account task-specific goals. In
this situation, we can direct attention to a location in space or to an object. This is known as top-down
processing or goal-driven attention. For example, if our goal is to count how many people will go out of
a room, we will orient our attention to the door. This means that, with this knowledge, we can guide our
attention to relevant places to make the process more efficient [13].
When a viewer is asked to find a specific target, he knows what to find but not where. For that, he
must search [8]. There are two types of search mechanisms: the parallel or pop out search and the
serial search. If the target differs from the other elements of the visual field by a single feature (e.g.
color), the search is performed very quickly - pop out search [14] [15] (see Section 2.2). There is no
need to guide attention for any of the elements of the field since it is enough to detect the presence of an
activation in the corresponding feature map. Still, if the target differs from the non-targets (distractors)
by a conjunction of features (e.g. red-horizontal bar), the search is slower - serial search.
In this case, search time depend linearly on the number of non-target elements since there is the need
to focus attention in each and every one of them [14] [16].
2.2 Preattention
The concept of preattention is related to noticing of something before attention is fully focused on it, i.e.
we can almost instantly recognize an element in the visual field using low-level information. Typically,
tasks that can be performed in less than 200-250 milliseconds are considered preattentive [17]. The
visual properties that we can detect effortlessly - called preattentive attributes - are color, movement,
form, lightness or spatial position, since they simply ”pop out”. Imagine an image with a certain number
of blue balls (distractors) and a red ball (target) randomly placed. If we take a look at the image for a
fraction of a second, we will detect the presence of the target without focusing at any specific region.
This happens because the target has a visual property ”red” that the distractors do not. The same logic
can be applied to images containing elements with different geometric forms like circles and triangles.
However, if a target is defined by the presence of two or more visual properties, often, it cannot be found
preattentively. In those cases (conjunction targets), viewers should perform a search through the display
to confirm its presence or absence [17].
Some theories attempt to explain how preattentive processing is done including the feature integra-
tion [18], guided search [19] and boolean maps [20]. In the remainder of this section, we explain the
basic ideas behind these theories.
8
2.2.1 Feature Integration Theory
Treisman [21] focused on two problems. First, to determine which visual features are detected preat-
tentively and then, she formulated a hypothesis about how the visual system make the preattentive
processing [18].
To identify the preattentive features, she made experiences by detecting targets and measuring per-
formance by response time and by accuracy. In the response time model, viewers are asked to complete
the task as quickly as possible and the number of distractors on the display varied. If elapsed time is
below some chosen threshold, the task is considered to be preattentive. In the accuracy model, the task
was the same and the number of distractors also varied. The display was shown for a fraction of time
and then removed. If the viewers had finished the task accurately, the feature used to define the target
was considered to be preattentive. After these experiences she was able to identify some visual features
that were detected preattentively: shape, color, size, contrast, orientation, intensity, among others [21].
To understand how preattentive processing is done, Treisman proposed a model (see Figure 2.3). In
this model, each feature map registers the activity for a specific visual feature like contrast or size. When
an image is shown, features are encoded in parallel into their respective maps. These maps only provide
us the activity log of each feature. If the target has a unique feature, we just have to check if there is
activity on the respective feature map. However, for conjunction target, one feature map is not enough.
Thereby, a serial search must be done in order to find the target that has the correct combination of
features. In this case, a focus attention is used which will increase the time and effort spent.
Figure 2.3: Treisman’s feature integration model of early vision — detection of activity in individualfeature maps can be done in parallel, but to search for a combination of features, attention must befocused. Figure adapted from [17].
2.2.2 Guided Search Theory
The theory of guided search was proposed by Wolfe [19] [22] and tries to take into account the goals of
the observer in the task. He proposed that the combination of bottom-up and top-down information will
create an activation map during visual search. Then, the attention will be guided to peaks in the acti-
vation map that will correspond to the areas of the image in which bottom-up and top-down information
9
more contributed.
Wolfe agreed with Treisman when it comes to the belief that, in preattentive processing, the image is
divided into individual feature maps. Each map corresponds to a feature that can be filtered into several
categories like the feature ’orientation’ can be divided into shallow, steep, left and right (see Figure 2.4).
Bottom-up activation measures how different an element is from its neighbors and top-down activation
is user-driven because it tries to give answers to the requests. For instance, if we want to find a ”green”
element, this will appeal for a top-down request considering the task. Wolfe also defends that observers
should specify the request in terms of the categories existing in each feature map (e.g. color map,
category ’G’) [23].
Figure 2.4: Guided search for steep green targets: an image is divided into individual feature maps thatare filtered into categories and bottom-up and top-down activation will point out target regions. Then,the information in blend in an activation map where the highest peaks represent the locations whereattention will be directed. Figure adapted from [17].
2.2.3 Boolean Map
With the purpose of understanding why we cannot notice features that are irrelevant to the immediate
task, Huang proposes a new model of low-level vision [20]. In this model, the scene is divided into two
distinct and complementary regions: the excluded elements and the selected elements. The last ones
can be accessed for detailed analysis.
There are two ways to create boolean maps. First, the observer defines a value for an individual
feature and with this, all objects with the feature value specified are selected. Imagine we are looking
for green objects, thus the color feature label for the boolean map will be ”green”. This mean that, for
the feature ’color’, we want access to objects with ”green” labels. Once we are not looking for other
features (e.g. orientation), their labels should be undefined. An example of boolean maps is presented
in Figure 2.5 where the goal is to select red objects or vertical objects. Second, the observer should
choose a set of elements at specific spatial locations. In this case, no value was assigned to the features
so all labels are undefined.
Huang’s theory differs from feature integration theory because the last do not provide information on
feature location. Still, boolean maps preserve the spatial locations of the selected elements as well as
feature labels. Another advantage is that we can create a boolean map that is a union or an intersection
of two existing maps as shown in Figure 2.5 (d).
10
Figure 2.5: Boolean maps: (a) red and blue vertical and horizontal elements; (b) map for “red”, colorlabel is red, orientation label is undefined; (c) map for “vertical”, orientation label is vertical, color labelis undefined; (d) map for set intersection on “red” and “vertical” maps. Figure adapted from [17].
2.3 Mechanisms for Information Processing
Sensation and perception play different but complementary roles in the way we interpret our world. Sen-
sation is the process by which we feel the environment around us through the senses: touch, taste,
sight, sound, and smell. Then, the information is sent to the brain where perception gets into action,
interpreting the received information and making us understand what is happening around us, allowing
us to form a mental representation of the environment.
In general, when it comes to processing in the context of sensation and perception, two types of pro-
cessing are commonly characterized: top-down processing and bottom-up processing. On one hand,
top-down processing corresponds to allocate attention voluntarily on features, objects or spatial regions
based on prior knowledge and current goals/tasks. Thus, prior knowledge and the task at hand are used
to influence attention in a goal-driven manner. On the other hand, bottom-up processing refers to the
involuntary mechanisms responsible for directing attention to salient regions based on differences from
a region and its surround (e.g. contrast). In this case, the stimuli directly triggers our attention and, thus,
it is a data-driven process.
Humans perceive data but the question is how is this done? Some theories known as constructivism
theories assume that information need to be processed at a higher level and then we build our percep-
tion of the world. Other theories called direct theories defend that the environment provides us enough
information to perceive our world. Finally, some authors like Neisser [24] refer that visual perception
depends on both bottom-up and top-down processing.
From Gregory’s perspective, perception involves top-down processing. His theory [25] defends that prior
knowledge and past experiences related to stimuli helps the agent to better guess or hypothesize. In this
way, the agent is constantly building his perception of the reality based on the environment and stored
information. However, the brain may create some incorrect hypotheses which result in visual illusions.
Whereas, for Gibson, the information provided by the environment is enough to make sense of the world
(theory of affordances [26]) considering that perception is direct and involve bottom-up processing which
11
is reflexive and involuntary, independent of the agent past experiences. Thereby, Gibson argued that
perception is a bottom-up process once the visual information needed is available in the environment,
excluding the need for prior knowledge.
At last, Neisser [24] advanced the cycle theory which describes perception as a nonstop process
(see Figure 2.6) where bottom-up and top-down processing work together and continuously. People use
the prior knowledge (top-down) of the world to build schemas and, with them, we are able to predict
which information may be available for us (bottom-up).
In conclusion, Gibson’s theory is more limited, restricting perception solely in terms of the environ-
ment while Gregory suggests that what we see is not enough and attention mechanisms take advantage
of prior knowledge.
Figure 2.6: The perceptual model cycle.1
2.4 Artificial Neural Networks
Artificial Neural Networks (ANN) are computational models inspired by the central nervous system of
an animal, specially the brain, and try to mimic the way a biological brain solves problems. Modern
networks are limited by computation power working with a few thousand to a few million neural units and
millions of connections which is still far away from human brain complexity.
A neural network receives inputs, transforms them and generates an output. Its key element is the
ability to learn implicit mappings between inputs and outputs, making it a powerful tool. They are also
capable of recognizing patterns.
1source: http://www.southampton.ac.uk/engineering/research/projects [seen in May, 2016]
12
A neural network is organized in layers that establish connections between neurons. It starts with an
input layer where each neuron is fully connected to all neurons at the next layer. To each connection be-
tween two neurons is assigned a weight that controls the signal transmission between them. The input
units receive information from the outside world and communicate with one or more hidden layers where
actual processing takes place. In classification networks, the hidden layers apply a kind of distortion of
the input data in a non-linear way with the aim of having linearly separable categories at the end [27].
The last hidden layer links to the output layer where items are assigned to the believed belonging class
(see Figure 2.7). All neurons in the hidden layers are processed by an activation function that can be
linear, threshold or sigmoid function.
Figure 2.7: Neural network basic structure.
There are two main learning algorithms for training a neural networks based classifier:
• Supervised learning - It requires a large labeled data set with input samples associated to cate-
gories. The network produces an output in the form of a vector of scores, one score for each cate-
gory. Then, an objective function is computed to measure the error, i.e. the difference between the
output scores and the desired pattern of scores. With this knowledge, all internal weights parame-
ters are adjusted with the goal of minimizing the error. To correctly perform these adjustments, the
learning algorithm computes a gradient vector that, for each weight, indicates what would be the
error value variation if the weight were increased by a tiny amount [27]. Finally, the weight vector
is adjusted in the opposite direction to the gradient vector.
• Unsupervised learning - The network learns intrinsic relations about the data without having a
target or label. It exploits only the statistical distribution of the input data to associate samples to
groups of related elements.
In supervised learning, there are mainly three steps to follow: the training set used to build the
model by finding relationships between data and pre-classified targets (labeled data), the validation set
used to tune the hyper parameters as the number of hidden units or the depth of the neural network and
finally, the test set used to estimate the performance of the model on unseen data.
13
One of the most popular neural networks training algorithm is back propagation. It requires a known
desired output for each input value to calculate a loss function representing the difference between the
current and the desired output. The back propagation training algorithm goes as follow. It starts by per-
forming feed-forward computations where the weights and bias are randomly set producing an output
and the loss function is calculated. Then, feedback from the output layer is used to make adjustments to
the weights and bias such that the error is incrementally minimized. This process is continuously done
in all hidden layers until the input layer is reached, trying to minimize the loss function for that set of
input values. These incremental changes have to be small since the weights will affect all inputs from
the training set.
Once the network has been trained, we can present a whole new set of inputs and see how it
responds. It will attempt to categorize the new inputs in the right class. For example, if we present a
set of inputs with images from cats and dogs, in the training phase, the network will learn the features
corresponding to these classes. In the test phase, we can present an image of a dog and see if the
network correctly classifies this input as class ’dog’ or if it fails considering the input a class ’cat’.
2.5 Convolutional Neural Networks
There are several types of neural networks but as far as visual attention is concerned, the most com-
monly used are the Convolutional Neural Networks (CNN), that are feed-forward artificial neural networks
that take into account the spatial structure. They have the ability to learn discriminative features from
raw data input and have been used in several visual tasks like object recognition and classification.
This type of neural networks is named convolutional once it performs the mathematical operation
convolution. In case of CNN that uses images, the signal is discrete so the convolution of two discrete
signals is done by summing the product of the two signals, where one of them is flipped and shifted [28].
The mathematical formula for convolution of discrete signals is defined in (2.1) where x is the input signal
and h is the impulse response. This operation has several applications on signal processing such as
filter signals (2D - image processing) or find patterns between them.
y[n] = x[n] ∗ h[n] =∞∑
k=−∞
x[k]h[n− k]. (2.1)
A CNN is constituted by multiple stacked layers that filter (convolve) the input stimuli to extract use-
ful and meaningful information depending on the task at hand. These layers have parameters that are
learned in a way that allows filters to automatically adjust to extract useful information without feature
selection so there is no need to manually select relevant features. The general architecture of a CNN is
shown in Figure 2.8.
14
Figure 2.8: Convolutional Neural Network architecture. Figure adapted from [29].
Convolutional layer: Each neuron receives a sub-region from a previous layer as input and these
local inputs are multiplied by the weights. These filters are applied throughout input space with the pur-
pose of looking for specific features. Their weights are shared and their output is a feature map.
To configure a convolution layer, it is necessary to set some hyper parameters [28] such as:
• Kernel size - size of the filters;
• Stride - number of pixels that the kernel window will slide (usually, 1 for convolution layers);
• Number of filters - number of patterns that the convolution layer will look for.
Pooling layer: Is generally placed in-between convolutional layers and its goal is to down-sampling
the input, reducing dimensionality and producing a single output from the local region. It also decrease
the amount of computation in the upstream layers by reducing the number of parameters to learn and
provides basic translation invariance. A commonly used down-sampling function is the max-pooling
which determines the maximum value within each sub-region (see Figure 2.9.)
Figure 2.9: Representation of max-pooling operation.2
2source: https://www.quora.com/What-is-max-pooling-in-convolutional-neural-networks [seen in December, 2016]
15
Fully-connected layer: Is the upper layer and computes the class scores to be consistent with training
set labels. The input of the fully-connected layers is the set of all feature maps at the previous layer.
Since they are not spatially located, there can not be a convolutional layer after a fully-connected one.
In a CNN, the neurons are arranged in a 2D structure (width, height) in a way that allows spatial
relations between neurons and original data to be preserved. However, with the use of colored images
specially RGB images, an additional dimension for separate color channels is required. In this way, we
have a 3D dimensional input (width, height and depth).
In CNNs, the number of input neurons residing in the first network layer is equal to the input size. In
essence, if an image is presented as input, the number of neurons at the first layer will be the same as the
number of pixels of the input image. Therefore, if an image was used as input of a fully-connected net-
work, it would require a combinatorial number of connections between neurons and hence the training of
this network would be unmanageable. CNNs are capable of dealing with this computational complexity
issue by connecting sub-region of the previous layer to a neuron and the weights and bias are shared
allowing to look for the same feature in several regions.
In the second layer, each neuron is connected to a subset of neurons from the previous layer, called
a receptive field. In this way, the receptive fields of neurons in a deeper layer involve a combination of
receptive fields from several neurons from the previous layer.
2.6 Deep Neural Networks
Deep Neural Networks (DNN) are a subclass of artificial neural networks and are characterized by hav-
ing several hidden layers between the input and output layers.
Before 2006, most neural networks typically used one hidden layer, two at the most, due to the ex-
pensive cost of computation and the scarce amount of available data. The deep breakthrough occurred
exactly in that year, 2006 when Hinton [30], Bengio [31] and Ranzato [32], three researchers brought
together by the Canadian Institute for Advanced Research (CIFAR) were capable of training networks
with much more layers for the handwriting recognition task.
They used unsupervised learning methods to create layers of feature detectors without the need of la-
belled data. Then, they pre-trained some layers with more complex feature detectors providing enough
information to initialize the weights with sensible values. This method allowed researchers to train net-
works 10 or 20 times faster [27].
16
In recent years, CNN are becoming deeper and deeper which resulted in a performance boost. How-
ever, they are not becoming wider (number of parameters in each layer), since very wide and shallow
networks exhibit very weak performance at generalization despite being good at memorization. As op-
posed, deeper networks can learn features through several levels of abstraction and present much better
results in generalization because they learn all the intermediate features between the raw data and the
high-level classification. Note that using wider and deeper networks lead to an increase in the number
of the parameters that the network will have to learn.
Following the tendency to work with deeper networks and considering the overfitting problem that
occurs when the model fits too closely to the data set, a recent technique called Dropout has been
successfully implemented. Dropout technique consist on randomly drop out neurons at training phase
[33] which forces the network to learn more robust features since a neuron can not trust in a presence
of another particular neuron, improving the generalization of the neural network.
One attempt to speed up the network by decreasing the number of parameters has been done by
substituting large convolutions with the combination of smaller ones. Researchers replaced a large con-
volution like 7×7 convolution by a cascade of several small convolutions like 3 3×3 convolutions with the
same depth [28]. In-between each of these small convolution layers, a ReLU layer is placed to increase
the number of non-linearities. Therefore, we end up with a similar network but with fewer weights that
result in fewer computations and a faster network.
However, this type of substitution can not be done on the first layer because it will result in an enormous
consumption of memory [28].
17
Chapter 3
Related Work
In this chapter, some methods used in visual attention are presented. The computational models asso-
ciated with visual attention can be divided among three models: bottom-up, top-down and hybrid.
We divided the state of the art concerning visual attention among two categories: the classical meth-
ods related with models designed by hand and the modern methods, which we characterized as the
ones who use neural networks. A detailed explanation of classical and modern methods is present in
Section 3.1 and in Section 3.2, respectively. A well-known collection of images called ImageNet is pre-
sented on Section 3.3 and several convolutional network models are described on Section 3.4.
The taxonomy of visual attention computational models is presented in Figure 3.1 and these models
are explained in the following sections.
Figure 3.1: Taxonomy of visual attention models.
19
3.1 Classical Methods
Visual attention computational models attempt to mimic the behavioral aspects of the human visual sys-
tem. Filter-based models have three branches corresponding to bottom-up models, top-down models
and a combination of both that we call hybrid models.
The bottom-up model corresponds to the process that provides information about the environment
towards cognition (brain) and relies totally on stimuli information. Bottom-up mechanisms are agnostic
to the task/goal at hand and have the purpose of extracting relevant low-level features and finding the
most salient regions where attention should be attended.
There are several studies about how to determine salient regions, based on purely low-level visual
features. The pionneering works of Itti [14] [34] consist at combining multi-scale image features (color,
intensity and orientation) into a single saliency map. Then, the WTA principle is applied, selecting the
most salient location which, using the Inhibition of Return (IOR) mechanism [35], creates a sequence in
order of decreasing saliency.
Osberger’s approach [36] starts by performing image segmentation and then assigns perceptual im-
portance based on the number of different factors. Human visual attention is influenced by low-level
(contrast, size, shape, color and motion) and high-level features (location, people and context). Os-
berger chose only 5 features to use in his algorithm and, per region, assigns an importance score to
each. Lastly, a combination of these features results in a map which represents important regions in an
image.
Kadir et al. [37] identify salient regions based on entropy measures of image intensity while Gao [38]
defines a salient region considering how different this is from the surrounding background (center-
surround mechanism [39]).
The top-down model takes into account the observer’s prior knowledge, expectations and current
goals. The literature on visual attention suggests several sources of top-down influences [1] when the
problem is how to decide where to look: attention can be drawn to specific object visual features in
search models to easily reach the goal or use the context or gist to constrain locations.
If an image is presented to an observer, say ∼ 80 ms or less [1], he is able to tell some essential
characteristics of a scene. The eye movements can be conditioned by contextual cues taking into ac-
count, for instance, that a computer mouse is often on top of a desk, near a keyboard and a computer.
Then, using that information based on scene context, it is possible to constrain the search.
There are several models for gist where different low-level features were used. The gist vector can be
computed by applying Gabor filters to an image and extracting universal textons [40] or by averaging
20
filter output and then apply PCA (Principal Component Analysis) [41]. Another approach was presented
by Itti that used center-surround features from orientation, color and intensity channels to model gist [39].
Gist representations give rich information important to constrain the search to relevant objects consider-
ing the observer’s goals (top-down attention).
Whenever there is a search task, top-down processes tend to dominate guidance and target-specific
features are an essential source to draw attention more effectively. Moreover, our attention is oriented to
task-relevant features and in this way, attentional resources are not wasted and time and computational
effort are saved for processing pertinent/relevant parts of the visual field. In these conditions, we know
what we are looking for (goal) thus we know from a priori knowledge the distinguishing features that we
should be searching for. Thereby as defended by guided search theory [19] [22] (see Section 2.2.2), we
can modulate the gains assigned to different features. If, for example, the task is to find a green object,
the gain assigned to green color will be higher.
Taking into account that building saliency maps is an intensive computational process [14], Lukic and
Billard [42] present an efficient method to allocate visual resources in the task of reaching and grasping
where the information provided by the motor system is taken into account. They compute projections
from the workspace to the image plane by applying motor babbling in simulation. This allows obtaining a
large number of training samples to train a feed-forward neural network in an incremental online manner.
To take into account the motor plans of the robot, the authors propose a Coupled Dynamical System
(CDS) [43] [44] to mentally simulate a trajectory and avoid obstacles. Following this approach, the initial
visual search space is confined to the peripersonal space attention. When the robot starts to move, the
attention should then switch to the motor-relevant parts.
Current visual attention approaches, model bottom-up and top-down processes independently. How-
ever, there must be a trade-off between purely bottom-up models that typically miss to detect inconspic-
uous objects of interest and top-down systems which confine search according to expectations related
to task related priors, excluding everything else.
In recent years, a combination of bottom-up and top-down models that we designate as hybrid mod-
els have been presented. For instance, Frintrop’s model [45] is compound by two saliency maps: one
corresponding to top-down influences and another related with bottom-up influences. The aggregated
saliency map is computed as a linear combination of those maps using a fixed weight which revealed
to be a non-flexible approach. Due to the losses of bottom-up information, Rasolzadeh et al. [46] pre-
sented a more flexible model where the combination of top-down and bottom-up saliency maps are
done dynamically, using entropy measures that give the information of how should the linear combina-
tion change over time. The conspicuity maps were created following Itti’s approach in [14] besides the
extra parameters used to weight the saliency map. They used a neural network to learn the bias of the
top-down saliency map based on information provided by contextual scene and the current task.
21
These hybrid models suggest that the human visual system can guide attention applying optimal
top-down weights on bottom-up saliency maps allowing a quicker target detection in a background full
of distractors [46].
3.2 Modern Methods
As previously mentioned on Section 3.1, there are many approaches based on models designed by
hand. Lately, new approaches using Convolutional Neural Networks have been presented. In this sec-
tion, we introduce a brief overview of the recent work done on visual attention with an emphasis on
Convolutional Neural Networks.
Several approaches have been presented to further improve the discriminative ability of deep neural
networks. There are two ways of achieving that: 1) adding regularization to improve robustness and
avoid overfitting; or 2) making the network deeper [47].
In neural networks, the number of input and output units depends on the dimensionality of the data
set. Thus, regularization can be performed by establishing the number of hidden units (free parameter).
Deep neural networks (see Section 2.6) are potent machine learning systems and overfitting can be very
difficult to handle once the models memorize the training data instead of learning them to further gener-
alize. A model that has been overfitting presents poor predictive performance. Dropout is a technique
to address this problem and was proposed by Srivastava et al. [33]. The main idea was drop randomly
units and their connections from the network during training. At test time, they use a network that has
smaller weights which attempt to make the same effects of averaging the predictions of all affected net-
works.
Szegedy et al. [47] presented a deep CNN architecture inspired by Lin’s et al. [48] work. They added
1×1 convolutional layers, increasing the depth (number of levels) and width (number of units per level) of
the network to remove the computational bottleneck. This implementation has several drawbacks: big-
ger networks require more parameters increasing the likelihood of overfitting and more computational
resources are needed. The solution passed by introducing sparse layers inside the convolutions.
Recently, work has been made to incorporate feedback strategy into deep neural networks. For in-
stance, Recurrent Neural Networks (RNN) are used to capture the attention in a dynamic environment
and exhibit dynamic temporal behavior. The inputs are fed back into the network giving a kind of mem-
ory. Other examples like Long Short-Term Memory (LSTM) or End-To-End Memory networks are used.
In this project, we focus on Convolutional Neural Networks.
22
Generally, neural networks are just a tool, depending on the approach can be applied in a bottom-
up, top-down or hybrid way. Until now, models proposed to detect interest regions employ hand-design
features [14] [49] which lack adaptiveness.
Lin et al. [50] proposed a way of detecting saliency using deep convolutional neural networks. They
use the k -means algorithm to learn low-level filters and then, convolve them with the image (input),
generating low-level features that carry texture and color information. Over these low-level features,
pooling techniques were applied to generate mid-level features. Then, local contrast at multiple levels
was calculated using hand-designed filters yielding several maps which are combined to produce a final
saliency map.
Xiao et al. [51] present a hybrid model to detect and locate parts taking advantaged of deep con-
volutional networks applied to features extracted from bottom-up region proposals. They were inspired
by Girshick’s work [52] and applied regions with convolutional neural networks (R-CNN) to model object
parts besides the whole objects and locate them. In this case, the data set used was a set of 200 species
of birds, containing more than 11 000 images.
The part localization model proposed consists in three phases: train object and part detectors from
bottom-up region proposals using deep convolutional features (train phase); apply a score function to
all detectors and apply geometric constraints to choose the best object and part detection (test phase);
and finally, extract features from the located parts and train a classifier to assign the parts to a category.
In the training phase, ground truth bounding box annotations were used for the whole object and se-
mantic parts. The features extracted from region proposals are trained using a support vector machine
(SVM) where regions with ≥ 0.7 overlap with the ground truth region are labeled as positive. Otherwise,
they are labeled negatively. They conclude after performing some experiences that there is no need to
annotate boxes during the test phase to correctly classify the bird species.
Cao et al. [3] proposed a method called Look and Think Twice to detect and locate an object, in
a top-down manner. He uses feedback Convolutional Neural Networks and performs two passages
through the network. In the first feed-forward pass, the predicted class labels are obtained which gives
a notion of the set of most probable object classes that are presented in the input image. Then, based
on the top-ranked labels given by the network, he computes the saliency map of the image with respect
to each one of the top-5 class labels. Next, a segmentation mask is applied to the saliency map for a
given threshold. If the salience of the pixels is greater than the threshold, they are retained, otherwise
discarded, which leaves us with the pixels that more contributed to the class score. The resultant stain
of points is then used to define a bounding box defining the object location proposal.
In the second feed-forward pass, the original image is cropped by the bounding box and the region is
re-classified getting a new set of predicted class labels. At the end, they are ranked and the top-5 are
selected as the final solution.
23
3.3 ImageNet Data Set
ImageNet is a large visual data set of over 15 million labeled images taking part of about 22 thousand
categories that is publicly available. The annual ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) started in 2010 and uses a subset of ImageNet formed by roughly 1 000 images in each of
the 1 000 categories1.
The ILSVRC 2012 data set [53] was previously divided into training, validation and test images. The
validation and test data consist of 50 000 and 100 000 photographs hand labeled but only validation
labels data were released. The remaining images (test data) were released without label and will be
used to evaluate the algorithm.
Since this data set was part of a competition, the participants had to submit their results on the avail-
able test images and only at the end of the competition they knew the results and the respective winner.
These 150 000 images (validation and test) were not part of the training data that is formed by 1.2 million
images containing the 1 000 categories.
The challenge consisted of three tasks and the data set [53] was already divided and publicly avail-
able for each of them:
1. Classification - For each image, a list of the top 5 object categories is presented in descending
order of confidence;
2. Classification with localization - The algorithm produces top 5 class labels and the correspond-
ing bounding box indicating the position of each of them. This task assesses the ability to locate
one instance of an object category;
3. Fine-grained classification - For each one of the 100+ dog categories, predict if the dog images
on test data belong to a particular category. The output of the system should be the real-valued
confidence that the dog is of a particular category.
For tasks 1 and 2, the images were hand labeled with the presence of one of 1 000 object categories
and each image contain only one ground truth label.
3.4 Pre-trained Models
Train a network from scratch using a large amount of color images is computationally expensive and time
consuming. Thereby, there are some pre-trained Convolutional Network (ConvNet) models available at
Caffe [54] Model Zoo.
1source: http://image-net.org/challenges/LSVRC/2012/browse-synsets [seen in November, 2016]
24
In this section, an explanation is given on the different architectures of several pre-trained models
and some preliminary results available on Model Zoo are shown2.
3.4.1 CaffeNet/AlexNet
Krizhevsky’s work [55] presents a deep convolutional neural network constituted by five convolutional
and three fully-connected layers called AlexNet model. The convolutional layers are followed by a ReLU
layer, then the neurons are normalized by a Local Response Normalization (LRN) layer and finally a
down-sampling is performed by a max-pooling layer. The fully-connected layers are followed by a ReLU
and a Dropout layer with dropout ratio of 0.5.
Two techniques were used to combat overfit: first, artificially increase the data set by applying small
transformations to the original images like translations and horizontal reflections or change intensity of
color channels during training and secondly, use the dropout technique (see Section 2.6).
Caffe [54] provides a reference CaffeNet3model which is a modification of AlexNet where the order
of Pooling and Normalization (LRN) layers are switched. Besides this, all the rest remains the same
including all the parameters of all layers.
The change originates a slight computational advantage to CaffeNet since the max-pooling operation is
done before the normalization which will use less memory and calculations. Yet, there is not a significant
performance difference between both models.
A pre-trained version of both models is available and both were tested to check for performance
differences (see Table 3.1). Both models were trained without the data-augmentation used to prevent
the overfit mentioned on [55] and the AlexNet model was initialized with non-zero biases of 0.1 instead
of 1.4
Results released at [55] shows a top-1 classification error of 40.7% and a top-5 classification error of
18.2% of AlexNet model while public replication of AlexNet presented a top-1/top-5 classification error of
42.9% / 19.8%. The results of CaffeNet differed by less than 0.5% from the AlexNet but once it requires
less memory, the CaffeNet was the chosen model to perform the tests.
2source: http://caffe.berkeleyvision.org/model zoo.html [seen in November, 2016]3source: https://github.com/BVLC/caffe/tree/master/models/bvlc reference caffenet [seen in December, 2016]4source: https://github.com/BVLC/caffe/tree/master/models/bvlc alexnet [seen in December, 2016]
25
3.4.2 GoogLeNet
GoogLeNet is a deep convolutional neural network with 22 weight layers proposed by Szegedy et al. [47]
for classification and detection tasks which improved the use of computational resources. It has nine
Inception modules that allow parallel pooling and convolution operations. For classification, it uses the
spatial average of the feature maps from the last convolution layer as the confidence of categories via a
global average pooling layer. The resulting vector is then used as input into the softmax layer.
The most direct form of improving the performance of deep networks is by increasing their size
including depth (more layers) and width (more units at each layer). Even with a bigger network, a con-
stant computational budget was managed by using additional 1×1 convolutions as dimension reduction
method [28] before the expensive 3×3 and 5×5 convolutions and by replacing fully connected layers by
sparse ones.
A replication of the model in [47] was trained and the weights file is publicly available5. However,
there are some training differences that should be highlighted: the replication uses ”xavier” to initialize
the weights instead of ”gaussian”; the learning rate decay policy is different allowing a faster training and
training was done without data-augmentation. Xavier initialization is characterized by setting the weights
with a Gaussian distribution with zero mean and a weight variance equal to the inverse of the number of
input neurons ensuring faster convergence [56].
In one hand, the original model [47] achieved a top-5 classification error of 10.07% in the validation
data and a localization error of 38.02%. The top-1 classification error was not disclosed. On the other
hand, replication model obtained a top-1 error of 31.3% and a top-5 error of 11.1%. The localization
error was not published. Once the weights file of the replication model was the one used, the results
obtained on this project were compared with theirs (see Table 3.1).
3.4.3 VGGNet
It is a deep convolutional network for object recognition developed and trained by Oxford’s renowned
Visual Geometry Group (VGG)6 [57].
This architecture was developed with the purpose of exploring the effect of the ConvNet depth on
its accuracy. Different configurations were used that goes from a ConvNet with 11 weight layers to a
ConvNet with 19 weight layers and the performance of individual ConvNet models were evaluated.
For localization task, the 16 weight layers architecture was used where the last fully connected layer
predicts the bounding box location instead of the class scores.
5source: https://github.com/BVLC/caffe/tree/master/models/bvlc googlenet [seen in December, 2016]6source: https://github.com/BVLC/caffe/wiki/Model-Zoo#models-used-by-the-vgg-team-in-ilsvrc-2014 [seen in December,
2016]
26
In comparison with the state-of-the-art at the time, an evident improvement was reached with a
deeper network achieving the optimal configuration at 16-19 weight layers. Since usually deeper net-
works mean more parameters and more chance to overfit, Simonyan et al. used small 3×3 filters in all
convolutional layers.
Besides this improvement, a demonstration of the generalization power of their model was done by
achieving the state-of-the-art results with other image recognition data sets such as PASCAL Visual Ob-
ject Classes (2007 and 2012) [58].
The 16 weight layer configuration achieved a top-1/top-5 classification error of 25.6% / 8.1% and a
localization error of 26.9%. The 19 weight layer configuration decreased only 1% of the previous classi-
fication error which proved to be the best results achieved so far. In this project, the pre-trained model
VGGNet that was used has 16 weight layers.
Table 3.1 has a compilation of the classification and localization errors disclosed by the current state-
of-the-art. There are some fields of the table that contain a line which means that these results have not
been published.
As explained on Section 3.4.1, AlexNet pre-trained model is not used in our tests once there is no
significant difference of performance between AlexNet and CaffeNet pre-trained model and CaffeNet
requires less memory.
Table 3.1: ConvNet performance following the state of the art.
Number of Classification Error Localization ErrorModel weight layers Top-1 [%] Top-5 [%] Error [%]
CaffeNet [55] 8 42.6 19.6 —-AlexNet [55] 8 42.9 19.8 —-
GoogLeNet [47] 22 31.3 11.1 38.02GoogLeNet Feedback [3] – 30.5 10.5 38.80
VGGNet [59] 8 39.7 17.7 44.60VGGNet [57] (16 layers) 16 25.6 8.1 26.90VGGNet [57] (19 layers) 19 25.5 8.0 —-
27
Chapter 4
Hybrid Attention Model
Our model is inspired by Cao’s et al. [3] work which uses feedback Deep Convolutional Neural Networks
to capture visual attention. We propose a biologically inspired hybrid attention model that is capable of
efficiently locate and recognize objects in digital images, in a multistage manner.
Briefly, our model goes as follows:
• Load an image into the network and capture the gist of the scene getting the predicted top-5 class
labels (feed-forward pass);
• For each of the top-5 class labels, compute the saliency map in a top-down manner (backward
pass) and apply a segmentation mask;
• Calculate the tightest bounding box that covers the stain of points resultant from the segmentation
mask and considered it as an object location proposal;
• Re-classify the image with selective attention (feed-forward pass) and obtain a final solution.
In this chapter, we mention the saliency map concept and explain in detail on Section 4.1 the method
proposed by Cao [3] about the computation of the saliency map, in a top-down manner, for a given
class. In the final stage of our model, the image re-classification with attention is done for two visual
sensing configurations: a uniform and a non-uniform foveal vision, that are presented on Section 4.2.
Section 4.3 presents a study on image information content for both visual sensing configurations, in
order to be possible to establish a relationship between them.
4.1 Class Saliency Visualization
The need to locate objects quickly and efficiently gave rise to the method proposed by Itti [14], based on
visual salience that proposed the most likely candidates and eliminates those that are less likely.
29
The visual features that contribute to the selection of attention of a stimulus (color, motion, orienta-
tion) are combined in a saliency map that has normalized information of the individual features maps.
In order to get a saliency map, the input visual information is analyzed for visual neurons, sensitive to
several visual features of the stimuli. This analysis is done in parallel through all visual field at multi-
ple spatial and temporal scales, originating a series of feature maps where each map represents the
amount of a certain visual resource at any place of the visual field. In each map, according to Koch and
Ullman [15], a local saliency is determined for how different this location is from nearby locations in terms
of color, orientation, motion, depth. The most salient location could be a good candidate for attentional
selection. Finally, all highlighted locations from all feature maps are combined in a single saliency map
that represents a pure relevant signal which is independent of visual features.
As opposed to Itti’s [14] method that computes the saliency map in a bottom-up manner, Cao [3]
proposed a way to calculate the saliency map, in a top-down manner, given an image I and a class
c. The class score Sc(I) is a non-linear function of the image, hence an approximation of the neural
network class score with the first-order Taylor expansion [3] [59] in the neighborhood of I can be done
as follows
Sc(I) ≈ G>c I + b (4.1)
where b is the bias of the model and Gc is the gradient of Sc with respect to I:
Gc =∂Sc∂I
. (4.2)
Accordingly, the saliency map is computed for a class c by calculating the score derivative of that specific
class employing a back propagation pass. This is done as follows: a comparison between the network
output and the desired output is done originating an error value. Since we want to get the saliency map
for a specific class c, our desired output is a vector of zeros where the position corresponding to class c
is set to one. In this way, to each neuron of the output layer is assigned an error value that is propagated
backward until it reaches the input layer where each neuron has associated an error value that roughly
represents its contribution to the output. These error values are used to calculate the gradient Gc which
is used to update the weights in order to minimize the difference between the network output and the
desired one.
In order to get the saliency value for each pixel (u, v) and once the images used are multi-channel
(RGB - three color channels), we rearrange the elements of the vector Gc by taking the maximum
magnitude of it over all color channels. This method for saliency map computation is extremely simple
and fast since only a back propagation pass is necessary. Simonyan et al. [59] shows that the magnitude
of the gradient Gc express which pixels contribute more to the class score. Consequently, it is expected
that these pixels can give us the localization of the object pertaining to that class (see Section 5.2), in
the image (see Figure 5.2).
30
4.2 Uniform vs Foveal Vision
In this work we will study and evaluate two types of organization of receptor fields: a conventional
uniform distribution, typical in artificial vision systems (e.g. in standard image sensors), against a log-
polar distribution, which approximates the human eye. The latter is composed by a region of high acuity
– the fovea – and the periphery, where central and low-resolution peripheral vision occurs, respectively.
4.2.1 Uniform Visual System
As many theories of visual processing proposed, a natural scene is processed in a fraction of a sec-
ond [60] where a first rough description (the gist) of the scene is computed. Typically, imaging sensors
use uniform resolution.
In the first feed-forward pass, we mimic the human behaviour on capturing the gist of the scene,
quickly and with limited resources. For this matter, there is no need to rely on high-resolution images
since this first glimpse takes only a split second and humans are capable of extract rough information of
it [60]. In this way, we compress the images to save resources since in most cases, they are scarce.
For the initial glimpse, we want to simulate the use of a sensor with low-resolution, this is with lower
level of detail which consequently requires fewer resources and comprehends a reduction of the informa-
tion. However, image details correspond to edges that typically are only perceptible with high-resolution
imaging sensors. For this purpose, the high-frequency details will be removed through low-pass filters.
When a low-pass filter is applied to a signal, its high frequency components are completely removed.
The simplest low-pass filter is the ideal low-pass filter (see Figure 4.1) that eliminates all frequencies
higher than a given cut-off frequency (fc) and keeps the lower frequencies intact. Following this ap-
proach, we lose the high-frequency features like the edges. However, there is a way to remove the noise
and preserve the edges and other (high-frequency) details. For this purpose, we use a Gaussian filter
that does not abruptly remove high frequencies but soften them (see Figure 4.2). The Gaussian filter
alters the input image by convolution with an isotropic 2D Gaussian function that is defined as
g(u, v, σ) =1√2πσ2
e−u2+v2
2σ2 (4.3)
where u and v represent the image coordinates and σ the standard deviation of the Gaussian distri-
bution. The 2D Gaussian function is separable into u and v components thus we can perform first a
convolution with a 1D Gaussian in the u direction, and then convolve with another 1D Gaussian in the v
direction. In this study, we define σ0 as the level of uniform blur (see Figure 4.4).
1source: https://i.stack.imgur.com/nLwKi.png [seen in April, 2017]
31
Figure 4.1: Shape of the 1D ideal low-pass filter in the frequency domain.1
Figure 4.2: 2D Representation of a Gaussian filterwith σ = 60.
4.2.2 Foveal Visual System
The central region of the retina of the human eye named fovea is a photoreceptor layer predominantly
constituted by cones which provide localized high-resolution color vision. The concentration of these
photoreceptor cells reduce drastically towards the periphery (see Figure 2.1) causing a loss of defini-
tion. This space-variant resolution decay is a natural mechanism to decrease the amount of information
that is transmitted to the brain (see Figure 4.4). Many artificial foveation methods have been proposed in
the literature that attempt to mimic similar behavior: geometric method [61], filtering-based method [62]
and multi-resolution methods [63].
In this work, we rely on the method proposed in [64] for image compression (e.g. in encoding/decoding
applications) which is extremely fast and easy to implement, with demonstrated applicability in real-time
image processing and pattern recognition tasks as in [65]. This approach comprises four steps that
go as follow. The first step consists on building a Gaussian pyramid. The first pyramid level (level 1)
contains the original image g1 that is low-pass filtered and down-sampled by a factor of two obtaining
the image g2 at level 2. The image g3 can be obtained from the g2 by applying the same operations, and
so forth. The image gk+1 has a quarter of the resolution of image gk where k ∈ {1, ...,K} denotes the
index of a pyramid level and K defines the total pyramid levels. This process is repeated as many times
as the desired number of resolution levels for the pyramid.
In the next step, the Laplacian pyramid is build where the difference between the original image and
the low-pass filtered image is computed. The Laplacian pyramid comprises a set of error images where
each level represents the difference between two levels of the previous output (see Figure 4.3).
Next, Gaussian weighting kernels are multiplied to each level of the Laplacian pyramid to implement
the foveation mechanism. The Gaussian kernels are defined as in (4.3) and the kernels are generated
just once for each image and then displaced for a given point defining the focus of attention.
32
The next step consists of locating the foveation point which corresponds to the image location that
will be displayed at the highest resolution. In our case, the foveation point is given by the center of the
object location proposal obtained through the analysis of the segmentation mask applied to the saliency
map.
At last, the foveated image is obtained by the reverse process used when building the Laplacian
pyramid. A more detailed explanation of the foveation system can be found on [64].
A summary of the human visual foveation model with four levels is presented on Figure 4.3. Starting
with the original image, the levels g1 to g4 of the reduced pyramid are computed. Then, the difference
between successive outputs from the previous step is obtained resulting the images L1 to L4 on the
Laplacian pyramid. These images are multiplied by the kernels and an expand-and-sum procedure is
done. An example of a foveated image obtained by this method is presented on Figure 4.4 where f0
simulates the size of the fovea, central region of the retina of the human eye.
Figure 4.3: A summary of the steps in the human visual foveation model with four levels. The imageg1 corresponds to the original image and f1 to the foveated image. The thick up arrows representsub-sampling and the thick down arrows represent up-sampling.
4.3 Information Attenuation
The different visual systems presented on Section 4.2 are based on different filtering strategies which
result on reduction of information. To be possible to compare these systems, we have to understand
how each system reduces the image information and what is the relationship between them.
33
a: σ0 = 0 b: σ0 = 5 c: σ0 = 10
d: f0 = 30 e: f0 = 60 f: f0 = 90
Figure 4.4: Different images acquired with two different visual sensing configurations are shown: auniform and a log-polar distribution. On top, the image of a bee eater is evenly blurred for different levelsof blur (σ0). At bottom, the same image is foveated from the center of the object location proposals fordifferent levels of blur. The parameter f0 defines the size of the region with high acuity.
4.3.1 Uniform Vision
The uniform visual system is computed via low-pass Gaussian filters. Let us define the original image
as i(u, v) to which corresponds the discrete time Fourier Transform I(ejwu , ejwv ). The filtered image
O(ejwu , ejwv ) is given by the convolution theorem as follows
O(ejwu , ejwv ) = I(ejwu , ejwv ) ∗G(ejwu , ejwv ). (4.4)
The Parseval’s theorem describes the unitarity of a Fourier transform establishing that the sum of
the square of a function is equal to the integral of the square of its transform. Therefore, the signal
information of the original image i is given by
Ei =
+∞∑u=−∞
+∞∑v=−∞
|i(u, v)|2dudv =1
4π2
∫ π
−π
∫ π
−π|I(ejwu , ejwv )|2dwudwv, (4.5)
34
and the information in the filtered image o is given by
Eo =
+∞∑u=−∞
+∞∑v=−∞
|o(u, v)|2dudv =1
4π2
∫ π
−π
∫ π
−π|I(ejwu , ejwv ).G(ejwu , ejwv )|2dwudwv. (4.6)
Assuming that I(ejwu , ejwv ) has energy/information equally distributed across all frequencies, of
magnitude M:
M = I(ejwu , ejwv ),∀wu, wv ∈ [−π, π], (4.7)
the information Eo can be expressed as
Eo =M2
4π2
∫ π
−πG(wu)
2dwu
∫ π
−πG(wv)
2dwv. (4.8)
Furthermore, since we use σ ≥ 1, the discrete time Fourier Transform is well approximated by the
continuous time Fourier Transform. Thus, the Gaussian filter has low energy content for
|wu|, |wv| > π ⇒∫ π
−πe−w
2uσ
2
dwu ≈∫ ∞−∞
e−w2uσ
2
dwu. (4.9)
Knowing that ∫ ∞−∞
e−12t2
σ2 dt =√2πσ, (4.10)
we are able to define G(wu) as ∫ ∞−∞
e−12
w2uσ2 dwu =
√π
σ, (4.11)
where the same applies to wv.
Thereby, we are now capable of simplify the expression of Eo presented on (4.8) as
Eo =M2
4π2.π
σ2=
M2
4πσ2. (4.12)
Finally, the information gain P is given by the ratio of the information of the filtered image to the
information of the original image, getting
P (σ) =EoEi
=1
4πσ2. (4.13)
4.3.2 Non-Uniform Foveal Vision
For the non-uniform foveal vision, we implement the method explained on Section 4.2.2 where the blur
is not evenly distributed, in the spatial domain.
In the first step of our foveation system, we apply low-pass Gaussian filters as we did for the case
of uniform vision (see Section 4.3.1) by applying (4.13) and perform down-sampling in each level of the
reduced pyramid.
35
The normalized information due to filtering for each level k of the pyramid is given by
P k(σk) =1
4πσ2k
(4.14)
where the parameter σk is related to σ0 as
σk = 2kσ0. (4.15)
The information due to spatial weighting for each pyramid level k is given by
Rk(fk) =
(∫ N/2−N/2 e
− 12u2
f2k du
N
)2
(4.16)
where N is the size of the image. Since the images are 2D, it is needed to calculate Rk for each
dimension. The foveation fk presented on each level can be related to the fovea dimension as follows
fk = 2kf0. (4.17)
Thus, to compute the total information compression of the pyramid for the non-uniform foveal vision,
we need to take into account the normalized informations due to filtering and due to spatial weighting at
each level of the pyramid. The total information reduction of the pyramid is given by
T (k) =
K∑k=0
RkP k. (4.18)
36
Chapter 5
Implementation
In this chapter, a detailed explanation of our model is made (see Figure 5.1). In the first feed-forward
pass, a rough description (the gist) of the scene is computed (Section 5.1) and analyzed via backward
propagation to obtain proposals regarding the location of the object in the scene(Section 5.2). For the
second feed-forward pass, two approaches have been compared, the human visual foveation model and
the cartesian one (Section 5.3). For the former, an image re-classification is done directing the attention
to the center of the proposed location. For the latter, the attention is directed to the cropped patch of the
image, thus the remaining part of the image is discarded.
Figure 5.1: Schematization of the proposed multistage attentional pipeline. Begins by uploading aninput image to the neural network and get the top-5 predicted class labels. For each class label, abackward pass is done obtaining the saliency map. In this, a segmentation mask is applied based ona threshold ending up with a proposed region for the location of the object. Then, the foveation systemis applied from the center of each proposed bounding box for a given f0 (in this case, f0 = 60). Thefoveated image is used as input of the neural network and for each, a forward pass is done resultinga new top-5 predicted class labels. The red rectangles represent the bounding boxes that contain allpixels above the specified threshold, in this case the threshold was 0.75. The red circles represent thefocused area simulating the fovea and the ground truth label of the input image is go-kart.
37
5.1 Image-Specific Class Saliency Extraction
After making the input data selection (see Section 3.3), the pre-trained models CaffeNet, GoogLeNet
and VGGNet were loaded to the corresponding networks for the test phase. Each network receives raw
input data which needs to be pre-processed: subtract the mean over all images used in the training set
in each color channel and swap channels from RGB to BGR. In our tests, we use the first 100 images
from the ILSVRC 2012 data set.
CaffeNet and GoogLeNet pre-trained models required a constant input dimension of 227×227 RGB
images while VGGNet pre-trained model required a constant input dimension of 224×224 RGB images.
Therefore the ImageNet images which present several resolutions were down-sampled for the required
fixed resolution of the corresponding system. The ILSVRC 2012 validation set was used to perform the
tests and evaluate our model.
After the pre-processing has been done, the network was loaded with images from the ILSVRC 2012
data set. We started by getting the network’s output for the input image by performing a feed-forward
pass filling the layers with data. Accessing the network’s output layer of type softmax, the actual proba-
bility scores for each class label (1 000 in total) were collected.
Retaining our attention on the five highest predicted class labels which are more likely to be present
in a given image, the saliency map for each one of those predicted classes was computed (see Fig-
ure 5.2).
The method put into action to compute the saliency map, in a top-down manner was the one described
on Section 4.1 where only an image I and a class c is required. As mentioned and previously ex-
plained, a back propagation pass was done to calculate the score derivative of the specific class c.
The calculation of the gradient tell us which pixels are more relevant for the class score [59].
5.2 Weakly Supervised Object Localization
Considering Simonyan’s findings [59] mentioned on Section 5.1, the class saliency maps hold the object
localization of the correspondent class in a given image. Surprisingly and despite been trained on image
labels only, the saliency maps can be used on localization tasks.
Our object localization method based on saliency maps goes as follow. Given an image I and the
corresponding class saliency map Mc, a segmentation mask is computed by selecting the pixels with
the saliency higher than a certain threshold and set the rest of the pixels to zero.
Considering the stain of points resulting from the segmentation mask, for a given threshold, we are able
to define a bounding box covering all the non-zero saliency pixels, obtaining a guess of the localization
of the object (see Figure 5.2). To set the bounding box, we use the boundingRect function from OpenCV
38
library that calculates the minimal up-right bounding rectangle for the specified point set. In our case,
we define a Mat array with all non-zero saliency pixels as input to the boundingRect function.
Figure 5.2: Representation of the saliency map and the correspondent bounding box of each of thetop- 5 predicted class labels of a bee eater image of ILSVRC 2012 data set. The rectangles representthe bounding boxes that cover all non-zero saliency pixels resultant from a segmentation mask with athreshold of 0.75. The rectangles shown are the same where the blacks delimit the pixels with non-zerosaliency and the red ones show the input image with the location proposal.
5.3 Image Re-Classification with Attention
The objective of performing a second pass through the neural network is to re-classify the class labels
obtained in the first pass, where the gist of the scene was captured.
Given the initial guess of the object localization through the bounding boxes, the image labels are
re-classified. We tested two different ways to re-classify image labels: firstly, inspired by Cao’s work [3],
we use cropped patches around the bounding boxes that are resized to the input dimension of the cor-
respondent pre-trained model (227×227 for CaffeNet and GoogLeNet and 224×224 for VGGNet) and
secondly, we foveate the images from the center of the bounding boxes with a fixed fovea size.
Following the first approach for image re-classification, the image patch was cropped from the original
input image to ensure a good resolution and resized to the input dimension of the pre-trained model that
supposedly corresponds to the smallest region that contained the object. Those new regions are then
loaded into the neural network and a new feed-forward pass is done resulting in a re-classification of
the regions. This strategy of re-classification is named by Cao [3] as the ”Look and Think Twice” method.
For the second approach, there is no need to crop or resize the image. We use the bounding boxes
obtained from the segmentation mask and apply the foveation method described on Section 4.2.2. Con-
sidering that the bounding box provided by our framework contains the object, we direct our attention
39
to the center of the bounding box and foveate the image for a given parameter f0, highly specialized
for high-resolution vision. The foveated image is then used as input to the network for the second feed-
forward pass, giving rise to an image re-classification.
The image re-classification method (for both approaches) is applied to each of the five bounding
boxes proposed from the first feed-forward pass where the highest five predicted class labels of each
bounding box are preserved (see Figure 5.1). Given the total 25 labels and the corresponding scores
(confidence given by the network), we sort by descending order and pick the top-5 labels as the final
solution. The sorted top-5 labels are then used to compute the classification error, corresponding to the
second time we look to the image.
Our framework were evaluated for three different topologies:
• Uniform Cartesian vision: In the first feed-forward pass, the input image has uniform resolution.
Then, in the second feed-forward pass, a crop patch of the image is used as input;
• Non-uniform foveal vision: The input image is foveated from the center for several f0 and on
second feed-forward pass, the high-resolution image is foveated from the center of the object
location proposal;
• Combined vision: In the first feed-forward pass, the input image has uniform resolution with σ0 = 5
and for the second feed-forward pass, a high-resolution input image is foveated for different f0 from
the center of the bounding box and used as input.
Figure 5.1 summarize our framework. We start by loading an image into the network and perform a
feed-forward pass producing a list of the actual probability scores for each class label considered in the
ILSVRC 2012 data set. In this case, a go-kart input image was used and as we can verify, the ground
truth label go-kart is not present on the classification top-5.
For each of the top-5 predicted class labels, a saliency map and a segmentation mask was com-
puted resulting in a total of five proposed bounding boxes. Next, the foveation system described on
Section 4.2.2 was used by foveating the input image from the center of the proposed bounding box for
a given f0. Each of the five foveated images are then loaded into the network and a new feed-forward
pass is performed giving rise to five predicted class labels for each input image, obtaining in total 25
predicted class labels. In order to get a final classification from the network in this second pass, the
predicted class labels are sorted by descending order and the five classes with the higher score are
considered as the final solution. For the example of Figure 5.1, we end up with a final solution that
corresponds with the ground truth label of the input image, this is go-kart with a confidence of 27%.
Figure 5.3 presents the convolutional layer output for several input images. These results show the
filters learned by the network. In the first feed-forward pass, the input images have high-resolution and
in a second feed-forward pass, a crop patch of the original image is used as input.
40
Figure 5.3: Representation of the first convolutional layer output for 4 different input images on VGGNetpre-trained model: for each image, first line shows 5 output filters and the bounding boxes; second lineshows the same filters applied to the new input images which were cropped by bounding box. Theground truth labels are in orange and the bounding boxes in red.
41
Chapter 6
Results
In this chapter, the results and tests performed in this project are presented. We begin by establishing
a numerical relationship between uniform and non-uniform visual systems on Section 6.1 in order to be
able to make a fair comparison between both visual systems. Next, an evaluation of the classification
and localization performance obtained for the first and second feed-forward passes are done on Sec-
tion 6.2 and Section 6.3, respectively. Finally, on Section 6.4 the performance of the first pass is directly
compared with the performance of the second pass for the different visual topologies. Table 6.1 shows
the different topologies considered in this work.
Topology First Pass Second PassUniform Uniform blur Cropped patchFoveal Foveate center Foveate bounding boxCombined Uniform blur Foveate bounding box
Table 6.1: Summary of the evaluated topologies.
6.1 Uniform vs Non-Uniform Foveal Vision
Through the study of information gain done in Section 4.3, we can represent the relationship between
σ0 and f0, uniform and non-uniform vision, respectively (see Figure 6.1). With this analysis, it is possible
to define the intersection point, that is, the values of σ0 and f0 where the information gain is the same
for both types of sensors. The tests were done for a pyramid with 5 levels.
Figure 6.1 was computed following the theory presented on Section 4.3 where expression (4.13) give
us the evolution of the information gain for uniform vision and expression (4.18) for foveal vision.
It is possible to verify that the information gain to the uniform vision is linear, in logarithmic scale, with
respect to σ0. As the blur level increases (σ0), more information is compressed which leads to less gain.
43
Figure 6.1: Information gain in function of σ0 for uniform and f0 for non-uniform foveal vision.
For non-uniform foveal vision there is a tendency to increase the information gain with the raise of f0
but not linearly. This kind of evolution with f0 for non-uniform vision makes sense since, as it increases,
the size of the high-resolution region of the image also increases. It is important to notice that for
f0 = 100, there is no gain of information, that is, for f0 greater than 100, the processed image has the
same information as the original one. The intersection point between the two different vision types is
obtained with a gain of approximately −24 dB when
σ0 = f0 ≈ 5. (6.1)
6.2 First Feed-Forward Pass
The first stage of our hybrid model consists on loading an image into the neural network and perform a
feed-forward pass in order to get the predicted class labels where the top-5 class labels are preserved.
Then, in a top-down manner, for each one of the top-5 class labels, a backward pass is done result-
ing for each, a saliency map corresponding to the respective class label. The saliency map provides
meaningful information for a given class since it results from a feedback visualization with respect to
that particular class showing which pixels contribute more to the class score. For this reason, a feasible
manner for localization was derived from the saliency map. A segmentation mask was applied to the
saliency map by selecting the pixels whose saliency value was higher then a certain threshold. The
remainder of the pixels were discarded, setting them to zero. Finally, a tightest bounding box covering
the stain of non-zero saliency values is computed resulting an object location proposal.
44
To evaluate our model, we compute two types of measurements: the classification error and the
localization error. The classification error is calculated comparing the ground truth class labels provided
by ILSVRC with the preserved predicted class labels. Usually, two error rates are commonly mentioned:
the top-1 and the top-5. The former serves to verify if the predicted class label with the highest score
is equal to the ground truth label provided for the same image. If they are not a match, it leads to an
error. For the latter, we verify if the ground truth label is in the set of the five highest predicted class labels.
The localization is considered correct if at least one of the five predicted bounding boxes for an im-
age overlaps over 50% with the ground truth bounding box1, otherwise the bounding box is considered a
false positive [53]. The evaluation metric consists on the intersection over union between the proposed
and the ground truth bounding box (see Figure 6.2) and this criteria was established on the ILSVRC
2012 challenge.
Figure 6.2: We select a spoonbill image from the ILSVRC 2012 data set to demonstrate our weaklysupervised object localization method. The red rectangles represent the computed bounding boxes forthe top-3 predicted class labels and the greens, the ground truth bounding box of the spoonbill image.
The classification and localization errors were calculated for the three different topologies considered
in this project: first, a non-uniform foveal vision where the images are foveated from the center for
different f0; second, a uniform vision characterized by having evenly blurred images for various σ0 and
finally, a combined vision that, for the first pass have uniformly blurred images with a level of blur equal
to σ0 = 5, which corresponds approximately to the point of intersection obtained from Figure 6.1.
6.2.1 Classification Performance
For the first pass, both with uniform and foveal sensors, we obtained the results presented on Fig-
ure 6.3. With regard to the classification, a global conclusion can be withdrawn: CaffeNet pre-trained
model which presents the shallower architecture had the worst performance, obtaining the highest clas-
sification errors in all topologies. One possible justification for this is that the other GoogLeNet and VGG
models use smaller convolutional filters and deeper networks that can enhance the distinction between1source: http://image-net.org/challenges/LSVRC/2012/index#task [seen in December, 2016]
45
similar and nearby objects.
For non-uniform foveal vision, a common tendency is visible on Figure 6.3a, in all three pre-trained
models, there is a f0 value from which the classification error saturates being this approximately f0 = 70.
This result is corroborated by the evolution of the gain depicted in Figure 6.1 where from f0 = 70, the
value of the gain is approximately −2 dB. This means that, beyond this fovea size, the amount of infor-
mation that is added is not relevant for the correct classification of the object.
As expected for uniform (Cartesian) vision, as σ0 increases and the blur level applied to the image
rises, the amount of information present in the image decreases, resulting in an increase in the classifi-
cation error. From Figure 6.3c it can be seen that this increase is approximately linear.
Through the relation obtained in Section 6.1, we can compare the two types of vision, the uniform
and the non-uniform foveal one. Thus, for σ0 = 5, uniform vision presents a lower classification error,
in the order of 50%. In turn, non-uniform foveal vision with f0 = 5 shows an extremely high error. We
hypothesize that the foveated area for f0 corresponds to a very small region characterized by having
high acuity. The images that make up the ILSVRC data set have objects that occupy most of the image
area, that is, although the image has a region with high-resolution, it may be small and not suffice to
give an idea of the object in the image, which leads to poor performance in the classification task.
6.2.2 Localization Performance
The threshold parameter that defines which pixels will be selected to create the bounding boxes that
represents the object location proposal. On one hand, if we set low thresholds, we will select all the
pixels in the saliency map that have an intensity higher than this threshold, i.e., we will base our localiza-
tion task on a large number of pixels at the risk of having many outliers. On the other hand, the higher
the threshold, the more restrictive the selection of pixels used for the localization. By visualizing the
evolution of the location error as a function of the threshold, it is possible to verify that there is a trade-off
between the chosen threshold and the location error obtained.
A consistent result in all topologies with respect to the localization error is the range of threshold
values that get the smallest error. For thresholds smaller than 0.4, the localization error remains stable
where the VGG model presents a smaller error compared to the other models. From this point, the
evolution of the error presents the form of a valley obtaining the lowest localization error for thresholds
of 0.65 and 0.7, depending on the topology used.
46
GoogLeNet, the deeper model considered in this work, presents a better location performance com-
pared to the other models in the range of thresholds located in the valley. Although the VGG model is
deeper than the CaffeNet model, the latter has better performance in the location. Both models feature
two fully-connected layers of 4096 dimension that can ruin the spatial distinction of image characteris-
tics. GoogLeNet does not have these fully-connected layers, instead it adopted a global average pooling
for classification in CNN which has better results when it comes to the location of the object.
6.3 Top-Down Class Refinement
As previously explained, the objective of performing a second pass through the neural network is to
re-classify the class labels obtained in the first pass, where the gist of the scene was captured.
For the second pass, the topologies have undergone minor changes: first, in the non-uniform foveal
vision, the foveation point ceases to be the center of the image and becomes the center of the bound-
ing box proposals. Second, in the uniform (Cartesian) vision, instead of using the whole image for the
re-classification, a cropped patch of the input image defined by the proposed bounding boxes is used,
resulting in a loss of context but improved acuity. Finally, in the second pass of the combined vision,
the original image is used with high-resolution and the foveal visual system presented on Section 4.2.2
is applied where the foveation point is given by the center of the bounding boxes. This configuration
corresponds to the one used in the non-uniform foveal vision, in the second feed-forward pass.
In the case of non-uniform vision, the foveation point is now given by the center of the proposed
bounding boxes. The performance in the classification task in this second pass is cumulative, this is, it
depends on the parameters that were used in the first pass. Thus, the foveation point that in this second
pass corresponds to the center of the bounding boxes, is dependent on the threshold that was used
in the first pass and that gave rise to the location proposals. For the three topologies considered, the
threshold chosen to be used in the segmentation mask and that conditions the location proposals in
the second pass was th = 0.7. In the second feed-forward pass, it is expected that the evolution of the
classification error performance follows the trend observed in the first one, i.e. the classification error
increases with σ0 for the uniform vision and decreases with f0 for the non-uniform and combined vision,
since we know that the data set images have the objects centered on them.
For the uniform vision case, the input image used to be re-classified is the cropped patch of the
original image defined by the location proposals. In this way, the presence of the surrounding context is
discarded. Surprisingly, the presence or not of the context seems to have no great contribution in the
classification of the object. One possible justification for this fact is that each image contains only one
object that occupies almost the full image (see Figure 6.2).
47
6.4 First vs Second Pass
For the non-uniform foveal vision, the difference between the first and the second pass is the selected
foveation point in the input image: in the first, the image is foveated at the center and in the second, the
foveation point is moved to the center of the proposed bounding boxes.
In Figure 6.3a and Figure 6.4a, it is possible to verify that there is practically no difference in clas-
sification error, this is, there is no significant difference in foveate from the center of the image or from
the center of the object location proposal. One major limitation of this experiment is the fact that the
objects are large scale and centered on the image. Therefore, we can conclude that for this data set
and topology, there is no advantage in making a second pass through the network.
For the uniform vision, in the first pass, the image is evenly blurred for a given σ0. In the second
pass, a crop patch of the image defined by the proposed bounding boxes is used resulting in loss of
context. As expected, regardless of the passage, the higher the blur level (σ0), more information is re-
duced making it more difficult for the network to correctly classify the image, resulting in an increase of
the classification error (see Figure 6.3c and Figure 6.4c).
The combined vision model is characterized for using images with a uniform blur of σ0 = 5, in the
first pass. In this case, the deeper the neural network used, the lower the classification error.
This tendency remains in the second pass where predominantly, the deeper networks obtain better
results. Again, due to the fact that the objects are centered in the image, as the size of the high-
resolution region f0 applied in the center of the location proposals increases, the lower classification
error is obtained.
48
0 10 20 30 40 50 60 70 80 90 1000
20
40
60
80
100
Cla
ssific
ation E
rror
(%)
Feed-foward (CaffeNet)
Feed-foward (VGGNet)
Feed-foward (GoogLeNet)
(a) Classification error: Non-uniform foveal vision.
0 0.2 0.4 0.6 0.80
20
40
60
80
100
Lo
ca
liza
tio
n E
rro
r (%
)
Backward (CaffeNet)
Backward (VGGNet)
Backward (GoogLeNet)
(b) Localization error: Non-uniform foveal vision.
1 2 3 4 5 6 7 8 9 100
20
40
60
80
100
Cla
ssific
atio
n E
rro
r (%
)
(c) Classification error: Uniform Cartesian vision.
0 0.2 0.4 0.6 0.80
20
40
60
80
100
Lo
ca
liza
tio
n E
rro
r (%
)
(d) Localization error: Uniform Cartesian vision.
1 10 20 30 40 50 60 70 80 90 1000
20
40
60
80
100
Cla
ssific
ation E
rror
(%)
(e) Classification error: Combined vision.
0 0.2 0.4 0.6 0.80
20
40
60
80
100
Lo
ca
liza
tio
n E
rro
r (%
)
(f) Localization error: Combined vision.
Figure 6.3: Classification and localization performance of the first pass for several topologies. Threedifferent architectures are evaluated: CaffeNet (red lines with circles), VGGNet (green lines with stars)and GoogLeNet (blue lines with squares). This order and color arrangement are the same for all thesubfigures. Left column correspond to classifcation error where dash lines correspond to top-1 error andthe solid ones correspond to top-5 error. The localization error is at the right column. For Figure 6.3b,dash lines correspond to a foveation with f0 = 80 and solid lines to f0 = 100. For Figure 6.3d, dash linescorrespond to a uniform blur with σ0 = 1 and solid lines to σ0 = 5. For Figure 6.3f, dash lines correspondto a uniform blur with σ0 = 5. The classification error was based on the predicted class labels providedby the first feed-forward pass and the localization error was computed using the proposed boundingboxes resulting from the backward pass for various thresholds.
49
0 10 20 30 40 50 60 70 80 90 1000
20
40
60
80
100
Cla
ssific
ation E
rror
(%)
Feed-foward (CaffeNet)
Feed-foward (VGGNet)
Feed-foward (GoogLeNet)
(a) Classification error: Non-uniform foveal vision.
0 0.2 0.4 0.6 0.80
20
40
60
80
100
Lo
ca
liza
tio
n E
rro
r (%
)
Backward (CaffeNet)
Backward (VGGNet)
Backward (GoogLeNet)
(b) Localization error: Non-uniform foveal vision.
1 2 3 4 5 6 7 8 9 100
20
40
60
80
100
Cla
ssific
atio
n E
rro
r (%
)
(c) Classification error: Uniform Cartesian vision.
0 0.2 0.4 0.6 0.80
20
40
60
80
100
Lo
ca
liza
tio
n E
rro
r (%
)
(d) Localization error: Uniform Cartesian vision.
1 10 20 30 40 50 60 70 80 90 1000
20
40
60
80
100
Cla
ssific
ation E
rror
(%)
(e) Classification error: Combined vision.
0 0.2 0.4 0.6 0.80
20
40
60
80
100
Lo
ca
liza
tio
n E
rro
r (%
)
(f) Localization error: Combined vision.
Figure 6.4: Classification and localization performance of the second pass for several topologies. Threedifferent architectures are evaluated: CaffeNet (red lines with circles), VGGNet (green lines with stars)and GoogLeNet (blue lines with squares). This order and color arrangement are the same for all thesubfigures. Left column correspond to classifcation error where dash lines correspond to top-1 error andthe solid ones correspond to top-5 error. The localization error is at the right column. For Figure 6.4b,dash lines correspond to a foveation with f0 = 80 and solid lines to f0 = 100. For Figure 6.4d, dash linescorrespond to a uniform blur with σ0 = 1 and solid lines to σ0 = 5. For Figure 6.4f, dash lines correspondto a non-uniform foveal blur with f0 = 80 and solid lines to a f0 = 100. The classification error was basedon the predicted class labels provided by the second feed-forward pass and the localization error wascomputed using the proposed bounding boxes resulting from the backward pass for various thresholds.
50
Chapter 7
Conclusions
In this thesis we propose a biologically inspired framework for object classification and localization that
incorporates bottom-up and top-down attentional mechanisms, combining the recent Deep Convolu-
tional Neural Networks with foveal vision.
We had as main goal of this study to evaluate the performance of several CNN architectures al-
ready known and usually used in recognition and localization tasks such as CaffeNet, VGGNet and
GoogLeNet. Furthermore, we tested two different visual sensory structures, namely a uniform vision
where it is not necessary to move the eyes towards the region of interest (covert attention) and a non-
uniform foveal vision where the attention is directed to the location proposals of the object, by means of
overt eye movements.
Our multistage framework begins by receiving evenly blurred images in the case of uniform vision and
multi-resolution images in the case of the foveal vision which simulates the humans’ peripheral vision.
The input image is then classified by means of a CNN where, for each of the top-5 predicted class labels,
a backward pass is performed obtaining the saliency maps. To these maps is applied a segmentation
mask for a given threshold that causes a spot of points that will serve as location proposals. Finally, the
image is reclassified with selective attention. For the non-uniform foveal visual sensor, the attention is
directed to the proposed locations, by means of overt attentional spotlight movements whereas for the
uniform sensor, the attentional spotlight is oriented in a covert manner to crop patches of the original
image.
7.1 Achievements
Through the analysis performed to our tests, we can conclude that the deep neural networks present
better performance when it comes to classification. These deep nets have the ability to learn more fea-
tures which results in a better learning in distinguishing similar and close objects.
51
Comparing the classification performance for uniform vision sensors and non-uniform foveal vision
sensors, it is possible to verify that it is preferable to have an image with a lower resolution uniformly
distributed, than to have a multi-resolution image where the region of greater acuity is small.
On one hand, as one would expect, the higher the level of uniformly distributed blur applied to an
image, the greater the classification error since it is more difficult to get the gist from the scene.
On the other hand, in the case of using multi-resolution vision sensors, the higher the region of high-
resolution, the greater the level of detail that the network can achieve which leads to better performance
in the classification task.
In this way, the scene gist is best captured when the entire picture is displayed, even if it is blurred. When
using a foveated image, in a glimpse we will direct our attention to the region of high-resolution, which
despite having great detail, may not suffice to give us an idea of what is really in the image.
The results we obtained for the non-uniform foveal vision are promising. We conclude that it is
not necessary to store and transmit all the information present in a high-resolution image since, from
a given f0, the performance in the classification task remains constant, regardless of the size of the
high-resolution region.
7.2 Future Work
One of the major limitations on the evaluation of non-uniform foveal vision is that it is constrained by the
chosen data set that presents the objects centered on the image. In the future, we intend to test this type
of vision in other data sets trained for recognition and location tasks where objects are not centered, thus
having a greater localization variety. The other very relevant limitation that also conditioned the tests is
the scale of the images. Scaling is a problem for the foveal sensor in particular for very close objects
because it loses the overall characteristics as the resolution decays very rapidly towards the periphery.
It would also be interesting to train the system directly with blur (uniform and non-uniform foveal). In
this case, it would be expected that with this tuning of the network, its performance should improve for
both classification and localization tasks.
52
Bibliography
[1] A. Borji and L. Itti, “State-of-the-art in visual attention modelling,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185–207, 2013.
[2] F. Katsuki and C. Constantinidis, “Bottom-up and top-down attention: different processes and over-
lapping neural systems,” The Neuroscientist, vol. 20, no. 5, pp. 509–521, 2014.
[3] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, L. Wang, C. Huang, T. S. Huang, W. Xu, D. Ra-
manan, and Y. Huang, “Look and Think Twice : Capturing Top-Down Visual Attention with Feed-
back,” IEEE International Conference on Computer Vision, 2015.
[4] F. B. Colavita, “Human Sensory Dominance,” Perception & Psychophysics, vol. 16, no. 2, pp. 409–
412, 1974.
[5] B. Wandell, Foundations of Vision. Sinauer Associates, 1995.
[6] R. Parasuraman and S. Yantis, The attentive brain. Mit Press Cambridge, MA, 1998.
[7] P. Quinlan and B. Dyson, “Attention: general introduction, basic models and data,” Cognitive Psy-
chology, pp. 271–311, 2008.
[8] L. M. Ward, “Attention,” Scholarpedia, vol. 3, no. 10:1538, 2008.
[9] M. Carrasco, “Visual attention: The past 25 years,” Vision Research, vol. 51, no. 13, pp. 1484–1525,
2011.
[10] A. Mack and I. Rock, “Inattentional Blindness,” Cambridge MA MIT Press Malik J Perona P, vol. 7,
no. 1998, p. 287, 1998.
[11] W. James, “The principles of psychology (Vols. 1 & 2),” New York Holt, vol. 118, p. 688, 1890.
[12] E. Sokolov and O. Vinogradova, Neuronal mechanisms of the orienting reflex. L. Erlbaum Asso-
ciates, 1975.
[13] M. I. Posner, “Orienting of attention,” Quarterly journal of experimental psychology, vol. 32, no. 1,
pp. 3–25, 1980.
[14] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259,
1998.
53
[15] C. Koch and S. Ullman, Shifts in selective visual attention: towards the underlying neural circuitry.
Springer Netherlands, 1985.
[16] G. F. Woodman and S. J. Luck, “Electrophysiological measurement of rapid shifts of attention during
visual search.,” Nature, vol. 400, no. 6747, pp. 867–869, 1999.
[17] C. G. Healey and J. T. Enns, “Attention and Visual Perception in Visualization and Computer Graph-
ics,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 7, pp. 1–20, 2011.
[18] A. M. Treisman, “A Feature-Integration Theory of Attention,” Cognitive Psychology, vol. 12, pp. 97–
136, 1980.
[19] J. M. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: an alternative to the feature integration
model for visual search,” Journal of Experimental Psychology: Human Perception and Performance,
vol. 15, no. 3, pp. 419–433, 1989.
[20] L. Huang and H. Pashler, “A Boolean map theory of visual attention.,” Psychological review,
vol. 114, no. 3, pp. 599–631, 2007.
[21] A. Treisman, “Preattentive processing in vision,” Computer Vision, Graphics, and Image Processing,
vol. 31, pp. 156–177, aug 1985.
[22] J. M. Wolfe, “Guided Search 2 . 0 A revised model of visual search,” Psychnomic Bulletin & Review,
vol. 1, no. 2, pp. 202–238, 1994.
[23] J. M. Wolfe, S. R. Friedman-Hill, M. I. Stewart, and K. M. O’Connell, “The role of categorization in
visual search for orientation.,” Journal of experimental psychology. Human perception and perfor-
mance, vol. 18, no. 1, pp. 34–49, 1992.
[24] U. Neisser, Cognition and reality: principles and implications of cognitive psychology. 1976.
[25] R. L. Gregory, Perceptions as Hypotheses. Philosophical Transactions of the Royal Society of
London. Series B, Biological sciences, vol.290, No. 1038, 1980.
[26] J. J. Gibson, “The Theory of Affordances,” in Perceiving, Acting, and Knowing, pp. 127–142 (332),
Hoboken, NJ: John Wiley & Sons Inc., 1977.
[27] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[28] A. Message, A. Farke, A. Farke, A. Farke, V. M. Arbour, M. E. Burns, R. M. Sullivan, S. G. Lu-
cas, A. K. Cantrell, T. L. Suazo, J.-r. Boisserie, A. Souron, H. T. Mackaye, A. Likius, P. Vignaud,
M. Brunet, M. Tallman, N. Amenta, E. Delson, S. R. Frost, D. Ghosh, and Z. S. Klukkert, “Artificial
Inteligence,” no. August, 2014.
[29] J.-T. Huang, J. Li, and Y. Gong, “An analysis of convolutional neural networks for speech recog-
nition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 4989–4993, 2015.
54
[30] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural
Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[31] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy Layer-Wise Training of Deep Net-
works,” Advances in Neural Information Processing Systems, vol. 19, no. 1, p. 153, 2007.
[32] M. aurelio Ranzato, C. Poultney, S. Chopra, Y. L. Cun, M. Ranzato, C. Poultney, S. Chopra, and
Y. L. Cun, “Efficient Learning of Sparse Representations with an Energy-Based Model,” Advances
in Neural Information Processing Systems, vol. 19, no. 1, pp. 1137–1144, 2007.
[33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout : A Simple
Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research (JMLR),
vol. 15, pp. 1929–1958, 2014.
[34] L. Itti and C. Koch, “A saliency-based search mechanism for overt and covert hifts of visual atten-
tion,” Vision Research, vol. 40, no. 10-12, pp. 1489–1506, 2000.
[35] S. P. Tipper, J. Driver, and B. Weaver, “Object-centred inhibition of return of visual attention.,” The
Quarterly journal of experimental psychology. A, Human experimental psychology, vol. 43, no. 2,
pp. 289–298, 1991.
[36] W. Osberger and A. Maeder, “Automatic identification of perceptually important regions in an
image,” Proceeding of the Fourteenth International Conference on Pattern Recognition, vol. 1,
pp. 701–704, 1998.
[37] T. Kadir and J. M. Brady, “Scale, Saliency and Image Description,” International Journal of Computer
Vision, vol. 45, no. 2, pp. 83–105, 2001.
[38] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discriminant process,” Proceedings of the
IEEE International Conference on Computer Vision, 2007.
[39] C. Siagian and L. Itti, “Rapid biologically-inspired scene classification using features shared with
visual attention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 2,
pp. 300–312, 2007.
[40] L. W. Renninger and J. Malik, “When is scene identification just texture recognition?,” Vision Re-
search, vol. 44, no. 19, pp. 2301–2311, 2004.
[41] A. Torralba, “Modeling global scene factors in attention,” Journal of the Optical Society of America
A, vol. 20, no. 7, p. 1407, 2003.
[42] L. Lukic and A. Billard, “Motor-primed Visual Attention for Humanoid Robots,” IEEE Transactions
on Autonomous Mental Development, pp. 1–16, 2015.
[43] L. Lukic and A. Billard, “Learning Coupled Dynamical Systems from Human Demonstration for
Robotic Eye-Hand Coordination,” IEEE, pp. 552–559, 2012.
55
[44] L. Lukic, J. Santos-Victor, and A. Billard, “Learning robotic eye-arm-hand coordination from human
demonstration: A coupled dynamical systems approach,” Biological Cybernetics, vol. 108, no. 2,
pp. 223–248, 2014.
[45] S. Frintrop, ”VOCUS: A Visual Attention System for Object Detection and Goal-Directed Search”.
Springer-Verlag Berlin Heidelberg, 2006.
[46] B. Rasolzadeh, A. T. Targhi, and J.-O. Eklundh, “An attentional system combining top-down and
bottom-up influences,” Attention in Cognitive Systems. Theories and Systems from an Interdisci-
plinary Viewpoint Lecture Notes in Computer Science, vol. 4840, pp. 123–140, 2007.
[47] C. Szegedy, W. Liu, Y. Jia, and P. Sermanet, “Going deeper with convolutions,” Computer Vision
Foundation, 2014.
[48] M. Lin, Q. Chen, and S. Yan, “Network In Network,” Computer Vision Foundation, 2013.
[49] Y. Lin, “A Computational Model for Saliency Maps by Using Local Entropy,” Artificial Intelligence,
pp. 967–973, 2001.
[50] Y. Lin, S. Kong, D. Wang, and Y. Zhuang, “Saliency Detection within a Deep Convolutional Archi-
tecture,” Cognitive Computing for Augmented Human Intelligence: From the AAAI-14 Workshop,
pp. 31–37, 2014.
[51] T. D. Ning Zhang, Jeff Donahue, Ross Girshick, Part-based R-CNNs for Fine-Grained Category
Detection. 2014.
[52] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object de-
tection and semantic segmentation,” Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pp. 580–587, 2014.
[53] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
[54] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Dar-
rell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093,
2014.
[55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional
Neural Networks,” Advances In Neural Information Processing Systems, pp. 1–9, 2012.
[56] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,”
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS),
vol. 9, pp. 249–256, 2010.
[57] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recog-
nition,” International Conference on Learning Representations, pp. 1–14, 2015.
56
[58] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The
Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vi-
sion, vol. 111, no. 1, pp. 98–136, 2014.
[59] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps,” Computer Vision and Pattern Recognition, 2014.
[60] I. I. A. Groen, S. Ghebreab, H. Prins, V. A. F. Lamme, and H. S. Scholte, “From image statistics to
scene gist: evoked neural activity reveals transition from low-level natural image structure to scene
category.,” Journal of Neuroscience, vol. 33, no. 48, pp. 18814–18824, 2013.
[61] R. S. Wallace, P.-W. Ong, B. B. Bederson, and E. L. Schwartz, “Space variant image processing,”
International Journal of Computer Vision, vol. 13, no. 1, pp. 71–90, 1994.
[62] Z. Wang, “Rate scalable foveated image and video communications [ph. d. thesis],” 2003.
[63] W. S. Geisler and J. S. Perry, “Real-time foveated multiresolution system for low-bandwidth video
communication,” in Photonics West’98 Electronic Imaging, pp. 294–305, International Society for
Optics and Photonics, 1998.
[64] P. J. Burt and E. H. Adelson, “The laplacian pyramid as a compact image code,” IEEE TRANSAC-
TIONS ON COMMUNICATIONS, vol. 31, pp. 532–540, 1983.
[65] M. J. Bastos, “Modeling human gaze patterns to improve visual search in autonomous systems,”
Master’s thesis, Instituto Superior Tecnico, 2016.
57