Learning Attention Based Saliency in Videos from Human Eye Movements
Sunaad Nataraju, Vineeth Balasubramanian, Sethuraman PanchanathanCenter for Cognitive Ubiquitous Computing(CUbiC)
Arizona State University Tempe, AZ 85287{sunaad.nataraju, vineeth.nb, panch}@asu.edu
Abstract
In this paper we propose a framework to learn and pre-dict saliency in videos using human eye movements. Inour approach, we record the eye-gaze of users as they arewatching videos, and then learn the low level features ofregions that are of visual interest. The learnt classifier isthen used to predict salient regions in videos belonging tothe same application. So far, predicting saliency in im-ages and videos has been approached from mainly two dif-ferent perspectives, namely visual attention modeling andspatio-temporal interest point detection. Such approachesare purely-vision based and detect regions having a prede-fined set of characteristics, such as complex motion or highcontrast, for all kinds of videos. However, what is ’inter-esting’ varies from one application to another. By learn-ing features of regions that capture the attention of viewerswhile watching a video, we aim to distinguish those thatare actually salient in the given context, from the rest. Thisis especially useful in an environment where users are in-terested only in a certain kind of activity, as in the case ofsurveillance or biomedical applications. In this paper, theproposed framework is implemented using a neural networkthat learns the low-level features defined in visual attentionmodeling literature (Itti’s saliency model) based on the in-teresting regions as identified by the eye gaze movementsof viewers. In our experiments with news videos of popu-lar channels, the results show a significant improvement inthe identification of relevant salient regions in such videos,when compared with existing approaches.
1. IntroductionAs growing numbers of videos are generated each day,
there has been an equally increasing need to be able to iden-
tify relevant regions of interest for all video-based analysis
tasks, such as medical diagnosis and surveillance. Existing
approaches to identify salient regions of interest in images
and videos have typically been addressed from two different
schools of thought. One approach is based on visual atten-
tion modeling, and the most widely accepted methodology
in this respect has been proposed by Itti et al. in [3]. Their
approach is biologically inspired by the visual system in pri-
mates in which the incoming image is first decomposed into
individual feature maps based on color, orientation, inten-
sity and motion. These maps are then combined to obtain a
final saliency map for the image that indicates the individ-
ual saliency of each pixel. In the other school of thought,
salient (or interest) points are detected using a set of prede-
fined saliency functions. The pixels that have a sufficiently
high response to these functions (possible local maxima) are
generally identified as a salient or an interest point. Such
popular approaches include the 3-D Harris corner detector
[9], and the periodic detector presented by Dollar et al. in
[1]. Here, we propose a novel approach to combine the two
schools of thought by learning visual saliency in videos us-
ing human eye movements.
Existing methods in both the aforementioned approaches
have largely been vision-based i.e. they rely on generic pre-
defined filters that characterize regions of interest with a
bias to extrema in certain image characteristics. The inher-
ent problem with such approaches is that they use the same
set of filters to determine saliency in all types of videos.
While such an approach generally performs well to detect
artifacts such as corners, edges and motion-based interest
regions, ’interestingness’ is strongly dependent on the ap-
plication under consideration, and it may not be appropri-
ate to generalize this concept. For example, in surveil-
lance applications, monitoring personnel would consider
specific events such as a person carrying a gun as salient, al-
though the background of this scene may be cluttered with
other ’interesting’ image artifacts based on edges or color.
Similarly, in biomedical applications such as colonoscopy
videos, a physician would specifically consider patterns cor-
responding to tumors as salient, rather than all complex pat-
terns corresponding to normal characteristics in the body.
The use of human eye movements to learn temporal char-
acteristics of interest points in videos was introduced in [6]
and was based on the popular periodic detector[1]. A neu-
ral network model was used to learn the coefficients of the
978-1-4244-5501-0/09/$25.00 ©2009 IEEE
temporal filters. While viewing scenes in images and video,
human eye movements are driven by two approaches: top-down and bottom-up. In the top-down approach, the user
views the scene with a particular objective which could po-
tentially be based on the viewer’s thoughts, mood, etc. On
the other hand, in the bottom-up approach, the viewer is not
looking for anything in particular and the eye movements
are guided only by certain image or video features such as
high contrast, orientations and motion in the scene. Atten-
tion modeling algorithms such as [3], [11] use such local
image features to model the bottom-up approach. Though
these models have been widely accepted, they have had lim-
ited success in predicting eye movements in natural images.
This can be attributed to the fact that the top-down approach
plays a big role and cannot be modeled. In this work, we
propose a unique approach which can be considered to be
a blend of these two approaches. The low-level features
which are known to trigger bottom-up saliency are learnt
for regions that attract user attention in a particular appli-
cation scope - which can be understood as governing top-
down saliency. Our approach learns the saliency of regions
in videos belonging to a particular application, and is subse-
quently used to predict relevant regions of interest in similar
videos.
Few commonly used terms to describe eye movements
are fixations, smooth pursuit and saccades. Fixation points
refer to regions on which the eye focuses on for a long
time, ranging from a few tens to hundreds of milliseconds.
Smooth pursuit is when the eye is continually tracking a
moving object over time while viewing a video. Saccades
are characterized by quick movements of the eye in be-
tween. Fixations and smooth pursuits indicate regions that
are of interest to the viewer. In this work, we intend to use
the features from these regions to train our algorithm.
The approach of using human eye movements to indi-
cate saliency has two distinct advantages: (i) Unlike video
datasets that are typically captured under controlled condi-
tions with a stationary background, real-world videos are
often comprised of several regions, each of which has its
own spatio-temporal characteristics. Each of these regions
may or may not be relevant depending on the class of videos
being studied. Human eye movements are capable of indi-
cating the relevant regions of interest. (ii) Supervised learn-
ing approaches that can possibly learn the relevance of re-
gions need labeling of training samples. This is a very te-
dious and time consuming process, where users would be
required to manually select the regions that are of interest.
This can be avoided by using human eye movements to label
the regions. Users can be asked to watch videos naturally,
and their eye movements can be recorded to subconsciously
label regions of interest. This process proves to be an effi-
cient and quick way of obtaining ground truth for a learning
framework that can predict regions of interest in new videos.
In summary, we propose a machine learning based
framework that learns characteristics of regions that bear
a high attention-based saliency in specific applications.
Though it might initially seem that the training could be
a laborious process because of this specificity, our proposed
model could be easily incorporated into a routine task. Eye-
tracking technology has grown in recent years, and a com-
mercially available eye tracker is typically integrated into a
standard desktop monitor and thereby allows a natural un-
obtrusive way to record eye movements as users are view-
ing videos. This technology enables the users (such as
physicians diagnosing abnormalities in medical videos and
surveillance personnel monitoring videos) to carry out their
regular routines as their gaze is being tracked. Thus, the
proposed framework can be used with ease in such appli-
cations without any explicit procedure for every time the
model needs to be learnt or updated. Further, another ad-
vantage is that such an approach is non-parametric. Pure
vision based approaches require thresholds and parameter
values that need to be manually set, which heavily influence
the results. While machine learning frameworks are also
dependent on parameters these are learnt from the training
data eliminating the need to manually set them.
The remainder of the paper is organized as follows. Re-
lated background work is discussed in Section 2. The data
collection procedure and the experimental set up are de-
scribed in Section 3. In Section 4, we present our algorithm.
The results and conclusion are then discussed in Section 5
and 6 respectively.
2. Related Work
Attention modeling has been well studied over the last
few decades. The approach proposed by Itti described in
[3] has been widely accepted and known to resemble the
bottom-up approach of eye movements well. This model
was built on the architecture that was suggested by Koch
and Ullman in [8]. The visual search strategy in humans is
based on the feature integration theory of combining indi-
vidual feature maps corresponding to low level features in
images, as explained in [12]. The model suggested by Itti
makes use of the concept of a saliency map. Firstly, the
input image is constructed at multiple scales using a Gaus-
sian pyramid. Individual feature maps are extracted for each
scale, based on color, contrast and orientations. These are
then combined across the scales so as to have a single fea-
ture map for a given feature. The final saliency map, which
indicates the interestingness of each pixel in an image, is
obtained by a linear combination of the feature maps.
In the other category of approaches, existing methods
that identify spatio-temporal interest points have largely
been purely vision-based. All functions and filters used in
such pure vision-based approaches are predefined and have
a bias towards certain features. While some of them have
been extensions of existing 2-D algorithms (such as the 3-D
Harris corner detector), there are others specifically meant
for the spatio-temporal domain. Initial work in detecting
spatio-temporal interest points was done by Laptev et al.in [9]. In this approach, an extension of the Harris corner
detector [2] into the third dimension was proposed to de-
tect points that are spatio-temporal corners. The problem
with this algorithm was that the points were too sparse to
characterize videos. Since selected points had to be a spa-
tial as well as a temporal corner, very few were detected.
Another popular approach that has been used extensively in
action classification problems is the periodic detector [1].
The saliency function here measures the response from a
quadrature pair of Gabor filters in the temporal domain.
This approach is known to produce a high number of points
in regions having motion, including spatio-temporal cor-
ners. Though it primarily detects periodic motion, it re-
sponds to other kinds of motion as well. The approach sug-
gested by Oikonomopoulos et al. [10] is an extension of the
Kadir and Brady spatial interest detector [4] detector to the
temporal domain. This detector measures the information
content of pixels not only in the spatial, but also in the tem-
poral neighborhood. Spatio-temporal points are compared
using a metric based on the Chamfer distance. Another ap-
proach suggested by Ke et al. in [5] makes use of volumet-
ric features and video optical flow to detect motion. This is
based on the rectangular features used by Viola and Jones
[13]. Among the aforementioned detectors, the periodic de-
tector of Dollar et al. [1] has been the most popular choice.
A framework to learn temporal characteristics of inter-
est points in videos from eye movements was introduced
by Kienzle et al. in [6] and was based on the popular pe-
riodic detector [1]. A neural network model was used to
learn the temporal filters, instead of the quadrature pair of
1-D temporal Gabor filters used in the periodic detector. In
their approach, they recorded eye gaze data from several
viewers watching various short clips from a movie. A feed-
forward neural network model is used to learn the saliency
function. The filter coefficients are optimized using logis-
tic regression to fit the saliency model to the eye movement
data. The detector was validated on human action classifica-
tion. The results showed state-of-the-art accuracy and even
outperformed some of the previous methods. We integrate
a modified version of such a learning framework into the
saliency detection architecture suggested by Itti, to develop
machine learning models that learn saliency characteristics
for videos in a specific application. Such a model is capa-
ble of locating salient regions in videos based on the ap-
plication, rather than plainly selecting pixels having prede-
termined features such as high motion and complex spatial
textures, as in the case of existing approaches.
3. Experimental Setup
Before we describe the proposed learning methodology,
we present the details of the experimental setup for ease
of understanding. The Tobii 1750 eye tracker was used to
record eye movements. This device, as shown in Figure
1, is integrated into a 21 inch monitor. It tracks the eye
gaze of the viewers while using the monitor with a sam-
pling frequency of 50 Hz. The device has an accuracy of
0.5 degrees. The eye tracking procedure involves an initial
calibration for each person. Once calibrated, the eye tracker
is capable of compensating for the user’s head movements,
thereby allowing a natural uninterrupted viewing procedure.
In this work, we used news videos to demonstrate the ap-
Figure 1. Tobii 1750 eye tracker
plicability of our approach. Real-world video clips were
downloaded from websites of popular channels, instead of
creating an artificial dataset. News videos were used for
this work since these videos generally have several regions
with motion, apart from those corresponding to the news
reader. These movements include people walking behind
the news reader, camera movement, and arbitrarily moving
design patterns in the background.
Eleven volunteers participated in this experiment. Each
of these volunteers were unaware of the experiment and
were instructed to watch the news videos naturally. The
users were asked to sit at a normal viewing distance from
the monitor while watching the videos. The eye gaze infor-
mation of these volunteers was recorded for thirty seven dif-
ferent news video snippets. In between each of the videos,
a blank screen was displayed for four seconds so that the
initial eye movements during a video are not biased by the
previous video. The regions in the neighborhood of the pix-
els viewed by the volunteers were labeled as positive sam-
ples. All other regions correspond to the negative class.
In order to generate negative samples, arbitrary points are
drawn from the regions that were not viewed by the users.
This procedure produces samples whose spatio-temporal
features did not capture the attention of any of the users.
These positive and negative samples are used to train the
learning framework, described in the next section.
Figure 2. Positive and negative training samples for an image
4. Learning Attention Based Saliency
The proposed learning framework is motivated by the
widely accepted Itti’s saliency detection approach [3], i.e.,
we use the focal concept of obtaining saliency maps from
biologically inspired low-level feature maps. However, in-
stead of thresholding generic filter responses of such low-
level features, we introduce a learning framework that can
capture the image characteristics of salient regions based on
the labeled data that is obtained by recording the eye gaze
of users as they are naturally watching videos for a specific
application.
As mentioned earlier, the concept of learning interest
points in videos using human eye movements was intro-
duced in [6]. However, this framework had several draw-
backs that motivated us to take up the proposed frame-
work. Primarily, the work of Kienzle et al. in [6] used
only temporal descriptors in their work. However, temporal
descriptors are insufficient to characterize regions of inter-
est in videos (the figures presented in Section 4 illustrates
this point clearly). There is an impending need for spa-
tial features to be considered to detect points that are truly
of ’spatio-temporal’ interest. Further more, motion-based
saliency models are incapable of handling camera motion
and zoom. They require that camera induced motion be
subtracted before the algorithms can be applied. Inclusion
of spatial features makes our framework robust to camera
movements (as illustrated in Figure 4).
The low level features used to construct maps in the vi-
sual system of primates include color, intensity, orientation
and motion (as in [3]). In our approach, we construct de-
scriptors for a set of features representing these low-level
features for each of the training samples and learn a neural
network to obtain the coefficients so as to fit their training
labels. For each training sample, a window around the pixel
is taken to construct the descriptors. The feature descriptors
include:
(i) Color intensity histograms: Three separate his-
tograms of the intensities corresponding to red, green and
blue are concatenated to obtain the color intensity his-
togram. The size of each of the bins is 20. Hence, the length
of of the histograms is 13.
(ii) Gradient Orientation Histogram: The horizontal and
vertical gradients of each of the pixels in the window is cal-
culated using the masks [1,−1] and [1,−1]T respectively.
The inverse tangent of their ratio gives the orientation of the
gradient. The contribution of each of the pixels to the his-
togram is determined by its magnitude of its gradient. Such
a histogram not only captures the orientation characteristics,
but also the intensity contrasts in the region. The histogram
is of length 18 with each bin spanning 10 degrees between
0 to 180 degrees.
(iii) Motion descriptor: This is similar to that suggested
in [6], and consists of the pixel intensities in the neighbor-
ing frames corresponding to the same spatial coordinates.
Such a vector characterizes the temporal features well, as
illustrated in their results.
Once the feature vectors, Fs, for all the samples are ob-
tained, the saliency model used is represented using a feed-
forward neural network:
Os = bs0 +ks∑
i=1
αsitanh(Fs ·Wsi + bsi) (1)
Here, Ws and α are vectors representing weights of the
first and second layer respectively. bsiare their correspond-
ing bias parameters. The final saliency, Ss is given by the
logistic sigmoidal activation function applied to the output
of the second layer of hidden nodes, Os.
Ss = 1/(1 + exp(−Os)) (2)
The logistic function is known to produce good results for
binary classification, and is hence used. The motivating
factor in using such a function also arises from the fact
that its inverse, Os gives a probabilistic interpretation of
the saliency. It is known as the logit function and repre-
sents the log of the ratios of the probabilities, ln(P (C =1|F )/P (C = 0|F )). Here C = 1 and C = 0 represent
the positive and negative classes respectively. On the other
hand, most vision based saliency functions have arbitrary
ranges, and this problem makes their interpretation a dif-
ficult challenge. In order to train our saliency model, we
build individual neural networks for each of the features.
Once the weights of the neural network are trained, the al-
gorithm is ready to detect saliency regions in similar videos
that belong to the same application. For a given frame in a
video sequence, saliency is predicted as follows. We con-
struct individual feature maps and then the final saliency
map is obtained as a point-to-point multiplication of these
maps. In doing so, we are spatially filtering out regions that
are temporally detected as interesting. Once the saliency
map is obtained, the interest points are those which corre-
spond to local maxima within their neighborhood.
Figure 3. Interest points detected using (b) periodic detector, (c) detector in [6] and (d) our learned detector on a video with moving patterns
in the background. (e) shows the fixations of users on the frame
Figure 4. Interest points detected using (b) periodic detector, (c) detector in [6] and (d) our learned detector on a video with moving patterns
in the background. (e) shows the fixations of users on the frame
5. Results
The results for our experiments are presented from two
perspectives. In the first set of results, we present the ac-
curacy of predicting potential eye-gaze positions on test
videos using the individual neural networks. Secondly, we
visually illustrate the performance of our framework in de-
tecting salient pixels in videos.
(i) Eye Movement Prediction: The eye gaze data for
eleven participants was recorded for a set of 37 news videos.
The samples from the first 27 videos were used to train the
models. The positive and negative samples from the remain-
ing ten videos were used to test the performance of the clas-
sifiers. Classification results for the individual neural net-
works that were discussed in the previous section are shown
in Table 1. The performance of the motion descriptor also
indicates the performance of the approach mentioned in [6].
It is worth mentioning that the results indicate the perfor-
mance of the features for videos belonging to this specific
application alone.
Table 1. Prediction accuracy of different neural networks on the
samples from test videos. Please note that the results for the mo-
tion descriptor-based neural network is equivalent to the results
obtained using earlier work [6]
Neural Network Accuracy
Positive Negative
Edge orientation histogram 88.3% 42.6%
Motion Descriptor 73.2% 71.4%
Color Histogram 82.4% 89.4%
Concatenated Vector 82.4% 86.0
(ii) Salient Point detection: Figures 3 and 4 illustrate
the performance of the algorithm using two different ex-
amples. The images to the left are the original frames.
These frames have different regions with motion. Figures
3(b) and 4(b) show the points detected using the periodic
detector. Figures 3(c) and 4(c) show the pixels predicted
by the algorithm in [7]. It is evident that both these detec-
tors, being motion-based, respond to all regions in the video
frame that have movements. The results obtained using our
approach in which the saliency map is obtained from indi-
vidual feature maps are presented in Figures 3(d) and 4(d).
Clearly, this framework is able to distinguish the salient re-
gions from the irrelevant ones having motion. Also, the fix-
ations of the users, as recorded by the eye tracker are shown
in the images to the right, i.e. Figures 3(e) and 4(e). As can
be seen, the interest points detected by our approach bear a
high correlation to the fixations of the users in both the ex-
amples. The frame shown in Figure 5 is taken from a video
that was captured using a moving camera. As expected, the
pure motion-based approaches detect motion in all parts of
the scene, and in turn predict interest points from all re-
gions. Here, we see that our learned detector is robust to
the camera movements. It is evident from the examples that
our framework ensures interest points have relevant spatial
content along with temporal content. In addition, we also
experimented with the proposed framework to design a sin-
gle neural network that was built using a concatenated fea-
ture vector (using all the considered features). However, we
found that this approach did not perform as well as using
individual neural networks for each feature. We believe that
this may have been due to the high dimensionality of the
resulting concatenated feature vector, and we plan to study
this further in our future work. Another point to be noted
is that the saliency model detects regions based on the re-
Figure 5. Interest points detected using (b) periodic detector, (c) detector in [6] and (d) our learned detector on a video taken using a moving
hand-held video camera
gions that are commonly viewed by all the users of the ex-
periment. For example, in this work, we found that all the
volunteers in our experiments viewed the faces of the news-
casters during the video. However, there might be a case
where some users view other regions (possibly the text on
the screen). In such a scenario, the learning framework will
learn features based on the most commonly viewed regions.
6. Conclusion and Future WorkDetecting saliency in videos has emerged to be an impor-
tant component of many video analysis problems. While
traditional approaches in the domain tend to be vision-
based, we have proposed a novel machine-learning frame-
work that is trained to detect spatio-temporal interest points
based on the interest of viewers, as captured by human
eye movements. Our results have shown that the learned
detector is capable of predicting the most relevant spatio-
temporal interest points, even in the presence of background
movements. This approach is non-parametric and is also
capable of handling camera-induced motion in videos. A
seeming limitation (which is also an important feature of
the approach) is that the algorithm will have to be trained
individually for each application to detect relevant interest
points. However, as mentioned earlier, state-of-the-art eye
tracking devices can be integrated into monitors and are ca-
pable of unobtrusively tracking eye gaze as users are freely
viewing videos. This way, our framework can be easily in-
corporated and at the same time learn saliency in the videos
corresponding to their application. In future work, we plan
to study the choice of more feature descriptors that provide
the best possible performance in learning the saliency func-
tions. In addition, we also intend to include a multi-scale
approach so as to be able to handle interest regions at dif-
ferent scales.
References[1] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behav-
ior recognition via sparse spatio-temporal features. In VisualSurveillance and Performance Evaluation of Tracking andSurveillance, 2005. 2nd Joint IEEE International Workshopon, pages 65–72, 2005.
[2] C. Harris and M. Stephens. A combined corner and edge
detector. In Alvey vision conference, volume 15, page 50,
1988.
[3] L. Itti, C. Koch, and E. Niebur. A model of saliency-based
visual attention for rapid scene analysis. IEEE Transactionson pattern analysis and machine intelligence, 20(11):1254–
1259, 1998.
[4] T. Kadir and M. Brady. Scale saliency: A novel approach
to salient feature and scale selection. In Visual InformationEngineering, 2003. VIE 2003. International Conference on,
pages 25–28, 2003.
[5] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event
detection using volumetric features. In Tenth IEEE Interna-tional Conference on Computer Vision, 2005. ICCV 2005,
volume 1, 2005.
[6] W. Kienzle, B. Scholkopf, F. Wichmann, and M. Franz. How
to find interesting locations in video: a spatiotemporal inter-
est point detector learned from human eye movements. Lec-ture Notes in Computer Science, 4713:405, 2007.
[7] W. Kienzle, F. Wichmann, B. Scholkopf, and M. Franz.
Learning an interest operator from human eye movements.
In IEEE Conference on Computer Vision and Pattern Recog-nition, volume 17, page 22. Citeseer, 2006.
[8] C. Koch and S. Ullman. Shifts in selective visual attention:
towards the underlying neural circuitry. Human neurobiol-ogy, 4(4):219–227, 1985.
[9] I. Laptev. On space-time interest points. International Jour-nal of Computer Vision, 64(2):107–123, 2005.
[10] A. Oikonomopoulos, I. Patras, and M. Pantic. Human action
recognition with spatiotemporal salient points. IEEE Trans-actions on Systems, Man and Cybernetics-Part B, 36(3):710–
719, 2006.
[11] F. Stentiford. Attention-based similarity. Pattern Recogni-tion, 40(3):771–783, 2007.
[12] A. Treisman and G. Gelade. A feature-integration theory of
attention. Cognitive psychology, 12(1):97–136, 1980.
[13] P. Viola and M. Jones. Rapid Object Detection using a
Boosted Cascade of Simple. In Proceedings of CVPR2001,
volume 1, 2001.