Download pdf - [IEEE 2009 Workshop on Motion and Video Computing (WMVC) - Snowbird, UT, USA (2009.12.8-2009.12.9)] 2009 Workshop on Motion and Video Computing (WMVC) - Learning attention based saliency

Learning Attention Based Saliency in Videos from Human Eye Movements

Sunaad Nataraju, Vineeth Balasubramanian, Sethuraman PanchanathanCenter for Cognitive Ubiquitous Computing(CUbiC)

Arizona State University Tempe, AZ 85287{sunaad.nataraju, vineeth.nb, panch}@asu.edu

Abstract

In this paper we propose a framework to learn and pre-dict saliency in videos using human eye movements. Inour approach, we record the eye-gaze of users as they arewatching videos, and then learn the low level features ofregions that are of visual interest. The learnt classifier isthen used to predict salient regions in videos belonging tothe same application. So far, predicting saliency in im-ages and videos has been approached from mainly two dif-ferent perspectives, namely visual attention modeling andspatio-temporal interest point detection. Such approachesare purely-vision based and detect regions having a prede-fined set of characteristics, such as complex motion or highcontrast, for all kinds of videos. However, what is ’inter-esting’ varies from one application to another. By learn-ing features of regions that capture the attention of viewerswhile watching a video, we aim to distinguish those thatare actually salient in the given context, from the rest. Thisis especially useful in an environment where users are in-terested only in a certain kind of activity, as in the case ofsurveillance or biomedical applications. In this paper, theproposed framework is implemented using a neural networkthat learns the low-level features defined in visual attentionmodeling literature (Itti’s saliency model) based on the in-teresting regions as identified by the eye gaze movementsof viewers. In our experiments with news videos of popu-lar channels, the results show a significant improvement inthe identification of relevant salient regions in such videos,when compared with existing approaches.

1. IntroductionAs growing numbers of videos are generated each day,

there has been an equally increasing need to be able to iden-

tify relevant regions of interest for all video-based analysis

tasks, such as medical diagnosis and surveillance. Existing

approaches to identify salient regions of interest in images

and videos have typically been addressed from two different

schools of thought. One approach is based on visual atten-

tion modeling, and the most widely accepted methodology

in this respect has been proposed by Itti et al. in [3]. Their

approach is biologically inspired by the visual system in pri-

mates in which the incoming image is first decomposed into

individual feature maps based on color, orientation, inten-

sity and motion. These maps are then combined to obtain a

final saliency map for the image that indicates the individ-

ual saliency of each pixel. In the other school of thought,

salient (or interest) points are detected using a set of prede-

fined saliency functions. The pixels that have a sufficiently

high response to these functions (possible local maxima) are

generally identified as a salient or an interest point. Such

popular approaches include the 3-D Harris corner detector

[9], and the periodic detector presented by Dollar et al. in

[1]. Here, we propose a novel approach to combine the two

schools of thought by learning visual saliency in videos us-

ing human eye movements.

Existing methods in both the aforementioned approaches

have largely been vision-based i.e. they rely on generic pre-

defined filters that characterize regions of interest with a

bias to extrema in certain image characteristics. The inher-

ent problem with such approaches is that they use the same

set of filters to determine saliency in all types of videos.

While such an approach generally performs well to detect

artifacts such as corners, edges and motion-based interest

regions, ’interestingness’ is strongly dependent on the ap-

plication under consideration, and it may not be appropri-

ate to generalize this concept. For example, in surveil-

lance applications, monitoring personnel would consider

specific events such as a person carrying a gun as salient, al-

though the background of this scene may be cluttered with

other ’interesting’ image artifacts based on edges or color.

Similarly, in biomedical applications such as colonoscopy

videos, a physician would specifically consider patterns cor-

responding to tumors as salient, rather than all complex pat-

terns corresponding to normal characteristics in the body.

The use of human eye movements to learn temporal char-

acteristics of interest points in videos was introduced in [6]

and was based on the popular periodic detector[1]. A neu-

ral network model was used to learn the coefficients of the

978-1-4244-5501-0/09/$25.00 ©2009 IEEE

temporal filters. While viewing scenes in images and video,

human eye movements are driven by two approaches: top-down and bottom-up. In the top-down approach, the user

views the scene with a particular objective which could po-

tentially be based on the viewer’s thoughts, mood, etc. On

the other hand, in the bottom-up approach, the viewer is not

looking for anything in particular and the eye movements

are guided only by certain image or video features such as

high contrast, orientations and motion in the scene. Atten-

tion modeling algorithms such as [3], [11] use such local

image features to model the bottom-up approach. Though

these models have been widely accepted, they have had lim-

ited success in predicting eye movements in natural images.

This can be attributed to the fact that the top-down approach

plays a big role and cannot be modeled. In this work, we

propose a unique approach which can be considered to be

a blend of these two approaches. The low-level features

which are known to trigger bottom-up saliency are learnt

for regions that attract user attention in a particular appli-

cation scope - which can be understood as governing top-

down saliency. Our approach learns the saliency of regions

in videos belonging to a particular application, and is subse-

quently used to predict relevant regions of interest in similar

videos.

Few commonly used terms to describe eye movements

are fixations, smooth pursuit and saccades. Fixation points

refer to regions on which the eye focuses on for a long

time, ranging from a few tens to hundreds of milliseconds.

Smooth pursuit is when the eye is continually tracking a

moving object over time while viewing a video. Saccades

are characterized by quick movements of the eye in be-

tween. Fixations and smooth pursuits indicate regions that

are of interest to the viewer. In this work, we intend to use

the features from these regions to train our algorithm.

The approach of using human eye movements to indi-

cate saliency has two distinct advantages: (i) Unlike video

datasets that are typically captured under controlled condi-

tions with a stationary background, real-world videos are

often comprised of several regions, each of which has its

own spatio-temporal characteristics. Each of these regions

may or may not be relevant depending on the class of videos

being studied. Human eye movements are capable of indi-

cating the relevant regions of interest. (ii) Supervised learn-

ing approaches that can possibly learn the relevance of re-

gions need labeling of training samples. This is a very te-

dious and time consuming process, where users would be

required to manually select the regions that are of interest.

This can be avoided by using human eye movements to label

the regions. Users can be asked to watch videos naturally,

and their eye movements can be recorded to subconsciously

label regions of interest. This process proves to be an effi-

cient and quick way of obtaining ground truth for a learning

framework that can predict regions of interest in new videos.

In summary, we propose a machine learning based

framework that learns characteristics of regions that bear

a high attention-based saliency in specific applications.

Though it might initially seem that the training could be

a laborious process because of this specificity, our proposed

model could be easily incorporated into a routine task. Eye-

tracking technology has grown in recent years, and a com-

mercially available eye tracker is typically integrated into a

standard desktop monitor and thereby allows a natural un-

obtrusive way to record eye movements as users are view-

ing videos. This technology enables the users (such as

physicians diagnosing abnormalities in medical videos and

surveillance personnel monitoring videos) to carry out their

regular routines as their gaze is being tracked. Thus, the

proposed framework can be used with ease in such appli-

cations without any explicit procedure for every time the

model needs to be learnt or updated. Further, another ad-

vantage is that such an approach is non-parametric. Pure

vision based approaches require thresholds and parameter

values that need to be manually set, which heavily influence

the results. While machine learning frameworks are also

dependent on parameters these are learnt from the training

data eliminating the need to manually set them.

The remainder of the paper is organized as follows. Re-

lated background work is discussed in Section 2. The data

collection procedure and the experimental set up are de-

scribed in Section 3. In Section 4, we present our algorithm.

The results and conclusion are then discussed in Section 5

and 6 respectively.

2. Related Work

Attention modeling has been well studied over the last

few decades. The approach proposed by Itti described in

[3] has been widely accepted and known to resemble the

bottom-up approach of eye movements well. This model

was built on the architecture that was suggested by Koch

and Ullman in [8]. The visual search strategy in humans is

based on the feature integration theory of combining indi-

vidual feature maps corresponding to low level features in

images, as explained in [12]. The model suggested by Itti

makes use of the concept of a saliency map. Firstly, the

input image is constructed at multiple scales using a Gaus-

sian pyramid. Individual feature maps are extracted for each

scale, based on color, contrast and orientations. These are

then combined across the scales so as to have a single fea-

ture map for a given feature. The final saliency map, which

indicates the interestingness of each pixel in an image, is

obtained by a linear combination of the feature maps.

In the other category of approaches, existing methods

that identify spatio-temporal interest points have largely

been purely vision-based. All functions and filters used in

such pure vision-based approaches are predefined and have

a bias towards certain features. While some of them have

been extensions of existing 2-D algorithms (such as the 3-D

Harris corner detector), there are others specifically meant

for the spatio-temporal domain. Initial work in detecting

spatio-temporal interest points was done by Laptev et al.in [9]. In this approach, an extension of the Harris corner

detector [2] into the third dimension was proposed to de-

tect points that are spatio-temporal corners. The problem

with this algorithm was that the points were too sparse to

characterize videos. Since selected points had to be a spa-

tial as well as a temporal corner, very few were detected.

Another popular approach that has been used extensively in

action classification problems is the periodic detector [1].

The saliency function here measures the response from a

quadrature pair of Gabor filters in the temporal domain.

This approach is known to produce a high number of points

in regions having motion, including spatio-temporal cor-

ners. Though it primarily detects periodic motion, it re-

sponds to other kinds of motion as well. The approach sug-

gested by Oikonomopoulos et al. [10] is an extension of the

Kadir and Brady spatial interest detector [4] detector to the

temporal domain. This detector measures the information

content of pixels not only in the spatial, but also in the tem-

poral neighborhood. Spatio-temporal points are compared

using a metric based on the Chamfer distance. Another ap-

proach suggested by Ke et al. in [5] makes use of volumet-

ric features and video optical flow to detect motion. This is

based on the rectangular features used by Viola and Jones

[13]. Among the aforementioned detectors, the periodic de-

tector of Dollar et al. [1] has been the most popular choice.

A framework to learn temporal characteristics of inter-

est points in videos from eye movements was introduced

by Kienzle et al. in [6] and was based on the popular pe-

riodic detector [1]. A neural network model was used to

learn the temporal filters, instead of the quadrature pair of

1-D temporal Gabor filters used in the periodic detector. In

their approach, they recorded eye gaze data from several

viewers watching various short clips from a movie. A feed-

forward neural network model is used to learn the saliency

function. The filter coefficients are optimized using logis-

tic regression to fit the saliency model to the eye movement

data. The detector was validated on human action classifica-

tion. The results showed state-of-the-art accuracy and even

outperformed some of the previous methods. We integrate

a modified version of such a learning framework into the

saliency detection architecture suggested by Itti, to develop

machine learning models that learn saliency characteristics

for videos in a specific application. Such a model is capa-

ble of locating salient regions in videos based on the ap-

plication, rather than plainly selecting pixels having prede-

termined features such as high motion and complex spatial

textures, as in the case of existing approaches.

3. Experimental Setup

Before we describe the proposed learning methodology,

we present the details of the experimental setup for ease

of understanding. The Tobii 1750 eye tracker was used to

record eye movements. This device, as shown in Figure

1, is integrated into a 21 inch monitor. It tracks the eye

gaze of the viewers while using the monitor with a sam-

pling frequency of 50 Hz. The device has an accuracy of

0.5 degrees. The eye tracking procedure involves an initial

calibration for each person. Once calibrated, the eye tracker

is capable of compensating for the user’s head movements,

thereby allowing a natural uninterrupted viewing procedure.

In this work, we used news videos to demonstrate the ap-

Figure 1. Tobii 1750 eye tracker

plicability of our approach. Real-world video clips were

downloaded from websites of popular channels, instead of

creating an artificial dataset. News videos were used for

this work since these videos generally have several regions

with motion, apart from those corresponding to the news

reader. These movements include people walking behind

the news reader, camera movement, and arbitrarily moving

design patterns in the background.

Eleven volunteers participated in this experiment. Each

of these volunteers were unaware of the experiment and

were instructed to watch the news videos naturally. The

users were asked to sit at a normal viewing distance from

the monitor while watching the videos. The eye gaze infor-

mation of these volunteers was recorded for thirty seven dif-

ferent news video snippets. In between each of the videos,

a blank screen was displayed for four seconds so that the

initial eye movements during a video are not biased by the

previous video. The regions in the neighborhood of the pix-

els viewed by the volunteers were labeled as positive sam-

ples. All other regions correspond to the negative class.

In order to generate negative samples, arbitrary points are

drawn from the regions that were not viewed by the users.

This procedure produces samples whose spatio-temporal

features did not capture the attention of any of the users.

These positive and negative samples are used to train the

learning framework, described in the next section.

Figure 2. Positive and negative training samples for an image

4. Learning Attention Based Saliency

The proposed learning framework is motivated by the

widely accepted Itti’s saliency detection approach [3], i.e.,

we use the focal concept of obtaining saliency maps from

biologically inspired low-level feature maps. However, in-

stead of thresholding generic filter responses of such low-

level features, we introduce a learning framework that can

capture the image characteristics of salient regions based on

the labeled data that is obtained by recording the eye gaze

of users as they are naturally watching videos for a specific

application.

As mentioned earlier, the concept of learning interest

points in videos using human eye movements was intro-

duced in [6]. However, this framework had several draw-

backs that motivated us to take up the proposed frame-

work. Primarily, the work of Kienzle et al. in [6] used

only temporal descriptors in their work. However, temporal

descriptors are insufficient to characterize regions of inter-

est in videos (the figures presented in Section 4 illustrates

this point clearly). There is an impending need for spa-

tial features to be considered to detect points that are truly

of ’spatio-temporal’ interest. Further more, motion-based

saliency models are incapable of handling camera motion

and zoom. They require that camera induced motion be

subtracted before the algorithms can be applied. Inclusion

of spatial features makes our framework robust to camera

movements (as illustrated in Figure 4).

The low level features used to construct maps in the vi-

sual system of primates include color, intensity, orientation

and motion (as in [3]). In our approach, we construct de-

scriptors for a set of features representing these low-level

features for each of the training samples and learn a neural

network to obtain the coefficients so as to fit their training

labels. For each training sample, a window around the pixel

is taken to construct the descriptors. The feature descriptors

include:

(i) Color intensity histograms: Three separate his-

tograms of the intensities corresponding to red, green and

blue are concatenated to obtain the color intensity his-

togram. The size of each of the bins is 20. Hence, the length

of of the histograms is 13.

(ii) Gradient Orientation Histogram: The horizontal and

vertical gradients of each of the pixels in the window is cal-

culated using the masks [1,−1] and [1,−1]T respectively.

The inverse tangent of their ratio gives the orientation of the

gradient. The contribution of each of the pixels to the his-

togram is determined by its magnitude of its gradient. Such

a histogram not only captures the orientation characteristics,

but also the intensity contrasts in the region. The histogram

is of length 18 with each bin spanning 10 degrees between

0 to 180 degrees.

(iii) Motion descriptor: This is similar to that suggested

in [6], and consists of the pixel intensities in the neighbor-

ing frames corresponding to the same spatial coordinates.

Such a vector characterizes the temporal features well, as

illustrated in their results.

Once the feature vectors, Fs, for all the samples are ob-

tained, the saliency model used is represented using a feed-

forward neural network:

Os = bs0 +ks∑

i=1

αsitanh(Fs ·Wsi + bsi) (1)

Here, Ws and α are vectors representing weights of the

first and second layer respectively. bsiare their correspond-

ing bias parameters. The final saliency, Ss is given by the

logistic sigmoidal activation function applied to the output

of the second layer of hidden nodes, Os.

Ss = 1/(1 + exp(−Os)) (2)

The logistic function is known to produce good results for

binary classification, and is hence used. The motivating

factor in using such a function also arises from the fact

that its inverse, Os gives a probabilistic interpretation of

the saliency. It is known as the logit function and repre-

sents the log of the ratios of the probabilities, ln(P (C =1|F )/P (C = 0|F )). Here C = 1 and C = 0 represent

the positive and negative classes respectively. On the other

hand, most vision based saliency functions have arbitrary

ranges, and this problem makes their interpretation a dif-

ficult challenge. In order to train our saliency model, we

build individual neural networks for each of the features.

Once the weights of the neural network are trained, the al-

gorithm is ready to detect saliency regions in similar videos

that belong to the same application. For a given frame in a

video sequence, saliency is predicted as follows. We con-

struct individual feature maps and then the final saliency

map is obtained as a point-to-point multiplication of these

maps. In doing so, we are spatially filtering out regions that

are temporally detected as interesting. Once the saliency

map is obtained, the interest points are those which corre-

spond to local maxima within their neighborhood.

Figure 3. Interest points detected using (b) periodic detector, (c) detector in [6] and (d) our learned detector on a video with moving patterns

in the background. (e) shows the fixations of users on the frame

Figure 4. Interest points detected using (b) periodic detector, (c) detector in [6] and (d) our learned detector on a video with moving patterns

in the background. (e) shows the fixations of users on the frame

5. Results

The results for our experiments are presented from two

perspectives. In the first set of results, we present the ac-

curacy of predicting potential eye-gaze positions on test

videos using the individual neural networks. Secondly, we

visually illustrate the performance of our framework in de-

tecting salient pixels in videos.

(i) Eye Movement Prediction: The eye gaze data for

eleven participants was recorded for a set of 37 news videos.

The samples from the first 27 videos were used to train the

models. The positive and negative samples from the remain-

ing ten videos were used to test the performance of the clas-

sifiers. Classification results for the individual neural net-

works that were discussed in the previous section are shown

in Table 1. The performance of the motion descriptor also

indicates the performance of the approach mentioned in [6].

It is worth mentioning that the results indicate the perfor-

mance of the features for videos belonging to this specific

application alone.

Table 1. Prediction accuracy of different neural networks on the

samples from test videos. Please note that the results for the mo-

tion descriptor-based neural network is equivalent to the results

obtained using earlier work [6]

Neural Network Accuracy

Positive Negative

Edge orientation histogram 88.3% 42.6%

Motion Descriptor 73.2% 71.4%

Color Histogram 82.4% 89.4%

Concatenated Vector 82.4% 86.0

(ii) Salient Point detection: Figures 3 and 4 illustrate

the performance of the algorithm using two different ex-

amples. The images to the left are the original frames.

These frames have different regions with motion. Figures

3(b) and 4(b) show the points detected using the periodic

detector. Figures 3(c) and 4(c) show the pixels predicted

by the algorithm in [7]. It is evident that both these detec-

tors, being motion-based, respond to all regions in the video

frame that have movements. The results obtained using our

approach in which the saliency map is obtained from indi-

vidual feature maps are presented in Figures 3(d) and 4(d).

Clearly, this framework is able to distinguish the salient re-

gions from the irrelevant ones having motion. Also, the fix-

ations of the users, as recorded by the eye tracker are shown

in the images to the right, i.e. Figures 3(e) and 4(e). As can

be seen, the interest points detected by our approach bear a

high correlation to the fixations of the users in both the ex-

amples. The frame shown in Figure 5 is taken from a video

that was captured using a moving camera. As expected, the

pure motion-based approaches detect motion in all parts of

the scene, and in turn predict interest points from all re-

gions. Here, we see that our learned detector is robust to

the camera movements. It is evident from the examples that

our framework ensures interest points have relevant spatial

content along with temporal content. In addition, we also

experimented with the proposed framework to design a sin-

gle neural network that was built using a concatenated fea-

ture vector (using all the considered features). However, we

found that this approach did not perform as well as using

individual neural networks for each feature. We believe that

this may have been due to the high dimensionality of the

resulting concatenated feature vector, and we plan to study

this further in our future work. Another point to be noted

is that the saliency model detects regions based on the re-

Figure 5. Interest points detected using (b) periodic detector, (c) detector in [6] and (d) our learned detector on a video taken using a moving

hand-held video camera

gions that are commonly viewed by all the users of the ex-

periment. For example, in this work, we found that all the

volunteers in our experiments viewed the faces of the news-

casters during the video. However, there might be a case

where some users view other regions (possibly the text on

the screen). In such a scenario, the learning framework will

learn features based on the most commonly viewed regions.

6. Conclusion and Future WorkDetecting saliency in videos has emerged to be an impor-

tant component of many video analysis problems. While

traditional approaches in the domain tend to be vision-

based, we have proposed a novel machine-learning frame-

work that is trained to detect spatio-temporal interest points

based on the interest of viewers, as captured by human

eye movements. Our results have shown that the learned

detector is capable of predicting the most relevant spatio-

temporal interest points, even in the presence of background

movements. This approach is non-parametric and is also

capable of handling camera-induced motion in videos. A

seeming limitation (which is also an important feature of

the approach) is that the algorithm will have to be trained

individually for each application to detect relevant interest

points. However, as mentioned earlier, state-of-the-art eye

tracking devices can be integrated into monitors and are ca-

pable of unobtrusively tracking eye gaze as users are freely

viewing videos. This way, our framework can be easily in-

corporated and at the same time learn saliency in the videos

corresponding to their application. In future work, we plan

to study the choice of more feature descriptors that provide

the best possible performance in learning the saliency func-

tions. In addition, we also intend to include a multi-scale

approach so as to be able to handle interest regions at dif-

ferent scales.

References[1] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behav-

ior recognition via sparse spatio-temporal features. In VisualSurveillance and Performance Evaluation of Tracking andSurveillance, 2005. 2nd Joint IEEE International Workshopon, pages 65–72, 2005.

[2] C. Harris and M. Stephens. A combined corner and edge

detector. In Alvey vision conference, volume 15, page 50,

1988.

[3] L. Itti, C. Koch, and E. Niebur. A model of saliency-based

visual attention for rapid scene analysis. IEEE Transactionson pattern analysis and machine intelligence, 20(11):1254–

1259, 1998.

[4] T. Kadir and M. Brady. Scale saliency: A novel approach

to salient feature and scale selection. In Visual InformationEngineering, 2003. VIE 2003. International Conference on,

pages 25–28, 2003.

[5] Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event

detection using volumetric features. In Tenth IEEE Interna-tional Conference on Computer Vision, 2005. ICCV 2005,

volume 1, 2005.

[6] W. Kienzle, B. Scholkopf, F. Wichmann, and M. Franz. How

to find interesting locations in video: a spatiotemporal inter-

est point detector learned from human eye movements. Lec-ture Notes in Computer Science, 4713:405, 2007.

[7] W. Kienzle, F. Wichmann, B. Scholkopf, and M. Franz.

Learning an interest operator from human eye movements.

In IEEE Conference on Computer Vision and Pattern Recog-nition, volume 17, page 22. Citeseer, 2006.

[8] C. Koch and S. Ullman. Shifts in selective visual attention:

towards the underlying neural circuitry. Human neurobiol-ogy, 4(4):219–227, 1985.

[9] I. Laptev. On space-time interest points. International Jour-nal of Computer Vision, 64(2):107–123, 2005.

[10] A. Oikonomopoulos, I. Patras, and M. Pantic. Human action

recognition with spatiotemporal salient points. IEEE Trans-actions on Systems, Man and Cybernetics-Part B, 36(3):710–

719, 2006.

[11] F. Stentiford. Attention-based similarity. Pattern Recogni-tion, 40(3):771–783, 2007.

[12] A. Treisman and G. Gelade. A feature-integration theory of

attention. Cognitive psychology, 12(1):97–136, 1980.

[13] P. Viola and M. Jones. Rapid Object Detection using a

Boosted Cascade of Simple. In Proceedings of CVPR2001,

volume 1, 2001.