Camera Motion-Based Analysis of User Generated Video-jFu

Embed Size (px)

Citation preview

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    1/14

    28 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 1, JANUARY 2010

    Camera Motion-Based Analysis ofUser Generated Video

    Golnaz Abdollahian, Student Member, IEEE, Cuneyt M. Taskiran, Member, IEEE, Zygmunt Pizlo, andEdward J. Delp, Fellow, IEEE

    AbstractIn this paper we propose a system for the analysis ofuser generated video (UGV). UGV often has a rich camera mo-tion structure that is generated at the time the video is recordedby the person taking the video, i.e., the camera person. We ex-ploit this structure by defining a new concept known as cameraview for temporal segmentation of UGV. The segmentation pro-vides a video summary with unique properties that is useful inapplications such as video annotation. Camera motion is also apowerful feature for identification of keyframes and regions of in-terest(ROIs)since it is an indicatorof thecamera persons interestsin the scene and can also attract the viewers attention. We pro-

    pose a new location-based saliency map which is generated basedon camera motion parameters. This map is combined with othersaliency maps generated using features such as color contrast, ob-ject motion and face detection to determine the ROIs. In order toevaluate our methods we conducted several user studies. A sub-jective evaluation indicated that our system produces results thatis consistent with viewers preferences. We also examined the ef-fect of camera motion on human visual attention through an eyetracking experiment. The results showed a high dependency be-tween the distribution of fixation points of the viewers and thedirection of camera movement which is consistent with our loca-tion-based saliency map.

    Index TermsContent-based video analysis, eye tracking, homevideo, motion-based analysis, regions of interest, saliency maps,

    user generated video, video summarization.

    I. INTRODUCTION

    DUE to the availability of online repositories, such asYouTube and social networking sites, there has been a

    tremendous increase in the amount of personal video generatedand consumed by average users. Such video is often referred toas user generated video (UGV) as opposed to produced video,

    which is produced and edited by professionals, e.g., televisionprograms, movies, and commercials. The large amount of usergenerated content available has increased the need for compact

    Manuscript receivedDecember19, 2008; revisedOctober 06, 2009. Firstpub-lished November 13, 2009; current version published December 16, 2009. Thiswork was supported by a Motorola Partnerships in Research Grant. The asso-ciate editor coordinating the review of this manuscript and approving it for pub-lication was Dr. Qibin Sun.

    G. Abdollahian and E. J. Delp are with the School of Electrical and Com-puter Engineering, Purdue University, West Lafayette, IN 47907 USA (e-mail:[email protected]; [email protected]).

    C. M. Taskiran is with the Application Research and Technology Center, Mo-torola, Inc., Schaumburg, IL 60196 USA (e-mail: [email protected]).

    Z. Pizlois with the Department of Psychological Sciences, Purdue University,West Lafayette, IN 47907 USA (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TMM.2009.2036286

    representations of UGV sequences that are intuitive for usersand let them easily and quickly browse large collections of

    video data, especially on portable devices, such as mobilephones.

    Due to its characteristics, when compared to produced video,

    analyzing UGV presents unique challenges. Produced video hasa rich shot-scene-sequence syntactic structure that is created byvideo editors who follow a well-known set of editing guidelinesthat are genre specific. In this type of video, properties of shots,

    e.g., types of camera views and distribution of shot durations,are related to the content and can be exploited to provide cluesabout the relative importance of a shot and its content. There-fore, most video analysis algorithms use the video shot as thebasic building block and start by segmenting the video into shots

    [1], [2].UGV sequences typically are unedited and unstructured,

    where each UGV clip can be considered as a one-shot video

    sequence determined by the camera start and stop operations.The lack of a well-defined syntactic structure precludes the useof most content-based video analysis approaches for UGV.

    However, UGV generally has a rich camera motion struc-

    ture that is generated by the person taking the video, i.e., the

    camera person who edits the video in real-time by movingthe camera, e.g., rather than having a general shot and then cut-ting to an interesting object, a camera zoom or a pan to the objectis used. Such content-based large camera motion is not typical

    in produced video where the same effect is obtained throughvideo editing.

    In this paper we propose a system for UGV analysis that ex-

    ploits camera motion for temporal segmentation, keyframe se-lection, and also combines it with other features for identifica-tion of regions of interest (ROIs). While one could use manyother properties or features to analyze UGV, the goal of this

    paper is to examine what can be done by mainly focusing onthe camera motion.

    The temporal segmentation algorithm proposed here makes

    use of camera motion by dividing the video into camera viewswhich we define as the basic units of UGV. The video is

    segmented into different views that the camera is observingduring displacement or change of viewing angle with respectto the scene. This segmentation has the special property that

    by selecting at least one frame from each segment in the videosummary, we obtain a notion of all the scenes captured bythe camera. Therefore, even with a simple keyframe selection

    strategy, the video summary will cover all the camera views.This property does not hold for methods that are based onother low-level features such as color, e.g., in [3]. In some

    video sequences the color distribution does not have significant

    1520-9210/$26.00 2009 IEEE

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    2/14

    ABDOLLAHIAN et al.: CAMERA MOTION-BASED ANALYSIS OF USER GENERATED VIDEO 29

    variations throughout the video. There may be a few colors

    dominating the color distribution in all the views. The segmen-tation or summarization methods that are based on measuringthe difference between the color distributions among the framesfail to capture all the scenes in this kind of video sequences.Our segmentation approach measures the displacement of the

    camera and is capable of detecting the changes in the cameraview. This method of summarization is particularly useful forvideo annotation where we can associate location-based tags tothe video based on the backgrounds and views without missing

    any scenes.The main application of our system is for mobile devices

    which have become more popular for recording, sharing, down-

    loading and watching UGV. Having efficient video represen-tation for these devices is even more crucial due to the limitedbandwidth, battery life and storage. Therefore, our goal is to use

    computationally efficient methods in each part of the system.Another challenge in dealing with video content on mobile de-vices is displaying the video or its summary on a small size

    screen. Most video summarization approaches, especially theones that represent video as hierarchical trees and storyboards

    [4][7], are not suitable for such devices. Therefore, we employa human attention model to extract saliency information withinkeyframes and identify ROIs, i.e., regions that capture the atten-tion of a viewer while watching the video. In this way, the results

    are more consistent with the viewers perception without usingsemantic analysis. We propose a new location-based saliencymap which uses camera motion information to determine the

    saliency values of pixels with respect to their spatial location inthe frame. This map is combined with saliency maps generatedbased on other cues such as color contrast, local motion and facedetection to extract ROIs in the keyframes.

    Several user studies were conducted to verify our hypothesesand methods. A subjective study was done to validate ourend-to-end system. This study showed a high correlation be-tween the output of our system with the user-selected results.In addition, we explored the effect of camera motion on the

    human visual system through an eye tracking experiment wherewe recorded the eye movements of subjects while they watchedvideos with various types of camera motion. The experiment

    showed how different types of camera motion led to differentpatterns in the distribution of eye fixation points, indicatinghigh dependency between the direction of camera motion andvisual saliency in video.

    Our proposed system is illustrated in Fig. 1. The output ofthe system is a set of keyframes with highlighted ROIs as asummarization for the input video sequence. First, the global

    motion is classified and the frames with intentional camerabehavior are identified (Section III). Motion information isthen used to temporally segment the video into camera views(Section IV). The most representative frames are selected fromeach view. Our strategy for extracting these keyframes as

    a way of video summarization is described in Section V.To explore the information structure within the keyframeswe consider several factors to develop saliency maps for theextracted keyframes (Section VI). These maps are used to

    extract regions of interest and highlight them in the presented

    keyframes (Section VI-E). Experimental results and the userstudies are reported in Section VII.

    Fig. 1. Overview of the proposed user generated video analysis system.

    II. PREVIOUS WORK

    The first task in most video content analysis systems isextracting the syntactic structure of video in the temporal

    domain and segmenting the sequence into smaller units thatare easier to manage. In the majority of these systems that areproposed to work on produced video, shots are considered to bethe basic elements of the video sequence. Shots can be further

    segmented into sub-shots or can be clustered and organizedto form scenes. Several methods have been proposed for shot

    boundary and scene change detection [1][3], [5], [6], [8][11].These methods use various features to measure the similaritybetween frames or shots; for example, color histograms [3],

    [4], [9], [10], edge information [4], [5], motion information [9],luminance projections [10], or a combination of these features[2], [4], [9].

    Other approaches proposed for video segmentation andkeyframe selection employ the clustering of video frames[2], [12]. In these methods, frames are represented in a highdimensional vector space using low-level visual features and

    a distance metric is used to measure the similarity betweenthe clusters. After clustering, the centroid of each cluster ischosen to form the keyframe summarization of the video. For

    example, in [2] the frames are clustered in a bottom-up fashionby combining the closest clusters at a time.

    Video summarization is an important video analysis task thatexploits the extracted structure of video and aims to make thevideos easier to manage and more enjoyable to watch by de-

    creasing redundancy. A great deal of research has been donein this area [13]. There are several ways to represent a videosummary including mosaic images [7], [14], a set of keyframes[2], [10], [12], key objects [15] or a reduced-length video [16].

    The crucial task in video summarization is identifying what isimportant or salient in a video. Finding a generic defini-tion for visual importance is difficult due to the lack of tools for

    modeling human cognition and affection. As a result, many re-searchers have limited their attention to specific domains, suchas news [17], [18] or sports [19][21] programs that have pre-defined events as their highlights.

    Some proposed schemes for identifying highlights in UGVsequences are semi-automatic and dependent on manual selec-tion of important segments [4], [15], [22]. For example, thesystem in [22] allows users to select their favorite clips in a se-quence and the desired output total length and provides the ap-

    propriate boundaries for each clip based on the suitability of theanalyzed video material and the requested clip length.

    Some recent approaches use human attention models to de-

    fine visual saliency or importance based on what captures a

    viewers attention while watching a video [23][25]. Humanattention models are used to create saliency maps that indicate

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    3/14

    30 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 1, JANUARY 2010

    how viewers are visually attracted to different regions in a scene

    and identify the ROIs. There are several applications that useROIs for image and video adaptation to smaller screens. Onesuch application is described in [26] where ROIs are used tomodel the information structure within the frames and generatebrowsing paths. Chen et al. [27] use a human attention model

    to create different cropped and scaled versions of images ac-cording to the screen size. Several factors such as contrast, size,location, shape, faces, foreground/background and local motionactivities can influence visual attention and have been used to

    identify ROIs in images and videos [23], [28][31]. Most pre-vious attention models have not considered camera motion as anindependent factor in identifying visual saliency within a frame.

    For example, in [25] the global motion is subtracted from themacroblock motion vectors to obtain relative motion which isused for generating temporal saliency maps. Ma et al. in [24]use a camera attention model to assign an attention factor to

    each frame based on the type of motion and speed of camera butthis factor is the same for all pixels within a frame. As we will

    show in this paper, camera motion plays an important role in at-tracting viewers attention while watching a video. This featurehas been previously used in video summarization approaches as

    an indicator of the camera persons levels of interest [32], [33].Here, we will show that this feature has also a major influenceon viewers visual attention and therefore can be used as a pow-

    erful tool for UGV content analysis.

    III. MOTION-BASED FRAME LABELING

    A. Global Motion Estimation

    In the majority of UGV, camera motion is limited to a few op-

    erations, e.g. pan, tilt, and zoom; more complex camera move-

    ments, such as rotation, rarely occur in UGV. Several motionmodels and estimation methods have been proposed in the liter-

    ature for global motion estimation and camera motion charac-terization [34][36]. However our goal here is to be computa-tionally efficient to be able to target devices with low processing

    power such as mobile devices. Therefore, we use a simplifiedthree-parameter global camera motion model in the three major

    directions, horizontal , vertical , and radial . Thismodel adequately describes the majority of the camera motionwe have observed in UGV. The motion model is defined as

    (1)

    where and are the matched pixel locations in thecurrent and reference frame, respectively. The Integral Tem-

    plate Matching algorithm [37] is used to estimate the motionparameters. In this method a template, , in the current frame ismatched against the previous frame using a full search in a pre-defined search window. The template is illustrated as the white

    region in Fig. 2. The central part of the frame is excluded fromthe template to avoid the effect of object motion close to theframe center. This is motivated by the observation that if the

    moving object is of interest to the camera person, it will mostlikely be close to the frame center and if not, it usually does not

    stay in the camera view for a long time. This local motion sig-nificantly decreases the accuracy of the estimation especially

    Fig. 2. Template used for motion parameter estimation is shown as the whitearea.

    Fig. 3. Decision tree used to label video frames.

    in the case of radial motion parameter. Pixels close to frameboundary are also removed from the template because of errors

    in the pixel values due to compression and camera aberration.The parameters , and are estimated by minimizing the

    distance between the 2-D template in the current frame, ,and the previous template transformed using parameters ,

    and which is denoted by :

    (2)In order to accelerate the template matching process, we esti-mate the initial values for and using the Integral Projec-

    tion method [38]. In this technique, the 2-D intensity matrixof each frame is projected onto two 1-D vectors in the hori-zontal and vertical directions. The projections in each direction

    are matched between the consecutive frames resulting in the ini-tial estimated values for translational parameters, and . Aparabolic fitting is then preformed to modify these values to theresolution of a fraction of a pixel. The template matching starts

    from the initial point and iterates through the valuesin the search window. The iteration stops when a local minimumis found.

    B. Motion Classification

    After the camera motion parameters are estimated, videoframes are labeled based on the type of motion as a prepro-cessing step in the analysis. Camera motion in UGV usually

    contains both intentional motion and unintentional motion,such as shaky and fast motion. While intentional cameramotion provides valuable cues about the relative importanceof a segment, unintentional motion decreases the perceived

    quality of the video and can be misleading in the analysis. Inour system video frames are classified into one of four classesusing the decision tree structure shown in Fig. 3.

    Due to their superior performance, we have used supportvector machine (SVM) classifiers [39] at each decision node

    of this tree. The LIBSVM library was used, which is an opensource software library for SVM training and classification

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    4/14

    ABDOLLAHIAN et al.: CAMERA MOTION-BASED ANALYSIS OF USER GENERATED VIDEO 31

    Fig. 4. Automatically labeled frames. In the top two rows, each row consists offour consecutive frames classified in the same motion group. The bottom rowsare a set of eight consecutive frames labeled as shaky motion.

    [40]. Forty video sequences, each with duration of a fewminutes, were manually labeled and used as the training data.

    These sequences were all UGVs recorded by several usersand contained different types of camera motion including bothnormal and abnormal motions. The video sequences were from

    a wide variety of outdoor and indoor scenes, e.g., playgrounds,parks, university campuses, beaches, museums, ranches andinside various buildings.

    In order to label a frame we first classify it as having a zoom or

    not, using the 3-D motion vector as the feature vector.Then, blurry frames caused by fast camera motion and shakysegments caused by frequent change of camera motion are de-

    tected sequentially. For these two decisions we use a methodsimilar to [41] where SVM classifiers are trained on an eight-di-

    mensional feature vector derived from the parameters andover a temporal sliding window. The eight features include

    average velocity, average acceleration, variance of acceleration

    and average number of direction changes in vertical and hori-zontal directions over a sliding window. The size of the slidingwindow is different for blurry and shaky motion classification.This is due to the fact that fast camera motion that causes blur-

    riness may last for just a few frames but shaky behavior occursduring a period of time when the camera changes direction fre-quently. Therefore, to detect this behavior, camera motion must

    be monitored during a relatively longer time interval. The ap-propriate window sizes were experimentally found to be

    for blurry motion and for shaky motion detection. Theframes that are not labeled as zoom, blurry or shaky are identi-fied as stable motion with no zooms. Some examples of video

    framesthat areautomatically labeled by this approach areshownin Fig. 4.

    IV. TEMPORAL VIDEO SEGMENTATION BASED

    ON THE USE OF CAMERA VIEW

    In order to address the problem of temporal video segmen-tation of UGV sequences we propose a new segmentation ap-proach based on camera views. First, we define some termi-

    nology we use below. Two frames are considered to be corre-latedif they have overlap with each other, i.e., at least a common

    part of the background is visible in both frames. A camera view

    is a temporal concept defined as a set of consecutive frames thatare all correlated with each other. We use the term view occa-sionally to refer to camera view. The view transitions or viewboundaries, which occur when the camera is displaced or thereis a change of viewing angle, are detected to temporally seg-

    ment the video. The boundaries are selected such that the frameright before the segmented camera view and the one right afterthat, are not correlated. At least one frame from each view is se-lected to be present in the video summary as described in the

    next section. With this setup, the summary provides the userwith a notion of all the scenes captured by the camera. Thischaracteristic of the video summary is particularly useful for ap-

    plications such as video annotation [42]. In order to detect thecamera view boundaries, we define the displacement vector be-

    tween frames and as , where ,

    and are total horizontal, vertical and radial interframedisplacements given by

    (3)

    where , and are the motion parameters defined in

    Section III-A. The displacement values in the and di-rections are normalized to the frame width, , and height,

    , respectively. These values are the minimum displacementneeded to make the two frames uncorrelated with each other.

    The displacement caused by radial motion is not the same for allthe pixels in the frame. The displacement at the point is ob-

    tained as . The average displacement of the

    pixels in each quarter of the frame is . Dividing

    the value in the horizontal direction by and in the vertical

    direction by will result in in each direction. Thus,we set to 4 in (3). The translational displacements areupdated when there are no zooms and the radial

    displacement is updated only during the zooms .This is due to the fact that estimates for and could beinaccurate during a zoom.

    Segmentation of a video sequence into different camera

    views can be considered as the selection of a set of viewboundary frames, for the segments

    . The viewboundary frames are identified as follows: starting from the be-ginning of the sequence, a boundary frame is flagged whenever

    the magnitude of the displacement vector, , for the currentframe and that for the previously detected boundary frame islarger than , which indicates that the frames and

    have almost no overlap. This procedure is repeated until

    the end of the video clip is reached. We place a constraint on theselection such that a boundary frame cannot be chosen during

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    5/14

    32 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 1, JANUARY 2010

    Fig. 5. Displacement curves and camera view boundaries indicated by red as-terisks for two different UGVs.

    intervals labeled as blurry segments in the motion classificationstep. This is due to the fact that we do not want the video

    summary to include the blurry frames. Hence, the boundariesinside blurry regions are relocated to the end of the blurry seg-ment. Fig. 5 shows the 3-D displacement curves for two UGVs.

    Each point on the curve is a frame number, , mapped into thedisplacement space using . The red asterisks indicate thecorresponding view boundaries detected by the algorithm.

    V. KEYFRAME SELECTION

    Many different types of video summary representations, col-lectively known as summary visualizations, have been proposed[2], [7], [10], [12][16]. Two most common types of summary

    visualizations are video abstracts, which are based on one ormore keyframes extracted from each segment, and video skims,where a shorter summary video is generated from selected partsof the original sequence. In this section we describe our ap-

    proach for selection of keyframes in UGVs in order to producevideo abstracts. Moreover, these frames are used in other stepsof the analysis such as the identification of ROIs in a scene. Inorder to represent the segment it is extracted from, a keyframeshould be the frame with the highest subjective importance in

    the segment. However, defining relative importance in sceneswith diverse content is a difficult problem. As mentioned in theprevious section, the segmentation has the property that by using

    a simple strategy for keyframe selection, the video summarywill cover almost all the views captured in the video. Thus, wewould like to avoid complicated algorithms for this purpose butat the same time we would like our results to be consistent with

    what the camera person and the viewers find more interesting.For this purpose, we conducted a preliminary experiment

    in which we asked a number of users to select a set of rep-

    resentative frames from several UGV clips with durations of

    15120 s. The clips were first segmented using our camera viewsegmentation algorithm. The users had access to individualvideo frames. We gave the following statement to the users, Ifyou wanted to summarize the important content of the video

    segment in minimum number of frames, which frame(s) wouldyou choose? The users also had the option not to select anyframes if they think the segment is of insignificant information.Moreover, the subjects were asked to provide the reason why

    they selected each of the keyframes. Based on the collectedinformation we formed some initial conclusions. In general,the selected keyframes had the following properties: the frames

    were picked when there was a nice view of the objects ofinterest, a new action occurred, after a zoom-in (close-up view),

    after a zoom-out (overall view), a new semantic object enteredthe camera view (text, people, buildings), there was a pause

    after a continuous camera movement or a significant change in

    the background. The users did not select any frames from thesegments that were highly blurred or shaky or when the cameramotion does not change and the scene is static e.g., camera pansto the left on a homogeneous background.

    Since our intention was to avoid the complex tasks of object

    and action recognition in our system, our keyframe selectionstrategy was only based on camera motion factor. This way theresults are compatible with viewers preferences as well as whatthe camera person deems important while shooting the video

    through intentional camera motion. The following frames areselected as keyframes:

    The frame after a zoom-in is usually a close-up view of

    object of interest and is included in the summary. The frame after a large zoom-out is chosen as a keyframe

    to give an overview of the scene. A camera move-and-hold is another indicative pattern and

    the frame where the camera is at pause is usually empha-sized.

    For segments during which the camera has constant mo-tion, all frames are considered to be of relatively same im-portance. In this case, the frame closest to the middle of the

    segment and having the least amount of motion is chosenas the keyframe in order to minimize blurriness.

    VI. KEYFRAME SALIENCY MAPS AND ROI EXTRACTION

    In order to extract higher-level information from thekeyframes we employ a human attention model to obtainsaliency information within keyframes. This information is

    used to identify ROIs, i.e., regions that capture the attention of aviewer while watching the video. To identify the ROIs, we first

    need to generate the keyframe saliency maps. A saliency mapor importance map (S) of a frame indicates how much a vieweris visually attracted to different regions in the frame. Studies

    based on eye movements have identified several factors thatinfluence visual attention such as motion activity, contrast, size,shape, faces and spatial location [23], [28], [29]. In this paper,we propose a new location-based saliency map that employs

    camera motion information. We combine this map with a colorcontrast saliency map, a moving objects saliency map andhighlighted faces to generate the keyframes saliency maps.

    A. Color Contrast Saliency Map

    One of the important factors that causes a region to stand outand be more noticeable is the contrast of that region compared toits neighborhood. This includes contrast in luminance and color.In [29], Ma and Zhang used the color components in LUVspace

    as the stimulus in order to generate saliency maps for images.We use a similar technique in our system to [29], however, weuse the RGB color space to generate the contrast-based saliencymap. Our experiments showed better results in this space com-

    pared to using UVcomponents since it combines the luminanceand color contrasts.

    First, the three-dimensional pixel vectors in RGB space are

    clustered into a small number of color vectors using generalizedLloyd algorithm (GLA) for vector quantization [43]. A fixed

    number of clusters is used to avoid additional computationalcomplexity for finding the optimum number of clusters since

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    6/14

    ABDOLLAHIAN et al.: CAMERA MOTION-BASED ANALYSIS OF USER GENERATED VIDEO 33

    Fig. 6. Example of contrast-based saliency map for a keyframe. (a) Originalkeyframe. (b) Color-quantized frame. (c) Enlarged downsampled frame. (d)Saliency map.

    this number does not have a major effect on the quality of re-

    sults. Based on the results using our video database, we found32 colors to be sufficient for our purpose. The color-quantizedframe is then downsampled by a factor of 16 in each dimension

    in order to reduce the computational complexity. Finally, thecontrast-based saliency value for each pixel in the down-

    sampled image is obtained by

    (4)

    where and are theRGB pixel values, is the neighborhoodof pixel and is the Gaussian distance. In our experiments

    a 5 5 neighborhood was used. Fig. 6 shows an example of thecontrast-based saliency map for an extracted keyframe.

    B. Moving Object Saliency Map

    Regions of frames with significant motion with respect to the

    background can also attract attention. To determine the movingobject saliency map, we examine the magnitude and phase of

    macro block relative motion vectors. Macro block (MB) mo-tion vectors, , are generated using a full search al-gorithm for macro blocks of size16 16 pixels.The global mo-

    tion parameters and are subtracted from the motion vectorsto compensate for global motion

    (5)

    where represents the relative motion vector for themacro block at location . Since motion vectors in flat re-

    gions are usually erroneous, we apply a threshold on the averagenorm of gradient for each MB and assign zero relative motion tothe MBs with below the threshold values. The motion intensity

    and motion phase entropy maps are then generated similarto [44]. The motion intensity and motion phase for

    are defined as

    (6)

    (7)

    The phase entropy map, , indicates the regions with inconsis-tent motion which usually belong to the boundary of the moving

    Fig. 7. Example of moving object saliency map. (a) Motion vectors (not com-pensated). (b) Compensated motion intensity map. (c) Phase entropy map. (d)

    Final moving objects saliency map.

    object. The value of for each MB is determined as follows.First, an eight-bin phase histogram is obtained for a slidingwindow with size 5 5 MBs at the location of the MB. Thephase entropy at is then

    (8)

    where is the probability of the th phase whose valueis estimated from the histogram. Combining the intensity andphase entropy maps results in a moving objects saliency map:

    (9)

    Fig. 7 illustrates an example of a moving objects saliency mapfor a video frame. Part (a) illustrates the estimated motion vec-tors before global motion compensation. Parts (b) and (c) are the

    generated maps based on magnitude and phase entropy of mo-tion vectors, respectively. Combining these two maps results inthe moving objects saliency map in part (d).

    C. Location-Based Saliency Map

    A location-based saliency map indicates how the saliency

    values of pixels change with respect to their spatial location inthe frame. Most approaches that have considered location-basedsaliency are based on experiments on still images which have re-sulted in central saliency, i.e., the center of the image is consid-

    ered to be visually more important [28], [30], [31]. As we willshow through our user study in Section VII-C, the direction ofthe camera motion also has a major effect on the regions wherea viewer looks in the sequence. For UGV, in which there is

    not much object activity, usually the intentional motion of thecamera determines the story by moving around the scene orzooming in/out. The human visual system has a tendency tofollow this motion and particularly look for new objects that

    are about to enter the camera view. For example, if the camerais panning towards the right, the a viewer is more attracted tothe right side of the scene or when the camera starts to zoomout, the attention to the borders of the frame increases. In the

    case of a zoom-in or still camera, the location saliency is sim-ilar to the one for still pictures and the viewers attention is moreconcentrated on the center of the frame. We conducted an eye

    tracking experiment to verify our hypothesis as we will describein Section VII.

    The global motion parameters were used to generate the loca-tion saliency maps for the extracted keyframes. Three individual

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    7/14

    34 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 1, JANUARY 2010

    Fig. 8. Examples of motion-based location maps for: (a) zoom-in, (b) largepanning toward left, (c) zoom-out with panning left, and (d) tilting down.

    maps for , and directions are generated as in (10), (11)and (12) and combined to form the location saliency map (13):

    (10)

    (11)

    (12)

    (13)

    where is the pixel location, and , and are con-stants whose values were experimentally found to be optimum

    at 10, 5 and 0.5, respectively. The parameter represents thedistance of a pixel from the center of the frame and is itsmaximum value in the frame. The parameters and arethe frame width and height, respectively. After combining the

    and maps, the peak of the map function is at .

    If there is no translational motion, the peak occurs at the center

    of the frame. The radial map, , is either decreasing or in-creasing as we move from the center to the borders, depending

    on whether the camera has a zoom-in/no-zoom or zoom-out op-eration. Some examples of maps for various camera operationsare shown in Fig. 8. In part (a) the camera is zooming in so thevalues are larger at the frame center. Parts (b) and (d) show the

    maps for panning left and tilting down operations which movesthe peak of attention toward the left and bottom part, respec-tively. In part (c) camera is zooming out while panning toward

    the left so the attention is moved toward the borders with moreemphasis on the left side of the frame.

    D. Combined Saliency Map

    In order to generate the combined saliency map from theabove maps, first, the color contrast and moving object saliency

    maps are superimposed since they represent two independentfactors in attracting visual attention. In addition to the low-levelvisual features mentioned above, specific objects such as faces,hands and texts can also draw the viewers attention. These re-

    gions are semantically important and may have been overlookedin the low-level saliency maps. In our system, faces are detectedand highlighted after combining the low level saliency maps, inour case color contrast and moving objects maps, by assigningthe saliency value to the pixels inside the face

    regions as shown in Fig. 9. We used an online face detectionsystem [45] for this purpose.

    Fig. 9. Highlighting face areas in the saliency map. (a) Original frame. (b)Saliency map after face detection.

    Fig. 10. Membership functions of the fuzzy sets: ROIs ( R ) and insignificantregions ( R ) .

    The location-based saliency map is then multiplied pixel-wisewith this map to yield the combined saliency map, :

    (14)

    where is the process of highlighting faces on the mapand is the location saliency map defined in (13). Thevalues of are normalized to [0,255]. Examples of combined

    saliency maps for several keyframes are shown in Section VII.

    E. Identification of ROIs

    The saliency map is a gray level image that represents therelative importance of pixels in a frame. In order to extractregions of interest from the saliency map, a region growing al-

    gorithm proposed in [29] is used in our system. In this method,fuzzy partitioning is employed to classify the pixels into ROIs,

    , and insignificant regions, i.e., areas that attract lesser atten-tion from the viewer, . The membership functions of thesetwo fuzzy sets, and , are shown in Fig. 10 where

    is the gray level value in the saliency map ranging from to, and is the number of gray level bins. The parameters

    and in Fig. 10 are determined by minimizing the differenceof entropy of the fuzzy sets as follows:

    (15)

    where and are the entropies of the two fuzzy sets.The seeds for region growing are defined to be the pixels

    with membership value of one and maximum local valuein the saliency map. Adjacent pixels with saliency value

    and become partof the seeds for the iterative growing. Examples of extractedregions for some saliency maps are shown in Fig. 11. More

    examples will be presented when we describe our experimentalresults in Section VII.

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    8/14

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    9/14

    36 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 1, JANUARY 2010

    Fig. 15. Top five salient locations detected using Itti et al.s model. The whitecircles show the most salient points in the frames and the yellow circles are thenext four. The red arrows indicate the order.

    The main distinction between the two systems is evident inthe frames during intentional camera motion. As we show inour eye tracking study in Section VII-C, this factor draws theoverall attention of the viewer to part of the frame depending on

    the direction of the motion. For example, the camera is panningtowards the left in frame (b) in Fig. 14. As a result, the visualattention is more focused along the horizontal line at the centerof the frame and towards the left. In the same frame in Fig. 15,we can see that both the deck of the ship and the horizon view

    of the city are among the salient points however the camera mo-tion gives higher priority to the horizon region. Similar situationoccurs in frame (g). Also, in frames (f) and (l) the camera has a

    zoom-out and therefore a larger region is selected as ROI com-pared to the zoom-in case in frame (k).

    B. Subjective Evaluation of Our End-to-End System

    As previously mentioned, the outcome of our system is a col-lection of keyframes with highlighted ROIs (Fig. 14). We con-

    ducted a user study of our overall system to compare our re-sults to viewers preferences. In our system, a video is first di-vided into camera views and the keyframes are extracted from

    each camera view to represent the scene. Our system choosesthe frames that are emphasized by the camera person throughspecific patterns in the camera motion. The view segmentation is

    done in such a way that by selecting one frame from each cameraview, the set of keyframes will cover almost all the scenes cap-tured in the video. Consequently, if the ROIs are correctly de-tected, their set should contain all the important objects of the

    video. In order to evaluate the end-to-end system, a subjectiveassessment was conducted as described below.

    Method: Ten subjects participated in the user study. Eachsubject watched ten UGV sequences and was asked to select afixed number of keyframes and arbitrary number of objects

    of interest from each clip with the following descriptions. Keyframe: If you want to summarize the video in the spec-

    ified number of frames, which frames would you choose to

    represent the video?

    Object of interest: What is (are) the most important ob- ject(s) in the video?

    For each video, the subjects were required to select exactly the

    same number of keyframe as determined by our system. Thisnumber depended on the number of camera views and the spe-cific camera patterns detected by our system. Ten videos in the

    range of 30 s to 2 min were used for the experiment. The videoscontained different types of intentional motion such as pan, tilt,

    zoom and hold and also abnormal motion such as blurry andshaky motions. The videos were from a wide range of scenessuch as the zoo, university campuses, playgrounds, beaches, acruise and parks. The subjects had access to each individual

    frame and were able to play the video frame by frame or withthe normal frame rate. They were asked to watch the videos atleast once before making the selection.

    Data AnalysisKeyframe Selection: In order to compare theresults of our system to the results from the user study, we con-sider the outputs of our system as the reference set and measure

    the closeness of the user-selected data to this set by determiningthe precision, recall and the measure defined in (16)(18).The measure is the harmonic mean of precision and recall.

    For each frame selected by a user, if it is close enough to oneof our system outputs, we consider it as a correct hitotherwiseas a false alarm. For each output of our system, if a subject does

    not select a correct frame, it will be considered a missed pointfor that subject:

    (16)

    (17)

    (18)

    Several frames of a video can represent the same content, e.g.,when the camera has small movement on a static scene or whenthe objects in the scene slightly move. Fig. 16 illustrates exam-ples of this scenario. In this figure, either the camera, the objectin the scene or both have small motion from the top frame to the

    bottom frame butthe frames aresemantically similar. Therefore,the first step in comparing two frames is motion compensationbetween the two. The three-parameter motion compensation de-

    scribed earlier is used for this purpose. We limit the search rangeto 20 pixels for a frame size of 360 240 pixels. Larger dis-placements are considered as significant motion and are not

    compensated. Since the motion estimation is not very accurateand also because of the artifacts caused by video compression,two similar frames still may not match, especially around theedges. Thus, we use a Gaussian filter to blur the edges. Next,

    we use the SSIM metric [47] to measure the distance betweenthe blurred images. A threshold is used on the SSIM distance todecide whether or not two frames are similar. This threshold isobtained separately for each keyframe of each video. The reasonis that during some intervals, there maybe a small change in

    the visual content of the frame although it changes semanticallywhile in others, the visual content changes rapidly. Thus, we de-termine the threshold based on local variation among the framesin the neighborhood of each keyframe. The threshold for each

    keyframe is the average distance between that keyframe and itstwo neighboring keyframes.

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    10/14

    ABDOLLAHIAN et al.: CAMERA MOTION-BASED ANALYSIS OF USER GENERATED VIDEO 37

    Fig. 16. Examples of frame pairs in a video where the two frames are seman-tically similar even though the camera and the objects in the scene have moved.

    TABLE ICOMPARISON BETWEEN THE DATA FROM THE USER STUDY AND OUR SYSTEM

    TABLE IISTATISTICS FOR RECALL, PRECISION, AND F PARAMETERS

    For each frame selected by a user, we compare it to the four

    closest indices among the keyframes determined by our systemand find the most similar keyframe based on the SSIM distance.If this distance is less than the threshold associated with that

    keyframe, it is considered as a correct hit for that keyframe andotherwise a false alarm. Table I shows the statistics for the tenvideos. Average precision, recall and measure and their vari-ances and standard errors are listed in Table II. It can be ob-

    served that the average precision, 87.32%, is higher than averagerecall of 70.38%. This is due to the fact that several user-se-lected frames may be matched to one system-selected keyframeso some of the outputs are missed by the user. The results indi-

    cate high correlation between the outcome of our system and theuser-selected frames. This not only shows that camera motionpatterns affect how users select the keyframes but also validatesour view segmentation step. Our view segmentation divides the

    video into groups of correlated frames therefore the set of se-lected frames from all the subshots cover almost all scenes inthe video.

    Fig. 17 illustrates a number of typical failure cases of our

    system based on the user study. In video sequence (a) the childshown by the red arrows, makes a short appearance in the videoand runs to the garage. The camera follows the child and holds

    when she enters the garage. Frame 371 is selected by our systemas the keyframe and the entire area inside the garage is consid-

    ered to be the ROI. However, most users selected the keyframesduring the segment where the child is running, e.g., frames 251,

    Fig. 17. Examples where the output of our system fails with respect to theuser study. Each row contains the frames extracted from a video sequence, e.g.,vid(a) represents the frames from video sequence (a), and fr. stands for theword frame.

    268, and 303. They identified the child as the ROI. This issue isdue to the fact that the process of selecting keyframes is com-pletely based on camera motion and does not consider objectmotion. Sequence (b) shows a similar case where different userschose different frames as representative frames based on the po-

    sition of the children in the scene. Video sequence (c) showsan example where the results can be controversial. The cameraholds on the view of the castle (frame 62) and pans to the leftand moves back to the same view of the castle at frame 234. Our

    system does not discard frames from similar views taken at dif-ferent times since those frames might be semantically differentdue to a change of the objects in scene. In case (d) the camera has

    a 90 rotation and some users selected the rotated frame (frame813) as the keyframe whereas our system selected frame 935.Since the motion model we used is a three-parameter model, itdoes not account for rotation. We noted that intentional rotation

    is not a frequent pattern in home videos. Data AnalysisObjects of Interest: For each object that a

    user selected, we checked if it has been included in the iden-tified ROIs of at least one of the extracted keyframes from the

    output of our system. The users identify the object of interestby including a description of the object and the frame it appearsin the video. If the object appears in more than one occasions,the user did not need to cite the same object more than once.

    The subjects were allowed to select as many objects that theythought were important. A total number of 360 objects were se-lected by the ten subjects from the ten videos which results inan average of 3.6 objects per video per subject. 314 of these

    objects (87.22%) appeared in the highlighted set of keyframesextracted by the algorithm. Most of the objects missed wheresmall objects that appeared in video for a very short period of

    time. Again, the results not only assess the quality of saliencymap and ROI identification step but also validate our view seg-

    mentation which makes the set of keyframes an enclosing set ofimportant objects in the video.

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    11/14

    38 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 1, JANUARY 2010

    C. User Study on the Effect of Camera Motion on Human

    Visual Attention

    In this section we investigate how camera motion influencesvisual attention through an eye tracking experiment. The re-sults verified our hypothesis that was used for the location-based

    saliency map in Section VI-C. In this user study the eye move-

    ments of subjects were recorded using an infrared eye trackingsystem. The experiment was designed in a way so as to mini-mize the effect of visual features other than camera motion. For

    this purpose, a number of video sequences with different typesof camera motion were generated.

    Test Sequence Stimuli: Six videos, each with a different type

    of camera motion (pan right, pan left, tilt up, tilt down, zoomin and zoom out) were used. Each video consisted of severalshort video sequences with durations ranging from ten to 40 s.All sequences in each video contained only one type of cameramotion. Between the video sequences, a black frame with a

    bright point at the center was inserted with a duration of 2 s andthe subjects were asked to fixate on the bright point when the

    black frame appeared. This was done to move the attention ofthe subject back to the center to avoid any bias when switchingfrom one sequence to another. In order to minimize the effect ofthe visual features other than camera motion, we generated thevideo sequences as described below.

    For every scene with a specific direction of camera motion,we recorded the exact same scene with the opposite direc-tion of camera motion. For example, if a sequence was con-

    structed by panning the camera towards the right, anothersequence was constructed from the same scene by panningthe camera towards the left. As a result, the effect of anystatic visual feature is equal in both directions of cameramovement.

    Moving objects, people and written texts were avoided inthe videos since these factors significantly influence thehuman visual attention.

    Each subject viewed only one of the six videos. The videos

    were displayed at a rate of 25 fps with display resolution of 478403 pixels. The screen was located at the distance of 63 cm

    from the viewers eyes.

    Participants: Eighteen subjects took part in the controlled

    experiment. All had normal or corrected-to-normal vision. Thesubjects had not viewed any of the videos before and were notaware of our hypothesis. Each subject viewed the video corre-

    sponding to only one type of camera motion. Therefore, for each

    category of motion, data were collected from three subjects.Apparatus: An infrared-based eye tracker (RK-726PCI

    Pupil/Corneal Reflection Tracking System, ISCAN) was usedto record the eye movement. The eye tracker recorded at a

    sampling rate of 60 Hz and mean spacial accuracy of 1 . A chinrest with a forehead stabilizer (Table Mounted Head Restraint,Applied Science Group) was used to restrict the movement ofthe participants head. Fig. 18 illustrates the system.

    Calibration: Before each test, a nine-point static calibrationprocedure was performed during which a subject was asked tofixate at each point shown as a white cross on the black screen.

    These points included the eight points on the border and the oneat the center. Each subject was individually calibrated due to

    variation in eye geometry and physiology. Once the calibrationwas done, the eye tracker was able to record the eye movement.

    Fig. 18. Eye tracking experimental setup.

    Fig. 19. Projection of the calibration points from stimuli space to the dataspace.

    In order to confirm the quality of the calibration and also com-pensate for the possible nonlinearity in the eye tracker data, a

    nine-point dynamic calibration was performed prior to the ex-periment. This calibration was done by asking the subjects tofollow a moving white cross on the black screen. The cross dis-played each of the nine points for 4 s and moved to the next

    point.The dynamic calibration data were used to find the relation-

    ship between the eye tracking points and their correspondinglocations in the frame. This data indicated nonlinearity in oureye tracking device that caused the calibration points located on

    the rectangular grid in the stimuli to be warped into points ontrapezoids in the data plane as illustrated in Fig. 19. In orderto compensate for this nonlinearity, we used an eight-param-

    eter model to transform each data point into its correspondinglocation in the stimuli space. The plane was divided to four seg-ments (Fig. 19) and a different set of transform parameters wasobtained for each segment using the nine calibration points. The

    eight-parameter model is defined as

    (19)

    where are the data point coordinates, are the cor-

    responding stimuli point coordinates and the index indicates towhich of the four trapezoids in the data space the point belongs(Fig. 19).

    Data Analysis: The eye-gaze data provided a direct mea-

    sure of the overt visual spatial attention of each subject in ourtest. The fixation points were extracted by discarding the blinksand saccades from the data. In order to detect the saccades, we

    used a velocity-based algorithm that discriminated between thefixations and saccades based on their angular velocities [48].

    The point-to-point velocity of each sample was obtained by di-viding the distance between the current and previous point by

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    12/14

    ABDOLLAHIAN et al.: CAMERA MOTION-BASED ANALYSIS OF USER GENERATED VIDEO 39

    Fig. 20. Examples of the recorded fixation points for six different types ofcamera motion. (a) Pan right. (b) Tilt up. (c) Zoom in. (d) Pan left. (e) Tilt down.(f) Zoom out.

    TABLE IIISTATISTICS OF THE FIXATION POINTS FOR PAN

    RIGHT AND PAN LEFT CAMERA MOTIONS

    the sampling time. This velocity was converted to angular ve-locity using the distance between eyes and the visual stimuli. Asample point was labeled as a saccade and discarded from thedata if its angular velocity was greater than a threshold. We usedthe threshold value of 20 /s.

    The distinct fixation points were then determined by clus-tering the sample points between each two consecutive sac-cades. Fig. 20 shows the fixation points recorded from six dif-

    ferent subjects for videos with different types of camera motion.As can be seen from the figure, in the case of panning and tilting,the center of the distribution of the fixation points is highly de-pendent on the direction of the camera motion. For example, in

    the video with camera panning toward the right [Fig. 20(a)] themajority of fixation points are at the right side of the frame. Inthe case of zoom in and zoom out, the camera motion affects the

    variance of the distribution. The zoom-out data points are more

    scattered over the frame area, whereas the zoom-in data pointsare more concentrated around the mean.

    1) Motion Type: Pan/Tilt: To show that the difference be-tween the sample mean and the frame center, with (0,0) coordi-

    nates, in the pan and tilt videos is statistically significant, thetest is performed on the collected data. For pan left and pan rightcamera motions, the test is performed on the coordinate ofthe fixation points. Similarly, the test is carried out in the

    direction for tilt up and tilt down camera motions. Tables III andIV demonstrate the summary statistics and the correspondingvalue for each set of fixation data in pan and tilt videos, respec-

    tively. Here, the axis points toward the right and the axispoints downward.

    As indicated in Table III, the samples means are located at theright side of the frame (positive values) for videos with pan right

    TABLE IVSTATISTICS OF THE FIXATION POINTS FOR TILT

    UP AND TILT DOWN CAMERA MOTIONS

    TABLE VSTATISTICS OF THE FIXATION POINTS FOR ZOOM-IN

    AND ZOOM-OUT CAMERA MOTIONS

    camera motion and at the left side (negative values) for videoswith pan left camera motion. In a similar manner, depending

    on the direction of camera tilt, the sample means are shifted tothe lower or upper side of the frame (Table IV). According tothe tests, in all cases, the amount of displacement of sample

    means from the frame center is statistically significant with a99% confidence level.

    2) Motion Type: Zoom: The statistics of the fixation data forvideos with zooming are summarized in Table V. As illustrated

    in the table, the variances in both and directions are signif-icantly larger in the case of zoom-out compared to the zoom-incase. This result implies that the direction of camera zoom (inor out) affects the covariances of the sample points. To verifythe significance of the difference between the covariance ma-

    trices in the two cases, we put all the zoom-in data in one set andall the zoom-out data in another. A multivariate test is thenperformed to verify the significance of the difference between

    the covariance matrices. We used the method proposed by Box[49] to determine the value of . The two degrees of freedomfor the distribution are obtained using the number of samples

    in each class, number of classes and dimension of the sample

    points. In our case, the two degrees of freedom are andwhere the latter value can be considered tobe . The critical value of at the 99% confidence level is

    . The value of for the eye trackingdata is obtained as 128.02 (Table VI) which exceeds the critical

    value by a large margin. This concludes that the difference be-tween the covariance matrices is statistically significant with a99% confidence level.

    VIII. CONCLUSION AND FUTURE WORK

    In this paper we proposed a system for UGV analysis. UGVs

    contain rich content-based camera motion structure that can bean indicator of importance in the scene. While one could use

    many other properties or features to analyze UGV, the goal ofthis paper was to examine what can be done by mainly focusing

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    13/14

    40 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 1, JANUARY 2010

    TABLE VIF STATISTICS FOR COMPARING THE COVARIANCE

    MATRICES IN ZOOM-IN AND ZOOM-OUT DATA

    on the camera motion. Since camera motion in UGV may haveboth intentional and unintentional behaviors, we used motion

    classification as a preprocessing step. Frames were labeled aszooms, blurry, shaky and stable motion based on their globalmotion information. A temporal segmentation algorithm wasproposed based on the concept of camera views which relates

    each subshot to a different view. We use a simple keyframe se-lection strategy based on camera motion patterns to representeach view. The segmentation step rules out the need to have a

    complicated method to choosethe keyframes and the video sum-mary covers all scenes captured in the video.

    Through a controlled eye tracking experiment, we showedthat camera motion has a major effect on viewers attention. Ourexperiment indicated high dependency between the distribution

    of the fixation points in the recorded eye movements and thetype and direction of camera motion. Statistical tests confirmedthat the results were statistically significant. Thus, we employedcamera motion in addition to several other factors to generate

    saliency maps for keyframes and identify ROIs based on visualattention. The ROIs can be used to present a more compact video

    summary which is suitable for devices with small-size screens.A user study was conducted to subjectively evaluate the

    end-to-end performance of our system. The results indicatedhigh correlation between the outcome of our system anduser-selected results. The study validated different stages of

    our system such as segmentation, keyframe summarization andidentification of ROIs.

    As future work, we are examining extending our system to an-notate UGVs. We will take advantage of our proposed segmen-tation step and use the keyframes and ROIs to annotate the video

    by tagging the video summary with geo-tags and content-basedtags.

    REFERENCES

    [1] C.-W. Ngo, T.-C. Pong, and H.-J. Zhang, Recent advances in content-based video analysis, Int. J. Image Graph., vol. 1, no. 3, pp. 445468,2001.

    [2] C. Taskiran, J. Chen, A. Albiol, L. Torres, C. Bouman, and E. Delp,ViBE: A compressed video database structured for active browsingand search, IEEE Trans. Multimedia, vol. 6, no. 1, pp. 103118, Feb.2004.

    [3] B. Yeo and B. Liu, Rapid scene analysis on compressed video, IEEETrans. Circuits Syst. Video Technol., vol. 5, no. 6, pp. 533544, Dec.1995.

    [4] P. Wu, A semi-automatic approach to detect highlights for homevideoannotation, in Proc. IEEEInt. Conf. Acoustics, Speech and Signal Pro-cessing, May 2004, vol. 5, pp. 957960.

    [5] D. Gatica-Perez, A. Loui, and M. T. Sun, Finding structure in homevideos by probabilistic hierarchical clustering, IEEE Trans. CircuitsSyst. Video Technol., vol. 13, no. 6, pp. 539548, Jun. 2003.

    [6] M. Yeung, B. L. Yeo, and B. Liu, Extracting story units fromlong programs for video browsing and navigation, in Proc. 3rd

    IEEE Int. Conf. Multimedia Computing and Systems, Jun. 1996,pp. 296305.

    [7] R. Dony, J. Mateer, and J. Robinson, Techniques for automatedreverse storyboarding, Proc. Inst. Elect. Eng., Vis., Image, SignalProcess., vol. 152, no. 4, pp. 425436, Aug. 2005.

    [8] A. F. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns and

    TRECVid, in Proc. 8th ACM Int. Workshop Multimedia InformationRetrieval (MIR 06), 2006, pp. 321330.[9] S.-H. Huang, Q.-J. Wu, K.-Y. Chang, H.-C. Lin, S.-H. Lai, W.-H.

    Wang, Y.-S. Tsai, C.-L. Chen,and G.-R. Chen, Intelligenthomevideomanagement system, in Proc. 3rd Int. Conf. Information Technology,

    Research and Education (ITRE 2005), Jun. 2005, pp. 176180.[10] M. Yeung and B. Liu, Efficient matching and clustering of video

    shots, in Proc. IEEE Int. Conf. Image Processing, Oct. 1995, vol. 1,pp. 338341.

    [11] J. Nesvadba, F. Ernst, J. Perhavc, J. Benois-Pineau, and L. Primaux,Comparison of shot boundary detectors, in Proc. IEEE Int. Conf.

    Multimedia and Expo, Jul. 2005, pp. 788791.[12] A. Hanjalic and H. Zhang, An integrated scheme for automated video

    abstraction based on unsupervised cluster-validity analysis, IEEETrans. Circuits Syst. Video Technol., vol. 9, no. 8, pp. 12801289, Dec.1999.

    [13] P. Over, A. Smeaton, and G. Awad, in TRECVID Rushes Summariza-

    tion Workshop, 2008. [Online]. Available: http://www-nlpir.nist.gov/projects/tvpubs/tv8.slides/tvs08.slides.pdf.

    [14] H. Sawhney, S. Ayer, and M. Gorkani, Model-based 2d and 3d dom-inant motion estimation for mosaicking and video representation, inProc. 5th Int. Conf. Computer Vision, Los Alamitos, CA, Jun. 1995,pp. 583590.

    [15] D. Gatica-Perez and M.-T. Sun, Linking objects in videos by impor-tance sampling, in Proc. IEEE Int. Conf. Multimedia and Expo, 2002,vol. 2, pp. 525528.

    [16] C. Taskiran, Z. Pizlo, A. Amir, D. Ponceleon, and E. Delp, Automatedvideo program summarization using speech transcripts, IEEE Trans.

    Multimedia, vol. 8, no. 4, pp. 775791, Aug. 2006.[17] A. Albiol, L. Torres, and E. Delp, The indexing of persons in news

    sequences using audio-visual data, in Proc. IEEE Int. Conf. Acoustics,Speech, and Signal Processing, Hong Kong, Apr. 2003.

    [18] J. Choi and D. Jeong, Story board construction using segmentationof MPEG encoded news video, in Proc. 43rd IEEE Midwest Symp.Circuits and Systems, 2000, vol. 2, pp. 758761.

    [19] Y.-P. Tan, D. Saur, S. Kulkami, and P. Ramadge, Rapid estimationof camera motion from compressed video with application to videoannotation, IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 1,pp. 133146, Feb. 2000.

    [20] Y. Rui, A. Gupta, and A. Acero, Automatically extracting highlightsfor TV baseball programs, in Proc. 8th ACM Int. Conf. Multimedia,2000, pp. 105115.

    [21] N. Babaguchi, Y. Kawai, Y. Yasugi, and T. Kitahashi, Linking liveand replay scenes in broadcasted sports video, in Proc. 2000 ACMWorkshops Multimedia, Nov. 2000, pp. 205208.

    [22] A. Girgensohn, J. Boreczky, P. Chiu, J. Doherty, J. Foote, G.Golovchinsky, S. Uchihashi, and L. Wilcox, A semi-automaticapproach to home video editing, in Proc. ACM Symp. User InterfaceSoftware and Technology, San Diego, CA, Nov. 2000, vol. 2, pp.8189.

    [23] L. Itti, C. Koch,and E. Niebur, A model of saliency-basedvisual atten-tion for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell.,vol. 20, no. 11, pp. 12541259, Nov. 1998.

    [24] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, A user attention model forvideo summarization, in Proc. 10th ACM Int. Conf. Multimedia, 2002,pp. 533542.

    [25] O. L. Meur, D. Thoreau, P. L. Callet, and D. Barba, A spatio-tem-poral model of the selective human visual attention, in Proc. Int. Conf.

    Image Processing, 2005, vol. 3, pp. 11881191.[26] X. Xie, H. Liu, W. Ma, and H. Zhang, Browsing large pictures under

    limited display sizes, IEEE Trans. Multimedia, vol. 8, no. 4, pp.707715, Aug. 2006.

    [27] L.-Q. Chen, X. Xie, X. Fan, W.-Y. Ma, H.-J. Zhang, and H.-Q. Zhou,A visual attention model for adapting images on small displays,Mul-timedia Syst., vol. 9, pp. 353364, Oct. 2003.

    [28] W. Osberger and A. Rohaly, Automatic detection of regionsof interest in complex video sequences, in Proc. SPIE, HumanVision and Electronic Imaging VI, Bellingham, WA, 2001, vol.4299, pp. 361372.

    AlultIXDoM1a1UfIX Ra

  • 8/2/2019 Camera Motion-Based Analysis of User Generated Video-jFu

    14/14

    ABDOLLAHIAN et al.: CAMERA MOTION-BASED ANALYSIS OF USER GENERATED VIDEO 41

    [29] Y.-F. Ma andH.-J. Zhang,Contrast-based image attentionanalysisbyusing fuzzy growing, in Proc. 11th ACM Int. Conf. Multimedia, 2003,pp. 374381.

    [30] F. Liu and M. Gleicher, Automatic image retargeting with fisheye-view warping, in Proc. 18th Annu. ACM Symp. User Interface Soft-ware and Technology, Oct. 2005, pp. 153162.

    [31] Y. Hu, L.-T. Chia, and D. Rajan, Region-of-interest based image res-olution adaptation for MPEG-21 digital item, in Proc. ACM Multi-

    media, Oct. 2004.[32] G. Abdollahian and E. J. Delp, Analysis of unstructured video basedon camera motion, in Proc. SPIE Int. Conf. Multimedia Content Ac-cess: Algorithms and Systems, San Jose, CA, Jan. 2007, vol. 6506.

    [33] J. R. Kender and B. L. Yeo, On the structure and analysis of homevideos, in Proc. Asian Conf. Computer Vision, Taipei, Taiwan, Jan.2000.

    [34] W. Kraaij and T. Ianeva, TREC Video Low-level Feature (Camera Mo-tion) Task Overview, 2005. [Online]. Available: http://www-nlpir.nist.gov/projects/tvpubs/tv5.papers/tv5.llf.slides.final.pdf.

    [35] L.-Y. Duan, M. Xu, Q. Tian, and C.-S. Xu, Nonparametric motionmodel with applications to camera motion pattern classification, inProc. 12th Annu. ACM Int. Conf. Multimedia (MULTIMEDIA 04),2004, pp. 328331.

    [36] F. Coudert, J. Benois-Pineau, and D. Barba, Dominant motion estima-tion and video partitioning with a 1D signal approach, in Proc. SPIEConf. Multimedia Storage and Archiving Systems III, 1998, vol. 3527.

    [37] D. Lan, Y. Ma, and H. Zhang, A novel motion-based representationfor video mining, in Proc. IEEE Int. Conf. Multimedia and Expo, Bal-timore, MD, Jul. 2003.

    [38] K. Sauer and B. Schwartz, Efficient block motion estimation usingintegral projections, IEEE Trans. Circuits Syst. Video Technol., vol. 6,no. 5, pp. 513518, Oct. 1996.

    [39] C. Burges, A tutorial on support vector machines for pattern recogni-tion, Data Min. Knowl. Discov., vol. 2, pp. 121167, 1998.

    [40] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support VectorMachines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/cjlin/libsvm.

    [41] S. Wu, Y. Ma, and H. Zhang, Video quality classification based homevideo segmentation, in Proc. IEEE Int. Conf. Multimedia and Expo,Jul. 2005, pp. 217220.

    [42] G. Abdollahian and E. J. Delp, User generated video annotation usinggeo-tagged image databases, in Proc. IEEE Int. Conf. Multimedia and

    Expo, Jul. 2009.[43] R. M. Gray, Vector quantization, IEEE ASSP Mag., vol. 1, no. 2, pp.

    429, Apr. 1984.[44] Y.-F. Ma and H.-J. Zhang, A model of motion attention for video

    skimming, in Proc. 2002 Int. Conf. Image Processing, 2002, vol. 1,pp. 129132.

    [45] Pittsburgh Pattern Recognition, Demonstration: Face Detectionin Pho-tographs. [Online]. Available: http://www.demo.pittpatt.com/.

    [46] L. Itti, The iLab neuromorphic vision C++ toolkit: Free tools for thenext generation of vision algorithms, Neuromorphic Eng., vol. 1, no.1, p. 10, 2004.

    [47] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, Image quality as-sessment: From error visibility to structural similarity, IEEE Trans.

    Image Process., vol. 13, no. 4, pp. 600612, Apr. 2004.[48] D. D. Salvucci and J. H. Goldberg, Identifying fixations and saccades

    in eye-tracking protocols, in Proc. 2000 ACM Symp. Eye Tracking Re-search and Applications (ETRA 00), 2000, pp. 7178.

    [49] G. E. P. Box, A general distribution theory for a class of likelihoodcriteria, Biometrika, vol. 36, no. 3/4, pp. 317346, Dec. 1949.

    Golnaz Abdollahian (S06) received the B.S. degreein electrical engineering from Sharif University ofTechnology, Tehran, Iran, in 2004. Currently, she ispursuing the Ph.D. degree in the area of communi-cations, networking, signal, and image processing atPurdue University, West Lafayette, IN.

    During the summer of 2007, she was a StudentIntern at the Eastman Kodak Company, Rochester,NY. Her research interests include multimedia con-tent analysis, video summarization, indexing, and re-trieval.

    Cuneyt M. Taskiran (M99) was born in Istanbul,Turkey. He received the B.S. and M.S. degrees inelectrical engineering from Bogazici University,Istanbul, and the Ph.D. degree in electrical andcomputer engineering and M.A. degree in linguisticsfrom Purdue University, West Lafayette, IN.

    His research interests include media analysis andassociation for content-based applications, video

    summarization, and natural language watermarking.

    Zygmunt Pizlo received the Ph.D. degree in elec-trical and computer engineering in 1982 from the In-stitute of Electron Technology, Warsaw, Poland, andthePh.D.degree in psychology in 1991 from theUni-versity of Maryland, College Park.

    He is a Professor of psychology at Purdue Uni-versity, West Lafayette, IN. His research interests in-clude all aspects of visual perception, motor control,and problem solving.

    Edward J. Delp (S70M79SM86F97) wasborn in Cincinnati, OH. He received the B.S.E.E.(cum laude) and M.S. degrees from the Universityof Cincinnati, and the Ph.D. degree from PurdueUniversity, West Lafayette, IN. In May 2002, hereceived an Honorary Doctor of Technology fromthe Tampere University of Technology, Tampere,Finland.

    From 19801984, he was with the Department ofElectrical and Computer Engineering at The Univer-sity of Michigan, Ann Arbor. Since August 1984, he

    hasbeenwiththe Schoolof Electrical andComputerEngineeringand theSchoolof Biomedical Engineering at Purdue University. From 20022008, he was achaired professor and held the title The Silicon Valley Professor of Electrical

    and Computer Engineering and Professor of Biomedical Engineering. In 2008,he was named a Distinguished Professor and is currently The Charles WilliamHarrison Distinguished Professor of Electrical and Computer Engineering andProfessor of Biomedical Engineering. In 2007, he received a Distinguished Pro-fessor appointment from the Academy of Finland as part of the Finland Distin-guished Professor Program (FiDiPro). This appointment is at theTampere Inter-national Center for Signal Processing at the Tampere University of Technology.His research interests include image and video compression, multimedia se-curity, medical imaging, multimedia systems, communication, and informationtheory. He has also consulted for various companies and government agenciesin the areas of signal, image, and video processing, pattern recognition, and se-cure communications. He has published and presented more than 400 papers.

    Dr. Delp is a Fellow of the SPIE, a Fellow of the Society for Imaging Scienceand Technology (IS&T), and a Fellow of the American Institute of Medical andBiological Engineering. In 2004, he received the Technical Achievement Awardfrom the IEEE Signal Processing Society (SPS) for his work in image and videocompression and multimedia security. In 2008, he received the Society Awardfrom the SPS. This is the highest award given by SPS and it cited his work inmultimedia security and image and video compression. In 2009, he received thePurdue College of Engineering Faculty Excellence Award for Research.