An iteratively reweighting algorithm for dynamic video summarization

An iteratively reweighting algorithm for dynamicvideo summarization

Pei Dong & Yong Xia & Shanshan Wang & Li Zhuo &

David Dagan Feng

Received: 3 January 2014 /Revised: 7 May 2014 /Accepted: 26 May 2014# Springer Science+Business Media New York 2014

Abstract Information explosion has imposed unprecedented challenges on the conventionalways of video data consumption. Hence providing condensed and meaningful video summaryto viewers has been recognized as a beneficial and attractive research in the multimediacommunity in recent years. Analyzing both the visual and textual modalities proves essentialfor an automatic video summarizer to pick up important contents from a video. However, mostestablished studies in this direction either use heuristic rules or rely on simple ways of textanalysis. This paper proposes an iteratively reweighting dynamic video summarization(IRDVS) algorithm based on the joint and adaptive use of the visual modality and accompa-nying subtitles. The proposed algorithm takes advantage of our developed SEmantic inDicatorof videO seGment (SEDOG) feature for exploring the most representative concepts fordescribing the video. Meanwhile, the iteratively reweighting scheme effectively updates thedynamic surrogate of the original video by combining the high-level features in an adaptivemanner. The proposed algorithm has been compared to four state-of-the-art video summari-zation approaches, namely the speech transcript-based (STVS) algorithm, attention model-based (AMVS) algorithm, sparse dictionary selection-based (DSVS) algorithm and heteroge-neity image patch index-based (HIPVS) algorithm, on different video genres, includingdocumentary, movie and TV news. Our results show that the proposed IRDVS algorithmcan produce summarized videos with better quality.

Multimed Tools ApplDOI 10.1007/s11042-014-2126-8

P. Dong :Y. Xia : S. Wang : D. D. FengBiomedical and Multimedia Information Technology (BMIT) Research Group, School of InformationTechnologies, The University of Sydney, Sydney NSW 2006, Australia

P. Dong (*) : L. ZhuoSignal and Information Processing Laboratory, Beijing University of Technology, Beijing 100124, Chinae-mail: [email protected]

Y. Xia (*)Shaanxi Key Lab of Speech & Image Information Processing, School of Computer Science, NorthwesternPolytechnical University, Xi’an 710072, Chinae-mail: [email protected]

S. WangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

S. WangSchool of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Keywords Video summarization . Semantic indicator of video segment (SEDOG) . Iterativeweight estimation .Multimodal features . Saliency ranking

1 Introduction

With the advent of the big data era, humans have been increasingly exposed to an exponen-tially growing amount of video data. Take YouTube for example, about 100 hours of videosare uploaded every minute [1]. As a result, traditional manual browsing often turns to beinefficient and less helpful when we need the relevant and essential video content. Naturally,fast and effective approaches are on great demand to facilitate human beings to digest andmanipulate huge video archives.

Video summarization aims to produce a concise surrogate of a full-length video and thusoffers video users a less time-consuming experience to grasp the video’s essence. The videosurrogate can be either dynamic or static. A dynamic video summary, also known as a videoskim, is a playable yet shorter video clip that consists of a number of segments extracted fromthe original video. A static video summary is often termed as an image storyboard, whichconsists of a series of selected video key frames. Recently, video summarization has drawnincreasing research attention in the multimedia community, with a number of algorithmshaving been proposed in the literature [44,50,61]. A specialized group of methods focus onthe raw (i.e., unedited) rushes videos [48,49]. They aim to identify the large proportion of junkand repetitive materials from the retakes using the algorithms such as clustering [56] andmaximal matching by bipartite graphs [6]. Only the most representative video clips areretained to compose a much shorter summary than the original video [64]. Most other videosummarization approaches are proposed for edited, content-intensive videos. Many researchersemployed the visual (e.g. color, shape, texture and edge) or audio descriptors for identifyingthe essential video content [4,5,14,54]. Pritch et al. [54] condensed the activities in a video intoa synopsis by using color information and either pixel-based or object-based graph optimiza-tion. Almeida et al. [4,5] adopted the zero-mean normalized cross correlation metric tomeasure the frame similarity on the color histograms of the compressed domain DC coeffi-cients. De Avila et al. [14] proposed the VSUMM algorithm for static video summarization.They adopted the k-means algorithm to cluster the pre-sampled video frames into a number ofclusters according to the extracted color histogram features and used the information from shotdetection to estimate the number of clusters. Matos and Pereira [43] addressed the creation ofMPEG-7 compliant summary descriptions via a low-level arousal model, which worksprimarily for the high action video content like sports and action movies. A heterogeneityimage patch index [13] was proposed to measure the entropy of video frame based on the pixelinformation only. Both static and dynamic video summarization schemes were introducedutilizing this index and were evaluated on consumer videos. Those low-level feature basedalgorithms, however, are often inadequate in video summarization, since they do not wellconsider the underlying semantics of the video towards a high-level understanding.

1.1 Related work

The mismatch between low-level features and high-level semantics, known as the semantic gap,makes it ineffective for video summarizers in fulfilling their mission. Therefore, many researchersexplored the semantics for video summarization [9,10,12,15,16,18,20,21,26,41,45,59,60].Domain-specific rules were designed to detect semantic events in soccer videos [18,60] and

Multimed Tools Appl

broadcast baseball videos [26]. In contrast to such methods, more generic approaches weredeveloped to extract semantics for less domain-specific applications, including: (1) employing thetext analysis techniques from natural language processing [59], (2) estimating the amount ofattention that would be drawn from viewers [16,41], (3) integrating the linguistic word categoryto complement the audiovisual content analysis [20,21], (4) utilizing the concept entities oftranscript terms [9] or high-level semantic concepts [15] to assist visual content importanceestimation, (5) exploring the physiological responses from viewers [10,45], and (6) regularizingthe sparsity in video representation to semantically grasp the video [12].

Taskiran et al. [59] proposed a video summarizer exploiting the text information recognizedfrom the speech transcripts of video programs. This algorithm was rooted from naturallanguage processing. Different from most video summarization approaches, it segments theoriginal video based on automatic pause detection in the audio track rather than the analysis ofthe visual changes. The term frequency-inverse document frequency is used to derive a scorefor each video segment. A video skim is then generated by selecting the segments with highscores while fulfilling a summary duration requirement. Although well suited for mining theessential information in audio, this algorithm might be less competent when the viewer isinterested in both the semantics and visual content.

Exploiting the human attention that originated from psychological research [28] providedanother angle to summarize videos. Extending the idea on visual attention [2,27] which was apioneering study in computer vision, Ma et al. [41] introduced a user attention modelframework which can incorporate the information from the visual, audio and linguistic cues.This method is free from the heuristic rule-based understanding of the video’s high-levelsemantics. In its application to video summarization, the video content importance is capturedvia exploiting the attention mechanisms, namely the models from the visual and auralattentions. However, this work did not actually employ the linguistic module which couldbring benefit to the video summary quality. A recent work by Ejaz et al. [16] was proposed forkey frame extraction via a visual attention clue-based approach, in which the improvedcomputational efficiency over traditional optical flow-based methods was achieved by usingthe temporal gradient-based dynamic visual saliency detection.

Extending the idea of movie summarization by visual and audio-based curve [19],Evangelopoulos et al. [21] proposed to further incorporate a textual cue so that a saliencycurve derived from three modalities could be formed to detect salient events that lead to videosummaries. In Ref. [20], the hierarchical fusion of multimodal saliency was further explored.Although the authors considered multimodal information, the weights designated to the movietranscript terms were determined by a simple part-of-speech (POS) based method.

Chen et al. [9] proposed to generalize the transcript terms of video shots into four conceptentity categories (namely “who”, “what”, “where” and “when”) and the shots were correspond-ingly grouped based on this textual classification. Then a graph entropy-basedmethod is used topick up a subset of significant shots for flexible browsing with their relations emphasized. InRef. [15], video summarization with high-level concept detection was studied, where thesemantic coherence between the video’s accompanying subtitles and trained high-level con-cepts was explored. It has been pointed out that thousands of basic concept detectors might besupportive enough to achieve a decent concept detection accuracy [25]. However, trainingconcept detectors requires considerable amount of manual work. This results in the currentlyrelatively small number of available concept detectors. Nevertheless, incorporating semanticinformation with other features substantially benefits video summarization [15,22,40].

Physiological data obtained from the viewers, such as the signals of the brain [65], heart andskin [10,45], have been exploited as external sources of information to overcome the challengeof semantic gap in video analysis. Money and Agius [45] considered five types of

Multimed Tools Appl

physiological response measurements, including the electro-dermal response, heart rate, bloodvolume pulse, respiration rate, and respiration amplitude, to identify the essential parts frommovies that match the emotional experience of a viewer. In contrast to Ref. [45] which is forsingle-user, personalized summarization, Chênes et al. [10] integrated the physiological signalsof more than one participant together for a general highlight of the original video.

Sparse representation was also explored for video summarization. Cong et al. [12]employed both the CENTRIST descriptor [67] for scene understanding and color momentsto represent video frames. Then the summarization of consumer videos is mathematicallyformulated into an iterative dictionary selection problem. Key frames and essential videosegments are identified via selecting a sparse dictionary which addresses both the sparseness ofthe selected visual data and low reconstruction error in representing the original video.

Although the aforementioned approaches have realized the importance of using high-levelsemantics in video summary, they still suffer from several limitations. These methods eitherfail to employ the valuable text information or merely use the text information. Some,however, conduct their text analysis via a simple differentiation of linguistic term categoriesor a rough classification of text terms into very limited number of entity types. Some rely onheuristic settings for feature extraction and fusion. In our previous work [15], we also haveconsidered to explore the high-level semantic information and trained concepts for videosummarization. Unfortunately, the semantic coherence proposed in that paper needs manualsetting as well for concept selection and it simply averages the contribution of differentconcepts.

1.2 Outline of our work

Anchoring in the above observations, this paper proposes an iteratively reweighting dynamicvideo summarization (IRDVS) algorithm based on the joint and adaptive use of the visual andtextual information. The proposed algorithm takes advantage of our developed SEmanticinDicator of videO seGment (SEDOG) feature in exploring the most representative conceptsfor describing the video. Like our previously introduced semantic coherence [15], SEDOGalso leverages the detectors trained on a set of semantic concepts [31] and external linguisticknowledgebase [51]. Differently, SEDOG avoids manually tuning parameters in the conceptselection step and more reasonable feature value derivation schemes are developed.Furthermore, IRDVS also has proposed an iterative reweighting scheme to balance thecontribution of different features and therefore generate the dynamic video summary in anadaptive manner. To evaluate the proposed algorithm, it has been compared to four state-of-the-art algorithms namely a speech transcript-based video summarization (STVS) algorithm[59], an attention model-based video summarization (AMVS) algorithm [41], a sparse dictio-nary selection-based video summarization (DSVS) algorithm [12] and a heterogeneity imagepatch index-based video summarization (HIPVS) algorithm [13] on different video genres,including documentary, movie and TV news.

2 Proposed method

The proposed IRDVS algorithm consists of four major steps, including (1) extracting fourgroups of low-level visual features for the video frames, (2) deriving two types of high-levelvisual features and a semantic feature named SEDOG for each video segment based on thelow-level visual features and accompanying subtitles, (3) combining three high-level featuresin an iteratively estimated linear model towards the saliency scores, and (4) producing a

Multimed Tools Appl

dynamic surrogate of the original video using the saliency scores. The diagram of thisalgorithm is illustrated in Fig. 1.

2.1 Low-level visual representation

To enable a semantically meaningful representation of the video content, four groups of basicvisual features are extracted from video frames. This step serves as a preparation for the high-level visual and semantic representation, which will not only connect the signal-level infor-mation and high-level semantics [3,25,57] but also lead to the identification of the salient partsfrom the original video. Among the various types of low-level descriptors, we employ thefeatures of color moments, wavelet texture and local keypoints [31] that will be used to derivethe concept detection-based [38,66] semantic feature SEDOG, and the motion feature for thepurpose of high-level motion attention-based representation.

2.1.1 Color moment features

To capture the low-level fundamental characteristics of a video frame, we divide the frame into5×5 non-overlapping blocks [30] and, on each block, calculate the first-order moment andsecond- and third-order central moments of each color component in Lab color space [31]. Thecolor moments on all 25 blocks in the i th frame compose the color moment feature vector fcm(i).

2.1.2 Wavelet texture features

To describe the textured information in each frame, we divide the frame into 3×3 non-overlappingblocks and apply the three-level Haar wavelet decomposition to the intensity values of each block[31]. The variances of the obtained wavelet coefficients in horizontal, vertical and diagonaldirections at each level are computed. The assembly of the variances of wavelet coefficients onall 9 blocks in the i th frame is defined as a wavelet texture feature vector fwt(i).

2.1.3 Motion features

It is widely recognized that the human eye is sensitive to the changes in visual content. Tocharacterize the visual changes, we divide each frame intoM×N non-overlapping blocks, eachcontaining 16×16 pixels, and calculate block-based motion vectors via full-search motion

Video sequence

Video frames

Subtitles

Saliency scores

Visual features:

color (fcm), texture (fwt),

motion (fmv), keypoint (fkp)

High-level features:

motion attention (fM),

face attention (fF),

SEDOG (fE)

Iteratively estimated

linear model

Semantic concept detectors,

WordNet::Similarity

Video summary

Fig. 1 Diagram of the proposed IRDVS algorithm

Multimed Tools Appl

estimation [33]. In the i th frame, let the motion vector on the (m,n) th block be denotedby v(i,m,n). The motion features for this frame is the assembly ofM×N motion vectors fmv(i).

2.1.4 Local keypoint features

Since local keypoint-based bag of features (BoF) can serve as an effective complement toglobally computed features in semantic video analysis [37], we employ the soft-weighted localkeypoint features [30], which are based on the significance of keypoints over a vocabulary of500 visual words, to characterize salient regions. Given the i th frame, keypoints are detectedby the difference of Gaussians (DoG) detector, represented by the scale-invariant featuretransform (SIFT) descriptor, and clustered into the 500 visual words. The keypoint featurevector fkp(i) is calculated as the weighted similarities between the keypoints and visual wordsunder a four nearest neighbor principle [30].

2.2 High-level visual and semantic representation

The grasp of video information towards content selection could be more natural and reliablewhen high-level representation is available [18,44,68]. Based on the fundamental visualfeatures produced in the previous subsection and the video’s accompanying text, we considerboth the clues from user attention and multimodal semantics, with three groups of high-levelvisual and semantic representations derived for any given video segment χs that ranges fromthe i1(s) th frame to the i2(s) th frame.

2.2.1 Motion attention features

The early work by James [28] and other researchers in psychology on human attention has seta crucial foundation for the use of attention modeling in computer vision [27,41,55]. Thecognitive mechanism of attention is essential in the analysis and understanding of humanthinking and activities [29,35,53] and hence useful in selecting relatively salient content forvideo summaries [7,8,16]. We employ the motion attention model, which is a component ofthe widely-used user attention model [41], to derive high-level motion attention features thatare more suitable for semantic analysis.

Suppose we have a spatial window with 5×5 blocks in the same frame, and a temporalwindow with 7 blocks at the same spatial location and in adjacent frames. Let both windowsbe centered on the (m,n) th block in the i th frame, and we consider the motion vectors in them.We evenly divide the phase interval [0,2π) into eight bins and count the spatial phasehistogram Hi,m,n

(s) (ζ) , 1≤ζ≤8, in the spatial window, and temporal phase histogram Hi,m,n(t) (ζ),

1≤ζ≤8, in the temporal window, respectively. Thus, the spatial coherence inductor andtemporal coherence inductor [41] are calculated as

Cs i;m; nð Þ ¼ −∑ζps ζð Þlogps ζð Þ; ð1Þ

Ct i;m; nð Þ ¼ −∑ζpt ζð Þlogpt ζð Þ; ð2Þ

where ps ζð Þ ¼ H sð Þi;m;n ζð Þ

∑ζHsð Þi;m;n ζð Þ and pt ζð Þ ¼ H tð Þ

i;m;n ζð Þ∑ζH

tð Þi;m;n ζð Þ are phase distributions in the spatial window

and temporal window, respectively. Then, the motion attention feature for the i th frame can bedefined by combining the magnitudes of motion vectors, spatial coherence inductors and

Multimed Tools Appl

temporal coherence inductors as follows

gMOT ið Þ ¼ 1

MN∑M

m¼1∑Nn¼1 v i;m; nð Þj jCt i;m; nð Þ 1− v i;m; nð Þj jCs i;m; nð Þ½ �: ð3Þ

Finally, we apply a median filter with nine input elements to the obtained motion atten-tion feature sequence to suppress the noise and violent changes between adjacent frames. Themotion attention feature for the video segment χs is the average of the smoothedfeatures {MOT(i)|i=i1(s),…,i2(s)} over related frames

f M sð Þ ¼ 1

i2 sð Þ−i1 sð Þ þ 1∑i2 sð Þ

i¼i1 sð ÞMOT ið Þ: ð4Þ

2.2.2 Face attention features

The presence of human faces usually indicates the semantic importance of the video content.We adopt the robust real-time face detection algorithm [62] to detect the human faces in eachframe. For the j th detected face, besides its area AF(j), a position-related weight wfp(j) is alsoassigned to express the attention it draws from viewers. The distribution of this weight insideeach frame is illustrated in Fig. 2. Thus, the face attention feature for the i th frame is calculatedas [41]

gFAC ið Þ ¼ ∑ jwfp jð ÞAF jð ÞAfrm

; ð5Þ

where Afrm is the area of the whole video frame. To alleviate the impact of imperfect facedetection, the obtained face attention feature sequence is smoothed by a median filter with fiveinput elements. The face attention feature for the video segment χs is defined as the average ofthe smoothed features {FAC(i)|i=i1(s),…,i2(s)} over related frames

f F sð Þ ¼ 1

i2 sð Þ−i1 sð Þ þ 1∑i2 sð Þ

i¼i1 sð ÞFAC ið Þ: ð6Þ

2.2.3 SEDOG

Mining the concepts relevant to the visual information is beneficial to identifying the semanticmeaning conveyed in a video sequence [24,46,58,69,70]. To exploit the semantics, weintroduce a high-level feature, namely the SEmantic inDicator of videO seGment (SEDOG).

124

224

124

1/3 1/3 1/3

424

824

424

124

224

124

3/12

4/12

5/12

==

=

Fig. 2 Position-related face weights in each frame (left) and a sample image with weights assigned to three faces(right)

Multimed Tools Appl

SEDOG is computed by using the VIREO-374 [31], which consists of 374 concepts and threetypes of support vector machines (SVM) for each concept. The SVMs have been trained byusing the color moment features, wavelet texture features and local keypoint features toestimate the membership of a video frame belonging to the corresponding concept. Someconcepts in the VIREO-374 are listed in Table 1.

For the s th video segment, we calculate its SEDOG feature in three major steps, assummarized in Fig. 3. First, for the middle frame [9] im(s) of χs, we extract the color momentfeatures fcm(im(s)), wavelet texture features fwt(im(s)) and local keypoint features fkp(im(s)), andapply each group of features to the SVM-based concept detectors in VIREO-374 to generatethe semantic memberships {ucm(s,j),uwt(s,j),ukp(s,j)|j=1,2,…,374} of the frame belonging toeach concept. The concept membership of the segment χs and j th concept is defined as

u s; jð Þ ¼ ucm s; jð Þ þ uwt s; jð Þ þ ukp s; jð Þ3

; ð7Þ

KeyframeColor moments (fcm),wavelet texture (fwt),local keypoints (fkp)

Concept

membership (u)

SubtitlesSimilarities

between subtitles

and concepts

Textual

relatedness ρ

SEDOG (fE)

WordNet::Similarity

Concept detectors

(cm,wt and kp)

VIREO-374

semantic concepts

Fig. 3 Diagram for computing the SEDOG for each video segment

Table 1 Examples of single-word concepts and multi-word concepts in VIREO-374. Parts-of-speech (e.g. #n)and proper senses (e.g. #1) are manually assigned based on concept definitions [32] and WordNet:: Similarity[51]

Single-word concepts Multi-word concepts

Original concept Term with POS and sense index Original concept Constituent terms with POSsand sense indices

Airplane airplane#n#1 Corporate_Leader company#n#1,leader#n#1

Animal animal#n#1 Explosion_Fire explosion#n#1,fire#n#1

Car car#n#1 Industrial_Setting industry#n#1,setting#n#2

Desert desert#n#1 Male_News_Subject male#n#2,news#n#1,subject#n#6

Forest forest#n#1 Pedestrian_Zone pedestrian#n#1,zone#n#1

First_Lady first_lady#n#2 People_Marching person#n#1,marching#n#1

Military military#n#1 Police_Security police#n#1,security#n#1

Pipes pipe#n#2 Us_Flags america#n#1,flag#n#1

Shopping_Mall shopping_mall#n#1 Television_Tower television#n#1,tower#n#1

Multimed Tools Appl

which represents the confidence the three types of detectors mutually have that the j th conceptrelates to the frame im(s) of video segment χs.

Second, the semantic similarity is measured in terms of the textual information usingsubtitles. Since subtitles in neighboring segments often relate to each other and contribute tothe same semantics, a temporal window that consists ofW segments is centered on the currentsegment. Let all subtitle terms (except the stop words1) inW segments be denoted by Γst(s) andthe set of constituent words in the j th concept be denoted by Γcp(j). The textual semanticsimilarity between the segment χs and j th concept is calculated as

κ s; jð Þ ¼ maxγ∈Γst sð Þ1

Γcp jð Þ�� ∑ ω∈Γcp jð Þη γ;ωð Þ; ð8Þ

where | ⋅ | represents the cardinality of a set, and η(⋅,⋅) is the semantic similarity between a pairof linguistic terms obtained by using the WordNet::Similarity package [51]. To accuratelyretrieve the term pair similarity, we have manually picked up the parts-of-speech and propersenses of the constituent terms of the concepts according to their definitions [32]. Note that theconcepts like “First_Lady” and “Shopping_Mall” shown in Table 1 are still considered assingle-word concepts, since each of them is defined as a whole in WordNet::Similarity.

To reduce the impact of less relevant concepts, the textual relatedness is defined bythresholding the textual semantic similarity according to the corresponding conceptmembership

p s; jð Þ ¼1

Qk s; jð Þ;0;

(u s; jð Þ∈ 0:5; 1ð �ju s; jð Þ∈ 0; 0:5½ �j ; ð9Þ

where Q is a normalization factor that ensures ∑ j=1374p(s, j)=1. Since the outputs of the

SVMs are probabilities of binary classification problems, a threshold of 0.5 isnaturally used in (9).

Finally, the SEDOG score of the segment χs, denoted by fE(s), is defined as a sum of theconcept memberships weighted by the corresponding textual relatednesses, shown as follows

f E sð Þ ¼ ∑374j¼1 p s; jð Þu s; jð Þ: ð10Þ

In this formulation, the textual relatedness p(s,j) is used to adjust the contribution of the j thconcept or even prune the concept when ρ(s,j)=0. An example of computing the SEDOGscore is illustrated in Fig. 4.

2.3 Dynamic video summarization

In this study, video summarization aims to select a subset of video segments based on theirsaliency scores. For each video segment xs, we define its saliency score as a linear combination[17,34] of its motion attention feature fM(s), face attention feature fF(s) and SEDOG featurefE(s), shown as follows

f SAL sð Þ ¼ wM sð Þ f M sð Þ þ wF sð Þ f F sð Þ þ wE sð Þ f E sð Þ; ð11Þwhere wM(s), wF(s) and wE(s) are the feature weighting parameters. Note that each type offeature is linearly normalized to the interval [0,1] prior to the linear fusion. The video

1 http://www.tomdiethe.com/teaching/remove_stopwords.m

Multimed Tools Appl

http://www.tomdiethe.com/teaching/remove_stopwords.m

summarization problem is now cast into the estimation of the weighting parameters, which canbe achieved in the following iterative process.

Let the iterations be indexed by k. At the k th iteration, we first calculate each weightingparameter w#(s) (#∈{M,F,E}), which is determined simultaneously by a macro factor α#(s)and a micro factor β#(s) in the following way

w# sð Þ ¼ α# sð Þ⋅β# sð Þ: ð12ÞThe macro factor α#(s) measures the relative significance of f#(s) globally in the entire videosequence and can be defined as

α# sð Þ ¼ 1−r# sð ÞNS

; ð13Þ

where r#(s) is the rank of f#(s) after {f#(s)|s=1,2,…,NS} is sorted in descending order, and NS

is the total number of segments in the video. The micro factor β#(s) captures the importance ofthe current video segment χs as compared to its temporally closest segment χs ' in the previousvideo summary and can be calculated as

β kð Þ# sð Þ ¼ 1þ

f # s kð Þ� �− f # s0 k−1ð Þ

� �f # s kð Þð Þ þ f # s0 k−1ð Þ� �: ð14Þ

With the high-level visual and semantic features and the currently estimated weightingparameters wM

(k)(s), wF(k)(s) and wE

(k)(s), we can calculate the saliency score f SAL(k) (s) and sort the

saliency scores of all video segments in descending order. Then, the first NVS(k) segments

from the sorted sequence are selected, which have larger saliency scores, to form anew video summary. The number of chosen segments NVS

(k) should also satisfy the

Fig. 4 An example showing the computation of SEDOG

Multimed Tools Appl

constraint that the length of the video summary does not exceed a pre-specified targetduration limit.

The weighting parameters are equally initialized and the iterative estimation terminateswhen a maximum iteration number Kmax is reached. In our experiments, Kmax is set to be 15.The video summarization process is summarized in Algorithm 1.

Algorithm 1. Video summarization based on iteratively estimated linear model

1: Input: fM(s), fF(s) fE(s), s=1,2,…,NS; target duration limit.

2: Initialization: w 0ð ÞM sð Þ ¼ w 0ð Þ

F sð Þ ¼ w 0ð ÞE sð Þ ¼ 1

3 , s=1,2,…,NS; k=0.

3: Compute f SAL(k) (s) according to Eq. (11). Sort f SAL

(k) (s); obtain the initial summaryconsidering saliency ranks and target duration limit.

4: Calculate αM(s), αF(s) and αE(s) using Eq. (13).

5: while k≤Kmax do

6: k←k+1.

7: Update βM(k)(s), β F

(k)(s) and β E(k)(s) via Eq. (14).

8: Update wM(k)(s), wF

(k)(s) and wE(k)(s) via Eq. (12).

9: Update f SAL(k) (s) via Eq. (11).

10: Sort f SAL(k) (s); obtain a new summary for the k th iteration considering saliency ranks and

target duration limit.

11: end while

12: Output: Final video skim (i.e. the summary of the Kmax th iteration).

3 Experiments and results

Although numerous videos are accessible nowadays from either shared repositories (e.g.YouTube2 and Open Video Project3) or released datasets (e.g. Kodak’s consumer videobenchmark dataset [39] and VSUMM dataset [14]), the evaluation of the proposed algorithmcannot be directly carried out by using these available resources. The main reason is that nosuitable ground truths are provided for dynamic video summarization. For the Open VideoProject and VSUMM dataset, despite that storyboards of key frames are available, they areproduced specifically for evaluating the static video summarization rather than for dynamicsummarization.

In our experiments, for a comprehensive performance evaluation, the proposed IRDVSalgorithm and other competing algorithms have been applied to a dataset of 14 test videos.This dataset has a total duration of about 7.6 h and spans different genres, namely documen-tary, movie and news, as listed in Table 2. Including documentaries is based on the consid-eration that the subtitles largely well match the video content and semantics [9]. Movies andnews have also been widely utilized in video summarization. The four documentaries wereobtained from YouTube, the movie videos, each lasting about half an hour, were extracted ascontinuous clips from three well known movies, and the news videos are from NBC News,USA and ABC News, Australia. As to the NBC News, commercials were manually removedand the remaining segments were concatenated as our test videos. In the experiments, the

2 http://www.youtube.com/3 http://www.open-video.org/

Multimed Tools Appl

http://www.youtube.com/

http://www.open-video.org/

errors in video subtitles were preserved so that the algorithms can be tested in a practicalsetting.

3.1 Performance evaluation

The evaluation of video summarization algorithms still remains an open problem and is quitesubjective. Although quantitative scores can be given by users to measure the informativeness,enjoyability [42,47], satisfaction [52], experience [63], interrelation [9] and so on, subjectiveevaluation can merely provide an overall and rough assessment of the entire summary. As tothe objective evaluation, the metrics used are often designed for specific video types. Forinstance, the duration of summary and the difference between the size of target and actualsummaries are adopted for rushes video summarization [56,64], while the content missing ratecalculated as the percentage of content elements defined in the original video but missing fromthe summary is employed for instructional videos [11].

In this paper, unlike the methods obtaining a single score from each user for thewhole summary, we used a divide-and-conquer strategy based on the video segmentscores from users. Firstly, the original video was manually partitioned into segments,each of which maintains semantically independent content. Next an importance scorewas assigned to each video segment by the invited user. Then algorithm-made videoskims were compared to their ground-truth counterpart which was generated based onthe segment scores. This method was inspired by the above mentioned whole-summary rating approaches. However, the major difference is that it makes theevaluation process more manageable since the user only has to deal with a numberof smaller and simpler subtasks.

For performance evaluation, the metrics of the precision (P), recall (R) and F-measure (F)were utilized. In our scenario, the precision is the fraction of video segments in the algorithm-made skim that are correct according to the ground-truth summary, while the recall is definedas the fraction of correct video segments that are picked up by the algorithm. The F-measure isthe harmonic mean of the precision and recall and can be used as a more comprehensive

Table 2 Test videos

Genre Video Abbreviation Duration (mins)

Documentary Astrobiology ASTR 44.5

Constellations CSTL 44.5

Cosmic Holes CSMH 44.5

Mars MARS 44.1

Movie A Beautiful Mind (part 1) BM-I 23.4

A Beautiful Mind (part 2) BM-II 39.0

Harry Potter and the Sorcerer’s Stone (part 1) HP1-I 37.5

Harry Potter and the Sorcerer’s Stone (part 2) HP1-II 23.2

The Legend of 1900 (part 1) LGD-I 26.7

The Legend of 1900 (part 2) LGD-II 28.1

TV news ABC News, Australia (part 1) ABC-I 30.0

ABC News, Australia (part 2) ABC-II 30.0

NBC News, USA (part 1) NBC-I 21.8

NBC News, USA (part 2) NBC-II 18.7

Multimed Tools Appl

metric. To obtain these metrics, the video segments in the algorithm-made video skim arecategorized into true positives (TP) and false positives (FP). For a given video segment, if atleast p (50 % in our experiments) of its frames also appear in the ground-truth summary, it isconsidered as a true positive; otherwise, it is a false positive. In the ground-truth summary, anyvideo segment that is not sufficiently matched by any segment in the true positive set falls intothe false negative (FN) category. Therefore, the precision, recall and F-measure of analgorithm-made summary can be calculated as follows

P ¼ nTPnTP þ nFP

;R ¼ nTPnTP þ nFN

; F ¼ 2PR

P þ R; ð15Þ

where nTP, nFP and nFN are the numbers of true positives, false positives and false negatives,respectively.

3.2 Ground-truth summary

A ground-truth video skim was produced based on the scores given by an invited user and thetarget summary length. Firstly, the user viewed the original full-length video and grasped itsstructure and idea. Then by considering the relative content importance of video segments, theuser provided quantitative scores in the interval [0,100] for all segments. In our experiments,no time limit was set for completing the whole task so that the video could be played as manytimes as necessary.

With the user scores, all video segments were ranked in descending order. A ground-truthsummary that maximally uses but does not exceed the summary length budget was formed byconcatenating all highly ranked segments and in the meantime preserving the order ofappearance in the original video. In the experiments, three users participated in the groundtruth making task independently. Hence each of the test videos has three different ground-truthsummaries at one specific summary length.

3.3 Results

The proposed IRDVS algorithm was compared to four state-of-the-art video summarizationmethods, namely the STVS algorithm [59], AMVS algorithm [41], DSVS algorithm [12] andHIPVS algorithm [13]. The STVS algorithm temporally partitions a video by using audiopause detection and then computes a score for each video segment based on the automaticallyrecognized speech transcripts to summarize the video. In our implementation, we directlyprovided the STVS algorithm with the video subtitles and the manually obtained videopartitioning results used in making the ground-truth summaries. These two groups of infor-mation are near-perfect substitutes to the results of automatic speech recognition and audiopause detection since they are free from the performance limitations of the two modules. As toour implementation of the AMVS algorithm, the motion attention model and faceattention model were integrated via a linear fusion, in which the weights, withoutprior knowledge, were set to be equal. Since Ref. [41] suggested using the informa-tion of pauses and silence to decide video segment boundaries when making videoskims, we also provided the AMVS algorithm with the same manually produced videopartitioning results as we did for STVS. We tested the DSVS algorithm with theparameter values suggested by its author and terminated its dictionary selectionprocess after it sufficiently converged. The HIPVS algorithm was implemented witha suggested spatial downsampling operation on all original video frames.

Multimed Tools Appl

The video summary quality evaluated by using the precision, recall and F-measure ispresented in Table 3. For a given video genre, the performances were calculated as follows.Firstly, the video summaries were generated by applying all five algorithms at the summarylengths of 20 %, 30 % and 40 %. Next each summary was compared to the three user ground-truth summaries respectively to yield the metric values P, R and F. Finally, for each algorithm,its results of all test videos in the given genre and against all ground truths were averaged to bethe performance under a specific summary length and metric.

From the results of the documentary, movie and news, it is shown that the fivealgorithms could generate better summaries with the increase in the summary length.Our proposed IRDVS algorithm attained the best results for most of the cases, withthe exception that for movies and under the metric precision it is the second best on20 % and 30 % summaries and marginally underperformed the second best on 40 %

Table 3 Performance comparison of the five video summarization algorithms based on the precision (P), recall(R) and F-measure (F). Each 3×2 cell presents the results of the STVS (top left), AMVS (top right), DSVS(middle left), HIPVS (middle right) and proposed IRDVS (bottom left) algorithms as well as the rank of IRDVSalgorithm (bottom right). The best and second best ranks in the comparisons are highlighted in boldface and italicboldface respectively if produced by IRDVS

Summary length Metric Documentary Movie News Overall

20 % P 0.107 0.184 0.145 0.360 0.134 0.334 0.129 0.293

0.172 0.171 0.184 0.171 0.231 0.149 0.196 0.164

0.235 1 0.304 2 0.436 1 0.325 1

R 0.042 0.120 0.088 0.204 0.041 0.200 0.057 0.175

0.135 0.164 0.187 0.244 0.151 0.155 0.158 0.188

0.415 1 0.644 1 0.623 1 0.561 1

F 0.060 0.145 0.108 0.251 0.063 0.248 0.077 0.215

0.151 0.167 0.182 0.199 0.177 0.146 0.170 0.171

0.297 1 0.410 1 0.496 1 0.401 1

30 % P 0.219 0.253 0.350 0.431 0.209 0.368 0.259 0.351

0.296 0.295 0.258 0.231 0.322 0.205 0.292 0.244

0.339 1 0.376 2 0.483 1 0.399 1

R 0.100 0.146 0.176 0.266 0.075 0.249 0.117 0.220

0.228 0.288 0.211 0.340 0.229 0.234 0.223 0.288

0.505 1 0.726 1 0.678 1 0.636 1

F 0.136 0.185 0.230 0.327 0.109 0.295 0.158 0.269

0.257 0.291 0.229 0.273 0.266 0.209 0.251 0.258

0.402 1 0.494 1 0.553 1 0.483 1

40 % P 0.356 0.364 0.479 0.536 0.406 0.424 0.414 0.441

0.383 0.403 0.329 0.360 0.362 0.317 0.358 0.360

0.415 1 0.475 3 0.512 1 0.468 1

R 0.162 0.213 0.262 0.325 0.150 0.285 0.191 0.274

0.295 0.405 0.252 0.479 0.267 0.382 0.271 0.422

0.563 1 0.779 1 0.707 1 0.683 1

F 0.222 0.269 0.336 0.401 0.218 0.339 0.259 0.336

0.333 0.403 0.283 0.410 0.305 0.335 0.307 0.383

0.476 1 0.586 1 0.590 1 0.551 1

Multimed Tools Appl

summaries. Although the HIPVS algorithm is often the second best on documentariesand the AMVS algorithm yielded comparatively good results on the movie and newsvideos, both algorithms are of a text-free nature and thus less competent in wellbalancing the multi-modality information in all three video genres. With a furthercomparison on the recall values, it is noted that the performances of the STVSalgorithm are comparatively low especially at a summary length of 20 %, which isprobably due to its text-only analysis. Accordingly, the higher recalls of the proposedIRDVS algorithm could probably be attributed to the exploitation of information fromboth the visual and textual modalities.

In the last column of Table 3, the overall performances calculated as the mean ofthe category results indicate that our IRDVS algorithm is the best in all cases.Furthermore, the five algorithms were ranked based on the averaged metric valuesover all summary lengths as listed in Table 4. The overall rank of the IRDVSalgorithm is 1.00.

Since multiple user ground-truth summaries were used in the evaluation, weanalyzed the fluctuation of the precision, recall and F-measure. For each test videothe metric values corresponding to three user ground truths were used to compute anindividual standard deviation, and then these individual values of all test videos wereaveraged and illustrated in Fig. 5. These statistics demonstrate that the results ofIRDVS against different ground truths are quite stable. However, the IRDVS, DSVSand HIPVS algorithms are generally advantageous in terms of standard deviation at20 %, 30 % and 40 % summaries respectively, each winning two out of three cases.Since the value ranges of the five algorithms’ performances are quite different (seeTable 3), the relative standard deviation (RSD) [23,36] was further employed, whichis defined as the ratio of the standard deviation to the mean. According to the RSDsof all metrics shown in Fig. 6, the proposed IRDVS algorithm outperformed the otherfour competitors in all cases.

4 Discussions

4.1 Impact of window size parameter W

The window size parameter W is a crucial factor for SEDOG. When W varies,different amount of context information from the textual modality will be considered.Therefore, it is desirable to find a proper value of W that generally well balances theaccuracy and computational load of the IRDVS algorithm. Performances in theprecision, recall and F-measure computed over all video genres and user ground

Table 4 Performances and ranks of the STVS, AMVS, DSVS, HIPVS and proposed IRDVS algorithms basedon the precision (P), recall (R) and F-measure (F). The best results are highlighted in boldface. The subscriptsindicate the ranks among five algorithms. The overall ranks are averaged over all the metric ranks

Algorithm STVS AMVS DSVS HIPVS IRDVS

P 0.2674 0.3622 0.2823 0.2565 0.3971R 0.1225 0.2233 0.2174 0.2992 0.6271F 0.1655 0.2732 0.2424 0.2703 0.4781Overall rank 4.67 2.33 3.67 3.33 1.00

Multimed Tools Appl

truths were illustrated in Fig. 7 for five different W settings. Additionally, theaveraged performances over three summary lengths were reported in Table 5, withthe metric ranks and overall ranks shown. Since the scheme with a window size of 3obtained the overall optimal rank of 1.33, we set W to be 3 in all experiments.

20% 30% 40%

0.119 0.052 0.070 0.096 0.049 0.063 0.063 0.038 0.047

0.098 0.057 0.072 0.073 0.045 0.053 0.074 0.054 0.063

0.056 0.051 0.049 0.063 0.040 0.046 0.061 0.039 0.046

+ 0.064 0.071 0.067 0.049 0.053 0.052 0.040 0.043 0.041

0.045 0.053 0.044 0.057 0.058 0.054 0.044 0.047 0.041

0.03

0.05

0.07

0.09

0.11

0.13STVS

AMVS

DSVS

HIPVS

IRDVS

Fig. 5 Stability measured in the standard deviation (smaller is better) of the algorithm-generated videosummaries when evaluated against the ground-truth summaries made by using the scores from different users.Each value is an average of the standard deviations over all test videos. The best results (the smallest values)among the five algorithms are highlighted in boldface. The second bests, if produced by IRDVS algorithm, arehighlighted in italic boldface

20% 30% 40%

0%

20%

40%

60%

80%

STVS AMVS DSVS HIPVS IRDVS

Fig. 6 The relative standard deviations (smaller is better) of the performances of five algorithms. Each value iscomputed over the results against ground-truth summaries from different users

Multimed Tools Appl

4.2 Role and contribution of the iteratively estimated linear model

4.2.1 Convergence property

In the iterative weight estimation process, since the feature sequences remain the same for alliterations, the feature weighting parameters decide the segment saliency scores and thus the videosummaries. Therefore, we examined the average change of feature weights with the maximumnumber of iterations Kmax set as 15. The curves of the documentary videos “CSTL” and “CSMH”,movies “HP1-I” and “LGD-II” and news videos “ABC-II” and “NBC-II” are illustrated in Fig. 8. Itis observed that the feature weights changed rapidly within the first few iterations and the proposedalgorithm can gradually achieve a quite stable set of weights after about five iterations inmost cases.

4.2.2 Final summary versus initial summary

To demonstrate the improvement of the final summary obtained by the iterative weightestimation process over the initial summary based on equal-weight fusion, Fig. 9 illustratesseveral example frames of the video segments together with their accompanying subtitles fromthe 20 %-long summaries. All these examples are in the final summaries but failed to be pickedup by the initial summaries. It proves that the iteratively reweighting process effectivelyincorporated highly relevant and semantically essential content into the final summaries.

20% 30% 40%

= 1 0.342 0.536 0.396 0.398 0.620 0.479 0.481 0.674 0.558

= 3 0.325 0.561 0.401 0.399 0.636 0.483 0.468 0.683 0.551

= 5 0.311 0.561 0.393 0.393 0.635 0.478 0.468 0.683 0.550

= 7 0.298 0.561 0.382 0.387 0.626 0.470 0.464 0.682 0.547

= 9 0.302 0.563 0.387 0.392 0.636 0.476 0.461 0.678 0.544

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Fig. 7 Performance comparison of the IRDVS algorithm when the window size W for SEDOG takes differentvalues. The best result (the highest value) in each case is highlighted in boldface

Table 5 Average performances of the IRDVS algorithm with different window size settings for SEDOG. Thesubscripts are metric ranks. The best results and ranks are highlighted in boldface

W 1 3 5 7 9

P 0.4071 0.3972 0.3913 0.3835 0.3854R 0.6105 0.6271 0.6262 0.6234 0.6263F 0.4782 0.4781 0.4743 0.4665 0.4694Overall rank 2.67 1.33 2.67 4.67 3.67

Multimed Tools Appl

4.3 Computational cost

The proposed IRDVS algorithm provides an efficient framework for generating dynamic videosummaries by integrating multimodal features. The iterative weight estimation process can beeasily realized to achieve a high speed. Our current implementation in 64-bit MATLAB 7.11tested on a workstation with Intel Core 2 Duo 3.00 GHz CPU and 4 GB RAM consumed only5.2 seconds to complete the 15 iterations for producing the summaries of our 7.6 hours dataset.

However, the most time-consuming part of the IRDVS algorithm is feature extractionmainly because it involves the motion estimation, face detection and obtaining term pair

(c) (d)

(e) (f)

(a) (b)

Fig. 8 Curves of the average change of feature weighting parameters against the iteration index k. The summarylengths tested include 20 %, 30 % and 40 %

Multimed Tools Appl

Could life have existed even

earlier on our planet? The

oldest evidence of life is not

evidence of the oldest life.

The reason I’m interested in Lassen Volcanic Park is

because of these hydrothermal

hot spot areas.

SETI uses radio technology to

listen for radio leaks from alien

civilizations.

(a) ASTR

In 1930, Einstein and his

colleague, Nathan Rosen

Backward time travel has

ignited a myriad of science

… The other is a supermassive black hole that is millions to

calculated the mathematics of

one of these intergalactic

pipelines.

fiction scenarios. billions of times the mass of

our sun.

(b) CSMH

Now, who among you will be

the next Morse? The next

Einstein? Who among you will

be the vanguard ...

I'll take another. Excuse me? A

thousand pardons. I simply

assumed you were the waiter.

And my compliments to you,

sir. Thank you very much.

(c) BM-I

They were quick to the scene,

they were quick to take control,

and from my understanding that

they worked within the

operational capacity and

policy ...

I have learnt patience, which is

so essential, tolerance. We live

daily, just do our basic things

day by day. A new report from

Alzheimer’s Australia suggests many more families will face a

similar dilemma in the future.

... it’s expected to employ more than 6,000 workers. The

Government is appointing an

employment coordinator to

help with the recruitment ...

(d) ABC-I

Fig. 9 Example frames that are in the final video summaries but missing from the summaries at the initializationstage generated by the equal-weight fusion. The subtitles of the corresponding video segments are also shown.The target summary length is 20 %

Multimed Tools Appl

similarities from WordNet::Similarity. Several strategies can be leveraged to reduce theoverhead introduced by these inefficient steps. Firstly, the motion vectors can be alternativelyparsed from compressed videos so that the motion estimation is waived. Secondly, a middle-size lookup table can be built up that contains frequently used term pair similarities for thesemantic concepts and therefore only the other “unknown” similarities have to be obtainedfrom WordNet::Similarity. Third, the multi-core features of modern CPU can be exploited toconduct the independent processes, e.g. the visual part and textual part, in a parallel manner.

5 Conclusions and outlook

As a major endeavor towards assisted consumption and manipulation of the fast growingdigital video archives, a variety of video summarization approaches have emerged in themultimedia community. Distinguished from most of the existing multimodal and semanticvideo summarization work, this paper proposes an iteratively reweighting dynamic videosummarization (IRDVS) algorithm which adaptively integrates two high-level visual featureswith the proposed semantic indicator of video segment (SEDOG) derived from the visual andtextual information. The summarization process is performed through rating each videosegment according to its saliency score, which is calculated as a linear combination of allthree types of high-level features. The weighting parameter of each feature is efficientlyupdated via an iterative estimation process. Under the test dataset that consists of documen-taries, movies and TV news, the proposed IRDVS algorithm was compared favorably to theSTVS algorithm [59], AMVS algorithm [41], DSVS algorithm [12] and HIPVS algorithm [13]in terms of the precision, recall and F-measure.

Our future work may include long-term video summarization and recommendation withhigh-level semantics. Incorporating more modalities such as the audio tracks would be anotherfuture endeavor. Furthermore, we may also consider expanding our current dataset andproviding the ground truths for the evaluation and comparison of dynamic video summariza-tion methods.

Acknowledgments This work was supported in part by the Australian Research Council grants, in part by theChina Scholarship Council under Grant 2011623084, in part by the National Natural Science Foundation ofChina (No. 61372149, No. 61370189, No. 61100212), in part by the Program for New Century Excellent Talentsin University (No. NCET-11-0892), in part by the Specialized Research Fund for the Doctoral Program of HigherEducation (No. 20121103110017), in part by the Natural Science Foundation of Beijing (No. 4142009), in partby the Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions (No.CIT&TCD201304036, No. CIT&TCD201404043), and in part by the Science and Technology DevelopmentProgram of Beijing Education Committee (No. KM201410005002). We appreciate the anonymous reviewers fortheir constructive comments. Copyrights of images, videos and subtitles used in this work are the property oftheir respective owners.

References

1. (2013) Here’s to eight great years. YouTube Blog. http://youtube-global.blogspot.com/2013/05/heres-to-eight-great-years.html.

2. Ahmad S (1991) VISIT: A neural model of covert visual attention. In: Advances in Neural InformationProcessing Systems (NIPS), vol 4. pp 420–427.

Multimed Tools Appl

http://youtube-global.blogspot.com/2013/05/heres-to-eight-great-years.html

http://youtube-global.blogspot.com/2013/05/heres-to-eight-great-years.html

3. Alatan AA, Akansu A, Wolf W (2001) Multi-modal dialog scene detection using hidden Markov models forcontent-based multimedia indexing. Multimed Tools and Appl 14(2):137–151

4. Almeida J, Leite NJ, Torres RS (2012) VISON: video summarization for online applications. Pattern RecognLett 33(4):397–409

5. Almeida J, Leite NJ, Torres RS (2013) Online video summarization on compressed domain. J Vis CommunImage Represent 24(6):729–738

6. Bai L, Hu Y, Lao S, Smeaton AF, O’Connor NE (2010) Automatic summarization of rushes video usingbipartite graphs. Multimed Tools and Appl 49(1):63–80

7. Borji A, Itti L (2013) State-of-the-art in visual attention modeling. IEEE Trans on Pattern Anal and MachIntell 35(1):185–207

8. Chen BW, Bharanitharan K, Wang JC, Fu Z, Wang JF (2014) Novel mutual information analysis of attentivemotion entropy algorithm for sports video summarization. In: Huang YM, Chao HC, Deng DJ, Park JJ (eds)Advanced Technologies, Embedded and Multimedia for Human-centric Computing, vol 260. Lecture Notesin Electrical Engineering. Springer, Netherlands, pp 1031–1042

9. Chen B-W, Wang J-C, Wang J-F (2009) A novel video summarization based on mining the story-structureand semantic relations among concept entities. IEEE Trans on Multimed 11(2):295–312

10. Chênes C, Chanel G, Soleymani M, Pun T (2013) Highlight detection in movie scenes through inter-users,physiological linkage. In: Ramzan N, Zwol R, Lee J-S, Clüver K, Hua X-S (eds) Social Media Retrieval.Computer Communications and Networks, Springer London, pp 217–237

11. Choudary C, Liu T (2007) Summarization of visual content in instructional videos. IEEE Trans on Multimed9(7):1443–1455

12. Cong Y, Yuan J, Luo J (2012) Towards scalable summarization of consumer videos via sparse dictionaryselection. IEEE Trans on Multimed 14(1):66–75

13. Dang CT, Radha H (2014) Heterogeneity image patch index and its application to consumer videosummarization. IEEE Trans on Image Process 23(6):2704–2718

14. de Avila SEF, Lopes APB, da Luz JA, de Albuquerque AA (2011) VSUMM: a mechanism designed toproduce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68

15. Dong P, Wang Z, Zhuo L, Feng DD (2010) Video summarization with visual and semantic features. In: QiuG, Lam K-M, Kiya H, Xue X, Kuo CCJ, Lew MS (eds) Advances in Multimedia Information Processing -Pacific Rim Conference on Multimedia 2010, Part I. Lecture Notes in Computer Science, vol 6297. Springer,Berlin, pp 203–214

16. Ejaz N, Mehmood I, Wook Baik S (2013) Efficient visual attention based framework for extracting keyframes from videos. Signal Process Image Commun 28(1):34–44

17. Ejaz N, Tariq TB, Baik SW (2012) Adaptive key frame extraction for video summarization using anaggregation mechanism. J Vis Commun Image Represent 23(7):1031–1040

18. Ekin A, Tekalp AM, Mehrotra R (2003) Automatic soccer video analysis and summarization. IEEE TransImage Process 12(7):796–807

19. Evangelopoulos G, Rapantzikos K, Potamianos A, Maragos P, Zlatintsi A, Avrithis Y (2008) Moviesummarization based on audiovisual saliency detection. In: Proceedings of the 15th IEEE InternationalConference on Image Processing (ICIP), 12–15 Oct. 2008. pp 2528–2531.

20. Evangelopoulos G, Zlatintsi A, Potamianos A, Maragos P, Rapantzikos K, Skoumas G, Avrithis Y (2013)Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEETrans on Multimed 15(7):1553–1568

21. Evangelopoulos G, Zlatintsi A, Skoumas G, Rapantzikos K, Potamianos A, Maragos P, Avrithis Y (2009)Video event detection and summarization using audio, visual and text saliency. In: Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP). pp 3553–3556.

22. Fersini E, Sartori F (2012) Semantic storyboard of judicial debates: a novel multimedia summarizationenvironment. Program: Elec Libr Inf Syst 46(2):119–219

23. Garestier F, Le Toan T (2010) Estimation of the backscatter vertical profile of a pine forest using singlebaseline P-band (Pol-)InSAR data. IEEE Trans Geosci Remote Sens 48(9):3340–3348

24. Hauptmann A, Yan R, Lin W-H, Christel M, Wactlar H (2007) Can high-level concepts fill thesemantic gap in video retrieval? A case study with broadcast news. IEEE Trans on Multimed 9(5):958–966

25. Hauptmann A, Yan R, Lin W-H (2007) How many high-level concepts will fill the semantic gap in newsvideo retrieval? In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval(CIVR), Amsterdam, The Netherlands. ACM, pp 627–634.

26. HungM-H, Hsieh C-H (2008) Event detection of broadcast baseball videos. IEEE Trans on Circ and Syst forVideo Technol 18(12):1713–1726

27. Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEETrans on Pattern Anal and MachIntell 20(11):1254–1259

Multimed Tools Appl

28. James W (1890) The Principles of psychology. Harvard University Press.29. Jiang Y-G, Bhattacharya S, Chang S-F, ShahM (2013) High-level event recognition in unconstrained videos.

Int J Multimed Inf Retrieval 2(2):73–10130. Jiang Y-G, Ngo C-W, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic

video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval(CIVR), Amsterdam, The Netherlands. ACM, pp 494–501.

31. Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic conceptdetection: a comprehensive study. IEEE Trans on Multimed 12(1):42–53

32. Kennedy L, Hauptmann A (2006) LSCOM lexicon definitions and annotations (version 1.0). DTOChallenge workshop on large scale concept ontology for multimedia. Columbia University ADVENTtechnical report.

33. Kim J-N, Choi T-S (2000) A fast full-search motion-estimation algorithm using representative pixels andadaptive matching scan. IEEE Trans on Circ and Syst for Video Technol 10(7):1040–1048

34. Kleban J, Sarkar A, Moxley E, Mangiat S, Joshi S, Kuo T, Manjunath BS (2007) Feature fusion andredundancy pruning for rush video summarization. In: Proceedings of the international workshop onTRECVID video summarization (TVS), Augsburg, Bavaria, Germany. ACM, pp 84–88.

35. Knudsen EI (2007) Fundamental components of attention. Annu Rev Neurosci 30:57–7836. Koral KF, Yendiki A, Lin Q, Dewaraja YK, Fessler JA (2004) Determining total I-131 activity within a VoI

using SPECT, a UHE collimator, OSEM, and a constant conversion factor. IEEE Trans Nucl Sci 51(3):611–61837. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing

natural scene categories. In: Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR). pp 2169–2178.

38. Lin L, Chen C, Shyu M-L, Chen S-C (2011) Weighted subspace filtering and ranking algorithms for videoconcept retrieval. IEEE Multimed 18(3):32–43

39. Loui A, Luo J, Chang S-F, Ellis D, Jiang W, Kennedy L, Lee K, Yanagawa A (2007) Kodak’s consumervideo benchmark data set: concept definition and annotation. In: Proceedings of the 9th ACM SIGMMinternational workshop on Multimedia Information Retrieval (MIR), Augsburg, Bavaria, Germany. ACM, pp245–254.

40. Luo JB, Papin C, Costello K (2009) Towards extracting semantically meaningful key frames from personalvideo clips: from humans to computers. IEEE Trans on Circ and Syst for Video Technol 19(2):289–301

41. Ma Y-F, Hua X-S, Lu L, Zhang H-J (2005) A generic framework of user attention model and its applicationin video summarization. IEEE Trans on Multimed 7(5):907–919

42. Ma Y-F, Lu L, Zhang H-J, Li M (2002) A user attention model for video summarization. In:Proceedings of the Tenth ACM International Conference on Multimedia, Juan-les-Pins, France.ACM, pp 533–542.

43. Matos N, Pereira F (2008) Automatic creation and evaluation of MPEG-7 compliant summary descriptionsfor generic audiovisual content. Signal Process Image Commun 23(8):581–598

44. Money AG, Agius H (2008) Video summarisation: a conceptual framework and survey of the state of the art.J Vis Commun Image Represent 19(2):121–143

45. Money AG, Agius H (2010) ELVIS: Entertainment-led video summaries. ACM Trans Multimed ComputCommun Appl 6(3):1–30

46. Mylonas P, Spyrou E, Avrithis Y, Kollias S (2009) Using visual context and region semantics for high-levelconcept detection. IEEE Trans on Multimed 11(2):229–243

47. Ngo C-W, Ma Y-F, Zhang H-J (2005) Video summarization and scene detection by graph modeling. IEEETrans on Circ and Syst for Video Technol 15(2):296–305

48. Over P, Smeaton AF, Awad G (2008) The TRECVID 2008 BBC rushes summarization evaluation. In:Proceedings of the 2nd ACM TRECVID video summarization workshop, Vancouver, British Columbia,Canada. ACM, pp 1–20.

49. Over P, Smeaton AF, Kelly P (2007) The TRECVID 2007 BBC rushes summarization evaluation pilot. In:Proceedings of the international workshop on TRECVID video summarization, Augsburg, Bavaria,Germany. ACM, pp 1–15.

50. Pal R, Ghosh A, Pal SK (2012) Video summarization and significance of content: a review. In: Handbook onsoft computing for video surveillance. Chapman & Hall/CRC cryptography and network security series.Chapman and Hall/CRC, pp 79–102.

51. Pedersen T, Patwardhan S, Michelizzi J (2004) WordNet:Similarity - Measuring the relatedness of concepts.In: Proceedings of the nineteenth national conference on artificial intelligence (AAAI). pp 1024–1025.

52. Peng W-T, Chu W-T, Chang C-H, Chou C-N, Huang W-J, Chang W-Y, Hung Y-P (2011) Editing byviewing: automatic home video summarization by viewing behavior analysis. IEEE Trans on Multimed13(3):539–550

Multimed Tools Appl

53. Posner MI, Petersen SE (1990) The attention system of the human brain. Annu Rev Neurosci 13:25–4254. Pritch Y, Rav-Acha A, Peleg S (2008) Nonchronological video synopsis and indexing. IEEE Trans on

Pattern Anal and Mach Intell 30(11):1971–198455. Rapantzikos K, Avrithis Y, Kollias S (2011) Spatiotemporal features for action recognition and salient event

detection. Cogn Comput 3(1):167–18456. Ren J, Jiang J (2009) Hierarchical modeling and adaptive clustering for real-time summarization of rush

videos. IEEE Trans on Multimed 11(5):906–91757. Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level

features and their combinations for complex event detection in open source videos. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), 16–21 June 2012. pp 3681–3688.

58. Tang S, Zheng Y-T, Wang Y, Chua TS (2012) Sparse ensemble learning for concept detection. IEEE Trans onMultimed 14(1):43–54

59. Taskiran CM, Pizlo Z, Amir A, Ponceleon D, Delp EJ (2006) Automated video program summarizationusing speech transcripts. IEEE Trans on Multimed 8(4):775–791

60. Tavassolipour M, Karimian M, Kasaei S (2014) Event detection and summarization in soccer videos usingBayesian network and copula. IEEE Trans on Circand Systfor Video Technol 24(2):291–304

61. Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM TransMultimed Comput Commun Appl 3(1):1–37

62. Viola PA, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–15463. Wang M, Hong R, Li G, Zha Z-J, Yan S, Chua T-S (2012) Event driven web video summarization by tag

localization and key-shot identification. IEEE Trans on Multimed 14(4):975–98564. Wang F, Ngo C-W (2012) Summarizing rushes videos by motion, object, and event understanding. IEEE

Trans on Multimed 14(1):76–8765. Wang S, Zhu Y, Wu G, Ji Q (2013) Hybrid video emotional tagging using users’ EEG and video content.

Multimed Tools and Appl doi:10.1007/s11042-013-1450-866. Wei X-Y, Jiang Y-G, Ngo C-W (2011) Concept-driven multi-modality fusion for video search. IEEE Trans

on Circ and Syst for Video Technol 21(1):62–7367. Wu J, Rehg JM (2011) CENTRIST: a visual descriptor for scene categorization. IEEE Trans on Pattern

Analand Mach Intell 33(8):1489–150168. Xu G, Ma Y-F, Zhang H-J, Yang S-Q (2005) An HMM-based framework for video semantic analysis. IEEE

Trans on Circ and Syst for Video Technol 15(11):1422–143369. Yuan Z, Lu T, Wu D, Huang Y, Yu H (2011) Video summarization with semantic concept preservation. In:

Proceedings of the 10th International Conference on Mobile and Ubiquitous Multimedia (ACM MUM),Beijing, China. ACM, 2107609, pp 109–112.

70. Zhu S, Ngo C-W, Jiang Y-G (2012) Sampling and ontologically pooling web images for visual conceptlearning. IEEE Trans on Multimed 14(4):1068–1078

Pei Dong received the bachelor’s degree in electronic information engineering and master’s degree in signal andinformation processing from Beijing University of Technology, China, in 2005 and 2008, respectively. He iscurrently pursuing the Ph.D. degree in School of Information Technologies, The University of Sydney, Australia.His current research interests include video and image processing, pattern recognition, machine learning, andcomputer vision.

Multimed Tools Appl

http://dx.doi.org/10.1007/s11042-013-1450-8

Yong Xia received the B.E., M.E., and Ph.D. degrees in computer science and technology from NorthwesternPolytechnical University, Xi’an, China, in 2001, 2004, and 2007, respectively. He was a Postdoctoral ResearchFellow in the Biomedical and Multimedia Information Technology Research Group, School of InformationTechnologies, University of Sydney, Sydney, Australia. He is currently a full professor in School of ComputerScience, Northwestern Polytechnical University, and also an Associate Medical Physics Specialist in theDepartment of PET and Nuclear Medicine, Royal Prince Alfred Hospital, Sydney. His research interests includemedical imaging, image processing, computer-aided diagnosis, pattern recognition, and machine learning.

ShanshanWang received her bachelor’s degree in biomedical engineering from Central South University, Chinain 2009. She is now pursuing her double-Ph.D. degree as a cotutelle student from both Shanghai Jiao TongUniversity, China on biomedical engineering and The University of Sydney, Australia on computer science. Herresearch interest is inverse problem in medical imaging and image processing such as MR/PET image recon-struction, image denoising and dictionary learning.

Multimed Tools Appl

Li Zhuo graduated from the University of Electronic Science and Technology, Chengdu in 1992, received themaster degree in signal and information processing from Southeast University in 1998, and the Ph.D. degree inpattern recognition and intellectual system from Beijing University of Technology in 2004, respectively. She hasbeen a full professor since 2007 and the supervisor of Ph.D. students since 2009. She has published over 100research papers and authored three books. Her research interests include image/video coding and transmission,multimedia content analysis, and wireless video sensor networks.

David Dagan Feng received the M.E. degree in electrical engineering & computer science from Shanghai JiaoTong University, Shanghai, China, in 1982, the M.Sc. degree in biocybernetics and the Ph.D. degree in computerscience from the University of California, Los Angeles, CA, USA, in 1985 and 1988, respectively, where hereceived the Crump Prize for Excellence in Medical Engineering. He is currently the Head of School ofInformation Technologies and the Director of the Institute of Biomedical Engineering and Technology, Univer-sity of Sydney, Sydney, Australia, a Guest Professor of a number of universities, and a Chair Professor of HongKong Polytechnic University, Hong Kong. He is a fellow of IEEE, ACS, HKIE, IET, and the AustralianAcademy of Technological Sciences and Engineering.

Multimed Tools Appl

Documents

An iteratively reweighting algorithm for dynamic video summarization