8
University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed, Alper Yilmaz Fahd Rafi, Saad Ali, Orkun Alatas, Saad Khan, and Mubarak Shah Computer Vision Laboratory University of Central Florida Orlando, Florida U.S. 1 Introduction This year, the Computer Vision Group at University of Central Florida participated in two tasks in TRECVID 2004: High-Level Feature Extraction and Story Segmen- tation. For feature extraction task, we have developed the detection methods for “Madeleine Albright”, “Bill Clin- ton”, “Beach”, “Basketball Scored” and “People Walk- ing/Running”. We used the adaboost technique, and has employed the speech recognition output in addition to vi- sual cues. In story segmentation, we used a 2-phase ap- proach. The video is initially segmented into coarse seg- mentation by finding the anchor persons. The coarse seg- mentation is then refined in the second phase by further de- tecting weather and sports stories and merging the semanti- cally related stories, which is determined by the visual and text similarities. 2 Feature Extraction We have submitted five features: Madeleine Albright, Bill Clinton, Beach, Basketball Scoring, and People Walk- ing/Running. Section 2.1 describes the method for detect- ing person X (Albright and Clinton); section 2.2 describes the beach detector; section 2.3 presents the detection for the basketball scored; finally, section 2.4 presents the method for classifying the people walking/running shots. 2.1 Finding Person X in Broadcast News Our approach to find a specific person X combines text cues from the given transcripts, face detection from key frames and face recognition. Figure 2.1 gives the overview of our algorithm. At the core of the text reinforcement algorithm is the “wordtime” data provided by the CMU Infomedia team. These wordtime files contain the time in milliseconds be- ginning from the start of the broadcast (videos) at which each particular word is spoken. The speech information in news broadcast videos is highly correlated with the visual data. This makes the wordtime data a powerful resource to help classifying video and detect features. In the case of detecting shots containing footage of Bill Clinton and/or Madeleine Albright, we observed from the training data that the occurrence of words “Bill Clinton” and “Madeleine Al- bright” is highly correlated with their appearance in the video. This is a natural consequence of either the news caster or the reporter introducing these two personalities. We therefore scanned the wordtime data for the relevant words and extracted the time, denoted as the hit time, at which they are spoken. The hit times are used to label the shots spanning the duration as candidate shots. These can- didate shots then are further tested and refined by a Haar face detector, the details of which are described later. The words that were scanned as hints for Bill Clinton were “Clinton” or “President”. The results showed that there is no need to scan for the word “Bill” since the former president is almost never introduced as simply “Bill”. In fact, scanning for “Bill” increases the number of false pos- itives. At times Bill Clinton is referred only as the “Presi- dent”, so we found it necessary to search for the word “Pres- ident” also. In the case of Madeleine Albright, we scanned the wordtime data for the phrase “Madeleine Albright”. After extracting all the text cues from the given tran- scripts, we filter the key frames based on the temporal lo- cation of the cues, only maintaining the frames which are within a preset threshold from the cues. According to our experimental results, the text cues give more reliable out- put than visual cues. Therefore by filtering the key frames based on text cues, we save large amount of computation time for the face detection on the key frames which actually contain no person X. We assume that if a specific shot contains person X, there should exist a frontal view of the person X in the corre- sponding key frame. Based on this assumption, face regions were located in the shot’s key frames. The face detector we used is a modified version of the Haar-like feature face de- tector in OpenCV [1]. It only gives one face region output. 1

University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

University of Central Florida at TRECVID 2004

Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed, Alper YilmazFahd Rafi, Saad Ali, Orkun Alatas, Saad Khan, and Mubarak Shah

Computer Vision LaboratoryUniversity of Central Florida

Orlando, Florida U.S.

1 Introduction

This year, the Computer Vision Group at University ofCentral Florida participated in two tasks in TRECVID2004: High-Level Feature Extraction and Story Segmen-tation. For feature extraction task, we have developed thedetection methods for “Madeleine Albright”, “Bill Clin-ton”, “Beach”, “Basketball Scored” and “People Walk-ing/Running”. We used the adaboost technique, and hasemployed the speech recognition output in addition to vi-sual cues. In story segmentation, we used a 2-phase ap-proach. The video is initially segmented into coarse seg-mentation by finding the anchor persons. The coarse seg-mentation is then refined in the second phase by further de-tecting weather and sports stories and merging the semanti-cally related stories, which is determined by the visual andtext similarities.

2 Feature Extraction

We have submitted five features: Madeleine Albright, BillClinton, Beach, Basketball Scoring, and People Walk-ing/Running. Section 2.1 describes the method for detect-ing person X (Albright and Clinton); section 2.2 describesthe beach detector; section 2.3 presents the detection for thebasketball scored; finally, section 2.4 presents the methodfor classifying the people walking/running shots.

2.1 Finding Person X in Broadcast News

Our approach to find a specific person X combines text cuesfrom the given transcripts, face detection from key framesand face recognition. Figure 2.1 gives the overview of ouralgorithm.

At the core of the text reinforcement algorithm is the“wordtime” data provided by the CMU Infomedia team.These wordtime files contain the time in milliseconds be-ginning from the start of the broadcast (videos) at whicheach particular word is spoken. The speech information in

news broadcast videos is highly correlated with the visualdata. This makes the wordtime data a powerful resourceto help classifying video and detect features. In the caseof detecting shots containing footage of Bill Clinton and/orMadeleine Albright, we observed from the training data thatthe occurrence of words “Bill Clinton” and “Madeleine Al-bright” is highly correlated with their appearance in thevideo. This is a natural consequence of either the newscaster or the reporter introducing these two personalities.We therefore scanned the wordtime data for the relevantwords and extracted the time, denoted as the hit time, atwhich they are spoken. The hit times are used to label theshots spanning the duration as candidate shots. These can-didate shots then are further tested and refined by a Haarface detector, the details of which are described later.

The words that were scanned as hints for Bill Clintonwere “Clinton” or “President”. The results showed thatthere is no need to scan for the word “Bill” since the formerpresident is almost never introduced as simply “Bill”. Infact, scanning for “Bill” increases the number of false pos-itives. At times Bill Clinton is referred only as the “Presi-dent”, so we found it necessary to search for the word “Pres-ident” also. In the case of Madeleine Albright, we scannedthe wordtime data for the phrase “Madeleine Albright”.

After extracting all the text cues from the given tran-scripts, we filter the key frames based on the temporal lo-cation of the cues, only maintaining the frames which arewithin a preset threshold from the cues. According to ourexperimental results, the text cues give more reliable out-put than visual cues. Therefore by filtering the key framesbased on text cues, we save large amount of computationtime for the face detection on the key frames which actuallycontain no person X.

We assume that if a specific shot contains person X, thereshould exist a frontal view of the person X in the corre-sponding key frame. Based on this assumption, face regionswere located in the shot’s key frames. The face detector weused is a modified version of the Haar-like feature face de-tector in OpenCV [1]. It only gives one face region output.

1

Page 2: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

Total shots returned: 19Total shots with features: 19Total shots with feature returned: 4Precision at shots with feature: 0.2105Average Precision: 0.0899

Precision at n shots:5 0.400010 0.300015 0.266720 0.200030 0.1333100 0.0400200 0.0200500 0.00901000 0.00402000 0.0020Recall

Pre

cisi

on

Interpolated Precision/Recall

Figure 1: Evaluation Results for Madeleine Albright. Values ofinterpolated precision and recall are plotted in the graph instead oflisting them out.

Giving the results of the face detection, we need torecognize the specific person’s face. Therefore we buildtwo limited face recognizers for the two specific per-sons correspondingly based on extracted faces in trainingdata set. We collectk sample faces{F1, F2, · · · , Fk} forperson x andn − k non-x faces{Fk+1, Fk+2, · · · , Fn}from the training data set. An Eigenspace is build forthose faces{F1, F2, F3, · · · , Fn} and and the Eigenfaces{eigF1, eigF2, · · · , eigFn} are obtained. Next a SupportVector Machine (SVM) classifier is trained from the Eigen-face data , and is used to recognize person x from the ex-tracted faces in the testing data.

The valid recognition rate is low because of the follow-ing reasons. First of all, the face detector misses a lot offaces, especially faces in small sizes, faces in side view, andfaces in crowd. Secondly, the face detector sometimes de-tects only a part of the face which causes problems in recog-nition. It might be helpful if we use skin information to ex-tract the complete face based on the face detection results.Finally, we only use key frames instead of multiple framesto extract faces, which is not robust since more frames pro-vide more information. The evaluation results for detectingfeature “Madeleine Albright” are shown in Figure 1, and theone for “Bill Clinton” is shown in Figure 3.

Figure 2: The three steps approach to recognize Person X

Total shots returned: 2000Total shots with features: 409Total shots with feature returned: 103Precision at shots with feature: 0.1516Average Precision: 0.0536

Precision at n shots:5 0.000010 0.000015 0.200020 0.250030 0.4000100 0.3500200 0.2400500 0.13201000 0.07802000 0.0515Recall

Pre

cisi

on

Interpolated Precision/Recall

Figure 3:Evaluation Results for Bill Clinton.

2.2. Beach

The Beach feature was extracted from key frames using thecolor information and its spatial context. Low level featuresinclude color correlograms and color histograms in RGBand HSV color spaces. These were computed on entireimages. Moreover, the images were divided into regions,and the regional color histograms were also computed. Theregional histograms were used to capture the fact that thescenes with a beach has a very high probability of showingsky in the upper portion of the image and sand or water onthe lower portion. The above described features were se-lected because they were the most discriminatory among allthe features tested for the beach detection. Analysis throughadaboost classifiers was carried out to make the selection.

The Adaboost [2] frame work was used to train the clas-sifier. The weak classifiers were simple thresholds on valueof each dimension of the feature vector.

Beach in an image can appear in different settings whichwould affect the value of spatial color features. So we se-lected a subset of training examples from the TRECVID2003 data set which conformed to a general beach setting(sand and water occupying substantial area). These handpicked examples were augmented by a data set collectedfrom the internet. This was necessary due to the differencein the definition of the feature and the annotated data pro-vided. The TRECVID 2003 training images included someobviously ”difficult” beach images which were excluded toavoid over fitting of our trained classifier. A very importantfactor in the selection of images for positive samples wasthe fact that the training and testing data had colors muchdifferent from what one would expect in images of beachesfrom high quality sources, e.g., the sky was more close togrey than blue in all videos.

The images labelled as beaches by the boosted classifierwere then tested for motion. Images with high motion con-tent were removed as one would not be expect a beach sceneto have a high motion.

Finally, images were ordered using heuristics on relative

2

Page 3: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

Total shots returned: 1183Total shots with features: 374Total shots with feature returned: 130Precision at shots with feature: 0.1711Average Precision: 0.0635

Precision at n shots:5 0.400010 0.200015 0.200020 0.250030 0.2667100 0.2300200 0.2000500 0.15601000 0.12102000 0.0650Recall

Pre

cisi

on

Interpolated Precision/Recall

Figure 4:Evaluation Results for Beach.

RGB channel strength in the image. Images with a highblue content were given a higher rank and images with ahigh green content were given a low rank. Images with highred content were in between. These crude heuristics werebased on observations from TRECVID 2003 data. Withinthese ranks, the images were sorted according to the con-fidence provided by the boosted classifier for each image.Evaluation result for feature “Beach” is shown in Figure 4.

2.3. BasketballWe adopted a two step approach to detect basketball go-ing through the hoop. The first step was aimed at detect-ing the basketball game. While the second step determineswhich of the basketball games detected in the first step con-tains the ball going through the hoop. The features usedfor detecting basket ball game were similar to beach detec-tion. We used color correlograms and color histograms asthe feature. These features were also image based. Theintuition for these features follows from the rich color in-formation present in any basketball game setting which canbe exploited effectively for detection purposes.

The Adaboost training framework similar to the beachwas employed. We used threshold based weak classifiersfor each dimension of the feature vector. The classifier wastrained on TRECVID 2003 development data set. A largenumber of positive examples were added from the internetresources. This enabled the final classifier to have a betterperformance on a generalized set of basketball games.

The shots returned by the first step were refined by ourtextual analyzer. At the core of our text reinforcement is theCMU wordtime data. To detect the basketball hoops we ob-served that when a hoop-passing is made it is very likelythat the newscaster will report the match scores. Thesescores appear in the wordtime data along with the time atwhich they are spoken. We searched the wordtime files forbasketball scores and extracted the time, called the hit time.The hit times are used to label the shots spanning the du-ration as candidate shots. If the shots returned by boosted

Total shots returned: 94Total shots with features: 103Total shots with feature returned: 37Precision at shots with feature: 0.3592Average Precision: 0.2632

Precision at n shots:5 1.000010 0.800015 0.733320 0.800030 0.7667100 0.3700200 0.1850500 0.07401000 0.02702000 0.0185Recall

Pre

cisi

on

Interpolated Precision/Recall

Figure 5:Evaluation Results for Basketball Scored.

classifier are within these candidate shots we label it as ashot with a basket in it. Evaluation result for feature “bas-ketball scored” are shown in Figure 5.

2.4. People Walking/RunningThe TRECVID data set contained a large variety of sceneswith multiple walking or running people. Persons were ob-served in a wide range of poses and scales in such scenes.In many cases there was significant person to person oc-clusion. There was also significant variation in the cam-era views. However, there were some common attributesof these scenes also. One commonality in many of the thewalking/running scenes was the presence of camera motion,since the camera usually followed the walking people tokeep them centered in the image. Furthermore, usually insuch scenes the faces of the people were at least partiallyvisible. In addition, many sports shots specially those ofbasketball and football contained multiple walking or run-ning people.

In view of the above given observations, we used sev-eral cues to detect the walking/running features. These cuesconsisted of face detection, skin detection, ego-motion esti-mation and sports-shot detection.

• Face Detection: Face detection was performed usingthe Adaboost algorithm. The training set consisted ofboth front and side view poses as positive face exam-ples. The negative face examples consisted of imagesof a variety of potential backgrounds. Haar like fea-tures obtained from the these examples were used totrain the boosted classifier.

• Skin Detection: Skin was detected using the naiveBayes classifier. Normalized RGB color values and thevariance of R,G,B in a3 × 3 neighborhood were usedas features for skin detection. The distribution of thesefeatures for both skin and non-skin examples were ap-proximated using histograms. In the test phase a per-pixel (skin or non-skin) decision was made based on

3

Page 4: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

Total shots returned: 2000Total shots with features: 1695Total shots with feature returned: 273Precision at shots with feature: 0.1434Average Precision: 0.0752

Precision at n shots:5 0.800010 0.900015 0.933320 0.950030 0.9667100 0.8100200 0.5100500 0.29401000 0.18102000 0.1365Recall

Pre

cisi

on

Interpolated Precision/Recall

Figure 6:Evaluation Results for People Walking/Running.

the likelihood ratio of these distributions. Some skindetection results are shown in Figure 2.4.

• Global Motion Estimation: The global motion wasestimated by fitting affine parameters to block mo-tion vectors. Large translation parameters indicated apan,tilt or a translational movement of the cameras.

• Sports Shot Detection: Shots containing basket balland football play were detected using a combination ofcolor and audio cues. For more detail of this methodplease see the ‘basket ball going through the hoop’ fea-ture in Section.

A rule based system employing the above mentionedcues was used to make the final walking/running detectiondecision. A shot was labelled as contained multiple walkingor running people if

• multiple faces or significant skin regions were detectedin the key-frame along with camera translation, pan ortilt motion, or

• a sport (football, basketball) shot is detected.

System evaluation results are shown in Figure 6.

Figure 7: First Row: Some face detection results. SecondRow: Skin detection results on the images shown in the firstrow

3 Story Segmentation

In the news videos, we often observe the following pat-tern: first, the anchor person appears to introduce somenews story. Then, the camera switch to the outside of thestudio, for example, the scene of the airplane crash site,with or without a reporter. After traversing around the keysites, the camera switches back to the studio, and the an-chor person starts another news story. It can be summarizedin this form: [anchor]→[shots of story1]→[anchor]→[shotsof story2]→[...]. This pattern can be represented by the ShotConnectivity Graph (SCG) (Figure 3). In this graph, thenodes represent the shots in the video. The similar shotsare represented by a single node. The edges connecting thenodes are the transitions between the shots, as shown in Fig-ure 3. The stories in the video correspond to the large cyclesin the SCG that are connected at the node representing theanchor. Our goal for segmenting the stories in the video isthen same as finding these cycles in the SCG. We have de-veloped an efficient and robust framework to segment thenews programs into story topics. The framework containstwo phases: (1). the initial segmentation based on the detec-tions of the anchor person, including both the main anchorand the sub-anchor(s), and (2). the refinement based furtherdetections of the weather and sports stories and the mergingof the semantically related adjacent stories.

In the first phase, we detect the cycles that are connectedat the “anchor” node, such that the story segmentation istransformed to detecting the “anchor” shots in the video.The properties of the extended facial regions in the key-frames of the shots are analyzed for clustering the similarshots into corresponding groups. Note that besides the largecycles, there are also smaller cycles that are embedded inthe bigger ones. This can be explained as the appearance ofthe reporters, interviewers, or the sub-anchors for a specificnews story, e.g., finance news. We also consider the lastcase as a portion of the news story segments. The detectionmethod for the sub-anchor(s) is same as the detection forthe main anchor.

In the second phase, the initial segmentation by the de-tection of the anchor(s) is refined by further detecting newsstories with special format and merging the semantically re-lated stories. For some news stories with special formats,there is no anchor involved. These stories are “hidden” inthe large cycles in the SCG. Other techniques are used to“discover” them from the initial story segments. There aretwo kinds of special stories we incorporated into our sys-tem: weather news and sports news. The color pattern ofthe shots is examined to filter out the candidate weathershots. Then, these candidate weather shots are verified bytheir motion content. The largest continuous segment of theremaining weather shots form the weather story. For thedetection of the sports story, we used the text correlation of

4

Page 5: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

Start

Figure 8:Shot Connectivity Graph. The node with blue bounding box represents the anchor person shots in the news video. In this simpleexample, the video consists of two news stories and one commercial.

the shots to the sporting words. Similar to the weather storysegmentation, the near by sports shots are grouped into thesports story. It maybe possible that the initial segmenta-tions from the first phase are not semantically independent.For example, for a particular story, the anchor may appearmore than once, and this will cause multiple cycles in theSCG. In the situation, merging of the semantically relatedstories is needed. Two adjacent stories are merged togetherif they present similar pattern in either visual appearanceor word narration, or both. The visual similarity is com-puted as the color similarity between the non-anchor shotsin the adjacent stories. The narrative similarity is definedas the Normalized Text Similarity (NTS) based on the auto-matic speech recognition (ASR) output of the videos. Thevisual and text similarities are later combined to representthe overall similarity between the stories.

3.1. Phase I - Anchor DetectionWe construct the SCG by representing the shots with sameperson by a single node. There are two common approachesfor clustering the similar shots: (1) similarity measuresbased on the global features, for example, the color his-tograms of the key frames; (2) similarities based on thecorrelation of the human faces. The problem for the firstapproach is that if the settings of the studio have changed,the global features for the anchor shots may has less simi-larity. In the latter situation, the face correlation is sensitiveto the face pose, lighting condition, etc. Therefore, it tendsto create several clusters for one same person. To overcomethese problems, we used the “body”, an extended region ofthe face. In one news video, the anchor has the same dressall the time. We take this fact as the cue for this problem.For each shot in the video, we take the middle frame as thekey frame for that shot, detect the face by [10], and find thebody region by extending the face regions to cover the up-per body of the person. The similarity of two shotssi andsj

(a). Original Sample Key Frames

(b). Corresponding Body Regions Based on the Face Detection

Figure 9: The top row (a) are the sample key-frames. Row (b)are the body regions based on face detection. The global featurecomparison fails to cluster the anchor together if applied on thetop row.

is defined as the histogram intersection of the body patchesfi andfj :

HI(fi, fj) =∑

b∈allbins

min(Hi(b),Hj(b)) (1)

whereHi(b) andHj(b) are theb-th bin in the histogramof the “body” patchesfi andfj , respectively. Some exam-ple “body” patches are shown in Figure 9. Non-facial shotsare considered having zero similarity to others. The shotsare then clustered into groups using iso-data, each of thosegroups corresponds to a particular person. If a key-frameof a shot contains multiple “bodies”, the shot is clusteredinto the existing largest group with high similarity. Eventu-ally, the shots that contain the main anchor form the largestcluster in the video. Once the anchor shots are detected,the video is segmented into initial stories by taking everyanchor shot as the start points of the news stories.

Usually, in the news section of special interests, the mainanchor is switched to an expert or sub-anchor. For exam-ple, such phenomenon can be found in finance news. The

5

Page 6: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

(a) Sample Key-Frames of Weather Shots

(b) Sample Key-Frames of Non-Weather Shots

Figure 10: Row (a) shows the example key-frames of weathershots; Row (b) shows the key-frames of non-weather shots.

sub-anchor also appears multiple times with different sto-ries focuses. Reappearing sub-anchors result in small cy-cles in the SCG. Note that many of the stories do not havesub-anchor. In addition, some of the stories also present thesmall cycles due to other reasons: reporters or interview-ers. However, sub-anchor usually appears more times thanother miscellaneous persons. Therefore, the true sub-anchorcan be classified by examining the size of the largest group.Only the groups with sufficient facial shots are declared asthe sub-anchor shots. The detections of the main anchorand the sub-anchors provide the initial result of the storysegmentation.

3.2. Phase II - Refinement3.2.1 Weather Detection

In the news story segmentation, segments related to weatherforecast are considered as separate stories from the generalstories. To classify a weather shot, we use both the colorand motion information of the video. For the weather shots,there are certain color patterns in the visual signal, suchas greenish and bluish. Some example key-frames can befound in Figure 10. Furthermore, to ensure that the au-dience can capture the important weather information, themotion content of the shot should be low.

From the training data set, we have the key-frames ofthe weather forecast shots. For a key-framekm, a colorhistogramH(km) in RGB channels is computed. Thehistograms for all the key-frames then are clustered intodistinctive groups based on the mean and the variance ofthe RGB channels. These groups form the color modelT = {t1...tn} for the weather shot detection, whereti isthe average histogram for model groupi. To test if a shots is a weather shot, we compute the histogramH(s) of itskey-frame, and compare it withti in the color model. If thedistance betweenH(s) and one of histogram in the colormodel can be tolerated, the shots is classified as a weathershot.

The motion content is analyzed for the verification of theinitial detections. To verify if a candidate shots is a trueweather shot or not, we perform the following steps:

Shot Number

Spor

t Sim

ilari

ty

Figure 11: The plot of the sporting similarities of the shots invideo. Bars in the bottom row represent the potential sport shots,and the red region represents the actual sporting story.

1. For each frameFi in the shot, the motion fieldUi be-tweenFi andFi+1 is computed based on the 16x16blocks gridXi.

2. Estimate the Affine motion parametersAi from Ui us-ing the equationUi = AiXi.

3. Apply parametersAi on Xi to generate the re-projected motion fieldUp

i .

4. Compute motion contentMi as the average magni-tude of the “disagreement” between the original mo-tion fieldUi and the re-projected fieldUp

i .

5. The motion content of shots is the mean of{M1...Mns−1}, wherens is the number of frames inthe shot. If the motion content of the candidate shots is above defined threshold, this shot is rejected as anon-weather shot.

Finally, other false detections are eliminated by takingonly the largest temporally continuous section as the trueweather news story.

3.2.2 Sport Detection

We utilize the normalized text similarity measurement todetect sporting shots. In sports video, we often hear theparticular words that are related only to the sport games,e.g., “scoring”,“quarterback”, “home run”, “basketball”,“Olympic”, etc. Given such a database of sports relatedwords, we can find the relationship between a shot and thesporting database by computing the correlation between thewords spoken in the shot with the words in the database.The text information is provided by the automatic speechrecognition (ASR) output of the video [4]. The ASR outputcontains the recognized words from the audio track of thenews program and their starting times. From each candidateshots, we extract the key-words between the time lines byapplying a filter to prune the common words, such as “is”and “the”. The remaining key-words form asentenceSens

6

Page 7: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

Story Number

Stor

y Si

mila

rity

Figure 12: The story similarity plot for the stories created byphase-1. The red peaks are the stories merged into the followingones by in phase-2.

for this shot. The similarity of the candidate shots to thesporting database is defined as:

SportSim(s) =Ks

L(Sens)(2)

whereKs is the number of the key-words from shots thatalso appear in the database, andL(Sens) is the length ofthe key-wordsentenceof shots. The detection system de-clares the shots having strong correlation with the sportingdatabase to be the sporting shots. Similar as the techniqueused for weather detection, false detections can be removedby taking only the largest continuous section of the detectedsporting shots as the sporting story. In Figure 11, the upperplot is the similarity of the shots to the sporting database,while the bars in the bottom row represent the potentialsporting shots. The red region is the true sporting story inthe video.

The results from the detections of the main anchor, sub-anchor(s), weather forecast and the sporting stories arecombined and passed to the second phase of the framework.

3.2.3 Story Merging

The proposed segmentation method over segments thevideo in case of an anchor appearing more than once in asingle story. To over come this problem, we merge adjacentsegments based on the visual and text similarities. We usethe histogram intersection technique to compute the visualsimilarity of two stories and the Normalized Text Similarity(NTS) as the text similarity measure.

Suppose storiesSi and Sj are the news sections withsame topic created by phase 1. They haveni andnj non-anchor shots respectively. For each of the shots, we extractthe middle frame as the key-frame of that shot. The visualsimilarity V (i, j) between storiesSi andSi is defined as:

V (i, j) = max(HI(spi , s

qj)), p ∈ [1...ni], q ∈ [1...nj ] (3)

Table 1: UCF performance on the feature extraction taskwith five features. The second column represents the aver-age precisions and the last column is the relative standingamong all the runs for each feature.

Feature Avg. Prevision Rel. Standing

(29). Albright 0.0899 8 / 56(30). Clinton 0.0536 48 / 72(32). Beach 0.0635 1 / 59(33). Basketball 0.2632 25 / 63(35). Walking 0.0752 23 / 49

where HI(spi , s

qj) is the histogram intersection between

shotsspi andsq

j . This means if there are two visually similarshots in the adjacent stories, these two stories should havethe focus on the similar news topic.

Sometimes, the semantic similarity is not always re-flected in the visual appearance. For example, in some newsprogram which is related to a taxation plan, the news pro-gram may first show the interviews with the middle-classcitizens. Then, after a brief summary by the anchor, theprogram switches to the congress to show the debate be-tween the political parties on the same plan. In this case,the visual appearances of these two adjacent stories are notsimilar at all. However, if any two stories are focused on thesame story, there is usually a correlation in the narrations ofthe video. In our framework, this narrative correlation be-tween storiesSi andSi with sentencesSeni andSenj iscalculated by the Normalized Text Similarity (NTS):

NTS(i, j) =Ki→j + Kj→i

L(Seni) + L(Senj)(4)

whereKi→j is the number of words inSeni that also ap-pear inSenj , and similar definition forKi→j . L(Seni) andL(Senj) are the lengths ofSeni andSenj respectively.

The final similarity between storiesSi andSj is a fu-sion of the visual similarityV (i, j) and the normalized textsimilarity NTS(i, j) (Figure 12),

Sim(i, j) = αV × V (i, j) + αNTS ×NTS(i, j) (5)

whereαV andαNTS are the weights to balance the impor-tance of two measures. IfSim(i, j) for the two adjacentstoriesSi andSj is above the defined threshold, these twostories are merged into a single one.

4. Evaluation Results and DiscussionsThe results of the feature extraction task is shows in Table 1,including the average precision and relative standing amongcorresponding runs of the features. Since the TRECVID2004 data set contains a variety of news related videos and

7

Page 8: University of Central Florida at TRECVID 2004sali/Publications/2004_Trecvid.pdf · University of Central Florida at TRECVID 2004 Yun Zhai, Xiaochun Cao, Yunjun Zhang, Omar Javed,

Feature 29: Madeleine Albright

Feature 30: Bill Clinton

Feature 32: Beach

Feature 33: Basketball Scored

Feature 35: People Walking/Running

Figure 13:Relative standings of UCF’s feature extraction meth-ods comparing with all the runs in the task. The runs are sorted bythe mean average precision (MAP), and red bars represent UCF’sstanding.

commercials, we believe that the techniques used are rele-vant to most multimedia content analysis and search tasks.Particularly, they can help any multimedia search systemdeveloper not only in the selection of suitable features, butalso in determining tradeoffs between accuracy vs. compu-tational efficiency for the semantic concept detection.

The overall precision and recall for the story segmenta-tion task is shown in Figure 14. The precision and recallfor UCF system are 0.5390 and 0.8030, respectively. Sincethe proposed method is motivated by the shot connectiv-ity graph, it is biased towards more structured news videos.For instance, in ABC videos, it often following the patterndescribed in section 3.1. The initial segmentation is ableto provide the closed solution to the true segmentation ofthe video. On the other hand, in CNN videos, the structuresometime is not as expected same as in ABC. It is possiblefor multiple stories appearing in a single shot, and the an-chor person sometimes appears more than once in a singlestory, therefore, causing over-segmentation. The mergingtechnique described in section 3.2.3 is useful in this type ofsituations.

References

[1] Intel Open Source Computer Vision Library. URLhttp://www.intel.com/research/mrl/research/ opencv/.

[2] Y. Freund and R.E. Schapire, “A Decisiontheoretic General-ization of Online Learning and an Application to Boosting.”,JCSS, 55(1):119139, August 1997.

[3] B. Adams, C. Dorai, S. Venkatesh, “Novel Approach to De-termining Tempo and Dramatic Story Sections in Motion Pic-tures”,ICIP, 2000.

Recall

Prec

isio

n

(UCF Standing)

Figure 14:Precision and recall plot of the average performanceof all the runs. The red spot represents the standing of UCF frame-work.

[4] J.L. Gauvain, L. Lamel, and G. Adda. “The LIMSI BroadcastNews Transcription System”,Speech Communication, 37(1-2):89-108, 2002.

[5] A. Hanjalic, R.L. Lagendijk, and J. Biemond, “Auto-mated High-Level Movie Segmentation for Advanced Video-Retrieval Systems”,CSVT, Vol.9, Issue.4, 1999.

[6] O. Javed, S. Khan, Z. Rasheed, and M. Shah, “Visual ContentBased Segmentation of Talk and Game Shows”,IJCA, 2002.

[7] R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Scene Determi-nation Based on Video and Audio Features”,IEEE Conf. onMultimedia Computing and Systems, 1999.

[8] Y. Li, S. Narayanan, C.-C. Jay Kuo, “Movie Content AnalysisIndexing and Skimming”,Video Mining, Kluwer AcademicPublishers, Chapter 5, 2003.

[9] C.W. Ngo, H.J. Zhang, R.T. Chin, and T.C. Pong, “Motion-Based Video Representation for Scene Change Detection”,IJCV, 2001.

[10] P. Viola and M. Jones, “Robust Real-Time Object Detec-tion”, IJCV, 2001.

[11] M. Yeung, B. Yeo, and B. Liu, “Segmentation of Videos byClustering and Graph Analysis”,Computer Vision and ImageUnderstanding, vol.71, no.1, pp. 94-109, July 1998.

8