Cross-document Event Coreference Resolution …Cross-document Event Coreference Resolution based on Cross-media Features Tongtao Zhang 1, Hongzhi Li2, Heng Ji1, Shih-Fu Chang 2 1Computer

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 201–206,Lisbon, Portugal, 17-21 September 2015. c©2015 Association for Computational Linguistics.

Cross-document Event Coreference Resolutionbased on Cross-media Features

Tongtao Zhang1, Hongzhi Li2, Heng Ji1, Shih-Fu Chang2

1Computer Science Department, Rensselaer Polytechnic Institute{zhangt13, jih}@rpi.edu

2Department of Computer Science, Columbia University{hongzhi.li, shih.fu.chang}@columbia.edu

Abstract

In this paper we focus on a new problem ofevent coreference resolution across televi-sion news videos. Based on the observa-tion that the contents from multiple datamodalities are complementary, we developa novel approach to jointly encode effec-tive features from both closed captionsand video key frames. Experiment re-sults demonstrate that visual features pro-vided 7.2% absolute F-score gain on state-of-the-art text based event extraction andcoreference resolution.

1 Introduction

TV news is the medium that broadcasts events,stories and other information via television. Thebroadcast is conducted in programs with the nameof “Newscast”. Typically, newscasts require oneor several anchors who are introducing stories andcoordinating transition among topics, reporters orjournalists who are presenting events in the fieldsand scenes that are captured by cameramen. Sim-ilar to newspapers, the same stories are often re-ported by multiple newscast agents. Moreover, inorder to increase the impact on audience, the samestories and events are reported for mutliple times.TV audience passively receives redundant infor-mation, and often has difficulty in obtaining clearand useful digest of ongoing events. These proper-ties lead to needs for automatic methods to clusterinformation and remove redundancy. We proposea new research problem of event coreference reso-lution across multiple news videos.

To tackle this problem, a good starting pointis processing the Closed Captions (CC) whichis accompanying videos in newcasts. The CCis either generated by automatic speech recogni-tion (ASR) systems or transcribed by a humanstenotype operator who inputs phonetics which are

Figure 1: Similar visual contents improve detec-tion of a coreferential event pair which has a lowtext-based confidence score.Closed Captions: “It ’s not clear when it waskilled.”; “Jordan just executed two ISIS prisoners,direct retaliation for the capture of the killingJordanian pilot.”

instantly and automatically translated into texts,where events can be extracted. There exist someprevious event coreference resolution work suchas (Chen and Ji, 2009b; Chen et al., 2009; Lee etal., 2012; Bejan and Harabagiu, 2010). However,they only focused on formally written newswirearticles and utilized textual features. Such ap-proaches do not perform well on CC due to (1).the propagated errors from upper stream compo-nents (e.g., automatic speech/stenotype recogni-tion and event extraction); (2). the incomplete-ness of information. Different from written news,newscasts are often limited in time due to fixedTV program schedules, thus, anchors and journal-ists are trained and expected to organize reports

201

which are comprehensively informative with com-plementary visual and CC descriptions within ashort time. These two sides have minimal over-lapped information while they are inter-dependent.For example, anchors and reporters introduce thebackground story which are not presented in thevideos, and thus the events extracted from CC of-ten lack information about participants.

For example, as shown in Figure 1, these twoConflict.Attack event mentions are coreferential.However, in the first event mention, a mistakein Closed Caption (“he was killed” → “it waskilled”) makes event extraction and text basedcoreference systems unable to detect and link “it”to the entity of “Jordanian pilot”. Fortunately,videos often illustrate brief descriptions by vividvisual contents. Moreover, diverse anchors, re-porters and TV channels tend to use similar oridentical video contents to describe the same story,even though they usually use different words andphrases. Therefore, the challenges in coreferenceresolution methods based on text information canbe addressed by incorporating visual similarity. Inthis example, the visual similarity between the cor-responding video frames is high because both ofthem show the scene of the Jordanian pilot.

Similar work such as (Kong et al., 2014), (Ra-manathan et al., 2014), (Motwani and Mooney,2012) and (Ramanathan et al., 2013) have ex-plored methods of linking visual materials withtexts. However, these methods mainly focus onconnecting image concepts with entities in textmentions; and some of them do not clearly distin-guish entity and event in the documents since thedefinition of visual concepts often require both ofthem. Moreover, the aforementioned work mainlyfocuses on improving visual contents recognitionby introducing text features while our work willtake the opposite route, which takes advantage ofvisual information to improve event coreferenceresolution.

In this paper, we propose to jointly incorporatefeatures from both speech (textual) and video (vi-sual) channels for the first time. We also build anewscast crawling system that can automaticallyaccumulate video records and transcribe closedcaptions. With the crawler, we created a bench-mark dataset which is fully annotated with cross-document coreferential events 1.

1Dataset can be found athttp://www.ee.columbia.edu/dvmm/newDownloads.htm

2 Approach

2.1 Event ExtractionGiven unstructured transcribed CC, we extract en-tities and events and present them in structuredforms. We follow the terminologies used in ACE(Automatic Content Extraction) (NIST, 2005):• Entity: an object or set of objects in the world,

such as person, organization and facility.• Entity mention: words or phrases in the texts

that mention an entity.• Event: a specific occurrence involving partici-

pants.• Event trigger: the word that most clearly ex-

presses an event’s occurrence.• Event argument: an entity, or a temporal expres-

sion or a value that has a certain role (e.g., Time-Within, Place) in an event.• Event mention: a sentence (or a text span ex-

tent) that mentions an event, including a distincttrigger and arguments involved.

2.2 Text based Event Coreference ResolutionCoreferential events are defined as the same spe-cific occurrence mentioned in different sentences,documents and transcript texts. Coreferentialevents should happen in the same place and withinthe same time period, and the entities involved andtheir roles should be identical. From the perspec-tive of extracted events, each specific attribute andargument from those events should match. How-ever, mentions for the same event may appear informs of diverse words and phrases; and they donot always cover all arguments or attributes.

To tackle these challenges, we adopt a Maxi-mum Entropy (MaxEnt) model as in (Chen andJi, 2009b). We consider every pair of event men-tions which share the same event type as a can-didate and exploit features proposed in (Chen andJi, 2009b; Chen et al., 2009). Note that the goalin (Chen and Ji, 2009b; Chen et al., 2009) wasto resolve event coreference within the same doc-ument, whereas our scenario yields to a cross-document/video transcript setting, so we removesome improper and invalid features. We also in-vestigated the approaches by (Lee et al., 2012)and (Bejan and Harabagiu, 2010), but the con-fidence estimation results from these alternativemethods are not reliable. Moreover, the inputof event coreference are automatic results fromevent extraction instead of gold standard, so thenoise and errors significantly impact the corefer-

202

ence performance, especially for unsupervised ap-proaches (Bejan and Harabagiu, 2010). Neverthe-less, we still incorporate features from the afore-mentioned methods. Table 1 shows the featuresthat constitute the input of the MaxEnt model.

2.3 Visual Similarity

Visual content provides useful cues complemen-tary with those used in text-based approach inevent coreference resolution. For example, twocoreferential events typically show similar or evenduplicate scenes, objects, and activities in the vi-sual channel. Coherence of such visual contenthas been used in grouping multiple video shotsinto the same video story (Hsu et al., 2003), butit has not been used for event coreference res-olution. Recent work in computer vision hasdemonstrated tremendous progress in large-scalevisual content recognition. In this work, we adoptthe state-of-the-art techniques (Krizhevsky et al.,2012) and (Simonyan and Zisserman, 2014) thattrain robust convolutional neural networks (CNN)over millions of web images to detect 20,000 se-mantic categories defined in ImageNet (Deng etal., 2009) from each image. The 2nd to thelast layer features from such deep network canbe considered as high-level visual representationthat can be used to discriminate various seman-tic classes (scenes, objects, activity). It has beenfound effective in computing visual similarity be-tween images, by directly computing the L2 dis-tance of such features or through further met-ric learning. To compute the similarity betweenvideos associated with two candidate event men-tions, we sample multiple frames from each videoand aggregate the similarity scores of the fewmost similar image pairs between the videos. Let{f i

1, fi2, ..., , f

il } be the key frames sampled from

video Vi and {f j1 , f

j2 , ..., , f

jl } be key frames sam-

pled from video Vj . All the frames are resized to afixed resolution of 256 x 256 and fed into our pre-trained CNN model. We get the high-level visualrepresentation Fm = FC7(fm) for each frame fm

from the output of the 2nd to the last fully con-nected layer (FC7) of CNN model. Fm is a 4096dimension vector. The visual distance of framesfm and fn is defined by L2 distance, which is

Dmn = ||FC7(f im)− FC7(f j

n)||2. (1)

The distance of video pair (Vi, Vj) is computed as

D̄ij =1k∗

∑(fm,fn)

Dmn (2)

, where (fm, fn) is the top k of most similar framepairs. In our experiment, we use k = 3. Suchaggregation method among the top matches is in-tended to capture similarity between videos thatshare only partially overlapped content.

Each news video story typically starts with anintroduction by an anchor person followed bynews footages showing the visual scenes or activ-ities of the event. Therefore, when computing vi-sual similarity, it’s important to exclude the anchorshot and focus on the story-related clips. Anchorframe detection (Hsu et al., 2003) is a well studiedproblem. In order to detect anchor frames auto-matically, a face detector is applied to all I-framesof a video. We can obtain the location and sizeof each detected face. After checking the tempo-ral consistency of the detected faces within eachshot, we get a set of candidate anchor faces. Thedetected face regions are further extended to re-gions of interest that may include hair and upperbody. All the candidate faces detected from thesame video are clustered based on their HSV colorhistogram. It is reasonable to assume that the mostfrequent face cluster is the one corresponding tothe anchor faces. Once the anchor frames are de-tected, they are excluded and only the non-anchorframes are used to compute the visual similaritybetween videos associated with event mentions.

2.4 Joint Re-ranking

Using the visual distance calculated from Sec-tion 2.3, we can rerank the confidence values fromSection 2.2 using the text-based MaxEnt model.We use the following empirical equation to adjustthe confidence:

W ′ij = Wij ∗ e−D̄ijα

+1, (3)

where Wij denotes the original coreference con-fidence between event mentions i and j, Dij de-notes the visual distance between the correspond-ing video frames where the event mentions werespoken and α is a parameter which is used to ad-just the impact of visual distance. In the currentimplementation, we empirically set it as the aver-age of pair-wised visual distances between videosof all event coreference candidates. With this α

203

Category Features Remarks (EMi: the first event mention, EMj : the sec-ond event mention)

Baseline

type subtype pair of event type and subtype in EMi

trigger pair trigger pair of EMi and EMj

pos pair part-of-speech pair of triggers of EMi and EMj

nominal 1 if the trigger of EMi is nominalnom number “plural” or “singular” if the trigger of EMi is nominalpronominal 1 if the trigger of EMi is pronominalexact match 1 if the trigger spelling in EMi matches that in EMj

stem match 1 if the trigger stem in EMi matches that in EMj

trigger sim the semantic similarity scores between triggers of EMi

and EMj using WordNet(Miller, 1995)Arguments argument match 1 if arguments holding the same roles in both EMi and

EMj matches

Attributesmod,pol,gen,ten four event attributes in EMi: modality, polarity, gener-

icity and tensemod conflict,pol conflict, gen conflict,ten conflict

1 if the attributes of EMi and EMj conflict

Table 1: Features for Event Coreference Resolution

we generally enhance the confidence of event pairswith small visual distances and penalize those withlarge ones. An alternative way for setting the alphaparameter is through cross validation over separatedata partitions.

3 Experiments

3.1 Data and SettingWe establish a system that actively monitors over100 U.S. major broadcast TV channels such asABC, CNN and FOX, and crawls newscasts fromthese channels for more than two years (Li etal., 2013a). With this crawler, we retrieve 100videos and their correspondent transcribed CCwith the topic of “ISIS”2. This system also tem-porally aligns the CC text with the transcribed textfrom automatic speech recognition following themethods in (Huang et al., 2003). This provides ac-curate time alignment between the CC text and thevideo frames. As CC consists of capitalized let-ters, we apply the true-casing tool from StandfordCoreNLP (Manning et al., 2014) on CC. Then weapply a state-of-the-art event extraction system (Liet al., 2013b; Li et al., 2014) to extract event men-tions from CC. We asked two human annotatorsto investigate all event pairs and annotate coref-erential pairs as the ground truth. Kappa coeffi-cient for measuring inter-annotator agreement is

2abbreviation for Islamic State of Iraq and Syria

74.11%. In order to evaluate our system perfor-mance, we rank the confidence scores of all eventmention pairs and present the results in Precisionvs. Detection Depth curve. Finally we find thevideo frames corresponding to the event mentions,remove the anchor frames and calculate the visualsimilarity between the videos. Our final datasetconsists of 85 videos, 207 events and 848 eventpairs, where 47 pairs are considered coreferential.

We adopt the MaxEnt-based coreference reso-lution system from (Chen and Ji, 2009b; Chen etal., 2009) as our baseline, and use ACE 2005 En-glish Corpus as the training set for the model. A5-fold cross-validation is conducted on the train-ing set and the average f-score is 56%. It is lowerthan results from (Chen and Ji, 2009a) since weremove some features which are not available forthe cross-document scenario.

3.2 ResultsThe peak F-score for the baseline system is44.23% while our cross-media method boosts itto 51.43%. Figure 2 shows the improvement af-ter incorporating the visual information. We adoptWilcoxon signed-rank test to determine the signif-icance between the pairs of precision scores at thesame depth. The z-ratio is 3.22, which shows theimprovement is significant.

For example, the event pair “So why hasn’tU.S. air strikes targeted Kobani within the city

204

0 50 100 1500.2

0.4

0.6

0.8

1BaselineOur Method

Figure 2: Performance comparison between base-line and our cross-media method on top 150 pairs.Circles indicate the peak F-scores.

limits” and “Our strikes continue alongside ourpartners.” was mistakenly considered coreferen-tial by text features. In fact, the former “strikes”mentions the airstrike and the latter refers to thewar or battle, therefore, they are not coreferential.The corresponding video shots demonstrate twodifferent scenes: the former one shows bombingwhile the latter shows that the president is givinga speech about the strike. Thus the visual distancesuccessfully corrected this error.

3.3 Error AnalysisHowever, from Figure 2 we can also notice thatthere are still some errors caused by the vi-sual features. One major error type resides inthe negative pairs with both “relatively” hightextual coreference confidence scores and “rela-tively” high visual similarity. From the text side,the event pair contains similar events, for exam-ple: “The Penn(tagon) says coalition air strikesin and around the Syrian city of Kobani have killhundreds of ISIS fighters but more are stream-ing in even as the air campaign intensifies.” and“Throughout the day, explosions from coalitionair strikes sent plums of smoke towering into thesky.”. They talk about two airstrikes during differ-ent time periods and are not coreferential, but thebaseline system produces a high rank. Our currentapproach limits the image frames to those over-lapped with the speech of an event mention, and inthis error, both videos show “battle” scene, yield-ing a small visual distance. The aforementionedassumption that anchors and journalists tend to usesimilar videos when describing the same events ,which may introduce risk of error caused by sim-ilar text event mentions with similar video shots.For such errors, one potential solution is to expandthe video frame windows to capture more eventsand concepts from videos. Expanding the detec-

tion range to include visual events in the temporalneighborhood can also differentiate the events.

3.4 DiscussionA systematic way of choosing α in Equation 3 willbe useful. One idea is to adapt the α value for dif-ferent types of events, e.g., we expect some eventtypes are more visually oriented than others andthus use a smaller α value.

We also notice the impact of the errors from theupstream event extraction system. According to(Li et al., 2014) the F-score of event trigger label-ing is 65.3%, and event argument labeling is 45%.Missing arguments in events is a main problem,thus the performance on automatically extractedevent mentions is significantly worse. About 20more coreferential pairs could be detected if eventsand arguments are perfectly extracted.

4 Conclusions and Future Work

In this paper, we improved event coreference res-olution on newscast speech by incorporating vi-sual similarity. We also build a crawler that pro-vides a benchmark dataset of videos with alignedclosed captions. This system can also help cre-ate more datasets to conduct research on video de-scription generation. In the future, we will focuson improving event extraction from texts by intro-ducing more fine-grained cross-media informationsuch as object, concept and event detection resultsfrom videos. Moreover, joint detection of eventsfrom both sides is our ultimate goal, however, weneed to explore the mapping among events fromboth text and visual sides, and automatic detectionof a wide range of objects and events from newsvideo itself is still challenging.

Acknowledgement

This work was supported by the U.S. DARPADEFT Program No. FA8750-13-2-0041, ARLNS-CTA No. W911NF-09-2-0053, NSF CA-REER Award IIS-1523198, AFRL DREAMproject, gift awards from IBM, Google, Disneyand Bosch. The views and conclusions containedin this document are those of the authors andshould not be interpreted as representing the of-ficial policies, either expressed or implied, of theU.S. Government. The U.S. Government is autho-rized to reproduce and distribute reprints for Gov-ernment purposes notwithstanding any copyrightnotation here on.

205

ReferencesCosmin Adrian Bejan and Sanda Harabagiu. 2010.

Unsupervised event coreference resolution with richlinguistic features. In Proceedings of the 48th An-nual Meeting of the Association for ComputationalLinguistics, pages 1412–1422.

Zheng Chen and Heng Ji. 2009a. Event coref-erence resolution: Algorithm, feature impact andevaluation. In Proceedings of Events in EmergingText Types (eETTs) Workshop, in conjunction withRANLP, Bulgaria.

Zheng Chen and Heng Ji. 2009b. Graph-basedevent coreference resolution. In Proceedings of the2009 Workshop on Graph-based Methods for Natu-ral Language Processing, TextGraphs-4, pages 54–57.

Zheng Chen, Heng Ji, and Robert Haralick. 2009.A pairwise event coreference model, feature impactand evaluation for event coreference resolution. InProceedings of the Workshop on Events in EmergingText Types, eETTs ’09, pages 17–22.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. Imagenet: A large-scale hi-erarchical image database. In Proceedings of IEEEConference on Computer Vision and Pattern Recog-nition, 2009., pages 248–255.

Winston Hsu, Shih-Fu Chang, Chih-Wei Huang, Lyn-don Kennedy, Ching-Yung Lin, and GiridharanIyengar. 2003. Discovery and fusion of salient mul-timodal features toward news story segmentation. InElectronic Imaging 2004, pages 244–258.

Chih-Wei Huang, Winston Hsu, and Shin-Fu Chang.2003. Automatic closed caption alignment based onspeech recognition transcripts. Rapport technique,Columbia.

Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun,and Sanja Fidler. 2014. What are you talking about?text-to-image coreference. In Proceedings of 2014IEEE Conference on Computer Vision and PatternRecognition, pages 3558–3565.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neuralinformation processing systems, pages 1097–1105.

Heeyoung Lee, Marta Recasens, Angel Chang, MihaiSurdeanu, and Dan Jurafsky. 2012. Joint entity andevent coreference resolution across documents. InProceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, pages489–500.

Hongzhi Li, Brendan Jou, Jospeh G Ellis, Daniel Mo-rozoff, and Shih-Fu Chang. 2013a. News rover:Exploring topical structures and serendipity in het-erogeneous multimedia news. In Proceedings of the

21st ACM international conference on Multimedia,pages 449–450.

Qi Li, Heng Ji, and Liang Huang. 2013b. Joint eventextraction via structured prediction with global fea-tures. In Proceedings of the 51st Annual Meetingof the Association for Computational Linguistics,pages 73–82.

Qi Li, Heng Ji, Yu HONG, and Sujian Li. 2014.Constructing information networks using one singlemodel. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1846–1851.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Proceedings of 52ndAnnual Meeting of the Association for Computa-tional Linguistics: System Demonstrations, pages55–60.

George A. Miller. 1995. Wordnet: A lexical databasefor english. Commun. ACM, 38(11):39–41, Novem-ber.

Tanvi S Motwani and Raymond J Mooney. 2012.Improving video activity recognition using objectrecognition and text mining. In Proceedings of the20th European Conference on Artificial Intelligence,pages 600–605.

NIST. 2005. The ace 2005 evaluation plan. http://www.itl.nist.gov/iad/mig/tests/ace/ace05/doc/ace05-evaplan.v3.pdf.

Vignesh Ramanathan, Percy Liang, and Li Fei-Fei.2013. Video event understanding using naturallanguage descriptions. In Proceedings of 2013IEEE International Conference on Computer Vision,pages 905–912.

Vignesh Ramanathan, Armand Joulin, Percy Liang,and Li Fei-Fei. 2014. Linking people in videos withtheir names using coreference resolution. In Com-puter Vision–ECCV 2014, pages 95–110.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556.

206

Documents

Cross-document Event Coreference Resolution …Cross-document Event Coreference Resolution based on Cross-media Features Tongtao Zhang 1, Hongzhi Li2, Heng Ji1, Shih-Fu Chang 2 1Computer