Introduction to Cross-Document Coreference Amit Bagga StreamSage/Comcast [email protected]

Introduction to Cross-Document Coreference

Amit Bagga

StreamSage/Comcast

[email protected]

Outline• Motivation and Definition• Comparison with Within-Document Coreference,

WSD and other NL tasks• Methodologies for Entity Cross-Document

Coreference• Other types of Cross-Document Coreference

– Concept Cross-Document Coreference– Event Cross-Document Coreference– Cross-Media Coreference– Cross-Language, Cross-Document Coreference

• Scoring Methodologies

Motivation

• Proper names comprise approximately 10% of news text (Coates-Stephens, 1992)

• Names are often ambiguous across documents– increasingly becoming a challenge for NLP

systems as collection size and generality grow– also as systems break the “document boundary”

Definition

• Cross-Document Coreference (CDC) for entities, in broad terms, asks– how can one computationally disambiguate the

intended referent of a name• Winchester & Lee 2002

– for example, it asks, which ‘John Smith’ is meant by a particular occurrence of the string “John Smith”

Comparison with Within-Document Coreference

• Within a document– Identical or similarly named entities seldom

appear in the same context• when they do, writers distinguish them explicitly • i.e. it is usually the case that we have one referent

per discourse

– Variant form of the same name generally obey certain regularities which are predictable

• For example: Michael Jordan may be referred to by the following – Michael, Mr. Jordan, Jordan, etc.

• Across documents– Assumption that same or similar names refer to

same entity is not valid– Linguistics theories do not apply– The only way to distinguish between these

entities is to examine context

Comparison with WSD• CDC can be thought of as disambiguating the

“sense” of usage of a name• In WSD:

– Usually possible to enumerate a priori all possible senses of word

– Number of possible senses of word is small (1-10)

• In CDC:– A large corpus can contain 10s or 100s of entities with

same name which are impossible to enumerate a priori– From linguistic perspective, all entities equally

plausible

The Role of Context

• Similar to WSD, context is vital for CDC– context can be of different sizes

• window of words centered around a name, sentence containing name, group of sentences, or even whole document

– modeling context can be done in many different ways

• bag of words, set of phrases, set of entities, set of relations, etc.

• All CDC systems use context in one form or another

Bag of Words Approach– Bagga and Baldwin, 1998

– Within-document coreference system is used to identify all mentions of entity

– Sentences containing mention are extracted from each document

• “summaries” with respect to entity

– Set of summaries compared using VSM (tf*idf)

– Single-link clustering used

– Version 2 (1999) eliminates use of within document coreference system

• sentences containing any variant of name extracted

Corpus, Evaluation, and Results

– 197 articles containing “John Smith” extracted from 2 years of New York Times data

• 35 different John Smiths

– B-CUBED algorithm used– Version 2 results

• 84% F-Measure

• 90% Precision, 78% Recall

• < 1% F-Measure drop when compared to original system

Minimizing Context Matches

• Kazi and Ravin, 2000• Problem with Bagga and Baldwin, 1998

– Prohibitively expensive in terms of storage and n-to-n comparisons (specially in a large corpus)

• Use IBM’s Nominator for named entity identification and within document coreference (non-pronominal)

• CDC task is merging canonical names from different documents that refer to same entity

• Context analysis done by use of a Context Thesaurus– Given a name, returns a ranked list of terms that are related to

name in the corpus

# Docs Nominator Output

17 Bush (unspecified gender)

1 Christopher Bush (male)

1 Douglas Bush (male)

26 George Bush (male)

2 George Bush; President Bush (male)

1George W. Bush; Gov. George W. Bush; President George Bush (male)

1 Mr. Bush (male)

2 President Bush (male)

7 Vannevar Bush (unspecified gender)

E1 Christopher Bush

E2 Douglas Bush

E3 George W. Bush

E4 Vannevar Bush

M1

George Bush,

mergeable with E3

(first name and gender)

M2Mr. Bush,

mergeable with E1-E4

M3President Bush,


M4-M9

Bush,


• E = Exclusives – i.e. no merging possible

• M = Mergeables – i.e. compatible with some or all exclusives

• Tables are created by analyzing two lists sorted by ambiguity– PERS names

• George Walker Bush > George W. Bush > George Bush > G. Bush > Bush

– PLACE names• Albany, NY > Albany

• Merging steps– Merge identical canonical strings >= 2 words

• Merges 28 George Bush, 2 President Bush 7 Vannevar Bush articles into 3 equivalence classes

– Between mergeables and exclusives, combine if any variants share a common prefix

• Merges E3, M1 and M3 (common prefix = President)

• Reduces # of context matches from 58x58 to 7x4


• Corpus – 1998 editions of New York Times• 15 name families

– For example: Berger, Black, Brown, Bush, Clinton, Gore, etc.

• B-CUBED algorithm for scoring• Without context comparisons:

– Avg Precision = 98.5%– Avg Recall = 72.85%

• No results reported when context comparisons are used (Ravin and Kazi, 1999)

3 Models of Similarity• Gooi and Allan, 2004• Methodology similar to Bagga and Baldwin

– extract 55 word snippets centered at name or its variant

• Problem with Bagga and Baldwin– sharp drop off in F-Measure around threshold

• 3 different models of similarity– Incremental Vector Space

• tf*idf, but with average link clustering

– KL divergence• snippets are represented as probability distribution of words• similarity = “distance” between two probability distributions

– Agglomerative Vector Space• tf*idf with bottom-up, complete-link clustering

Corpus• John Smith corpus (Bagga and Baldwin)• Person-x corpus

– created by querying TREC collection with queries like arts, business, sports, etc.

– BBN’s IdentiFinder used for named entity recognition

– one name (and its corresponding variants) randomly replaced with phrase Person-x

– 34,404 documents; 14,767 actual unique entities

Evaluation and Results

• B-CUBED algorithm used for scoring• Agglomerative VS best

– 88.2% F-Measure for John Smith corpus– 83% F-Measure for Person-x corpus

• When run on each sub-corpus (arts, sports, etc.) of Person-x corpus– F-Measure drops to 77%– shows that a more homogenous corpus is more difficult

• Results for Agglomerative VS degrade much more smoothly around threshold than others

Second Order Co-Occurrence• Three methods – independently published• Bagga, Baldwin, and Ramesh, 2001 - 2-pass

algorithm– First pass: as before– Second pass:

• for each chain, compute set of most frequent overlapping words in chain (signature words for chain)

• for each singleton document after pass 1, compare to each chain

– use signature words to extract additional sentences– compare enhanced summary to every summary in chain– merge if similarity > threshold– if not merged with any chain, remains singleton

• Winchester and Lee, 2001– named entity detection and conflation within

documents is done as pre-processing step

– based on Schutze’s (1998) algorithm for context-group discrimination

– 3 types of vectors are created• Term Vectors – formed for each name occurring in context of

entity of interest and its variants– stores co-occurrence stats for term across whole corpus

• Context Vectors – formed for entity of interest by summing all term vectors associated with its context

– term vectors are weighted with their idf scores before sum

• Entity Vectors – for each entity, it is centroid of set of context vectors

– entity disambiguation is done by comparing Entity Vectors using VSM with single-link clustering

Corpus, Evaluation, Results• Bagga, Baldwin, and Ramesh

– John Smith corpus, B-CUBED scoring– new F-Measure 91% (+7 from before)

• Winchester and Lee– 30 name sets; 10 each of PER, LOC, ORG– from 6000 WSJ articles– B-CUBED scoring– discovered that selective creation of 3 types of vectors

boosts performance• for example, LOC helps disambiguate other LOC• Birmingham, Alabama vs UK; John Smith associated with

Pocahontas– overall F-Measure 78.5%

• NAM – 90.3%, LOC – 79.2%, ORG – 72.5%

• Guha and Garg, 2004– mine descriptions associated with entity of interest

(sketch)• descriptions are other entities + professions that are in close

proximity

– comparing descriptions• different weights given to different descriptions given type of

entity of interest and entity-type of description– for example: location is more likely to be disambiguated by

another location than by the name of a person

– Corpus and Evaluation• 26 entities (names + places), 2-6 instances identified of each

• sent as queries to search engines, top 150 results collated and manually tagged for truth

• best F-Measure = 90.3%

Maximum Entropy Model• Fleischman and Hovy, 2004 – use ME to determine if two

concept/instance pairs are same entity– concept/instance pairs – ACL dataset (2M pairs)

• John Edwards/lawyer and John Edwards/politician– Name features: NAME-COMMON (census), NAME-FAME (ACL

dataset), WEB-FAME (Google)– Web features: based on # of Google hits with name plus headwords of

concepts used as queries– Overlap features: based on # words overlapping in context of names and

concepts– Semantic features: based on semantic relatedness of concepts (WordNet)

• for example: lawyers are more likely to become politicians– Estimated Statistics features: probabilities that a name is associated with a

particular concept (computed over entire ACL dataset)• Disambiguation using group-average agglomerative clustering• Tested on set of 31 concept/instance pairs (1875 used for training)

– 20 had a single referent– F-Measure = 93.9%– baseline (all in same chain) = 92.4%

Robust Reading Approach• Li, Morie, and Roth, 2004

– a global probabilistic view of how documents are generated and how entities are “sprinkled” into them

• Model 1 (simplest – no notion of author)– entities are present in a document with a prior probability,

independent of other entities– mentions (references) are selected according to probability

distribution P(mj|ei)– i.e. entity referenced by a mention is not dependent on other

mentions• Model 2 (more expressive)

– # of entities in doc and # of mentions follow uniform distribution– entities enter doc with a prior probability, independent of others– representative (canonical form) for each entity is selected

according to P(rj|ei) – for each representative, mentions are selected by P(mk|rj)– i.e. entity referenced by a mention depends on other mentions in

the same document

• Model 3 (least relaxation)– # of entities based on uniform distribution – but not independent of each

other– entities in doc viewed as nodes in a weighted directed graph with edges

labeled as P(ej|ei)– entities inserted in document via a random walk starting at an entity with

prior probability P(ek)– representatives and mentions follow the same probabilities as Model 2– i.e. entity referenced by a mention depends on other mentions in same

document, but also on other entities in entire corpus• Models learned using truncated EM algorithm• Evaluation

– 300 NYT articles from TREC corpus– 8000 mentions corresponding to 2000 entities (people, locations,

organizations)– compared to SOFT-TF-IDF and baseline (entities with identical writing

are same)– overall F-Measure = 89% (model 2)– baseline = 70.7% and SOFT-TF-IDF = 79.8%

• Model 3 does not perform best because– global dependencies enforces restrictions over groupings of similar

mentions– because of limited document set, estimating global dependency is

inaccurate

Using IE Features• 3 different methods published• Mann and Yarowsky, 2003

– use unsupervised learning to learn patterns from corpus that capture biographical features

• birth day, birth year, birth place and occupation

– use bottom-up centroid agglomerative clustering for disambiguation

– vectors for each document are generated by using the following

• all words (plain) or proper nouns (nnp)• most relevant words (mi and tf-idf)• basic biographical features (feat)• extended biographical features (extfeat)

Corpus, Evaluation, and Results• Mann and Yarowsky

– Pseudoname corpus• query Google with names of 8 people

– take 28 possible pairs and replace with different pseudonames

– Naturally occurring corpus• query for 4 naturally occurring polysemous names

– example: Jim Clark

• 60 articles for each name• 3-way classification (top 2 occurring people + “others”)

– Disambiguating accuracy for Pseudonames• 86.4% with nnp+feat+tf-idf

– For naturally occurring corpus• using mutual information 88% Precision and 73% Recall

• Niu, Li, and Srihari, 2004 - use 3 different categories of contextual features– set of 50 words centered around name (or alias)– other entities occurring in 50 word context of name (or

alias)– automatic extracted relationships (25 possible)

• birth day, age, affiliation, title, address, degree, etc.

– features combined using Maximum Entropy Model

• Evaluation using B-CUBED algorithm– 4 sets of 4 famous names mixed together using

pseudonames • 88% F-Measure achieved

– 2 naturally occurring sets • Peter Sutherland – 96% F-Measure• John Smith – 85% F-Measure

• Dozier and Zielund, 2004– CDC for people in legal domain

• attorneys, judges, and expert witnesses

– Combine IE techniques with record linkage techniques• biographical records for attorneys and judges created manually

from Westlaw Legal Directory• biographical record for expert witnesses created through text

mining• IE techniques extract templates associated with each type from

document• record linkage part uses Bayesian network to match templates

with biographical records

– Evaluation• for docs with stereotypical syntax and full names – 98%

precision and 95% recall• Otherwise, 95% precision and 60% recall

Baseline

• Guha and Garg, 2004– established baseline when full docs were

compared using TF-IDF without considering context for 26 entities (names and places)

– 2-6 instances of each entity considered– for each instance, top 10 results evaluated– 22.5% accuracy overall

Types of CDC• Named Entities

– described earlier• Terms or Concept

– Kazi and Ravin, 2000• Events

– Bagga and Baldwin, 1999• Cross-Media and/or Multimedia Coreference

– Between text and pictures for names (Bagga and Hu, unpublished)– Between text and video for names (Satoh and Kanade, 1997) – Between video streams (using image and text) for events (Bagga,

Hu, and Zhong, 2002)• Cross-Language, Cross-Document Coreference

– parallel corpus (Harabagiu and Maiorano, 2000)– non-parallel corpus – open problem, although manual results

encouraging (Bagga and Baldwin, unpublished)

Term or Concept CDC

• Single or multi-word terms refer to concepts occurring in domain

• Multi-word terms– identified by Terminator (rule-based)

• form subset of noun phrases in document– discard those that occur only once in document

• for example: price rose where rose is mistakenly identified as noun

– discard those that are found only as proper sub-strings• for example: dimension space (part of lower dimension space)

– are seldom ambiguous and are merged across documents

Single Word Terms• Capitalized single words are most common

sources of ambiguity– for example: Wired – name of magazine and an

adjective that is first word in sentence• Within-doc categorization of single words

– If capitalized word occurs in lowercase in document – consider as regular word

– If capitalized word appears as capitalized in middle of sentence – consider as name

– If no lowercase occurrences and word appears at beginning of sentence or in title/header - consider as term

– All other single words not identified as part of name or multi-word terms – consider as lower-case term

Disambiguating Single Words Across Documents

Lower-case

Term

Uncat.

Name

enliven Enliven

bush Bush

wired Wired

Unambiguous cases – no merging

Upper-case Term

Lower-case term

Uncat. Name

Name is variant of

Finds finds ---- ----

Loss loss ---- ----

Allied ---- Allied ----

Microsoft ---- MicrosoftMicrosoft Corp.

N.Y. ---- ---- New YorkAmbiguous cases –merge if only nameor only lower-case term found in corpus

• Single occurrences of single capitalized terms can be merged with occurrences of corresponding names if names occur more than once in at least one document

• No evaluation was performed

Upper-case Term

Uncat. Name # Docs# occurrences within doc

Find Find 2 1

Please Please 8 1

Met Met 5 2-3

Sun Sun 12 1-3

Apple Apple 203 1-46

Event CDC• Bagga and Baldwin, 1999

– similar approach to entity-based CDC• Two events are coreferent iff the players, time, and

location are the same• Event CDC system extracts as “summaries”

sentences which contain:– main event verb (for example: resign)– nominalization of main verb (for example: resignation)– synonyms (for example: quit)

• Summaries are clustered using single-link clustering and VSM similarity

Evaluation and Results• Articles chosen for 3 events: resignations, elections, and

espionage– 2 years of New York Times data

• B-CUBED algorithm used for scoring

Event # docs F-Measure Precision RecallF-Measure

(2-pass algorithm)

resignations 219 84 95 75 84

elections 135 43 50 37 45

espionage 184 76 79 74 81

Analysis• Events are harder than entities:

– no within-document coreference– no explicit references– are at time spread over the entire document

• Analysis of Elections event– elections are temporal in nature

• disambiguating phrases largely use temporal references (for example – upcoming fall elections, elections last year, next elections, etc)

• exposes weakness of using a bag of words approach– presence of sub-events

• US General election consists of both Presidential elections and Congressional elections

– “players” are the same due to high rate of incumbency– descriptions of events are very similar

• issues in every election are similar (inflation, unemployment, economy)

Cross-Media Coreference – Between Text and Video (Names)

• Satoh and Kanade, 1997

• Association of face and name in video– given unknown face, infer name or,– given name, guess faces which are likely to

have that name

• Use closed caption transcripts and video images for correlation

• Face extraction: neural-network based face detector to locate faces in images

• Name candidate extraction: use Oxford Text Archive dictionary (appx 70k words)– Word is considered to be a proper noun if

• annotated as one in dictionary• not found in dictionary

• Face similarity: eigenvector based method to compute distance between two faces

• Face and name co-occurrence: use co-occurrence factor– captures how well name and face co-occur in

time


• No large scale evaluation done

• Problem with technique: false positives– specially for famous

people– Clinton mentioned by

news anchor repeatedly– name gets associated

with news anchor

Between Text and Pictures (Names)

• Bagga and Hu, unpublished (2004)• Algorithm

– Use text and image based features to identify coreference

– Tested on web pages• Text narrowed by extracting sentences containing name

variants of entity

• Image features computed by analyzing distribution of colors in L*a*b perceptual color space

– Across URLs, first compute text similarity (VSM) and image similarity (L*a*b) and then combine

Preliminary Results

Maps related to Captain John Smith’s explorations

Portraits of Captain John Smith

Captain John Smith as portrayed in the movie Pocahontas

Cross-Media Coreference

• Goal: identify and track “important” news events in broadcast news video

• Observations:– “important” stories of the day are repeated

within/across stations– common footage scenes can be used as

representative clips for these stories

Scene 1 Scene 2 Scene 3 Scene 6Scene 5Scene 4

Story 1 Story 2

images

sound

images

sound

images

sound

Scene 7

Story seg. 1 Story seg. 2 Story seg. 3Commercial Segment 1

News

ClosedCaption

ClosedCaption

ClosedCaption

Structure of Broadcast News

Methodology

• For each video source, use closed caption text:– to identify segment boundaries (>> signs indicate

speaker change)– identify and eliminate commercial segments (based

upon text-tiling method)– cluster story segments into stories

• Use complete link, hierarchical clustering to identify overlapping stories between programs– identify common footage scenes between each pair of

overlapping stories

Common Footage Detection

Overlapping Story

Scenes from video source 1

Visualsimilarity

Textsimilarity

Combined-Media

clustering

key frames

key frames

Common Footages

Scenes from video source 2

text

text

Examples – Found by SystemNews conferenceOn Iraqi bombing

CBS 4257

CBS 2829 CBS 3873 NBC 3885 NBC 5061

CBS 13833 NBC 16317

CBS 38805 NBC 20805

Flood rescue -> rescue school bus

US submarine->US submarine incident

CBS 4125 NBC 7377

Topic: US/Iraq->US bombingof Iraq.

More Examples

Same stories and similar key-frame images, but not reallyidentical footage.

CBS 2253 NBC 4173

CBS 2001 NBC 3177

Night at Baghdad->night bombingat Iraq.

Iraqi map

CBS 5193 NBC 30021

UN cars->UN inspectors leaving IraqFound by algorithm, butmissed by human subjects

CBS 501 CBS 13305 NBC 16977

US submarine incident.Missed because weak textlink and image intensitychange.

Missed by system

False positive:

Death of Dale Earnhardt

Results• System achieves on average 71% recall, 37%

precision– 4 test sets– each set consisted of 2 thirty minute news programs

from CBS and NBC (same day)

• Majority of false positives occur due to presence of studio scenes

• If studio scenes are eliminated from results (when stories are the same)– precision increases to 87%

Cross-Language CDC:Parallel Corpus

• Harabagiu and Maiorano, 2000 • Use parallel corpus English and Romanian

– Romanian obtained by manually translating MUC-6 and MUC-7 corpora

• Within-document coreference system run within each language

• Parallelism used to improve coreference in each language by using features/coreference chain information from the other– English precision increases from 84% to 87% while

preserving recall– Romanian precision increases from 72% to 76% while

preserving recall

Cross-Language CDC:Non-Parallel Corpus

• Bagga and Baldwin (unpublished)• Algorithm evaluated manually on a small set of documents

in English and Korean– for each document, extract sentences containing mentions of entity

(name variants only) – “summary”– translate each summary from non-English language to English

using a bi-lingual dictionary (word for word translation, without regard for sense)

– Compare “approximate translations” with English summaries using VSM

• Initial results were promising with limited decline in F-Measure

• Identification of transliterated names is a major problem

Cross-Language CDC:Arabic Non-Parallel Corpus

• Sayeed, et al., 2009• Based on Bagga and Baldwin, 1998

– Use BBN’s Serif for computing Within-Document Coreference chains– for each document, extract windows of 50 words around mentions of entity

– “summary”– One variation of the system tries to address the name transliteration problem

by • a) translating the longest names in each document into English• b) correlating which ones are “similar” in English, and• c) attempting to find xdoc coreference between these discovered pairs

– A baseline system identified xdoc coreference when longest matching names were exact matches

• Tested on 412 document set from ACE 2008 corpus– Baseline B-Cubed F-Measure = 40.6 (best F-Measure for task = 69)– F-Measure for System without name translation = 40.6– F-Measure for system with name translation = 41.3

Evaluation Methodologies

• MUC-6/7 algorithm – Vilain, et al., 1996– originally developed for within-document coreference

• B-CUBED– Bagga and Baldwin, 1998

• Clustering– Treat CDC as a clustering problem

• ACE – Automatic Content Extraction Program– developed for Entity Detection and Tracking (EDT)

task (currently, used for within-document EDT only)

The MUC Scorer: Example

Truth:Response A:

Response B:

MUC Scoring Algorithm

• Precision Error is determined by asking:– How many links must be added to truth (key) to

have the same equivalence classes as the response?

• For recall error, reverse the roles above.

Problem: All Errors are Equal• For response A:

– Precision = 9/10

– Recall = 9/9

• For response B:– Precision = 9/10

– Recall = 9/9

• Unintuitive results in the extreme cases– N = # of entities– m = # of chains (truth)– All entities in same

chain:

– P => 1, if N >> m

1 - N

m - N P

An Intuition for Scoring Differently

Truth:Response A:

Response B:

A Mistake!

A Bigger Mistake!!

B-CUBED Algorithm: An entity based approach

• For each entity, i:

i

ii

element containingchain output in elements of #element containingchain output in elementscorrect of #Precision

i

ii

element containingchain truth in the elements of #element containingchain output in elementscorrect of #Recall

N

1Precision *

N

1 Precision Final i

Example: PrecisionResponse A: Response B:

%)76(21

16 5*7

52*7

25*5

5 * 12

1

%)58(

12

7 5*10

52*2

25*10

5 * 12

1

Recall for both responses is 100%

ACE Scoring Algorithm

• Types of errors: miss and false alarm• Score is calculated as a function of “cost”• Cost depends on

– entity type• person, organization, geo-political entity, location, and facility

– entity level• name, nominal reference, and pronominal reference

• used for evaluation purposes only

• No published CDC evaluation using this algorithm

Type/

LevelPER ORG GPE LOC FAC

NAM 1 0.5 0.25 0.1 0.05

NOM 0.2 0.1 0.05 0.02 0.01

PRO 0.04 0.02 0.01 0.004 0.002

The cost of a single miss or false alarm

NREF = total number of reference entities in source, SDenominator is normalization factor

= cost when no entities are output

CEDT(S) =

sum over type, tsum over level, l {CMiss(t, l)*NMiss(t, l) + CFA(t, l)*NFA(t, l)}

sum over type, tsum over level, l {CMiss(t, l)*NRef(t, l)}

Applications• IR, EDT, and TDT• Name Matching Problem (Patman and Thomson, 2003)

– When are different name strings potential references to the same entity? (Qaddafi, Gadafi, Gaddafi, Kaddafi, Qaddafy, etc.)

• Cross-Document IE and Information Fusion– increases chances of a pattern match – information may be more explicit in one or more articles– the set of articles may contain more information than any one

• Multi-Document Summarization– 2002 DUC evaluation – earthquakes– systems had difficulty distinguishing between earthquakes

• Question Answering– When was Kennedy born? – which Kennedy is being referred to?

• Link Analysis– linking entities is a first step towards identifying more complex

relationships across documents

Conclusions• CDC is a feasible task

– context (text/images/video) around entity/event provides enough information to disambiguate

• Entity-based CDC – many different methods/models– performance over different, large corpora is consistently in mid

80s• Other types of CDC

– simple models/methods have been tried – plenty of opportunity to explore more sophisticated contextual

models• Evaluation Methodologies

– several different ones exist; no consensus on best one• Applications

– time is ripe for integrating entity-based CDC in higher level applications