Cross Document Coreference

Cross document coreference

Kepa Joseba RodrıguezSeminar on EXtreme Information Extraction

Rovereto, 25. March 2009

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction


Outline

Background.

Intra-doc/cross-doc coreference tasks.Overview of a system.

Unsupervised personal name disambiguation.

Generation of extraction patterns.

Algorithm of (Ravichandran & Hovy, 2002)

Generation of vectors and clustering.

Evaluation

Optional: Disambiguation of geographic names.

Optional: Clustering of news.



The task of CDC

Cross document coreference occurs when the same person,place, event or concept is discussed in more than one textsource. (Bagga & Baldwin 1998)



Intra-document vs. cross-document coreference

There are substantial differences between intra-documentand cross document coreference resolution.

In a document there is a certain consistency that wecannot expect across documents.Most underlying principles of linguistics and discoursecontexts cannot be applied across documents.

There are some links between both.

The resolution of intra-document coreference helps inthe resolution of cross document coreference.The resolution of cross document coreference can help inthe resolution of intra-document coreference (Haghighi& Klein, 2007).







Unsupervised personal name disambiguation (1)

A personal name can refer to thousands of differententities in the real world.

Ex: for the name Jim Clark Google shows 76.000different web-sites (Man & Yarowsky, 2003):

1 Jim Clark Race car driver from Scotland2 Jim Clark Clock-maker from Colorado3 Jim Clark Film editor4 Jim Clark Netscape founder5 Jim Clark Disaster survivor6 Jim Clark Car salesman in Kansas... Jim Clark ...

Each entry has features that may be helpful todisambiguate the entity.




Earlier approaches to personal name disambiguation userepresentations of the context like vectors.

Distinction between instances with identical name basedon potentially indicative words.

Jim Clark - carJim Clark - filmJim Clark - NetscapeJim Clark - Colorado

In the case of personal names there is more preciseinformation available than in other kind of entities.




Use of information extraction techniques can addcategorial information like:

Age/date of birth.Nationality.Profession.

Space of associated names. It can be used:

As a vector based bag-of-words model.With extracted specific types of association, such as:

familiar relationships: son, wife, married with...employment relationship: manager of, etc...



Generation of extraction patterns

Patterns are automatically generated from data.

It is possible to get a good performance without use ofparser or other language specific resources.

Automatic generation is more flexible to be applied tonew languages.

Potentially higher precision and recall than patternsintroduced by hand.



(R & H) algorithm for pattern extraction (1)

Select items for the query (i.e. +Mozart, +1756)

Search in a document collection for documents thatcontains both terms.

Extract the sentences in which both terms are contained.

Search for the long matches between sentences. For thesentences:

The great composer Mozart (1756-1791) achieved fameas a young age.Mozart (1756-1791) was a genius.The whole world would always be indebted to the greatmusic of Mozart (1756-1791).

the longest matching substring is “Mozart (1756-1791)”



(R & H) algorithm for pattern extraction (2)

Repeat the same procedure for other terms like

+Newton +1642+Gandhi +1869...

For BIRTHDATE the algorithm produces this output:

born in <ANSWER>, <NAME><NAME> was born in <ANSWER><NAME> (<ANSWER> -<NAME> (<ANSWER> -)...



(R & H) algorithm to calculate precision (1)

Build a collection of documents that contain the questionterm (the name)

Query a search engine using only the question termDownload the top 1000 web documents.

Extract the sentences that contain the question term.For each extracted pattern, check the presence in thesentence obtained for the following instances

Presence of the pattern with <ANSWER> tag matched byany word (Ca)i.e: Mozart was born in <WORD>.Presence of the pattern with <ANSWER> tag matched bythe correct term (Co)i.e: Mozart was born in 1756.

P = Co/CaKepa Joseba Rodrıguez Seminar on EXtreme Information Extraction


(R & H) algorithm to calculate precision (2)

Example: precision for the extracted patters for BIRTHDATE.

1.0 <NAME> (<ANSWER> -)0.85 <NAME> was born on <ANSWER>0.6 <NAME> was born in <ANSWER>0.59 <NAME> was born <ANSWER>0.53 <ANSWER> <NAME> was born



Unsupervised Clustering

(Mann & Yarowsky, 2003)

Used cluster method: bottom-up centroid agglomerativeclustering.

Each document is represented by a vector ofautomatically extracted features.

The two more similar vectors are merged to produce anew cluster.

The new cluster is represented by a vector equal to thecentroid of the clustered vectors.



Cluster refactoring

Unsupervised agglomerative clustering can lead toproblems.

The most similar pages are clustered at the begin of theprocess.The less similar pages are added as stragglers to the toplevels of the cluster tree.The top-level clusters are less discriminative than theclusters at the bottom of the tree.

The refactoring.Clustering is stopped when a percentage of thedocuments have been classified and clusters haveachieved a given size.The rest of the documents are assigned to the clusterswith the closest distance measure.



Methods for vector generation

Baseline

Techniques of selective term weighting.

Term Frequency / Inverse Document Frequency(tf-idf)Mutual Information (mi).

Biographical features (feat)

Extended biographical features (extfeat)

Cluster refactoring.



Baseline

The term vectors are composed of only proper nouns.

The similarity between vectors is computed usingstandard cosine similarity.

cos(a, b) =a · b

||a|| × ||b||



TF-IDF

Techniques of selective term weighting.

TF-IDF weight (Term Frequency - Inverse DocumentFrequency)

Measure used to evaluate how important a word is to adocument in a collection.The importance decreases proportionally to the numberof times a word appears in a document, but it is offsetby the frequency of the word in the collection.

tfi ,j =ni,jPk nk,j

idfi = |D||d :ti∈d | tfidfi ,j = tfi ,j × idfi



Mutual Information

Mutual Information: Measure used to evaluate themutual dependence between random variables.

Given a document collection c, for each word w wecompute I (w ; c) = p(w |c)

p(w)

We selected words that

appear more than 20 times in the collectionhave a I (w ; c) > 10

these words are added to the document’s feature vectorwith a weight equal to log(I (w ; c))



Extracted biographical features (feat)

Use of biographical features extracted with the algorithmof (Ravichendran & Hovy, 2002)

Biographical information is used to link the documents:documents which contain similar extracted features havethe same referent.

The extracted biographical features help to improvedisambiguation: documents with different extractedfeatures belong to different clusters.



Extracted biographical features (feat)

Type Extracted feature

birth place Midland (4), Texas (3), Alton (1), Illinois(1)birth year 1926 (9). 1967 (3), 1973 (2), 1947 (1),

1958 (1), 1969 (1)occupation actor (11), trumpeter (9), heavyweight (2), ...spouse Demi Moore (1)

Table: feat Features extracted for Davis/Harrelson pseudoname



Extended biographical features (extfeat)

In this method the system gives higher weight to wordsthat appear filling patterns.

Example:

The system recognises 1756 as a birth-year using surfacepatterns.Then when it is found in context outside of an extractionpattern, it is given a higher weight and added to thedocument vector as a potential biographical feature.

For the experiment it was applied for words which appearsmore than a threshold of 4 times.

Then value of the weight is the log of the number oftimes the word was found as an extracted feature.



word w(mi) w(extfeat)

adderley 3.50 0snipes 5.16 0coltrane 5.06 0bitches 4.99 0danson 4.97 0hemp 4.97 0mullally 4.95 0porgy 4.94 0remastered 4.92 0actor 3.50 2.401926 0 2-20trumpeter 0 2.20midland 0 1.39

Table: 10 words with higher mutual information with the documentcollection and all extfeat words for Davis/Harrelson pseudoname



Experiments: the data set

The data set consisted in web pages collected usingGoogle for a set of target personal names.

Not more than 1000 pages for each target name.No requirement that the web-page was focused on thename.No minimum number of occurrences of the name in thepage.



Evaluation on pseudonames

Pseudonames created as follows:

Take retrieval results from two different people.Replace all references to each name by a unique sharedpseudoname.

Resulting collection consists of documents which areambiguous as to whom they are talking about.

The aim of the clustering is to distinguish the introducedpseudoname.




Selected a set of 8 different people:Historical figures.Figures from media and pop culture.Non famous people with similar background (birthdate,profession, etc.)

Submit Google queries and retrieval up to 1000 pagesabout each person.

Select a maximum of 100 pages for each person.Evaluation of two granularities of feature extraction:

Use high precision rules to extract occupation, birthday,spouse, birth location and school.Use high recall rules to extract the same terms and addparent/child relationships.




Method Accuracy

nnp 79.7nnp + tfidf 79.7nnp + mi 82.9

Table: Disambiguation accuracy of different clustering methods




extracted featuresfeature set sizesmall large

nnp+feat 82.5 85.1nnp+feat+extfeat 82.0 84.6nnp+feat+mi 85.6 85.3nnp + feat + tfidf 82.9 86.4

Table: Disambiguation accuracy of different clustering methodsand different size of feature sets



Evaluation on naturally ambiguous names

Start with a selection of 4 polysemous names with aaverage of 60 different instances for each of them.

Manual annotation with name-ID numbers

The occurrences of each name should be classified into 3clusters

The 2 automatically derived first-pass majority seed sets.The residual set for “other uses”

Weighting method Precision Recall

TF-IDF .81 .70Mutual Information .88 .73



Conclusions

The results of the clustering are improved by:

Learning and using automatically extracted biographicinformation.The use of weighting techniques.

The produced clusters can be used as seeds fordisambiguating further entities.



Disambiguating geographicnames in a digital library



Outline

Task of the Perseus project.

Problems of the task domain.

External knowledge sources.

Identification and classification of proper names.

First disambiguation of geographical names.

Simple carachterization of the document context.

Final disambiguation.



Task of the project Perseus

Task of the Perseus Project (Smith & Crane, 2002)Library with historical data in humanities from the ancientGreece to the 19th century America.

Over a million of toponym references.

The task consist of:

Identification of geographic names.Link the names to information about location, type,dates of occupation, relation to other places,inhabitants, etc.Link the names to a position in a map.



Problems of the domain

The introduction of the entity by a unambiguous mentionis less common than in new papers articles.

There is a great difference between the documents, like

Different size of the documents.Lack of standard structures.Different registers and dialects are used.Historical variations: borders, names associated todifferent political systems, etc.

Long distance anaphora.

Resolution process is more similar to the resolution ofcross-document coreference in the web than in corpora.



Knowledge sources

The system uses external knowledge sources. The mostimportant are:

Getty Thesaurus of geographic names.

Cruchley’s gazetteer of London, that were build forgeocoding.

Lists of authors of the entries in the Dictionary ofNational Biography, that helps to add additionalinformation to the documents.



Identification and classification of proper-names

The task of identifying the proper names and the firstclassification of them is done using simple heuristics.

Capitalisation and punctuation conventions.

Markup added by the editor of the document.

Language specific honorifics (Mr., Dr., etc).

Generic topographic labels are taken as “moderate”evidence that the name may be geographic.

Rocky MountainsCharles River

Stand-alone names are preferably classified as personalnames.

John (personal name vs. village in Louisiana or Virginia)



Disambiguation (1)

Based in local context.

Explicit disambiguating tags put after the names.e.g.“Lancaster, PA”, “Vienna, Austria”, post code, etc.

If an ambiguous name of a place is mentioned togetherwith other names of places, the most likelyinterpretation of the name is that is geographically nearfrom the others.e.g. if “Philadelphia” and “Harrisburg” appear in the same

paragraph, the preferred interpretation of “Lancaster” will be

the town in Pennsylvania, and not the town in England or

Arizona.



Disambiguation (2)

Based in document context.

Preponderance of geographic references in the entiredocument.For short documents, like new papers articles, documentcontext and local context are considered as the same.

Based in word knowledge.

Captured from gazetteers and other reference works.Facts about a place like political coordinates, size, etc.



Simple characterisation of the document context

Aggregate all of the possible locations for all thetoponyms in the document onto a one-by-one degree grid.

Assign weights for the number of mentions of eachtoponym.

Prune the grid based on general world knowledge.

Compute the centroid of this weighted map.

Compute the standard deviation of the distance of thepoints from this centroid.

Discard points more than to times the standard deviationaway from the centroid.

Calculate a new centroid.



Final disambiguation.

Local context of a toponym is represented by a movingwindow of the four previous and four following toponymsin the text.

Only non ambiguous or disambiguated toponyms areconsidered.

Each of the possible interpretations of the ambiguoustoponym are scored using:

Geographical proximity to the toponyms around it.Proximity to the centroid for the document.Relative importance.

The interpretation that achieves the highest score isselected.



Evaluation (1)

The system has evaluated using 5 hand-annotated corpora.

Corpus PCat Prec Rec F1

Greek 0.98 0.93 0.99 0.96Roman 0.99 0.91 1.00 0.95London 0.92 0.86 0.96 0.91California 0.92 0.83 0.96 0.89Upper Midwest 0.89 0.74 0.89 0.81



Evaluation (2)

Categorisation performed on texts of the Greek andRoman history texts is better than on texts about moreactual items.

In places with a hight density of population we foundmore toponyms that are ambiguous with other names.

Mistakes where ethnonyms are used as geo-political Entity(like “The Germans” in the text Cæsar’s Gallic War).

Proper names are usually non inflected in English.We can add rules by hand to correct it, but the precisionof the system could decrease.



Conclusions

Simple heuristic categorisation seems to work properly forthe categorisation of entities that appear in certain kindof texts.

The evaluation procedure is not very clear.

There are cases that are not covered properly by thegazetteers, but the use of huge fine grained gazetteersleads to a higher recall but a lower precision.

An alternative is the use of linguistic processing andmachine learning techniques for restricted cases andcollections of documents.



NewsExplorer: multilingualcoreference resolution



NewsExplorer

NewsExplorer (Steinberger & Pouliquen, 2008) is anapplication that gathers and aggregates extractedinformation for 19 languages.

Each entity is displayed on a dedicated web-site.

For each entity the user get:

List of the latest new clusters in which the entity hasbeen mentioned.List of other entities found in the same clusters.Titles and other phrases describing the entity.Quotations done by the entity or about it.Photograph if available.Wikipedia site about the entity if available.



Text analyse components of the system (1)

Monolingual document clustering.

Named entity recognition.

Person.Organisation.Geographical location.

Named entity disambiguation.

Quotation recognition and reference resolution for nameparts.

Identification and mapping of name variants for the sameperson.

Topic detection and tracking.



Text analyse components of the system (2)

Categorisation of documents according to a multilingualthesaurus.

Cluster similarity calculation:

monolingual.across languages.



Language independent rules for geo-tagging

Use of document context:

If a name can be a personal name or the name of aplace, if it has been mentioned as a person earlier, thenthe preferred reading is that it is a person.If a name can be a personal name or the name of aplace, if it has been mentioned as a person earlier, thenthe preferred reading is that it is a person.If a country has been mentioned in the text, and thenappear a polysemous item, resolve the ambiguity infavour of a place in the mentioned country.Prefer locations that are physically closer to other, nonambiguous locations than have been mentioned in thecontext.



Language independent rules for geo-tagging

In case of polysemy, most important places will bepreferred.

Ignore places that cannot be disambiguated.

Combine the rules giving different weights.



Inflection and regular variations (1)

Hyphen/space alternations (Jean-Marie / Jean Marie).

Diacritic variations (Schroder / Schroder).

Name inversion: change of position between first and lastname.

Typos: relatively frequents in names like CondoleezzaRice, often written as Condoleza, Condolezza, etc.

Simplification: Condoleezza Rice and George W. Bush arefrequently simplified as Ms. Rice and President Bush.



Inflection and regular variations (2)

Morphological declensions: use of prefixes and suffixes inseveral languages.

Transliteration from other alphabets:

there is not a 1x1 mapping between letters.there are different conventions.

Vowel variations, specially in transliterations from andinto Arabic.



Identification of name variants

Some of these variants can be predicted and generatedusing sets of regular expressions.i.e. declination of personal names in Sloven:

s/[aeo]?/(e|a|o|u|om|em|m|ju|jem|ja)?/For every frequent name in the data base will begenerated a pattern likePierr(e|a|o|u|om|em|m|ju|jem|ja)?Gemayel(e|a|o|u|om|em|m|ju|jem|ja)?

For cases that cannot be resolved by the regularexpressions:

Normalise the names, translating them to alanguage-independent representation.Compute edit distance between name-variant andnormalised-names.



Doc. categorisation with multilingual thesaurus (1)

Eurovoc Thesaurus: hierarchically organised controlledvocabulary developed by European institutions andnational parliaments of different countries.

It is used in public administrations for cataloguing, searchand retrieval of large multilingual collections.

The thesaurus consists of 6000 descriptors organised in 21fields and at the second level into 127 micro-thesauri.



Doc. categorisation with multilingual thesaurus (2)

NewsExplorer produces a ranked set of words statisticallyrelated to the descriptor.

These sets of words were produced on the basis of a largeamount of hand annotated documents, by comparingword frequencies of the subset of texts indexed whicheach descriptors with the word frequencies of the wholetraining corpus.

This model is completed with a list of stop words to avoidthat irrelevant words have an impact in the categorisationtask.



Thanks



References (1)

Bagga, A. and Baldwin, B (1998). Entity-based crossdocument coreferencing using the vector space model.Proceedings of the 36th Annual Meeting of theAssociation for Computational Linguistics.

Haghighi, A. and Klein, D. (2007). UnsupervisedCoreference Resolution in a Nonparametric BayesianModel. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics.

Mann, G.S. and Yarowsky, D. (2003). UnsupervisedPersonal Name Disambiguation. In Proceedings of theCoNLL.



References (2)

Ravichandran, D. and Hovy, E. (2002). Learning surfacetext patterns for a question answering system. InProceedings of the 40th Annual Meeting of theAssociation for Computational Linguistics.

Smith, D.A. and Crane, G. (2002). Disambiguatinggeographic names in a historical digital library. InProceedings of ECDL.

Steinberger, R. & Pouliquen, B. (2008): NewsExplorer -combining various text analysis tools to allow multilingualnews linking and exploration. Lecture notes for thelecture held at the SORIA Summer School “Cursos deTecnologıas Linguısticas”.



Technology

Cross Document Coreference