58
Cross document coreference Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Rovereto, 25. March 2009 Kepa Joseba Rodr´ ıguez Seminar on EXtreme Information Extraction Cross document coreference

Cross Document Coreference

Embed Size (px)

DESCRIPTION

"Cross Document Coreference" - Slides at the Seminar on EXtreme Information Extraction. 25. March 2009. -- University of Trento. Italy.

Citation preview

Page 1: Cross Document Coreference

Cross document coreference

Kepa Joseba RodrıguezSeminar on EXtreme Information Extraction

Rovereto, 25. March 2009

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 2: Cross Document Coreference

Outline

Background.

Intra-doc/cross-doc coreference tasks.Overview of a system.

Unsupervised personal name disambiguation.

Generation of extraction patterns.

Algorithm of (Ravichandran & Hovy, 2002)

Generation of vectors and clustering.

Evaluation

Optional: Disambiguation of geographic names.

Optional: Clustering of news.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 3: Cross Document Coreference

The task of CDC

Cross document coreference occurs when the same person,place, event or concept is discussed in more than one textsource. (Bagga & Baldwin 1998)

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 4: Cross Document Coreference

Intra-document vs. cross-document coreference

There are substantial differences between intra-documentand cross document coreference resolution.

In a document there is a certain consistency that wecannot expect across documents.Most underlying principles of linguistics and discoursecontexts cannot be applied across documents.

There are some links between both.

The resolution of intra-document coreference helps inthe resolution of cross document coreference.The resolution of cross document coreference can help inthe resolution of intra-document coreference (Haghighi& Klein, 2007).

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 5: Cross Document Coreference

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 6: Cross Document Coreference

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 7: Cross Document Coreference

Unsupervised personal name disambiguation (1)

A personal name can refer to thousands of differententities in the real world.

Ex: for the name Jim Clark Google shows 76.000different web-sites (Man & Yarowsky, 2003):

1 Jim Clark Race car driver from Scotland2 Jim Clark Clock-maker from Colorado3 Jim Clark Film editor4 Jim Clark Netscape founder5 Jim Clark Disaster survivor6 Jim Clark Car salesman in Kansas... Jim Clark ...

Each entry has features that may be helpful todisambiguate the entity.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 8: Cross Document Coreference

Unsupervised personal name disambiguation (2)

Earlier approaches to personal name disambiguation userepresentations of the context like vectors.

Distinction between instances with identical name basedon potentially indicative words.

Jim Clark - carJim Clark - filmJim Clark - NetscapeJim Clark - Colorado

In the case of personal names there is more preciseinformation available than in other kind of entities.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 9: Cross Document Coreference

Unsupervised personal name disambiguation (3)

Use of information extraction techniques can addcategorial information like:

Age/date of birth.Nationality.Profession.

Space of associated names. It can be used:

As a vector based bag-of-words model.With extracted specific types of association, such as:

familiar relationships: son, wife, married with...employment relationship: manager of, etc...

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 10: Cross Document Coreference

Generation of extraction patterns

Patterns are automatically generated from data.

It is possible to get a good performance without use ofparser or other language specific resources.

Automatic generation is more flexible to be applied tonew languages.

Potentially higher precision and recall than patternsintroduced by hand.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 11: Cross Document Coreference

(R & H) algorithm for pattern extraction (1)

Select items for the query (i.e. +Mozart, +1756)

Search in a document collection for documents thatcontains both terms.

Extract the sentences in which both terms are contained.

Search for the long matches between sentences. For thesentences:

The great composer Mozart (1756-1791) achieved fameas a young age.Mozart (1756-1791) was a genius.The whole world would always be indebted to the greatmusic of Mozart (1756-1791).

the longest matching substring is “Mozart (1756-1791)”

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 12: Cross Document Coreference

(R & H) algorithm for pattern extraction (2)

Repeat the same procedure for other terms like

+Newton +1642+Gandhi +1869...

For BIRTHDATE the algorithm produces this output:

born in <ANSWER>, <NAME><NAME> was born in <ANSWER><NAME> (<ANSWER> -<NAME> (<ANSWER> -)...

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 13: Cross Document Coreference

(R & H) algorithm to calculate precision (1)

Build a collection of documents that contain the questionterm (the name)

Query a search engine using only the question termDownload the top 1000 web documents.

Extract the sentences that contain the question term.For each extracted pattern, check the presence in thesentence obtained for the following instances

Presence of the pattern with <ANSWER> tag matched byany word (Ca)i.e: Mozart was born in <WORD>.Presence of the pattern with <ANSWER> tag matched bythe correct term (Co)i.e: Mozart was born in 1756.

P = Co/CaKepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 14: Cross Document Coreference

(R & H) algorithm to calculate precision (2)

Example: precision for the extracted patters for BIRTHDATE.

1.0 <NAME> (<ANSWER> -)0.85 <NAME> was born on <ANSWER>0.6 <NAME> was born in <ANSWER>0.59 <NAME> was born <ANSWER>0.53 <ANSWER> <NAME> was born

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 15: Cross Document Coreference

Unsupervised Clustering

(Mann & Yarowsky, 2003)

Used cluster method: bottom-up centroid agglomerativeclustering.

Each document is represented by a vector ofautomatically extracted features.

The two more similar vectors are merged to produce anew cluster.

The new cluster is represented by a vector equal to thecentroid of the clustered vectors.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 16: Cross Document Coreference

Cluster refactoring

Unsupervised agglomerative clustering can lead toproblems.

The most similar pages are clustered at the begin of theprocess.The less similar pages are added as stragglers to the toplevels of the cluster tree.The top-level clusters are less discriminative than theclusters at the bottom of the tree.

The refactoring.Clustering is stopped when a percentage of thedocuments have been classified and clusters haveachieved a given size.The rest of the documents are assigned to the clusterswith the closest distance measure.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 17: Cross Document Coreference

Methods for vector generation

Baseline

Techniques of selective term weighting.

Term Frequency / Inverse Document Frequency(tf-idf)Mutual Information (mi).

Biographical features (feat)

Extended biographical features (extfeat)

Cluster refactoring.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 18: Cross Document Coreference

Baseline

The term vectors are composed of only proper nouns.

The similarity between vectors is computed usingstandard cosine similarity.

cos(a, b) =a · b

||a|| × ||b||

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 19: Cross Document Coreference

TF-IDF

Techniques of selective term weighting.

TF-IDF weight (Term Frequency - Inverse DocumentFrequency)

Measure used to evaluate how important a word is to adocument in a collection.The importance decreases proportionally to the numberof times a word appears in a document, but it is offsetby the frequency of the word in the collection.

tfi ,j =ni,jPk nk,j

idfi = |D||d :ti∈d | tfidfi ,j = tfi ,j × idfi

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 20: Cross Document Coreference

Mutual Information

Mutual Information: Measure used to evaluate themutual dependence between random variables.

Given a document collection c, for each word w wecompute I (w ; c) = p(w |c)

p(w)

We selected words that

appear more than 20 times in the collectionhave a I (w ; c) > 10

these words are added to the document’s feature vectorwith a weight equal to log(I (w ; c))

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 21: Cross Document Coreference

Extracted biographical features (feat)

Use of biographical features extracted with the algorithmof (Ravichendran & Hovy, 2002)

Biographical information is used to link the documents:documents which contain similar extracted features havethe same referent.

The extracted biographical features help to improvedisambiguation: documents with different extractedfeatures belong to different clusters.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 22: Cross Document Coreference

Extracted biographical features (feat)

Type Extracted feature

birth place Midland (4), Texas (3), Alton (1), Illinois(1)birth year 1926 (9). 1967 (3), 1973 (2), 1947 (1),

1958 (1), 1969 (1)occupation actor (11), trumpeter (9), heavyweight (2), ...spouse Demi Moore (1)

Table: feat Features extracted for Davis/Harrelson pseudoname

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 23: Cross Document Coreference

Extended biographical features (extfeat)

In this method the system gives higher weight to wordsthat appear filling patterns.

Example:

The system recognises 1756 as a birth-year using surfacepatterns.Then when it is found in context outside of an extractionpattern, it is given a higher weight and added to thedocument vector as a potential biographical feature.

For the experiment it was applied for words which appearsmore than a threshold of 4 times.

Then value of the weight is the log of the number oftimes the word was found as an extracted feature.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 24: Cross Document Coreference

word w(mi) w(extfeat)

adderley 3.50 0snipes 5.16 0coltrane 5.06 0bitches 4.99 0danson 4.97 0hemp 4.97 0mullally 4.95 0porgy 4.94 0remastered 4.92 0actor 3.50 2.401926 0 2-20trumpeter 0 2.20midland 0 1.39

Table: 10 words with higher mutual information with the documentcollection and all extfeat words for Davis/Harrelson pseudoname

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 25: Cross Document Coreference

Experiments: the data set

The data set consisted in web pages collected usingGoogle for a set of target personal names.

Not more than 1000 pages for each target name.No requirement that the web-page was focused on thename.No minimum number of occurrences of the name in thepage.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 26: Cross Document Coreference

Evaluation on pseudonames

Pseudonames created as follows:

Take retrieval results from two different people.Replace all references to each name by a unique sharedpseudoname.

Resulting collection consists of documents which areambiguous as to whom they are talking about.

The aim of the clustering is to distinguish the introducedpseudoname.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 27: Cross Document Coreference

Evaluation on pseudonames

Selected a set of 8 different people:Historical figures.Figures from media and pop culture.Non famous people with similar background (birthdate,profession, etc.)

Submit Google queries and retrieval up to 1000 pagesabout each person.

Select a maximum of 100 pages for each person.Evaluation of two granularities of feature extraction:

Use high precision rules to extract occupation, birthday,spouse, birth location and school.Use high recall rules to extract the same terms and addparent/child relationships.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 28: Cross Document Coreference

Evaluation on pseudonames

Method Accuracy

nnp 79.7nnp + tfidf 79.7nnp + mi 82.9

Table: Disambiguation accuracy of different clustering methods

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 29: Cross Document Coreference

Evaluation on pseudonames

extracted featuresfeature set sizesmall large

nnp+feat 82.5 85.1nnp+feat+extfeat 82.0 84.6nnp+feat+mi 85.6 85.3nnp + feat + tfidf 82.9 86.4

Table: Disambiguation accuracy of different clustering methodsand different size of feature sets

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 30: Cross Document Coreference

Evaluation on naturally ambiguous names

Start with a selection of 4 polysemous names with aaverage of 60 different instances for each of them.

Manual annotation with name-ID numbers

The occurrences of each name should be classified into 3clusters

The 2 automatically derived first-pass majority seed sets.The residual set for “other uses”

Weighting method Precision Recall

TF-IDF .81 .70Mutual Information .88 .73

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 31: Cross Document Coreference

Conclusions

The results of the clustering are improved by:

Learning and using automatically extracted biographicinformation.The use of weighting techniques.

The produced clusters can be used as seeds fordisambiguating further entities.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 32: Cross Document Coreference

Disambiguating geographicnames in a digital library

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 33: Cross Document Coreference

Outline

Task of the Perseus project.

Problems of the task domain.

External knowledge sources.

Identification and classification of proper names.

First disambiguation of geographical names.

Simple carachterization of the document context.

Final disambiguation.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 34: Cross Document Coreference

Task of the project Perseus

Task of the Perseus Project (Smith & Crane, 2002)Library with historical data in humanities from the ancientGreece to the 19th century America.

Over a million of toponym references.

The task consist of:

Identification of geographic names.Link the names to information about location, type,dates of occupation, relation to other places,inhabitants, etc.Link the names to a position in a map.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 35: Cross Document Coreference

Problems of the domain

The introduction of the entity by a unambiguous mentionis less common than in new papers articles.

There is a great difference between the documents, like

Different size of the documents.Lack of standard structures.Different registers and dialects are used.Historical variations: borders, names associated todifferent political systems, etc.

Long distance anaphora.

Resolution process is more similar to the resolution ofcross-document coreference in the web than in corpora.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 36: Cross Document Coreference

Knowledge sources

The system uses external knowledge sources. The mostimportant are:

Getty Thesaurus of geographic names.

Cruchley’s gazetteer of London, that were build forgeocoding.

Lists of authors of the entries in the Dictionary ofNational Biography, that helps to add additionalinformation to the documents.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 37: Cross Document Coreference

Identification and classification of proper-names

The task of identifying the proper names and the firstclassification of them is done using simple heuristics.

Capitalisation and punctuation conventions.

Markup added by the editor of the document.

Language specific honorifics (Mr., Dr., etc).

Generic topographic labels are taken as “moderate”evidence that the name may be geographic.

Rocky MountainsCharles River

Stand-alone names are preferably classified as personalnames.

John (personal name vs. village in Louisiana or Virginia)

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 38: Cross Document Coreference

Disambiguation (1)

Based in local context.

Explicit disambiguating tags put after the names.e.g.“Lancaster, PA”, “Vienna, Austria”, post code, etc.

If an ambiguous name of a place is mentioned togetherwith other names of places, the most likelyinterpretation of the name is that is geographically nearfrom the others.e.g. if “Philadelphia” and “Harrisburg” appear in the same

paragraph, the preferred interpretation of “Lancaster” will be

the town in Pennsylvania, and not the town in England or

Arizona.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 39: Cross Document Coreference

Disambiguation (2)

Based in document context.

Preponderance of geographic references in the entiredocument.For short documents, like new papers articles, documentcontext and local context are considered as the same.

Based in word knowledge.

Captured from gazetteers and other reference works.Facts about a place like political coordinates, size, etc.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 40: Cross Document Coreference

Simple characterisation of the document context

Aggregate all of the possible locations for all thetoponyms in the document onto a one-by-one degree grid.

Assign weights for the number of mentions of eachtoponym.

Prune the grid based on general world knowledge.

Compute the centroid of this weighted map.

Compute the standard deviation of the distance of thepoints from this centroid.

Discard points more than to times the standard deviationaway from the centroid.

Calculate a new centroid.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 41: Cross Document Coreference

Final disambiguation.

Local context of a toponym is represented by a movingwindow of the four previous and four following toponymsin the text.

Only non ambiguous or disambiguated toponyms areconsidered.

Each of the possible interpretations of the ambiguoustoponym are scored using:

Geographical proximity to the toponyms around it.Proximity to the centroid for the document.Relative importance.

The interpretation that achieves the highest score isselected.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 42: Cross Document Coreference

Evaluation (1)

The system has evaluated using 5 hand-annotated corpora.

Corpus PCat Prec Rec F1

Greek 0.98 0.93 0.99 0.96Roman 0.99 0.91 1.00 0.95London 0.92 0.86 0.96 0.91California 0.92 0.83 0.96 0.89Upper Midwest 0.89 0.74 0.89 0.81

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 43: Cross Document Coreference

Evaluation (2)

Categorisation performed on texts of the Greek andRoman history texts is better than on texts about moreactual items.

In places with a hight density of population we foundmore toponyms that are ambiguous with other names.

Mistakes where ethnonyms are used as geo-political Entity(like “The Germans” in the text Cæsar’s Gallic War).

Proper names are usually non inflected in English.We can add rules by hand to correct it, but the precisionof the system could decrease.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 44: Cross Document Coreference

Conclusions

Simple heuristic categorisation seems to work properly forthe categorisation of entities that appear in certain kindof texts.

The evaluation procedure is not very clear.

There are cases that are not covered properly by thegazetteers, but the use of huge fine grained gazetteersleads to a higher recall but a lower precision.

An alternative is the use of linguistic processing andmachine learning techniques for restricted cases andcollections of documents.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 45: Cross Document Coreference

NewsExplorer: multilingualcoreference resolution

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 46: Cross Document Coreference

NewsExplorer

NewsExplorer (Steinberger & Pouliquen, 2008) is anapplication that gathers and aggregates extractedinformation for 19 languages.

Each entity is displayed on a dedicated web-site.

For each entity the user get:

List of the latest new clusters in which the entity hasbeen mentioned.List of other entities found in the same clusters.Titles and other phrases describing the entity.Quotations done by the entity or about it.Photograph if available.Wikipedia site about the entity if available.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 47: Cross Document Coreference

Text analyse components of the system (1)

Monolingual document clustering.

Named entity recognition.

Person.Organisation.Geographical location.

Named entity disambiguation.

Quotation recognition and reference resolution for nameparts.

Identification and mapping of name variants for the sameperson.

Topic detection and tracking.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 48: Cross Document Coreference

Text analyse components of the system (2)

Categorisation of documents according to a multilingualthesaurus.

Cluster similarity calculation:

monolingual.across languages.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 49: Cross Document Coreference

Language independent rules for geo-tagging

Use of document context:

If a name can be a personal name or the name of aplace, if it has been mentioned as a person earlier, thenthe preferred reading is that it is a person.If a name can be a personal name or the name of aplace, if it has been mentioned as a person earlier, thenthe preferred reading is that it is a person.If a country has been mentioned in the text, and thenappear a polysemous item, resolve the ambiguity infavour of a place in the mentioned country.Prefer locations that are physically closer to other, nonambiguous locations than have been mentioned in thecontext.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 50: Cross Document Coreference

Language independent rules for geo-tagging

In case of polysemy, most important places will bepreferred.

Ignore places that cannot be disambiguated.

Combine the rules giving different weights.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 51: Cross Document Coreference

Inflection and regular variations (1)

Hyphen/space alternations (Jean-Marie / Jean Marie).

Diacritic variations (Schroder / Schroder).

Name inversion: change of position between first and lastname.

Typos: relatively frequents in names like CondoleezzaRice, often written as Condoleza, Condolezza, etc.

Simplification: Condoleezza Rice and George W. Bush arefrequently simplified as Ms. Rice and President Bush.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 52: Cross Document Coreference

Inflection and regular variations (2)

Morphological declensions: use of prefixes and suffixes inseveral languages.

Transliteration from other alphabets:

there is not a 1x1 mapping between letters.there are different conventions.

Vowel variations, specially in transliterations from andinto Arabic.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 53: Cross Document Coreference

Identification of name variants

Some of these variants can be predicted and generatedusing sets of regular expressions.i.e. declination of personal names in Sloven:

s/[aeo]?/(e|a|o|u|om|em|m|ju|jem|ja)?/For every frequent name in the data base will begenerated a pattern likePierr(e|a|o|u|om|em|m|ju|jem|ja)?Gemayel(e|a|o|u|om|em|m|ju|jem|ja)?

For cases that cannot be resolved by the regularexpressions:

Normalise the names, translating them to alanguage-independent representation.Compute edit distance between name-variant andnormalised-names.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 54: Cross Document Coreference

Doc. categorisation with multilingual thesaurus (1)

Eurovoc Thesaurus: hierarchically organised controlledvocabulary developed by European institutions andnational parliaments of different countries.

It is used in public administrations for cataloguing, searchand retrieval of large multilingual collections.

The thesaurus consists of 6000 descriptors organised in 21fields and at the second level into 127 micro-thesauri.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 55: Cross Document Coreference

Doc. categorisation with multilingual thesaurus (2)

NewsExplorer produces a ranked set of words statisticallyrelated to the descriptor.

These sets of words were produced on the basis of a largeamount of hand annotated documents, by comparingword frequencies of the subset of texts indexed whicheach descriptors with the word frequencies of the wholetraining corpus.

This model is completed with a list of stop words to avoidthat irrelevant words have an impact in the categorisationtask.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 56: Cross Document Coreference

Thanks

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 57: Cross Document Coreference

References (1)

Bagga, A. and Baldwin, B (1998). Entity-based crossdocument coreferencing using the vector space model.Proceedings of the 36th Annual Meeting of theAssociation for Computational Linguistics.

Haghighi, A. and Klein, D. (2007). UnsupervisedCoreference Resolution in a Nonparametric BayesianModel. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics.

Mann, G.S. and Yarowsky, D. (2003). UnsupervisedPersonal Name Disambiguation. In Proceedings of theCoNLL.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference

Page 58: Cross Document Coreference

References (2)

Ravichandran, D. and Hovy, E. (2002). Learning surfacetext patterns for a question answering system. InProceedings of the 40th Annual Meeting of theAssociation for Computational Linguistics.

Smith, D.A. and Crane, G. (2002). Disambiguatinggeographic names in a historical digital library. InProceedings of ECDL.

Steinberger, R. & Pouliquen, B. (2008): NewsExplorer -combining various text analysis tools to allow multilingualnews linking and exploration. Lecture notes for thelecture held at the SORIA Summer School “Cursos deTecnologıas Linguısticas”.

Kepa Joseba Rodrıguez Seminar on EXtreme Information Extraction

Cross document coreference