Resources for linguistically motivated Multilingual Anaphora Resolution

Resources for linguistically motivated

Multilingual Anaphora Resolution

Kepa Joseba Rodrıguez

Advisor: Massimo Poesio18. January 2011


Resources for linguistically motivated Multilingual Anaphora Resolution

Outline

1 Motivation of the research

2 Contributions of this dissertation

3 Limitations of previous annotation schemes

4 Annotation scheme proposal

5 Annotated data

6 Usability of the data for anaphora resolution

7 Use of the data

8 Conclusions



Motivation

Linguistic research: cross linguistic studies aboutanaphora (Poesio et al 2004)

Applications: summarization (Steinberger et al 2007)

Applications: machine translation

1 German: Peter hat Maria seine Blumen zum Gießengegeben. Sie hat sie vertrocknen lassen.

2 English (Babelfish): Peter gave Maria his flowers forpouring. Then it left it to dry.

3 English (Google translate): Peter gave Mary flowersto his casting. Then she let them dry up.

4 English (wanted): Peter gave Maria his flowers towater. Then she let them dry out.



Contributions

Development of a linguistically motivated annotationscheme for anaphoric relations.

Implementation of the scheme for manual annotation ofEnglish and Italian data.

Creation of annotated data for English and Italian.

Use of the corpora for feature extraction and developmentof anaphora resolution systems in English and Italian.

Participation of the systems in SemEval 2010.



Limitations of previous schemes (1)

Coverage of the annotation.

Annotation of reference.

Identification and annotation of discontinuity of semanticmaterial.

Problem of multiple interpretations: ambiguity.




Coverage of the annotation:

Annotated relations: only identity

ACE-like annotation schemes constraint the annotation tonoun phrases from a list of semantic types.

Genres: Most annotation schemes focus the annotationon a few genres.




Annotation of reference

Expletives: they are not considered.There are two people waiting for the interview.Predication:

MUC, ACE: No distinction between predication andidentity relation.OntoNotes: no semantic criteria to decide which nounphrase is referring and which is a predicate.

[The president of the bank] is [John Smith].[John Smith] is [the president of the bank].

Coordination: coordinated items are considered referringexpressions in corpora like MUC or OntoNotes.

[Milosevic or anyone else]

Nominals and proper names in premodifier position.Kepa Joseba Rodrıguez



Identification of discontinuous semantic material.

Bill and Hillary Clinton

black cars and bikes

Multiple interpretations are not captured

[The house] is on [a long street]. [It] is very dirty.



Annotation scheme

Annotation of all noun phrases

Distinction between referring and non-referringexpressions

Annotation of clitics attached to the verb and emptypronouns

Introduction of ambiguity

Introduction of discontinuous markables

Annotation of different kind of relations: identity,discourse deixis and bridging.



Reference

Markables are classified in referring and non-referring

Non-referring markables are annotated with type ofnon-referring expression

Referring markables are annotated with:

Information status: New or old.Semantic type



Reference

Types of non-referring expressionsExpletives

[There] are two people waiting for the interviewThe new car is [there]

Predicate: semantic criteria to distinguish predicate andreferring expression.

[Il presidente della Repubblica, [Giorgio Napolitano]][The president of the bank] is [John Smith].[John Smith] is [the president of the bank].

Quantifiers:[All of [the box cars]]

Coordination.Idiomatic expressions

by [the nape of [the neck]]Kepa Joseba Rodrıguez


Semantic types

1 Person2 Animate3 Organization4 Facility5 Geopolitical entity (GPE)6 Location7 Temporal8 Numerical9 Concrete10 Abstract11 Event12 Other13 Unknown



Annotation of ambiguity

Not always a unique interpretation for a markable.

1 Be careful hooking up [the engine] to [the boxcar]because [it] is faulty.

2 [The house] is on [a long street]. [It] is very dirty.

In case of ambiguity, we tag the markable as ambiguousand we annotate the possible interpretations.

Other possible ambiguities are:

Information status: between new and old.Old and not referring.



List of annotated features

Agreement features

GenderNumberPerson

Grammatical function

Reference and information status

Semantic type

Type of non-referring

Link to antecedent

Ambiguity

Bridging



Description of the annotated data

ARRAU (English)

Wall Street Journal textsTrains dialoguesGnome corpusPear stories

Live Memories Corpus for Italian (LMC)

Wikipedia sitesBlog sitesVENEX dataset



Description: English corpus

WSJ dataset205 files147,600 words in 5585 sentences. 47,900 markables.1% of discontinuous markables, 12.6% non-referring.

Trains dialogues35 files26,000 words in 4600 sentences. 5200 markables.

GNOME corpus5 files21,600 words in 1000 sentences. 6100 markables

PEAR stories20 files14,000 words in 2,000 sentences. 3,900 markables.



Description: Italian corpus

Wikipedia dataset:

144 files.140.000 words in 4700 sentences. 44.500 markables.0.5% discontinuous markables, 0.5% clitics attached tothe verb, 4.5% empty subjects.13.7% non-referring.

Blogs dataset:

75 files.53.000 words in 2230 sentences.16.000 markables.

VENEX corpus:

30 files20,300 words in 720 sentences6.220 markables



Reliability of the annotation – ARRAU

Previous study for annotation of anaphoric links publishedby (Poesio and Artstein, 2008)

Metric: Krippendorf’s α

α = 0.6-0.7

Statistics reflect the complexity of the task.



Reliability of the annotation – LMC

Metric: Sigel and Castellan’s κ

Information status and reference: old, new andnon-referring

κ = 0.80

Basic annotation of the markable: new, phraseantecedent, segment antecedent, predicate, quantifier,expletive, coordination and idiom.

κ = 0.79Main disagreement between discourse new and predicate

Semantic type

κ = 0.85



Reliability of the annotation – LMC

Link to the antecedent

κ = 0.88

Antecedent of clitics

κ = 0.84

Antecedent of empty pronouns

κ = 0.93



Use of the corpus for anaphora resolution (1)

Baseline proposed by (Soon et al 2001)

Classifier: MaxEnt

English data: ACE02, MUC-7 and ARRAU

Italian data: ICAB and LMC

Evaluation metrics:

MUC (Vilain et al. 1995)CEAF (Luo, 2005)Link based evaluation




English corpora: ARRAU, ACE, MUCACE Carafe MUC-7 ACE02 ARRAU

MUC 0.618 0.585 0.590 0.557CEAF-AGGR Φ-3 0.537 0.379 0.393 0.683CEAF-AGGR Φ-4 0.506 0.206 0.309 0.717Link-based 0.638 0.594 0.532 0.540Pronouns 0.686 0.492 0.597 0.558Nominals 0.355 0.455 0.239 0.352Names 0.638 0.817 0.784 0.763




Italian corpora: LMC, ICABICAB LMC-Sys LMC-Gold

MUC 0.494 0.456 0.619CEAF-AGGR Φ-3 0.557 0.622 0.798CEAF-AGGR Φ-4 0.560 0.671 0.869Link-based 0.556 0.470 0.580Pronouns 0.452 0.520 0.521Nominals 0.421 0.303 0.522Names 0.741 0.642 0.752




Use of C4 decision trees to compare the impact ofindividual features.

The impact of the baseline features is similar for Englishand Italian with two exceptions:

The impact of gender matching is high in English, buthas no effect for Italian.The use of automatically computed aliases have a highimpact for Italian and a low impact for English.



Use of the data

5th International Workshop on Semantic Evaluations(SemEval 2010)Task: Coreference Resolution in Multiple Languages.

Comparative research about zero-anaphora in Italian andJapanese

Training and evaluation of content extraction models inthe Live Memories project.



Conclusions

Linguistic motivated annotation scheme applicable toEnglish and Italian.

Scheme used to annotate different genres: newspapers,encyclopedic text, dialogue, narrative and weblogs.

Corpora are usable to build anaphora resolution models.

Datasets have been used for international competitionsand for linguistic research.



Technology

Resources for linguistically motivated Multilingual Anaphora Resolution