26
Resources for linguistically motivated Multilingual Anaphora Resolution Kepa Joseba Rodr´ ıguez Advisor: Massimo Poesio 18. January 2011 Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution

Resources for linguistically motivated Multilingual Anaphora Resolution

Embed Size (px)

DESCRIPTION

PhD defense presentation

Citation preview

Page 1: Resources for linguistically motivated Multilingual Anaphora Resolution

Resources for linguistically motivated

Multilingual Anaphora Resolution

Kepa Joseba Rodrıguez

Advisor: Massimo Poesio18. January 2011

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 2: Resources for linguistically motivated Multilingual Anaphora Resolution

Outline

1 Motivation of the research

2 Contributions of this dissertation

3 Limitations of previous annotation schemes

4 Annotation scheme proposal

5 Annotated data

6 Usability of the data for anaphora resolution

7 Use of the data

8 Conclusions

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 3: Resources for linguistically motivated Multilingual Anaphora Resolution

Motivation

Linguistic research: cross linguistic studies aboutanaphora (Poesio et al 2004)

Applications: summarization (Steinberger et al 2007)

Applications: machine translation

1 German: Peter hat Maria seine Blumen zum Gießengegeben. Sie hat sie vertrocknen lassen.

2 English (Babelfish): Peter gave Maria his flowers forpouring. Then it left it to dry.

3 English (Google translate): Peter gave Mary flowersto his casting. Then she let them dry up.

4 English (wanted): Peter gave Maria his flowers towater. Then she let them dry out.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 4: Resources for linguistically motivated Multilingual Anaphora Resolution

Contributions

Development of a linguistically motivated annotationscheme for anaphoric relations.

Implementation of the scheme for manual annotation ofEnglish and Italian data.

Creation of annotated data for English and Italian.

Use of the corpora for feature extraction and developmentof anaphora resolution systems in English and Italian.

Participation of the systems in SemEval 2010.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 5: Resources for linguistically motivated Multilingual Anaphora Resolution

Limitations of previous schemes (1)

Coverage of the annotation.

Annotation of reference.

Identification and annotation of discontinuity of semanticmaterial.

Problem of multiple interpretations: ambiguity.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 6: Resources for linguistically motivated Multilingual Anaphora Resolution

Limitations of previous schemes (2)

Coverage of the annotation:

Annotated relations: only identity

ACE-like annotation schemes constraint the annotation tonoun phrases from a list of semantic types.

Genres: Most annotation schemes focus the annotationon a few genres.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 7: Resources for linguistically motivated Multilingual Anaphora Resolution

Limitations of previous schemes (3)

Annotation of reference

Expletives: they are not considered.There are two people waiting for the interview.Predication:

MUC, ACE: No distinction between predication andidentity relation.OntoNotes: no semantic criteria to decide which nounphrase is referring and which is a predicate.

[The president of the bank] is [John Smith].[John Smith] is [the president of the bank].

Coordination: coordinated items are considered referringexpressions in corpora like MUC or OntoNotes.

[Milosevic or anyone else]

Nominals and proper names in premodifier position.Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 8: Resources for linguistically motivated Multilingual Anaphora Resolution

Limitations of previous schemes (4)

Identification of discontinuous semantic material.

Bill and Hillary Clinton

black cars and bikes

Multiple interpretations are not captured

[The house] is on [a long street]. [It] is very dirty.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 9: Resources for linguistically motivated Multilingual Anaphora Resolution

Annotation scheme

Annotation of all noun phrases

Distinction between referring and non-referringexpressions

Annotation of clitics attached to the verb and emptypronouns

Introduction of ambiguity

Introduction of discontinuous markables

Annotation of different kind of relations: identity,discourse deixis and bridging.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 10: Resources for linguistically motivated Multilingual Anaphora Resolution

Reference

Markables are classified in referring and non-referring

Non-referring markables are annotated with type ofnon-referring expression

Referring markables are annotated with:

Information status: New or old.Semantic type

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 11: Resources for linguistically motivated Multilingual Anaphora Resolution

Reference

Types of non-referring expressionsExpletives

[There] are two people waiting for the interviewThe new car is [there]

Predicate: semantic criteria to distinguish predicate andreferring expression.

[Il presidente della Repubblica, [Giorgio Napolitano]][The president of the bank] is [John Smith].[John Smith] is [the president of the bank].

Quantifiers:[All of [the box cars]]

Coordination.Idiomatic expressions

by [the nape of [the neck]]Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 12: Resources for linguistically motivated Multilingual Anaphora Resolution

Semantic types

1 Person2 Animate3 Organization4 Facility5 Geopolitical entity (GPE)6 Location7 Temporal8 Numerical9 Concrete10 Abstract11 Event12 Other13 Unknown

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 13: Resources for linguistically motivated Multilingual Anaphora Resolution

Annotation of ambiguity

Not always a unique interpretation for a markable.

1 Be careful hooking up [the engine] to [the boxcar]because [it] is faulty.

2 [The house] is on [a long street]. [It] is very dirty.

In case of ambiguity, we tag the markable as ambiguousand we annotate the possible interpretations.

Other possible ambiguities are:

Information status: between new and old.Old and not referring.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 14: Resources for linguistically motivated Multilingual Anaphora Resolution

List of annotated features

Agreement features

GenderNumberPerson

Grammatical function

Reference and information status

Semantic type

Type of non-referring

Link to antecedent

Ambiguity

Bridging

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 15: Resources for linguistically motivated Multilingual Anaphora Resolution

Description of the annotated data

ARRAU (English)

Wall Street Journal textsTrains dialoguesGnome corpusPear stories

Live Memories Corpus for Italian (LMC)

Wikipedia sitesBlog sitesVENEX dataset

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 16: Resources for linguistically motivated Multilingual Anaphora Resolution

Description: English corpus

WSJ dataset205 files147,600 words in 5585 sentences. 47,900 markables.1% of discontinuous markables, 12.6% non-referring.

Trains dialogues35 files26,000 words in 4600 sentences. 5200 markables.

GNOME corpus5 files21,600 words in 1000 sentences. 6100 markables

PEAR stories20 files14,000 words in 2,000 sentences. 3,900 markables.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 17: Resources for linguistically motivated Multilingual Anaphora Resolution

Description: Italian corpus

Wikipedia dataset:

144 files.140.000 words in 4700 sentences. 44.500 markables.0.5% discontinuous markables, 0.5% clitics attached tothe verb, 4.5% empty subjects.13.7% non-referring.

Blogs dataset:

75 files.53.000 words in 2230 sentences.16.000 markables.

VENEX corpus:

30 files20,300 words in 720 sentences6.220 markables

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 18: Resources for linguistically motivated Multilingual Anaphora Resolution

Reliability of the annotation – ARRAU

Previous study for annotation of anaphoric links publishedby (Poesio and Artstein, 2008)

Metric: Krippendorf’s α

α = 0.6-0.7

Statistics reflect the complexity of the task.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 19: Resources for linguistically motivated Multilingual Anaphora Resolution

Reliability of the annotation – LMC

Metric: Sigel and Castellan’s κ

Information status and reference: old, new andnon-referring

κ = 0.80

Basic annotation of the markable: new, phraseantecedent, segment antecedent, predicate, quantifier,expletive, coordination and idiom.

κ = 0.79Main disagreement between discourse new and predicate

Semantic type

κ = 0.85

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 20: Resources for linguistically motivated Multilingual Anaphora Resolution

Reliability of the annotation – LMC

Link to the antecedent

κ = 0.88

Antecedent of clitics

κ = 0.84

Antecedent of empty pronouns

κ = 0.93

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 21: Resources for linguistically motivated Multilingual Anaphora Resolution

Use of the corpus for anaphora resolution (1)

Baseline proposed by (Soon et al 2001)

Classifier: MaxEnt

English data: ACE02, MUC-7 and ARRAU

Italian data: ICAB and LMC

Evaluation metrics:

MUC (Vilain et al. 1995)CEAF (Luo, 2005)Link based evaluation

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 22: Resources for linguistically motivated Multilingual Anaphora Resolution

Use of the corpus for anaphora resolution (2)

English corpora: ARRAU, ACE, MUCACE Carafe MUC-7 ACE02 ARRAU

MUC 0.618 0.585 0.590 0.557CEAF-AGGR Φ-3 0.537 0.379 0.393 0.683CEAF-AGGR Φ-4 0.506 0.206 0.309 0.717Link-based 0.638 0.594 0.532 0.540Pronouns 0.686 0.492 0.597 0.558Nominals 0.355 0.455 0.239 0.352Names 0.638 0.817 0.784 0.763

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 23: Resources for linguistically motivated Multilingual Anaphora Resolution

Use of the corpus for anaphora resolution (3)

Italian corpora: LMC, ICABICAB LMC-Sys LMC-Gold

MUC 0.494 0.456 0.619CEAF-AGGR Φ-3 0.557 0.622 0.798CEAF-AGGR Φ-4 0.560 0.671 0.869Link-based 0.556 0.470 0.580Pronouns 0.452 0.520 0.521Nominals 0.421 0.303 0.522Names 0.741 0.642 0.752

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 24: Resources for linguistically motivated Multilingual Anaphora Resolution

Use of the corpus for anaphora resolution (4)

Use of C4 decision trees to compare the impact ofindividual features.

The impact of the baseline features is similar for Englishand Italian with two exceptions:

The impact of gender matching is high in English, buthas no effect for Italian.The use of automatically computed aliases have a highimpact for Italian and a low impact for English.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 25: Resources for linguistically motivated Multilingual Anaphora Resolution

Use of the data

5th International Workshop on Semantic Evaluations(SemEval 2010)Task: Coreference Resolution in Multiple Languages.

Comparative research about zero-anaphora in Italian andJapanese

Training and evaluation of content extraction models inthe Live Memories project.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Page 26: Resources for linguistically motivated Multilingual Anaphora Resolution

Conclusions

Linguistic motivated annotation scheme applicable toEnglish and Italian.

Scheme used to annotate different genres: newspapers,encyclopedic text, dialogue, narrative and weblogs.

Corpora are usable to build anaphora resolution models.

Datasets have been used for international competitionsand for linguistic research.

Kepa Joseba Rodrıguez

Resources for linguistically motivated Multilingual Anaphora Resolution