CLEF 2009, Corfu Question Answering Track Overview

1

CLEF 2009, Corfu

Question Answering TrackOverview

J. TurmoP.R. ComasS. RossetO. GalibertN. MoreauD. MostefaP. RossoD. Buscaldi

D. SantosL.M. Cabral

A. PeñasP. FornerR. SutcliffeÁ. RodrigoC. ForascuI. AlegriaD. GiampiccoloN. MoreauP. Osenova

2

QA Tasks & Time

2003 20042005

2006 20072008

2009

QA Tasks

Multiple Language QA Main TaskResPubliQ

A

Temporal restrictio

nsand lists

Answer Validation Exercise (AVE)

GikiCLEF

Real Time

QA over Speech Transcriptions (QAST)

WiQAWSD QA

3

2009 campaign

ResPubliQA: QA on European Legislation

GikiCLEF: QA requiring geographical reasoning on Wikipedia

QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

4

QA 2009 campaign

TaskRegistere

dgroups

Participant groups

Submitted Runs

Organizing people

ResPubliQA

20 1128 + 16

(baseline runs)9

Giki CLEF 27 8 17 runs 2

QAST 12 4 86 (5 subtasks) 8

Total59

showed interest

23 Groups

147 runs evaluated

19 + addition

al assessor

s

5

ResPubliQA 2009:QA on European Legislation

Organizers

Anselmo PeñasPamela FornerRichard SutcliffeÁlvaro RodrigoCorina ForascuIñaki AlegriaDanilo GiampiccoloNicolas MoreauPetya Osenova

Additional Assessors

Fernando Luis CostaAnna KampchenJulia KrammeCosmina Croitoru

Advisory Board

Donna HarmanMaarten de RijkeDominique Laurent

6

Evolution of the task

2003 2004 2005 20062007

2008

2009

Target language

s3 7 8 9 10 11 8

Collections

News 1994 + News 1995+ Wikipedia Nov. 2006

European Legislation

Number of

questions200 500

Type of questions

200 Factoid

+ Temporal

restrictions

+ Definitions

- Type of

question

+ Lists

+ Linked questions

+ Closed lists

- Linked+ Reason+ Purpose

+ Procedure

Supporting

information

Document Snippet Paragraph

Size of answer

Snnipet Exact Paragraph

7

Objectives

1. Move towards a domain of potential users

2. Compare systems working in different languages

3. Compare QA Tech. with pure IR4. Introduce more types of questions5. Introduce Answer Validation Tech.

8

Collection

Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements

and resolutions Economy, health, law, food, … Between 1950 and 2006 XML-TEI.2 encoding Unfortunately, non parallel at the

paragraph level -> extra work

9

500 questions

REASON Why did a commission expert conduct an

inspection visit to Uruguay?

PURPOSE/OBJECTIVE What is the overall objective of the eco-

label?

PROCEDURE How are stable conditions in the natural

rubber trade achieved?

In general, any question that can be answered in a paragraph

10

500 questions

Also FACTOID

• In how many languages is the Official Journal of the Community published?

DEFINITION• What is meant by “whole milk”?

No NIL questions

11

12

Translation of questions

13

Selection of the final pool of 500 questions out of the 600 produced

14

15

Systems response

No Answer ≠ Wrong Answer

1. Decide if the answer is given or not• [ YES | NO ]• Classification Problem• Machine Learning, Provers, etc.• Textual Entailment

2. Provide the paragraph (ID+Text) that answers the question

AimTo leave a question unanswered has more value than to give a wrong answer

16

Assessments

R: The question is answered correctlyW: The question is answered incorrectlyNoA: The question is not answered

• NoA R: NoA, but the candidate answer was correct• NoA W: NoA, and the candidate answer was incorrect• Noa Empty: NoA and no candidate answer was given

Evaluation measure: c@1 Extension of the traditional accuracy

(as proportion of questions correctly answered) Considering unanswered questions

17

Evaluation measure

n: Number of questionsnR: Number of correctly answered

questionsnU: Number of unanswered questions

)(1

1@n

nnn

nc R

UR

18

Evaluation measure

If nU = 0 then c@1=nR/n Accuracy

If nR = 0 then c@1=0

If nU = n then c@1=0

Leave a question unanswered gives value only if this avoids to return a wrong answer

Accuracy

)(1

1@n

nnn

nc R

UR Accuracy

The added value is the performance shown with the answered questions: Accuracy

19

List of Participants

System Team

elix ELHUYAR-IXA, SPAIN

icia RACAI, ROMANIA

iiit Search & Info Extraction Lab, INDIA

iles LIMSI-CNRS-2, FRANCE

isik ISI-Kolkata, INDIA

loga U.Koblenz-Landau, GERMAN

mira MIRACLE, SPAIN

nlel U. politecnica Valencia, SPAIN

syna Synapse Developpment, FRANCE

uaic AI.I.Cuza U. of IASI, ROMANIA

uned UNED, SPAIN

20

Value of reducing wrong answers

System c@1 Accuracy

#R #W

#NoA

#NoA R

#NoA W

#NoA empty

combination 0.76 0.76 381 119

0 0 0 0

icia092roro 0.68 0.52 260 84 156 0 0 156

icia091roro 0.58 0.47 237 156

107 0 0107

UAIC092roro 0.47 0.47 236 264

0 0 00

UAIC091roro 0.45 0.45 227 273

0 0 00

base092roro 0.44 0.44 220 280

0 0 00

base091roro 0.37 0.37 185 315

0 0 00

21

Detecting wrong answers

System c@1

Accuracy

#R #W #NoA

#NoA R

#NoA W

#NoA empt

y

combination 0.56

0.56 278

222 0 0 0 0

loga091dede

0.44

0.4 186

221 93 16 689

loga092dede

0.44

0.4 187

230 83 12 629

base092dede

0.38

0.38 189

311 0 0 00

base091dede

0.35

0.35 174

326 0 0 00

Maintaining the number of correct answers,the candidate answer was not correctfor 83% of unanswered questions

Very good step towards improving the system

22

IR important, not enough

System c@1 Accuracy #R #W #NoA #NoA R #NoA W #NoA empty

combination 0.9 0.9 451 49 0 0 0 0

uned092enen 0.61 0.61 288 184 28 15 12 1

uned091enen 0.6 0.59 282 190 28 15 13 0

nlel091enen 0.58 0.57 287 211 2 0 0 2

uaic092enen 0.54 0.52 243 204 53 18 35 0

base092enen 0.53 0.53 263 236 1 1 0 0

base091enen 0.51 0.51 256 243 1 0 1 0

elix092enen 0.48 0.48 240 260 0 0 0 0

uaic091enen 0.44 0.42 200 253 47 11 36 0

elix091enen 0.42 0.42 211 289 0 0 0 0

syna091enen 0.28 0.28 141 359 0 0 0 0

isik091enen 0.25 0.25 126 374 0 0 0 0

iiit091enen 0.2 0.11 54 37 409 0 11 398

elix092euen 0.18 0.18 91 409 0 0 0 0

elix091euen 0.16 0.16 78 422 0 0 0 0

Feasible Task

Perfect combination is 50% better than best system

Many systems under the IR baselines

23

Comparison across languages Same questions Same documents Same baseline systems Strict comparison only affected by

the variable of language But it is feasible to detect the most

promising approaches across languages

24

Comparison across languages

System RO ES EN IT DE

icia092 0.68

nlel092 0.47

uned092 0.41 0.61

uned091 0.41 0.6

icia091 0.58

nlel091 0.58 0.52

uaic092 0.47 0.54

uaic091 0.45

loga091 0.44

loga092 0.44

Baseline 0.44 0.4 0.53 0.42 0.38

Systems above the baselines

Icia, Boolean + intensive NLP +

ML-based validation & very good knowledge of the collection

(Eurovoc terms…)

Baseline, Okapi-BM25 tuned for

paragraph retrieval

25



icia092 0.68

nlel092 0.47

uned092 0.41 0.61

uned091 0.41 0.6

icia091 0.58

nlel091 0.58 0.52

uaic092 0.47 0.54

uaic091 0.45

loga091 0.44

loga092 0.44

Baseline 0.44 0.4 0.53 0.42 0.38


nlel092, ngram-based retrieval,

combining evidence from

several languages


paragraph retrieval

26



icia092 0.68

nlel092 0.47

uned092 0.41 0.61

uned091 0.41 0.6

icia091 0.58

nlel091 0.58 0.52

uaic092 0.47 0.54

uaic091 0.45

loga091 0.44

loga092 0.44

Baseline 0.44 0.4 0.53 0.42 0.38


Uned, Okapi-BM25 + NER +

paragraph validation +

ngram based re-ranking


paragraph retrieval

27



icia092 0.68

nlel092 0.47

uned092 0.41 0.61

uned091 0.41 0.6

icia091 0.58

nlel091 0.35 0.58 0.52

uaic092 0.47 0.54

uaic091 0.45

loga091 0.44

loga092 0.44

Baseline 0.44 0.4 0.53 0.42 0.38


nlel091, ngram-based paragraph

retrieval


paragraph retrieval

28



icia092 0.68

nlel092 0.47

uned092 0.41 0.61

uned091 0.41 0.6

icia091 0.58

nlel091 0.58 0.52

uaic092 0.47 0.54

uaic091 0.45

loga091 0.44

loga092 0.44

Baseline 0.44 0.4 0.53 0.42 0.38



paragraph retrieval

Loga, Lucene + deep NLP + Logic + ML-

based validation

29

Conclusion

Compare systems working in different languages

Compare QA Tech. with pure IR Pay more attention to paragraph retrieval Old issue, late 90’s state of the art (English) Pure IR performance: 0.38 - 0.58 Highest difference respect IR baselines: 0.44 – 0.68

• Intensive NLP• ML-based answer validation

Introduce more types of questions Some types difficult to distinguish Any question that can be answered in a paragraph Analysis of results by question types (in progress)

30

Conclusion

Introduce Answer Validation Tech. Evaluation measure: c@1 Value of reducing wrong answers Detecting wrong answers is feasible

Feasible task 90% of questions have been answered Room for improvement: Best systems around

60% Even with less participants we have

More comparison More analysis More learning

ResPubliQA proposal for 2010 SC and breakout session

31

Interest on ResPubliQA 2010

GROUP

1 Uni. "Al.I.Cuza" Iasi (Dan Cristea, Diana Trandabat)

2 Linguateca (Nuno Cardoso)

3 RACAI (Dan Tufis, Radu Ion)

4 Jesus Vilares

5 Univ. Koblenz-Landlau (Bjorn Pelzer)

6 Thomson Reuters (Isabelle Moulinier)

7 Gracinda Carvalho

8 UNED (Alvaro Rodrigo)

9 Uni. Politecnica Valencia (Paolo Rosso & Davide Buscaldi)

10

Uni. Hagen (Ingo Glockner)

11

Linguit (Jochen L. Leidner)

12

Uni. Saarland (Dietrich Klakow)

13

ELHUYAR-IXA (Arantxa Otegi)

14

MIRACLE TEAM (Paloma Martínez Fernández)

But we need more

You have already a Gold Standard of 500 questions & answers to play with…

Documents

CLEF 2009, Corfu Question Answering Track Overview