Upload
akamu
View
37
Download
1
Tags:
Embed Size (px)
DESCRIPTION
CLEF 2009, Corfu Question Answering Track Overview. A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova. D. Santos L.M. Cabral. J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi. QA Tasks & Time. - PowerPoint PPT Presentation
Citation preview
1
CLEF 2009, Corfu
Question Answering TrackOverview
J. TurmoP.R. ComasS. RossetO. GalibertN. MoreauD. MostefaP. RossoD. Buscaldi
D. SantosL.M. Cabral
A. PeñasP. FornerR. SutcliffeÁ. RodrigoC. ForascuI. AlegriaD. GiampiccoloN. MoreauP. Osenova
2
QA Tasks & Time
2003 20042005
2006 20072008
2009
QA Tasks
Multiple Language QA Main TaskResPubliQ
A
Temporal restrictio
nsand lists
Answer Validation Exercise (AVE)
GikiCLEF
Real Time
QA over Speech Transcriptions (QAST)
WiQAWSD QA
3
2009 campaign
ResPubliQA: QA on European Legislation
GikiCLEF: QA requiring geographical reasoning on Wikipedia
QAST: QA on Speech Transcriptions of European Parliament Plenary sessions
4
QA 2009 campaign
TaskRegistere
dgroups
Participant groups
Submitted Runs
Organizing people
ResPubliQA
20 1128 + 16
(baseline runs)9
Giki CLEF 27 8 17 runs 2
QAST 12 4 86 (5 subtasks) 8
Total59
showed interest
23 Groups
147 runs evaluated
19 + addition
al assessor
s
5
ResPubliQA 2009:QA on European Legislation
Organizers
Anselmo PeñasPamela FornerRichard SutcliffeÁlvaro RodrigoCorina ForascuIñaki AlegriaDanilo GiampiccoloNicolas MoreauPetya Osenova
Additional Assessors
Fernando Luis CostaAnna KampchenJulia KrammeCosmina Croitoru
Advisory Board
Donna HarmanMaarten de RijkeDominique Laurent
6
Evolution of the task
2003 2004 2005 20062007
2008
2009
Target language
s3 7 8 9 10 11 8
Collections
News 1994 + News 1995+ Wikipedia Nov. 2006
European Legislation
Number of
questions200 500
Type of questions
200 Factoid
+ Temporal
restrictions
+ Definitions
- Type of
question
+ Lists
+ Linked questions
+ Closed lists
- Linked+ Reason+ Purpose
+ Procedure
Supporting
information
Document Snippet Paragraph
Size of answer
Snnipet Exact Paragraph
7
Objectives
1. Move towards a domain of potential users
2. Compare systems working in different languages
3. Compare QA Tech. with pure IR4. Introduce more types of questions5. Introduce Answer Validation Tech.
8
Collection
Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements
and resolutions Economy, health, law, food, … Between 1950 and 2006 XML-TEI.2 encoding Unfortunately, non parallel at the
paragraph level -> extra work
9
500 questions
REASON Why did a commission expert conduct an
inspection visit to Uruguay?
PURPOSE/OBJECTIVE What is the overall objective of the eco-
label?
PROCEDURE How are stable conditions in the natural
rubber trade achieved?
In general, any question that can be answered in a paragraph
10
500 questions
Also FACTOID
• In how many languages is the Official Journal of the Community published?
DEFINITION• What is meant by “whole milk”?
No NIL questions
11
12
Translation of questions
13
Selection of the final pool of 500 questions out of the 600 produced
14
15
Systems response
No Answer ≠ Wrong Answer
1. Decide if the answer is given or not• [ YES | NO ]• Classification Problem• Machine Learning, Provers, etc.• Textual Entailment
2. Provide the paragraph (ID+Text) that answers the question
AimTo leave a question unanswered has more value than to give a wrong answer
16
Assessments
R: The question is answered correctlyW: The question is answered incorrectlyNoA: The question is not answered
• NoA R: NoA, but the candidate answer was correct• NoA W: NoA, and the candidate answer was incorrect• Noa Empty: NoA and no candidate answer was given
Evaluation measure: c@1 Extension of the traditional accuracy
(as proportion of questions correctly answered) Considering unanswered questions
17
Evaluation measure
n: Number of questionsnR: Number of correctly answered
questionsnU: Number of unanswered questions
)(1
1@n
nnn
nc R
UR
18
Evaluation measure
If nU = 0 then c@1=nR/n Accuracy
If nR = 0 then c@1=0
If nU = n then c@1=0
Leave a question unanswered gives value only if this avoids to return a wrong answer
Accuracy
)(1
1@n
nnn
nc R
UR Accuracy
The added value is the performance shown with the answered questions: Accuracy
19
List of Participants
System Team
elix ELHUYAR-IXA, SPAIN
icia RACAI, ROMANIA
iiit Search & Info Extraction Lab, INDIA
iles LIMSI-CNRS-2, FRANCE
isik ISI-Kolkata, INDIA
loga U.Koblenz-Landau, GERMAN
mira MIRACLE, SPAIN
nlel U. politecnica Valencia, SPAIN
syna Synapse Developpment, FRANCE
uaic AI.I.Cuza U. of IASI, ROMANIA
uned UNED, SPAIN
20
Value of reducing wrong answers
System c@1 Accuracy
#R #W
#NoA
#NoA R
#NoA W
#NoA empty
combination 0.76 0.76 381 119
0 0 0 0
icia092roro 0.68 0.52 260 84 156 0 0 156
icia091roro 0.58 0.47 237 156
107 0 0107
UAIC092roro 0.47 0.47 236 264
0 0 00
UAIC091roro 0.45 0.45 227 273
0 0 00
base092roro 0.44 0.44 220 280
0 0 00
base091roro 0.37 0.37 185 315
0 0 00
21
Detecting wrong answers
System c@1
Accuracy
#R #W #NoA
#NoA R
#NoA W
#NoA empt
y
combination 0.56
0.56 278
222 0 0 0 0
loga091dede
0.44
0.4 186
221 93 16 689
loga092dede
0.44
0.4 187
230 83 12 629
base092dede
0.38
0.38 189
311 0 0 00
base091dede
0.35
0.35 174
326 0 0 00
Maintaining the number of correct answers,the candidate answer was not correctfor 83% of unanswered questions
Very good step towards improving the system
22
IR important, not enough
System c@1 Accuracy #R #W #NoA #NoA R #NoA W #NoA empty
combination 0.9 0.9 451 49 0 0 0 0
uned092enen 0.61 0.61 288 184 28 15 12 1
uned091enen 0.6 0.59 282 190 28 15 13 0
nlel091enen 0.58 0.57 287 211 2 0 0 2
uaic092enen 0.54 0.52 243 204 53 18 35 0
base092enen 0.53 0.53 263 236 1 1 0 0
base091enen 0.51 0.51 256 243 1 0 1 0
elix092enen 0.48 0.48 240 260 0 0 0 0
uaic091enen 0.44 0.42 200 253 47 11 36 0
elix091enen 0.42 0.42 211 289 0 0 0 0
syna091enen 0.28 0.28 141 359 0 0 0 0
isik091enen 0.25 0.25 126 374 0 0 0 0
iiit091enen 0.2 0.11 54 37 409 0 11 398
elix092euen 0.18 0.18 91 409 0 0 0 0
elix091euen 0.16 0.16 78 422 0 0 0 0
Feasible Task
Perfect combination is 50% better than best system
Many systems under the IR baselines
23
Comparison across languages Same questions Same documents Same baseline systems Strict comparison only affected by
the variable of language But it is feasible to detect the most
promising approaches across languages
24
Comparison across languages
System RO ES EN IT DE
icia092 0.68
nlel092 0.47
uned092 0.41 0.61
uned091 0.41 0.6
icia091 0.58
nlel091 0.58 0.52
uaic092 0.47 0.54
uaic091 0.45
loga091 0.44
loga092 0.44
Baseline 0.44 0.4 0.53 0.42 0.38
Systems above the baselines
Icia, Boolean + intensive NLP +
ML-based validation & very good knowledge of the collection
(Eurovoc terms…)
Baseline, Okapi-BM25 tuned for
paragraph retrieval
25
Comparison across languages
System RO ES EN IT DE
icia092 0.68
nlel092 0.47
uned092 0.41 0.61
uned091 0.41 0.6
icia091 0.58
nlel091 0.58 0.52
uaic092 0.47 0.54
uaic091 0.45
loga091 0.44
loga092 0.44
Baseline 0.44 0.4 0.53 0.42 0.38
Systems above the baselines
nlel092, ngram-based retrieval,
combining evidence from
several languages
Baseline, Okapi-BM25 tuned for
paragraph retrieval
26
Comparison across languages
System RO ES EN IT DE
icia092 0.68
nlel092 0.47
uned092 0.41 0.61
uned091 0.41 0.6
icia091 0.58
nlel091 0.58 0.52
uaic092 0.47 0.54
uaic091 0.45
loga091 0.44
loga092 0.44
Baseline 0.44 0.4 0.53 0.42 0.38
Systems above the baselines
Uned, Okapi-BM25 + NER +
paragraph validation +
ngram based re-ranking
Baseline, Okapi-BM25 tuned for
paragraph retrieval
27
Comparison across languages
System RO ES EN IT DE
icia092 0.68
nlel092 0.47
uned092 0.41 0.61
uned091 0.41 0.6
icia091 0.58
nlel091 0.35 0.58 0.52
uaic092 0.47 0.54
uaic091 0.45
loga091 0.44
loga092 0.44
Baseline 0.44 0.4 0.53 0.42 0.38
Systems above the baselines
nlel091, ngram-based paragraph
retrieval
Baseline, Okapi-BM25 tuned for
paragraph retrieval
28
Comparison across languages
System RO ES EN IT DE
icia092 0.68
nlel092 0.47
uned092 0.41 0.61
uned091 0.41 0.6
icia091 0.58
nlel091 0.58 0.52
uaic092 0.47 0.54
uaic091 0.45
loga091 0.44
loga092 0.44
Baseline 0.44 0.4 0.53 0.42 0.38
Systems above the baselines
Baseline, Okapi-BM25 tuned for
paragraph retrieval
Loga, Lucene + deep NLP + Logic + ML-
based validation
29
Conclusion
Compare systems working in different languages
Compare QA Tech. with pure IR Pay more attention to paragraph retrieval Old issue, late 90’s state of the art (English) Pure IR performance: 0.38 - 0.58 Highest difference respect IR baselines: 0.44 – 0.68
• Intensive NLP• ML-based answer validation
Introduce more types of questions Some types difficult to distinguish Any question that can be answered in a paragraph Analysis of results by question types (in progress)
30
Conclusion
Introduce Answer Validation Tech. Evaluation measure: c@1 Value of reducing wrong answers Detecting wrong answers is feasible
Feasible task 90% of questions have been answered Room for improvement: Best systems around
60% Even with less participants we have
More comparison More analysis More learning
ResPubliQA proposal for 2010 SC and breakout session
31
Interest on ResPubliQA 2010
GROUP
1 Uni. "Al.I.Cuza" Iasi (Dan Cristea, Diana Trandabat)
2 Linguateca (Nuno Cardoso)
3 RACAI (Dan Tufis, Radu Ion)
4 Jesus Vilares
5 Univ. Koblenz-Landlau (Bjorn Pelzer)
6 Thomson Reuters (Isabelle Moulinier)
7 Gracinda Carvalho
8 UNED (Alvaro Rodrigo)
9 Uni. Politecnica Valencia (Paolo Rosso & Davide Buscaldi)
10
Uni. Hagen (Ingo Glockner)
11
Linguit (Jochen L. Leidner)
12
Uni. Saarland (Dietrich Klakow)
13
ELHUYAR-IXA (Arantxa Otegi)
14
MIRACLE TEAM (Paloma Martínez Fernández)
But we need more
You have already a Gold Standard of 500 questions & answers to play with…