Upload
lily-daniels
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
EvaluatingQuestion Answering Validation
Anselmo Peñas (and Alvaro Rodrigo)
NLP & IR groupUNED
nlp.uned.es
Information Science Institute
Marina del Rey, December 11, 2009
UNED
nlp.uned.es
Old friends
Question Answering Nothing else than
answering a question Natural Language
Understanding Something there, if
you are able to answer a question
QA: extrinsic evaluation for NLU
Suddenly… (See the track?)…The QA Track at TREC
UNED
nlp.uned.es
Question Answering at TREC
Object of evaluation itself Redefined as a (roughly
speaking): Highly-precision-oriented IR task Where NLP was necessary
• Specially for Answer Extraction
After
Big document collections (News, Blogs)
Unrestricted domain
Ranking of answers (linked to documents)
More Retrieval
Before
Knowledge Base (e.g. Semantic networks)
Specific domain
Single accurate answer (with explanation)
More Reasoning
UNED
nlp.uned.es
What’s this story about?
2003 2004
2005 2006
2007
2008 2009 2010
QA Tasks
at CLEF
Multiple Language QA Main Task ResPubliQA
Temporal restrictions and lists
Answer Validation Exercise (AVE)
GikiCLEF
Real Time
QA over Speech Transcriptions (QAST)
WiQAWSD QA
UNED
nlp.uned.es
Outline
1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009
UNED
nlp.uned.es
Short cycleLong cycle
Out-line
1. Analysis of current systems
performance
2. Mid term goals and strategy
3. Evaluation Task definition
4. Analysis of the evaluation
cycle
Result analysis
Methodology analysis
Generation of methodology and
evaluation resources
Task activation and
development
UNED
nlp.uned.es
Systems performance
2003 - 2006 (Spanish)
OverallBest result
<60%
Definitions
Best result>80% NOT
IR approach
UNED
nlp.uned.es
Pipeline Upper Bounds
SOMETHING to break the pipeline
Question
Answer
Questionanalysis
PassageRetrieval
AnswerExtraction
AnswerRanking
1.00.8 0.8 0.64x x =
Not enough evidence
UNED
nlp.uned.es
Results in CLEF-QA 2006 (Spanish)
Perfect combination
81%
Best system 52,5%
Best with ORGANIZATION
Best with PERSON
Best with TIME
UNED
nlp.uned.es
Collaborative architectures
Different systems response better different types of questions
• Specialization• Collaboration
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidate answers
SOMETHING for
combining / selecting
Answer
UNED
nlp.uned.es
Collaborative architectures
How to select the good answer?• Redundancy• Voting• Confidence score• Performance history
Why not deeper content analysis?
UNED
nlp.uned.es
Mid Term Goal
GoalImprove QA systems performance
New mid term goalImprove the devices for:
Rejecting / Accepting / Selecting Answers
The new task (2006)Validate the correctness of the answersGiven by real QA systems...
...the participants at CLEF QA
UNED
nlp.uned.es
Outline
1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009
UNED
nlp.uned.es
Define Answer Validation
Decide whether an answer is correct or not
• More precisely: The Task:
Given• Question• Answer• Supporting Text
Decide if the answer is correct according to the supporting text
Let’s call it Answer Validation Exercise (AVE)
UNED
nlp.uned.es
Whish list
Test collection• Questions• Answers• Supporting Texts• Human assessments
Evaluation measures Participants
UNED
nlp.uned.es
Evaluation linked to main QA task
QuestionAnswering
Track
Systems’ answers
Systems’ Supporting Texts
AnswerValidationExercise
Questions
(ACCEPT / REJECT)
Human Judgements (R,W,X,U)
QA Track results
Mapping(ACCEPT / REJECT) Evaluation
AVE Track results
Reuse human assessments
UNED
nlp.uned.es
Candidate answer
Supporting Text
Answer is not correct or not enough evidence
Question
Answer is correct
Answer Validation
Answer Validation Exercise (AVE)
AVE 2007 - 2008
Textual Entailment
HypothesisAutomatic HypothesisGeneration
AVE 2006
UNED
nlp.uned.es
Outline
Motivation and goals Definition and general framework AVE 2006
• Underlying architecture: pipeline• Evaluating the validation• As RTE exercise: pairs text-hypothesis
AVE 2007 & 2008 QA 2009
UNED
nlp.uned.es
AVE 2006: A RTE exercise
If the text semantically entails the hypothesis, then the answer is expected to be correct.
Question
Supporting snippet
Exact AnswerQA system
HypothesisText Entailment?
Is this true? Yes 95% with current QA systems (J LOG COMP 2009)
UNED
nlp.uned.es
Collections AVE 2006
Available at: nlp.uned.es/clef-qa/ave/
Testing (pairs entail.) Training
English 2088 (10% YES) 2870 (15% YES)
Spanish 2369 (28% YES) 2905 (22% YES)
German 1443 (25% YES)
French 3266 (22% YES)
Italian 1140 (16% YES)
Dutch 807 (10% YES)
Portuguese 1324 (14% YES)
UNED
nlp.uned.es
Evaluating the Validation
ValidationDecide if each candidate answer is correct or not
• YES | NO
Not balanced collections
Approach: Detect if there is enough evidence to accept an answer
Measures: Precision, recall and F over correct answers
Baseline system: Accept all answers
UNED
nlp.uned.es
Evaluating the Validation
CRCA
CA
nn
nrecall
Correct Answer
Incorrect
Answer
AnswerAccepte
dnCA nWA
AnswerRejected
nCR nWR
WACA
CA
nn
nprecision
precisionrecall
precisionrecallF
2
UNED
nlp.uned.es
Results AVE 2006
Language Baseline (F)
Best system (F)
Reported Techiques
English .27 .44 Logic
Spanish .45 .61 Logic
German .39 .54 Lexical, Syntax, Semantics, Logic, Corpus
French .37 .47 Overlapping, Learning
Dutch .19 .39 Syntax, Learning
Portuguese .38 .35 Overlapping
Italian .29 .41 Overlapping, Learning
UNED
nlp.uned.es
Outline
Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008
• Underlying architecture: multi-stream• Quantify the potential benefit of AV in QA• Evaluating the correct selection of one answer• Evaluating the correct rejection of all answers
QA 2009
UNED
nlp.uned.es
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidate answers
+ Supporting Texts
Answer Validation &
Selection
Answer
Participant systems in aCLEF – QA
Evaluation of AnswerValidation & Selection
AVE 2007 & 2008
UNED
nlp.uned.es
Collections
<q id="116" lang="EN"><q_str> What is Zanussi? </q_str><a id="116_1" value="">
<a_str> was an Italian producer of home appliances </a_str><t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str>
</a><a id="116_2" value="">
<a_str> who had also been in Cassibile since August 31 </a_str><t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str>
</a><a id="116_4" value="">
<a_str> 3 </a_str><t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>
</a></q>
UNED
nlp.uned.es
Evaluating the Selection
Goals Quantify the potential gain of Answer Validation in
Question Answering Compare AV systems with QA systems Develop measures more comparable to QA
accuracy
questions
correctlyansweredquestions
n
naccuracyqa ___
UNED
nlp.uned.es
Evaluating the selection
Given a question with several candidate answers
Two options:
Selection Select an answer ≡ try to answer the question
• Correct selection: answer was correct• Incorrect selection: answer was incorrect
Rejection Reject all candidate answers ≡ leave question
unanswered• Correct rejection: All candidate answers were incorrect• Incorrect rejection: Not all candidate answers were
incorrect
UNED
nlp.uned.es
Evaluating the Selection
n questionsn= nCA + nWA + nWS + nWR + nCR
Question with Correct Answer
Question without
Correct Answer
Question Answered Correctly(One Answer Selected)
nCA -
Question Answered Incorrectly
nWA nWS
Question Unanswered(All Answers Rejected)
nWR nCR
n
naccuracyqa CA_
n
naccuracyrej CR_
UNED
nlp.uned.es
Evaluating the Selection
n
naccuracyqa CA_
n
naccuracyrej CR_
n
n
n
naccuracy CRCA
Rewards rejection(not balanced cols)
Interpretation for QA: all questions correctly rejected by AV will be answered correctly
UNED
nlp.uned.es
Evaluating the Selection
)(1
n
nnn
nn
n
n
n
n
nestimated CA
CRCACACRCA
n
naccuracyqa CA_
n
naccuracyrej CR_
Interpretation for QA:questions correctly rejected has
valueas if they were answered
correctlyin qa_accuracy proportion
UNED
nlp.uned.es
Techniques in AVE 2007
Generates hypotheses 6
Wordnet 3
Chunking 3
n-grams, longest common
Subsequences
5
Phrase transformations 2
NER 5
Num. Expressions 6
Temp. expressions 4
Coreference resolution 2
Dependency analysis 3
Syntactic similarity 4
Functions (sub, obj, etc) 3
Syntactic transformations 1
Word-sense disambiguation 2
Semantic parsing 4
Semantic role labeling 2
First order logic representation
3
Theorem prover 3
Semantic similarity 2
UNED
nlp.uned.es
Conclusion of AVE
Answer Validationbefore
• It was assumed as a QA module• But no space for its own development
The new devices should help to improve QAthey
• Introduce more content analysis• Use Machine Learning techniques• Are able to break pipelines or combine streams
Let’s transfer them to QA main task
UNED
nlp.uned.es
Outline
Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009
UNED
nlp.uned.es
CLEF QA 2009 campaign
ResPubliQA: QA on European Legislation
GikiCLEF: QA requiring geographical reasoning on Wikipedia
QAST: QA on Speech Transcriptions of European Parliament Plenary sessions
UNED
nlp.uned.es
CLEF QA 2009 campaign
TaskRegistere
dgroups
Participant groups
Submitted Runs
Organizing people
ResPubliQA 20 11 28 + 16
(baseline runs) 9
Giki CLEF 27 8 17 runs 2
QAST 12 4 86 (5 subtasks) 8
Total59
showed interest
23 Groups
147 runs evaluated
19 + addition
al assessor
s
ResPubliQA 2009:QA on European Legislation
Organizers
Anselmo PeñasPamela FornerRichard SutcliffeÁlvaro RodrigoCorina ForascuIñaki AlegriaDanilo GiampiccoloNicolas MoreauPetya Osenova
Additional Assessors
Fernando Luis CostaAnna KampchenJulia KrammeCosmina Croitoru
Advisory Board
Donna HarmanMaarten de RijkeDominique Laurent
UNED
nlp.uned.es
Evolution of the task
2003 2004 2005 20062007
2008 2009
Target language
s3 7 8 9 10 11 8
Collections
News 1994 + News 1995+ Wikipedia Nov. 2006
European Legislation
Number of
questions200 500
Type of questions
200 Factoid
+ Temporal restrictions
+ Definitions
- Type of
question
+ Lists
+ Linked questions
+ Closed lists
- Linked+ Reason+ Purpose
+ Procedure
Supporting
information
Document Snippet Paragraph
Size of answer
Snnipet Exact Paragraph
UNED
nlp.uned.es
Collection
Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements
and resolutions Economy, health, law, food, … Between 1950 and 2006
UNED
nlp.uned.es
500 questions
REASON Why did a commission expert conduct an
inspection visit to Uruguay?
PURPOSE/OBJECTIVE What is the overall objective of the eco-
label?
PROCEDURE How are stable conditions in the natural
rubber trade achieved?
In general, any question that can be answered in a paragraph
UNED
nlp.uned.es
500 questions
Also FACTOID
• In how many languages is the Official Journal of the Community published?
DEFINITION• What is meant by “whole milk”?
No NIL questions
UNED
nlp.uned.es
Systems response
No Answer ≠ Wrong Answer
1. Decide if they answer or not• [ YES | NO ]• Classification Problem• Machine Learning, Provers, etc.• Textual Entailment
2. Provide the paragraph (ID+Text) that answers the question
AimTo leave a question unanswered has more value than to give a wrong answer
UNED
nlp.uned.es
Assessments
R: The question is answered correctlyW: The question is answered incorrectlyNoA: The question is not answered
• NoA R: NoA, but the candidate answer was correct
• NoA W: NoA, and the candidate answer was incorrect
• Noa Empty: NoA and no candidate answer was given
Evaluation measure: c@1 Extension of the traditional accuracy
(as proportion of questions correctly answered)
Considering unanswered questions
UNED
nlp.uned.es
Evaluation measure
n: Number of questionsnR: Number of correctly answered
questionsnU: Number of unanswered questions
)(1
1@n
nnn
nc R
UR
UNED
nlp.uned.es
Evaluation measure
If nU = 0 then c@1=nR/n AccuracyIf nR = 0 then c@1=0If nU = n then c@1=0
Leave a question unanswered gives value only if this avoids to return a wrong answer
Accuracy
)(1
1@n
nnn
nc R
UR Accuracy
The added value is the performance shown with the answered questions: Accuracy
UNED
nlp.uned.es
List of Participants
System Team
elix ELHUYAR-IXA, SPAIN
icia RACAI, ROMANIA
iiit Search & Info Extraction Lab, INDIA
iles LIMSI-CNRS-2, FRANCE
isik ISI-Kolkata, INDIA
loga U.Koblenz-Landau, GERMAN
mira MIRACLE, SPAIN
nlel U. politecnica Valencia, SPAIN
syna Synapse Developpment, FRANCE
uaic AI.I.Cuza U. of IASI, ROMANIA
uned UNED, SPAIN
UNED
nlp.uned.es
Value of reducing wrong answers
System c@1 Accuracy
#R #W
#NoA
#NoA R
#NoA W
#NoA empty
combination 0.76 0.76 381 119
0 0 0 0
icia092roro 0.68 0.52 260 84 156 0 0 156
icia091roro 0.58 0.47 237 156
107 0 0107
UAIC092roro 0.47 0.47 236 264
0 0 00
UAIC091roro 0.45 0.45 227 273
0 0 00
base092roro 0.44 0.44 220 280
0 0 00
base091roro 0.37 0.37 185 315
0 0 00
UNED
nlp.uned.es
Detecting wrong answers
System c@1
Accuracy
#R #W #NoA
#NoA R
#NoA W
#NoA empt
y
combination 0.56
0.56 278
222 0 0 0 0
loga091dede
0.44
0.4 186
221 93 16 689
loga092dede
0.44
0.4 187
230 83 12 629
base092dede
0.38
0.38 189
311 0 0 00
base091dede
0.35
0.35 174
326 0 0 00
Maintaining the number of correct answers,the candidate answer was not correctfor 83% of unanswered questions
Very good step towards improving the system
UNED
nlp.uned.es
IR important, not enough
System c@1 Accuracy #R #W #NoA #NoA R #NoA W #NoA empty
combination 0.9 0.9 451 49 0 0 0 0
uned092enen 0.61 0.61 288 184 28 15 12 1
uned091enen 0.6 0.59 282 190 28 15 13 0
nlel091enen 0.58 0.57 287 211 2 0 0 2
uaic092enen 0.54 0.52 243 204 53 18 35 0
base092enen 0.53 0.53 263 236 1 1 0 0
base091enen 0.51 0.51 256 243 1 0 1 0
elix092enen 0.48 0.48 240 260 0 0 0 0
uaic091enen 0.44 0.42 200 253 47 11 36 0
elix091enen 0.42 0.42 211 289 0 0 0 0
syna091enen 0.28 0.28 141 359 0 0 0 0
isik091enen 0.25 0.25 126 374 0 0 0 0
iiit091enen 0.2 0.11 54 37 409 0 11 398
elix092euen 0.18 0.18 91 409 0 0 0 0
elix091euen 0.16 0.16 78 422 0 0 0 0
Achievable Task
Perfect combination is 50% better than best system
Many systems under the IR baselines
UNED
nlp.uned.es
Outline
Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009
Conclusion
UNED
nlp.uned.es
Conclusion
New QA evaluation setting Assuming that
To leave a question unanswered has more value than to give a wrong answer
This assumption give space to further development QA systems
And hopefully improve their performance
Thanks!
http://nlp.uned.es/clef-qa/ave
http://www.clef-campaign.org
Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)