Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,

EvaluatingQuestion Answering Validation

Anselmo Peñas (and Alvaro Rodrigo)

NLP & IR groupUNED

nlp.uned.es

Information Science Institute

Marina del Rey, December 11, 2009

UNED

nlp.uned.es

Old friends

Question Answering Nothing else than

answering a question Natural Language

Understanding Something there, if

you are able to answer a question

QA: extrinsic evaluation for NLU

Suddenly… (See the track?)…The QA Track at TREC

UNED

nlp.uned.es

Question Answering at TREC

Object of evaluation itself Redefined as a (roughly

speaking): Highly-precision-oriented IR task Where NLP was necessary

• Specially for Answer Extraction

After

Big document collections (News, Blogs)

Unrestricted domain

Ranking of answers (linked to documents)

More Retrieval

Before

Knowledge Base (e.g. Semantic networks)

Specific domain

Single accurate answer (with explanation)

More Reasoning

UNED

nlp.uned.es

What’s this story about?

2003 2004

2005 2006

2007

2008 2009 2010

QA Tasks

at CLEF

Multiple Language QA Main Task ResPubliQA

Temporal restrictions and lists

Answer Validation Exercise (AVE)

GikiCLEF

Real Time

QA over Speech Transcriptions (QAST)

WiQAWSD QA

UNED

nlp.uned.es

Outline

1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009

UNED

nlp.uned.es

Short cycleLong cycle

Out-line

1. Analysis of current systems

performance

2. Mid term goals and strategy

3. Evaluation Task definition

4. Analysis of the evaluation

cycle

Result analysis

Methodology analysis

Generation of methodology and

evaluation resources

Task activation and

development

UNED

nlp.uned.es

Systems performance

2003 - 2006 (Spanish)

OverallBest result

<60%

Definitions

Best result>80% NOT

IR approach

UNED

nlp.uned.es

Pipeline Upper Bounds

SOMETHING to break the pipeline

Question

Answer

Questionanalysis

PassageRetrieval

AnswerExtraction

AnswerRanking

1.00.8 0.8 0.64x x =

Not enough evidence

UNED

nlp.uned.es

Results in CLEF-QA 2006 (Spanish)

Perfect combination

81%

Best system 52,5%

Best with ORGANIZATION

Best with PERSON

Best with TIME

UNED

nlp.uned.es

Collaborative architectures

Different systems response better different types of questions

• Specialization• Collaboration

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidate answers

SOMETHING for

combining / selecting

Answer

UNED

nlp.uned.es

Collaborative architectures

How to select the good answer?• Redundancy• Voting• Confidence score• Performance history

Why not deeper content analysis?

UNED

nlp.uned.es

Mid Term Goal

GoalImprove QA systems performance

New mid term goalImprove the devices for:

Rejecting / Accepting / Selecting Answers

The new task (2006)Validate the correctness of the answersGiven by real QA systems...

...the participants at CLEF QA

UNED

nlp.uned.es

Outline

1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009

UNED

nlp.uned.es

Define Answer Validation

Decide whether an answer is correct or not

• More precisely: The Task:

Given• Question• Answer• Supporting Text

Decide if the answer is correct according to the supporting text

Let’s call it Answer Validation Exercise (AVE)

UNED

nlp.uned.es

Whish list

Test collection• Questions• Answers• Supporting Texts• Human assessments

Evaluation measures Participants

UNED

nlp.uned.es

Evaluation linked to main QA task

QuestionAnswering

Track

Systems’ answers

Systems’ Supporting Texts

AnswerValidationExercise

Questions

(ACCEPT / REJECT)

Human Judgements (R,W,X,U)

QA Track results

Mapping(ACCEPT / REJECT) Evaluation

AVE Track results

Reuse human assessments

UNED

nlp.uned.es

Candidate answer

Supporting Text

Answer is not correct or not enough evidence

Question

Answer is correct

Answer Validation

Answer Validation Exercise (AVE)

AVE 2007 - 2008

Textual Entailment

HypothesisAutomatic HypothesisGeneration

AVE 2006

UNED

nlp.uned.es

Outline

Motivation and goals Definition and general framework AVE 2006

• Underlying architecture: pipeline• Evaluating the validation• As RTE exercise: pairs text-hypothesis

AVE 2007 & 2008 QA 2009

UNED

nlp.uned.es

AVE 2006: A RTE exercise

If the text semantically entails the hypothesis, then the answer is expected to be correct.

Question

Supporting snippet

Exact AnswerQA system

HypothesisText Entailment?

Is this true? Yes 95% with current QA systems (J LOG COMP 2009)

UNED

nlp.uned.es

Collections AVE 2006

Available at: nlp.uned.es/clef-qa/ave/

Testing (pairs entail.) Training

English 2088 (10% YES) 2870 (15% YES)

Spanish 2369 (28% YES) 2905 (22% YES)

German 1443 (25% YES)

French 3266 (22% YES)

Italian 1140 (16% YES)

Dutch 807 (10% YES)

Portuguese 1324 (14% YES)

UNED

nlp.uned.es

Evaluating the Validation

ValidationDecide if each candidate answer is correct or not

• YES | NO

Not balanced collections

Approach: Detect if there is enough evidence to accept an answer

Measures: Precision, recall and F over correct answers

Baseline system: Accept all answers

UNED

nlp.uned.es

Evaluating the Validation

CRCA

CA

nn

nrecall

Correct Answer

Incorrect

Answer

AnswerAccepte

dnCA nWA

AnswerRejected

nCR nWR

WACA

CA

nn

nprecision

precisionrecall

precisionrecallF

2

UNED

nlp.uned.es

Results AVE 2006

Language Baseline (F)

Best system (F)

Reported Techiques

English .27 .44 Logic

Spanish .45 .61 Logic

German .39 .54 Lexical, Syntax, Semantics, Logic, Corpus

French .37 .47 Overlapping, Learning

Dutch .19 .39 Syntax, Learning

Portuguese .38 .35 Overlapping

Italian .29 .41 Overlapping, Learning

UNED

nlp.uned.es

Outline

Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008

• Underlying architecture: multi-stream• Quantify the potential benefit of AV in QA• Evaluating the correct selection of one answer• Evaluating the correct rejection of all answers

QA 2009

UNED

nlp.uned.es

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidate answers

+ Supporting Texts

Answer Validation &

Selection

Answer

Participant systems in aCLEF – QA

Evaluation of AnswerValidation & Selection

AVE 2007 & 2008

UNED

nlp.uned.es

Collections

<q id="116" lang="EN"><q_str> What is Zanussi? </q_str><a id="116_1" value="">

<a_str> was an Italian producer of home appliances </a_str><t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str>

</a><a id="116_2" value="">

<a_str> who had also been in Cassibile since August 31 </a_str><t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str>

</a><a id="116_4" value="">

<a_str> 3 </a_str><t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>

</a></q>

UNED

nlp.uned.es

Evaluating the Selection

Goals Quantify the potential gain of Answer Validation in

Question Answering Compare AV systems with QA systems Develop measures more comparable to QA

accuracy

questions

correctlyansweredquestions

n

naccuracyqa ___

UNED

nlp.uned.es

Evaluating the selection

Given a question with several candidate answers

Two options:

Selection Select an answer ≡ try to answer the question

• Correct selection: answer was correct• Incorrect selection: answer was incorrect

Rejection Reject all candidate answers ≡ leave question

unanswered• Correct rejection: All candidate answers were incorrect• Incorrect rejection: Not all candidate answers were

incorrect

UNED

nlp.uned.es


n questionsn= nCA + nWA + nWS + nWR + nCR

Question with Correct Answer

Question without

Correct Answer

Question Answered Correctly(One Answer Selected)

nCA -

Question Answered Incorrectly

nWA nWS

Question Unanswered(All Answers Rejected)

nWR nCR

n

naccuracyqa CA_

n

naccuracyrej CR_

UNED

nlp.uned.es


n

naccuracyqa CA_

n

naccuracyrej CR_

n

n

n

naccuracy CRCA

Rewards rejection(not balanced cols)

Interpretation for QA: all questions correctly rejected by AV will be answered correctly

UNED

nlp.uned.es


)(1

n

nnn

nn

n

n

n

n

nestimated CA

CRCACACRCA

n

naccuracyqa CA_

n

naccuracyrej CR_

Interpretation for QA:questions correctly rejected has

valueas if they were answered

correctlyin qa_accuracy proportion

UNED

nlp.uned.es

Analysis and discussion (AVE 2007 Spanish)

Validation

Selection

Comparing AV & QA

UNED

nlp.uned.es

Techniques in AVE 2007

Generates hypotheses 6

Wordnet 3

Chunking 3

n-grams, longest common

Subsequences

5

Phrase transformations 2

NER 5

Num. Expressions 6

Temp. expressions 4

Coreference resolution 2

Dependency analysis 3

Syntactic similarity 4

Functions (sub, obj, etc) 3

Syntactic transformations 1

Word-sense disambiguation 2

Semantic parsing 4

Semantic role labeling 2

First order logic representation

3

Theorem prover 3

Semantic similarity 2

UNED

nlp.uned.es

Conclusion of AVE

Answer Validationbefore

• It was assumed as a QA module• But no space for its own development

The new devices should help to improve QAthey

• Introduce more content analysis• Use Machine Learning techniques• Are able to break pipelines or combine streams

Let’s transfer them to QA main task

UNED

nlp.uned.es

Outline

Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009

UNED

nlp.uned.es

CLEF QA 2009 campaign

ResPubliQA: QA on European Legislation

GikiCLEF: QA requiring geographical reasoning on Wikipedia

QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

UNED

nlp.uned.es

CLEF QA 2009 campaign

TaskRegistere

dgroups

Participant groups

Submitted Runs

Organizing people

ResPubliQA 20 11 28 + 16

(baseline runs) 9

Giki CLEF 27 8 17 runs 2

QAST 12 4 86 (5 subtasks) 8

Total59

showed interest

23 Groups

147 runs evaluated

19 + addition

al assessor

s

ResPubliQA 2009:QA on European Legislation

Organizers

Anselmo PeñasPamela FornerRichard SutcliffeÁlvaro RodrigoCorina ForascuIñaki AlegriaDanilo GiampiccoloNicolas MoreauPetya Osenova

Additional Assessors

Fernando Luis CostaAnna KampchenJulia KrammeCosmina Croitoru

Advisory Board

Donna HarmanMaarten de RijkeDominique Laurent

UNED

nlp.uned.es

Evolution of the task

2003 2004 2005 20062007

2008 2009

Target language

s3 7 8 9 10 11 8

Collections

News 1994 + News 1995+ Wikipedia Nov. 2006

European Legislation

Number of

questions200 500

Type of questions

200 Factoid

+ Temporal restrictions

+ Definitions

- Type of

question

+ Lists

+ Linked questions

+ Closed lists

- Linked+ Reason+ Purpose

+ Procedure

Supporting

information

Document Snippet Paragraph

Size of answer

Snnipet Exact Paragraph

UNED

nlp.uned.es

Collection

Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements

and resolutions Economy, health, law, food, … Between 1950 and 2006

UNED

nlp.uned.es

500 questions

REASON Why did a commission expert conduct an

inspection visit to Uruguay?

PURPOSE/OBJECTIVE What is the overall objective of the eco-

label?

PROCEDURE How are stable conditions in the natural

rubber trade achieved?

In general, any question that can be answered in a paragraph

UNED

nlp.uned.es

500 questions

Also FACTOID

• In how many languages is the Official Journal of the Community published?

DEFINITION• What is meant by “whole milk”?

No NIL questions

UNED

nlp.uned.es

Systems response

No Answer ≠ Wrong Answer

1. Decide if they answer or not• [ YES | NO ]• Classification Problem• Machine Learning, Provers, etc.• Textual Entailment

2. Provide the paragraph (ID+Text) that answers the question

AimTo leave a question unanswered has more value than to give a wrong answer

UNED

nlp.uned.es

Assessments

R: The question is answered correctlyW: The question is answered incorrectlyNoA: The question is not answered

• NoA R: NoA, but the candidate answer was correct

• NoA W: NoA, and the candidate answer was incorrect

• Noa Empty: NoA and no candidate answer was given

Evaluation measure: c@1 Extension of the traditional accuracy

(as proportion of questions correctly answered)

Considering unanswered questions

UNED

nlp.uned.es

Evaluation measure

n: Number of questionsnR: Number of correctly answered

questionsnU: Number of unanswered questions

)(1

1@n

nnn

nc R

UR

UNED

nlp.uned.es

Evaluation measure

If nU = 0 then c@1=nR/n AccuracyIf nR = 0 then c@1=0If nU = n then c@1=0

Leave a question unanswered gives value only if this avoids to return a wrong answer

Accuracy

)(1

1@n

nnn

nc R

UR Accuracy

The added value is the performance shown with the answered questions: Accuracy

UNED

nlp.uned.es

List of Participants

System Team

elix ELHUYAR-IXA, SPAIN

icia RACAI, ROMANIA

iiit Search & Info Extraction Lab, INDIA

iles LIMSI-CNRS-2, FRANCE

isik ISI-Kolkata, INDIA

loga U.Koblenz-Landau, GERMAN

mira MIRACLE, SPAIN

nlel U. politecnica Valencia, SPAIN

syna Synapse Developpment, FRANCE

uaic AI.I.Cuza U. of IASI, ROMANIA

uned UNED, SPAIN

UNED

nlp.uned.es

Value of reducing wrong answers

System c@1 Accuracy

#R #W

#NoA

#NoA R

#NoA W

#NoA empty

combination 0.76 0.76 381 119

0 0 0 0

icia092roro 0.68 0.52 260 84 156 0 0 156

icia091roro 0.58 0.47 237 156

107 0 0107

UAIC092roro 0.47 0.47 236 264

0 0 00

UAIC091roro 0.45 0.45 227 273

0 0 00

base092roro 0.44 0.44 220 280

0 0 00

base091roro 0.37 0.37 185 315

0 0 00

UNED

nlp.uned.es

Detecting wrong answers

System c@1

Accuracy

#R #W #NoA

#NoA R

#NoA W

#NoA empt

y

combination 0.56

0.56 278

222 0 0 0 0

loga091dede

0.44

0.4 186

221 93 16 689

loga092dede

0.44

0.4 187

230 83 12 629

base092dede

0.38

0.38 189

311 0 0 00

base091dede

0.35

0.35 174

326 0 0 00

Maintaining the number of correct answers,the candidate answer was not correctfor 83% of unanswered questions

Very good step towards improving the system

UNED

nlp.uned.es

IR important, not enough

System c@1 Accuracy #R #W #NoA #NoA R #NoA W #NoA empty

combination 0.9 0.9 451 49 0 0 0 0

uned092enen 0.61 0.61 288 184 28 15 12 1

uned091enen 0.6 0.59 282 190 28 15 13 0

nlel091enen 0.58 0.57 287 211 2 0 0 2

uaic092enen 0.54 0.52 243 204 53 18 35 0

base092enen 0.53 0.53 263 236 1 1 0 0

base091enen 0.51 0.51 256 243 1 0 1 0

elix092enen 0.48 0.48 240 260 0 0 0 0

uaic091enen 0.44 0.42 200 253 47 11 36 0

elix091enen 0.42 0.42 211 289 0 0 0 0

syna091enen 0.28 0.28 141 359 0 0 0 0

isik091enen 0.25 0.25 126 374 0 0 0 0

iiit091enen 0.2 0.11 54 37 409 0 11 398

elix092euen 0.18 0.18 91 409 0 0 0 0

elix091euen 0.16 0.16 78 422 0 0 0 0

Achievable Task

Perfect combination is 50% better than best system

Many systems under the IR baselines

UNED

nlp.uned.es

Outline

Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009

Conclusion

UNED

nlp.uned.es

Conclusion

New QA evaluation setting Assuming that

To leave a question unanswered has more value than to give a wrong answer

This assumption give space to further development QA systems

And hopefully improve their performance

Thanks!

http://nlp.uned.es/clef-qa/ave

http://www.clef-campaign.org

Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)

http://nlp.uned.es/clef-qa/ave

http://www.clef-campaign.org/

Documents

Evaluating Question Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es Information Science Institute Marina del Rey,