Evaluating and Improving Inference Rules via Crowdsourcing Naomi Zeichner Supervisors: Prof. Ido Dagan & Dr. Meni Adler Evaluating and Improving Inference

Evaluating and Improving Inference Rules via Crowdsourcing

Naomi Zeichner

Supervisors: Prof. Ido Dagan & Dr. Meni Adler

Evaluating and Improving Inference Rules via Crowdsourcing

Naomi Zeichner

Supervisors: Prof. Ido Dagan & Dr. Meni Adler

Inference Rules – important component in semantic applications

Q Where was Reagan raised?

A Reagan was brought up in Dixon.

X brought up in Y X raised in Y

Hiring Event

PERSON ROLE

Bob worked as an analyst for Dell

X work as Y X hired as Y

analystBob

1

Text Hypothesis

• Many algorithms for the automatic acquisition of inference-rules

• Poor quality of automatically acquired rules

• We would like an indication of how likely the rule is to extract correct rule-applications

• Many algorithms for the automatic acquisition of inference-rules

• Poor quality of automatically acquired rules

• We would like an indication of how likely the rule is to extract correct rule-applications

Current State

2

X reside in Y X born in Y

X criticize Y X attack Y

X reside in Y X live in Y

An efficient and reliable way to manually

assess the validity of inference rules

Useful for two purposes:

Dataset for training and evaluation

Improving rule-base

An efficient and reliable way to manually

assess the validity of inference rules

Useful for two purposes:

Dataset for training and evaluation

Improving rule-base

Our Goal

3

Outline

Crowdsourcing Rule Application Annotations

Evaluate & Improve Inference Rule-Base

2

3

Inference Rule-Base Evaluation

1

4

Current stateCurrent state

Our FrameworkOur Framework

Use CasesUse Cases

Outline




1

4



Use CasesUse Cases

2

3

Evaluation - What are the options?

Impact on end task QA, IE, RTEPro: What interests an inference system developer

Con: Many components, address multiple phenomena Hard to asses the effect of a single resource.

1

Judge rule correctness directlyPro: Theoretically most intuitive

Con: In fact hard to do Often results in low inter-annotator agreement.

2

X reside in Y X live in Y

X reside in Y X born in Y

X criticize Y X attack Y

Instance-based evaluation (Szpektor et al 2007., Bhagat et al. 2007)

Pro: Simulates utility of rules in an application

Yields high inter-annotator agreement.

3

5

Inference Rule-Base Evaluation Crowdsourcing Rule Application Annotations Evaluate & Improve Inference Rule-Base

Instance Based Evaluation

6

RuleFind LHSSentence

Sentence LHS

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing No

NoNo

Yes

Yes

Yes

Invalid

X acquire Y X buy Y Kim acquired new abilities at school.

Kim acquired school.

Kim acquired abilities

Kim buy abilities.Dropbox buy Audiogalaxy.Kim buy abilities.

Dropbox buy Audiogalaxy.

Kim acquired new abilities at school.Kim acquired new abilities at school.

Dropbox acquired Audiogalaxy. Dropbox acquired Audiogalaxy.


Instance Based Evaluation – Issues

7

RuleFind LHSSentence

Sentence LHS

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing No

NoNo

Yes

Yes

Yes

Invalid

Szpektor reported 43%

Requires lengthy guidelines & training

Hard to

Replicate

Complex


Crowdsourcing

• Recent trend of using crowdsourcing for

annotation tasks

• Requires tasks to be

• Coherent

• Simple

• Does not allow for

• Long instructions

• Extensive training

8


• Replicable & Reliable– Rule applications:

• Good representation of rule use• Coherent

– Annotation process:• Simple• Communicate entailment

without lengthy guidelines and training




without lengthy guidelines and training

Requirements Summary

9


Outline




10



Use CasesUse Cases

2

1

3

Overview

11

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing

RuleBase


Overview - Generation

12

Find LHSSentence

Generate RHSRule

Base

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


13

Rule: X shoot Y X attack Y

Rule Application Generation

shoot:V

subj

X:N

obj

Y:N Y:N

obj

attack:V

subj

X:N

Find LHSSentence

Generate RHS

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


14


shoot:V

subj

X:N

obj

Y:N

Sentence Extraction

Sentence: The bank manager shoots one of the robbers.

3:manager:N

4:shoot:V

1:The:Det

subj

det

5:one:N

obj

pcomp-n

2:bank:N 6:of:Prep

8:robber:N

nn comp1

det

7:the:Det

Find LHSSentence

Generate RHS

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


14


shoot:V

subj

X:N

obj

Y:N

Sentence Extraction


3:manager:N

4:shoot:V

1:The:Det

subj

det

5:one:N

obj

pcomp-n

2:bank:N 6:of:Prep

8:robber:N

nn comp1

det

7:the:Det

Find LHSSentence

Generate RHS

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


15


attack:V

subj

X:N

obj

Y:N

RHS Phrase Generation


Phrase: X attack Y

3:manager:N

4:shoot:V

1:The:Det

subj

det

5:one:N

obj

pcomp-n

2:bank:N 6:of:Prep

8:robber:N

nn comp1

det

7:the:Det

Find LHSSentence

Generate RHS

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


15


attack:V

subj

X:N

obj

Y:N



Phrase: manager attack one

3:manager:N

4:shoot:V

1:The:Det

subj

det

5:one:N

obj

pcomp-n

2:bank:N 6:of:Prep

8:robber:N

nn comp1

det

7:the:Det

Find LHSSentence

Generate RHS

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


15


attack:V

subj

X:N

obj

Y:N



Phrase: The bank manager attack one of the robbers.

3:manager:N

4:shoot:V

1:The:Det

subj

det

5:one:N

obj

pcomp-n

2:bank:N 6:of:Prep

8:robber:N

nn comp1

det

7:the:Det

X Y

Find LHSSentence

Generate RHS

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


16

Rule Application GenerationSentence Filtering

Problem: ‘Left phrase not entailed by sentence’Sentence LHS

Solution: Verify sentence parsing

53% of sentences filtered out

Sentence: They were first used as fighting dogs.

LHS extraction: they fight dogs

fight:V

Y:NX:N

subj obj

Cause: Parsing Errors BonusFilter out ungrammatical

sentences

Find LHSSentence

Generate RHS

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


Overview - Crowdsourcing

17

Find LHSSentence

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

EntailingNoNo

YesYes

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing

RuleBase


Rule: X greet Y X marry YSentence: Mr. Monk visits her, and she greets him with real joy.Phrase: she marry him

Rule: X acquire Y X buy YSentence: Kim acquired new abilities at school.Phrase: Kim buy abilities

Rule: X shoot Y X attack YSentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.

18

Crowdsourcing: Simplify ProcessFind LHSSentence

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing

Is a phrase meaningful?1




Rule: X shoot Y X attack YSentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.The bank manager attack one of the robbers.

she marry him

18

Kim buy abilities


Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing





Rule: X shoot Y X attack YSentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.

The bank manager attack one of the robbers.

she marry him

Kim buy abilities

18


Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing




2 Judge if a phrase is true given a sentence.


she marry him

18


Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing

Kim buy abilities



Sentence: The bank manager shoots one of the robbers.Phrase:

Sentence: Mr. Monk visits her, and she greets him with real joy.Phrase:


she marry him


she marry him

18


Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


Kim buy abilities



Crowdsourcing: Simplify Process


Sentence: The bank manager shoots one of the robbers.Phrase: The bank manager attack one of the robbers.

Sentence: Mr. Monk visits her, and she greets him with real joy.Phrase: she marry him


she marry himshe marry him

18

Find LHSSentence

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing

Kim buy abilities


Gold standard – annotated rule applicationsGold standard – annotated rule applications

Crowdsourcing: Communicate

19

Educating “Confusing” examples used as gold with feedback if Turkers get them wrong

1

2 Enforcing Unanimous examples used as gold to estimate Turker reliability

Sentence: Michelle thinks like an artistPhrase: Michelle behave like an artistFeedback: No. It is quite possible for someone to think like an artist but not behave like an artist.

Entailment

Find LHSSentence

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


Crowdsourcing: Aggregate Annotation

20

• Each rule application is evaluated by 3 Turkers• Annotations aggregated by

– Majority Vote– Bias correction for non-expert annotators

measure (Snow et al. 2008)

• Each rule application is evaluated by 3 Turkers• Annotations aggregated by

– Majority Vote– Bias correction for non-expert annotators

measure (Snow et al. 2008)

Find LHSSentence

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


21

xi – `true’ label of example i

yiw – label provided by worker to example i

Estimate of worker w’s probability to label Y or N given the `true’ label xi

Calculated using performance on expert annotated examples

Aggregated label

xi – `true’ label of example i

yiw – label provided by worker to example i

Estimate of worker w’s probability to label Y or N given the `true’ label xi

Calculated using performance on expert annotated examples

Aggregated label

0i

Yy

N otherwise

1

11

( | )log ( | ) log

( | )

i i i W

i i i Wi i i W

P x Y y yposterior ratio x Y y y

P x N y y

( | ) ( | ) ( | ) ( | )iw i iw i iw i iw iP y Y x Y P y N x Y P y Y x N P y N x N

w W

Snow’s Method

Uniform distribution

( | ) ( )log log

( | ) ( )iw i i

w W iw i i

P y x Y P x Y

P y x N P x N

Crowdsourcing: Aggregate AnnotationFind LHSSentence

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing


Crowdsourcing: Evaluation

• Agreement between Turker and Expert Annotations• Agreement between Turker and Expert Annotations

Agreement Kappa

Task 1

Task 2

*considerably higher than the 0.65 kappa reported by Szpektor et al. (2007)

22

88% 0.70

93% 0.78*





without lengthy guidelines and training– Crowdsourcing




without lengthy guidelines and training– Crowdsourcing

Requirements Summary

23

– Parsing validation

– Simple Wikipedia & argument sub-tree

– Split tasks

– Gold standard


Outline




24



Use CasesUse Cases

1

3

2

Use Cases

Our Goals

Find an efficient and reliable way to manually assess the validity of inference rules

Useful for two purposes: Dataset for training and evaluation

Use Case 1: Evaluating Rule Acquisition Methods

Improving rule-base

Use Case 2: Improving Accuracy-estimate of Automatically Acquired

Inference-rules

Our Goals

Find an efficient and reliable way to manually assess the validity of inference rules

Useful for two purposes: Dataset for training and evaluation

Use Case 1: Evaluating Rule Acquisition Methods

Improving rule-base

Use Case 2: Improving Accuracy-estimate of Automatically Acquired

Inference-rules

25

Inference Rule-Base Evaluation Evaluate & Improve Inference Rule-BaseCrowdsourcing Rule Application Annotations

Use Case 1: Data Set

26

• Supplement study derived from this work (Zeichner et al. 2012)

• Generated rule applications using four inference rule learning methods

• Annotated each rule application using our framework

• After some filtering 6,567 rule applications remained

• Supplement study derived from this work (Zeichner et al. 2012)

• Generated rule applications using four inference rule learning methods

• Annotated each rule application using our framework

• After some filtering 6,567 rule applications remained


Use Case 1: Output

27

• Task 1• 1,012 meaningless phrase

• 5,555 meaningful phrase

•Task 2• 2,447 positive entailment

• 3,108 negative entailment

• Overall• 6,567 rule applications

• Annotated for $1000

• About a week

• Task 1• 1,012 meaningless phrase

• 5,555 meaningful phrase

•Task 2• 2,447 positive entailment

• 3,108 negative entailment

• Overall• 6,567 rule applications

• Annotated for $1000

• About a week

non-entailment

passed to Task 2


Use Case 1: Algorithm Comparison

28

Algorithm AUC

DIRT (Lin and Pantel, 2001) 0.40

Cover (Weeds andWeir, 2003) 0.43

BInc (Szpektor and Dagan, 2008) 0.44

Berant (Berant et al., 2010) 0.52


Use Case 1: Results

29

• Large scale data-set of rule-application annotations

• Quickly

• Reasonable cost

• Allowed comparison between different inference-rule learning methods

• Large scale data-set of rule-application annotations

• Quickly

• Reasonable cost

• Allowed comparison between different inference-rule learning methods


Use Case 2: Setting

• We follow the evaluation methodology by Szpektor et al. (2008)

• Implemented a naïve Information Extraction (IE) system

Attack

X Y X attack Y

X Y X attack Y

X Y X attack Y

Banks was convicted of shooting and killing a 16-year-old at a park in 1980

• We follow the evaluation methodology by Szpektor et al. (2008)

• Implemented a naïve Information Extraction (IE) system

Attack

X Y X attack Y

X Y X attack Y

X Y X attack Y

Banks was convicted of shooting and killing a 16-year-old at a park in 1980

30

shoot

bomb

destroy

0.245773

0.30322

0.298797

X Y


Use Case 2: Rule Re-Scoring Methods

• Crowd Score number of instantiations annotated as entailing out of those judged for the rule

• Combined Score linear combination of Crowd and Rule learning score

• Original Score (Baseline) score produced by rule-learning algorithm

• Crowd Score number of instantiations annotated as entailing out of those judged for the rule

• Combined Score linear combination of Crowd and Rule learning score

• Original Score (Baseline) score produced by rule-learning algorithm

31


• Context in which rules will be used must be reflected in crowdsourced annotations

End-Position

• Annotation guidelines adapted to consider context in judgment

• Context in which rules will be used must be reflected in crowdsourced annotations

End-Position

• Annotation guidelines adapted to consider context in judgment

X shoot Y

X dismiss Y

X fire Y

Use Case 2: Context Specific Instructions

32

X fire Y


Use Case 2: Evaluation

• Comparison with manual expert ranking

• Performance on Information Extraction (IE) task

• Comparison with manual expert ranking

• Performance on Information Extraction (IE) task

33


Sentence

Original Score convict X of Y sentence Y to Xconvict X of Y sentence X for YX guilty of Y sentence X for YX order Y X sentence Y convict X of Y Y sentence X

Mean Average Precision Original Score: 0.47 Crowd Score: 0.80

Sentence

Original Score convict X of Y sentence Y to Xconvict X of Y sentence X for YX guilty of Y sentence X for YX order Y X sentence Y convict X of Y Y sentence X

Mean Average Precision Original Score: 0.47 Crowd Score: 0.80

Crowd Scoreconvict X of Y sentence X for Ycondemn X to Y sentence Y to XX serve Y sentence Y to Xconvict X of Y sentence Y to Xconvict X of Y sentence Y in X

Use Case 2: Evaluation – Manual Ranking

34


Scoring MethodMajority

Snow

Original Score 0.077 0.077

Crowd Score 0.115 0.135

Combined 0.118 0.138

• Ranking Settings – Mean Average Precision• Ranking Settings – Mean Average Precision

Use Case 2: Evaluation – IE Performance

Scoring MethodMajority

Snow

35

Original Score 0.077 0.077

Crowd Score 0.115 0.135

Combined Score 0.118 0.138


Use Cases: Error Analysis

• Ambiguity Sentence: members disagree with leadership Phrase: members take exception to leadership

raise an objection take offense

• Entailment definition Sentence: A doctor claimed he died of stomach cancer Phrase: he die of stomach cancer

• Ambiguity Sentence: members disagree with leadership Phrase: members take exception to leadership

raise an objection take offense

• Entailment definition Sentence: A doctor claimed he died of stomach cancer Phrase: he die of stomach cancer

36

Crowdsourced Annotation Performance


• Corpora DifferencesEvent: Arrest-JailRule: X capture Y X arrest YFrom IE corpus:Sentence: American commandos captured a half brother of

Saddam Hussein on ThursdayPhrase: commandos arrest half brotherFrom Simple Wikipedia:Sentence: In 1622 AD, Nurhaci's armies captured Guang Ning.Phrase: Nurhaci's armies arrest Guang Ning

• Corpora DifferencesEvent: Arrest-JailRule: X capture Y X arrest YFrom IE corpus:Sentence: American commandos captured a half brother of

Saddam Hussein on ThursdayPhrase: commandos arrest half brotherFrom Simple Wikipedia:Sentence: In 1622 AD, Nurhaci's armies captured Guang Ning.Phrase: Nurhaci's armies arrest Guang Ning

37

Use Cases: Error AnalysisRule-base performance on the IE Task


• Better corpus to use for rule-application generation

• Use the framework to determine rule-context

• Look at rule-base ranking as a Machine Learning problem `Learning to rank’

• Better corpus to use for rule-application generation

• Use the framework to determine rule-context

• Look at rule-base ranking as a Machine Learning problem `Learning to rank’

38

Future work


• Replicable framework

• High quality annotations quickly and at reasonable cost

• Hopefully encourage the use of inference-rules

• Replicable framework

• High quality annotations quickly and at reasonable cost

• Hopefully encourage the use of inference-rules

39

Conclusion

Thank

You


attack X in Y

Y:N

in:Prep

mod

attack:V

obj

X:N

Generation: Creating the RHS instantiation

• Template Linearization


Target: Judge if a rule application is valid or not

Instance Based Evaluation – Decisions


Rule: X turn in Y X bring in Y

Sentence: Humans turn in bed during the night.

Phrase: Humans bring in bed

Rule: X fight Y X attack YSentence: The American soldiers fought the British troops, in 1775.Phrase: The American soldiers attack the British troops.


21a

Bias correction for non-expert annotators• Worker Probabilities:

Bias correction for non-expert annotators• Worker Probabilities: ( | ) ( | )

( | ) ( | )iw i iw i

iw i iw i

P y Y x Y P y Y x N

P y N x Y P y Y x N

1

1

1

|log | log

|

i i i W

i i i W

i i i W

P x Y y yposterior ratio x Y y y

P x N y y

|log

|iw i i

w W iw i i

P y x Y P x Y

P y x N P x N

| |log log

| |

i iw iw i iw W w W

i iw iw i iw W w W

P x Y y P y x Y P x Y

P x N y P y x N P x N

Snow’s Method

Find LHSSentence

Generate RHS

RHS meaningful?

Not Entailing

Sentence RHS

Entailing

No

No

Yes

Yes

RuleBase

Generation

RuleApplications

AnnotatedRule

Applications

Crowdsourcing

Crowdsourcing: Aggregate Annotation


Use Case 1: Data Set - Details

26a

1. Apply each rule learning method on a set of one billion tuple extractions of the form: Arg1 predicate Arg2

2. Sample 5000 extractions3. For each extraction apply all relevant rules4. Compared extractions of each method to crowdsourced

annotations to get true-positive (TP), false-positive (FP) and false-negative (FN) values

5. Calculated Recall and Precision values

Recall = Precision =

1. Apply each rule learning method on a set of one billion tuple extractions of the form: Arg1 predicate Arg2

2. Sample 5000 extractions3. For each extraction apply all relevant rules4. Compared extractions of each method to crowdsourced

annotations to get true-positive (TP), false-positive (FP) and false-negative (FN) values

5. Calculated Recall and Precision values

Recall = Precision = TP

TP FNTP

TP FP


Use Case 2: Mean Average Precision (MAP)

34a

1. MAP 02. for each IE event

2.1 AP 02.2 TPcum 02.3 FPcum 02.4 for each rule in event

2.4.1 if rule is judged correct by expert 2.4.1.1 TPcum TPcum +1

2.4.1.2 AP AP + [TPcum / (TPcum + FPcum)]2.4.2 else

2.4.2.1 FPcum FPcum + 12.5 AP AP / TPcum2.6 MAP MAP + AP

3. MAP MAP / number of events



2.4.1 if rule is judged correct by expert 2.4.1.1 TPcum TPcum +1

2.4.1.2 AP AP + [TPcum / (TPcum + FPcum)]2.4.2 else

2.4.2.1 FPcum FPcum + 12.5 AP AP / TPcum2.6 MAP MAP + AP

3. MAP MAP / number of events

AP – average precision

TP / FP – true / false positive

TPcum / FPcum – cumulative TP / FP

Cumulative Precision


Use Case 2: Mean Average Precision (MAP)

35a



2.4.1 FPcum FPcum + ruleFP2.4.2 ruleTPcum 02.4.3 for each ruleTP

2.4.3.1 ruleTPcum ruleTPcum + 1 2.4.3.2 AP AP + [ruleTPcum / (TPcum + ruleTPcum + FPcum)]

2.4.4 TPcum TPcum + ruleTPcum2.5 AP AP / TPcum2.6 MAP MAP + AP

3. MAP MAP / # events



2.4.1 FPcum FPcum + ruleFP2.4.2 ruleTPcum 02.4.3 for each ruleTP

2.4.3.1 ruleTPcum ruleTPcum + 1 2.4.3.2 AP AP + [ruleTPcum / (TPcum + ruleTPcum + FPcum)]

2.4.4 TPcum TPcum + ruleTPcum2.5 AP AP / TPcum2.6 MAP MAP + AP

3. MAP MAP / # events

AP – average precision

TP / FP – true / false positive

ruleTP / ruleFP – TP / FP for rule

TPcum / FPcum – cumulative TP / FP

ruleTPcum – cumulative TP for rule

Cumulative Precision


Crowdsourcing: Evaluation - Kappa

• Takes into account the agreement occurring by chance

Relative observed agreement among raters

Hypothetical probability of chance agreement

• Takes into account the agreement occurring by chance

Relative observed agreement among raters

Hypothetical probability of chance agreement

22


Pr( ) Pr( )

1 Pr( )

a e

e

Pr( )examples agreed on

aall annotated examples

Pr( )examples worker A answered Yes examples worker B answered Yes

eall annotated examples all annotated examples

examples worker A answered No examples worker B answered No

all annotated examples all annotated examples

Documents

Evaluating and Improving Inference Rules via Crowdsourcing Naomi Zeichner Supervisors: Prof. Ido Dagan & Dr. Meni Adler Evaluating and Improving Inference