Upload
guest60b48a
View
4.998
Download
3
Embed Size (px)
DESCRIPTION
Citation preview
Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng
Cheap and Fast - But is it Good? Evaluating Nonexpert Annotations
for Natural Language Tasks
The primacy of data
(Banko and Brill, 2001): Scaling to Very Very Large Corpora
for Natural Language Disambiguation
Datasets drive research
statistical parsing
speech recognition
semantic role labeling
statistical machine
translation
Penn TreebankPropBank
Switchboard
UN Parallel TextPascal RTE
textual entailment
word sensedisambiguation
WordNetSemCor
The advent of human computation
• Open Mind Common Sense (Singh et al., 2002)
• Games with a Purpose (von Ahn and Dabbish, 2004)
• Online Word Games (Vickrey et al., 2008)
Amazon Mechanical TurkBut what if your task isn’t “fun”?
mturk.com
Using AMT for dataset creation
• Su et al. (2007): name resolution, attribute extraction
• Nakov (2008): paraphrasing noun compounds
• Kaisser and Lowe (2008): sentence-level QA annotation
• Kaisser et al. (2008): customizing QA summary length
• Zaenen (2008): evaluating RTE agreement
Using AMT is cheap
Paper Labels Cents/Label
Su et al. (2007) 10,500 1.5
Nakov (2008) 19,018 unreported
Kaisser and Lowe (2008) 24,321 2.0
Kaisser et al. (2008) 45,300 3.7
Zaenen (2008) 4,000 2.0
And it’s fast...
blog.doloreslabs.com
But is it good?
• Objective: compare nonexpert annotation quality on NLP tasks with gold standard, expert-annotated data
• Method: pick 5 standard datasets, and relabel each point with 10 new annotations
• Compare Turk agreement to dataset with reported expert interannotator agreement
Tasks• Affect recognition
• Strapparava and Mihalcea (2007)
• Word Similarity
• Miller and Charles (1991)
• Textual Entailment
• Dagan et al. (2006)
• WSD
• Pradhan et al. (2007)
• Temporal Annotation
• Pustejovsky et al. (2003)
sim(boy, lad) > sim(rooster, noon)
ran happens before fell in:“The horse ran past the barn fell.”
“a bass on the line” vs. “a funky bass line”
if “Microsoft was established in Italy in 1985”,then “Microsoft was established in 1985” ?
fear(“Tropical storm forms in Atlantic”) > fear(“Goal delight for Sheva”)
TasksTask
Expert Labelers
Unique Examples
Interannotator Agreement
Answer Type
Affect Recognition
6 700 0.603 numeric
Word Similarity
1 30 0.958 numeric
Textual Entailment
1 800 0.91 binary
Temporal Annotation
1 462 Unknown binary
WSD 1 177 Unknown ternary
Affect Recognition
Interannotator Agreement
• 6 total experts.
• One expert’s ITA is calculated as the average of Pearson correlations from each annotator to the avg. of the other 5 annotators.
Emotion 1-E ITA
Anger 0.459
Disgust 0.583
Fear 0.711
Joy 0.596
Sadness 0.645
Surprise 0.464
Valence 0.844
All 0.603
Nonexpert ITAWe average over k annotations to create a single “proto-labeler”.
We plot the ITA of this proto-labeler for up to 10 annotations and compare to the average single expert ITA.
Interannotator AgreementEmotion 1-E ITA 10-N ITA
Anger 0.459 0.675
Disgust 0.583 0.746
Fear 0.711 0.689
Joy 0.596 0.632
Sadness 0.645 0.776
Surprise 0.464 0.496
Valence 0.844 0.669
All 0.603 0.694
2 4 6 8 100.45
0.55
0.65
correlation
anger
2 4 6 8 10
0.55
0.65
0.75
correlation
disgust
2 4 6 8 100.40
0.50
0.60
0.70
correlation
fear
2 4 6 8 10
0.35
0.45
0.55
0.65
correlation
joy
2 4 6 8 100.55
0.65
0.75
annotators
correlation
sadness
2 4 6 8 100.20
0.30
0.40
0.50
annotators
correlation
surprise
Number of nonexpert annotators required to match expert ITA, on average: 4
Task 1-E ITA 10-N ITA
Affect Recognition 0.603 0.694
Word Similarity
0.958 0.952
Textual Entailment 0.91 0.897
Temporal Annotation
0.940
WSD 0.994
2 4 6 8 10
0.84
0.90
0.96
correlation
word similarity
2 4 6 8 100.70
0.80
0.90
accuracy
RTE
2 4 6 8 100.70
0.80
0.90
annotators
accuracy
before/after
2 4 6 8 100.980
0.990
1.000
annotators
accuracy
WSD
Interannotator Agreement
Error Analysis: WSDonly 1 “mistake” out of 177 labels:
“The Egyptian president said he would visit Libya today...”
Semeval Task 17 marks this as “executive officer of a firm” sense, while Turkers voted for “head of a country” sense.
Error Analysis: RTE
• Bob Carpenter: “Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.”
• Bob Carpenter’s full analysis available at“Fool’s Gold Standard”, http://lingpipe-blog.com/
~10 disagreements out of 100:
Close Examples
T: “Google files for its long awaited IPO.”
H: “Google goes public.”
Labeled “TRUE” in PASCAL RTE-1,Turkers vote 6-4 “FALSE”.
T: A car bomb that exploded outside a U.S. military base near Beiji, killed 11 Iraqis.
H: A car bomb exploded outside a U.S. base in the northern town of Beiji, killing 11 Iraqis.
Labeled “TRUE” in PASCAL RTE-1, Turkers vote 6-4 “FALSE”.
Weighting Annotators
• There are a small number of very prolific, very noisy annotators. If we plot each annotator:
0 200 400 600 800
0.4
0.6
0.8
1.0
number of annotations
accu
racy
Task: RTE
• We should be able to do better than majority voting.
Weighting Annotators
• To infer the true value xi, we weight each response yi from annotator w using a small gold standard training set:
• We estimate annotator response from 5% of the gold standard test set, and evaluate with 20-fold CV.
Weighting Annotators
annotators
accu
racy
0.7
0.8
0.9
RTE
annotators
0.7
0.8
0.9
before/after
Gold calibratedNaive voting
• Several follow-up posts at http://lingpipe-blog.com
RTE: 4.0% avg. accuracy increase
Temporal: 3.4% avg. accuracy increase
Cost SummaryTask
Total Labels
Cost in USD
Time in hours
Labels / USD
Labels / Hour
Affect Recognition
7000 $2.00 5.93 3500 1180.4
Word Similarity 300 $0.20 0.17 1500 1724.1
Textual Entailment 8000 $8.00 89.3 1000 89.59
Temporal Annotation
4620 $13.86 39.9 333.3 115.85
WSD 1770 $1.76 8.59 1005.7 206.1
All 21690 $25.82 143.9 840.0 150.7
In Summary• All collected data and annotator
instructions are available at: http://ai.stanford.edu/~rion/annotations
• Summary blog post and comments on the Dolores Labs blog: http://blog.doloreslabs.com
nlp.stanford.edu ai.stanford.edudoloreslabs.com
Supplementary Slides
Training systems on nonexpert annotations• A simple affect recognition classifier trained
on the averaged nonexpert votes outperforms one trained on a single expert annotation
Where are Turkers?United States 77.1%
India 5.3%Philippines 2.8%
Canada 2.8%UK 1.9%
Germany 0.8%Italy 0.5%
Netherlands 0.5%Portugal 0.5%Australia 0.4%
Remaining 7.3% divided among 78 countries / territories
Analysis by Dolores Labs
Who are Turkers?
“Mechanical Turk: The Demographics”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com
Gender
Education
Age
Annual income
Why are Turkers?
“Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com
A. To Kill TimeB. Fruitful way to spend free timeC. Income purposesD. Pocket change/extra cashE. For entertainmentF. Challenge, self-competitionG. Unemployed, no regular job, part-time jobH. To sharpen/ To keep mind sharpI. Learn English
How much does AMT pay?
“How Much Turking Pays?”, Panos Ipeirotis, NYUbehind-the-enemy-lines.blogspot.com
Annotaton Guidelines: Affective Text
Annotaton Guidelines: Word Similarity
Annotaton Guidelines:Textual Entailment
Annotaton Guidelines: Temporal Ordering
Annotaton Guidelines: Word Sense Disambiguation
Affect Recognition
We label 100 headlines for each of 7 emotionsWe pay 4 cents for 20 headlines (140 total
labels)Total Cost: $2.00
Time to complete: 5.94 hrs
Example Task: Word Similarity
30 word pairs (Rubenstein and
Goodenough, xxxx)
We pay 10 Turkers 2 cents apiece to score
all 30 word pairs
Total cost: $0.20Time to complete:
10.4 minutes
2 4 6 8 10
0.84
0.90
0.96
annotations
correlation
Word Similarity ITA
• Comparison against multiple annotators
• (graphs)
• avg. number of nonexperts : expert = 4
Datasets lead the wayWSJ + syntactic annotation = Penn TreeBank enables Statistical
parsing
Brown corpus + sense labeling = Semcor => WSD
TreeBank + role labels = PropBank => SRL
political speeches + translations = United Nations parallel corpora => statistical machine translation
more: RTE, Timebank, ACE/MUC, etc...
Datasets drive research
statistical parsing
speech recognition
semantic role labeling
statistical MTsocial network
analysis
Penn TreebankPropBank
Switchboard
UN Parallel TextEnron E-mail
Corpus
Pascal RTE
textual entailment
word sensedisambiguation
WordNetSemCor