1
Natural Language Inference What are NLI models really learning? Kicking out Premises… Takeaways SNLI (Bowman et. al, 2015) 570 K MultiNLI (Williams et. al., 2017) 433 K Results consistent with numerous findings of issues with NLP datasets like ROC (Cai et. al., 2017) and VQA (Agrawal et. al., 2016). Annotation artifacts could be addressed by improving annotation protocols preemptively to correct for common biases. NLI models learn from lexical cues rather than entailment semantics. SNLI Accuracy (%) 50 62.5 75 87.5 100 DAM ESIM DIIN 93.4 92.6 92.4 72.7 71.3 69.4 86.8 85.8 84.7 Full Hard Easy MultiNLI Matched 0 22.5 45 67.5 90 DAM ESIM DIIN 87.6 86.2 85.3 64.1 59.3 55.8 77 74.1 72 MultiNLI Mismatched 0 22.5 45 67.5 90 DAM ESIM DIIN 86.8 85.2 85.7 64.4 58.9 56.2 76.5 73.1 72.1 Two dogs are running through a field. Contradiction The pets are sitting on a couch. Entailment There are animals outdoors. Neutral Some puppies are running to catch a stick. Premise Text Classification Accuracy (%) 0 17.5 35 52.5 70 SNLI MultiNLI MultiNLI 52.3 53.9 67 35.2 35.4 34.3 Majority Class Hypothesis-only Matched Mismatched Annotation Artifacts in Natural Language Inference Data Suchin Gururangan* Swabha Swayamdipta* Omer Levy Roy Schwartz Samuel R. Bowman Noah A. Smith * equal contribution Negation Older man with white hair and a red cap painting the golden gate bridge on the shore with the golden gate bridge in the distance. Contradiction Nobody wears a cap. Premise Cats! Three dogs racing on racetrack. Contradiction Three cats race on a track. Premise Contradiction Artifacts Purpose Clauses A group of female athletes are huddled together and excited. Neutral They are huddled together because they are working together. Premise Modifiers A middle-aged man works under the engine of a train on rail tracks. Neutral A man is doing work on a black Amtrack train. Premise Neutral Artifacts Shortening A person in a red shirt is mowing the grass with a green riding mower. Entailment A person in red is cutting the grass on a riding mower. Premise Generalization Some men and boys are playing frisbee in a grassy area. Entailment People play frisbee outdoors. Premise Entailment Artifacts C E N Hypothesis Premise C E N Hypothesis Easy Hard Premise Hard/Easy Classification Text Classification SNLI premises are Flickr captions. MultiNLI premises are collected from diverse genre. Hypotheses are crowdsource-generated. Resources Hard benchmarks for MultiNLI: www.kaggle.com/c/multinli-matched-open-hard-evaluation/ Hard benchmark for SNLI: www.nlp.stanford.edu/projects/snli/snli_1.0_test_hard.jsonl DAM - Decomposable Attention Model (Parikh et. al. 2016) ESIM - Enhanced Sequential Inference Model (Chen et. al., 2017) DIIN - Densely Interactive Inference Network (Gong et. al. 2018) Over 50% of NLI examples can be correctly classified without ever observing the premise!

Annotation Artifacts in Natural Language Inference Data · DAM ESIM DIIN 85.7 85.2 86.8 64.4 56.2 58.9 76.5 72.1 73.1 Two dogs are running through a field. Contradiction The pets

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Annotation Artifacts in Natural Language Inference Data · DAM ESIM DIIN 85.7 85.2 86.8 64.4 56.2 58.9 76.5 72.1 73.1 Two dogs are running through a field. Contradiction The pets

Natural Language Inference What are NLI models really learning?

Kicking out Premises…

Takeaways

SNLI (Bowman et. al, 2015) 570 K MultiNLI (Williams et. al., 2017) 433 K

★Results consistent with numerous findings of issues with NLP datasets like ROC (Cai et. al., 2017) and VQA (Agrawal et. al., 2016).

★Annotation artifacts could be addressed by improving annotation protocols preemptively to correct for common biases.

NLI models learn from lexical cues rather than entailment semantics.

SNLI

Acc

urac

y (%

)

50

62.5

75

87.5

100

DAM ESIM DIIN

93.492.692.4

72.771.369.4

86.885.884.7

Full Hard Easy

MultiNLI Matched

0

22.5

45

67.5

90

DAM ESIM DIIN

87.686.285.3

64.159.355.8

7774.172

MultiNLI Mismatched

0

22.5

45

67.5

90

DAM ESIM DIIN

86.885.285.7

64.458.956.2

76.573.172.1

Two dogs are running through a field.

Contradiction

The pets are sitting on a couch.

Entailment

There are animals

outdoors.

Neutral

Some puppies are running to catch

a stick.

Premise

Tex

t Cl

assi

fica

tion

A

ccur

acy

(%)

0

17.5

35

52.5

70

SNLI MultiNLI MultiNLI

52.353.9

67

35.235.434.3

Majority ClassHypothesis-only

Matched Mismatched

Annotation Artifacts in Natural Language Inference Data

Suchin Gururangan* Swabha Swayamdipta* Omer Levy Roy Schwartz Samuel R. Bowman Noah A. Smith

* equal contribution

Negation

Older man with white hair and a red cap painting the golden gate bridge on the shore with the golden gate bridge in the distance.

Contradiction

Nobody wears a cap.

Premise

Cats!

Three dogs racing on racetrack.

Contradiction

Three cats race on a track.

Premise

Contradiction Artifacts

Purpose

Clauses

A group of female athletes are huddled together and

excited.

Neutral

They are huddled together

because they are working together.

Premise

Modifiers

A middle-aged man works under the engine of a train on

rail tracks.

Neutral

A man is doing work on a black Amtrack train.

Premise

Neutral Artifacts

Shortening

A person in a red shirt is mowing the grass with a

green riding mower.

Entailment

A person in red is cutting the grass on a

riding mower.

Premise

Generalizatio

n

Some men and boys are playing frisbee in a grassy

area.

Entailment

People play frisbee outdoors.

Premise

Entailment Artifacts

C

E

N

Hypothesis

Premise

C

E

N

Hypothesis

Easy

Hard

Premise

Hard/Easy Classification

Tex

t Cl

assi

fica

tion

SNLI premises are Flickr captions. MultiNLI premises are collected from diverse genre. Hypotheses are crowdsource-generated.

Resources★Hard benchmarks for MultiNLI:

www.kaggle.com/c/multinli-matched-open-hard-evaluation/

★Hard benchmark for SNLI: www.nlp.stanford.edu/projects/snli/snli_1.0_test_hard.jsonl

DAM - Decomposable Attention Model (Parikh et. al. 2016) ESIM - Enhanced Sequential Inference Model (Chen et. al., 2017) DIIN - Densely Interactive Inference Network (Gong et. al. 2018)

Over 50% of NLI examples can be

correctly classified without ever observing the

premise!