Annotation Artifacts in Natural Language Inference Data · DAM ESIM DIIN 85.7 85.2 86.8 64.4 56.2 58.9 76.5 72.1 73.1 Two dogs are running through a field. Contradiction The pets

Natural Language Inference What are NLI models really learning?

Kicking out Premises…

Takeaways

SNLI (Bowman et. al, 2015) 570 K MultiNLI (Williams et. al., 2017) 433 K

★Results consistent with numerous findings of issues with NLP datasets like ROC (Cai et. al., 2017) and VQA (Agrawal et. al., 2016).

★Annotation artifacts could be addressed by improving annotation protocols preemptively to correct for common biases.

NLI models learn from lexical cues rather than entailment semantics.

SNLI

Acc

urac

y (%

)

50

62.5

75

87.5

100

DAM ESIM DIIN

93.492.692.4

72.771.369.4

86.885.884.7

Full Hard Easy

MultiNLI Matched

0

22.5

45

67.5

90

DAM ESIM DIIN

87.686.285.3

64.159.355.8

7774.172

MultiNLI Mismatched

0

22.5

45

67.5

90

DAM ESIM DIIN

86.885.285.7

64.458.956.2

76.573.172.1

Two dogs are running through a field.

Contradiction

The pets are sitting on a couch.

Entailment

There are animals

outdoors.

Neutral

Some puppies are running to catch

a stick.

Premise

Tex

t Cl

assi

fica

tion

A

ccur

acy

(%)

0

17.5

35

52.5

70

SNLI MultiNLI MultiNLI

52.353.9

67

35.235.434.3

Majority ClassHypothesis-only

Matched Mismatched

Annotation Artifacts in Natural Language Inference Data

Suchin Gururangan* Swabha Swayamdipta* Omer Levy Roy Schwartz Samuel R. Bowman Noah A. Smith

* equal contribution

Negation

Older man with white hair and a red cap painting the golden gate bridge on the shore with the golden gate bridge in the distance.

Contradiction

Nobody wears a cap.

Premise

Cats!

Three dogs racing on racetrack.

Contradiction

Three cats race on a track.

Premise

Contradiction Artifacts

Purpose

Clauses

A group of female athletes are huddled together and

excited.

Neutral

They are huddled together

because they are working together.

Premise

Modifiers

A middle-aged man works under the engine of a train on

rail tracks.

Neutral

A man is doing work on a black Amtrack train.

Premise

Neutral Artifacts

Shortening

A person in a red shirt is mowing the grass with a

green riding mower.

Entailment

A person in red is cutting the grass on a

riding mower.

Premise

Generalizatio

n

Some men and boys are playing frisbee in a grassy

area.

Entailment

People play frisbee outdoors.

Premise

Entailment Artifacts

C

E

N

Hypothesis

Premise

C

E

N

Hypothesis

Easy

Hard

Premise

Hard/Easy Classification

Tex

t Cl

assi

fica

tion

SNLI premises are Flickr captions. MultiNLI premises are collected from diverse genre. Hypotheses are crowdsource-generated.

Resources★Hard benchmarks for MultiNLI:

www.kaggle.com/c/multinli-matched-open-hard-evaluation/

★Hard benchmark for SNLI: www.nlp.stanford.edu/projects/snli/snli_1.0_test_hard.jsonl

DAM - Decomposable Attention Model (Parikh et. al. 2016) ESIM - Enhanced Sequential Inference Model (Chen et. al., 2017) DIIN - Densely Interactive Inference Network (Gong et. al. 2018)

Over 50% of NLI examples can be

correctly classified without ever observing the

premise!

http://www.kaggle.com/c/multinli-matched-open-hard-evaluation/

Documents

Annotation Artifacts in Natural Language Inference Data · DAM ESIM DIIN 85.7 85.2 86.8 64.4 56.2 58.9 76.5 72.1 73.1 Two dogs are running through a field. Contradiction The pets