Download pdf - Pathologies of Neural Models Make Interpretations Difficult · Pathologies of Neural Models Make Interpretations Difﬁcult Shi Feng 1Eric Wallace Alvin Grissom II2 Mohit Iyyer3;4

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3719–3728Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

3719

Pathologies of Neural Models Make Interpretations Difficult

Shi Feng1 Eric Wallace1 Alvin Grissom II2 Mohit Iyyer3,4Pedro Rodriguez1 Jordan Boyd-Graber11University of Maryland 2Ursinus College

3UMass Amherst 4Allen Institute for Artificial Intelligence{shifeng,ewallac2,entilzha,jbg}@umiacs.umd.edu,

[email protected], [email protected]

Abstract

One way to interpret neural model predic-tions is to highlight the most important in-put features—for example, a heatmap visu-alization over the words in an input sen-tence. In existing interpretation methods forNLP, a word’s importance is determined byeither input perturbation—measuring the de-crease in model confidence when that word isremoved—or by the gradient with respect tothat word. To understand the limitations ofthese methods, we use input reduction, whichiteratively removes the least important wordfrom the input. This exposes pathological be-haviors of neural models: the remaining wordsappear nonsensical to humans and are not theones determined as important by interpreta-tion methods. As we confirm with human ex-periments, the reduced examples lack infor-mation to support the prediction of any la-bel, but models still make the same predic-tions with high confidence. To explain thesecounterintuitive results, we draw connectionsto adversarial examples and confidence cali-bration: pathological behaviors reveal difficul-ties in interpreting neural models trained withmaximum likelihood. To mitigate their defi-ciencies, we fine-tune the models by encourag-ing high entropy outputs on reduced examples.Fine-tuned models become more interpretableunder input reduction without accuracy loss onregular examples.

1 Introduction

Many interpretation methods for neural networksexplain the model’s prediction as a counterfactual:how does the prediction change when the input ismodified? Adversarial examples (Szegedy et al.,2014; Goodfellow et al., 2015) highlight the in-stability of neural network predictions by showinghow small perturbations to the input dramaticallychange the output.

SQUADContext In 1899, John Jacob Astor IV invested

$100,000 for Tesla to further developand produce a new lighting system. In-stead, Tesla used the money to fund hisColorado Springs experiments.

Original What did Tesla spend Astor’s money on ?Reduced didConfidence 0.78→ 0.91

Figure 1: SQUAD example from the validation set.Given the original Context, the model makes the samecorrect prediction (“Colorado Springs experiments”)on the Reduced question as the Original, with evenhigher confidence. For humans, the reduced question,“did”, is nonsensical.

A common, non-adversarial form of model in-terpretation is feature attribution: features thatare crucial for predictions are highlighted in aheatmap. One can measure a feature’s importanceby input perturbation. Given an input for text clas-sification, a word’s importance can be measuredby the difference in model confidence before andafter that word is removed from the input—theword is important if confidence decreases signifi-cantly. This is the leave-one-out method (Li et al.,2016b). Gradients can also measure feature im-portance; for example, a feature is influential to theprediction if its gradient is a large positive value.Both perturbation and gradient-based methods cangenerate heatmaps, implying that the model’s pre-diction is highly influenced by the highlighted, im-portant words.

Instead, we study how the model’s prediction isinfluenced by the unimportant words. We use in-put reduction, a process that iteratively removesthe unimportant words from the input while main-taining the model’s prediction. Intuitively, thewords remaining after input reduction should beimportant for prediction. Moreover, the words

3720

should match the leave-one-out method’s selec-tions, which closely align with human percep-tion (Li et al., 2016b; Murdoch et al., 2018). How-ever, rather than providing explanations of theoriginal prediction, our reduced examples moreclosely resemble adversarial examples. In Fig-ure 1, the reduced input is meaningless to a hu-man but retains the same model prediction withhigh confidence. Gradient-based input reductionexposes pathological model behaviors that contra-dict what one expects based on existing interpreta-tion methods.

In Section 2, we construct more of these coun-terintuitive examples by augmenting input reduc-tion with beam search and experiment with threetasks: SQUAD (Rajpurkar et al., 2016) for read-ing comprehension, SNLI (Bowman et al., 2015)for textual entailment, and VQA (Antol et al.,2015) for visual question answering. Input re-duction with beam search consistently reduces theinput sentence to very short lengths—often onlyone or two words—without lowering model confi-dence on its original prediction. The reduced ex-amples appear nonsensical to humans, which weverify with crowdsourced experiments. In Sec-tion 3, we draw connections to adversarial exam-ples and confidence calibration; we explain whythe observed pathologies are a consequence of theoverconfidence of neural models. This elucidateslimitations of interpretation methods that rely onmodel confidence. In Section 4, we encouragehigh model uncertainty on reduced examples withentropy regularization. The pathological modelbehavior under input reduction is mitigated, lead-ing to more reasonable reduced examples.

2 Input Reduction

To explain model predictions using a set of impor-tant words, we must first define importance. Af-ter defining input perturbation and gradient-basedapproximation, we describe input reduction withthese importance metrics. Input reduction dras-tically shortens inputs without causing the modelto change its prediction or significantly decreaseits confidence. Crowdsourced experiments con-firm that reduced examples appear nonsensical tohumans: input reduction uncovers pathologicalmodel behaviors.

2.1 Importance from Input GradientRibeiro et al. (2016) and Li et al. (2016b) de-fine importance by seeing how confidence changeswhen a feature is removed; a natural approxima-tion is to use the gradient (Baehrens et al., 2010;Simonyan et al., 2014). We formally define theseimportance metrics in natural language contextsand introduce the efficient gradient-based approx-imation. For each word in an input sentence, wemeasure its importance by the change in the con-fidence of the original prediction when we removethat word from the sentence. We switch the signso that when the confidence decreases, the impor-tance value is positive.

Formally, let x = 〈x1, x2, . . . xn〉 denote the in-put sentence, f(y |x) the predicted probability oflabel y, and y = argmaxy′ f(y

′ |x) the originalpredicted label. The importance is then

g(xi | x) = f(y |x)− f(y |x−i). (1)

To calculate the importance of each word in a sen-tence with n words, we need n forward passes ofthe model, each time with one of the words leftout. This is highly inefficient, especially for longersentences. Instead, we approximate the impor-tance value with the input gradient. For each wordin the sentence, we calculate the dot product ofits word embedding and the gradient of the outputwith respect to the embedding. The importanceof n words can thus be computed with a singleforward-backward pass. This gradient approxima-tion has been used for various interpretation meth-ods for natural language classification models (Liet al., 2016a; Arras et al., 2016); see Ebrahimiet al. (2017) for further details on the derivation.We use this approximation in all our experimentsas it selects the same words for removal as an ex-haustive search (no approximation).

2.2 Removing Unimportant WordsInstead of looking at the words with high impor-tance values—what interpretation methods com-monly do—we take a complementary approachand study how the model behaves when the sup-posedly unimportant words are removed. Intu-itively, the important words should remain afterthe unimportant ones are removed.

Our input reduction process iteratively removesthe unimportant words. At each step, we removethe word with the lowest importance value un-til the model changes its prediction. We experi-

3721

ment with three popular datasets: SQUAD (Ra-jpurkar et al., 2016) for reading comprehension,SNLI (Bowman et al., 2015) for textual entail-ment, and VQA (Antol et al., 2015) for visualquestion answering. We describe each of thesetasks and the model we use below, providing fulldetails in the Supplement.

In SQUAD, each example is a context para-graph and a question. The task is to predict a spanin the paragraph as the answer. We reduce onlythe question while keeping the context paragraphunchanged. The model we use is the DRQA Doc-ument Reader (Chen et al., 2017).

In SNLI, each example consists of two sen-tences: a premise and a hypothesis. The task isto predict one of three relationships: entailment,neutral, or contradiction. We reduce only the hy-pothesis while keeping the premise unchanged.The model we use is Bilateral Multi-PerspectiveMatching (BIMPM) (Wang et al., 2017).

In VQA, each example consists of an imageand a natural language question. We reduce onlythe question while keeping the image unchanged.The model we use is Show, Ask, Attend, and An-swer (Kazemi and Elqursh, 2017).

During the iterative reduction process, we en-sure that the prediction does not change (exactsame span for SQUAD); consequently, the modelaccuracy on the reduced examples is identical tothe original. The predicted label is used for inputreduction and the ground-truth is never revealed.We use the validation set for all three tasks.

Most reduced inputs are nonsensical to humans(Figure 2) as they lack information for any reason-able human prediction. However, models makeconfident predictions, at times even more confi-dent than the original.

To find the shortest possible reduced inputs(potentially the most meaningless), we relax therequirement of removing only the least impor-tant word and augment input reduction with beamsearch. We limit the removal to the k least impor-tant words, where k is the beam size, and decreasethe beam size as the remaining input is shortened.1

We empirically select beam size five as it pro-duces comparable results to larger beam sizes withreasonable computation cost. The requirement ofmaintaining model prediction is unchanged.

1We set beam size to max(1,min(k, L − 3)) where k ismaximum beam size and L is the current length of the inputsentence.

SNLIPremise Well dressed man and woman dancing in

the streetOriginal Two man is dancing on the streetReduced dancingAnswer ContradictionConfidence 0.977→ 0.706VQA

Original What color is the flower ?Reduced flower ?Answer yellowConfidence 0.827→ 0.819

Figure 2: Examples of original and reduced inputswhere the models predict the same Answer. Reducedshows the input after reduction. We remove words fromthe hypothesis for SNLI, questions for SQUAD andVQA. Given the nonsensical reduced inputs, humanswould not be able to provide the answer with high con-fidence, yet, the neural models do.

With beam search, input reduction finds ex-tremely short reduced examples with little to nodecrease in the model’s confidence on its orig-inal predictions. Figure 3 compares the lengthof input sentences before and after the reduction.For all three tasks, we can often reduce the sen-tence to only one word. Figure 4 compares themodel’s confidence on original and reduced in-puts. On SQUAD and SNLI the confidence de-creases slightly, and on VQA the confidence evenincreases.

2.3 Humans Confused by Reduced Inputs

On the reduced examples, the models retain theiroriginal predictions despite short input lengths.The following experiments examine whether thesepredictions are justified or pathological, based onhow humans react to the reduced inputs.

For each task, we sample 200 examples that arecorrectly classified by the model and generate theirreduced examples. In the first setting, we com-pare the human accuracy on original and reducedexamples. We recruit two groups of crowd work-ers and task them with textual entailment, readingcomprehension, or visual question answering. Weshow one group the original inputs and the otherthe reduced. Humans are no longer able to give

3722

0 5 10 15 200

0.05

0.10

0.15

0.20

Freq

uenc

y

11.52.3SQuAD

OriginalReduced

Mean Length

0 5 10 15 20

7.51.5SNLI

0 5 10 15 20Example Length

6.22.3VQA

Figure 3: Distribution of input sentence length before and after reduction. For all three tasks, the input is oftenreduced to one or two words without changing the model’s prediction.

SQuAD Start

OriginalReduced

SQuAD End

SNLI

0 0.25 0.50 0.75 1Confidence

VQA

Figure 4: Density distribution of model confidence onreduced inputs is similar to the original confidence. InSQUAD, we predict the beginning and the end of theanswer span, so we show the confidence for both.

the correct answer, showing a significant accuracyloss on all three tasks (compare Original and Re-duced in Table 1).

The second setting examines how random thereduced examples appear to humans. For eachof the original examples, we generate a versionwhere words are randomly removed until thelength matches the one generated by input reduc-tion. We present the original example along withthe two reduced examples and ask crowd work-ers their preference between the two reduced ones.The workers’ choice is almost fifty-fifty (the vs.Random in Table 1): the reduced examples appearalmost random to humans.

These results leave us with two puzzles: whyare the models highly confident on the nonsensicalreduced examples? And why, when the leave-one-out method selects important words that appearreasonable to humans, the input reduction processselects ones that are nonsensical?

Dataset Original Reduced vs. Random

SQUAD 80.58 31.72 53.70SNLI-E 76.40 27.66 42.31SNLI-N 55.40 52.66 50.64SNLI-C 76.20 60.60 49.87VQA 76.11 40.60 61.60

Table 1: Human accuracy on Reduced examples dropssignificantly compared to the Original examples, how-ever, model predictions are identical. The reduced ex-amples also appear random to humans—they do notprefer them over random inputs (vs. Random). ForSQUAD, accuracy is reported using F1 scores, othernumbers are percentages. For SNLI, we report resultson the three classes separately: entailment (-E), neutral(-N), and contradiction (-C).

3 Making Sense of Reduced Inputs

Having established the incongruity of our defini-tion of importance vis-a-vis human judgements,we now investigate possible explanations for theseresults. We explain why model confidence canempower methods such as leave-one-out to gen-erate reasonable interpretations but also lead topathologies under input reduction. We attributethese results to two issues of neural models.

3.1 Model OverconfidenceNeural models are overconfident in their predic-tions (Guo et al., 2017). One explanation foroverconfidence is overfitting: the model overfitsthe negative log-likelihood loss during training bylearning to output low-entropy distributions overclasses. Neural models are also overconfident onexamples outside the training data distribution. AsGoodfellow et al. (2015) observe for image classi-fication, samples from pure noise can sometimestrigger highly confident predictions. These so-called rubbish examples are degenerate inputs that

3723

a human would trivially classify as not belongingto any class but for which the model predicts withhigh confidence. Goodfellow et al. (2015) arguethat the rubbish examples exist for the same rea-son that adversarial examples do: the surprisinglinear nature of neural models. In short, the confi-dence of a neural model is not a robust estimate ofits prediction uncertainty.

Our reduced inputs satisfy the definition of rub-bish examples: humans have a hard time makingpredictions based on the reduced inputs (Table 1),but models make predictions with high confidence(Figure 4). Starting from a valid example, inputreduction transforms it into a rubbish example.

The nonsensical, almost random results are bestexplained by looking at a complete reduction path(Figure 5). In this example, the transition fromvalid to rubbish happens immediately after the firststep: following the removal of “Broncos”, humanscan no longer determine which team the ques-tion is asking about, but model confidence remainshigh. Not being able to lower its confidence onrubbish examples—as it is not trained to do so—the model neglects “Broncos” and eventually theprocess generates nonsensical results.

In this example, the leave-one-out method willnot highlight “Broncos”. However, this is nota failure of the interpretation method but of themodel itself. The model assigns a low impor-tance to “Broncos” in the first step, causing it to beremoved—leave-one-out would be able to exposethis particular issue by not highlighting “Bron-cos”. However, in cases where a similar issue onlyappear after a few unimportant words are removed,the leave-one-out method would fail to expose theunreasonable model behavior.

Input reduction can expose deeper issues ofmodel overconfidence and stress test a model’s un-certainty estimation and interpretability.

3.2 Second-order Sensitivity

So far, we have seen that the output of a neuralmodel is sensitive to small changes in its input. Wecall this first-order sensitivity, because interpreta-tion based on input gradient is a first-order Taylorexpansion of the model near the input (Simonyanet al., 2014). However, the interpretation alsoshifts drastically with small input changes (Fig-ure 6). We call this second-order sensitivity.

The shifting heatmap suggests a mismatch be-tween the model’s first- and second-order sensi-

SQUADContext: The Panthers used the San Jose State practice facility and stayedat the San Jose Marriott. The Broncos practiced at Stanford University andstayed at the Santa Clara Marriott.

Question:(0.90, 0.89) Where did the Broncos practice for the Super Bowl ?(0.92, 0.88) Where did the practice for the Super Bowl ?(0.91, 0.88) Where did practice for the Super Bowl ?(0.92, 0.89) Where did practice the Super Bowl ?(0.94, 0.90) Where did practice the Super ?(0.93, 0.90) Where did practice Super ?(0.40, 0.50) did practice Super ?

Figure 5: A reduction path for a SQUAD validation ex-ample. The model prediction is always correct and itsconfidence stays high (shown on the left in parenthe-ses) throughout the reduction. Each line shows the in-put at that step with an underline indicating the word toremove next. The question becomes unanswerable im-mediately after “Broncos” is removed in the first step.However, in the context of the original question, “Bron-cos” is the least important word according to the inputgradient.

tivities. The heatmap shifts when, with respect tothe removed word, the model has low first-ordersensitivity but high second-order sensitivity.

Similar issues complicate comparable interpre-tation methods for image classification models.For example, Ghorbani et al. (2017) modify im-age inputs so the highlighted features in the in-terpretation change while maintaining the sameprediction. To achieve this, they iteratively mod-ify the input to maximize changes in the distribu-tion of feature importance. In contrast, the shift-ing heatmap we observe occurs by only remov-ing the least impactful features without a targetedoptimization. They also speculate that the steep-est gradient direction for the first- and second-order sensitivity values are generally orthogonal.Loosely speaking, the shifting heatmap suggeststhat the direction of the smallest gradient valuecan sometimes align with very steep changes insecond-order sensitivity.

When explaining individual model predictions,the heatmap suggests that the prediction is madebased on a weighted combination of words, asin a linear model, which is not true unless themodel is indeed taking a weighted sum such asin a DAN (Iyyer et al., 2015). When the modelcomposes representations by a non-linear combi-nation of words, a linear interpretation obliviousto second-order sensitivity can be misleading.

3724

SQUADContext: QuickBooks sponsored a “Small Business Big Game” contest,in which Death Wish Coffee had a 30-second commercial aired free ofcharge courtesy of QuickBooks. Death Wish Coffee beat out nine othercontenders from across the United States for the free advertisement.

Question:What company won free advertisement due to QuickBooks contest ?What company won free advertisement due to QuickBooks ?What company won free advertisement due to ?What company won free due to ?What won free due to ?What won due to ?What won due toWhat won dueWhat wonWhat

Figure 6: Heatmap generated with leave-one-out shiftsdrastically despite only removing the least importantword (underlined) at each step. For instance, “adver-tisement”, is the most important word in step two butbecomes the least important in step three.

4 Mitigating Model Pathologies

The previous section explains the observedpathologies from the perspective of overconfi-dence: models are too certain on rubbish exam-ples when they should not make any prediction.Human experiments in Section 2.3 confirm thatthe reduced examples fit the definition of rubbishexamples. Hence, a natural way to mitigate thepathologies is to maximize model uncertainty onthe reduced examples.

4.1 Regularization on Reduced Inputs

To maximize model uncertainty on reduced ex-amples, we use the entropy of the output distri-bution as an objective. Given a model f trainedon a dataset (X ,Y), we generate reduced exam-ples using input reduction for all training examplesX . Beam search often yields multiple reduced ver-sions with the same minimum length for each in-put x, and we collect all of these versions togetherto form X as the “negative” example set.

Let H (·) denote the entropy and f(y |x) denotethe probability of the model predicting y given x.We fine-tune the existing model to simultaneouslymaximize the log-likelihood on regular examplesand the entropy on reduced examples:∑(x,y)∈(X ,Y)

log(f(y |x)) + λ∑x∈X

H (f(y | x)) ,

(2)where hyperparameter λ controls the trade-off be-tween the two terms. Similar entropy regulariza-tion is used by Pereyra et al. (2017), but not in

Accuracy Reduced length

Before After Before After

SQUAD 77.41 78.03 2.27 4.97SNLI 85.71 85.72 1.50 2.20VQA 61.61 61.54 2.30 2.87

Table 2: Model Accuracy on regular validation ex-amples remains largely unchanged after fine-tuning.However, the length of the reduced examples (Reducedlength) increases on all three tasks, making them lesslikely to appear nonsensical to humans.

combination with input reduction; their entropyterm is calculated on regular examples rather thanreduced examples.

4.2 Regularization Mitigates Pathologies

On regular examples, entropy regularization doesno harm to model accuracy, with a slight increasefor SQUAD (Accuracy in Table 2).

After entropy regularization, input reductionproduces more reasonable reduced inputs (Fig-ure 7). In the SQUAD example from Figure 1, thereduced question changed from “did” to “spendAstor money on ?” after fine-tuning. The averagelength of reduced examples also increases acrossall tasks (Reduced length in Table 2). To verifythat model overconfidence is indeed mitigated—that the reduced examples are less “rubbish” com-pared to before fine-tuning—we repeat the humanexperiments from Section 2.3.

Human accuracy increases across all three tasks(Table 3). We also repeat the vs. Random exper-iment: we re-generate the random examples tomatch the lengths of the new reduced examplesfrom input reduction, and find humans now pre-fer the reduced examples to random ones. The in-crease in both human performance and preferencesuggests that the reduced examples are more rea-sonable; model pathologies have been mitigated.

While these results are promising, it is not clearwhether our input reduction method is necessaryto achieve them. To provide a baseline, we fine-tune models using inputs randomly reduced to thesame lengths as the ones generated by input reduc-tion. This baseline improves neither the model ac-curacy on regular examples nor interpretability un-der input reduction (judged by lengths of reducedexamples). Input reduction is effective in generat-ing negative examples to counter model overcon-fidence.

3725

SQUADContext In 1899, John Jacob Astor IV invested

$100,000 for Tesla to further developand produce a new lighting system. In-stead, Tesla used the money to fund hisColorado Springs experiments.

Original What did Tesla spend Astor’s money on ?Answer Colorado Springs experimentsBefore didAfter spend Astor money on ?Confidence 0.78→ 0.91→ 0.52SNLIPremise Well dressed man and woman dancing in

the streetOriginal Two man is dancing on the streetAnswer ContradictionBefore dancingAfter two man dancingConfidence 0.977→ 0.706→ 0.717VQAOriginal What color is the flower ?Answer yellowBefore flower ?After What color is flower ?Confidence 0.847→ 0.918→ 0.745

Figure 7: SQUAD example from Figure 1, SNLI andVQA (image omitted) examples from Figure 2. We ap-ply input reduction to models both Before and After en-tropy regularization. The models still predict the sameAnswer, but the reduced examples after fine-tuning ap-pear more reasonable to humans.

5 Discussion

Rubbish examples have been studied in the imagedomain (Goodfellow et al., 2015; Nguyen et al.,2015), but to our knowledge not for NLP. Our in-put reduction process gradually transforms a validinput into a rubbish example. We can often deter-mine which word’s removal causes the transitionto occur—for example, removing “Broncos” inFigure 5. These rubbish examples are particularlyinteresting, as they are also adversarial: the dif-ference from a valid example is small, unlike im-age rubbish examples generated from pure noisewhich are far outside the training data distribution.

The robustness of NLP models has been studiedextensively (Papernot et al., 2016; Jia and Liang,2017; Iyyer et al., 2018; Ribeiro et al., 2018), andmost studies define adversarial examples similarto the image domain: small perturbations to theinput lead to large changes in the output. Hot-Flip (Ebrahimi et al., 2017) uses a gradient-basedapproach, similar to image adversarial examples,to flip the model prediction by perturbing a fewcharacters or words. Our work and Belinkovand Bisk (2018) both identify cases where noisy

Accuracy vs. Random

Before After Before After

SQUAD 31.72 51.61 53.70 62.75SNLI-E 27.66 32.37 42.31 50.62SNLI-N 52.66 50.50 50.64 58.94SNLI-C 60.60 63.90 49.87 56.92VQA 40.60 51.85 61.60 61.88

Table 3: Human Accuracy increases after fine-tuningthe models. Humans also prefer gradient-based re-duced examples over randomly reduced ones, indicat-ing that the reduced examples are more meaningful tohumans after regularization.

user inputs become adversarial by accident: com-mon misspellings break neural machine transla-tion models; we show that incomplete user inputcan lead to unreasonably high model confidence.

Other failures of interpretation methods havebeen explored in the image domain. The sensi-tivity issue of gradient-based interpretation meth-ods, similar to our shifting heatmaps, are observedby Ghorbani et al. (2017) and Kindermans et al.(2017). They show that various forms of inputperturbation—from adversarial changes to simpleconstant shifts in the image input—cause signifi-cant changes in the interpretation. Ghorbani et al.(2017) make a similar observation about second-order sensitivity, that “the fragility of interpreta-tion is orthogonal to fragility of the prediction”.

Previous work studies biases in the annotationprocess that lead to datasets easier than desiredor expected which eventually induce pathologicalmodels. We attribute our observed pathologiesprimarily to the lack of accurate uncertainty es-timates in neural models trained with maximumlikelihood. SNLI hypotheses contain artifacts thatallow training a model without the premises (Gu-rurangan et al., 2018); we apply input reductionat test time to the hypothesis. Similarly, VQAimages are surprisingly unimportant for traininga model; we reduce the question. The recentSQUAD 2.0 (Rajpurkar et al., 2018) augments theoriginal reading comprehension task with an un-certainty modeling requirement, the goal being tomake the task more realistic and challenging.

Section 3.1 explains the pathologies from theoverconfidence perspective. One explanation foroverconfidence is overfitting: Guo et al. (2017)show that, late in maximum likelihood training,

3726

the model learns to minimize loss by outputtinglow-entropy distributions without improving vali-dation accuracy. To examine if overfitting can ex-plain the input reduction results, we run input re-duction using DRQA model checkpoints from ev-ery training epoch. Input reduction still achievessimilar results on earlier checkpoints, suggestingthat better convergence in maximum likelihoodtraining cannot fix the issues by itself—we neednew training objectives with uncertainty estima-tion in mind.

5.1 Methods for Mitigating Pathologies

We use the reduced examples generated by inputreduction to regularize the model and improve itsinterpretability. This resembles adversarial train-ing (Goodfellow et al., 2015), where adversar-ial examples are added to the training set to im-prove model robustness. The objectives are dif-ferent: entropy regularization encourages high un-certainty on rubbish examples, while adversarialtraining makes the model less sensitive to adver-sarial perturbations.

Pereyra et al. (2017) apply entropy regulariza-tion on regular examples from the start of train-ing to improve model generalization. A similarmethod is label smoothing (Szegedy et al., 2016).In comparison, we fine-tune a model with entropyregularization on the reduced examples for betteruncertainty estimates and interpretations.

To mitigate overconfidence, Guo et al. (2017)propose post-hoc fine-tuning a model’s confidencewith Platt scaling. This method adjusts the soft-max function’s temperature parameter using asmall held-out dataset to align confidence with ac-curacy. However, because the output is calibratedusing the entire confidence distribution, not indi-vidual values, this does not reduce overconfidenceon specific inputs, such as the reduced examples.

5.2 Generalizability of Findings

To highlight the erratic model predictions on shortexamples and provide a more intuitive demonstra-tion, we present paired-input tasks. On these tasks,the short lengths of reduced questions and hy-potheses obviously contradict the necessary num-ber of words for a human prediction (further sup-ported by our human studies). We also apply inputreduction to single-input tasks including sentimentanalysis (Maas et al., 2011) and Quizbowl (Boyd-Graber et al., 2012), achieving similar results.

Interestingly, the reduced examples transferto other architectures. In particular, whenwe feed fifty reduced SNLI inputs from eachclass—generated with the BIMPM model (Wanget al., 2017)—through the Decomposable Atten-tion Model (Parikh et al., 2016),2 the same predic-tion is triggered 81.3% of the time.

6 Conclusion

We introduce input reduction, a process that it-eratively removes unimportant words from an in-put while maintaining a model’s prediction. Com-bined with gradient-based importance estimatesoften used for interpretations, we expose patholog-ical behaviors of neural models. Without loweringmodel confidence on its original prediction, an in-put sentence can be reduced to the point whereit appears nonsensical, often consisting of oneor two words. Human accuracy degrades whenshown the reduced examples instead of the orig-inal, in contrast to neural models which maintaintheir original predictions.

We explain these pathologies with known is-sues of neural models: overconfidence and sen-sitivity to small input changes. The nonsensicalreduced examples are caused by inaccurate uncer-tainty estimates—the model is not able to lowerits confidence on inputs that do not belong toany label. The second-order sensitivity is anotherissue why gradient-based interpretation methodsmay fail to align with human perception: a smallchange in the input can cause, at the same time, aminor change in the prediction but a large changein the interpretation. Input reduction perturbs theinput multiple times and can expose deeper issuesof model overconfidence and oversensitivity thatother methods cannot. Therefore, it can be used tostress test the interpretability of a model.

Finally, we fine-tune the models by maximizingentropy on reduced examples to mitigate the de-ficiencies. This improves interpretability withoutsacrificing model accuracy on regular examples.

To properly interpret neural models, it is impor-tant to understand their fundamental characteris-tics: the nature of their decision surfaces, robust-ness against adversaries, and limitations of theirtraining objectives. We explain fundamental diffi-culties of interpretation due to pathologies in neu-ral models trained with maximum likelihood. Our

2http://demo.allennlp.org/textual-entailment

http://demo.allennlp.org/textual-entailment

http://demo.allennlp.org/textual-entailment

3727

work suggests several future directions to improveinterpretability: more thorough evaluation of in-terpretation methods, better uncertainty and con-fidence estimates, and interpretation beyond bag-of-word heatmap.

Acknowledgments

Feng was supported under subcontract toRaytheon BBN Technologies by DARPA awardHR0011-15-C-0113. JBG is supported by NSFGrant IIS1652666. Any opinions, findings,conclusions, or recommendations expressed hereare those of the authors and do not necessarilyreflect the view of the sponsor. The authors wouldlike to thank Hal Daume III, Alexander M. Rush,Nicolas Papernot, members of the CLIP lab atthe University of Maryland, and the anonymousreviewers for their feedback.

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual question an-swering. In International Conference on ComputerVision.

Leila Arras, Franziska Horn, Gregoire Montavon,Klaus-Robert Muller, and Wojciech Samek. 2016.Explaining predictions of non-linear classifiers inNLP. In Workshop on Representation Learning forNLP.

David Baehrens, Timon Schroeter, Stefan Harmel-ing, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Muller. 2010. How to explain individualclassification decisions. Journal of Machine Learn-ing Research.

Yonatan Belinkov and Yonatan Bisk. 2018. Syntheticand natural noise both break neural machine transla-tion. In Proceedings of the International Conferenceon Learning Representations.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an-notated corpus for learning natural language infer-ence. In Proceedings of Empirical Methods in Nat-ural Language Processing.

Jordan L. Boyd-Graber, Brianna Satinoff, He He, andHal Daume III. 2012. Besting the quiz master:Crowdsourcing incremental classification games. InProceedings of Empirical Methods in Natural Lan-guage Processing.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the Associationfor Computational Linguistics.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2017. HotFlip: White-box adversarial exam-ples for text classification. In Proceedings of the As-sociation for Computational Linguistics.

Amirata Ghorbani, Abubakar Abid, and James Y. Zou.2017. Interpretation of neural networks is fragile.arXiv preprint arXiv: 1710.10547.

Ian J. Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2015. Explaining and harnessing adver-sarial examples. In Proceedings of the InternationalConference on Learning Representations.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein-berger. 2017. On calibration of modern neural net-works. In Proceedings of the International Confer-ence of Machine Learning.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, andNoah A. Smith. 2018. Annotation artifacts in naturallanguage inference data. In Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proceedings of the Association for ComputationalLinguistics.

Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke S.Zettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In Conference of the North American Chapter of theAssociation for Computational Linguistics.

Robin Jia and Percy Liang. 2017. Adversarial ex-amples for evaluating reading comprehension sys-tems. In Proceedings of Empirical Methods in Nat-ural Language Processing.

Vahid Kazemi and Ali Elqursh. 2017. Show, ask, at-tend, and answer: A strong baseline for visual ques-tion answering. arXiv preprint arXiv: 1704.03162.

Pieter-Jan Kindermans, Sara Hooker, Julius Ade-bayo, Maximilian Alber, Kristof T. Schutt, SvenDahne, Dumitru Erhan, and Been Kim. 2017. The(un)reliability of saliency methods. arXiv preprintarXiv: 1711.00867.

Jiwei Li, Xinlei Chen, Eduard H. Hovy, and Daniel Ju-rafsky. 2016a. Visualizing and understanding neuralmodels in NLP. In Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics.

Jiwei Li, Will Monroe, and Daniel Jurafsky. 2016b.Understanding neural networks through representa-tion erasure. arXiv preprint arXiv: 1612.08220.

Andrew Maas, Raymond Daly, Peter Pham, DanHuang, Andrew Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. In

3728

Proceedings of the Association for ComputationalLinguistics.

W. James Murdoch, Peter J. Liu, and Bin Yu. 2018. Be-yond word importance: Contextual decompositionto extract interactions from lstms. In Proceedings ofthe International Conference on Learning Represen-tations.

Anh Mai Nguyen, Jason Yosinski, and Jeff Clune.2015. Deep neural networks are easily fooled: Highconfidence predictions for unrecognizable images.In Computer Vision and Pattern Recognition.

Nicolas Papernot, Patrick D. McDaniel, AnanthramSwami, and Richard E. Harang. 2016. Crafting ad-versarial input sequences for recurrent neural net-works. IEEE Military Communications Conference.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of Empirical Methods in Natural LanguageProcessing.

Gabriel Pereyra, George Tucker, Jan Chorowski,Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Reg-ularizing neural networks by penalizing confidentoutput distributions. In Proceedings of the Interna-tional Conference on Learning Representations.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In Proceedings of the Associationfor Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100, 000+ questionsfor machine comprehension of text. In Proceedingsof Empirical Methods in Natural Language Process-ing.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. Why should i trust you?”: Explain-ing the predictions of any classifier. In KnowledgeDiscovery and Data Mining.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversar-ial rules for debugging nlp models. In Proceedingsof the Association for Computational Linguistics.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. 2014. Deep inside convolutional networks: Vi-sualising image classification models and saliencymaps. In Proceedings of the International Confer-ence on Learning Representations.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethinkingthe inception architecture for computer vision. InComputer Vision and Pattern Recognition.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, andRob Fergus. 2014. Intriguing properties of neural

networks. In Proceedings of the International Con-ference on Learning Representations.

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.Bilateral multi-perspective matching for natural lan-guage sentences. In International Joint Conferenceon Artificial Intelligence.