16
Eric Wallace , Pedro Rodriguez , Shi Feng , Ikuya Yamada, and Jordan Boyd-Graber. Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples. Transactions of the Associ- ation of Computational Linguistics, 2019, 14 pages. @article{Wallace:Rodriguez:Feng:Yamada:Boyd-Graber-2019, Title = {Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples}, Author = {Eric Wallace and Pedro Rodriguez and Shi Feng and Ikuya Yamada and Jordan Boyd-Graber}, Booktitle = {Transactions of the Association of Computational Linguistics}, Year = {2019}, Volume = {10}, Url = {http://umiacs.umd.edu/~jbg//docs/2019_tacl_trick.pdf} } Downloaded from http://umiacs.umd.edu/ ~ jbg/docs/2019_tacl_trick.pdf 1

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd-Graber. Trick Me If You Can:Human-in-the-loop Generation of Adversarial Question Answering Examples. Transactions of the Associ-ation of Computational Linguistics, 2019, 14 pages.

@article{Wallace:Rodriguez:Feng:Yamada:Boyd-Graber-2019,Title = {Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples},Author = {Eric Wallace and Pedro Rodriguez and Shi Feng and Ikuya Yamada and Jordan Boyd-Graber},Booktitle = {Transactions of the Association of Computational Linguistics},Year = {2019},Volume = {10},Url = {http://umiacs.umd.edu/~jbg//docs/2019_tacl_trick.pdf}}

Downloaded from http://umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdf

1

Page 2: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Trick Me If You Can: Human-in-the-loop Generation ofAdversarial Examples for Question Answering

Eric WallaceEE and UMIACS

University of [email protected]

Pedro Rodriguez, Shi FengCS, UMIACS

University of Maryland{pedro, shifeng}

@umiacs.umd.edu

Ikuya YamadaStudio Ousia

[email protected]

Jordan Boyd-GraberCS, iSchool, UMIACS, LSC

University of [email protected]

Abstract

Adversarial evaluation stress tests a model’sunderstanding of natural language. Whilepast approaches expose superficial patterns,the resulting adversarial examples are lim-ited in complexity and diversity. We pro-pose human-in-the-loop adversarial gener-ation, where human authors are guided tobreak models. We aid the authors with inter-pretations of model predictions through aninteractive user interface. We apply this gen-eration framework to a question answeringtask called Quizbowl, where trivia enthusi-asts craft adversarial questions. The result-ing questions are validated via live human–computer matches: although the questionsappear ordinary to humans, they systemati-cally stump neural and information retrievalmodels. The adversarial questions cover di-verse phenomena from multi-hop reasoningto entity type distractors, exposing open chal-lenges in robust question answering.

1 Introduction

Proponents of machine learning claim human par-ity on tasks like reading comprehension (Yu et al.,2018) and commonsense inference (Devlin et al.,2018). Despite these successes, many evaluationsneglect that computers solve NLP tasks in a funda-mentally different way than humans.

Models can succeed without developing “true”language understanding, instead learning superfi-cial patterns from crawled (Chen et al., 2016) ormanually annotated datasets (Kaushik and Lipton,2018; Gururangan et al., 2018). Thus, recent work

stress tests models via adversarial evaluation: elu-cidating a system’s capabilities by exploiting itsweaknesses (Jia and Liang, 2017; Belinkov andGlass, 2019). Unfortunately, while adversarial eval-uation reveals simplistic model failures (Ribeiroet al., 2018; Mudrakarta et al., 2018), exploringmore complex failure patterns requires human in-volvement (Figure 1): automatically modifying nat-ural language examples without invalidating themis difficult. Hence, the diversity of adversarial ex-amples is often severely restricted.

Instead, our human–computer hybrid approachuses human creativity to generate adversarial ex-amples. A user interface presents model inter-pretations and helps users craft model-breakingexamples (Section 3). We apply this to a ques-tion answering (QA) task called Quizbowl, wheretrivia enthusiasts—who write questions for aca-demic competitions—create diverse examples thatstump existing QA models.

The adversarially-authored test set is nonethelessas easy as regular questions for humans (Section 4),but the relative accuracy of strong QA models dropsas much as 40% (Section 5). We also host live hu-man vs. computer matches, where models typicallydefeat top human teams, but observe spectacularmodel failures on adversarial questions.

Analyzing the adversarial edits uncovers phe-nomena that humans can solve but computers can-not (Section 6), validating that our framework un-covers creative, targeted adversarial edits (Sec-tion 7). Our resulting adversarial dataset presents afun, challenging, and diverse resource for future QA

research: a system that masters it will demonstratemore robust language understanding.

Page 3: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

HypothesizePhenomenon

Create ModelGet Examples

HumanVerification

Adversarialor Not ?

CreateInterpretations

HumanAuthoring

Adversarialor Not ?

CategorizePhenomena

ExistingMethods[ ]

OurMethod[ ]

HypothesizePhenomenon

Create ModelGet Examples

HumanVerification

Adversarialor Not ?

CreateInterpretations

HumanAuthoring

Adversarialor Not ?

CategorizePhenomena

Figure 1: Adversarial evaluation in NLP typically focuses on a specific phenomenon (e.g., word replace-ments) and then generates the corresponding examples (top). Consequently, adversarial examples arelimited to the diversity of what the underlying generative model or perturbation rule can produce, and alsorequire downstream human evaluation to ensure validity. Our setup (bottom) instead has human-authoredexamples, using human–computer collaboration to craft adversarial examples with greater diversity.

2 Adversarial Evaluation for NLP

Adversarial examples (Szegedy et al., 2013) oftenreveal model failures better than traditional testsets. However, automatic adversarial generation istricky for NLP (e.g., by replacing words) withoutchanging an example’s meaning or invalidating it.

Recent work side-steps this by focusing on sim-ple transformations that preserve meaning. Forinstance, Ribeiro et al. (2018) generate adversarialperturbations such as replacing What has→What’s.Other minor perturbations such as typos (Belinkovand Bisk, 2018), adding distractor sentences (Jiaand Liang, 2017; Mudrakarta et al., 2018), or char-acter replacements (Ebrahimi et al., 2018) preservemeaning while degrading model performance.

Generative models can discover more adversarialperturbations but require post-hoc human verifica-tion of the examples. For example, neural para-phrase or language models can generate syntaxmodifications (Iyyer et al., 2018), plausible cap-tions (Zellers et al., 2018), or NLI premises (Zhaoet al., 2018). These methods improve example-level diversity but mainly target a specific phe-nomenon, e.g., rewriting question syntax.

Furthermore, existing adversarial perturbationsare restricted to sentences—not the paragraph in-puts of Quizbowl and other tasks—due to chal-lenges in long-text generation. For instance, syntaxparaphrase networks (Iyyer et al., 2018) applied toQuizbowl only yield valid paraphrases 3% of thetime (Appendix A).

2.1 Putting a Human in the Loop

Instead, we task human authors with adversarialwriting of questions: generating examples whichbreak a specific QA system but are still answerableby humans. We expose model predictions and in-terpretations to question authors, who find questionedits that confuse the model.

The user interface makes the adversarial writ-ing process interactive and model-driven, in con-trast to adversarial examples written independentof a model (Ettinger et al., 2017). The result isan adversarially-authored dataset that explicitly ex-poses a model’s limitations by design.

Human-in-the-loop generation can replace oraid model-based adversarial generation approaches.Creating interfaces and interpretations is often eas-ier than designing and training generative modelsfor specific domains. In domains where adversarialgeneration is feasible, human creativity can revealwhich tactics automatic approaches can later emu-late. Model-based and human-in-the-loop genera-tion approaches can also be combined by trainingmodels to mimic human adversarial edit history,using the relative merits of both approaches.

3 Our QA Testbed: Quizbowl

The “gold standard” of academic competitions be-tween universities and high schools is Quizbowl.Unlike QA formats such as Jeopardy! (Ferrucciet al., 2010), Quizbowl questions are designed tobe interrupted: questions are read to two competingteams and whoever knows the answer first inter-rupts the question and “buzzes in”.

This style of play requires questions to be struc-tured “pyramidally” (Jose, 2017): questions startwith difficult clues and get progressively easier.These questions are carefully crafted to allow themost knowledgeable player to answer first. A ques-tion on Paris that begins “this capital of France”would test reaction speed, not knowledge; thus,skilled authors arrange the clues so players will rec-ognize them with increasing probability (Figure 2).

The answers to Quizbowl questions are typ-ically well-known entities. In the QA commu-nity (Hirschman and Gaizauskas, 2001), this iscalled “factoid” QA: the entities come from a rela-tively closed set of possible answers.

Page 4: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

The protagonist of this opera describes the future daywhen her lover will arrive on a boat in the aria “UnBel Di” or “One Beautiful Day”. The only baritonerole in this opera is the consul Sharpless who readsletters for the protagonist, who has a maid namedSuzuki. That protagonist blindfolds her child Sorrowbefore stabbing herself when her lover B. F. Pinkertonreturns with a wife. For 10 points, name this Gia4oPuccini opera about an American lieutenant’s affairwith the Japanese woman Cio-Cio San.Answer: Madama Butterfly

Figure 2: An example Quizbowl question. Thequestion becomes progressively easier (for humans)to answer later on; thus, more knowledgeable play-ers can answer after hearing fewer clues. Our ad-versarial writing process ensures that the clues alsochallenge computers.

3.1 Known Exploits of Quizbowl Questions

Like most QA datasets, Quizbowl questions arewritten for humans. Unfortunately, the heuristicsthat question authors use to select clues do not al-ways apply to computers. For example, humansare unlikely to memorize every song in every operaby a particular composer. This, however, is trivialfor a computer. In particular, a simple QA systemeasily solves the example in Figure 2 from seeingthe reference to “Un Bel Di”. Other questions con-tain uniquely identifying “trigger words” (Harris,2006). For example, “martensite” only appears inquestions on steel. For these examples, a QA sys-tem needs to understand no additional informationother than an if–then rule.

One might wonder if this means that factoidQA is thus an uninteresting, nearly solved researchproblem. However, some Quizbowl questions arefiendishly difficult for computers. Many questionshave intricate coreference patterns (Guha et al.,2015), require reasoning across multiple types ofknowledge, or involve complex wordplay. If wecan isolate and generate questions with these dif-ficult phenemona, “simplistic” factoid QA quicklybecomes non-trivial.

3.2 Models and Datasets

We conduct two rounds of adversarial writing. Inthe first, authors attack a traditional InformationRetrieval (IR) system. The IR model is the baselinefrom a NIPS 2017 shared task on Quizbowl (Boyd-Graber et al., 2018) based on ElasticSearch (Gorm-ley and Tong, 2015).

In the second round, authors attack either the IR

model or a neural QA model. The neural modelis a bidirectional RNN using the gated recurrentunit architecture (Cho et al., 2014). The modeltreats Quizbowl as classification and predicts theanswer entity from a sequence of words representedas 300-dimensional GloVe embeddings (Penning-ton et al., 2014). Both models in this round aretrained using an expanded dataset of approximately110,000 Quizbowl questions. We expanded theround two dataset to incorporate more diverse an-swers (25,000 entities versus 11,000 in round one).

3.3 Interpreting Quizbowl Models

To help write adversarial questions, we exposewhat the model is thinking to the authors. We inter-pret models using saliency heat maps: each word ofthe question is highlighted based on its importanceto the model’s prediction (Ribeiro et al., 2016).

For the neural model, word importance is thedecrease in prediction probability when a wordis removed (Li et al., 2016; Wallace et al., 2018).We focus on gradient-based approximations (Si-monyan et al., 2014; Montavon et al., 2017) fortheir computational efficiency.

To interpret a model prediction on an input se-quence of n words w = 〈w1,w2, . . .wn〉, we ap-proximate the classifier f with a linear function ofwi derived from the first-order Taylor expansion.The importance of wi, with embedding vi, is thederivative of f with respect to the one-hot vector:

∂f

∂wi=

∂f

∂vi

∂vi∂wi

=∂f

∂vi· vi. (1)

This simulates how model predictions change whena particular word’s embedding is set to the zero vec-tor, i.e., it approximates word removal (Ebrahimiet al., 2018; Wallace et al., 2018).

For the IR model, we use the ElasticSearchHighlight API (Gormley and Tong, 2015), whichprovides word importance scores based on querymatches from the inverted index.

3.4 Adversarial Writing Interface

The authors interact with either the IR or RNN

model through a user interface1 (Figure 3). Anauthor writes their question in the upper right whilethe model’s top five predictions (Machine Guesses)appear in the upper left. If the top prediction is

1https://github.com/Eric-Wallace/trickme-interface/

Page 5: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Figure 3: The author writes a question (top right), the QA system provides guesses (left), and explainswhy it makes those guesses (bottom right). The author can then adapt their question to “trick” the model.

the right answer, the interface indicates where inthe question the model is first correct. The goalis to cause the model to be incorrect or to delaythe correct answer position as much as possible.2

The words of the current question are highlightedusing the applicable interpretation method in thelower right (Evidence). We do not enforce timerestrictions or require questions to be adversarial:if the author fails to break the system, they are freeto “give up” and submit any question.

The interface continually updates as the authorwrites. We track the question edit history to identifyrecurring model failures (Section 6) and understandhow interpretations guide the authors (Section 7).

3.5 Question Authors

We focus on members of the Quizbowl community:they have deep trivia knowledge and craft ques-tions for Quizbowl tournaments (Jennings, 2006).We award prizes for questions read at live human–computer matches (Section 5.3).

The question authors are familiar with the stan-dard format of Quizbowl questions (Lujan andTeitler, 2003). The questions follow a commonparagraph structure, are well edited for grammar,

2The authors want normal Quizbowl questions which hu-mans can easily answer by the very end. For popular answers,(e.g., Australia or Suez Canal), writing novel final give-awayclues is difficult. We thus expect models to often answercorrectly by the very end of the question.

and finish with a simple “give-away” clue. Theseconstraints benefit the adversarial writing processas it is very clear what constitutes a difficult butvalid question. Thus, our examples go beyond sur-face level “breaks” such as character noise (Be-linkov and Bisk, 2018) or syntax changes (Iyyeret al., 2018). Rather, questions are difficult becauseof their semantic content (examples in Section 6).

3.6 How an Author Writes a QuestionTo see how an author might write a question withthe interface, we walk through an example of writ-ing a question’s first sentence. The author first se-lects the answer to their question from the trainingset—Johannes Brahms—and begins:

Karl Ferdinand Pohl showed this com-poser some pieces on which this com-poser’s Variations on a Theme by Haydnwere based.

The QA system buzzes (i.e., it has enough informa-tion to interrupt and answer correctly) after “com-poser”. The author sees that the name “Karl Fer-dinand Pohl” appears in Brahms’ Wikipedia pageand avoids that specific phrase, describing Pohl’sposition instead of naming him directly:

This composer was given a theme called“Chorale St. Antoni” by the archivistof the Vienna Musikverein, which couldhave been written by Ignaz Pleyel.

Page 6: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Science 17%History 22%Literature 18%Fine Arts 15%Religion, Mythology, 13%Philosophy, and Social ScienceCurrent Events, Geography, 15%and General Knowledge

Total Questions 1213

Table 1: The topical diversity of the questions in theadversarially-authored dataset based on a randomsample of 100 questions.

This rewrite adds in some additional information(there is a scholarly disagreement over who wrotethe theme and its name), and the QA system nowincorrectly thinks the answer is Frédéric Chopin.The user can continue to build on the theme, writing

While summering in Tutzing, this com-poser turned that theme into “Variationson a Theme by Haydn”.

Again, the author then sees that the system buzzes“Variations on a Theme” with the correct answer.However, the author can rewrite it in its originalGerman, “Variationen über ein Thema von Haydn”to fool the system. The author continues to createentire questions the model cannot solve.

4 A New Adversarially-Authored Dataset

Our adversarial dataset consists of 1213 questionswith 6,541 sentences across diverse topics (Ta-ble 1).3 There are 807 questions written againstthe IR system and 406 against the neural model by115 unique authors. We plan to hold twice-yearlycompetitions to continue data collection.

4.1 Validating Questions with QuizbowlersWe validate that the adversarially-authored ques-tions are not of poor quality or too difficult forhumans. We first automatically filter out questionsbased on length, the presence of vulgar statements,or repeated submissions (including re-submissionsfrom the Quizbowl training or evaluation data).

We next host a human-only Quizbowl event us-ing intermediate and expert players (former andcurrent collegiate Quizbowl players). We selectsixty adversarially-authored questions and sixty

3Data available at http://trickme.qanta.org.

standard high school national championship ques-tions, both with the same number of questions percategory (list of categories in Table 1).

To answer a Quizbowl question, a player inter-rupts the question: the earlier the better. To capturethis dynamic, we record both the average answerposition (as a percentage of the question, lower isbetter) and answer accuracy. We shuffle the regularand adversarially-authored questions, read them toplayers, and record these two metrics.

The adversarially-authored questions are on av-erage easier for humans than the regular test ques-tions. For the adversarially-authored set, humansbuzz with 41.6% of the question remaining andan accuracy of 89.7%. On the standard questions,humans buzz with 28.3% of the question remain-ing and an accuracy of 84.2%. The difference inaccuracy between the two types of questions isnot significantly different (p = 0.16 using Fisher’sexact test), but the buzzing position is earlier foradversarially-authored questions (p = 0.0047 for atwo-sided t-test). We expect the questions that werenot played to be of comparable difficulty becausethey went through the same submission process andpost-processing. We further explore the human-perceived difficulty of the adversarially-authoredquestions in Section 5.3.

5 Computer Experiments

This section evaluates QA systems on theadversarially-authored questions. We test threemodels: the IR and RNN models shown in the inter-face, as well as a Deep Averaging Network (Iyyeret al., 2015, DAN) to evaluate the transferabil-ity of the adversarial questions. We break ourstudy into two rounds. The first round consistsof adversarially-authored questions written againstthe IR system (Section 5.1); the second round ques-tions target both the IR and RNN (Section 5.2).

Finally, we also hold live competitions that pitthe state-of-the-art Studio Ousia model (Yamadaet al., 2018) against human teams (Section 5.3).

5.1 First Round Attacks: IR AdversarialQuestions Transfer To All Models

The first round of adversarially-authored questionstarget the IR model and are significantly harder forthe IR, RNN, and DAN models (Figure 4). For exam-ple, the DAN’s accuracy drops from 54.1% to 32.4%on the full question (60% of original performance).

For both adversarially-authored and original test

Page 7: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

0 0.5 10

0.2

0.4

0.6Ac

cura

cyDAN

QuestionsRegular TestRound 1 - IR Adversarial

0 0.5 1

RNN

0 0.5 1Percent of Question Revealed

IR

Round 1 Attacks and Models

Figure 4: The first round of adversarial writing attacks the IR model. Like regular test questions,adversarially-authored questions begin with difficult clues that trick the model. However, the adversarialquestions are significantly harder during the crucial middle third of the question.

0 0.5 10

0.2

0.4

0.6

Accu

racy

DAN

QuestionsRegular TestRound 2 - IR AdversarialRound 2 - RNN Adversarial

0 0.5 1

RNN

0 0.5 1Percent of Question Revealed

IR

Round 2 Attacks and Models

Figure 5: The second round of adversarial writing attacks the IR and RNN models. The questions targetedagainst the IR system degrade the performance of all models. However, the reverse does not hold: the IR

model is robust to the questions written to fool the RNN.

questions, the early clues are difficult to answer(accuracy about 10% through 25% of the question).However, during the middle third of the questions,where buzzes in Quizbowl most frequently occur,the accuracy on original test questions rises sig-nificantly quicker than the adversarially-authoredones. For both type of questions, the accuracy risestowards the end as the clues become “give-aways”.

5.2 Second Round Attacks: RNNAdversarial Questions are Brittle

In the second round, the authors also attack an RNN

model. All models tested in the second round aretrained on a larger dataset (Section 3.2).

A similar trend holds for IR adversarial ques-tions in the second round (Figure 5): a questionthat tricks the IR system also fools the two neuralmodels (i.e., adversarial examples transfer). Forexample, the DAN model was never targeted buthad substantial accuracy decreases in both rounds.

However, this does not hold for questions writtenadversarially against the RNN model. On thesequestions, the neural models struggle but the IR

model is largely unaffected (Figure 5, right).

5.3 Humans vs. Computer, Live!

In the offline setting (i.e., no pressure to “buzz”before an opponent) models demonstrably struggleon the adversarial questions. But, what happens instandard Quizbowl: live, head-to-head games?

We run two live humans vs. computer matches.The first match uses IR adversarial questions in aforty question, tossup-only Quizbowl format. Wepit a human team of national-level Quizbowl play-ers against the Studio Ousia model (Yamada et al.,2018), the current state-of-the-art Quizbowl system.The model combines neural, IR, and knowledgegraph components (details in Appendix B), andwon the 2017 NIPS shared task, defeating a team ofexpert humans 475–200 on regular Quizbowl testquestions. Although the team at our live event wascomparable to the NIPS 2017 team, the tables wereturned: the human team won handedly 300–30.

Our second live event is significantly larger:seven human teams play against models on over400 questions written adversarially against the RNN

model. The human teams range in ability from highschool Quizbowl players to national-level teams

Page 8: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

0.5 10

0.20.40.60.8

1Ac

cura

cy Intermediate

QuestionsRegular TestIR AdversarialRNN Adversarial

0.5 1

Expert

0.5 1Percent of Question Revealed

National

Figure 6: Humans find adversarially-authored question about as difficult as normal questions: rustyweekend warriors (Intermediate), active players (Expert), or the best trivia players in the world (National).

0.25 0.50 0.75 1Percent of Question Revealed

0

0.2

0.4

0.6

0.8

1

Accu

racy

Regular TestIR AdversarialRNN Adversarial

Figure 7: The accuracy of the state-of-the-art Stu-dio Ousia model degrades on the adversarially-authored questions despite never being directly tar-geted. This verifies that our findings generalizebeyond the RNN and IR models.

(Jeopardy! champions, Academic CompetitionFederation national champions, top scorers in theWorld Quizzing Championships). The models arebased on either IR or neural methods. Despite a fewclose games between the weaker human teams andthe models, humanity prevailed in every match.4

Figures 6–7 summarize the live match resultsfor the humans and Ousia model, respectively.Humans and models have considerably differenttrends in answer accuracy. Human accuracy onboth regular and adversarial questions rises quicklyin the last half of the question (curves in Figure 6).In essence, the “give-away” clues at the end ofquestions are easy for humans to answer.

On the other hand, models on regular test ques-tions do well in the first half, i.e., the “difficult”clues for humans are easier for models (RegularTest in Figure 7). However, models, like humans,struggle on adversarial questions in the first half.

4Videos available at http://trickme.qanta.org.

6 What Makes Adversarially-authoredQuestions Hard?

This section analyzes the adversarially-authoredquestions to identify the source of their difficulty.

6.1 Quantitative Differences in Questions

One possible source of difficulty is data scarcity:the answers to adversarial questions rarely appearin the training set. However, this is not the case;the mean number of training examples per answer(e.g., George Washington) is 14.9 for the adversar-ial questions versus 16.9 for the regular test data.

Another explanation for question difficulty islimited “overlap” with the training data, i.e., mod-els cannot match n-grams from the training clues.We measure the proportion of test n-grams thatalso appear in training questions with the same an-swer (Table 2). The overlap is roughly equal forunigrams but surprisingly higher for adversarialquestions’ bigrams. The adversarial questions arealso shorter and have fewer NEs. However, theproportion of named entities is roughly equivalent.

One difference between the questions writtenagainst the IR system and the ones written againstthe RNN model is the drop in NEs. The decrease inNEs is higher for IR adversarial questions, whichmay explain their generalization: the RNN is moresensitive to changes in phrasing, while the IR sys-tem is more sensitive to specific words.

6.2 Categorizing Adversarial Phenomena

We next qualitatively analyze adversarially-authored questions. We manually inspect the au-thor edit logs, classifying questions into six differ-ent phenomena in two broad categories (Table 3)from a random sample of 100 questions, doublecounting questions into multiple phenomena whenapplicable.

Page 9: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Adversarial Regular

Unigram overlap 0.40 0.37Bigram overlap 0.08 0.05Longest n-gram overlap 6.73 6.87Average NE overlap 0.38 0.46

IR Adversarial 0.35RNN Adversarial 0.44

Total Words 107.1 133.5Total NE 9.1 12.5

Table 2: The adversarially-authored questions havesimilar n-gram overlap to the regular test questions.However, the overlap of the named entities (NE)decreases for IR Adversarial questions.

Composing Seen Clues 15%Logic & Calculations 5%Multi-Step Reasoning 25%

Paraphrases 38%Entity Type Distractors 7%Novel Clues 26%

Total Questions 1213

Table 3: A breakdown of the phenomena in theadversarially-authored dataset.

6.2.1 Adversarial Category 1: Reasoning

The first question category requires reasoning aboutknown clues (Table 4).

Composing Seen Clues: These questions pro-vide entities with a first-order relationship to thecorrect answer. The system must triangulate thecorrect answer by “filling in the blank”. For ex-ample, the first question of Table 4 names theplace of death of Tecumseh. The training data con-tains a question about his death reading “thoughstiff fighting came from their Native Americanallies under Tecumseh, who died at this battle”(The Battle of the Thames). The system must con-nect these two clues to answer.

Logic & Calculations: These questions requiremathematical or logical operators. For exam-ple, the training data contains a clue about theBattle of Thermopylae: “King Leonidas and 300Spartans died at the hands of the Persians”. Thesecond question in Table 4 requires adding 150 tothe number of Spartans.

Multi-Step Reasoning: This question type re-quires multiple reasoning steps between entities.For example, the last question of Table 4 requires areasoning step from the “I Have A Dream” speechto the Lincoln Memorial and then another reason-ing step to reach Abraham Lincoln.

6.2.2 Adversarial Category 2: DistractingClues

The second category consists of circumlocutoryclues (Table 5).

Paraphrases: A common adversarial modifica-tion is to paraphrase clues to remove exact n-grammatches from the training data. This renders ourIR system useless but also hurts the neural mod-els. Many of the adversarial paraphrases go beyondsyntax-only changes (e.g., the first row of Table 5).

Entity Type Distractors: Whether explicit orimplicit in a model, one key component for QA

is determining the answer type of the question. Au-thors take advantage of this by providing clues thatcause the model to select the wrong answer type.For example, in the second question of Table 5, the“lead-in” clue implies the answer may be an actor.The RNN model answers Don Cheadle in responsedespite previously seeing the Bill Clinton “playinga saxophone” clue in the training data.

Novel Clues: Some adversarially-authored ques-tions are hard not because of phrasing or logicbut because our models have not seen these clues.These questions are easy to create: users can addNovel Clues that—because they are not uniquelyassociated with an answer—confuse the models.While not as linguistically interesting, novel cluesare not captured by Wikipedia or Quizbowl data,thus improving the dataset’s diversity. For example,adding clues about literary criticism (Hardwick,1967; Watson, 1996) to a question about LillianHellman’s The Little Foxes: “Ritchie Watson com-mended this play’s historical accuracy for gettingthe price for a dozen eggs right—ten cents—to de-fend against Elizabeth Hardwick’s contention thatit was a sentimental history.” Novel clues createan incentive for models to use information beyondpast questions and Wikipedia.

Novel clues have different effects on IR and neu-ral models: while IR models largely ignore them,novel clues can lead neural models astray. For ex-ample, on a question about Tiananmen Square, theRNN model buzzes on the clue “World Economic

Page 10: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Question Prediction Answer PhenomenonThis man, who died at the Battle of theThames, experienced a setback when hisbrother Tenskwatawa’s influence over theirtribe began to fade.

Battle of Tippecanoe Tecumseh ComposingSeen Clues

This number is one hundred fifty more thanthe number of Spartans at Thermopylae.

Battle of Thermopylae 450 Logic & Cal-culations

A building dedicated to this man was thesite of the “I Have A Dream” speech.

Martin Luther King Jr. Abraham Lincoln Multi-StepReasoning

Table 4: The first category of adversarially-authored questions consists of examples that require reasoning.Answer displays the correct answer (all models were incorrect). For these examples, connecting thetraining and adversarially-authored clues is simple for humans but difficult for models.

Set Question Prediction PhenomenonTraining Name this sociological phenomenon, the taking of one’s

own life.Suicide

Paraphrase

Adversarial Name this self-inflicted method of death. Arthur MillerTraining Clinton played the saxophone on The Arsenio Hall Show. Bill ClintonAdversarial He was edited to appear in the film “Contact”. . . For ten

points, name this American president who played thesaxophone on an appearance on the Arsenio Hall Show.

Don Cheadle Entity TypeDistractor

Table 5: The second category of adversarial questions consists of clues that are present in the training databut are written in a distracting manner. Training shows relevant snippets from the training data. Predictiondisplays the RNN model’s answer prediction (always correct on Training, always incorrect on Adversarial).

Herald”. However, adding a novel clue about “thehistory of shaving” renders the brittle RNN unableto buzz on the “World Economic Herald” clue thatit was able to recognize before.5 This helps to ex-plain why adversarially-authored questions writtenagainst the RNN do not stump IR models.

7 How Do Interpretations Help?

This section explores how model interpretationshelp to guide adversarial authors. We analyze thequestion edit log, which reflects how authors mod-ify questions given a model interpretation.

A direct edit of the highlighted words often cre-ates an adversarial example (e.g., Figure 8). Fig-ure 9 shows a more intricate example. The left plotshows the Question Length, as well as the positionwhere the model is first correct (Buzzing Position,lower is better). We show two adversarial edits. Inthe first (1), the author removes the first sentence ofthe question, which makes the question easier for

5The “history of shaving” is a tongue-in-cheek name for aposter displaying the hirsute leaders of Communist thought. Itgoes from the bearded Marx and Engels, to the mustachioedLenin and Stalin, and finally the clean-shaven Mao.

One of these concepts . . . a Hyperbola is a typeof, for ten points, what shapes made by passinga plane through a namesake solid,that also includes the ellipse, parabola?whose area is given by one-third Pi r squaredtimes height?Prediction: Conic Section (3)→ Sphere (7)

Figure 8: The interpretation successfully aids an at-tack against the IR system. The author removesthe phrase containing the words “ellipse” and“parabola”, which are highlighted in the interface(shown in bold). In its place, they add a phrasewhich the model associates with the answer sphere.

the model (buzz position decreases). The authorcounteracts this in the second edit (2), where theyuse the interpretation to craft a targeted modifica-tion which breaks the IR model.

However, models are not always this brittle. InFigure 10 (Appendix C), the interpretation fails toaid an adversarial attack against the RNN model. Ateach step, the author uses the highlighted words asa guide to edit targeted portions of the question yet

Page 11: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

10

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

Confidential TACL submission. DO NOT DISTRIBUTE.

In his speeches this . . . As a Senator,� this man supported Paraguay in theChaco War, believing Bolivia was backedby Standard Oil.+ this man’s campaign was endorsed byMilo Reno and Charles Coughlin.Prediction: Huey Long (3) ! Huey Long (3)

In his speeches this . . . As a Senator, this man’scampaign was endorsed by Milo Reno and� Charles Coughlin.+ a Catholic priest and radio show host.Prediction: Huey Long (3) ! Huey Long (3)

Figure 6: An failed attempt to trick the neuralmodel. The user modified the question multipletimes, replacing words suggested by the interpreta-tion, but was unable to break the system.

The BioLIp database stores data on theinteraction of these species with proteins.Examples of these molecules with C2 symme-try can increase enantioselectivity, as in theirJosiphos variety. . .Prediction: Ion (7) ! Ligand (3)

Examples of these molecules species with C2symmetry can increase enantioselectivity, as intheir Josiphos variety. . .Prediction: Ligand (3) ! Ion (7)

Figure 7: An failed attempt to trick the neuralmodel. The user modified the question multipletimes, replacing words suggested by the interpreta-tion, but was unable to break the system.

and predictions to aid in the generation of chal-lenging examples. This new annotation method issalient given the difficulty of collecting large-scaledatasets that do not contain superficial clues mod-els can use to “game” a task (Gururangan et al.,2018; Chen et al., 2016). Our adversarial writingframework alleviates these annotation artifacts byexposing model pathologies (and their learned arti-facts) during the data annotation process.

While our adversarial writing setup requiresclear interpretations of a model, annotators canstill generate challenging examples for neural sys-tems even using IR output. The effort required fromannotators increases during the adversarial writingprocess, as they may need to rewrite an example

numerous times. However, better model interpre-tation techniques and visualizations can ease thisburden.

Another benefit of leveraging human adversariesis that they can generate examples that are more di-verse than automatic methods (Jia and Liang, 2017;Iyyer et al., 2018). This diversity provides insightinto numerous model limitations, rather than a sin-gle one. Combining these two approaches, perhapsby training models to mimic user edit history, couldbe fruitful to explore in future work.

9 Related Work

Creating evaluation datasets to get a fine-grainedanalysis of particular linguistics features or modelattributes has been explored in past work. TheLAMBADA dataset tests a model’s ability to under-stand the broad contexts present in book passages(Paperno et al., 2016). Other work focuses on natu-ral language inference, where challenge exampleshighlight existing model failures (Wang et al., 2018;Glockner et al., 2018; Naik et al., 2018). Our workis unique in that we use human as adversaries to ex-pose model weaknesses, which provides a diverseset of phenomena (from paraphrases to multi-hopreasoning) that models can’t solve.

Other work has explored specific limitationsof NLP systems. Rimell et al. (2009) show thatparsers struggle on test examples with unboundeddependencies. A closely related work to ours isEttinger et al. (2017) who also use humans as ad-versaries. Unlike their Build-it Break-it setting, wehave a ready-made audience of “breakers” who aremotivated and capable of generating adversarial ex-amples. Our work also differs in that we use modelinterpretation methods to facilitate the breaking inan collaborative manner.

Finally, we discussed recent work on adversarialNLP attacks in Section ??. These types of inputmodifications target one specific type of phenom-ena (e.g., syntatic modifications). These methodsare a complementary strategy to adversarial evalua-tion of NLP models.

10 Conclusion

It is difficult to automatically expose the limitationsof a machine learning system, especially when thatsystem solves a fixed held-out evaluation set. Inour setup, humans try to break a trained system. Bysupporting this breaking process with interpretationmethods, users can understand what a model is

1 2

1

2

Figure 9: The Question Length and the position where the model is first correct (Buzzing Position, loweris better) are shown as a question is written. In (1), the author makes a mistake by removing a sentencethat makes the question easier for the IR model. In (2), the author uses the interpretation, replacing thehighlighted word (shown in bold) “molecules” with “species” to trick the RNN model.

fails to trick the model. The author gives up andsubmits their relatively non-adversarial question.

7.1 Interviews With Adversarial Authors

We also interview the adversarial authors who at-tended our live events. Multiple authors agree thatidentifying oft-repeated “stock” clues was the inter-face’s most useful feature. As one author explained,“There were clues which I did not think were stockclues but were later revealed to be”. In particular,the author’s question about the Congress of Viennaused a clue about “Kraków becoming a free city”,which the model immediately recognized.

Another interviewee was Jordan Brownstein,6 anational Quizbowl champion and one of the best ac-tive players, who felt that computer opponents werebetter at questions that contained direct referencesto battles or poetry. He also explained how the dif-ferent writing styles used by each Quizbowl authorincreases the difficulty of questions for computers.The interface’s evidence panel allows authors toread existing clues which encourages these uniquestylistic choices.

8 Related Work

New datasets often allow for a finer-grained anal-ysis of a linguistic phenomenon, task, or genre.The LAMBADA dataset (Paperno et al., 2016) testsa model’s understanding of the broad contexts

6https://www.qbwiki.com/wiki/Jordan_Brownstein

present in book passages, while the Natural Ques-tions corpus (Kwiatkowski et al., 2019) combsWikipedia for answers to questions that users trustsearch engines to answer (Oeldorf-Hirsch et al.,2014). Other work focuses on natural language in-ference, where challenge examples highlight modelfailures (Wang et al., 2019; Glockner et al., 2018;Naik et al., 2018). Our work is unique in thatwe use human adversaries to expose model weak-nesses, which provides a diverse set of phenom-ena (from paraphrases to multi-hop reasoning) thatmodels cannot solve.

Other work puts an adversary in the data annota-tion or postprocessing loop. For instance, Dua et al.(2019) and Zhang et al. (2018) filter out easy ques-tions using a baseline QA model, while Zellers et al.(2018) use stylistic classifiers to filter language in-ference examples. Rather than filtering out easyquestions, we instead use human adversaries to gen-erate hard ones. Similar to our work, Ettinger et al.(2017) use human adversaries. We extend theirsetting by providing humans with model interpre-tations to facilitate adversarial writing. Moreover,we have a ready-made audience of question writersto generate adversarial questions.

The collaborative adversarial writing process re-flects the complementary abilities of humans andcomputers. For instance, “centaur” chess teams ofboth a human and a computer are often strongerthan a human or computer alone (Case, 2018). InStarcraft, humans devise high-level “macro” strate-gies, while computers are superior at executing fast

Page 12: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

and precise “micro” actions (Vinyals et al., 2017).In NLP, computers aid simultaneous human inter-preters (He et al., 2016) at remembering forgotteninformation or translating unfamiliar words.

Finally, recent approaches to adversarial eval-uation of NLP models (Section 2) typically targetone phenomenon (e.g., syntactic modifications) andcomplement our human-in-the-loop approach.

9 Conclusion

One of the challenges of machine learning is know-ing why systems fail. This work brings togethertwo threads that attempt to answer this question:visualizations and adversarial examples. Visualiza-tions underscore the capabilities of existing models,while adversarial examples—crafted with the inge-nuity of human experts—show that these modelsare still far from matching human prowess.

Our experiments with both neural and IR method-ologies show that QA models still struggle with syn-thesizing clues, handling distracting information,and adapting to unfamiliar data. Our adversarially-authored dataset is only the first of many itera-tions (Ruef et al., 2016): as models improve, futureadversarially-authored datasets can elucidate thelimitations of next-generation QA systems.

While we focus on QA, our procedure is appli-cable to other NLP settings where there is (1) apool of talented authors who (2) write text withspecific goals. Future research can look to craftadversarially-authored datasets for other NLP tasksthat meet these criteria.

Acknowledgments

We thank all of the Quiz Bowl players, writers, andjudges who helped make this work possible, espe-cially Ophir Lifshitz and Daniel Jensen. We alsothank the anonymous reviewers and members of theUMD “Feet Thinking” group for helpful comments.Finally, we would also like to thank Sameer Singh,Matt Gardner, Pranav Goel, Sudha Rao, PouyaPezeshkpour, Zhengli Zhao, and Saif Mohammadfor their useful feedback. This work was supportedby NSF Grant IIS-1822494. Shi Feng is partiallysupported by subcontract to Raytheon BBN Tech-nologies by DARPA award HR0011-15-C-0113, andPedro Rodriguez is partially supported by NSF

Grant IIS-1409287 (UMD). Any opinions, find-ings, conclusions, or recommendations expressedhere are those of the authors and do not necessarilyreflect the view of the sponsor.

References

Yonatan Belinkov and Yonatan Bisk. 2018. Syn-thetic and natural noise both break neural ma-chine translation. In Proceedings of the Interna-tional Conference on Learning Representations.

Yonatan Belinkov and James Glass. 2019. Anal-ysis methods in neural language processing: Asurvey. In Transactions of the Association forComputational Linguistics.

Jordan Boyd-Graber, Shi Feng, and Pedro Ro-driguez. 2018. Human-Computer Question An-swering: The Case for Quizbowl. Springer.

Nicky Case. 2018. How To Become A Centaur.Journal of Design and Science.

Danqi Chen, Jason Bolton, and Christopher D.Manning. 2016. A thorough examination of theCNN/Daily Mail reading comprehension task.Proceedings of the Association for Computa-tional Linguistics.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Dzmitry Bahdanau, Fethi Bougares,Holger Schwenk, and Yoshua Bengio. 2014.Learning phrase representations using RNNencoder-decoder for statistical machine trans-lation. In Proceedings of Empirical Methods inNatural Language Processing.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,Gabriel Stanovsky, Sameer Singh, and MattGardner. 2019. DROP: A reading comprehen-sion benchmark requiring discrete reasoningover paragraphs. In Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and De-jing Dou. 2018. HotFlip: White-box adversarialexamples for text classification. In Proceedingsof the Association for Computational Linguis-tics.

Page 13: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Allyson Ettinger, Sudha Rao, Hal Daumé III, andEmily M. Bender. 2017. Towards linguisticallygeneralizable NLP systems: A workshop andshared task. In In Proceedings of the First Work-shop on Building Linguistically GeneralizableNLP Systems.

David Ferrucci, Eric Brown, Jennifer Chu-Carroll,James Fan, David Gondek, Aditya A. Kalyanpur,Adam Lally, J. William Murdock, Eric Nyberg,John Prager, Nico Schlaefer, and Chris Welty.2010. Building Watson: An Overview of theDeepQA Project. AI Magazine, 31(3).

Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking NLI systems with sentencesthat require simple lexical inferences. In Pro-ceedings of the Association for ComputationalLinguistics.

Clinton Gormley and Zachary Tong. 2015. Elastic-search: The Definitive Guide. O’Reilly Media,Inc.

Anupam Guha, Mohit Iyyer, Danny Bouman, andJordan Boyd-Graber. 2015. Removing the train-ing wheels: A coreference dataset that entertainshumans and challenges computers. In NorthAmerican Association for Computational Lin-guistics.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, andNoah A. Smith. 2018. Annotation artifacts innatural language inference data. In Conferenceof the North American Chapter of the Associa-tion for Computational Linguistics.

Elizabeth Hardwick. 1967. The Little Foxes re-vived. The New York Review of Books, 9(11).

Bob Harris. 2006. Prisoner of Trebekistan: ADecade in Jeopardy!

He He, Jordan Boyd-Graber, and Hal DauméIII. 2016. Interpretese vs. translationese: Theuniqueness of human strategies in simultaneousinterpretation. In Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics.

Lynette Hirschman and Rob Gaizauskas. 2001.Natural language question answering: The viewfrom here. Natural Language Engineering,7(4):275–300.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep un-ordered composition rivals syntactic methods fortext classification. In Proceedings of the Associ-ation for Computational Linguistics.

Mohit Iyyer, John Wieting, Kevin Gimpel, andLuke Zettlemoyer. 2018. Adversarial examplegeneration with syntactically controlled para-phrase networks. In Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics.

Ken Jennings. 2006. Brainiac: adventures in thecurious, competitive, compulsive world of triviabuffs. Villard.

Robin Jia and Percy Liang. 2017. Adversarial ex-amples for evaluating reading comprehensionsystems. In Proceedings of Empirical Methodsin Natural Language Processing.

Ike Jose. 2017. The craft of writing pyramidal quizquestions: Why writing quiz bowl questions isan intellectual task.

Divyansh Kaushik and Zachary C. Lipton. 2018.How much reading does reading comprehen-sion require? A critical investigation of popularbenchmarks. In Proceedings of Empirical Meth-ods in Natural Language Processing.

Tom Kwiatkowski, Jennimaria Palomaki, OliviaRhinehart, Michael Collins, Ankur Parikh, ChrisAlberti, Danielle Epstein, Illia Polosukhin,Matthew Kelcey, Jacob Devlin, et al. 2019. Natu-ral Questions: a benchmark for question answer-ing research. In Transactions of the Associationfor Computational Linguistics.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016.Understanding neural networks throughrepresentation erasure. arXiv preprintarXiv:1612.08220.

Paul Lujan and Seth Teitler. 2003. Writing goodquizbowl questions: A quick primer.

Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2017. Methods for interpretingand understanding deep neural networks. arXivpreprint, abs/1706.07979.

Pramod Kaushik Mudrakarta, Ankur Taly, MukundSundararajan, and Kedar Dhamdhere. 2018. Did

Page 14: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

the model understand the question? In Pro-ceedings of the Association for ComputationalLinguistics.

Aakanksha Naik, Abhilasha Ravichander, NormanSadeh, Carolyn Rose, and Graham Neubig. 2018.Stress test evaluation for natural language infer-ence. In Proceedings of International Confer-ence on Computational Linguistics.

Anne Oeldorf-Hirsch, Brent Hecht, Mered-ith Ringel Morris, Jaime Teevan, and DarrenGergle. 2014. To search or to ask: the routingof information needs between traditional searchengines and social networks. In Conference onComputer Supported Cooperative Work and So-cial Computing.

Denis Paperno, Germán Kruszewski, Ange-liki Lazaridou, Quan Ngoc Pham, RaffaellaBernardi, Sandro Pezzelle, Marco Baroni,Gemma Boleda, and Raquel Fernández. 2016.The LAMBADA dataset: Word prediction re-quiring a broad discourse context. In Proceed-ings of the Association for Computational Lin-guistics.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. GloVe: Global vectorsfor word representation. In Proceedings of Em-pirical Methods in Natural Language Process-ing.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. Why should I trust you?: Ex-plaining the predictions of any classifier. InProceedings of the ACM SIGKDD InternationalConference on Knowledge Discovery and DataMining.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adver-sarial rules for debugging nlp models. In Pro-ceedings of the Association for ComputationalLinguistics.

Andrew Ruef, Michael Hicks, James Parker, DaveLevin, Michelle L. Mazurek, and Piotr Mardziel.2016. Build it, break it, fix it: Contesting securedevelopment. In Proceedings of the 2016 ACMSIGSAC Conference on Computer and Commu-nications Security.

Karen Simonyan, Andrea Vedaldi, and AndrewZisserman. 2014. Deep inside convolutional net-works: Visualising image classification modelsand saliency maps. In Proceedings of the In-ternational Conference on Learning Representa-tions.

Christian Szegedy, Wojciech Zaremba, IlyaSutskever, Joan Bruna, Dumitru Erhan, Ian J.Goodfellow, and Rob Fergus. 2013. Intriguingproperties of neural networks. In Proceedings ofthe International Conference on Learning Rep-resentations.

Oriol Vinyals, Timo Ewalds, Sergey Bartunov,Petko Georgiev, Alexander Sasha Vezhnevets,Michelle Yeo, Alireza Makhzani, Heinrich Küt-tler, John Agapiou, Julian Schrittwieser, JohnQuan, Stephen Gaffney, Stig Petersen, Karen Si-monyan, Tom Schaul, Hado van Hasselt, DavidSilver, Timothy P. Lillicrap, Kevin Calderone,Paul Keet, Anthony Brunasso, David Lawrence,Anders Ekermo, Jacob Repp, and Rodney Ts-ing. 2017. Starcraft II: A new challengefor reinforcement learning. arXiv preprintarXiv:1708.04782.

Eric Wallace, Shi Feng, and Jordan Boyd-Graber.2018. Interpreting neural networks with near-est neighbors. In EMNLP 2018 Workshop onAnalyzing and Interpreting Neural Networks forNLP.

Alex Wang, Amapreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel R. Bowman.2019. Glue: A multi-task benchmark and analy-sis platform for natural language understanding.In Proceedings of the International Conferenceon Learning Representations.

Ritchie D. Watson. 1996. Lillian Hellman’s "TheLittle Foxes" and the new south creed: An ironicview of southern history. The Southern LiteraryJournal, 28(2):59–68.

Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo,and Yoshiyasu Takefuji. 2018. Studio ousia’squiz bowl question answering system. arXivpreprint arXiv:1803.08652.

Adams Wei Yu, David Dohan, Minh-Thang Lu-ong, Rui Zhao, Kai Chen, Mohammad Norouzi,and Quoc V. Le. 2018. QANet: Combininglocal convolution with global self-attention for

Page 15: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

reading comprehension. In Proceedings of theInternational Conference on Learning Represen-tations.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adver-sarial dataset for grounded commonsense infer-ence. In Proceedings of Empirical Methods inNatural Language Processing.

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jian-feng Gao, Kevin Duh, and Benjamin Van Durme.2018. Record: Bridging the gap between humanand machine commonsense reading comprehen-sion. arXiv preprint arXiv:1810.12885.

Zhengli Zhao, Dheeru Dua, and Sameer Singh.2018. Generating natural adversarial examples.In Proceedings of the International Conferenceon Learning Representations.

Page 16: Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and ...users.umiacs.umd.edu/~jbg/docs/2019_tacl_trick.pdfThe gold standard of academic competitions be-tween universities and

Sentence Success/Failure Phenomenaits types include “frictional”, “cyclical”, and “structural”

Missing Information 7its types include “frictional”, and structuralgerman author of the sorrows of young werther and a two-part faust

Lost Named Entity 7german author of the sorrows of mr. werthername this elegy on the death of john keats composed by percy shelley

Incorrect Clue 7name was this elegy on the death of percy shelleyidentify this play about willy loman written by arthur miller

Unsuited Syntax Template 7so you can identify this work of mr. millerhe employed marco polo and his father as ambassadors

Verb Synonym Xhe hired marco polo and his father as ambassadors

Table 6: Failure and success cases for SCPN. The model fails to create a valid paraphrase of the sentencefor 97% of questions.

A Failure of Syntactically ControlledParaphrase Networks

We apply the Syntactically Controlled ParaphraseNetwork (Iyyer et al., 2018, SCPN) to Quizbowlquestions. The model operates on the sentencelevel and cannot paraphrase paragraphs. We thusfeed in each sentence independently, ignoring pos-sible breaks in coreference. The model does notcorrectly paraphrase most of the complex sentencespresent in Quizbowl questions. The paraphraseswere rife with issues: ungrammatical, repetitive, ormissing information.

To simplify the setting, we focus on paraphrasingthe shortest sentence from each question (often thefinal clue). The model still fails in this case. Weanalyze a random sample of 200 paraphrases: onlysix maintained all of the original information.

Table 6 shows common failure cases. One re-curring issue is an inability to maintain the correctnamed entities after paraphrasing. In Quizbowl,maintaining entity information is vital for ensuringquestion validity. We were surprised by this failurebecause SCPN incorporates a copy mechanism.

B Studio Ousia Quizbowl Model

The Studio Ousia system works by aggregatingscores from both a neural text classification modeland an IR system. Additionally, it scores answersbased on their match with the correct entity type(e.g., religious leader, government agency, etc.) pre-dicted by a neural entity type classifier. The StudioOusia system also uses data beyond Quizbowl ques-tions and the text of Wikipedia pages, integratingentities from a knowledge graph and customizedword vectors (Yamada et al., 2018).

C Failed Adversarial Attempt

Figure 10 shows a user’s failed attempt to break theneural Quizbowl model.

In his speeches this . . . As a Senator,this man supported Paraguay in theChaco War, believing Bolivia was backedby Standard Oil.this man’s campaign was endorsed byMilo Reno and Charles Coughlin.Prediction: Huey Long (3)→ Huey Long (3)

In his speeches this . . . As a Senator, this man’scampaign was endorsed by Milo Reno andCharles Coughlin.a Catholic priest and radio show host.Prediction: Huey Long (3)→ Huey Long (3)

Figure 10: A failed attempt to trick the neuralmodel. The author modifies the question multipletimes, replacing words suggested by the interpreta-tion, but is unable to break the system.