arXiv:2002.00293v2 [cs.CL] 22 Sep 2020

Beat the AI: Investigating Adversarial Human Annotationfor Reading Comprehension

Max Bartolo Alastair Roberts Johannes Welbl Sebastian Riedel Pontus StenetorpDepartment of Computer Science

University College London{m.bartolo,a.roberts,j.welbl,s.riedel,p.stenetorp}@cs.ucl.ac.uk

Abstract

Innovations in annotation methodologyhave been a catalyst for Reading Compre-hension (RC) datasets and models. Onerecent trend to challenge current RC mod-els is to involve a model in the annotationprocess: humans create questions adversar-ially, such that the model fails to answerthem correctly. In this work we investi-gate this annotation methodology and ap-ply it in three different settings, collecting atotal of 36,000 samples with progressivelystronger models in the annotation loop.This allows us to explore questions suchas the reproducibility of the adversarial ef-fect, transfer from data collected with vary-ing model-in-the-loop strengths, and gener-alisation to data collected without a model.We find that training on adversarially col-lected samples leads to strong generalisa-tion to non-adversarially collected datasets,yet with progressive performance deteriora-tion with increasingly stronger models-in-the-loop. Furthermore, we find that strongermodels can still learn from datasets col-lected with substantially weaker models-in-the-loop. When trained on data collectedwith a BiDAF model in the loop, RoBERTaachieves 39.9F1 on questions that it can-not answer when trained on SQuAD – onlymarginally lower than when trained on datacollected using RoBERTa itself (41.0F1).

1 Introduction

Data collection is a fundamental prerequisite forMachine Learning-based approaches to NaturalLanguage Processing (NLP). Innovations in dataacquisition methodology, such as crowdsourcing,have led to major breakthroughs in scalability andpreceded the “deep learning revolution”, for whichthey can arguably be seen as co-responsible (Denget al., 2009; Bowman et al., 2015; Rajpurkar

question ahuman amodel 🏆

Who created the first commercial piston steam engine?

Thomas Newcomen

Thomas Newcomen

How long has steam been used to move things?

over 2000 years 2000 years

What happened in the year prior to the penultimate year of the 17th century?

Thomas Savery patented a

steam pump1698

i)

In what year did Savery patent his steam pump?▲

Early attempts to use steam to do work were what?▲

Who was the man who had an engine in the 18th century?▲

What happened in the year prior to the penultimate year of the 17th century?▲

None (SQuAD) BiDAF BERT RoBERTa

ii) Model-in-the-loop Strength

Using boiling water to produce mechanical motion goes back over 2000 years, but early devices were not practical. The Spanish inventor Jerónimo de Ayanz y Beaumont obtained the first patent for a steam engine in 1606. In 1698 Thomas Savery patented a steam pump that used […]. Thomas Newcomen's atmospheric engine was the first commercial true steam engine using a piston, and was used in 1712 for pumping in a mine.

Figure 1: Human annotation with a model in the loop,showing: i) the “Beat the AI” annotation setting whereonly questions that the model does not answer correctlyare accepted, and ii) questions generated this way, witha progressively stronger model in the annotation loop.

et al., 2016). Annotation approaches include ex-pert annotation, for example, relying on trainedlinguists (Marcus et al., 1993), crowd-sourcing bynon-experts (Snow et al., 2008), distant supervi-sion (Mintz et al., 2009; Joshi et al., 2017), andleveraging document structure (Hermann et al.,2015). The concrete data collection paradigm cho-sen dictates the degree of scalability, annotationcost, precise task structure (often arising as a com-promise of the above) and difficulty, domain cov-erage, as well as resulting dataset biases and modelblind spots (Jia and Liang, 2017; Schwartz et al.,2017; Gururangan et al., 2018).

A recently emerging trend in NLP dataset cre-ation is the use of a model-in-the-loop when

1

arX

iv:2

002.

0029

3v2

[cs

.CL

] 2

2 Se

p 20

20

composing samples: A contemporary model isused either as a filter or directly during anno-tation, to identify samples wrongly predicted bythe model. Examples of this method are re-alised in Build It Break It, The Language Edi-tion (Ettinger et al., 2017), HotpotQA (Yanget al., 2018a), SWAG (Zellers et al., 2018), Me-chanical Turker Descent (Yang et al., 2018b),DROP (Dua et al., 2019), CODAH (Chen et al.,2019), Quoref (Dasigi et al., 2019), and Adversar-ialNLI (Nie et al., 2019).1 This approach probesmodel robustness and ensures that the resultingdatasets pose a challenge to current models, whichdrives research to tackle new sets of problems.

We study this approach in the context of RC,and investigate its robustness in the face of contin-uously progressing models – do adversarially con-structed datasets quickly become outdated in theirusefulness as models grow stronger?

Based on models trained on the widely usedSQuAD dataset, and following the same annota-tion protocol, we investigate the annotation setupwhere an annotator has to compose questions forwhich the model predicts the wrong answer. As aresult, only samples that the model fails to predictcorrectly are retained in the dataset – see Figure 1for an example.

We apply this annotation strategy with three dis-tinct models in the loop, resulting in datasets with12,000 samples each. We then study the repro-ducibility of the adversarial effect when retrainingthe models with the same data, as well as the gen-eralisation ability of models trained using datasetsproduced with and without a model adversary.Models can, to a considerable degree, learn to gen-eralise to more challenging questions, based ontraining sets collected with both stronger and alsoweaker models in the loop. Compared to trainingon SQuAD, training on adversarially composedquestions leads to a similar degree of generalisa-tion to non-adversarially written questions, bothfor SQuAD and NaturalQuestions (Kwiatkowskiet al., 2019). It furthermore leads to general im-provements across the model-in-the-loop datasetswe collect, as well as improvements of more than20.0F1 for both BERT and RoBERTa on an ex-tractive subset of DROP (Dua et al., 2019), an-other adversarially composed dataset. When con-ducting a systematic analysis of the concrete ques-

1 The idea was alluded to at least as early as Richardsonet al. (2013), but it has only recently seen wider adoption.

tions different models fail to answer correctly,as well as non-adversarially composed questions,we see that the nature of the resulting questionschanges: Questions composed with a model in theloop are overall more diverse, use more paraphras-ing, multi-hop inference, comparisons, and back-ground knowledge, and are generally less easilyanswered by matching an explicit statement thatstates the required information literally. Given ourobservations, we believe a model-in-the-loop ap-proach to annotation shows promise and should beconsidered when creating future RC datasets.

To summarise, our contributions are as follows:First, an investigation into the model-in-the-loopapproach to RC data collection based on three pro-gressively stronger models, together with an em-pirical performance comparison when trained ondatasets constructed with adversaries of differentstrength. Second, a comparative investigation intothe nature of questions composed to be unsolvableby a sequence of progressively stronger models.Third, a study of the reproducibility of the adver-sarial effect and the generalisation ability of mod-els trained in various settings.

2 Related Work

Constructing Challenging Datasets Recent ef-forts in dataset construction have driven consid-erable progress in RC, yet datasets are struc-turally diverse and annotation methodologies vary.With its large size and combination of free-form questions with answers as extracted spans,SQuAD1.1 (Rajpurkar et al., 2016) has becomean established benchmark that has inspired theconstruction of a series of similarly structureddatasets. However, mounting evidence suggeststhat models can achieve strong generalisation per-formance merely by relying on superficial cues –such as lexical overlap, term frequencies, or en-tity type matching (Chen et al., 2016; Weissenbornet al., 2017; Sugawara et al., 2018). It has thus be-come an increasingly important consideration toconstruct datasets that RC models find challeng-ing, and for which natural language understand-ing is a requisite for generalisation. Attemptsto achieve this non-trivial aim have typically re-volved around extensions to the SQuAD datasetannotation methodology. They include unanswer-able questions (Trischler et al., 2017; Rajpurkaret al., 2018; Reddy et al., 2019; Choi et al., 2018),adding the option of “Yes” or “No” answers (Dua

2

et al., 2019; Kwiatkowski et al., 2019), questionsrequiring reasoning over multiple sentences ordocuments (Welbl et al., 2018; Yang et al., 2018a),questions requiring rule interpretation or contextawareness (Saeidi et al., 2018; Choi et al., 2018;Reddy et al., 2019), limiting annotator passage ex-posure by sourcing questions first (Kwiatkowskiet al., 2019), controlling answer types by includ-ing options for dates, numbers, or spans fromthe question (Dua et al., 2019), as well as ques-tions with free form answers (Nguyen et al., 2016;Kociský et al., 2018; Reddy et al., 2019).

Adversarial Annotation One recently adoptedapproach to constructing challenging datasets in-volves the use of an adversarial model to selectexamples that it does not perform well on, an ap-proach which superficially is akin to active learn-ing (Lewis and Gale, 1994). Here, we make a dis-tinction between two sub-categories of adversar-ial annotation: i) adversarial filtering, where theadversarial model is applied offline in a separatestage of the process, usually after data generation;examples include SWAG (Zellers et al., 2018),ReCoRD (Zhang et al., 2018), HotpotQA (Yanget al., 2018a), and HellaSWAG (Zellers et al.,2019); ii) model-in-the-loop adversarial annota-tion, where the annotator can directly interact withthe adversary during the annotation process anduses the feedback to further inform the generationprocess; examples include CODAH (Chen et al.,2019), Quoref (Dasigi et al., 2019), DROP (Duaet al., 2019), FEVER2.0 (Thorne et al., 2019), Ad-versarialNLI (Nie et al., 2019), as well as workby Dinan et al. (2019), Kaushik et al. (2020),and Wallace et al. (2019) for the Quizbowl task.

We are primarily interested in the latter cate-gory, as this feedback loop creates an environmentwhere the annotator can probe the model directlyto explore its weaknesses and formulate targetedadversarial attacks. Although Dua et al. (2019)and Dasigi et al. (2019) make use of adversarialannotations for RC, both annotation setups limitthe reach of the model-in-the-loop: In DROP, pri-marily due to the imposition of specific answertypes, and in Quoref by focusing on co-reference,which is already a known RC model weakness.

In contrast, we investigate a scenario where an-notators interact with a model in its original tasksetting – annotators must thus explore a range ofnatural adversarial attacks, as opposed to filteringout “easy” samples during the annotation process.

1. Human generates question q and selects answer ah for passage p.

2. (p, q) sent to the model. Model predicts answer am.

3. F1 score between ah and am is calculated; if the F1 score is greater than

a threshold (40%), the human loses.

4(b). Human loses. The process is restarted (same p).

4(a). Human wins. The human-sourced adversarial example (p, q, ah) is collected.

Figure 2: Overview of the annotation process to col-lect adversarially written questions from humans usinga model in the loop.

3 Annotation Methodology

3.1 Annotation Protocol

The data annotation protocol is based onSQuAD1.1, with a model in the loop, and the addi-tional instruction that questions should only haveone answer in the passage, which directly mirrorsthe setting in which these models were trained.

Formally, provided with a passage p, a humanannotator generates a question q and selects a (hu-man) answer ah by highlighting the correspondingspan in the passage. The input (p, q) is then givento the model, which returns a predicted (model)answer am. To compare the two, a word-overlapF1 score between ah and am is computed; a scoreabove a threshold of 40% is considered a “win” forthe model.2 This process is repeated until the hu-man “wins”; Figure 2 gives a schematic overviewof the process. All successful (p, q, ah) triples,that is, those which the model is unable to answercorrectly, are then retained for further validation.

3.2 Annotation Details

Models in the Annotation Loop We begin bytraining three different models, which are usedas adversaries during data annotation. As a seeddataset for training the models we select thewidely used SQuAD1.1 (Rajpurkar et al., 2016)dataset, a large-scale resource for which a va-riety of mature and well-performing models arereadily available. Furthermore, unlike cloze-baseddatasets, SQuAD is robust to passage/question-only adversarial attacks (Kaushik and Lipton,

2 This threshold is set after initial experiments to not beoverly restrictive given acceptable answer spans, e.g., a hu-man answer of “New York” vs. model answer “New YorkCity” would still lead to a model “win”.

3

Figure 3: “Beat the AI” question generation interface. Human annotators are tasked with asking questions about aprovided passage which the model in the loop fails to answer correctly.

2018). We will compare dataset annotation witha series of three progressively stronger modelsas adversary in the loop, namely BiDAF (Seoet al., 2017), BERTLARGE (Devlin et al., 2019),and RoBERTaLARGE (Liu et al., 2019b). Each ofthese will serve as a model adversary in a separateannotation experiment and result in three distinctdatasets; we will refer to these asDBiDAF,DBERT,and DRoBERTa respectively. Examples from thevalidation set of each are shown in Table 1. Werely on the AllenNLP (Gardner et al., 2018) andTransformers (Wolf et al., 2019) model implemen-tations, and our models achieve EM/F1 scores of65.5%/77.5%, 82.7%/90.3% and 86.9%/93.6% forBiDAF, BERT, and RoBERTa respectively on theSQuAD1.1 validation set, consistent with resultsreported in other work.

Our choice of models reflects both the transi-tion from LSTM-based to pre-trained transformer-based models, as well as a graduation amongthe latter; we investigate how this is reflectedin datasets collected with each of these differentmodels in the annotation loop. For each of themodels we collect 10,000 training, 1,000 valida-tion, and 1,000 test examples. Dataset sizes aremotivated by the data efficiency of transformer-based pretrained models (Devlin et al., 2019; Liuet al., 2019b), which has improved the viability of

smaller-scale data collection efforts for investiga-tive and analysis purposes.

To ensure the experimental integrity providedby reporting all results on a held-out test set, wesplit the existing SQuAD1.1 validation set in half(stratified by document title) as the official test setis not publicly available. We maintain passageconsistency across the training, validation and testsets of all datasets to enable like-for-like compar-isons. Lastly, we use the majority vote answer asground truth for SQuAD1.1 to ensure that all ourdatasets have one valid answer per question, en-abling us to fairly draw direct comparisons. Forclarity, we will hereafter refer to this modified ver-sion of SQuAD1.1 as DSQuAD.

Crowdsourcing We use custom-designed Hu-man Intelligence Tasks (HITs) served throughAmazon Mechanical Turk (AMT) for all annota-tion efforts. Workers are required to be based inCanada, the UK, or the US, have a HIT ApprovalRate greater than 98%, and have previously com-pleted at least 1,000 HITs successfully. We exper-iment with and without the AMT Master require-ment and find no substantial difference in quality,but observe a throughput reduction of nearly 90%.We pay USD 2.00 for every question generationHIT, during which workers are required to com-

4

BiD

AF Passage: [. . . ] the United Methodist Church has placed great emphasis on the importance of education. As such, the

United Methodist Church established and is affiliated with around one hundred colleges [. . . ] of Methodist-relatedSchools, Colleges, and Universities. The church operates three hundred sixty schools and institutions overseas.Question: The United Methodist Church has how many schools internationally?

BiD

AF Passage: In a purely capitalist mode of production (i.e. where professional and labor organizations cannot limit the

number of workers) the workers wages will not be controlled by these organizations, or by the employer, but rather bythe market. Wages work in the same way as prices for any other good. Thus, wages can be considered as a [. . . ]Question: What determines worker wages?

BiD

AF Passage: [. . . ] released to the atmosphere, and a separate source of water feeding the boiler is supplied. Normally

water is the fluid of choice due to its favourable properties, such as non-toxic and unreactive chemistry, abundance,low cost, and its thermodynamic properties. Mercury is the working fluid in the mercury vapor turbine [. . . ]Question: What is the most popular type of fluid?

BE

RT Passage: [. . . ] Jochi was secretly poisoned by an order from Genghis Khan. Rashid al-Din reports that the great Khan

sent for his sons in the spring of 1223, and while his brothers heeded the order, Jochi remained in Khorasan. Juzjanisuggests that the disagreement arose from a quarrel between Jochi and his brothers in the siege of Urgench [. . . ]Question: Who went to Khan after his order in 1223?

BE

RT Passage: In the Sandgate area, to the east of the city and beside the river, resided the close-knit community of keelmen

and their families. They were so called because [. . . ] transfer coal from the river banks to the waiting colliers, forexport to London and elsewhere. In the 1630s about 7,000 out of 20,000 inhabitants of Newcastle died of plague [. . . ]Question: Where did almost half the people die?

BE

RT Passage: [. . . ] was important to reduce the weight of coal carried. Steam engines remained the dominant source of

power until the early 20th century, when advances in the design of electric motors and internal combustion enginesgradually resulted in the replacement of reciprocating (piston) steam engines, with shipping in the 20th-century [. . . ]Question: Why did steam engines become obsolete?

RoB

ER

Ta Passage: [. . . ] and seven other hymns were published in the Achtliederbuch, the first Lutheran hymnal. In 1524 Lutherdeveloped his original four-stanza psalm paraphrase into a five-stanza Reformation hymn that developed the theme of"grace alone" more fully. Because it expressed essential Reformation doctrine, this expanded version of "Aus [. . . ]Question: Luther’s reformed hymn did not feature stanzas of what quantity?

RoB

ER

Ta Passage: [. . . ] tight end Greg Olsen, who caught a career-high 77 passes for 1,104 yards and seven touchdowns, andwide receiver Ted Ginn, Jr., who caught 44 passes for 739 yards and 10 touchdowns; [. . . ] receivers included veteranJerricho Cotchery (39 receptions for 485 yards), rookie Devin Funchess (31 receptions for 473 yards and [. . . ]Question: Who caught the second most passes?

RoB

ER

Ta Passage: Other prominent alumni include anthropologists David Graeber and Donald Johanson, who is best known fordiscovering the fossil of a female hominid australopithecine known as "Lucy" in the Afar Triangle region, psychologistJohn B. Watson, American psychologist who established the psychological school of behaviorism, communicationtheorist Harold Innis, chess grandmaster Samuel Reshevsky, and conservative international relations scholar andWhite House Coordinator of Security Planning for the National Security Council Samuel P. Huntington.Question: Who thinks three moves ahead?

Table 1: Validation set examples of questions collected using different RC models (BiDAF, BERT, and RoBERTa)in the annotation loop. The answer to the question is highlighted in the passage.

pose up to five questions that “beat” the model inthe loop (cf. Figure 3). The mean HIT completiontimes for BiDAF, BERT, and RoBERTa are 551.8s,722.4s, and 686.4s. Furthermore we find that hu-man workers are able to generate questions thatsuccessfully “beat” the model in the loop 59.4% ofthe time for BiDAF, 47.1% for BERT, and 44.0%for RoBERTa. These metrics broadly reflect therelative strength of the models.

3.3 Quality Control

Training and Qualification We provide a two-part worker training interface in order to i) famil-

iarise workers with the process, and ii) conduct afirst screening based on worker outputs. The inter-face familiarises workers with formulating ques-tions, and answering them through span selection.Workers are asked to generate questions for twogiven answers, to highlight answers for two givenquestions, to generate one full question-answerpair, and finally to complete a question generationHIT with BiDAF as the model in the loop. Eachworker’s output is then reviewed manually (by theauthors); those who pass the screening are addedto the pool of qualified annotators.

5

Manual Worker Validation In the second an-notation stage, qualified workers produce data forthe “Beat the AI” question generation task. A sam-ple of every worker’s HITs is manually reviewedbased on their total number of completed tasks n,determined by b5·log10(n)+1c, chosen for conve-nience. This is done after every annotation batch;if workers fall below an 80% success threshold atany point, their qualification is revoked and theirwork is discarded in its entirety.

Question Answerability As the models used inthe annotation task become stronger, the resultingquestions tend to become more complex. How-ever, this also means that it becomes more chal-lenging to disentangle measures of dataset qualityfrom inherent question difficulty. As such, we usethe condition of human answerability for an an-notated question-answer pair as follows: It is an-swerable if at least one of three additional non-expert human validators can provide an answermatching the original. We conduct answerabilitychecks on both the validation and test sets, andachieve answerability scores of 87.95%, 85.41%,and 82.63% for DBiDAF, DBERT, and DRoBERTa.We discard all questions deemed unanswerablefrom the validation and test sets, and further dis-card all data from any workers with less thanhalf of their questions considered answerable. Itshould be emphasised that the main purpose ofthis process is to create a level playing field forcomparison across datasets constructed for differ-ent model adversaries, and can inevitably resultin valid questions being discarded. The total costfor training and qualification, dataset construction,and validation is approximately USD 27,000.

Human Performance We select a randomlychosen validator’s answer to each question andcompute Exact Match (EM) and word overlap F1scores with the original to calculate non-expert hu-man performance; Table 2 shows the result. Weobserve a clear trend: the stronger the model inthe loop used to construct the dataset, the harderthe resulting questions become for humans.

3.4 Dataset StatisticsTable 3 provides general details on the numberof passages and question-answer pairs used in thedifferent dataset splits. The average number ofwords in questions and answers, as well as the av-erage longest n-gram overlap between passage andquestion are given in Table 4.

Dev Test

Resource EM F1 EM F1

DBiDAF 63.0 76.9 62.6 78.5DBERT 59.2 74.3 63.9 76.9DRoBERTa 58.1 72.0 58.7 73.7

Table 2: Non-expert human performance results for arandomly-selected validator per question.

#Passages #QAsResource Train Dev Test Train Dev Test

DSQuAD 18,891 971 1,096 87,599 5,278 5,292DBiDAF 2,523 278 277 10,000 1,000 1,000DBERT 2,444 283 292 10,000 1,000 1,000DRoBERTa 2,552 341 333 10,000 1,000 1,000

Table 3: Number of passages and question-answerpairs for each data resource.

We can again observe two clear trends: Fromweaker towards stronger models used in the an-notation loop, the average length of answers in-creases, and the largest n-gram overlap drops from3 to 2 tokens. That is, on average there is atrigram overlap between the passage and ques-tion for DSQuAD, but only a bigram overlap forDRoBERTa (Figure 4).3 This is in line with priorobservations on lexical overlap as a predictive cuein SQuAD (Weissenborn et al., 2017; Min et al.,2018); questions with less overlap are harder toanswer for any of the three models.

We furthermore analyse question types based onthe question wh-word. We find that – in contrast toDSQuAD – the datasets collected with a model inthe annotation loop have fewer when, how and inquestions, and more which, where and why ques-tions, as well as questions in the other category,which indicates increased question diversity. Interms of answer types, we observe more commonnoun and verb phrase clauses than in DSQuAD,as well as fewer dates, names, and numeric an-swers. This reflects on the strong answer-typematching capabilities of contemporary RC mod-els. The training and validation sets used in thisanalysis (DBiDAF, DBERT and DRoBERTa) will bepublicly released.

3Note that the original SQuAD1.1 dataset can be consid-ered a limit case of the adversarial annotation framework, inwhich the model in the loop always predicts the wrong an-swer, thus every question is accepted.

6

0 5 10Longest N-Gram Overlap

0.0

0.1

0.2

0.3

0.4Pr

opor

tion

SQuAD : = 3.03, = 1.88


0.0

0.1

0.2

0.3

0.4

Prop

ortio

n

BiDAF : = 2.23, = 1.57


0.0

0.1

0.2

0.3

0.4

Prop

ortio

n

BERT : = 2.15, = 1.70


0.0

0.1

0.2

0.3

0.4Pr

opor

tion

RoBERTa : = 2.03, = 1.33

Figure 4: Distribution of longest n-gram overlap be-tween passage and question for different datasets.µ: mean; σ: standard deviation.

DSQuAD DBiDAF DBERT DRoBERTa

Question length 10.3 9.8 9.8 10.0Answer length 2.6 2.9 3.0 3.2N-Gram overlap 3.0 2.2 2.1 2.0

Table 4: Average number of words per question andanswer, and average longest n-gram overlap betweenpassage and question.

4 Experiments

4.1 Consistency of the Model in the Loop

We begin with an experiment regarding the con-sistency of the adversarial nature of the modelsin the annotation loop. Our annotation pipelineis designed to reject all samples where the modelcorrectly predicts the answer. How reproducibleis this when retraining the model with the sametraining data? To measure this, we evaluate theperformance of instances of BiDAF, BERT, andRoBERTa, which only differ from the model usedduring annotation in their random initialisationand order of mini-batch samples during training.These results are shown in Table 5.

First, we observe – as expected given our an-notation constraints – that model performance is0.0EM on datasets created with the same respec-tive model in the annotation loop. We observehowever that retrained models do not reliably per-form as poorly on those samples. For exam-ple, BERT reaches 19.7EM, whereas the originalmodel used during annotation provides no correct

Original Re-init.

Model Resource EM F1 EM F1

BiDAF DBiDAFdev 0.0 5.3 10.7 0.8 20.4 1.0

BERT DBERTdev 0.0 4.9 19.7 1.0 30.1 1.2

RoBERTa DRoBERTadev 0.0 6.1 15.7 0.9 25.8 1.2

BiDAF DBiDAFtest 0.0 5.5 11.6 1.0 21.3 1.2

BERT DBERTtest 0.0 5.3 18.9 1.2 29.4 1.1

RoBERTa DRoBERTatest 0.0 5.9 16.1 0.8 26.7 0.9

Table 5: Consistency of the adversarial effect (or lackthereof) when retraining the models in the loop on thesame data again, but with different random seeds. Wereport the mean and standard deviation (subscript) over10 re-initialisation runs.

answer with 0.0EM. This demonstrates that ran-dom model components can substantially affectthe adversarial annotation process. The evaluationfurthermore serves as a baseline for subsequentmodel evaluations: This much of the performancerange can be learnt merely by retraining the samemodel. A possible takeaway for employing themodel-in-the-loop annotation strategy in the futureis to rely on ensembles of adversaries and reducethe dependency on one particular model instantia-tion, as investigated by Grefenstette et al. (2018).

4.2 Adversarial Generalisation

A potential problem with the focus on challengingquestions is that they might be very distinct fromone another, leading to difficulties in learning togeneralise to and from them. We conduct a se-ries of experiments in which we train on DBiDAF,DBERT, and DRoBERTa, and observe how wellmodels can learn to generalise to the respectivetest portions of these datasets. Table 6 shows theresults, and there is a multitude of observations.

First, one clear trend we observe across all train-ing data setups is a negative performance progres-sion when evaluated against datasets constructedwith a stronger model in the loop. This trend holdstrue for all but the BiDAF model, in each of thetraining configurations, and for each of the evalu-ation datasets. For example, RoBERTa trained onDRoBERTa achieves 72.1, 57.1, 49.5, and 41.0F1when evaluated onDSQuAD,DBiDAF,DBERT, andDRoBERTa respectively.

Second, we observe that the BiDAF model isnot able to generalise well to datasets constructedwith a model in the loop, independent of its train-ing setup. In particular it is unable to learn fromDBiDAF, thus failing to overcome some of its own

7

Evaluation (Test) DatasetModel Trained On DSQuAD DBiDAF DBERT DRoBERTa DDROP DNQ

EM F1 EM F1 EM F1 EM F1 EM F1 EM F1

DSQuAD(10K) 40.9 0.6 54.3 0.6 7.1 0.6 15.7 0.6 5.6 0.3 13.5 0.4 5.7 0.4 13.5 0.4 3.8 0.4 8.6 0.6 25.1 1.1 38.7 0.7

BiDAF DBiDAF 11.5 0.4 20.9 0.4 5.3 0.4 11.6 0.5 7.1 0.4 14.8 0.6 6.8 0.5 13.5 0.6 6.5 0.5 12.4 0.4 15.7 1.1 28.7 0.8

DBERT 10.8 0.3 19.8 0.4 7.2 0.5 14.4 0.6 6.9 0.3 14.5 0.4 8.1 0.4 15.0 0.6 7.8 0.9 14.5 0.9 16.5 0.6 28.3 0.9

DRoBERTa 10.7 0.2 20.2 0.3 6.3 0.7 13.5 0.8 9.4 0.6 17.0 0.6 8.9 0.9 16.0 0.8 15.3 0.8 22.9 0.8 13.4 0.9 27.1 1.2

DSQuAD(10K) 69.4 0.5 82.7 0.4 35.1 1.9 49.3 2.2 15.6 2.0 27.3 2.1 11.9 1.5 23.0 1.4 18.9 2.3 28.9 3.2 52.9 1.0 68.2 1.0

BERT DBiDAF 66.5 0.7 80.6 0.6 46.2 1.2 61.1 1.2 37.8 1.4 48.8 1.5 30.6 0.8 42.5 0.6 41.1 2.3 50.6 2.0 54.2 1.2 69.8 0.9

DBERT 61.2 1.8 75.7 1.6 42.9 1.9 57.5 1.8 37.4 2.1 47.9 2.0 29.3 2.1 40.0 2.3 39.4 2.2 47.6 2.2 49.9 2.3 65.7 2.3

DRoBERTa 57.0 1.7 71.7 1.8 37.0 2.3 52.0 2.5 34.8 1.5 45.9 2.0 30.5 2.2 41.2 2.2 39.0 3.1 47.4 2.8 45.8 2.4 62.4 2.5

DSQuAD(10K) 68.6 0.5 82.8 0.3 37.7 1.1 53.8 1.1 20.8 1.2 34.0 1.0 11.0 0.8 22.1 0.9 25.0 2.2 39.4 2.4 43.9 3.8 62.8 3.1

RoBERTa DBiDAF 64.8 0.7 80.0 0.4 48.0 1.2 64.3 1.1 40.0 1.5 51.5 1.3 29.0 1.9 39.9 1.8 44.5 2.1 55.4 1.9 48.4 1.1 66.9 0.8

DBERT 59.5 1.0 75.1 0.9 45.4 1.5 60.7 1.5 38.4 1.8 49.8 1.7 28.2 1.5 38.8 1.5 42.2 2.3 52.6 2.0 45.8 1.1 63.6 1.1

DRoBERTa 56.2 0.7 72.1 0.7 41.4 0.8 57.1 0.8 38.4 1.1 49.5 0.9 30.2 1.3 41.0 1.2 41.2 0.9 51.2 0.8 43.6 1.1 61.6 0.9

Table 6: Training models on various datasets, each with 10,000 samples, and measuring their generalisation todifferent evaluation datasets. Results underlined indicate the best result per model. We report the mean andstandard deviation (subscript) over 10 runs with different random seeds.

blind spots through adversarial training. Irrespec-tive of the training dataset, BiDAF consistentlyperforms poorly on the adversarially collectedevaluation datasets, and we also note a substan-tial performance drop when trained on DBiDAF,DBERT, or DRoBERTa and evaluated on DSQuAD.

In contrast, BERT and RoBERTa are able to par-tially overcome their blind spots through trainingon data collected with a model in the loop, and toa degree that far exceeds what would be expectedfrom random retraining (cf. Table 5). For example,BERT reaches 47.9F1 when trained and evaluatedon DBERT, while RoBERTa trained on DRoBERTa

reaches 41.0F1 on DRoBERTa, both considerablybetter than random retraining, or when trainingon the non-adversarially collected DSQuAD(10K)

showing gains of 20.6F1 for BERT and 18.9F1for RoBERTa. These observations suggest thatthere exists learnable structure among harder ques-tions that can be picked up by some of the mod-els, yet not all, as BiDAF fails to achieve this.The fact that even BERT can learn to generalise toDRoBERTa, but not BiDAF to DBERT suggests theexistence of an inherent limitation to what BiDAFcan learn from these new samples, compared toBERT and RoBERTa.

More generally, we observe that training onDS,where S is a stronger RC model, helps generaliseto DW, where W is a weaker model, for example,training on DRoBERTa and testing on DBERT. Onthe other hand, training on DW also leads to gen-eralisation towards DS. For example, RoBERTa

trained on 10,000 SQuAD samples reaches 22.1F1onDRoBERTa (DS), whereas training RoBERTa onDBiDAF and DBERT (DW) bumps this number to39.9F1 and 38.8F1, respectively.

Third, we observe similar performance degra-dation patterns for both BERT and RoBERTa onDSQuAD when trained on data collected with in-creasingly stronger models in the loop. For ex-ample, RoBERTa evaluated on DSQuAD achieves82.8, 80.0, 75.1, and 72.1F1 when trained onDSQuAD(10K), DBiDAF, DBERT, and DRoBERTa

respectively. This may indicate a gradual shiftin the distributions of composed questions as themodel in the loop gets stronger.

These observations suggest an encouragingtakeaway for the model-in-the-loop annotationparadigm: Even though a particular model mightbe chosen as an adversary in the annotation loop,which at some point falls behind more recent state-of-the-art models, these future models can stillbenefit from data collected with the weaker model,and also generalise better to samples composedwith the stronger model in the loop.

We further show experimental results for thesame models and training datasets, but now in-cluding SQuAD as additional training data in Ta-ble 7. In this training setup we generally seeimproved generalisation to DBiDAF, DBERT, andDRoBERTa. Interestingly, the relative differencesbetweenDBiDAF,DBERT, andDRoBERTa as train-ing sets used in conjunction with SQuAD aremuch diminished, and especially DRoBERTa as

8

Evaluation (Test) DatasetModel Training Dataset DSQuAD DBiDAF DBERT DRoBERTa

EM F1 EM F1 EM F1 EM F1

DSQuAD 56.7 0.5 70.1 0.3 11.6 1.0 21.3 1.1 8.6 0.6 17.3 0.8 8.3 0.7 16.8 0.5

BiDAF DSQuAD + DBiDAF 56.3 0.6 69.7 0.4 14.4 0.9 24.4 0.9 15.6 1.1 24.7 1.1 14.3 0.5 23.3 0.7

DSQuAD + DBERT 56.2 0.6 69.4 0.6 14.4 0.7 24.2 0.8 15.7 0.6 25.1 0.6 13.9 0.8 22.7 0.8

DSQuAD + DRoBERTa 56.2 0.7 69.6 0.6 14.7 0.9 24.8 0.8 17.9 0.5 26.7 0.6 16.7 1.1 25.0 0.8

DSQuAD 74.8 0.3 86.9 0.2 46.4 0.7 60.5 0.8 24.4 1.2 35.9 1.1 17.3 0.7 28.9 0.9

BERT DSQuAD + DBiDAF 75.2 0.4 87.2 0.2 52.4 0.9 66.5 0.9 40.9 1.3 51.2 1.5 32.9 0.9 44.1 0.8

DSQuAD + DBERT 75.1 0.3 87.1 0.3 54.1 1.0 68.0 0.8 43.7 1.1 54.1 1.3 34.7 0.7 45.7 0.8

DSQuAD + DRoBERTa 75.3 0.4 87.1 0.3 53.0 1.1 67.1 0.8 44.1 1.1 54.4 0.9 36.6 0.8 47.8 0.5

DSQuAD 73.2 0.4 86.3 0.2 48.9 1.1 64.3 1.1 31.3 1.1 43.5 1.2 16.1 0.8 26.7 0.9

RoBERTa DSQuAD + DBiDAF 73.9 0.4 86.7 0.2 55.0 1.4 69.7 0.9 46.5 1.1 57.3 1.1 31.9 0.8 42.4 1.0

DSQuAD + DBERT 73.8 0.2 86.7 0.2 55.4 1.0 70.1 0.9 48.9 1.0 59.0 1.2 32.9 1.3 43.7 1.4

DSQuAD + DRoBERTa 73.5 0.3 86.5 0.2 55.9 0.7 70.6 0.7 49.1 1.2 59.5 1.2 34.7 1.0 45.9 1.2

Table 7: Training models on SQuAD, as well as SQuAD combined with different adversarially created datasets.We report the mean and standard deviation (subscript) over 10 runs with different random seeds.

(part of) the training set now generalises substan-tially better. We see that BERT and RoBERTa bothshow consistent performance gains with the addi-tion of the original SQuAD1.1 training data, butunlike in Table 6, this comes without any notice-able decline in performance on DSQuAD, suggest-ing that the adversarially constructed datasets ex-pose inherent model weaknesses, as investigatedby Liu et al. (2019a).

Furthermore, RoBERTa achieves the strongestresults on the adversarially collected eval-uation sets, in particular when trained onDSQuAD + DRoBERTa. This stands in contrast tothe results in Table 6, where training on DBiDAF

in several cases led to better generalisation thantraining on DRoBERTa. A possible explanation isthat training on DRoBERTa leads to a larger degreeof overfitting to specific adversarial examples inDRoBERTa than training on DBiDAF, and that theinclusion of a large number of standard SQuADtraining samples can mitigate this effect.

Results for the models trained on all thedatasets combined (DSQuAD, DBiDAF, DBERT,and DRoBERTa) are shown in Table 8. These fur-ther support the previous observations and pro-vide additional performance gains where, for ex-ample, RoBERTa achieves F1 scores of 86.9 onDSQuAD, 74.1 on DBiDAF, 65.1 on DBERT, and52.7 on DRoBERTa, surpassing the best previous

performance on all adversarial datasets.

Finally, we identify a risk of datasets con-structed with weaker models in the loop be-coming outdated. For example, RoBERTaachieves 58.2EM/73.2F1 on DBiDAF, in con-trast to 0.0EM/5.5F1 for BiDAF – which is notfar from the non-expert human performance of62.6EM/78.5F1 (cf. Table 2).

It is also interesting to note that, even whentraining on all the combined data (cf. Table 8),BERT outperforms RoBERTa on DRoBERTa andvice versa, suggesting that there may exist weak-nesses inherent to each model class.

4.3 Generalisation to Non-Adversarial Data

Compared to standard annotation, the model-in-the-loop approach generally results in new ques-tion distributions. Consequently, models trainedon adversarially composed questions might not beable to generalise to standard (“easy”) questions,thus limiting the practical usefulness of the result-ing data. To what extent do models trained onmodel-in-the-loop questions generalise differentlyto standard (“easy”) questions, compared to mod-els trained on standard (“easy”) questions?

9

Evaluation (Test) DatasetModel DSQuAD DBiDAF DBERT DRoBERTa

EM F1 EM F1 EM F1 EM F1

BiDAF 57.1 0.4 70.4 0.3 17.1 0.8 27.0 0.9 20.0 1.0 29.2 0.8 18.3 0.6 27.4 0.7

BERT 75.5 0.2 87.2 0.2 57.7 1.0 71.0 1.1 52.1 0.7 62.2 0.7 43.0 1.1 54.2 1.0

RoBERTa 74.2 0.3 86.9 0.3 59.8 0.5 74.1 0.6 55.1 0.6 65.1 0.7 41.6 1.0 52.7 1.0

Table 8: Training models on SQuAD combined with all the adversarially created datasets DBiDAF, DBERT, andDRoBERTa. We report the mean and standard deviation (subscript) over 10 runs with different random seeds.

To measure this we further train each of ourthree models on either DBiDAF, DBERT, orDRoBERTa and test on DSQuAD, with results inthe DSQuAD columns of Table 6. For comparison,the models are also trained on 10,000 SQuAD1.1samples (referred to asDSQuAD(10K)) chosen fromthe same passages as the adversarial datasets, thuseliminating size and paragraph choice as poten-tial confounding factors. The models are tuned forEM on the held-out DSQuAD validation set. Notethat, although performance values on the majorityvote DSQuAD dataset are lower than on the origi-nal, for the reasons described earlier, this enablesdirect comparisons across all datasets.

Remarkably, neither BERT nor RoBERTa showsubstantial drops when trained on DBiDAF com-pared to training on SQuAD data (−2.1F1, and−2.8F1): Training these models on a datasetwith a weaker model in the loop still leads tostrong generalisation even to data from the orig-inal SQuAD distribution, which all models in theloop are trained on. BiDAF, on the other hand,fails to learn such information from the adver-sarially collected data, and drops >30F1 for eachof the new training sets, compared to training onSQuAD.

We also observe a gradual decrease in general-isation to SQuAD when training on DBiDAF to-wards training on DRoBERTa. This suggests thatthe stronger the model, the more dissimilar the re-sulting data distribution becomes from the originalSQuAD distribution. We later find further supportfor this explanation in a qualitative analysis (Sec-tion 5). It may however also be due to a limitationof BERT and RoBERTa – similar to BiDAF – inlearning from a data distribution designed to beatthese models; an even stronger model might learnmore from, for example, DRoBERTa.

4.4 Generalisation to DROP andNaturalQuestions

Finally, we investigate to what extent models cantransfer skills learned on the datasets created witha model in the loop to two recently introduceddatasets: DROP (Dua et al., 2019), and Natu-ralQuestions (Kwiatkowski et al., 2019). In thisexperiment we select the subsets of DROP andNaturalQuestions that align with the structuralconstraints of SQuAD to ensure a like-for-likeanalysis. Specifically, we only consider questionsin DROP where the answer is a span in the pas-sage and where there is only one candidate an-swer. For NaturalQuestions, we consider all non-tabular long answers as passages, remove HTMLtags and use the short answer as the extractedspan. We apply this filtering on the validation setsfor both datasets. Next we split them, stratify-ing by document (as we did for DSQuAD), whichresults in 1409/1418 validation and test set ex-amples for DROP, and 964/982 for NaturalQues-tions, respectively. We denote these datasets asDDROP and DNQ for clarity and distinction fromtheir unfiltered versions. We consider the samemodels and training datasets as before, but tune onthe respective validation sets of DDROP and DNQ.Table 6 shows the results of these experiments inthe respective DDROP and DNQ columns.

First, we observe clear generalisation improve-ments towards DDROP across all models com-pared to training on DSQuAD(10K) when trainingon any of DBiDAF, DBERT, or DRoBERTa. Thatis, including a model in the loop for the train-ing dataset leads to improved transfer towardsDDROP. Note that DROP also makes use of aBiDAF model in the loop during annotation; theseresults are in line with our prior observations whentesting the same setups on DBiDAF, DBERT andDRoBERTa, compared to training onDSQuAD(10K).

10

[%]

0

5

10

15

20

25

SQuAD D-BiDAF D-BERT D-RoBERTa DROP NaturalQuestions

60273541 4257 ExplicitParaphrasingExt. KnowledgeCo-referenceMulti-hopComparativeNumericNegationFilteringTemporalSpatialInductiveImplicit

Figure 5: Comparison of comprehension types for the questions in different datasets. The label types are neithermutually exclusive nor comprehensive. Values above columns indicate excess of the axis range.

Second, we observe overall strong transfer re-sults towards DNQ, with up to 69.8F1 for a BERTmodel trained on DBiDAF. Note that this result issimilar to, and even slightly improves over modeltraining with SQuAD data of the same size. Thatis, relative to training on SQuAD data, training onadversarially collected data DBiDAF does not im-pede generalisation to theDNQ dataset, which wascreated without a model in the annotation loop.We then however see a similar negative perfor-mance progression as observed before when test-ing on DSQuAD: the stronger the model in the an-notation loop of the training dataset, the lower thetest accuracy on test data from a data distributioncomposed without a model in the loop.

5 Qualitative Analysis

Having applied the general model-in-the-loopmethodology on models of varying strength, wenext perform a qualitative comparison of the na-ture of the resulting questions. As reference pointswe also include the original SQuAD questions, aswell as DROP and NaturalQuestions in this com-parison: These datasets are both constructed toovercome limitations in SQuAD and have subsetssufficiently similar to SQuAD to make an analysispossible. Specifically, we seek to understand thequalitative differences in terms of reading compre-hension challenges posed by the questions in eachof these datasets.

5.1 Comprehension Requirements

There exists a variety of prior work that seeks tounderstand the types of knowledge, comprehen-sion skills or types of reasoning required to an-swer questions based on text (Rajpurkar et al.,2016; Clark et al., 2018; Sugawara et al., 2019;Dua et al., 2019; Dasigi et al., 2019); we are

however unaware of any commonly accepted for-malism. We take inspiration from these but de-velop our own taxonomy of comprehension re-quirements which suits the datasets analysed. Ourtaxonomy contains 13 labels, most of which arecommonly used in other work. However, the fol-lowing three deserve additional clarification: i) ex-plicit – for which the answer is stated nearly word-for-word in the passage as it is in the question, ii)filtering – a set of answers is narrowed down to se-lect one by some particular distinguishing feature,and iii) implicit – the answer builds on informationimplied by the passage and does not otherwise re-quire any of the other types of reasoning.

We annotate questions with labels from this cat-alogue in a manner that is not mutually exclusive,and neither fully comprehensive; the developmentof such a catalogue is itself very challenging. In-stead, we focus on capturing the most salient char-acteristics of each given question, and assign it upto three of the labels in our catalogue. In total,we analyse 100 samples from the validation set ofeach of the datasets; Figure 5 shows the results.

5.2 Observations

An initial observation is that the majority (57%)of answers to SQuAD questions are stated explic-itly, without comprehension requirements beyondthe literal level. This number decreases substan-tially for any of the model-in-the-loop datasetsderived from SQuAD (e.g., 8% for DBiDAF) andalso DDROP, yet 42% of questions in DNQ sharethis property. In contrast to SQuAD, the model-in-the-loop questions generally tend to involvemore paraphrasing. They also require more ex-ternal knowledge, and multi-hop inference (be-yond co-reference resolution) with an increasingtrend for stronger models used in the annotation

11

loop. Model-in-the-loop questions further fan outinto a variety of small, but non-negligible pro-portions of more specific types of inference re-quired for comprehension, for example, spatial ortemporal inference (both going beyond explicitlystated spatial or temporal information) – SQuADquestions rarely require these at all. Some ofthese more particular inference types are com-mon features of the other two datasets, in par-ticular comparative questions for DROP (60%)and to a small extent also NaturalQuestions. In-terestingly, DBiDAF possesses the largest num-ber of comparison questions (11%) among ourmodel-in-the-loop datasets, whereas DBERT andDRoBERTa only possess 1% and 3%, respectively.This offers an explanation for our previous obser-vation in Table 6, where BERT and RoBERTa per-form better on DDROP when trained on DBiDAF

rather than on DBERT or DRoBERTa. It is likelythat BiDAF as a model in the loop is worsethan BERT and RoBERTa at comparative ques-tions, as evidenced by the results in Table 6 withBiDAF reaching 8.6F1, BERT reaching 28.9F1,and RoBERTa reaching 39.4F1 on DDROP (whentrained on DSQuAD(10K)).

The distribution of NaturalQuestions containselements of both the SQuAD and DBiDAF distri-butions, which offers a potential explanation forthe strong performance on DNQ of models trainedon DSQuAD(10K) and DBiDAF. Finally, the gradu-ally shifting distribution away from both SQuADand NaturalQuestions as the model-in-the-loopstrength increases reflects our prior observationson the decreasing performance on SQuAD andNaturalQuestions of models trained on datasetswith progressively stronger models in the loop.

6 Discussion and Conclusions

We have investigated an RC annotation paradigmthat requires a model in the loop to be “beaten”by an annotator. Applying this approach with pro-gressively stronger models in the loop (BiDAF,BERT, and RoBERTa), we produced three sep-arate datasets. Using these datasets, we inves-tigated several questions regarding the annota-tion paradigm, in particular whether such datasetsgrow outdated as stronger models emerge, andtheir generalisation to standard (non-adversariallycollected) questions. We found that stronger mod-els can still learn from data collected with a weakadversary in the loop, and their generalisation im-

proves even on datasets collected with a strongeradversary. Models trained on data collected witha model in the loop further generalise well to non-adversarially collected data, both on SQuAD andon NaturalQuestions, yet we observe a gradualdeterioration in performance with progressivelystronger adversaries.

We see our work as a contribution towards theemerging paradigm of model-in-the-loop annota-tion. While this paper has focused on RC, withSQuAD as the original dataset used to train modeladversaries, we see no reason in principle whyfindings would not be similar for other tasks us-ing the same annotation paradigm, when crowd-sourcing challenging samples with a model in theloop. We would expect the insights and bene-fits conveyed by model-in-the-loop annotation tobe the greatest on mature datasets where modelsexceed human performance: Here the resultingdata provides a magnifying glass on model per-formance, focused in particular on samples whichmodels struggle on. On the other hand, applyingthe method to datasets where performance has notyet plateaued would likely result in a more simi-lar distribution to the original data, which is chal-lenging to models a priori. We hope that the se-ries of experiments on replicability, observationson transfer between datasets collected using mod-els of different strength, as well as our findingsregarding generalisation to non-adversarially col-lected data, can support and inform future researchand annotation efforts using this paradigm.

Acknowledgements

The authors would like to thank ChristopherPotts for his detailed and constructive feedback,and our reviewers. This work was supportedby the European Union’s Horizon 2020 researchand innovation programme under grant agreementNo. 875160 and the UK Defence Science andTechnology Laboratory (Dstl) and Engineeringand Physical Research Council (EPSRC) undergrant EP/R018693/1 as a part of the collaborationbetween US DOD, UK MOD, and UK EPSRC un-der the Multidisciplinary University Research Ini-tiative (MURI).

References

Samuel R. Bowman, Gabor Angeli, ChristopherPotts, and Christopher D. Manning. 2015. A

12

https://doi.org/10.18653/v1/D15-1075

large annotated corpus for learning natural lan-guage inference. In Proceedings of the 2015Conference on Empirical Methods in NaturalLanguage Processing, pages 632–642, Lisbon,Portugal. Association for Computational Lin-guistics.

Danqi Chen, Jason Bolton, and Christopher D.Manning. 2016. A thorough examination ofthe CNN/Daily Mail reading comprehensiontask. In Proceedings of the 54th Annual Meet-ing of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 2358–2367, Berlin, Germany. Association for Com-putational Linguistics.

Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fer-nandez, and Doug Downey. 2019. CODAH:An adversarially-authored question answeringdataset for common sense. In Proceedings ofthe 3rd Workshop on Evaluating Vector SpaceRepresentations for NLP, pages 63–69, Min-neapolis, USA. Association for ComputationalLinguistics.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar,Wen-tau Yih, Yejin Choi, Percy Liang, andLuke Zettlemoyer. 2018. QuAC: Question an-swering in context. In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing, pages 2174–2184, Brus-sels, Belgium. Association for ComputationalLinguistics.

Peter Clark, Isaac Cowhey, Oren Etzioni, TusharKhot, Ashish Sabharwal, Carissa Schoenick,and Oyvind Tafjord. 2018. Think you havesolved question answering? Try ARC, the AI2reasoning challenge. CoRR, abs/1803.05457.

Pradeep Dasigi, Nelson F. Liu, Ana Maraso-vic, Noah A. Smith, and Matt Gardner. 2019.Quoref: A reading comprehension datasetwith questions requiring coreferential reason-ing. In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Lan-guage Processing and the 9th InternationalJoint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 5925–5932,Hong Kong, China. Association for Computa-tional Linguistics.

Jia Deng, R. Socher, Li Fei-Fei, Wei Dong, Kai Li,and Li-Jia Li. 2009. ImageNet: A large-scale

hierarchical image database. In 2009 IEEEConference on Computer Vision and PatternRecognition (CVPR), volume 00, pages 248–255.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In Proceedings of the 2019Conference of the North American Chapterof the Association for Computational Linguis-tics: Human Language Technologies, Volume1 (Long and Short Papers), pages 4171–4186,Minneapolis, Minnesota. Association for Com-putational Linguistics.

Emily Dinan, Samuel Humeau, Bharath Chin-tagunta, and Jason Weston. 2019. Build itbreak it fix it for dialogue safety: Robust-ness from adversarial human attack. In Pro-ceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing andthe 9th International Joint Conference on Nat-ural Language Processing (EMNLP-IJCNLP),pages 4537–4546, Hong Kong, China. Associa-tion for Computational Linguistics.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi,Gabriel Stanovsky, Sameer Singh, and MattGardner. 2019. DROP: A reading compre-hension benchmark requiring discrete reason-ing over paragraphs. In Proceedings of the2019 Conference of the North American Chap-ter of the Association for Computational Lin-guistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association forComputational Linguistics.

Allyson Ettinger, Sudha Rao, Hal Daumé III, andEmily M. Bender. 2017. Towards linguisticallygeneralizable NLP systems: A workshop andshared task. CoRR, abs/1711.01505.

Matt Gardner, Joel Grus, Mark Neumann,Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu,Matthew Peters, Michael Schmitz, and LukeZettlemoyer. 2018. AllenNLP: A deep seman-tic natural language processing platform. InProceedings of Workshop for NLP Open SourceSoftware (NLP-OSS), pages 1–6, Melbourne,Australia. Association for Computational Lin-guistics.

13

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/D15-1075

https://doi.org/10.18653/v1/P16-1223

https://doi.org/10.18653/v1/P16-1223

https://doi.org/10.18653/v1/P16-1223

https://doi.org/10.18653/v1/W19-2008

https://doi.org/10.18653/v1/W19-2008

https://doi.org/10.18653/v1/W19-2008

https://doi.org/10.18653/v1/D18-1241

https://doi.org/10.18653/v1/D18-1241

http://arxiv.org/abs/1803.05457



https://doi.org/10.18653/v1/D19-1606

https://doi.org/10.18653/v1/D19-1606

https://doi.org/10.18653/v1/D19-1606

https://doi.org/10.1109/CVPR.2009.5206848

https://doi.org/10.1109/CVPR.2009.5206848

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/D19-1461

https://doi.org/10.18653/v1/D19-1461

https://doi.org/10.18653/v1/D19-1461

https://doi.org/10.18653/v1/N19-1246

https://doi.org/10.18653/v1/N19-1246

https://doi.org/10.18653/v1/N19-1246




https://doi.org/10.18653/v1/W18-2501

https://doi.org/10.18653/v1/W18-2501

Edward Grefenstette, Robert Stanforth, Bren-dan O’Donoghue, Jonathan Uesato, GrzegorzSwirszcz, and Pushmeet Kohli. 2018. Strengthin numbers: Trading-off robustness and com-putation via adversarially-trained ensembles.CoRR, abs/1811.09300.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, andNoah A. Smith. 2018. Annotation artifacts innatural language inference data. In Proceed-ings of the 2018 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, Volume 2 (Short Papers), pages 107–112,New Orleans, Louisiana. Association for Com-putational Linguistics.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, MustafaSuleyman, and Phil Blunsom. 2015. Teach-ing machines to read and comprehend. InC. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Advancesin Neural Information Processing Systems 28,pages 1693–1701. Curran Associates, Inc.

Robin Jia and Percy Liang. 2017. Adversarial ex-amples for evaluating reading comprehensionsystems. In Proceedings of the 2017 Confer-ence on Empirical Methods in Natural Lan-guage Processing, pages 2021–2031, Copen-hagen, Denmark. Association for Computa-tional Linguistics.

Mandar Joshi, Eunsol Choi, Daniel Weld, andLuke Zettlemoyer. 2017. TriviaQA: A largescale distantly supervised challenge dataset forreading comprehension. In Proceedings of the55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers), pages 1601–1611, Vancouver, Canada.Association for Computational Linguistics.

Divyansh Kaushik, Eduard Hovy, and ZacharyLipton. 2020. Learning the difference thatmakes a difference with counterfactually-augmented data. In International Conferenceon Learning Representations.

Divyansh Kaushik and Zachary C. Lipton. 2018.How much reading does reading comprehen-sion require? a critical investigation of popular

benchmarks. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 5010–5015, Brussels,Belgium. Association for Computational Lin-guistics.

Tomáš Kociský, Jonathan Schwarz, Phil Blun-som, Chris Dyer, Karl Moritz Hermann, Gá-bor Melis, and Edward Grefenstette. 2018. TheNarrativeQA reading comprehension challenge.Transactions of the Association for Computa-tional Linguistics, 6:317–328.

Tom Kwiatkowski, Jennimaria Palomaki, OliviaRedfield, Michael Collins, Ankur Parikh, ChrisAlberti, Danielle Epstein, Illia Polosukhin, Ja-cob Devlin, Kenton Lee, Kristina Toutanova,Llion Jones, Matthew Kelcey, Ming-WeiChang, Andrew M. Dai, Jakob Uszkoreit, QuocLe, and Slav Petrov. 2019. Natural Questions:A benchmark for question answering research.Transactions of the Association for Computa-tional Linguistics, 7:453–466.

David D. Lewis and William A. Gale. 1994. Asequential algorithm for training text classifiers.In SIGIR, pages 3–12. ACM/Springer.

Nelson F. Liu, Roy Schwartz, and Noah A. Smith.2019a. Inoculation by fine-tuning: A methodfor analyzing challenge datasets. In Proceed-ings of the 2019 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers), pages2171–2179, Minneapolis, Minnesota. Associa-tion for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi, Danqi Chen, Omer Levy,Mike Lewis, Luke Zettlemoyer, and VeselinStoyanov. 2019b. RoBERTa: A robustly op-timized BERT pretraining approach. CoRR,abs/1907.11692.

Mitchell P. Marcus, Beatrice Santorini, andMary Ann Marcinkiewicz. 1993. Buildinga large annotated corpus of English: ThePenn Treebank. Computational Linguistics,19(2):313–330.

Sewon Min, Victor Zhong, Richard Socher, andCaiming Xiong. 2018. Efficient and robustquestion answering from minimal context over

14




https://doi.org/10.18653/v1/N18-2017

https://doi.org/10.18653/v1/N18-2017

http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf

http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf

https://doi.org/10.18653/v1/D17-1215

https://doi.org/10.18653/v1/D17-1215

https://doi.org/10.18653/v1/D17-1215

https://doi.org/10.18653/v1/P17-1147

https://doi.org/10.18653/v1/P17-1147

https://doi.org/10.18653/v1/P17-1147

https://openreview.net/forum?id=Sklgs0NFvr



https://doi.org/10.18653/v1/D18-1546

https://doi.org/10.18653/v1/D18-1546

https://doi.org/10.18653/v1/D18-1546

https://doi.org/10.1162/tacl_a_00023




http://dblp.uni-trier.de/db/conf/sigir/sigir94.html#LewisG94

http://dblp.uni-trier.de/db/conf/sigir/sigir94.html#LewisG94

https://doi.org/10.18653/v1/N19-1225

https://doi.org/10.18653/v1/N19-1225



https://www.aclweb.org/anthology/J93-2004



https://doi.org/10.18653/v1/P18-1160

https://doi.org/10.18653/v1/P18-1160

documents. In Proceedings of the 56th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages1725–1735, Melbourne, Australia. Associationfor Computational Linguistics.

Mike Mintz, Steven Bills, Rion Snow, and DanielJurafsky. 2009. Distant supervision for rela-tion extraction without labeled data. In Pro-ceedings of the Joint Conference of the 47th An-nual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural LanguageProcessing of the AFNLP, pages 1003–1011,Suntec, Singapore. Association for Computa-tional Linguistics.

Tri Nguyen, Mir Rosenberg, Xia Song, Jian-feng Gao, Saurabh Tiwary, Rangan Majumder,and Li Deng. 2016. MS MARCO: A humangenerated MAchine Reading COmprehensiondataset. arXiv preprint arXiv:1611.09268.

Yixin Nie, Adina Williams, Emily Dinan, MohitBansal, Jason Weston, and Douwe Kiela. 2019.Adversarial NLI: A new benchmark for naturallanguage understanding. arXiv preprint arXiv:1910.14599.

Pranav Rajpurkar, Robin Jia, and Percy Liang.2018. Know what you don’t know: Unan-swerable questions for SQuAD. In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: ShortPapers), pages 784–789, Melbourne, Australia.Association for Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopy-rev, and Percy Liang. 2016. SQuAD: 100,000+questions for machine comprehension of text.In Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Process-ing, pages 2383–2392, Austin, Texas. Associa-tion for Computational Linguistics.

Siva Reddy, Danqi Chen, and Christopher D. Man-ning. 2019. CoQA: A conversational questionanswering challenge. Transactions of the Asso-ciation for Computational Linguistics, 7:249–266.

Matthew Richardson, Christopher J.C. Burges,and Erin Renshaw. 2013. MCTest: A challengedataset for the open-domain machine compre-hension of text. In Proceedings of the 2013

Conference on Empirical Methods in NaturalLanguage Processing, pages 193–203, Seattle,Washington, USA. Association for Computa-tional Linguistics.

Marzieh Saeidi, Max Bartolo, Patrick Lewis,Sameer Singh, Tim Rocktäschel, Mike Sheldon,Guillaume Bouchard, and Sebastian Riedel.2018. Interpretation of natural language rulesin conversational machine reading. In Pro-ceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing,pages 2087–2097, Brussels, Belgium. Associa-tion for Computational Linguistics.

Roy Schwartz, Maarten Sap, Ioannis Konstas,Leila Zilles, Yejin Choi, and Noah A. Smith.2017. The effect of different writing tasks onlinguistic style: A case study of the ROC storycloze task. In Proceedings of the 21st Con-ference on Computational Natural LanguageLearning (CoNLL 2017), pages 15–25, Van-couver, Canada. Association for ComputationalLinguistics.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi,and Hannaneh Hajishirzi. 2017. Bidirectionalattention flow for machine comprehension. InThe International Conference on Learning Rep-resentations (ICLR).

Rion Snow, Brendan O’Connor, Daniel Jurafsky,and Andrew Ng. 2008. Cheap and fast – butis it good? evaluating non-expert annotationsfor natural language tasks. In Proceedings ofthe 2008 Conference on Empirical Methods inNatural Language Processing, pages 254–263,Honolulu, Hawaii. Association for Computa-tional Linguistics.

Saku Sugawara, Kentaro Inui, Satoshi Sekine,and Akiko Aizawa. 2018. What makes read-ing comprehension questions easier? In Pro-ceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing,pages 4208–4219, Brussels, Belgium. Associa-tion for Computational Linguistics.

Saku Sugawara, Pontus Stenetorp, Kentaro Inui,and Akiko Aizawa. 2019. Assessing the bench-marking capacity of machine reading compre-hension datasets. CoRR, abs/1911.09241.

James Thorne, Andreas Vlachos, Oana Cocarascu,Christos Christodoulopoulos, and Arpit Mittal.

15

https://doi.org/10.18653/v1/P18-1160

https://www.aclweb.org/anthology/P09-1113

https://www.aclweb.org/anthology/P09-1113

https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/





https://doi.org/10.18653/v1/P18-2124

https://doi.org/10.18653/v1/P18-2124

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264



https://www.aclweb.org/anthology/D13-1020



https://doi.org/10.18653/v1/D18-1233

https://doi.org/10.18653/v1/D18-1233

https://doi.org/10.18653/v1/K17-1004

https://doi.org/10.18653/v1/K17-1004

https://doi.org/10.18653/v1/K17-1004

https://openreview.net/forum?id=HJ0UKP9ge

https://openreview.net/forum?id=HJ0UKP9ge




https://doi.org/10.18653/v1/D18-1453

https://doi.org/10.18653/v1/D18-1453




2019. The FEVER2.0 shared task. In Pro-ceedings of the Second Workshop on Fact Ex-traction and VERification (FEVER), pages 1–6,Hong Kong, China. Association for Computa-tional Linguistics.

Adam Trischler, Tong Wang, Xingdi Yuan, JustinHarris, Alessandro Sordoni, Philip Bachman,and Kaheer Suleman. 2017. NewsQA: A ma-chine comprehension dataset. In Proceedings ofthe 2nd Workshop on Representation Learningfor NLP, pages 191–200, Vancouver, Canada.Association for Computational Linguistics.

Eric Wallace, Pedro Rodriguez, Shi Feng, IkuyaYamada, and Jordan Boyd-Graber. 2019. Trickme if you can: Human-in-the-loop generationof adversarial examples for question answering.Transactions of the Association for Computa-tional Linguistics, 7:387–401.

Dirk Weissenborn, Georg Wiese, and Laura Seiffe.2017. Making neural QA as simple as possiblebut not simpler. In Proceedings of the 21st Con-ference on Computational Natural LanguageLearning (CoNLL 2017), pages 271–280, Van-couver, Canada. Association for ComputationalLinguistics.

Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hop reading comprehension across documents.Transactions of the Association for Computa-tional Linguistics, 6:287–302.

Thomas Wolf, Lysandre Debut, Victor Sanh,Julien Chaumond, Clement Delangue, An-thony Moi, Pierric Cistac, Tim Rault, RémiLouf, Morgan Funtowicz, and Jamie Brew.2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv,abs/1910.03771.

Zhilin Yang, Peng Qi, Saizheng Zhang, YoshuaBengio, William Cohen, Ruslan Salakhutdinov,and Christopher D. Manning. 2018a. Hot-potQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the2018 Conference on Empirical Methods in Nat-ural Language Processing, pages 2369–2380,Brussels, Belgium. Association for Computa-tional Linguistics.

Zhilin Yang, Saizheng Zhang, Jack Urbanek, WillFeng, Alexander Miller, Arthur Szlam, Douwe

Kiela, and Jason Weston. 2018b. Mastering thedungeon: Grounded language learning by me-chanical turker descent. In International Con-ference on Learning Representations.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adver-sarial dataset for grounded commonsense infer-ence. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Pro-cessing, pages 93–104, Brussels, Belgium. As-sociation for Computational Linguistics.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag:Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics,pages 4791–4800, Florence, Italy. Associationfor Computational Linguistics.

Sheng Zhang, Xiaodong Liu, Jingjing Liu,Jianfeng Gao, Kevin Duh, and BenjaminVan Durme. 2018. ReCoRD: Bridging thegap between human and machine common-sense reading comprehension. arXiv preprintarXiv:1810.12885.

16

https://doi.org/10.18653/v1/D19-6601

https://doi.org/10.18653/v1/W17-2623

https://doi.org/10.18653/v1/W17-2623




https://doi.org/10.18653/v1/K17-1028

https://doi.org/10.18653/v1/K17-1028



https://arxiv.org/abs/1910.03771


https://doi.org/10.18653/v1/D18-1259

https://doi.org/10.18653/v1/D18-1259

https://doi.org/10.18653/v1/D18-1259

https://openreview.net/forum?id=SJ-C6JbRW



https://doi.org/10.18653/v1/D18-1009

https://doi.org/10.18653/v1/D18-1009

https://doi.org/10.18653/v1/D18-1009

https://doi.org/10.18653/v1/P19-1472

https://doi.org/10.18653/v1/P19-1472




whatwho when

whichwhere

why how in on to other0

10

20

30

40

50[%

]SQuAD

BiDAFBERTRoBERTa

Figure 6: Analysis of question types across datasets.

0 20 40Number of words in Question

0.00

0.05

0.10

0.15

0.20

0.25

Prop

ortio

n

SQuAD : = 10.29, = 3.50


0.00

0.05

0.10

0.15

0.20

0.25

Prop

ortio

n

BiDAF : = 9.76, = 4.51


0.00

0.05

0.10

0.15

0.20

0.25

Prop

ortio

n

BERT : = 9.77, = 4.36


0.00

0.05

0.10

0.15

0.20

0.25

Prop

ortio

n

RoBERTa : = 9.99, = 4.61

Figure 7: Question length distribution across datasets.

A Additional Dataset Statistics

Question statistics In Figure 7 we analyse ques-tion lengths across SQuAD1.1 and compare themto questions constructed with different models inthe annotation loop. While the mean of the dis-tributions is similar, there is more question lengthvariability when using a model in the loop. Wealso perform analysis of question types by wh-word as described earlier (see Figure 6). This isin further detail displayed using sunburst plots ofthe first three question tokens forDSQuAD (cf. Fig-ure 11), DBiDAF (cf. Figure 13), DBERT (cf. Fig-ure 12) and DRoBERTa (cf. Figure 14). We ob-serve a general trend towards more diverse ques-tions with increasing model-in-the-loop strength.

Answer statistics Figure 9 allows for furtheranalysis of answer lengths across datasets. Weobserve that answers for all datasets constructedwith a model in the loop tend to be longer than inSQuAD. There is furthermore a trend of increas-

Common Noun Phrase

DateOther Entity

PersonOrganisation

Other Numeric

Verb Phrase

OtherAdjective Phrase

LocationClause

01020304050

[%]

34.2

13.210.6

8.8 7.65.4 5.2 4.0 4.0 3.6 3.4

44.6

5.2

13.1

6.0 6.6

2.4

7.5

1.93.4 3.9

5.4

43.1

4.7

11.4

6.6 6.9

3.2

7.9

1.0

4.5 5.0 5.7

43.2

6.0

10.6

5.77.8

1.7

7.5

1.6

5.1 5.6 5.2

SQuAD

BiDAFBERTRoBERTa

Figure 8: Analysis of answer types across datasets.

0 20 40Number of words in Answer

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prop

ortio

n

SQuAD : = 2.57, = 2.27


0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prop

ortio

n

BiDAF : = 2.94, = 3.23


0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prop

ortio

n

BERT : = 2.96, = 3.50


0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prop

ortio

n

RoBERTa : = 3.17, = 4.59

Figure 9: Answer length distribution across datasets.

ing answer length and variability with increasingmodel-in-the-loop strength. We show an analysisof answer types in Figure 8.

B Annotation Interface Details

We have three key steps in the dataset constructionprocess: i) training and qualification, ii) “Beat theAI” annotation and iii) answer validation.

Training and Qualification This is a combinedtraining and qualification task; a screenshot of theinterface is shown in Figure 15. The first stepinvolves a set of five assignments requiring theworker to demonstrate an ability to generate ques-tions and indicate answers by highlighting the cor-responding spans in the passage. Once complete,the worker is shown a sample “Beat the AI” HITfor a pre-determined passage which helps facil-itate manual validation. In earlier experiments,these two steps were presented as separate inter-faces, however, this created a bottleneck between

17

0

500

1000

1500

2000

# QA

sWorker Distribution by annotated QAs

ValidInvalidNot Validated

Figure 10: Worker distribution, together with the number of manually validated QA pairs per worker.

the two layers of qualification and slowed downannotation considerably. In total, 1,386 workerscompleted this task with 752 being assigned thequalification.

“Beat the AI Annotation” The “Beat the AI”question generation HIT presents workers witha randomly selected passage from SQuAD1.1,about which workers are expected to generatequestions and provide answers. This data is sentto the corresponding model-in-the-loop API run-ning on AWS infrastructure and primarily consist-ing of a load balancer and a t2.xlarge EC2 instancewith the T2/T3 Unlimited setting enabled to al-low high sustained CPU performance during an-notation runs. The model API returns a predictionwhich is scored against the worker’s answer to de-termine whether the worker has successfully man-aged to “beat” the model. Only questions whichthe model fails to answer are considered valid; ascreenshot for this interface is shown in Figure 16.Workers are tasked to ideally submit at least threevalid questions, however fewer are also accepted– in particular for very short passages. A sam-ple of each worker’s HITs is manually validated;those who do not satisfy the question quality re-quirements have their qualification revoked and alltheir annotated data discarded. This was the casefor 99 workers. Worker validation distributions areshown in Figure 10.

Answer Validation The answer validation inter-face (cf. Figure 17) is used to validate the answer-ability of the validation and test sets for each dif-

ferent model used in the annotation loop. Everypreviously collected question generation HIT fromthese dataset parts, which had not been discardedduring manual validation, is submitted to at least3 distinct annotators. Workers are shown the pas-sage and previously generated questions and areasked to highlight the answer in the passage. In apost-processing step, only questions with at least1 valid matching answer out of 3 are finally re-tained.

C Catalogue of ComprehensionRequirements

We give a description for each of the items in ourcatalogue of comprehension requirements in Ta-ble 9, accompanied with an example for illustra-tion. These are the labels used for the qualitativeanalysis performed in Section 5.

18

the

a

the

lutherthe

thea

ofthetwo

chloroplaststhesome

is

was

did

doestypearedo

ofdid

doesm

oneywas

lutherthe

genghisdidagowashasdoeswouldthe

genghisincom

eoxygen

chloroplast

dioxygenlap

lutherm

ostnewporttem

üjin

unregisteredwinterthe

chloroplasts

cyanobacteria

existingforcesgreen

incomeslaboratorynuclear

peridinin-typeyou

chloroplasts

studentsboth

centripetal

co-teachers

competing

cryptophyte

mongolians

pathogensplants

socialistssomewaswere

many

much

didlongisaredo

old

the

theluther

thethe

chu'tsaispecialtemüjin

two

was

didiswere

the luther

thea luther the attacks images luther most shimer some stromules theories the luther sky

didwas were isthe luther jebe water the corporal the luther a genghis thea kenya

centripetalheat luther newly oxygen warsaw

chloroplastsmost all

apicomplexanscacti

cryptophyte ipcc manynucleomorph

transport some these health most nearlyproplastidspyrenoidsteachers the youa aeolian

chloroplastsluther

non-condensingschirra tetzel the

didis was does do are can wereyear city fall

rankineunited10th 1980s

capabilities

process us wake yeardirection

yearcountymiddlemonthspointregionregions

4-cylindermarketnuclearplatoon

progressive

reportsteam type to

what the which a additionhas was had held

providedprovides

sponsoredwon

genghishis

temüjinlutherdid had held in won did had

playedrecovered

wasbowl

companyof team

playersuper

what

how

who

when where in which

Figure 11: Question sunburst plot for DSQuAD.

the

a

thelut

her

one

lutherthe

gearykhan

theatheaccountantschloroplastsdoctorsforcesheatinnuclearrepurposedworkersofguruschloroplastsdioxaneelementarymethodistn'tnormalprotoplastidsresearcherstintofirstafterbeforeof

is

was

diddoesaretypedohappenedkind

the

greatersm

allerbiggerfurther

possiblythei

betweenlater

earlierbuilt

capturedof

ofisblock

,happened

wastypethe

chairmore

genghisgiven

lower

married

opposed

referred

releaved

toghrul

johannluther

n't notsassetheafewermorespecial

athelowerthe

khasarrelatedbetween

expectedmorethe son

younganthe theat for thecover

provide

was

didhadhasis,are's

createddoes

headlinedthrew will

pointsabsoprtion

days medals new sacks siblings times yards luther the bskyb warsaw wianki you elders the you unemployment we did does must teacher luther the the thylakoids you cycling lupus the influence was assistance you

manydiddo can long was are does much would

the burkhan genghis georgia history mississippi warsaw the klutho martin residency the almost triton wernhercyanobacterial

mayorsplastoglubli aone the one you christians the the one many you

is was did are does can doin might were wouldperson united carrierreason of of were of building of in supports store of vectorform location

market ofimmunereportorgan do

gatheringthat of of trials that

homeplanettypesblood ``

first primary act amountapicomplexans

authorbasic

chamberchloroplast

churchdepartment

establishmentfour

gaseousgatheringgraingerhonoringinnatekalvenmain

mocamanext

personpool

presencerainhill

surfacestake third two way word

what

which

who

howwhere the

Figure 12: Question sunburst plot for DBERT.

the

a

the a

of

theluther

theabeingfoldinggenghislepidodiniummongoliapartiallyprevenientsupatithechloroplastsangiospermangiospermsantibodiescustomersglaucophytenkpoliticianspyrenoidssometeachersto115244768atinthebacteriadiatomsgetsn'tnewtonvirus-infectedwe

is

was

typediddoesdohappenedare

theaininterested

theluthernot

jamukha

newcastletem

ujinthey

universitywiesener

broadcastingcom

mem

oratedcreditedhighly

includedit

jochikhasarknownlim

itedbased

atthe

birthbskyb

perfectionthat

wasdidis,

gavethought

correctionscrew

designs

differentlayers

passingpeople

performances

rushingstanzastimestotalthe

luther

genghisitvtheythe

growthone

rent-seeking

warsawthea

warszawamore

of diddoes

changeone

teachersyoudoesdo

chloroplast

stroma-containing

the

manydid

doesis

muchcanoften do wasthe

antibodies

chloroplastsmost

peoplewestern

youcorporal

theaimportedvitamin

theglaucophyte

motionstudents

thylakoidswinters athe luther you residents n't the this the apollo there one can hilliard

doisare did can does was could specifically werethe these genghis between after before earlier is might first out the caused fumbled hadis expects favors crop building of occurence mentioned isis csmin was period was film would areas is contributes was was had sectors

of, opened building came is player team tradition agricultural architectural aspect climate country crime element i individuals issue medieval musician newcastle part popular rainforest sector settlement site time twoitv justin luther nasa poland we the chloroplasts wealthier luther anheusermeasuringunderlie genghis it the there

plastoglobulithe tribes adding n't oxygen some there n't one luther an

did do might is are does were ca can was wouldof animal of that

gradient of from was reform deity of playedbetween

that of name is behind of ofaspectsluther

methodist of that

founderannualcapital city

concentrationconceptenergy

eucharistgovernment

greekgrowth

instrumentinteractionlanguage

leaderpreviousrainforestrational

sonsymmetry

three timeunited use water

what

who

howwherewhich why the

Figure 13: Question sunburst plot for DBiDAF.

the

a

the

theathe

lutheraffilia

tegenghisjamukha

khannorthsingerthosethe

examplesmediumnotpeople

plasmodiumsalarysomespecialtwozeoliteteachers

christiansyoualgaekelvinsother

prokaryotessomewetoafterinfivenearlyof

is

was

doesdidaredo

happenedtype

the

mentioned

iswore

divisionfirstlater

isfirstlater

of

personcollegiatehappened

onecam

e

theelected

notthe

heldrelatedsawa

unfairlywarsznotthen'tong

betweenbeenfewerthebeorfor

gravitythatthe

interviewtalkthe

objectsthethe

thosethe

wasisdid,

hadmaypaid

thoughtwill

discoveredwerewon

thelutherbortelessingnaimanspeopletemujin

thispoints

buildingsfactsmain

peoplephasesspread

unmannedthekhanlutherreligion

we you the thingsthoughtswesleyhistory

mombasa6/10 luthermany the gasesfresno his thea

inequality didmany

isdo doeswere are was wouldthe

warsawjacksonville krfx the boats luther chloroplast robert the most youai my most the you the kenya mongolia i the people luther the

isdid was would do does inamcan could had wereof direction that is has powdery decet piece in council to united that trust of report area of has lists between were biologist catechism that cities used that luther forms held provides for part is tower lollipop

act cardinal company umc atlantic black bull chess chloroplasts city entrances evangelical idea ipcc journey kalven low majority pacific passage period positions russian small subject swiss symbol thing things three translator unep units upper vistula water wordthe jochi luther

microsoft the jacksonvilleoxygen the heit n't

photorespirationrajan a have finding there the objects mode

didis was does mightwould are do missio

n

what

which

who

howwhere

the why

Figure 14: Question sunburst plot for DRoBERTa.

19

Figure 15: Training and qualification interface. Workers are first expected to familiarise themselves with theinterface and them complete a sample “Beat the AI” task for validation.

20

Figure 16: “Beat the AI” question generation interface. Human annotators are tasked with asking questions abouta provided passage which the model in the loop fails to answer correctly.

Figure 17: Answer validation interface. Workers are expected to provide answers to questions generated in the“Beat the AI” task. The additional answers are used to determine question answerability and non-expert humanperformance.

21

Type Description Passage Question

Explicit Answer stated nearly word-for-wordin the passage as it is in the question.

Sayyid Abul Ala Maududi was an impor-tant early twentieth-century figure in theIslamic revival in India [. . . ]

Who was an importantearly figure in the Is-lamic revival in India?

Paraphrasing Question paraphrases parts of the pas-sage, generally relying on context-specific synonyms.

Seamans’ establishment of an ad-hoccommittee [. . . ]

Who created the ad-hoc committee?

ExternalKnowledge

The question cannot be answeredwithout access to sources of knowl-edge beyond the passage.

[. . . ] the 1988 film noir thriller StormyMonday, directed by Mike Figgis andstarring Tommy Lee Jones, MelanieGriffith, Sting and Sean Bean.

Which musician wasfeatured in the filmStormy Monday?

Co-reference

Requires resolution of a relationshipbetween two distinct words referringto the same entity.

Tamara de Lempicka was a famous artistborn in Warsaw. [. . . ] Better than any-one else she represented the Art Decostyle in painting and art [. . . ]

Through what cre-ations did Lempickaexpress a kind of artpopular after WWI?

Multi-Hop Requires more than one step of infer-ence, often across multiple sentences.

[. . . ] and in 1916 married a Polishlawyer Tadeusz Lempicki. Better thananyone else she represented the Art Decostyle in painting and art [. . . ]

Into what family didthe artist who repre-sented the Art Decostyle marry?

Comparative Requires a comparison between twoor more attributes (e.g., smaller than,last)

The previous chairs were Rajendra K.Pachauri, elected in May 2002; RobertWatson in 1997; and Bert Bolin in 1988.

Who was elected ear-lier, Robert Watson orBert Bolin?

Numeric Any numeric reasoning (e.g., someform of calculation is required to ar-rive at the correct answer).

[. . . ] it has been estimated that Africanswill make up at least 30% of the dele-gates at the 2012 General Conference,and it is also possible that 40% of thedelegates will be from outside [. . . ]

From which continentis it estimated thatmembers will make upnearly a third of partic-ipants in 2012?

Negation Requires interpreting a single or mul-tiple negations.

Subordinate to the General Conferenceare the jurisdictional and central confer-ences which also meet every four years.

What is not in charge?

Filtering Narrowing down a set of answers toselect one by some particular distin-guishing feature.

[. . . ] was engaged with Johannes Bu-genhagen, Justus Jonas, Johannes Apel,Philipp Melanchthon and Lucas Cranachthe Elder and his wife as witnesses [. . . ]

Whose partner couldtestify to the couple’sagreement to marry?

Temporal Requires an understanding of time andchange, and related aspects. Goes be-yond directly stated answers to Whenquestions or external knowledge.

In 2010 the Amazon rainforest ex-perienced another severe drought, insome ways more extreme than the 2005drought.

What occurred in 2005and then again fiveyears later?

Spatial Requires an understanding of the con-cept of space, location, or proximity.Goes beyond finding directly statedanswers to Where questions.

Warsaw lies in east-central Poland about300 km (190 mi) from the CarpathianMountains and about 260 km (160 mi)from the Baltic Sea, 523 km (325 mi)east of Berlin, Germany.

Is Warsaw closer to theBaltic Sea or Berlin,Germany?

Inductive A particular case is addressed in thepassage but inferring the answer re-quires generalisation to a broader cat-egory.

[. . . ] frequently evoked by particularevents in his life and the unfolding Ref-ormation. This behavior started with hislearning of the execution of Johann Eschand Heinrich Voes, the first individualsto be martyred by the Roman CatholicChurch for Lutheran views [. . . ]

How did the RomanCatholic Church dealwith non-believers?

Implicit Builds on information implied in thepassage and does not otherwise re-quire any of the above types of rea-soning.

Despite the disagreements on the Eu-charist, the Marburg Colloquy paved theway for the signing in 1530 of the Augs-burg Confession, and for the [. . . ]

What could not keepthe Augsburg con-fession from beingsigned?

Table 9: Comprehension requirement definitions and examples from adversarial model-in-the-loop annotated RCdatasets. Note that these types are not mutually exclusive. The annotated answer is highlighted in yellow.

22

Documents

arXiv:2002.00293v2 [cs.CL] 22 Sep 2020