Danish Fact Veriﬁcation: An End-to-End Machine Learning ... › itu › docs › fever-da_jubi_hens.pdf · Danish Fact Veriﬁcation: An End-to-End Machine Learning System for Automatic

Danish Fact Verification:An End-to-End Machine Learning System for

Automatic Fact-Checking of Danish Textual Claims

Julie Binau and Henri Schulte{jubi,hens}@itu.dk

Supervised by Leon [email protected]

A thesis presented for the degree ofM.Sc. Software Development (Design)

Project code: KISPECI1SE

IT University of CopenhagenDenmark

2nd June 2020

Table of Contents

1 Introduction 3

2 Related Works 32.1 FEVER Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 FEVER Shared Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Submissions to the FEVER Shared Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Bidirectional Encoder Representations from Transformers (BERT) . . . . . . . . . . . . . . . . 52.5 Statistical Cues in Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Data Set Creation 63.1 Data Collection Workflow for Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Claim Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.3 Generating Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.4 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.5 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.6 Data Creation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.7 Inter-Annotator Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Using FEVER Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Data Set Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Fact Verification: Technical Approach 114.1 Evidence Retrieval System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Fact Verification Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2.1 Filling Evidence for NEI Claims . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2.2 Balancing Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2.3 Generating Model Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.4 Fine-Tuning the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Fact Verification Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Analysis and Results 135.1 Model Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Balancing Classes in Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . 135.1.2 Testing Learning Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1.3 Fine-Tuning Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.1.4 Sequence Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.1.5 Increasing Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.1.6 Alternative Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Evidence Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2.1 Evidence Retrieval in Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2.2 Evidence Retrieval in the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.3 End-to-End Test Set Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.4 Probing for Statistical Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.4.2 Cues in the FEVER Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.4.3 Cues in our Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.5 Data Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.5.1 Shuffled Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.5.2 No Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.5.3 Random Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.5.4 Incomplete Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.5.5 Shuffling Nouns and Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Page 1 of 26

6 Discussion 236.1 Impact of Evidence Selection Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Challenges in Evidence Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.3 Alternative Sources for Claims and Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.4 Substituting with FEVER Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6.4.1 Zero-Shot Cross-Lingual Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.4.2 Translated FEVER Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Conclusion 24

8 Acknowledgements 25

List of Figures

1 Example of a fact-checked claim from TjekDet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Example of claims from the FEVER data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Representation over the data creation workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Parsing example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Claim generation example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Mutation example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Dictionary example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Example of claims labelled Supported, NotEnoughInfo and Refuted. . . . . . . . . . . . . . . 99 Screenshot of Annotation Tool in the Terminal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910 Banner on a Wikipedia page indicating that the article has been authored by a bot. . . . . . . . . . . . 1011 Example of a poor translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 Representation over fact verification pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313 f1 scores for different balancing strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314 Precision and recall by class for different balancing strategies. . . . . . . . . . . . . . . . . . . . . . . . 1415 Train loss for different learning rate values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416 f1 scores for different learning rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1517 Loss by batch when fine-tuning for three epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518 f1 by number of fine-tuning epochs on development set in pipeline. . . . . . . . . . . . . . . . . . . . 1519 f1 for with varying sequence length on development set with original evidence. . . . . . . . . . . . . . 1620 f1 for different data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1621 Precision, recall and f1 for different values of l and k. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1722 f1 in the pipelined system for retrieved and original evidence. . . . . . . . . . . . . . . . . . . . . . . . 1823 Confusion matrix over pipeline predictions on test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1824 f1 for base, shuffled input and no evidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2225 f1 by percent of randomised labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2226 f1 for base and replaced words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

List of Tables

1 Class distribution of data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Composition of FEVER samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Evidence retrieval configurations with highest f1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Selection of configurations in the pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Parameters resulting in highest f1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Fact Verification models across different tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Top 5 cues by coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 High-productivity cues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Top 6 cues by harmonic mean of utility and coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2010 Top 5 cues by coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2111 Top 6 cues by harmonic mean of utility and coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Page 2 of 26

Abstract—In this thesis we are addressing the challenge ofautomatic fact verification for Danish by developing the firstend-to-end solution for this task. Accordingly, we present a newdata set and a trained inference model for Danish-languagefact verification. The data set consists of 3,395 textual claims,labelled by human annotators based on the claims’ abilityto be supported by evidence from Wikipedia. We assess thepresence of unintended signals in our data set by adapting aneural probing method to the fact verification task. Further,we demonstrate our data set’s feasibility for Danish fact veri-fication by developing an end-to-end machine learning systemthat retrieves relevant evidence for a claim and predicts itsveracity. This system achieves a micro f1 of 58.6% and macrof1 of 53.4% on our test set. Ultimately, we deem that automaticfact verification for Danish is feasible, and that sufficient toolsexist to allow future research to narrow the gap to English.

1. Introduction

Denmark is affected by digital misinformation furtheredby recent developments in digital technologies, such asartificial intelligence and autonomous bots. In order toinvestigate the influence of these technologies on Danishsociety, a coalition of Danish members of parliament andindustry professionals united to form the SIRI Commissionin 2016 1. Their resulting report identifies bots spreadingbiased political news as a “serious threat to our democracy”(SIRI-Kommissionen, 2019). To alleviate this threat, theycall for media to actively combat misinformation, e.g.through fact checking.

Fact checking, however, requires considerablemanual labour and fact checkers are unable to keep upwith the ever-increasing amount of information that isbeing published online (Soleimani et al., 2020). Therefore,fact checking is now being considered a candidate forautomation (Augenstein et al., 2019; Derczynski andBontcheva, 2014) within the field of natural languageprocessing (NLP). So far, research into fact verificationis largely centered around the English language, butas SIRI-Kommissionen (2019) shows, the need for factverification is also present in smaller languages, such asDanish.

Some current fact verification approaches make useof machine learning models that are trained on a dataset of claims for which their veracity is known, e.g.they may be labelled as true or false. Models aretrained to predict these labels, sometimes while consultingadditional evidence, with the aim of constructing a generalunderstanding of the task that can then be applied to claimsfor which the label is unknown. Traditionally, training sucha model requires the existence of a suitable data set in thetarget language. However, the availability of Danish NLPresources is limited overall (Kirkedal et al., 2019) and to

1. https://www.dr.dk/nyheder/viden/tech/kommission-skal-undersoege-kunstig-intelligens-indflydelse-paa-samfundet

our knowledge a fact verification data set for Danish doesnot yet exist.

Thus, in order to train a machine learning modelto fact-check, our first step is to create a Danish factverification data set, composed of claims and supportingor refuting evidence. This process requires considerablehuman oversight and involvement, to ensure that claimsare annotated accurately. Then, we use our data set totrain a machine learning model to perform fact verificationon Danish claims. While doing so, we also address thechallenge of retrieving evidence against which to verify agiven claim.

The machine learning models that are used withinfact verification and NLP in general can be rather complexdue to their many independent parameters. For that reasonwe spend a considerable part of our analysis on finding theoptimal configuration for our particular task and model.

Ultimately, we aim to identify and evaluate apossible way to use modern NLP technologies for factverification in Danish and lay a foundation for futuredevelopment in this area. To that end we present a factchecking data set containing thousands of labelled Danishclaims and corresponding evidence. We also present afunctioning fact verification system that demonstrates theapplicability of modern NLP technologies to the domain ofDanish fact verification.

2. Related Works

As the need for automatic detection of false statements con-tinues to grow (Derczynski and Bontcheva, 2014), severallarge data sets for fact verification have emerged. These datasets generally provide statements, or claims, of varying ve-racity, sometimes accompanied by supporting or refuting ev-idence. The source of these claims are generally regarded aseither naturally-occurring or synthetic. Naturally-occurringclaims are not specifically created for fact verification andcan be found, for example, on social media. Claims insocial media, however, are generally not accompanied byveracity labels and evidence. This limits their usefulnessfor automatic fact verification. Fact-checkers, on the otherhand, work on finding and verifying claims (Vlachos andRiedel, 2014), thereby providing veracity labels. The ex-ample shown in Figure 1 (see page 4) checks the claim“Bill Gates already has his own CoVid-19 vaccine readyfor you, and here is the patent”. The claim, that occurredin a number of Facebook posts, is assigned a label (false)and an explanation for the verdict is given2.

Researchers are now looking towards fact-checkingsites as a potential source for naturally-occurring factverification data sets. As of 2019, the largest data setof this kind is MultiFC (Augenstein et al., 2019), whichcollects over 30,000 claims from 26 English-language fact

2. https://www.mm.dk/tjekdet/artikel/afsloerer-dokument-at-covid-19-vaccinen-allerede-findes-nej-coronaskeptikere-misforstaar-patentansoegning

Page 3 of 26

Figure 1. Example of a fact-checked claim from TjekDet.

checking websites. In order to reach this number of claims,Augenstein et al. (2019) make a compromise betweenquantity and uniformity. By including 26 different sites,they source claims with up to 40 different veracity labels,from “true” and “mostly false” to “a little baloney” and“4 Pinocchios”. Additionally, their methodology limits thesize of their data set to the number of fact-checked claimson the included sites.

To overcome this limitation, Thorne et al. (2018) createclaims synthetically. By authoring claims specifically withthe purpose of fact verification, they are able to control thesize of the data set and the number of labels. In the sectionbelow, we present the FEVER data set as well as outline theFEVER shared task which builds on said data set. Followingthis, we present a selected handful of FEVER submissions,and which methods they use to address the task.

2.1. FEVER Data Set

The original FEVER data set presented in Thorneet al. (2018) comprises a total of 185,445 claims cre-ated from approximately 50,000 popular Wikipedia arti-cles and annotated as either SUPPORTS, REFUTES orNOTENOUGHINFO. Additionally, claims that are labelledSUPPORTS and REFUTES also come with the evidenceagainst which this judgement has been made. Thorne et al.(2018) create this data set in two stages using 50 annota-tors: First generating claims from Wikipedia articles, thenlabelling them against evidence from Wikipedia.

The claim generation stage entails providing annota-tors with a randomly sampled sentence from the introductorysection of an English Wikipedia article and asking them tocreate claims about the article’s entity, i.e. its subject. Inaddition to basing their claims on the provided sentencealone, annotators are also given the choice to use informa-tion from hyperlinked articles to allow for more complexclaims (Thorne et al., 2018).

Annotators are also asked to create mutations of

these claims by, for example, negating, generalising or re-placing part of the claim. This is done to introduce refutableand non-verifiable claims into the data set (Thorne et al.,2018).

In the second stage, Thorne et al. (2018) ask an-notators to label the previously created claims as eitherSUPPORTS, REFUTES or NOTENOUGHINFO. For the firsttwo labels, annotators also mark the sentences they use asevidence for their decision. Once again, the annotators haveaccess to articles hyperlinked in the entity’s article as well.Figure 2 shows annotated examples from the FEVER dataset. Note that for the sake of simplicity some informationhas been omitted from the figure. The original data alsoincludes which parts of the evidence specifically supportsor refutes a claim as well as whether a claim is verifiableor not.

ID: 159923Claim: Melisandre wrote the third novel in A Song of

Ice and Fire.[wiki/Melisandre]: Melisandre of Asshai is a

fictional character in the A Song of Ice and Fire se-ries of fantasy novels by American author GeorgeR. R. Martin [. . . ]

Verdict: Refuted

ID: 40360Claim: Me Before You has Emilia Clarke in the cast.[wiki/Me_Before_You_(film)]: Me Before

You is a 2016 British-American romantic dramafilm directed by Thea Sharrock as her directorialdebut and adapted by English author Jojo Moyesfrom her 2012 novel of the same name. The filmstars Emilia Clarke, Sam Claflin, [. . . ]

Verdict: Supported

ID: 52213Claim: A dog has written for Doctor Who.[null]: NullVerdict: Not Enough Info

Figure 2. Example of claims from the FEVER data set.

Thorne et al. (2018) segment the labelled data set intomultiple subsets for training, development and testing, withthe first retaining a majority of the claims at a size of145,449.

2.2. FEVER Shared Task

Following the publication of the FEVER data set, Thorneet al. (2019) publish the FEVER shared task. This taskchallenges participants to develop a system that is capableof retrieving relevant textual evidence (from Wikipedia) forthe human-generated claims in the FEVER data set (Thorne

Page 4 of 26

et al., 2018) and predict their veracity against that evidence.As a conclusive step, any system submitted to this task mustassign a label to each claim: SUPPORTED, REFUTED orNOTENOUGHINFO. Participants are free to decide how tobest retrieve evidence and predict labels. Submissions aremeasured not only on their label accuracy, but also on theirability to retrieve the expected evidence, which togetherconstitute the FEVER score (Thorne et al., 2019).

2.3. Submissions to the FEVER Shared Task

In their paper, Thorne et al. (2019) highlight the methodsused by some of the most successful task submissions,and report multiple teams using a multi-step approach toevidence selections. An example of a multi-step approach,Nie et al. (2019) present the Three-Phase Procedurewhich consists of a document retrieval system, sentenceretrieval and finally claim verification. For documentretrieval, Nie et al. (2019) and Yoneda et al. (2019) extractkeywords from each claim, and use those keywords toquery the body of articles for matching evidence. Severalparticipants report using combinations of named entities,noun phrases and capitalised expressions from the claim tocreate keywords for querying suitable information retrieval(Thorne et al., 2019). However, Yoneda et al. (2019)find that the architecture of a standard claim frequentlyincludes a mention of the entity’s title. Hence, extractingnoun phrases or named entities could aid in retrievingrelevant sentences for a claim as they may representan entity. They also underline issues in co-referencingclaims, which refer to their entity as it, they, she etc., ratherthan using the actual title of the entity the pronouns refer to.

A more naive approach assumes that a keyword representsany span of the text which can eventually be matched witha piece of text from a Wikipedia article (Nie et al., 2019).As mentioned, a popular method includes extracting nounphrases from claims. Hanselowski et al. (2019) reason onecould “consider every noun phrase as a potential entitymention”, but also add that they achieve better resultswhen combining it with a heuristic to also include othercategories of words in e.g. movie or song titles. Otto(2019) also uses “noun chunks” from claims for queryingan index. However, in this approach the noun phrases arealso paired with a Named Entity Recognition (NER) subtask. Similarly, Malon (2019) report that using NER in adocument retrieval step ultimately provides a “significantboost” to the final results.

For the final step, claim verification, submissionsexhibit a variety of solutions, albeit united in their relianceon both word embeddings to transform the textual claim-evidence pairs and some inference model to ultimatelyevaluate veracity.

For fact verification tasks such as FEVER, Soleimaniet al. (2020) suggest using BERT as their language model.In their work, they implement BERT for sentence selectionby first retrieving five potential evidence documents and

then apply one of two strategies: The first is a point-wisestrategy, where every piece of potential evidence is directlyclassified as evidence or non-evidence (Soleimani et al.,2020). The second strategy is to pair potential positiveevidence together with negative samples, and then comparethe contents to appropriately classify evidence (Soleimaniet al., 2020). Similarly, Stammbach and Neumann (2019)also use BERT for sentence selection, but suggests a two-hop evidence process inspired by the work reported by Nieet al. (2019). Stammbach and Neumann (2019) argue thatusing BERT specifically for sentence selection, enables atheoretical ability to “retrieve most of the relevant evidencein the FEVER dataset”. While Soleimani et al. (2020) donot investigate using BERT in an end-to-end framework,but merely suggest it for future research, Stammbach andNeumann (2019) implement BERT for claim verificationas well. When using the BERT model Stammbach andNeumann (2019) report 71.5% label accuracy. However,when excluding non-verifiable claims, label accuracy evenincreases to 85.3% (Stammbach and Neumann, 2019).They conclude that BERT seems to struggle in classifyingNOTENOUGHINFO claims, stating that these are “the mostproblematic one to predict correctly” (Stammbach andNeumann, 2019). Overall, researchers show promisingresults when using BERT for both sentence selection andclaim verification steps of the FEVER task (Stammbachand Neumann, 2019; Soleimani et al., 2020).

2.4. Bidirectional Encoder Representations fromTransformers (BERT)

BERT is a language representation model using bothpre-training and fine-tuning for natural language processingtasks (Devlin et al., 2018). It is trained on a set of two tasks:Masked Language Modeling (MLM) and Next SentencePrediction (NSP). The MLM task helps the model trainin deep bidirectional representations of some input tokens,rather than a left-to-right or right-to-left representation ofa sequence. In MLM training 15% of (randomly chosen)tokens are replaced with [MASK] tokens, which the modelmust then predict (Devlin et al., 2018). The second task,NSP, trains BERT on “understanding the relationshipbetween two sentences” which Devlin et al. (2018) argueis not already captured by language modelling. Both tasksare performed on BookCorpus and Wikipedia3. Ultimately,these tasks make BERT’s “deep bidirectional” nature ableto contribute in tackling a “broad set of NLP tasks” (Devlinet al., 2018).

Originally written for TensorFlow, BERT hassince been implemented for PyTorch as part ofHuggingface Transformers (Wolf et al., 2019). Theyoffer several models with task-specific overhead,including but not limited to: BertForSequenceClassification,BertForMultipleChoice, BertForQuestionAnswering andBertForTokenClassification4. All tasks require additional

3. https://github.com/google-research/bert4. https://huggingface.co/transformers/model doc/bert.html

Page 5 of 26

fine-tuning to the specific domain, but could potentially beapplied to a claim verification task.

BERT models are available in multiple variants,differing in size, e.g. base and large, case-sensitivity, i.e.cased and uncased, and pre-training language. In additionto pre-training on English only, a multilingual versionpre-trained on up to 104 languages, including Danish, isalso available 5.

While the original English-trained BERT is a state-of-the-art language processing model (Devlin et al.,2018), we see Ronnqvist et al. (2019) test the quality ofmultilingual BERT (mBERT) on a handful of tasks whichdiffer in complexity and report mixed results. Specifically,these tasks are tested on English, German and Nordiclanguages (including Danish), where Ronnqvist et al.(2019) write:

“We found that the multilingual model notably lagsbehind the available monolingual models and thegap opens as the complexity of the task increases.”

In a suggested reason behind these results, Ronnqvist et al.(2019) argue that while monolingual BERT is trained on a100% English corpus, any multi-lingual implementation ofBERT will be trained on less that 1% of a corpus, assumingall languages have equal representation in a given corpus.

While Ronnqvist et al. (2019) remain critical to thequality of mBERT, others report mBERT “doing well” whentrying tasks using zero-shot cross-lingual transfers (Pireset al., 2019; Wu and Dredze, 2019). The concept of a zero-shot cross-lingual transfer concerns training a model in asource language (most often a high resource language), andthen transfer to a target language (Wu and Dredze, 2019).Hence, zero-shot transfers test a model’s ability to generaliseacross languages, as it specifically trains models on datafrom one language and evaluates them in another (Pireset al., 2019). Wu and Dredze (2019) execute five differenttasks to test the quality of mBERT in a zero-shot setting.The tasks include: Document classification, natural languageinference, named entity recognition, part-of-speech tagging,and dependency parsing. In their results Wu and Dredze(2019) report that mBERT “[. . . ] effectively learns a goodmultilingual representation” in the aforementioned tasks.Pires et al. (2019) also report encouraging results usingmBERT for zero-shot, and state that the model is robustin its cross-lingual generalisations “[. . . ] without being ex-plicitly trained for it.”. In conclusion, the results from Wuand Dredze (2019) and Pires et al. (2019) show potentialin implementing a zero-shot cross-lingual transfer, whereRonnqvist et al. (2019) show limitations in the capabilitiesof the mBERT model.

2.5. Statistical Cues in Data Sets

BERT has achieved state-of-the-art results in a number ofNLP tasks and even exceeded human baselines in some(Devlin et al., 2018). Yet, when Niven and Kao (2019)

5. https://github.com/google-research/bert

use BERT in the Argument Reasoning ComprehensionTask (ARCT) (Habernal et al., 2018) and find near-humanperformance, they question the legitimacy of their results,as the task requires additional world knowledge that themodel was not provided with. Niven and Kao (2019)examine what BERT had actually learned about the taskand find that “BERT’s surprising performance can beentirely accounted for in terms of exploiting spuriousstatistical cues” (Niven and Kao, 2019). Correspondingly,Schuster et al. (2020), who use BERT to identify cues inFEVER, find that they are able to achieve high classifieraccuracy, even when completely disregarding evidence.

The effect that Niven and Kao (2019) and Schusteret al. (2020) demonstrate, has also been dubbed the CleverHans Effect (Heinzerling, 2019) and occurs when modelslearn not from the intended data, but from unintentionalsignals hidden within. To counter this effect, Heinzerling(2019) proposes performing data set ablations, in whichdata sets are intentionally distorted to assess the impact onmodel performance. We see another attempt to limit thepresence of unintentional cues in FEVER, where annotatorsare asked to refrain from creating simple negations, e.g. byusing “not” (Thorne et al., 2018).

Niven and Kao (2019) present an analytical methodto determine the presence of statistical cues in a data set.They consider uni- and bigrams (i.e. one and two-wordphrases) in the ARCT data and calculate a productivity andcoverage value for each. They define the productivity of acue as “the proportion of applicable data points for which[the cue] predicts the correct answer” and a cue’s coverageas “the proportion of applicable cases over the totalnumber of data points” (Niven and Kao, 2019). Based onthese values they identify “not” as a strong cue that predictsthe right answer 61% of the time (productivity), whileoccurring in 61% of all data points (coverage). Niven andKao (2019) also demonstrate that BERT’s performance onARCT falls significantly when strong cues are eliminatedfrom the data.

Schuster et al. (2020) and Niven and Kao (2019)demonstrate the need for awareness of potential statisticalcues for those creating as well as those consuming datasets and provide suggestions on how to detect them, suchas probing statistical cues and data ablations.

3. Data Set Creation

According to the Danish Language Council (DanskSprognævn), the absence of large-scale data sets in Danishcurrently presents one of the greatest challenges to thedevelopment of Danish NLP technologies (Kirchmeier et al.,2019). At the same time, and perhaps as a result, the numberof models and tools for NLP tasks in Danish still remainsrelatively limited (Kirkedal et al., 2019). The same is truefor the Danish fact verification task that we are targeting.In the absence of meaningful quantities of suitable data, wemust at this time consider the creation of a sufficiently-sized

Page 6 of 26

data set a prerequisite for the development of a Danish factverification model.

3.1. Data Collection Workflow for Wikipedia

Our approach to creating a data set of suitable size includessplitting the workflow into smaller manageable tasks (seeFigure 3). Between each step we store the current state of thedata to best track and manage its life cycle. We decide to useWikipedia as a single source for claim generation. Hence,we start by parsing the data from Wikipedia to extract onlythe necessary information. Based on this we generate aclaim, which is then transformed into multiple mutations.Later on, we construct dictionaries for each claim. Finally,we annotate claims against evidence, labelling each claim aseither Supported, Refuted or NotEnoughInfo. Wepresent each step in the workflow followed by examples ofthe claim throughout the process.

Figure 3. Representation over the data creation workflow.

3.1.1. Parsing. The first step is to extract entities and theirrespective abstracts from a Wikipedia page. We do so byparsing the latest Danish Wikipedia dump6 as of 13 February2020. We are able to reliably extract the article text andhyperlinks using WikiExtractor7. These are fed into ourparsing tool where for each entity we extract the abstract andall hyperlinks within. The resulting abstracts occasionallycontain encoding errors that we fix using ftfy (Speer, 2019).While this correctly encodes variants of Latin characterssuch as ø, a, e, etc., it breaks text in non-Latin alphabets.Since we are primarily focused on Danish, we consider thistrade-off permissible.

For example, the Wikipedia article for the Danishmovie It don’t mean a thing8 is reduced to the data shown inFigure 4. It is worth noting that for the purpose of illustrationwe show a very compact example here. Abstracts can stretchover many sentences and include at times over a dozenlinked entities.

Entity: [wiki/It_don’t_mean_a_thing]Abstract: It don’t mean a thing er en animationsfilm

instrueret af Flemming Quist Møller efter egetmanuskript.

Linked Entities: [animationsfilm,Flemming_Quist_Møller, manuskript]

Figure 4. Parsing example.

3.1.2. Claim Generation. The second step in the datacollection workflow is to generate claims. To create oneclaim, we randomly sample one entity and its abstract fromthe set of all Danish Wikipedia pages. Based on these, aclaim author composes one claim which must be supportedby the abstract and regard the entity. Claims are composedexclusively based on knowledge retrieved from the abstract.In Figure 5 we present an example output that is saved toa file in this step.

Claim: It don’t mean a thing er en animationsfilminstrueret af Flemming Quist Møller.




Figure 5. Claim generation example.

6. https://dumps.wikimedia.org/dawiki/latest/7. https://github.com/attardi/wikiextractor8. https://da.wikipedia.org/wiki/It don%27t mean a thing

Page 7 of 26

3.1.3. Generating Mutations. In the third step of datacreation, a claim author must mutate the claims generatedin the preceding step. In contrast to the previous claimconstruction, authors are now allowed to include worldknowledge. The claim author compose a set of mutations ofthe original claim, without considerations towards its abilityto be supported by evidence. When composing mutations theclaim author is not presented with evidence, but merely theoriginal claim and the entity it concerns. During this step,we apply the rules for mutations presented by Thorne et al.(2018), with the important note that any mutation must stillconcern the source entity. In Figure 6 we present a mutationof the the claim presented in Figure 5 (see page 7).

Claim: It don’t mean a thing er instrueret af en danskmand, som hedder Flemming.




Figure 6. Mutation example.

3.1.4. Dictionary. Up to this point, the data set of mutatedclaims contains references to the linked entities, but theentities alone are of little value for annotation. Therefore,this step consists of looking up the linked entities in theparsed data and adding their abstracts as evidence to themutated claims. The result is shown in Figure 7.

We recognise that the inclusion of a dictionary isnecessary for allowing complexity in claims while alsolimiting the amount of world knowledge introduced duringannotation.




Dictionary: (Entity: Flemming Quist Møller,Abstract: Flemming Oluf Quist Møller (fø8dt19. maj 1942 i Taarbæk) er en dansk musiker,tegner, filmarbejder og forfatter. Han spiller imusikgruppen Bazaar, sammen med Peter Bastianog Anders Koppel.), (Entity: manuskript,Abstract: Manuskript (latin “manu scriptus”,“skrevet i handen”) er et dokument; handskrevet,skrevet pa skrivemaskine eller pa computer.)

Figure 7. Dictionary example.

3.1.5. Annotation. While annotating the generated claimsagainst the saved evidence, we disallow world knowledge,i.e. knowledge that cannot be found in the presented evi-dence, from influencing annotation. Common sense, how-ever, is permitted. For example, a Wikipedia article states:

Danish: Rimsø-stenen er en runesten.English translation: The Rimsø Stone is a rune

stone.A claim could suggest the Rimsø Stone to be a tombstone.An annotator might know that rune stones were in factused as tombstones, yet without further evidence the claimcan not be supported based on common sense alone.However, if the evidence included the definition of a runestone, an annotator could label the claim-evidence pair asSupported.

Additionally, we consider an annotator’s generalunderstanding of words and categories as commonsense. This includes understanding that some things,such as geographic locations, are mutually exclusive: If aphysical location is in Denmark, it cannot also be in Norway.

During annotation, we meet a peculiar gradient in thenature of the Danish language. For example:

Danish: David Beckham er en engelsk fodbold-spiller.

English translation: David Beckham is an En-glish football player.

Here an annotator might question the meaning of beingEnglish. Does being an English football player implythat one played football in England or that one is bornin England and is now a football player? This examplesupports the idea of having the annotator use commonsense to label a claim-evidence pair. A claim like this mightbe read differently from annotator to annotator, hence weask whether the annotator is convinced by the evidence,and to label accordingly.

A similar situation would arise if a claim allegedDavid Beckham to be an athlete or a chef. We consider itpermissible common sense that football players are athletesand would require no further evidence to support that claim.However, we also allow annotators to refute a claim if it iscontradicted by the evidence, such as David Beckham beinga football player, instead of a chef. Where (Thorne et al.,2018) rightfully argue that a person can possess multiplejobs, and that having worked as a football player does notexclude one from also working as a chef, we base ourdecision on the assumption that Wikipedia editors followthe Gricean Maxim of Quantity (Geis et al., 1976) andprovide the relevant information in its entirety. We applythe same logic to citizenship. Therefore, we determinethat if a Wikipedia article states a person’s profession orcitizenship, annotators can assume all relevant professionsor citizenship’s to be mentioned.

With the guidelines above, we annotate all claim-evidencepairs against a dictionary. The annotator must log whether

Page 8 of 26

a dictionary entry was used, such that it can be storedwith the source evidence for a given claim. In Figure 8we present the annotated example for Flemming QuistMøller as well as annotations for each other label. Theclaim suggests that Flemming is a Danish man. This claimcould be supported with the use of the dictionary, as thedictionary entry on Flemming Quist Møller explicitly stateshis citizenship.


Entity: [wiki/It_don’t_mean_a_thing]Evidence: It don’t mean a thing er en animationsfilm


Entity: [wiki/Flemming_Quist_Møller]Evidence: Flemming Oluf Quist Møller (født 19. maj

1942 i Taarbæk) er en dansk musiker, tegner, fil-marbejder og forfatter. Han spiller i musikgruppenBazaar, sammen med Peter Bastian og AndersKoppel.

Label: Supported

Claim: Jødisk Informationscenter blev oprettet efterafslutningen pa 2. verdenskrig.

Entity: [Jødisk_Informationscenter]Evidence: NullLabel: Not Enough Info

Claim: Suicide Squad er en superheltefilm instrueretaf Margot Robbie.

Entity: [wiki/Suicide_Squad]Evidence: Suicide Squad er en amerikansk superhelt-

film fra 2016. Filmen er skrevet og instrueret afDavid Ayer, og har Will Smith, Jared Leto, MargotRobbie, Joel Kinnaman, Jai Courtney, Cara Delev-ingne og Viola Davis i hovedrollerne.

Label: Refuted

Figure 8. Example of claims labelled Supported, NotEnoughInfo andRefuted.

3.1.6. Data Creation Tools. Our data collection strategyis supported by simple command line tools9. We present ascreenshot of the annotation tool in Figure 9. We designedthis process with the objective of enabling transparency anditerative workflows. The fact that the individual workflowsare disjoint enables us to gather data in batches, allowingus to start using the data before it has been collected in full(see Figure 3, page 7). Additionally, we can manually accessthe data between each step to review quality and assist indebugging.

9. https://github.com/HenriSchulte/Danish-Fact-Verification

Figure 9. Screenshot of Annotation Tool in the Terminal.

3.1.7. Inter-Annotator Agreement. Before annotating alldata entries, we perform an inter-annotator agreement toensure alignment in annotation methods and a shared un-derstanding of what making reasonable assumptions usingcommon sense entails. We use Fleiss κ-score to measureperformance (Fleiss, 1971). We each label 100 randomlysampled claims, without conferring with each other. Thisresults in a Fleiss κ-score of 0.75, which signifies a suf-ficient level of agreement. A κ-score between 0.61-0.80 isconsidered substantial for an annotation task (Landis andKoch, 1977, p. 165).

3.2. General Considerations

We are largely following the data collection procedureoutlined by Thorne et al. (2018) to maintain a certain,albeit debatable, degree of comparability. This means thatwe retain the structure of claim-evidence pairs labelled aseither Supported, Refuted or NotEnoughInfo. Wealso continue using Wikipedia as a primary data source forevidence and claim generation. While there certainly existdownsides to using the Danish Wikipedia, its relatively largesize, high accessibility and structured format lends itselfwell to this particular task.

In some areas, however, we decide to divert fromthe data set creation process of FEVER. Some of thesedifferences stem from the fact that our team of annotators islimited to the two authors while Thorne et al. (2018) wereable to use a team of 50 annotators. While this certainlyadds limitations on our part, we also argue that it relievesus from putting greater emphasis on designing an annota-tion process that is scalable and user-friendly. Therefore,we develop simple text-based command line tools insteadof using the existing FEVER Annotation Platform10. For

10. https://github.com/awslabs/fever/tree/master/fever-annotations-platform

Page 9 of 26

the same reason, we are unable to replicate Thorne et al.(2018)’s use of 5-way agreement and super-annotators tostrengthen labelling. Regardless, like FEVER we use Fleissκ-score to measure inter-annotator agreement (see Section3.1.7).

Additionally, there are minor differences in ourannotation strategy. While annotators in FEVER’s claimlabelling task were allowed to add additional evidencefrom Wikipedia (Thorne et al., 2018), our annotation islimited to the evidence in the entity’s abstract and theabstract of hyperlinked entities within. Further, in oursystem, annotators specify which abstracts are contributingto their decision, not which specific sentences do. Thesesimplifications allow us to move forward with the taskdespite the small number of annotators.

That being said, we do retain other ideas of Thorneet al. (2018) by basing our data set on purely syntheticdata and generally disregarding temporality. However, wedo acknowledge that the manufactured, synthetic natureof our data might introduce latent signals resulting in apotentially biased model. For this reason we analyse thedata set for potential bias in Section 5.4.3. In addition,our annotation strategy is also time-agnostic. This meansthat claims that are supported at some point are consideredsupported, even if they are no longer accurate at the currenttime. For example, the claim “Lars Løkke Rasmussen is thePrime Minister of Denmark.” can be supported against theevidence “Lars Løkke Rasmussen was Prime Minister ofDenmark from 2015-2019.” even though this is no longerthe case today.

As we are basing our data on the Danish Wikipedia,we need to be aware of areas in which it differs from itsEnglish counterpart, such as size, quality and composition.After parsing the Danish Wikipedia, we are left with256,048 articles, many of which are not much longerthan a few sentences. This is exacerbated by the factthat some articles are not actually authored by humansbut automatically generated from external databases. Forexample, the abstract shown in Figure 4 (see page 7) waswritten by a bot, as indicated by a banner in the top of thepage as shown in Figure 1011. We decide not to excludethese cases, as the presence of machine-generated text in thedata mirrors the current reality (SIRI-Kommissionen, 2019).We also note a lack of diversity in the Danish Wikipedia.A few categories alone make up a considerable part,specifically sports, movies and administrative divisions.

Figure 10. Banner on a Wikipedia page indicating that the article has beenauthored by a bot.

11. https://da.wikipedia.org/wiki/It don%27t mean a thing

3.3. Using FEVER Data

As a result of following a similar generation process, ourdata set bears some similarity to FEVER. Consequently, wedecided to explore the option of machine-translating EnglishFEVER data (Thorne et al., 2018) into Danish in additionto generating data from Danish Wikipedia. For this purposewe use the Microsoft Translator Text API12. Their machinetranslation system has been shown to achieve human parityin some Chinese to English news translation tasks (Hassanet al., 2018). While Danish is a rather low-resource languagewhen compared to Chinese, we consider the quality ofEnglish-to-Danish translations to be sufficient, even if notflawless. At times the system still fails at recognising namedentities. In the example shown in Figure 11 the name ofactress Claire Danes is translated as Claire danskere, as herlast name is a homograph of the English name for the peopleof Denmark and is translated as such.

FEVER original: The series stars Claire Danes asCarrie Mathison, a Central Intelligence Agencyofficer with bipolar disorder.

Danish translation: Serien stjerner Claire danskeresom Carrie Mathison, en Central IntelligenceAgency officer med bipolar lidelse.

Figure 11. Example of a poor translation.

Due to the uncertain data quality, we have elected not to in-clude translations of English FEVER claims and evidence inour own data set but instead perform separate experiments.

3.4. Data Set Statistics

During our annotation process, we have generated a total of3,395 labelled claims with corresponding evidence from theDanish Wikipedia. Most claims that are verifiable (labelledeither Supported or Refuted) can be verified basedon their source entities’ articles alone, however, 19.29%of verifiable claims make use of one or more dictionaryentries, i.e. additional evidence from linked articles. We havedivided these claims into a training, development and testset, as seen in Table 1. The data set is available on Github13.

TABLE 1. CLASS DISTRIBUTION OF DATA SETS.

Data set Refuted Supported NEI

train 839 1186 350

dev 119 171 52

test 245 333 100

12. https://www.microsoft.com/en-us/translator/business/translator-api/13. https://github.com/HenriSchulte/Danish-Fact-Verification

Page 10 of 26

Thorne et al. (2018) ensure that each source entity onlyappears in one of the three splits, presumably to guaranteethat the development and test splits include completelyunseen data. We have also adopted this standard for ourdata set.

Additionally, we would have preferred a balanceddevelopment and test data set. The small size of ourNotEnoughInfo class, however, makes such an approachunsuitable in our case. Instead, we explore methods forcompensating for the imbalanced training data in Section5.1.1.

4. Fact Verification: Technical Approach

In addition to our Danish claim verification data set, wealso present a claim verification system that is trained andtested on this data. This system is comprised of two majorparts, the evidence retrieval system and the classifier, whichcan be executed in a pipeline configuration14.

4.1. Evidence Retrieval System

Finding relevant evidence is an essential element of fact-checking a claim, as human fact-checkers require evidenceon which to base their judgement. Thus, we apply thesame principle to non-human fact-checkers as well. Wedevelop an evidence retrieval system that automates thisprocess by using Apache Lucene15 to query the parsedWikipedia dump for relevant evidence. We do so in twoconsecutive steps, first selecting article abstracts and fromthose selecting relevant sentences as evidence.

To begin, we construct a search index over the articleabstracts in the parsed Wikipedia dump that we also usefor data generation. Each abstract is considered a separatedocument for the purpose of indexing. We restrict ourselvesto only using the abstracts of articles to ensure consistencybetween the original evidence and the retrieved one. Thisway, retrieved evidence originates from the same supersetof sentences as the original evidence does.

After building the index, we prepare the searchquery. We initially query the abstracts with the entire claim,however, we find that this returns many results that containsimilar sentences, but are otherwise entirely unrelated. Forexample, the claim “Yosemite National Park is located innorthern Kentucky” might result in the very similar, butentirely unrelated evidence “Yellowstone National Parkis located in northern Wyoming”. Some contestants inthe FEVER Shared Task extract noun phrases from theclaim and use those to select relevant documents (Thorneet al., 2019; Hanselowski et al., 2019). We imitate thisapproach, albeit in the most simple fashion, by extracting

14. https://github.com/HenriSchulte/Danish-Fact-Verification15. https://lucene.apache.org/

noun phrases from the claim using Apache OpenNLP16

and using only those to query the abstracts. We find this toproduce more reliable results than querying with the entireclaim, as it removes noise introduced by words that are notcentral to the claim’s meaning.

From these results we continue with a certain number ofabstracts, k, that is subject to future tweaks. At this pointwe have retrieved the abstracts of those articles that containthe noun phrases in the claim. The next step is to select therelevant sentences from these abstracts. For this purposewe use OpenNLP to separate the abstracts into individualsentences and build a search index over all sentences. Wefind that OpenNLP’s Danish Sentence Dector Model17 isable to reliably split evidence sentences in our data.

Instead of indexing over articles, the second searchindex now contains all sentences from the k most relevantabstracts. We query this index with the entire claim tomake sure that we retrieve the most similar sentences. Thefirst l sentences are then added as evidence to the claim. Aswith the number of abstracts, k, the number of sentences,l, is determined through experiments in Section 5.2.

It is worth noting that while we use the same sourceas in our previous data generation process, we do not aimfor parity in evidence retrieval. During data generation,entire abstracts are added as evidence, regardless of whetherall sentences are required to annotate the claim. Here, weinstead try to filter irrelevant sentences to avoid includingirrelevant sentences in the retrieved evidence.

4.2. Fact Verification Classifier

Based on mBERT (Devlin et al., 2018), our classifier makesuse of an extensively pre-trained, multilingual model thatalso supports Danish. To then adapt the pre-trained modelto our claim verification task, we add a step of domain-specific training, so-called fine-tuning.

4.2.1. Filling Evidence for NEI Claims. Our annotationprocess does not produce any evidence for claims that arelabeled NotEnoughInfo, as they are not verifiable againstany evidence. Using these claims directly for fine-tuning themodel would not allow the model to learn how to actuallyclassify non-verifiable claims in a general sense. Thorneet al. (2018) address this issue by either sampling a sentencefrom the nearest page or randomly from Wikipedia. As ourevidence is structured slightly differently, we elect to add theentire abstract of the claim’s source article as evidence. Thismethod results in evidence that is similar to that of verifiableclaims, minimising the chance that its label is given awayby anything but the meaning of the evidence.

4.2.2. Balancing Classes. Before using our training dataset on the model, we take a number of preparatory and

16. https://opennlp.apache.org/17. http://opennlp.sourceforge.net/models-1.5/

Page 11 of 26

pre-processing steps, beginning with class balancing. Weadd this step to ensure that the model is equally trainedon all three classes, even though these classes appear in thetraining data with greatly varying frequency. We implementtwo separate balancing strategies: Oversampling and classweights. When oversampling, some entries from the smallerclasses, in our case NotEnoughInfo and Refuted areduplicated to produce an evenly balanced data set. Thistransforms the size of our training data set from 2,375 entriesto 3,558 with all three classes being represented evenly.

The other strategy is to use class weights to increasethe misclassification cost for the model when failing tocorrectly predict the label for an entry of a smaller class. Wedo this by calculating weights equivalent to the frequencyof the class in the training data (before oversampling) anduse those weights to amplify loss when misclassifying un-derrepresented instances. Both approaches are establishedstrategies for dealing with imbalanced data sets within ma-chine learning generally (Chawla, 2010; Ling and Sheng,2008). We investigate their specific effects on our classifierin Section 5.1.1.

4.2.3. Generating Model Inputs. The inference modelat the core of our classifier does not process the claimand evidence in its current, textual form. Therefore, wetransform these into numerical inputs first. Specifically,we base our classifier on BertForSequenceClassificationfrom Huggingface Transformers (Wolf et al., 2019), whichserves a pre-trained model with input ids, labels,attention masks and token type ids.

As the pre-trained model we choosebert-base-multilingual-cased as we hopeto benefit from both its multilingual and its cased attributes.Most important to our task, this model is pre-trained tobe multi-lingual on 104 languages including Danish18,which absolves us from having to fine-tune BERT for anew language, rather than just a new task. Further, wechoose the cased model, as we require the capability ofdistinguishing between upper- and lowercase words. This isneeded as named entities in Danish are generally indicatedthrough capitalisation. For example, Kamille is a woman’sname, where kamille is the Danish word for chamomile.The labels in the training data can be transformed intonumerical representations most easily by mapping eachlabel to a value. We select 0 for Refuted, 1 forSupported and 2 for NotEnoughInfo.

To generate input ids we concatenate a claim andits evidence and tokenise them with the pre-trainedbert-base-multilingual-cased model. Instead ofsplitting the input into words, BERT’s tokenizer divides theinput into wordpieces (Wu et al., 2016), fragments of wordswhich are then translated to numerical representations.Additionally, BERT makes use of special classificationand separator tokens (Devlin et al., 2018). We make sureto add the classification token [CLS] to the beginning

18. https://github.com/google-research/bert/blob/master/multilingual.md

of the input and separate claim and evidence by [SEP].Then, the resulting tensors are padded (or truncated) to themaximum input length to ensure that all inputs are of equalsize.

As we are padding the input, we need to makecertain that the model does not consider the padding in itsprediction. We do so by providing attention masks,which identify padding in the input ids by having avalue of 0. The value 1 is added in positions where theinput is relevant.

By providing token type ids we can indicateto the model that the input is comprised of two distinctsequences, in our case claim and evidence. We thereforegenerate for each claim a tensor of identical size to theinput ids where tokens belonging to the claim areassigned the value 0 and tokens belonging to the evidenceare assigned the value 1.

4.2.4. Fine-Tuning the Model. Our starting pointfor fine-tuning are the pre-trained weights from thebert-base-multilingual-cased model with its de-fault configuration. We decided to prioritise this over theuncased version as it retains capitalisation which we hopemight help the model differentiate between named entities,which are usually capitalised in Danish, and other nounphrases, which are not. Then, one batch at a time, the modelpredicts labels for the inputs in the current batch. Falsepredictions result in higher loss, which is modified with theaforementioned class weights prior to back-propagation.

After fine-tuning is complete, the model is saved ina binary format, which allows us to load it again at a laterpoint without repeating the fine-tuning process. This step iscrucial for the creation of the fact verification pipeline.

4.3. Fact Verification Pipeline

So far, we only describe the use of our classifier forfine-tuning a pre-trained model. However, we reconfigureour classifier to also be used for evaluation. In this case,we have the classifier load a model that we previouslyfine-tuned to evaluate its predictions on unseen data, suchas our test data set.

As claims in the test data set had their evidenceremoved, we need to find relevant evidence before readingthe data into the classifier. Therefore, we configure boththe classifier and our evidence retrieval system to workin a pipeline (see representation in Figure 12, page 13).When executed, any provided set of claims will first besupplemented with evidence by our evidence retrievalsystem and then have its labels predicted by the classifier.

We confirm the integrity of the pipeline by ensuringthat its results are identical to those of running evidenceretrieval and classifier individually.

Page 12 of 26

Figure 12. Representation over fact verification pipeline.

5. Analysis and Results

We perform a number of experiments on our data, ourfact verification pipeline and its individual components tooptimise performance and assess data quality. To beginwith, we optimise the classifier and evidence retrievalcomponents by tweaking a range of parameters, such aslearning rate, fine-tuning epochs and number of evidencesentences to retrieve. We then test the best model onour test set and report encouraging performance. Lastly,we probe our data for statistical cues and perform dataablations to assess the validity of our results.

We evaluate model performance on multiple parametersand metrics. However, we primarily measure our factverification system using f1 scores. According to Sasaki(2007), the unweighted f1 score is determined by theharmonic mean precision and recall for each label where:

f1 = 2 ∗ Precision ∗RecallPrecision+Recall

We use micro and macro f1 to evaluate the multi-classperformance. The micro f1 can in short represent the overalllabel accuracy (i.e. the amount of correctly labelled claimsout of all claims) of the classifier, where the macro f1 scorerepresent the arithmetic mean over each label’s f1 score19.

5.1. Model Fine-Tuning

We fine-tune the pre-trained mBERT modelbert-base-multilingual-cased with varioustweaks to its hyperparameters and data processing, andevaluate its performance on the development data set. Weadapt the configuration of our model where such adaptionslead to an increase in the classifier’s performance.

5.1.1. Balancing Classes in Training Data. Our datageneration process has resulted in an imbalanced trainingdata set (see Table 1, page 10) , where the majority class isnearly three times the size of the minority class. To assessthe effect that this imbalance has on fine-tuning the model,we employ two balancing strategies: Sampling and class

19. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1 score.html

weights.For sampling, we have the possibility of over-

or undersampling the data. Where oversampling has thedisadvantage that it may lead to overfitting, undersamplingmay discard important data (Chawla, 2010). As our trainingdata set - and especially its minority class - are alreadylimited in size, we have elected to disregard undersamplingin this set of experiments.

Alternatively, class weights can be used to assign ahigher misclassification cost to members of the minorityclass during entropy calculation (Ling and Sheng, 2008).The weight of a class can be proportional to its frequencyor determined through experimentation (Tayyar Madabushiet al., 2019). As a starting point, we proportionallyassign higher weights to the two less frequent classes, butrecognise that this could be a potential point for furthertweaking.

To test the effect of implementing these balancingstrategies, we fine-tune the model with either, both and noneof them enabled and compare the classifier’s performance.The results are shown in Figure 13. Without any balancing,

Figure 13. f1 scores for different balancing strategies.

Page 13 of 26

Figure 14. Precision and recall by class for different balancing strategies.

the classifier achieves a macro f1 of 51.7%. With classweights or oversampling enabled, f1 rises to 56.5% or64.4% respectively. When having both sampling strategiesenabled at the same time, however, f1 reaches 68.2%,outperforming all previous attempts. While we achievethe best performance with both balancing strategies incombination, oversampling seems to contribute the greaterpart of this benefit.

A closer look at each class’ precision and recall revealsfurther insights into how these balancing strategies affectthe classifier’s predictions (see Figure 14). We find thatthe smaller classes, NotEnoughInfo and Refuted,which only make up 15% and 35% of the training dataset respectively, are affected to a greater degree by thebalancing strategies. Without any balancing, the classifierdoes not predict any claims as NotEnoughInfo, mostlymisclassifying them to be Refuted instead. When classweights are used, NotEnoughInfo claims receive ahigh weight due to the class’ low frequency. The classifierthen reacts by falsely predicting many Refuted claimsas NotEnoughInfo. Enabling oversampling - with orwithout additional class weights - produces more balancedresults for all classes. While we can clearly see the impactof our balancing strategies on the smaller classes, theinverse is true for the majority class, Supported. Asexpected, its precision and recall are relative to the overallperformance of the classifier, indicating that the majorityclass is not greatly affected by the balancing strategiesemployed.

Based on these results we conclude that class weights andoversampling can both provide performance improvements,albeit to varying degrees. We proceed with both strategiesin combination, as we find this to improve our classifier’smacro f1 by 16,5%.

5.1.2. Testing Learning Rates. To best determine an op-timal learning rate for our model, we measure loss and f1for several values. In Figure 15, we show the train loss overfive learning rates from 1e-6 to 1e-4. The figure shows

that the lowest and highest rates never reach train loss closeto 0.0, and we generally find a higher variance of loss withoccasional spikes for 1e-4 especially. We therefore focuson learning rates 1e-5, 2e-5, and 3e-5, all of whichexhibit minimal loss, with 1e-5 showing the least variance.

Figure 15. Train loss for different learning rate values.

Consequently, we measure f1 on our development set forthe learning rates 1e-5, 2e-5, and 3e-5 (see Figure 16,page 15). Here we see that 2e-5 has slightly lower valuesthan its neighbours with a micro f1 of 64.5% and a macro f1of 56.4%. However, the difference is more subtle between1e-5 (macro f1: 65.7%) and 3e-5 (macro f1: 66.5%).While train loss and f1 are nearly identical between 1e-5and 3e-5, we have observed cases in which performancedrops dramatically for 3e-5 when other parameters arechanged. Therefore, we consider 1e-5 to be the preferredlearning rate at this point.

5.1.3. Fine-Tuning Epochs. The number of fine-tuningepochs determines the duration of our model’s domain-

Page 14 of 26

Figure 16. f1 scores for different learning rates.

specific training. We alter this parameter to detect whetherwe can achieve better performance with additional fine-tuning.

Our starting point is to fine-tune the model forthree epochs. We initially deemed this to be sufficient, astraining loss does not seem to decrease much further afterthe second epoch. This is apparent in Figure 17 (see page15), where out of the three training epochs of 295 batcheseach, only the first two show considerable reductions in loss.

Figure 17. Loss by batch when fine-tuning for three epochs.

Ultimately, we set out to validate this approach throughdirect comparison of classifier performance for models thatwere fine-tuned for varying numbers of epochs. We teston our fact verification pipeline with our development set,therefore challenging the models to classify claims againstretrieved evidence. In addition to the current model with 3epochs, we also fine-tune models for 10, 20, 30 and 100epochs. Fine-tuning the model for 100 epochs can be donein about four hours on an Azure NC12 virtual machine witha Tesla K80 GPU.

The model pre-trained for 20 epochs achieves thehighest f1 in these tests with a micro f1 of 55.0% and amacro f1 of 50.0% (see Figure 18). We find that perfor-mance begins to degrade with increased fine-tuning durationbeyond 20 epochs, as the model trained on 100 epochs

achieves nearly identical f1 to the model trained on only3 epochs. We hypothesise that due to the relatively smallsize of our training set, the model easily overfits to thetraining data and therefore performs worse on unseen data.This hypothesis is supported by earlier research suggestinga relationship between more epochs and lower test data dueto overfitting (Rosin and Fierens, 1995).

Expecting extended fine-tuning to negatively im-pact performance, we are surprised to see that our worst-performing model is fine-tuned on 30 epochs rather than themaximum 100. The poor performance can be reproduced byrepeating the test with a new model that is fine-tuned for30 epochs. At this point, we can offer no explanation forthis phenomenon, and continue with 20 fine-tuning epochsinstead.

Figure 18. f1 by number of fine-tuning epochs on development set inpipeline.

5.1.4. Sequence Length. Depending on available memoryand the remaining parameters, BERT can support a sequencelength of up to 512 tokens. Accordingly, we experiment withsequence lengths between 5 and 250. In Figure 19 (see page16) we present the results of running the classifier on ourdevelopment set with original evidence. The figure showsthe optimal sequence length to be 125 with a macro f1 of69.2%, whereas a longer input of 250 gives a moderatelylower macro f1 of 64.1%. As sequence length is reducedbelow 125, performance decreases as well, with a minimummacro f1 of 36.8% at a sequence length of 5.

We question why a length of 250 performs slightlyworse than a length of 125. Presumably a larger sequencelength should simply provide more evidence. However, wehypothesise that more evidence might not necessarily bebeneficial, as much of the important information used tosupport or refute claims may appear in the beginning of theevidence. Another point in this matter may be the generallength of Danish Wikipedia abstracts (see for example Fig-ure 4, page 7). Many abstracts are rather short and might notexceed a sequence length above 125, even when combinedwith a claim. We gain no benefit from increasing sequence

Page 15 of 26

Figure 19. f1 for with varying sequence length on development set withoriginal evidence.

length, while minimising it is beneficial to memory usageand execution time. For these reasons we continue with asequence length of 125.

5.1.5. Increasing Batch Size. When using BERT, batch sizeand sequence length are two of the factors that contributemost to memory usage. As we determine our model toperform best with a rather short sequence length of 125 (seeSection 5.1.4), we are able to increase the batch size andtest its effect on performance. For a similar sequence lengthof 128, the authors of BERT recommend a batch size of 3220. We test the pipeline performance on our development setwith a model fine-tuned for 20 epochs and batch sizes of 8and 32. Indeed, we see small improvements with 32 claimsper batch. Where a batch size of 8 results in a micro f1 of55.0% and a macro f1 of 50.0%, quadrupling the batch sizeincreases performance to 56.1% and 51.5% respectively.

5.1.6. Alternative Training Data. As an alternative to fine-tuning with our own Danish training data, we also con-sider two additional approaches. First, we use FEVER data(Thorne et al., 2018) that we machine-translate to Danish(see Section 3.3) for fine-tuning. Secondly, we also performa zero-shot cross-lingual transfer in which we fine-tune themodel with the original English FEVER data before testingit on our Danish data. For each approach we sample around1000 claims with their compositions being shown in Table2. To ensure comparability, all models are fine-tuned withidentical parameters over three epochs and then tested in thepipeline.

As Figure 20 shows, performance of both approachesfalls short of fine-tuning on our Danish data, althoughthe gap is smaller when performing zero-shot cross-lingualtransfer. Where fine-tuning on our Danish data achieves amacro f1 of 46.7%, zero-shot transferring with FEVER dataachieves 34.8% and 26.0% with the Danish translation of theFEVER data. In the case of fine-tuning with the translated

20. https://github.com/google-research/bert#out-of-memory-issues

TABLE 2. COMPOSITION OF FEVER SAMPLES.

Data set Refuted Supported NEI

Danish Translation 271 711 35

English Sample 272 715 35

data, the model fails to learn entirely, predicting nearly allclaims as Supported. Based on these results we continuefine-tuning with our Danish data only.

5.2. Evidence Retrieval

We evaluate our evidence retrieval system in two distinctsteps. First, we consider the evidence retrieval in isolation,measured by comparing the sentences it retrieves for eachclaim against the original evidence. Then, we evaluate thesystem as part of our fact verification pipeline, where wemeasure the classifier’s ability to predict the claims withtheir retrieved evidence.

5.2.1. Evidence Retrieval in Isolation. Like the FEVERbaseline (Thorne et al., 2018) and many submissions to theFEVER task (Thorne et al., 2019), our evidence retrievalalso consists of two steps: Document selection and sentenceselection. We evaluate them as one, as both subtasks followan identical structure, only with separate variables. Where kdescribes the number of article abstracts to retrieve duringdocument retrieval, l is the number of sentences to selectfrom those abstracts. We determine candidates for theoptimal values of these variables as such: First, we removethe original evidence from the data set. Then, we let theevidence retrieval system populate the data with evidenceit has retrieved from Wikipedia. Finally, we compare theretrieved evidence against the original corpus and computeprecision and recall for sentences in the retrieved evidence.

Figure 20. f1 for different data sets.

Page 16 of 26

In the FEVER baseline, Thorne et al. (2018) use k = 5 andl = 5 for their evidence retrieval. When assigning identicalvalues to k and l, we find a clear correlation between thenumber of documents selected, k, and precision, as wellas the number of sentences, l, and recall (see Figure 21).At least the latter observation makes intuitive sense: Asthe evidence, to which we are comparing the retrievedsentences, is comprised of full-length abstracts of relevantarticles, selecting more sentences increases the chanceof finding the correct ones. Simply put, higher recall isachieved by increasing l, although we see only limitedimprovements in recall for l > 10.

The inverse is true for the correlation betweenprecision and k. Higher precision is achieved by a lowk. In fact, the value of k seems to have a much greatereffect on precision than l has on recall. Note that whilek describes the number of article abstracts selected,precision is calculated based on the sentences selectedfrom those articles, not whether the articles themselveswere identified correctly. By constraining the evidenceretrieval system to only selecting one or a few articles,we can ensure that the l sentences picked from thosearticles are likely to be found in the original evidence- provided the system succeeds in selecting relevant articles.

Figure 21. Precision, recall and f1 for different values of l and k.

To balance precision and recall, we also compute macrof1 for each result. As shown in Figure 21, f1 peaks for3 <= l, k =< 5 at 29.0%. While this beats the FEVERbaseline at 18.3%, it does not compare well to other FEVERsubmissions, that reached up to 64.9% f1 (Thorne et al.,2019).

We therefore consider other scenarios, in which kand l are not necessarily equal to each other. These are listedin Table 3 in descending order of the f1 they achieve. Thefive best results all have l >= 8. For the sake of comparisonwe also include the earlier result for k, l = 10, whichis considerably worse. It therefore seems that performancecan be improved significantly by decreasing k. We see the

highest f1 for k = 3 with 37.1%, possibly since this allowsour evidence retrieval system to select not just sentencesfrom a claim’s source entity, but also other entities thatappear in dictionary.

TABLE 3. EVIDENCE RETRIEVAL CONFIGURATIONS WITH HIGHEST f1 .

k l Precision Recall f1

3 10 0.418 0.333 0.371

3 8 0.416 0.325 0.365

2 10 0.504 0.285 0.364

4 10 0.344 0.342 0.343

1 10 0.709 0.212 0.327...

10 10 0.195 0.288 0.233

5.2.2. Evidence Retrieval in the Pipeline. By evaluatingour evidence retrieval system in isolation we establish op-timal values for the number of articles, k, and the numberof sentences, l, that are being retrieved. We note, however,that due to the inclusion of entire abstracts, not all orig-inal evidence sentences can be found by searching withthe claim as they might not bear any resemblance to it.While this makes it harder to achieve high precision andrecall during evidence retrieval, it might also mean that theevidence that was originally attached to the claim duringannotation might not actually be optimal to classify theclaim against. For this reason, we also need to consider theclassifier performance in the fact verification pipeline. Thisway, rather than measuring how close the evidence retrievalsystem can get to reproducing the original evidence, we testwhich configuration of the evidence retrieval system resultsin the most accurate predictions down the line.

TABLE 4. SELECTION OF CONFIGURATIONS IN THE PIPELINE.

k l Micro f1 Macro f1

1 4 0.596 0.546

1 10 0.596 0.546

2 2 0.576 0.518

3 10 0.564 0.469

3 3 0.509 0.437

We report the performance in terms of f1 as calculatedon the classifiers predictions of claims with the evidenceretrieved by the evidence retrieval system. As expected,the configuration that results in the highest f1 duringevidence retrieval (k = 3, l = 10) does not excel in thepipeline. In fact, all configurations with a k < 3 that wetest perform better (see Table 4). This indicates that in thiscase, performance is highly dependent on k, i.e. the numberof articles from which evidence sentences are selected.

Page 17 of 26

We hypothesise that this is due to the fact that in theclassifier we trim sequences to 125 tokens, which results inevidence that exceeds this length to not be considered inthe prediction. This is supported by the first two results inTable 4. These show that in the pipeline increasing l above4 does not result in an increased performance, or even anyeffect at all.

It is worth noting that the configuration with the highest f1has a k = 1 and therefore selects its entire evidence fromonly one article. This approach likely fails at retrievingsufficient evidence for claims that rely on more than onearticle to be supported or refuted. However, we assume thatthis shortcoming is outweighed by the advantage that wereceive from only selecting sentences from the one mostrelevant article, instead of allowing our evidence retrievalsystem to select sentences from further articles. Thissuggests that further refinement of the evidence retrievalsystem is needed to also support more complex claims.

While we are satisfied with the results of the pipeline, westress that it is not yet equal to the results of running theclassifier on claims with the original evidence provided.Doing so results in another 8.2% increase in macro f1 asshown in Figure 22.

Figure 22. f1 in the pipelined system for retrieved and original evidence.

5.3. End-to-End Test Set Performance

In our experiments we have attempted to find the idealparameters for our model. For the most part we haveevaluated our model based on its performance on ourdevelopment set in the pipeline, measured by unweightedmacro f1. Now that we have optimised the model onour development set, we finally examine the pipelineperformance on the test set.

We choose to forego an alternative to the FEVERscore (Thorne et al., 2019), as requiring the model to

retrieve a certain set of evidence is not supported byour data set. Instead, we focus on the model’s abilityto correctly label given claims, for which we considerf1 scores to be entirely sufficient. Even though evidenceretrieval is not directly included in the metric, it is heavilyinfluencing model performance (and thereby f1) as weshow in Section 5.2.2. We find that we are able to achievethe highest f1 with the parameters listed in Table 5. Thisway, we achieve a micro f1 of 58.6% and macro f1 of53.4% on the test set with retrieved evidence.

TABLE 5. PARAMETERS RESULTING IN HIGHEST f1 .

Parameter Value

Balancing Oversampling + class weights

Sequence length 125 tokens

Fine-tuning epochs 20

Learning rate 1e-5

Batch size 32

Articles (k) 1

Sentences (l) ≥ 4

To put the system’s results into perspective, we match itup against a naive baseline that uses cosine similarity topredict a claim’s veracity. This baseline is first trainedon the training set, where it learns the average similaritybetween claim and evidence for each class. Then it isasked to predict labels for claims in the test set. Even whenprovided with the original evidence, this approach onlyyields a micro f1 of 36.6% and a macro f1 of 32.9%. At36.6% label accuracy, which in our task is equivalent tomicro f1, is close to random. In comparison to this, ourbest pipeline configuration performs more than 20% betterin both metrics.

Figure 23. Confusion matrix over pipeline predictions on test set.

Page 18 of 26

When considering performance on the individual classes,however, we find a significant discrepancy between theverifiable classes, Supported and Refuted, and thenon-verifiable NotEnoughInfo class. Label accuracy onthe verifiable classes is 62.1%, where it is only 38.0% forNotEnoughInfo. The model’s failure to reliably predictnon-verifiable claims is apparent in Figure 23 (see page18) which shows most claims of this class labelled asRefuted. We echo Stammbach and Neumann (2019) inconcluding that NotEnoughInfo seems to be the mostchallenging class to predict.

As no other points of reference for performance onour Danish data set exist yet, we look to similar claimverification tasks for comparison. Augenstein et al. (2019)provide micro and macro f1 for their models, whereThorne et al. (2019) publish label accuracy, which weassume equivalent to micro f1, for their own baseline andFEVER submissions (see Table 6). Of those submissionswe include UNC-NLP (Nie et al., 2019) in our comparison,as they achieved the highest label accuracy of all FEVERsubmissions (Thorne et al., 2019).

TABLE 6. FACT VERIFICATION MODELS ACROSS DIFFERENT TASKS.

Model Micro F1 Macro F1

FEVER Baseline 0.4884 -

FEVER UNC NLP 0.6821 -

MultiFC crawled ranked + meta 0.625 0.492

Our Naive Baseline 0.366 0.329

Our Best Pipeline 0.586 0.534

Even though our ability to directly compare performanceacross these different data sets is limited, we take thiscomparison as an indication that we managed to developa fact verification model that seems to perform adequatelyas a starting point for Danish fact verification.

5.4. Probing for Statistical Cues

As we present in Section 2.5, data sets are not always freeof unintentional cues that may sabotage a model’s learning.To ensure the validity of our own results, we analyse ourdata for the presence of statistical cues. This method hasbeen originally conceived by Niven and Kao (2019) for adata set on machine reasoning. Despite differences in thestructure of their data set and our own, we are able to modifytheir analysis to also cover fact verification tasks. As wewould lack grounds for comparison if just performed onour own data set, we begin by probing for statistical cuesin the training set of FEVER (Thorne et al., 2018) beforemoving on to our own data.

5.4.1. Methodology. We follow the approach of Niven andKao (2019) in determining a productivity and coverage

score for each cue in the data. As the structure of their dataset is fundamentally different from the data set presentedin Thorne et al. (2018), we have made amendments to theirmethodology in order to attain comparable results.

As in Niven and Kao (2019), we consider any uni-or bigram a potential cue. We extract these cues from theclaims in the data set and take note of the associated label.This allows us to calculate the applicability of a given cue(αk), which represents the absolute number of claims inthe data set that contain the cue irrespective of their label.Let T be the set of all cues and n the number of claims.

αk =

n∑i=1

1

[∃k ∈ T

]We define productivity of a cue (πk) as the maximumnumber of claims containing the cue for one particularlabel j, i.e. the frequency of the most common label acrossthe claims that contain the cue. In practical terms, theproductivity is the chance that a model correctly labels aclaim by assigning it the most common label of a givencue in the claim.

πk =

max

[∑ni=1 1

[∃j, k ∈ Tj

]]αk

Due to this definition, productivity may range from 1m to 1,

where m is the number of unique labels, which is 3 in ourcase. The coverage of a cue (ξk) is defined by Niven andKao (2019) as ξk = αk/n. We retain this definition with thecaveat that we derive αk differently.

This approach assumes a balanced data set withregard to the frequency of each label. If executed on animbalanced data set, a given cue’s productivity would bedominated by the most frequent label, not because it isactually more likely to appear in a claim with that label butpurely since the label is more frequent overall. We generatea balanced sample by undersampling majority classes. Inorder to not discard data from the majority classes, however,we repeat the process ten times with random samples.Even though this approach is not guaranteed to cover everysingle data point in the majority class, we find that thisis a better compromise than oversampling minority classesor introducing class weights when calculating productivity.This way we prevent inflating the productivity of rare cuesthat appear exclusively in the smallest class.

5.4.2. Cues in the FEVER Data Set. We apply the de-scribed methodology to the FEVER train data set (Thorneet al., 2018) and thereby determine productivity and cover-age for 14,320 cues. Considering the cues with a productiv-ity of 1, i.e. cues that could predict the label with a 100%accuracy, is not particularly relevant as none of them have acoverage over 0.01, meaning that they only appear in ≤ 1%of claims. In fact, there are 12,126 cues that only ever appearwith one label (≈ 85%).

Page 19 of 26

Table 7 shows the cues with the highest coverage.It is dominated by common English stop words with pro-ductivity near the minimum of 1

3 . This means that to amachine learning model these cues provide very little utilityin finding a shortcut. Some of the more common cues dostill provide some utility though. The cues “an”, “to” and“and” each appear in 6-8% of all claims and provide 0.44,0.53 and 0.49 productivity respectively.

TABLE 7. TOP 5 CUES BY COVERAGE.

Cue Productivity Coverage

a 0.36 0.34

is 0.38 0.32

in 0.37 0.30

the 0.36 0.26

was 0.35 0.25

These values pale, however, in comparison to the slightlyless common but considerably more productive cues “not”and “only” (see Table 8). While these only have a coverageof 0.04 each, they provide productivity of 0.86 and 0.90respectively. Even though Thorne et al. (2018) explicitlymention that they attempted to minimise the use of “not”for the creation of refuted claims, we find that in our sampleclaims containing “not” were labelled REFUTES 86% of thetime.

TABLE 8. HIGH-PRODUCTIVITY CUES.

Cue Productivity Coverage

not 0.86 0.04

only 0.90 0.04

Niven and Kao (2019) find that in the ARCT data set(Habernal et al., 2018) the cue “not” has a productivityof 61% and coverage of 64%. In the FEVER train data,we find “not” to have a higher productivity while having amuch lower coverage.

The productivity values alone are not necessarilycomparable across data sets though. Niven and Kao (2019)acknowledge that a cue is only useful to a machine learningmodel if πk > 1/m. In their case, every claim can have twopossible labels, i.e. m = 2. For the FEVER data set threelabels exist. This means that the productivity threshold atwhich cues start becoming useful to a model is higher inthe ARCT task. We should therefore actually consider theutility of a cue to the model (ρk).

ρk = πk −1

m

For “not” this provides a utility value of ≈ 0.11 in ARCTand ≈ 0.53 in the train set of FEVER, meaning that in theFEVER data the cue provides a significantly higher utilityto a ML model.

This conclusion is only drawn from the utility alonethough. For the sake of comparability across both utilityand coverage, we condense these values to one metric bytaking their harmonic mean. We choose the harmonic meanas it assigns higher values to cues that are both utilisableand covering. For “not” this results in ≈ 0.19 in ARCTand ≈ 0.07 in the FEVER train data.

Considering cues by their harmonic mean of utilityand coverage suggests that despite their high productivity,“not” and “only” might not be the most relevant cues inthe data, being preceded by common stop words that yetprovide noticeable utility (see Table 9).

TABLE 9. TOP 6 CUES BY HARMONIC MEAN OF UTILITY ANDCOVERAGE.

Cue Utility Coverage Harmonic Mean

to 0.19 0.07 0.10

an 0.10 0.08 0.09

and 0.15 0.06 0.09

is 0.04 0.32 0.08

not 0.53 0.04 0.07

only 0.56 0.04 0.07

During the creation of the FEVER data set, Thorne et al.(2018) foresaw the potential issue with using “not” to negatea claim and discouraged annotators from doing so. However,they could not have foreseen that other, seemingly neutralwords, such as “to” and “and” would appear in such animbalanced manner. In fact, in our samples 53% of claimscontaining “to” are labelled as REFUTES and 49% of claimscontaining “and” are labelled as SUPPORTS. These distri-butions are hard to predict and perhaps even harder to makesense of. Therefore, we strongly advocate for an analysis ofpotentially problematic cues in all NLP data sets that areintended for labelling tasks, including our own.

5.4.3. Cues in our Data Set. Having identified potentiallyproblematic cues in the FEVER data set (Thorne et al.,2018), we use the same methodology to probe our owntrain data set for statistical cues.

Considering the cues with the highest coverage, wesee a pattern similar to our findings on the FEVER dataset. In fact, the top five most covering cues are againall common stop words, albeit in Danish (see Table 10,page 21). These cues also show similar productivity andcoverage values, e.g. “a” in FEVER has a productivity of0.36 and coverage of 0.34, while its Danish translation“en” has a productivity of 0.37 and coverage of 0.33 in ourdata set.

Unlike with the FEVER data set, however, we find no cuescomparable to “not” and “only” in terms of their high

Page 20 of 26

TABLE 10. TOP 5 CUES BY COVERAGE.

Cue Productivity Utility Coverage

er 0.39 0.06 0.44

i 0.36 0.03 0.37

en 0.37 0.04 0.33

er en 0.39 0.06 0.21

af 0.39 0.06 0.17

productivity and non-negligible coverage. In our data set,all cues with a productivity over 0.6 have a coverage under0.02, making them rather insignificant. This is apparentwhen considering the cues with the highest harmonic meanof utility and coverage. These cues, shown in Table 11,are the most utilisable and covering cues in our trainingdata set. Among them, no cue has a higher utility than0.11, suggesting that cues that occur frequent enough to beimpactful, do not provide a high amount of utility.

TABLE 11. TOP 6 CUES BY HARMONIC MEAN OF UTILITY ANDCOVERAGE.

Cue Utility Coverage Harmonic Mean

er 0.06 0.44 0.11

har 0.11 0.09 0.10

er en 0.06 0.21 0.09

af 0.06 0.17 0.09

blev 0.06 0.10 0.08

for 0.07 0.08 0.07

The term “not” was identified as a problematic cue byNiven and Kao (2019) in the ARCT data set and we alsofind it providing high utility in FEVER (0.53). However,“ikke” (Danish for “not”) only ranks 22nd in terms ofharmonic mean, with a utility of 0.14 and coverage of0.03, resulting in an harmonic mean of 0.05. While the cueappears with a similar frequency in FEVER and our data,“not” in FEVER almost provides four times the utility that“ikke” provides in our data.

While the harmonic mean values are surprisinglysimilar between our and the FEVER data set (see table9), they are achieved through different attributes. Thehighlighted cues from FEVER generally provide higherutility, while cues in our data provide less utility, but occurmore frequently. These factors are balanced in such a waythat no cue in either data set reaches a harmonic mean over0.11.

We consider this an encouraging indicator that both,the FEVER data set of Thorne et al. (2018) and our own,are free of cues that may significantly sabotage genuine

machine learning efforts. Especially our Danish data setsexhibits a low correlation between the words used in aclaim and its assigned label. We attribute this to the factthat we were able to benefit from Thorne et al. (2018)’sobservation regarding trivial negations with “not” and weretherefore careful to avoid reduplicating this issue in ourown data generation.

5.5. Data Ablations

In the previous section, we isolate our data and analysedata quality mathematically, now we look at our data inpractice. The data should ideally allow the model to betrained without giving latent signals. As Heinzerling (2019)states, a data set will in most cases be built with a purpose,so patterns aligning with the task might influence a strongmodel to “recognise, remember, and generalise thesepatterns”. To test whether these patterns or cues influencehow the model performs on a data set, Heinzerling (2019)suggests performing ablations on ones data and observewhat impact the changes might have on performance.

Accordingly, we perform four data ablationssuggested by Heinzerling (2019). This provides us withthe possibility to test our data by means of our classifier.Ultimately, we can view our data from both a mathematicalanalysis, but also take a more practical perspective. Theablations can also support or refute our hypothesis that noelement of our data can be discarded without impactingperformance. If any of the ablations prove to have anegative influence on performance, we can assume that themodified property is essential.

As a ground for comparison, we run the classifier on ourdevelopment set without evidence retrieval and achieve amicro f1 of 73.8% and a macro f1 of 64.1%. In the examplesbelow we demonstrate the effect of a given ablation on thisoriginal claim:

“Christian Siegfried Nissen-Benzon havde enbror.”

5.5.1. Shuffled Input. In the first ablation we shuffle thewords in a claim and in its evidence as we believe thatour task requires sequential reasoning to perform well. Anexample of a shuffled claim is:

“Siegfried bror. Nissen-Benzon havde en Chris-tian”.

The result show a significantly worse performance in com-parison to the base performance (see Figure 24, page 22)with a micro f1 of 41.3% and a macro f1 of 38.1%.

5.5.2. No Evidence. The second ablation is to remove allevidence from all claim-evidence pairs. Hence, we train themodel on the claims and their labels only. In Figure 24(page 22) we show the performance giving a micro f1 of44.2% and a macro f1 of 38.9%. The figure shows a dropin performance compared to the base, however, also a slightimprovement in micro f1 over the shuffled input.

Page 21 of 26

Figure 24. f1 for base, shuffled input and no evidence.

5.5.3. Random Labels. Here we assign random labels toeach claim-evidence pair in the data set, with each claimbeing assigned any of the three labels. This helps us toexamine whether the model is indeed learning from ourdata. If model performance remain the same, even thoughlabels have been randomly assigned to claim-evidencepairs, we must assume that the model is learning nothingof value to the task.

Heinzerling (2019) suggest executing this experimentwith different coverage over the data set. In Figure 25 weshow the performance measured by micro and macro f1with varying ratios of randomly assigned labels. The resultsshow that performance is inversely related to the ratio ofassigned labels. The higher the ratio of random labels, thelower the model performance. At the extreme, with 100%random labels f1 is halved (micro f1: 35.5%, macro f1:26.6%) compared to 0% random labels. Hence, the resultssuggest that the models learning is supported by the wayannotators label the claim-evidence pairs.

Figure 25. f1 by percent of randomised labels.

5.5.4. Incomplete Input. The suggested ablations fromHeinzerling (2019) also include giving incomplete data tothe model. We decide to simulate this by tweaking thesequence length, i.e. the number of tokens provided to themodel. Due to BERT’s WordPiece tokenization (Wu et al.,2016), sequence length does not directly refer to the numberof words, but rather wordpieces in a concatenated claim-evidence pair.

As we have already performed an analysis on theeffects of altering the sequence length (see Section 5.1.4),we can simply refer to Figure 19 on page 16 for results.Here, we see that decreasing the sequence length, therebyproviding incomplete input, severely affects classifier per-formance, nearly halving macro f1 when sequence length isreduced from 125 all the way to 5. These results indicate thatclaims in our data set rely on (at least partially) completeinput to be verified.

5.5.5. Shuffling Nouns and Verbs. The final ablation ex-plores the impact of changing all nouns and verbs in claimsand evidence. This is not only an additional test for cuewords (see Section 5.4.3), but also for non-content cues, e.g.Refuted claims always being longer than those labelledSupported. To do this, we replace every noun and verbin every claim-evidence pair with another random noun orverb of the same type that occurs in the data. The examplebelow presents an example claim with replaced words inbold:

“Christian Siegfried Nissen-Benzon bebor enrekordliste.”

The results in Figure 26 show that with a micro f1 of 46.5%and a macro f1 of 31.4% performance is considerablyworsened compared to the base.

All of the ablations presented endorse the quality of ourdata set. We randomly alter aspects of the structure orcomposition, and with each alteration we see worsenedperformance compared to our base. The results suggest thatcorrect words and word order are significant factors. In

Figure 26. f1 for base and replaced words.

Page 22 of 26

addition, the presence of evidence matters greatly in helpingthe model learn the relation between claim and label. Weshow that, when not provided with evidence, the modelis incapable of achieving the same performance as in thebase, suggesting a lack of cues in the data that would allowfor correct classification based on the claim alone. Finally,our results exhibit the model’s ability to build a meaningfulunderstanding of the relationship between claim, evidenceand label, as randomly-labelled claims greatly impact per-formance.

Our ablative analysis examines a number of prop-erties in our data set and finds that none of them can bedisregarded without noticeably affecting performance. Whilethis might not a be complete coverage of possible cues andpatterns, we now have increased confidence in the robustnessof our data set.

6. Discussion

In the process of creating our Danish data set and our factverification pipeline, we discover a number of challengesand opportunities that so far remain largely unaddressed.Hence, we use this section to discuss and reflect uponsome points that we feel require further review for thisand potential future research: Evidence selection, evidenceretrieval and data sources.

6.1. Impact of Evidence Selection Decisions

The process from designing a data annotation system totesting the classifier is a long and convoluted one. Thisleads us make a number of decisions about how we selectevidence without full visibility of the impact that thesedecision might have later on. One of these decisions isto add entire abstracts of relevant articles to the evidenceof a claim, as opposed to Thorne et al. (2018), who onlyadd sentences that annotators highlight as being relevant.While this simplifies and speeds up our annotation process,we wonder about the consequences of this decision. Wehypothesise that adding all sentences from an abstract asevidence, even those not required to validate the claim,leads to an overall decrease in the evidence’s relevance.This way we might be unknowingly handicapping ourmodel by providing it too much non-relevant information.On the other hand, we can still see that some relevanceremains: As shown in Section 5.5, replacing nouns andverbs in the evidence or leaving it out entirely significantlyaffects performance, suggesting that there still is a certaindegree of relevance to the evidence provided.

A related concern is that our method of selectingevidence lengthens the evidence beyond the maximum inputlength of our model. In fact, in Section 5.2.2 we find thatthe pipeline performance does not change when retrievingmore than four evidence sentences. While many articlesin the Danish Wikipedia only have short abstracts, foursentences may not always be enough to accommodate theentire abstract of a claim’s source entity plus its dictionaryentries. Nevertheless, we determined that allowing for

longer inputs does not necessarily increase performance(see Section 5.1.4). We hypothesise that this may beattributed to the most relevant information commonly beingfound in the beginning of a Wikipedia abstract.

Another decision is to limit annotators to validateclaims based solely on the abstracts of the claim’s sourceentity and its dictionary. This leads to situations wherea claim’s mutations introduce new entities that are notincluded in the evidence. Since the annotators were notable to consult additional articles, they were sometimeslimited in their ability to effectively validate those kinds ofclaims on the given evidence alone. Our evidence retrieval,however, would often find articles to these entities, leadingto some cases in which retrieved evidence would actuallybe of higher quality than the original evidence. Thorneet al. (2019) supplement their original training data withevidence retrieved by systems participating in the FEVERshared task. Given a reliable evidence retrieval system,we consider this a potential opportunity to strengthen theevidence in our data set as well.

6.2. Challenges in Evidence Retrieval

In Section 5.5 we demonstrate the importance of properevidence to the model’s performance. However, we alsostill see that our evidence retrieval component can not fullymeet this standard yet, as model performance on retrievedevidence does not achieve parity with original evidence (seeFigure 22, page 18). We consider this an indication thatour next focus should be to develop a more sophisticatedevidence retrieval component. Some of the works that weintroduce in Section 2.2, exhibit promising examples ofevidence retrieval technologies, such as Hanselowski et al.(2019) who use NER and noun phrases to find relevantdocuments.

We see the greatest opportunity for improvement inour approach towards sentence selection, which currentlyis solely based on querying sentences with Lucene. Thisapproach only takes into account the similarity between asentence and the claim, but not whether they stand in anargumentative relationship to each other. While identifyingargumentative relationships is considered a challenging taskfor machine learning models (Niven and Kao, 2019), webelieve that adding a component for recognising textualentailment to sentence selection has the potential of strength-ening the resulting evidence by retrieving sentences directlythat support or refute the claim. We see encouraging resultsin e.g. Stammbach and Neumann (2019) and Soleimani et al.(2020), who use BERT not only for claim verification butalso for sentence selection.

6.3. Alternative Sources for Claims and Evidence

We explore fine-tuning BERT with different training data:Our own Danish data, FEVER data translated to Danishand English FEVER data for cross-lingual transfer. All ofthese approaches have in common that the training data is

Page 23 of 26

solely based on either the English or Danish Wikipedia.Although claims are generated by human annotators,evidence sentences are directly selected from Wikipediawithout modification.

Not only is Wikipedia the source for all of ourfine-tuning data, mBERT is even pre-trained exclusivelyon Wikipedia21. We wonder whether this sole reliance onWikipedia limits the model’s general ability to performclaim verification in other domains, such as journalism orsocial media. This ability is further called into question bythe synthetic nature of our claims. Can a model trainedon synthetic data achieve comparable performance onnaturally-occurring claims? While Augenstein et al. (2019)state that real-world data sets are irreplaceable in thisregard, we assess that replicating their process to collecta fact verification data set from fact-checking websites isnot feasible for low-resource languages. The MultiFC dataset achieves its size by sourcing claims from 26 differentEnglish fact-checking websites (Augenstein et al., 2019).Meanwhile, the International Fact Checking Network listsTjekDet as its only member in Denmark22 and their fact-checks are not frequent enough to constitute a real-worlddata set on their own.

While we cannot yet rely fully on real-world data,we encourage future research into Danish fact verificationto also include fact-checking, journalistic and otherencyclopedic sources.

6.4. Substituting with FEVER Data

In spite of the considerable annotation effort that we investinto the execution of our data generation workflow (seeSection 3), we also experiment with using alternate datasources for fine-tuning in Section 5.1.6. Inspired by thethe positive results reported by Wu and Dredze (2019) andPires et al. (2019), we experiment specifically with zero-shot cross-lingual transfer. Unfortunately, we are not ableto reproduce similar results, as without further modification,performance for zero-shot cross-lingual transfer remainsunderwhelming (see Section 5.1.6).

While Ronnqvist et al. (2019) report sub-par resultsfrom using mBERT on Nordic languages, we do not believethis to be obstructing our transfer attempt, as results improvewhen training on our own data.

6.4.1. Zero-Shot Cross-Lingual Transfer. While we do notbelieve mBERT to be obstructing our multilingual transferefforts, we consider a difference in data formats betweenours and the FEVER data set a potential issue. On pa-rameters like annotation rules, we take liberties and followa slightly different set of rules (see Section 3.2) than theFEVER annotators. This likely negatively affects perfor-mance when using FEVER data for training and our data fortesting. With the model learning how the FEVER data set

21. https://github.com/google-research/bert/blob/master/multilingual.md22. https://ifcncodeofprinciples.poynter.org/signatories

defines when a claim is verifiable, and then testing on ourslightly altered definition, results are understandably worsethan using one data set only for the entire task. Additionally,the FEVER task (Thorne et al., 2019) is arguably morecomplex than those tasks reported by Wu and Dredze (2019)and might not lend itself to cross-lingual transfer to the samedegree. The FEVER task requires multi-classification overlong input sequences, where Wu and Dredze (2019) reporttheir document classification task to be of only two inputsentences (even if the document is much larger). AlthoughPires et al. (2019) suggest that a model could, withoutbeing trained for it, still adjust and learn through zero-shottransfer, we do not achieve the expected results withoutfurther changes to the model. We may see better results ifwe tweak hyper parameters, class-weights, or other elementsof the architecture to better fit a zero-shot cross-lingualtransfer task. Our model is optimised for fine-tuning, soby optimising for this approach instead, we could achieveimproved results.

6.4.2. Translated FEVER Data. In Section 3.2 we con-sider combining our own data with translations of originalFEVER data points. However, due to inferior quality in thetranslated samples we refrain from the idea. Instead, weexplore using the translated data to train the model, and thentest on our own data in Section 5.1.6. The results reveal thatusing translated FEVER data can not yet match performanceon our own data. Nonetheless, we deem it possible toachieve better results using data translations from a highresource language to a low resource language in the future.Such a scenario would require near-flawless machine trans-lation, the opportunity to use man power for translations, ora combination of human revision over machine-translatedmaterial. If any scenarios were to become a reality, thiscould greatly aid NLP research with generating sizeable datasets in low-resource languages. Our findings suggest that inour case data sets specifically generated for Danish retain anadvantage over other approaches that leverage larger Englishdata sets.

7. Conclusion

With only limited previous research into fact verification forDanish, we set out to identify and evaluate a possible wayto use modern NLP technologies to address the challengeand lay a foundation for future development in this area. Tothis end we present a Danish fact verification data set anda fact verification pipeline consisting of evidence retrievaland a classifier.

We construct a fact verification data set based on theDanish Wikipedia, comprised of 3,395 labelled claimsand evidence from one or multiple Wikipedia articles.The claims are labelled by human annotators with highagreement. To ensure quality of the data set we perform dataablations and probe for statistical cues. Both methodologiesindicate that our data set is largely free of unintended cuesand patterns.

Page 24 of 26

We develop a fact verification system in threestages: Fine-tuning, evidence retrieval and classification.First, we fine-tune a language representation model,multilingual BERT, with our Danish data. We optimisethe model for ideal performance using our developmentdata set by tweaking its configuration and parameters.During evidence retrieval we collect evidence for eachclaim by querying Wikipedia abstracts with the claim’snoun phrases and selecting the most relevant sentences.Following this, we use the fine-tuned model to classifythe claim against the retrieved evidence, concludingin the prediction of a veracity label for each claim. Weconfigure the evidence retrieval and classification to work inan end-to-end pipeline, on which we report our final results.

Our fact verification pipeline achieves a micro f1 of58.6% and macro f1 of 53.4% on our test set, broadlyon par with similar fact verification research on English.We hypothesise that performance can be enhanced byimproving the evidence retrieval component to select morerelevant evidence, as we do not yet exhaust mBERT’spotential for fact verification. Current fact verificationresearch already presents sophisticated evidence retrievalmethods that may be applied to Danish as well.

Conclusively, we find that - although oftenconsidered a low-resource language - sufficient tools existto support Danish fact verification. We show that factverification methods that have been proposed for Englishcan be applied to Danish as well. We are optimistic thatif enabled by sufficient data sets development of factverification systems for Danish does not need to lag behindthat of high-resource languages.

8. Acknowledgements

We would like to thank our supervisor Leon Derczynskifor his time and efforts in guiding us through this process.Thank you for constant availability, and for trusting us tolead this research. We would also like to thank Troels BjerreLund and Manuel Rafael Ciosici who lent us their knowl-edge on several occasions. Finally, we also thank ChristosChristodoulopoulos and James Thorne for providing us withdetailed insights into their work on FEVER. We are gratefulto have been given the chance to contribute to this area ofresearch.

References

Augenstein, I., Lioma, C., Wang, D., Chaves Lima, L.,Hansen, C., Hansen, C., and Simonsen, J. G. (2019). Mul-tiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims. pages 4684–4696.

Chawla, N. V. (2010). Data Mining for Imbalanced Datasets:An Overview. In Maimon, O. and Rokach, L., editors,Data Mining and Knowledge Discovery Handbook, pages875–886. Springer US, Boston, MA.

Derczynski, L. and Bontcheva, K. (2014). PHEME: Veracityin digital social networks. In CEUR Workshop Proceed-ings, volume 1181, pages 19–22. CEUR-WS.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.(2018). BERT: Pre-training of Deep Bidirectional Trans-formers for Language Understanding. (Mlm).

Fleiss, J. L. (1971). Measuring nominal scale agreementamong many raters.

Geis, M. L., Cole, P., and Morgan, J. (1976). Syntax andSemantics Volume 3: Speech Acts.

Habernal, I., Wachsmuth, H., Gurevych, I., and Stein, B.(2018). The Argument Reasoning Comprehension Task:Identification and Reconstruction of Implicit Warrants.pages 1930–1940.

Hanselowski, A., Zhang, H., Li, Z., Sorokin, D., Schiller, B.,Schulz, C., and Gurevych, I. (2019). UKP-Athene: Multi-Sentence Textual Entailment for Claim Verification. pages103–108.

Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J.,Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis,W., Li, M., Liu, S., Liu, T.-Y., Luo, R., Menezes, A., Qin,T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y.,Zhang, D., Zhang, Z., and Zhou, M. (2018). AchievingHuman Parity on Automatic Chinese to English NewsTranslation.

Heinzerling, B. (2019). NLP’s Clever Hans Moment hasArrived.

Kirchmeier, S., Henrichsen, P. J., Diderichsen, P., andHansen, N. B. (2019). Dansk Sprogteknologi i Verden-sklasse.

Kirkedal, A., Plank, B., Derczynski, L., and Schluter, N.(2019). The Lacunae of Danish Natural Language Pro-cessing. Proceedings of the 22nd Nordic Conference onComputational Linguistics, pages 356–362.

Landis, J. R. and Koch, G. G. (1977). The Measurementof Observer Agreement for Categorical Data. Biometrics,33(1):159.

Ling, C. X. and Sheng, V. S. (2008). Cost-Sensitive Learn-ing and the Class Imbalance Problem. Encyclopedia ofMachine Learning, pages 231–235.

Malon, C. (2019). Team Papelo: Transformer Networks atFEVER. pages 109–113.

Nie, Y., Chen, H., and Bansal, M. (2019). CombiningFact Extraction and Verification with Neural SemanticMatching Networks. Proceedings of the AAAI Conferenceon Artificial Intelligence, 33:6859–6866.

Niven, T. and Kao, H.-Y. (2019). Probing Neural NetworkComprehension of Natural Language Arguments. pages4658–4664.

Otto, W. (2019). Team GESIS Cologne: An all in allsentence-based approach for FEVER. pages 145–149.

Pires, T., Schlinger, E., and Garrette, D. (2019). HowMultilingual is Multilingual BERT? pages 4996–5001.

Ronnqvist, S., Kanerva, J., Salakoski, T., and Ginter, F.(2019). Is Multilingual BERT Fluent in Language Gen-eration?

Rosin, P. L. and Fierens, F. (1995). Improving neuralnetwork generalisation. International Geoscience and

Page 25 of 26

Remote Sensing Symposium (IGARSS), 2:1255–1257.Sasaki, Y. (2007). The truth of the F-measure. Teach Tutor

mater, pages 1–5.Schuster, T., Shah, D. J., Yeo, Y. J. S., Filizzola, D., Santus,

E., and Barzilay, R. (2020). Towards debiasing fact verifi-cation models. EMNLP-IJCNLP 2019 - 2019 Conferenceon Empirical Methods in Natural Language Processingand 9th International Joint Conference on Natural Lan-guage Processing, Proceedings of the Conference, pages3419–3425.

SIRI-Kommissionen (2019). AI, medier og demokrati.Soleimani, A., Monz, C., and Worring, M. (2020). BERT

for Evidence Retrieval and Claim Verification. pages 359–366.

Speer, R. (2019). ftfy.Stammbach, D. and Neumann, G. (2019). Team DOMLIN:

Exploiting Evidence Enhancement for the FEVER SharedTask. pages 105–109.

Tayyar Madabushi, H., Kochkina, E., and Castelle, M.(2019). Cost-Sensitive BERT for Generalisable SentenceClassification on Imbalanced Data. pages 125–134.

Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mit-tal, A. (2018). FEVER: a Large-scale Dataset for FactExtraction and VERification. pages 809–819.

Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos,C., and Mittal, A. (2019). The Fact Extraction andVERification (FEVER) Shared Task. pages 1–9.

Vlachos, A. and Riedel, S. (2014). Fact Checking: Taskdefinition and dataset construction Andreas. Aclweb.Org,pages 18–22.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M.,and Brew, J. (2019). HuggingFace’s Transformers: State-of-the-art Natural Language Processing.

Wu, S. and Dredze, M. (2019). Beto, Bentz, Becas: TheSurprising Cross-Lingual Effectiveness of BERT.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser,L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens,K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J.,Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes,M., and Dean, J. (2016). Google’s Neural MachineTranslation System: Bridging the Gap between Humanand Machine Translation. pages 1–23.

Yoneda, T., Mitchell, J., Welbl, J., Stenetorp, P., and Riedel,S. (2019). UCL Machine Reading Group: Four FactorFramework For Fact Finding (HexaF). pages 97–102.

Page 26 of 26

Documents

Danish Fact Veriﬁcation: An End-to-End Machine Learning ... › itu › docs › fever-da_jubi_hens.pdf · Danish Fact Veriﬁcation: An End-to-End Machine Learning System for Automatic