15
Calibrating Structured Output Predictors for Natural Language Processing Abhyuday Jagannatha 1 , Hong Yu 1,2 1 College of Information and Computer Sciences,University of Massachusetts Amherst 2 Department of Computer Science,University of Massachusetts Lowell {abhyuday, hongyu}@cs.umass.edu Abstract We address the problem of calibrating predic- tion confidence for output entities of interest in natural language processing (NLP) applica- tions. It is important that NLP applications such as named entity recognition and ques- tion answering produce calibrated confidence scores for their predictions, especially if the applications are to be deployed in a safety- critical domain such as healthcare. However, the output space of such structured prediction models is often too large to adapt binary or multi-class calibration methods directly. In this study, we propose a general calibration scheme for output entities of interest in neu- ral network based structured prediction mod- els. Our proposed method can be used with any binary class calibration scheme and a neu- ral network model. Additionally, we show that our calibration method can also be used as an uncertainty-aware, entity-specific decod- ing step to improve the performance of the underlying model at no additional training cost or data requirements. We show that our method outperforms current calibration tech- niques for named-entity-recognition, part-of- speech and question answering. We also im- prove our model’s performance from our de- coding step across several tasks and bench- mark datasets. Our method improves the cal- ibration and model performance on out-of- domain test scenarios as well. 1 Introduction Several modern machine-learning based Natural Language Processing (NLP) systems can provide a confidence score with their output predictions. This score can be used as a measure of predictor confidence. A well-calibrated confidence score is a probability measure that is closely correlated with the likelihood of model output’s correctness. As a result, NLP systems with calibrated confidence can predict when their predictions are likely to be incorrect and therefore, should not be trusted. This property is necessary for the responsible deploy- ment of NLP systems in safety-critical domains such as healthcare and finance. Calibration of pre- dictors is a well-studied problem in machine learn- ing (Guo et al., 2017; Platt, 2000); however, widely used methods in this domain are often defined as binary or multi-class problems(Naeini et al., 2015; Nguyen and O’Connor, 2015). The structured out- put schemes of NLP tasks such as information ex- traction (IE) (Sang and De Meulder, 2003) and ex- tractive question answering (Rajpurkar et al., 2018) have an output space that is often too large for standard multi-class calibration schemes. Formally, we study NLP models that provide conditional probabilities p θ (y|x) for a structured output y given input x. The output can be a la- bel sequence in case of part-of-speech (POS) or named entity recognition (NER) tasks, or a span prediction in case of extractive question answer- ing (QA) tasks, or a relation prediction in case of relation extraction task. p θ (y|x) can be used as a score of the model’s confidence in its prediction. However, p θ (y|x) is often a poor estimate of model confidence for the output y. The output space of the model in sequence-labelling tasks is often large, and therefore p θ (y|x) for any output instance y will be small. For instance, in a sequence labelling task with C number of classes and a sequence length of L, the possible events in output space will be of the order of C L . Additionally, recent efforts (Guo et al., 2017; Nguyen and O’Connor, 2015; Dong et al., 2018; Kumar and Sarawagi, 2019) at calibrating machine learning models have shown that they are poorly calibrated. Empirical results from Guo et al. (2017) show that techniques used in deep neural networks such as dropout and their large architecture size can negatively affect the cal- ibration of their outputs in binary and multi-class classification tasks. arXiv:2004.04361v2 [cs.CL] 5 May 2020

arXiv:2004.04361v2 [cs.CL] 5 May 2020domain test scenarios as well. 1 Introduction Several modern machine-learning based Natural Language Processing (NLP) systems can provide a confidence

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Calibrating Structured Output Predictors for Natural LanguageProcessing

    Abhyuday Jagannatha1, Hong Yu1,21College of Information and Computer Sciences,University of Massachusetts Amherst

    2Department of Computer Science,University of Massachusetts Lowell{abhyuday, hongyu}@cs.umass.edu

    Abstract

    We address the problem of calibrating predic-tion confidence for output entities of interestin natural language processing (NLP) applica-tions. It is important that NLP applicationssuch as named entity recognition and ques-tion answering produce calibrated confidencescores for their predictions, especially if theapplications are to be deployed in a safety-critical domain such as healthcare. However,the output space of such structured predictionmodels is often too large to adapt binary ormulti-class calibration methods directly. Inthis study, we propose a general calibrationscheme for output entities of interest in neu-ral network based structured prediction mod-els. Our proposed method can be used withany binary class calibration scheme and a neu-ral network model. Additionally, we showthat our calibration method can also be usedas an uncertainty-aware, entity-specific decod-ing step to improve the performance of theunderlying model at no additional trainingcost or data requirements. We show that ourmethod outperforms current calibration tech-niques for named-entity-recognition, part-of-speech and question answering. We also im-prove our model’s performance from our de-coding step across several tasks and bench-mark datasets. Our method improves the cal-ibration and model performance on out-of-domain test scenarios as well.

    1 Introduction

    Several modern machine-learning based NaturalLanguage Processing (NLP) systems can providea confidence score with their output predictions.This score can be used as a measure of predictorconfidence. A well-calibrated confidence score is aprobability measure that is closely correlated withthe likelihood of model output’s correctness. Asa result, NLP systems with calibrated confidencecan predict when their predictions are likely to be

    incorrect and therefore, should not be trusted. Thisproperty is necessary for the responsible deploy-ment of NLP systems in safety-critical domainssuch as healthcare and finance. Calibration of pre-dictors is a well-studied problem in machine learn-ing (Guo et al., 2017; Platt, 2000); however, widelyused methods in this domain are often defined asbinary or multi-class problems(Naeini et al., 2015;Nguyen and O’Connor, 2015). The structured out-put schemes of NLP tasks such as information ex-traction (IE) (Sang and De Meulder, 2003) and ex-tractive question answering (Rajpurkar et al., 2018)have an output space that is often too large forstandard multi-class calibration schemes.

    Formally, we study NLP models that provideconditional probabilities pθ(y|x) for a structuredoutput y given input x. The output can be a la-bel sequence in case of part-of-speech (POS) ornamed entity recognition (NER) tasks, or a spanprediction in case of extractive question answer-ing (QA) tasks, or a relation prediction in case ofrelation extraction task. pθ(y|x) can be used as ascore of the model’s confidence in its prediction.However, pθ(y|x) is often a poor estimate of modelconfidence for the output y. The output space ofthe model in sequence-labelling tasks is often large,and therefore pθ(y|x) for any output instance y willbe small. For instance, in a sequence labelling taskwith C number of classes and a sequence lengthof L, the possible events in output space will beof the order of CL. Additionally, recent efforts(Guo et al., 2017; Nguyen and O’Connor, 2015;Dong et al., 2018; Kumar and Sarawagi, 2019) atcalibrating machine learning models have shownthat they are poorly calibrated. Empirical resultsfrom Guo et al. (2017) show that techniques usedin deep neural networks such as dropout and theirlarge architecture size can negatively affect the cal-ibration of their outputs in binary and multi-classclassification tasks.

    arX

    iv:2

    004.

    0436

    1v2

    [cs

    .CL

    ] 5

    May

    202

    0

  • Parallelly, large neural network architecturesbased on contextual embeddings (Devlin et al.,2018; Peters et al., 2018) have shown state-of-the-art performance across several NLP tasks (Andrewand Gao, 2007; Wang et al., 2019) . They are be-ing rapidly adopted for information extraction andother NLP tasks in safety-critical applications (Zhuet al., 2018; Sarabadani, 2019; Li et al., 2019; Leeet al., 2019). Studying the miss-calibration in suchmodels and efficiently calibrating them is impera-tive for their safe deployment in the real world.

    In this study, we demonstrate that neural net-work models show high calibration errors for NLPtasks such as POS, NER and QA. We extend thework by Kuleshov and Liang (2015) to define well-calibrated forecasters for output entities of interestin structured prediction of NLP tasks. We provide anovel calibration method that applies to a wide vari-ety of NLP tasks and can be used to produce modelconfidences for specific output entities instead ofthe complete label sequence prediction. We pro-vide a general scheme for designing manageableand relevant output spaces for such problems. Weshow that our methods lead to improved calibra-tion performance on a variety of benchmark NLPdatasets. Our method also leads to improved out-of-domain calibration performance as compared to thebaseline, suggesting that our calibration methodscan generalize well.

    Lastly, we propose a procedure to use our cal-ibrated confidence scores to re-score the predic-tions in our defined output event space. This pro-cedure can be interpreted as a scheme to combinemodel uncertainty scores and entity-specific fea-tures with decoding methods like Viterbi. We showthat this re-scoring leads to consistent improvementin model performance across several tasks at no ad-ditional training or data requirements.

    2 Calibration framework for StructuredPrediction NLP models

    2.1 Background

    Structured Prediction refers to the task of predictinga structured output y = [y1, y2, ...yL] for an inputx. In NLP, a wide array of tasks including pars-ing, information extraction, and extractive ques-tion answering fall within this category. Recentapproaches towards solving such tasks are com-monly based on neural networks that are trained by

    minimizing the following objective :

    L(θ|D) = −|D|∑i=0

    log(pθ(y(i)|x(i))) +R(θ) (1)

    where θ is the parameter vector of the neuralnetwork and R is the regularization penalty andD is the dataset {(y(i), x(i))}|D|i=0. The trainedmodel pθ can then be used to produce the outputŷ = argmaxy∈Y pθ(y|x). Here, the correspond-ing model probability pθ(ŷ|x) is the uncalibratedconfidence score.

    In binary class classification, the output spaceY is [0, 1]. The confidence score for such classi-fiers can then be calibrated by training a forecasterFy : [0, 1]→ [0, 1] which takes in the model confi-dence Fy(Pθ(y|x)) to produce a recalibrated score(Platt, 2000). A widely used method for binaryclass calibration is Platt scaling where Fy is a lo-gistic regression model. Similar methods have alsobeen defined for multi-class classification (Guoet al., 2017). However, extending this to structuredprediction in NLP settings is non-trivial since theoutput space |Y| is often too large for us to calibratethe output probabilities of all events.

    2.2 Related WorkCalibration methods for binary/multi class classi-fication has been widely studied in related litera-ture (Bröcker, 2009; Guo et al., 2017). Recent ef-forts at confidence modeling for NLP has focusedon several tasks like co-reference, (Nguyen andO’Connor, 2015), semantic parsing (Dong et al.,2018) and neural machine translation (Kumar andSarawagi, 2019).

    2.3 Calibration in Structured PredictionIn this section, we define the calibration frameworkby Kuleshov and Liang (2015) in the context ofstructured prediction problems in NLP. The modelpθ denotes the neural network that produces anconditional probability pθ(y|x) given an (x, y) tu-ple. In a multi/binary class setting, a function Fyis used to map the output pθ(y|x) to a calibratedconfidence score for all y ∈ Y . In a structuredprediction setting, since the cardinality of Y is usu-ally large, we instead focus on the event of interestset I(x). I(x) contains events of interest E thatare defined using the output events relevant to thedeployment requirements of a model. The eventE is a subset of Y . There can be several differ-ent schemes to define I(x). In later sections, we

  • discuss related work on calibration that can be un-derstood as applications of different I(x) schemes.In this work, we define a general framework forconstructing I(x) for NLP tasks which allows usto maximize calibration performance on output en-tities of interest.

    We define Fy(E, x, pθ) to be a function, thattakes the event E, the input feature x and pθ toproduce a confidence score between [0, 1]. We referto this calibration function as the forecaster and useFy(E, x) as a shorthand since it is implicit that Fydepends on outputs of pθ. We would like to find theforecaster that minimizes the discrepancy betweenFy(E, x) and P(y ∈ E|x) for (x, y) sampled fromP(x, y) and E uniformly sampled from I(x).

    A commonly used methodology for construct-ing a forecaster for pθ is to train it on a held-outdataset Ddev. A forecaster for a binary classifier isperfectly calibrated if

    P(y = 1|Fy(x) = p) = p. (2)

    It is trained on samples from {(x, I(y = 1) :(x, y) ∈ Ddev}. For our forecaster based on I(x),perfect calibration would imply that

    P(y ∈ E|Fy(x,E) = p) = p. (3)

    The training data samples for our forecaster are{(x, I(y ∈ E) : E ∈ I(x), (x, y) ∈ Ddev}.

    2.4 Construction of Event of Interest set I(x)The main contributions of this paper stem from ourproposed schemes for constructing the aformen-tioned I(x) sets for NLP applications.

    Entities of Interest : In the interest of brevity, letus define “Entities of interest” φ(x) as the set of allentity predictions that can be queried from pθ for asample x. For instance, in the case of answer spanprediction for QA, the φ(x) may contain the MAPprediction of the best answer span (answer startand end indexes). In a parsing or sequence labelingtask, φ(x) may contain the top-k label sequencesobtained from viterbi decoding. In a relation ornamed-entity extraction task, φ(x) contains the re-lation or named entity span predictions respectively.Each entity s in φ(x) corresponds to a event set Ethat is defined by all outputs in Y that contain theentity s. I(x) contains set E for all entities in φ(x).

    Positive Entities and Events : We are interestedin providing a calibrated probability for y ∈ Ecorresponding to an s for all s in φ(x). Here y is

    the correct label sequence for the input x. If y liesin the set E for an entity s, we refer to s as a positiveentity and the event as a positive event. In theexample of named entity recognition, s may referto a predicted entity span, E refers to all possiblesequences in Y that contain the predicted span. Thecorresponding event is positive if the correct labelsequence y contains the span prediction s.

    Schemes for construction of I(x) : While con-structing the set φ(x) we should ensure that it islimited to a relatively small number of output en-tities, while still covering as many positive eventsin I(x) as possible. To explain this consideration,let us take the example of a parsing task such assyntax or semantic parsing. Two possible schemesfor defining I(x) are :

    1. Scheme 1: φ(x) contains the MAP label se-quence prediction. I(x) contains the eventcorresponding to whether the label sequencey′ = argmaxy pθ(y|x) is correct.

    2. Scheme 2: φ(x) contains all possible label se-quences. I(x) contains a event correspondingto whether the label sequence y′ is correct, forall y′ ∈ Y

    Calibration of model confidence by Dong et al.(2018) can be viewed as Scheme 1, where the entityof interest is the MAP label sequence prediction.Whereas, using Platt Scaling in a one-vs-all settingfor multi-class classification (Guo et al., 2017) canbe seen as an implementation of Scheme 2 wherethe entity of interest is the presence of class label.As discussed in previous sections, Scheme 2 is toocomputationally expensive for our purposes dueto large value of |Y| . Scheme 1 is computation-ally cheaper, but it has lower coverage of positiveevents. For instance, a sequence labelling modelwith a 60% accuracy at sentence level means thatonly 60 % of positive events are covered by theset corresponding to argmaxy pθ(y|x) predictions.In other words, only 60 % of the correct outputsof model pθ will be used for constructing the fore-caster. This can limit the positive events in I(x).Including the top-k predictions in φ(x) may in-crease the coverage of positive events and thereforeincrease the positive training data for the forecaster.The optimum choice of k involves a trade-off. Alarger value of k implies broader coverage of posi-tive events and more positive training data for theforecaster training. However, it may also lead to

  • Calibration BERT BERT+CRF DistilBERTPlatt 15.90±.03 15.56±.23 12.30±.13Calibrated Mean 2.55±.34 2.31±.35 2.02±.16+Var 2.11±.32 2.55±.32 2.73±.40Platt+top2 11.4±.07 14.21±.16 11.03±.31Calibrated Mean+top2 2.94± .29 4.82±.15 3.61±.17+Var+top2 2.17±.35 4.26±.10 2.43±.16+Rank+top2 2.43±.30 2.43±.45 2.21±.09+Rank+Var+top2 1.81±.12 2.29±.27 1.97±.14Platt+top3 17.46±.13 18.11±.16 12.84±.37+Rank+Var+top3 3.18±.12 3.71±.25 2.05±.06

    Table 1: ECE percentages on Penn Treebank for different models and calibration methods. The results are fortop-1 MAP predictions on the test data. ECE standard deviation is estimated by repeating the experiments for5 repetitions. ECE for uncalibrated BERT, BERT+CRF model and DistilBERT is 35.11%, 33.72% and 28.06%respectively. heuristic-k is 2 for all +Rank+Var+topk forecasters. Full feature model +Rank+Var+topk, k = 3 isalso provided for completeness.

    an unbalanced training dataset that is skewed infavour of negative training examples.

    Task specific details about φ(x) are provided inthe later sections. For the purposes of this paper,top-k refers to the top k MAP sequence predictions,also referred to as argmax(k).

    2.5 Forecaster Construction

    Here we provide a summary of the steps involved inForecaster construction. Remaining details are inthe Appendix. We train the neural network modelpθ on the training data split for a task and use thevalidation data for monitoring the loss and earlystopping. After the training is complete, this vali-dation data is re-purposed to create the forecastertraining data. We use an MC-Dropout(Gal andGhahramani, 2016) average of (n=10) samples toget a low variance estimate of logit outputs fromthe neural networks. This average is fed into thedecoding step of the model pθ to obtain top-k labelsequence predictions. We then collect the relevantentities in φ(x), along with the I(y ∈ E) labels toform the training data for the forecaster. We usegradient boosted decision trees (Friedman, 2001)as our region-based (Dong et al., 2018; Kuleshovand Liang, 2015) forecaster model.

    Choice of the hyperparameter k: We limit ourchoice of k to {2, 3}. We train our forecasters ontraining data constructed through top-2 and top-3extraction each. These two models are then eval-uated on top-1 extraction training data, and thebest value of k is used for evaluation on test. Thisheuristic for k selection is based on the fact that

    the top-1 training data for a good predictor pθ, is apositive-event rich dataset. Therefore, this datasetcan be used to reject a larger k if it leads to re-duced performance on positive events. We referto the value of k obtained from this heuristic as asheuristic-k.

    2.6 Feature Construction for CalibrationWe use three categories of features as inputs to ourforecaster.

    Model and Model Uncertainty based featurescontain the mean probability obtained by averag-ing over the marginal probability of the “entity ofinterest” obtained from 10 MC-dropout samplesof pθ. Average of marginal probabilities acts as areduced variance estimate of un-calibrated modelconfidence. Our experiments use the pre-trainedcontextual word embedding architectures as thebackbone networks. We obtain MC-Dropout sam-ples by enabling dropout sampling for all dropoutlayers of the networks. We also provide 10th and90th percentile values from the MC-Dropout sam-ples, to provide model uncertainty information tothe forecaster. Since our forecaster training datacontains entity predictions from top-k MAP predic-tions, we also include the rank k as a feature. Werefer to these two features as “Var” and “Rank” inour models.

    Entity of interest based features contain thelength of the entity span if the output task is namedentity. We only use this feature in the NER experi-ments and refer to it as “ln”.

    Data Uncertainty based features: Dong et al.(2018) propose the use of language modelling (LM)

  • Calibration BERT BERT+CRF DistilBERTBaseline 60.30±.12 62.31±.11 60.17±.08+Rank+Var+top2 60.30±.23 62.31±.09 60.13±.11+Rank+Var+top3 59.84±.16 61.06±.14 58.95±.08

    Table 2: Micro-avg f-score for POS datasets using the baseline and our best proposed calibration method. Theconfidence score from the calibration method is used to re-rank the events E ∈ I(s) and the top selection is chosen.Standard deviation is estimated by repeating the experiments for 5 repetitions. Baseline refers to MC-dropoutaveraged (sample-size=10) output from the model pθ. heuristic-k is 2 for +Rank+Var+topk forecasters.

    and OOV-word-based features as a proxy for datauncertainty estimation. The use of word-piecesand large pre-training corpora in contextual wordembedding models like BERT may affect the ef-ficacy of LM based features. Nevertheless, weuse LM perplexity (referred to as “lm”) in the QAtask to investigate its effectiveness as an indica-tor of the distributional shift in data. Essentially,our analysis focuses on LM perplexity as a proxyfor distributional uncertainty (Malinin and Gales,2018) in our out-of-domain experiments. The useof word-pieces in models like BERT reduces thenegative effect of OOV words on model prediction.Therefore, we do not include OOV features in ourexperiments.

    3 Experiments and Results

    We use BERT-base (Devlin et al., 2018) and dis-tilBERT (Sanh et al., 2019) network architecture forour experiments. Validation split for each datasetwas used for early stopping BERT fine-tuning andas training data for forecaster training. POS andNER experiments are evaluated on Penn Treebankand CoNLL 2003 (Sang and De Meulder, 2003),MADE 1.0 (Jagannatha et al., 2019) respectively.QA experiments are evaluated on SQuAD1.1 (Ra-jpurkar et al., 2018) and EMRQA (Pampari et al.,2018) corpus. We also investigate the performanceof our forecasters on an out-of-domain QA corpusconstructed by applying EMRQA QA data genera-tion scheme (Pampari et al., 2018) on the MADE1.0 named entity and relations corpus. Details forthese datasets are provided in their relevant sec-tions.

    We use the expected calibration error (ECE) met-ric defined by Naeini et al. (2015) with N = 20bins (Guo et al., 2017) to evaluate the calibrationof our models. ECE is defined as an estimate of theexpected difference between the model confidenceand accuracy. ECE has been used in several re-lated works (Guo et al., 2017; Maddox et al., 2019;

    Kumar et al., 2018; Vaicenavicius et al., 2019) toestimate model calibration. We use Platt scaling asthe baseline calibration model. It uses the length-normalized probability averaged across 10 MC-Dropout samples as the input. The lower varianceand length invariance of this input feature makePlatt Scaling a strong baseline. We also use a “Cali-brated Mean” baseline using Gradient Boosted De-cision Trees as our estimator with the same inputfeature as Platt.

    3.1 Calibration for Part-of-Speech TaggingPart-of-speech (POS) is a sequence labelling taskwhere the input is a text sentence, and the out-put is a sequence of syntactic tags. We evaluateour method on the Penn Treebank dataset (Marcuset al., 1994). We can define either the token pre-diction or the complete sequence prediction as theentity of interest. Since using a token level entityof interest effectively reduces the calibration prob-lem to that of calibrating a multi-class classifier,we instead study the case where the predicted labelsequence of the entire sentence forms the entityof interest set. The event of interest set is definedby the events y = MAPk(x) which denote whethereach top-k sentence level MAP prediction is cor-rect. We use three choice of pθ models, namelyBERT, BERT-CRF and distilBERT. We use modeluncertainty and rank based features for our POSexperiments.

    Table 1 shows the ECE values for our base-line, proposed and ablated models. The value ofheuristic-k is 2 for all +Rank+Var+topk forecastersacross all PTB models. “topk” in Table 1 refersto forecasters trained with additional top-k predic-tions. Our methods outperform both baselines bya large margin. Both “Rank” and “Var” featureshelp in improving model calibration. Inclusion oftop-2 prediction sequences also improve the cal-ibration performance significantly. Table 1 alsoshows the performance of our full feature model“+Rank+Var+topk” for the sub-optimal value of

  • Calibration CoNLL MADE 1.0(BERT) (bioBERT)

    Platt 2.00±.12 4.00±.07Calibrated Mean 2.29±.33 3.07±.18+Var 2.43±.36 3.05±.17+Var+ln 2.24±.14 2.92±.24Platt+top3 16.64±.48 2.14±.18Calibrated Mean+top3 17.06±.50 2.22±.31+Var+top3 17.10±.24 2.17±.39+Rank+Var+top3 2.01±.33 2.34±.15+Rank+Var+ln+top3 1.91±.29 2.12±.24

    Table 3: ECE percentages for the two named entitydatasets and calibration methods. The results are for allpredicted named entity spans in top-1 MAP predictionson the test data. ECE standard deviation is estimatedby repeating the experiments for 5 repetitions. ECEfor uncalibrated span marginals from BERT model is3.68% and 5.59% for CoNLL and MADE 1.0 datasets.heuristic-k is 3 for all +Rank+Var+top3 forecasters.

    Calibration CoNLL MADE 1.0(BERT) (bioBERT)

    Baseline 89.45±.08 84.01±.11+Rank+Var+top3 89.73±.12 84.33±.07+Rank+Var+ln+top3 89.78±.10 84.34±.10

    Table 4: Micro-avg f-score for NER datasets andour best proposed calibration method. The confidencescore from the calibration method is used to re-rank theevents E ∈ I(s) and a confidence value of 0.5 is usedas a cutoff. Standard deviation is estimated by repeat-ing the experiments for 5 repetitions. Baseline refersto MC-dropout averaged (sample-size=10) output ofmodel pθ. heuristic-k is 3 for all +Rank+Var+top3 fore-casters.

    k = 3. It has lower performance than k = 2 acrossall models. Therefore for the subsequent experi-mental sections, we only report topk calibrationperformance using the heuristic-k value only.

    We use the confidence predictions of our full-feature model +Rank+Var+topk to re-rank the top-k predictions in the test set. Table 2 shows thesentence-level (entity of interest) accuracy for ourre-ranked top prediction and the original modelprediction.

    3.2 Calibration for Named Entities

    For Named Entity (NE) Recognition experiments,we use two NE annotated datasets, namely CoNLL2003 and MADE 1.0. CoNLL 2003 consistsof documents from the Reuters corpus annotated

    with named entities such as Person, Location etc.MADE 1.0 dataset is composed of electronic healthrecords annotated with clinical named entities suchas Medication, Indication and Adverse effects.

    The entity of interest for NER is the named en-tity span prediction. We define φ(x) as predictedentity spans in argmax(k) label sequences predic-tions for x. We use BERT-base with token-levelsoftmax output and marginal likelihood based train-ing. The model uncertainty estimates for “Var” fea-ture are computed by estimating the variance oflength normalized MC-dropout samples of spanmarginals. Due to the similar trends in behavior ofBERT and BERT+CRF model in POS experiments,we only use BERT model for NER. However, thespan marginal computation can be easily extendedto linear-chain CRF models. We also use the lengthof the predicted named entity as the feature “ln”in this experiment. Complete details about fore-caster and baselines are in the Appendix. Valueof heuristic-k is 3 for all +Rank+Var+topk fore-casters. We show ablation and baseline results fork = 3 only. However, no other forecasters for anyk ∈ {2, 3} outperform our best forecasters in Table3.

    We use the confidence predictions of our“+Rank+Var+top3” models to re-score the confi-dence predictions for all spans predicted in top-3MAP predictions for samples in the test set. Athreshold of 0.5 was used to remove span predic-tions with low confidence scores. Table 4 showsthe Named Entity level (entity of interest) Micro-F score for our re-ranked top prediction and theoriginal model prediction. We see that re-rankedpredictions from our models consistently improvethe model f-score.

    3.3 Calibration for QA Models

    We use three datasets for evaluation of our cali-bration methods on the QA task. Our QA tasksare modeled as extractive QA methods with asingle span answer predictions. We use threedatasets to construct experiments for QA calibra-tion. SQuAD1.1 and EMRQA (Pampari et al.,2018) are open-domain and clinical-domain QAdatasets, respectively. We process the EMRQAdataset by restricting the passage length and re-moving unanswerable questions. We also designan out-of-domain evaluation of calibration usingclinical QA datasets. We follow the guidelinesfrom Pampari et al. (2018) to create a QA dataset

  • Calibration SQuAD1.1 EMRQA MADE 1.0 MADE1.0(OOD)

    (BERT) (bioBERT) (bioBERT) (bioBERT)Platt 3.69±.16 5.07±.37 3.64±.17 15.20±.16Calibrated Mean 2.95±.26 2.28±.18 2.50±.31 13.26±.94+Var 2.92±.28 2.74±.15 2.71±.32 12.41±.95Platt+top3 7.71±.28 5.42±.25 11.87±.19 16.36±.26Calibrated Mean+top3 3.52±.35 2.11±.19 9.21±.25 12.11±.24+Var+top3 3.56±.29 2.20±.20 9.26±.27 11.67±.27+Var+lm+top3 3.54±.21 2.12±.19 6.07±.26 12.42±.32+Rank+Var+top3 2.47±.18 1.98±.10 1.77±.23 12.69±.20+Rank+Var+lm+top3 2.79±.32 2.24±.29 1.66±.27 12.60±.28

    Table 5: ECE percentages for QA tasks SQuAD1.1, EMRQA and MADE 1.0. MADE 1.0(OOD) refers to theout-of-domain evaluation of a QA model that is trained and calibrated on EMRQA training and validation splits.The results are for top-1 MAP predictions on the test data. ECE standard deviation is estimated by repeating theexperiments for 5 repetitions. BERT model’s uncalibrated ECE for SQuAD1.1, EMRQA, MADE 1.0 and MADE1.0(OOD) are 6.24% 6.10%, 20.10% and 18.70% respectively. heuristic-k is 3 for all +Rank+Var+topk forecasters.

    Calibration SQuAD1.1 EMRQA MADE 1.0 MADE1.0(OOD)

    (BERT) (bioBERT) (bioBERT) (bioBERT)Baseline 79.79±.08 70.97±.14 66.21±.18 31.62±.12+Rank+Var+top3 80.04±.11 71.34±.22 66.33±.12 31.99±.11+Rank+Var+lm+top3 80.03±.15 71.37±.26 66.33±.15 32.02±.09

    Table 6: Table shows change in Exact Match Accuracy for QA datasets and our best proposed calibration method.The confidence score from the calibration method is used to re-rank the events E ∈ I(s). Standard deviation isestimated by repeating the experiments for 5 repetitions. Baseline refers to MC-dropout averaged (sample-size=10)output of model pθ. heuristic-k is 3 for all +Rank+Var+topk forecasters.

    from MADE 1.0 (Jagannatha et al., 2019). Thisallows us to have two QA datasets with commonquestion forms, but different text distributions. Inthis experimental setup we can mimic the evalua-tion of calibration methods in a real-world scenario,where the task specifications may remain the samebut the underlying text source changes. Detailsabout dataset pre-processing and construction areprovided in the Appendix.

    The entity of interest for QA is the top-k answerspan predictions. We use the “lm” perplexity as afeature in this experiment to analyze its behaviourin out-of-domain evaluations. We use a 2 layerunidirectional LSTM to train a next word languagemodel on the EMRQA passages. This languagemodel is then used to compute the perplexity of asentence for the “lm” input feature to the forecaster.We use the same baselines as the previous twotasks.

    Based on Table 5, our methods outperform thebaselines by a large margin in both in-domain and

    out-of-domain experiments. Value of heuristic-k is3 for all +Rank+Var+topk forecasters. We show ab-lation and baseline results for k = 3 only. However,no other forecasters for any k ∈ {2, 3} outperformour best forecasters in Table 5 . Our models areevaluated on SQuAD1.1 dev set, and test sets fromEMRQA and MADE 1.0. They show consistentimprovements in ECE and Exact Match Accuracy.

    4 Discussion

    Our proposed methods outperform the baselines inmost tasks and are almost as competitive in others.

    Features and top-k samples: The inclusion oftop-k features improve the performance in almostall tasks when the rank of the prediction is included.We see large increases in calibration error when thetop-k prediction samples are included in forecastertraining without including the rank information intasks such as CoNLL NER and MADE 1.0 QA.This may be because the k = 1, 2, 3 predictions

  • Figure 1: Modified reliability plots (Accuracy - Confidence vs Confidence) on MADE 1.0 QA test. The dottedhorizontal line represents perfect calibration. Scatter point diameter denotes bin size. The inner diameter of thescatter point denotes the number of positive events in that bin.

    may have similar model confidence and uncertaintyvalues. Therefore a more discriminative signal suchas rank is needed to prioritize them. For instance,the difference between probabilities of k = 1 andk = 2 MAP predictions for POS tagging may dif-fer by only one or two tokens. In a sentence oflength 10 or more, this difference in probabilitywhen normalized by length would account to verysmall shifts in the overall model confidence score.Therefore an additional input of rank k leads to asubstantial gain in performance for all models inPOS.

    Our task-agnostic scheme of “Rank+Var+topk”based forecasters consistently outperform or staycompetitive to other forecasting methods. However,results from task-specific features such as “lm” and“len” show that use of task-specific features canfurther reduce the calibration error. Our domainshift experimental setup has the same set of ques-tions in both in-domain and out-of-domain datasets.Only the data distribution for the answer passage isdifferent. However, we do not observe an improve-ment in out-of-domain performance by using “lm”feature. A more detailed analysis of task-specificfeatures in QA with both data and question shiftsis required. We leave further investigations of suchschemes as our future work.

    Choice of k is important : The optimal choiceof k seems to be strongly dependent on the in-herent properties of the tasks and its output eventset. In all our experiments, for a specific task all

    Figure 2: An example of named entity span fromCoNLL dataset. Rank is kth rank from top-k MAPinference (Viterbi decoding). Mean Prob and Std isthe mean and standard deviation of length-normalizedprobabilities (geometric mean of marginal probabilitiesfor each token in the span). Calibrated confidence is theoutput of Rank+Var+ln+top3.

    +Rank+Var+topk forecasters exhibit consistent be-haviours with respect to the choice of k. In POSexperiments, heuristic-k = 2. In all other tasks,heuristic-k = 3. Our heuristic-k models are thebest performing models, suggesting that the heuris-tic described in Section 2.5 may generalize to othertasks as well.

    Re-scoring : We show that using our forecasterconfidence to re-rank the entities of interest leadsto a modest boost in model performance for theNER and QA tasks. In POS no appreciable gain ordrop in performance was observed for k = 2. Webelieve this may be due to the already high tokenlevel accuracy (above 97%) on Penn Treebank data.

  • Nevertheless, this suggests that our re-scoring doesnot lead to a degradation in model performance incases where it is not effective.

    Our forecaster re-scores the top-k entity confi-dence scores based on model uncertainty score andentity-level features such as entity lengths. Intu-itively, we want to prioritize predictions that havelow uncertainty over high uncertainty predictions,if their uncalibrated confidence scores are simi-lar. We provide an example of such re-rankingin Figure 2. It shows a named entity span predic-tions for the correct span “Such”. The model pθproduces two entity predictions “off-spinner Such”and “Such”. The un-calibrated confidence score of“off-spinner Such” is higher than “Such”, but thevariance of its prediction is higher as well. There-fore the +Rank+Var+ln+top3 re-ranks the second(and correct) prediction higher. It is important tonote here that the variance of “off-spinner Such”may be higher just because it involves two tokenpredictions as compared to only one token predic-tion in “Such”. This along with the “ln” featurein +Rank+Var+ln+top3 may mean that the fore-caster is also using length information along withuncertainty to make this prediction. However, wesee similar improvements in QA tasks, where the“ln” feature is not used, and all entity predictionsinvolve two predictions (span start and end indexpredictions). These results suggest that use of un-certainty features are useful in both calibration andre-ranking of predicted structured output entities.

    Out-of-domain Performance : Our experi-ments testing the performance of calibrated QAsystems on out-of-domain data suggest that ourmethods result in improved calibration on unseendata as well. Additionally, our methods also leadto an improvement in system accuracy on out-of-domain data, suggesting that the mapping learnedby the forecaster model is not specific to a dataset.However, there is still a large gap between the cali-bration error for within domain and out-of-domaintesting. This can be seen in the reliability plotshown in Figure 1. The number of samples ineach bin are denoted by the radius of the scatterpoint. The calibrated models shown in the figurecorresponds to “+Rank+Var+lm+top3’ forecastercalibrated using both in-domain and out-of-domainvalidation datasets for forecaster training. We seethat out-of-domain forecasters are over-confidentand this behaviour is not mitigated by using data-uncertainty aware features like “lm”. This is likely

    due to a shift in model’s prediction error whenapplied to a new dataset. Re-calibration of the fore-caster using a validation set from the out-of-domaindata seems to bridge the gap. However, we can seethat the sharpness (Kuleshov and Liang, 2015) ofout-of-domain trained, in-domain calibrated modelis much lower than that of in-domain trained, in-domain calibrated one. Additionally, a validationdataset is often not available in the real world. Miti-gating the loss in calibration and sharpness inducedby out-of-domain evaluation is an important avenuefor future research.

    Uncertainty Estimation : We use MC-Dropoutas a model (epistemic) uncertainty estimationmethod in our experiments. However, our methodis not specific to MC-Dropout, and is compatiblewith any method that can provide a predictive dis-tribution over token level outputs. As a result anybayesian or ensemble based uncertainity estima-tion method (Welling and Teh, 2011; Lakshmi-narayanan et al., 2017; Ritter et al., 2018) can beused with our scheme. In this work, we do notinvestigate the use of aleatoric uncertainty for cal-ibration. Our use of language model features isaimed at accounting for distributional uncertaintyinstead of aleatoric uncertainty (Gal, 2016; Malininand Gales, 2018). Investigating the use of differenttypes of uncertainty for calibration remains as ourfuture work.

    5 Conclusion

    We show a new calibration and confidence basedre-scoring scheme for structured output entities inNLP. We show that our calibration methods outper-form competitive baselines on several NLP tasks.Our task-agnostic methods can provide calibratedmodel outputs of specific entities instead of the en-tire label sequence prediction. We also show thatour calibration method can provide improvementsto the trained model’s accuracy at no additionaltraining or data cost. Our method is compatiblewith modern NLP architectures like BERT. Lastly,we show that our calibration does not over-fit onin-domain data and is capable of generalizing thecalibration to out-of-domain datasets.

    Acknowledgement

    Research reported in this publication was supportedby the National Heart, Lung, and Blood Institute(NHLBI) of the National Institutes of Health underAward Number R01HL125089.

  • ReferencesGalen Andrew and Jianfeng Gao. 2007. Scalable train-

    ing of L1-regularized log-linear models. In Proceed-ings of the 24th International Conference on Ma-chine Learning, pages 33–40.

    Jochen Bröcker. 2009. Reliability, sufficiency, and thedecomposition of proper scores. Quarterly Journalof the Royal Meteorological Society: A journal ofthe atmospheric sciences, applied meteorology andphysical oceanography, 135(643):1512–1519.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

    Li Dong, Chris Quirk, and Mirella Lapata. 2018. Con-fidence modeling for neural semantic parsing. arXivpreprint arXiv:1805.04604.

    Jerome H Friedman. 2001. Greedy function approx-imation: a gradient boosting machine. Annals ofstatistics, pages 1189–1232.

    Yarin Gal. 2016. Uncertainty in deep learning. Univer-sity of Cambridge, 1:3.

    Yarin Gal and Zoubin Ghahramani. 2016. Dropout as abayesian approximation: Representing model uncer-tainty in deep learning. In international conferenceon machine learning, pages 1050–1059.

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-berger. 2017. On calibration of modern neural net-works. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, pages1321–1330. JMLR. org.

    Abhyuday Jagannatha, Feifan Liu, Weisong Liu, andHong Yu. 2019. Overview of the first natural lan-guage processing challenge for extracting medica-tion, indication, and adverse drug events from elec-tronic health record notes (made 1.0). Drug safety,42(1):99–111.

    Volodymyr Kuleshov and Percy S Liang. 2015. Cali-brated structured prediction. In Advances in NeuralInformation Processing Systems, pages 3474–3482.

    Aviral Kumar and Sunita Sarawagi. 2019. Calibrationof encoder decoder models for neural machine trans-lation. arXiv preprint arXiv:1903.00802.

    Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. 2018.Trainable calibration measures for neural networksfrom kernel mean embeddings. In InternationalConference on Machine Learning, pages 2810–2819.

    Balaji Lakshminarayanan, Alexander Pritzel, andCharles Blundell. 2017. Simple and scalable predic-tive uncertainty estimation using deep ensembles. InAdvances in neural information processing systems,pages 6402–6413.

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. 2019. Biobert: pre-trained biomed-ical language representation model for biomedicaltext mining. arXiv preprint arXiv:1901.08746.

    Fei Li, Yonghao Jin, Weisong Liu, Bhanu Pratap SinghRawat, Pengshan Cai, and Hong Yu. 2019. Fine-tuning bidirectional encoder representations fromtransformers (bert)–based models on large-scaleelectronic health record notes: An empirical study.JMIR Med Inform, 7(3):e14830.

    Wesley Maddox, Timur Garipov, Pavel Izmailov,Dmitry Vetrov, and Andrew Gordon Wilson. 2019.A simple baseline for bayesian uncertainty in deeplearning. arXiv preprint arXiv:1902.02476.

    Andrey Malinin and Mark Gales. 2018. Predictive un-certainty estimation via prior networks. In Advancesin Neural Information Processing Systems, pages7047–7058.

    Mitchell Marcus, Grace Kim, Mary AnnMarcinkiewicz, Robert MacIntyre, Ann Bies,Mark Ferguson, Karen Katz, and Britta Schasberger.1994. The penn treebank: annotating predicateargument structure. In Proceedings of the workshopon Human Language Technology, pages 114–119.Association for Computational Linguistics.

    Mahdi Pakdaman Naeini, Gregory Cooper, and MilosHauskrecht. 2015. Obtaining well calibrated prob-abilities using bayesian binning. In Twenty-NinthAAAI Conference on Artificial Intelligence.

    Cláudio A Naranjo, Usoa Busto, Edward M Sellers,P Sandor, I Ruiz, EA Roberts, E Janecek, C Domecq,and DJ Greenblatt. 1981. A method for estimatingthe probability of adverse drug reactions. ClinicalPharmacology & Therapeutics, 30(2):239–245.

    Khanh Nguyen and Brendan O’Connor. 2015. Pos-terior calibration and exploratory analysis for nat-ural language processing models. arXiv preprintarXiv:1508.05154.

    Anusri Pampari, Preethi Raghavan, Jennifer Liang, andJian Peng. 2018. emrqa: A large corpus for ques-tion answering on electronic medical records. arXivpreprint arXiv:1809.00732.

    Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. arXiv preprint arXiv:1802.05365.

    J Platt. 2000. Probabilistic outputs for support vec-tor machines and comparison to regularized likeli-hood methods. Advances in Large Margin Classi-fiers, pages 61–74.

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. arXiv preprint arXiv:1806.03822.

    https://doi.org/10.2196/14830https://doi.org/10.2196/14830https://doi.org/10.2196/14830https://doi.org/10.2196/14830

  • Hippolyt Ritter, Aleksandar Botev, and David Barber.2018. A scalable laplace approximation for neuralnetworks. In 6th International Conference on Learn-ing Representations, ICLR 2018-Conference TrackProceedings, volume 6.

    Erik F Sang and Fien De Meulder. 2003. Intro-duction to the conll-2003 shared task: Language-independent named entity recognition. arXivpreprint cs/0306050.

    Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108.

    Sarah Sarabadani. 2019. Detection of adverse drug re-action mentions in tweets using elmo. In Proceed-ings of the Fourth Social Media Mining for HealthApplications (# SMM4H) Workshop & Shared Task,pages 120–122.

    Juozas Vaicenavicius, David Widmann, Carl Anders-son, Fredrik Lindsten, Jacob Roll, and Thomas BSchön. 2019. Evaluating model calibration in classi-fication. arXiv preprint arXiv:1902.06977.

    Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R Bowman. 2019. Super-glue: A stickier benchmark for general-purposelanguage understanding systems. arXiv preprintarXiv:1905.00537.

    Max Welling and Yee W Teh. 2011. Bayesian learn-ing via stochastic gradient langevin dynamics. InProceedings of the 28th international conference onmachine learning (ICML-11), pages 681–688.

    Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-towicz, et al. 2019. Transformers: State-of-the-art natural language processing. arXiv preprintarXiv:1910.03771.

    Henghui Zhu, Ioannis Ch Paschalidis, and AmirTahmasebi. 2018. Clinical concept extractionwith contextual word embedding. arXiv preprintarXiv:1810.10566.

  • A Appendices

    A.1 Algorithm Details:

    The forecaster construction algorithm is providedin Algorithm 1. The candidate events in Algorithm1 are obtained by extracting top-k label sequencesfor every output. The logits obtained from pθ areaveraged over 10 MC-Dropout samples before be-ing fed into the final output layer. We use the vali-dation dataset from the task’s original split to trainthe forecaster. The validation dataset is used toconstruct both training and validation split for theforecaster. The training split contains all top-k pre-dicted entities. The validation split contains onlytop-1 predicted entities.

    A.2 Evaluation Details

    We use the expected calibration error (ECE) scoredefined by (Naeini et al., 2015) to evaluate our cal-ibration methods. Expected calibration error is ascore that estimates the expected absolute differ-ence between model confidence and accuracy. Thisis calculated by binning the model outputs into N(N = 20 for our experiments) bins and then com-puting the expected calibration error across all bins.It is defined as

    ECE =N∑i=0

    |Bi|n|acc(Bi)− conf(Bi)|, (4)

    where N is the number of bins, n is the total num-ber of data samples, Bi is the ith bin. The func-tions acc(.) and conf(.) calculate the accuracy andmodel confidence for a bin.

    A.3 Implementation Details

    We use AllenNLP’s wrapper with HuggingFace’sTransformers code 1 for our implementation2. Weuse BERT-base-cased (Wolf et al., 2019) weights asthe initialization for general-domain datasets andbio-BERT weights (Lee et al., 2019) as the initial-ization for clinical datasets. We use cased modelsfor our analysis, since bio-BERT(Lee et al., 2019)uses cased models. A common learning rate of 2e-5 was used for all experiments. We used validationdata splits provided by the datasets. In cases wherethe validation dataset was not provided, such asMADE 1.0, EMRQA or SQuAD1.1, we use 10%

    1https://github.com/huggingface/transformers2The code for forecaster construction is available at

    https://github.com/abhyudaynj/ StructuredPredictionCalibra-tionNLP

    of the training data as the validation data. We usea patience of 5 for early stopping the model, witheach epoch consisting of 20,000 steps. We use thefinal evaluation metric instead of negative log like-lihood (NLL) to monitor and early stop the training.This is to reduce the mis-calibration of the underly-ing pθ model, since Guo et al. (2017) observe thatneural nets overfit on NLL. The implementationfor each experiment is provided in the followingsubsections.

    A.3.1 Part-of-speech experimentsWe evaluate our method on the Penn Treebankdataset (Marcus et al., 1994). Our experiment usesthe standard training (1-18), validation(19-21) andtest (22-24) splits from the WSJ portion of the PennTreebank dataset. The un-calibrated output of ourmodel for a candidate label sequence is estimatedas

    p̂ =1

    M

    ∑MC−Dropout

    pθ(y1, y2, ...yL|x)1L , (5)

    where M is the number of dropout samples. TheLth root accounts for different sentence lengths.Here L is the length of the sentence. We observethat this kind of normalization improves the cal-ibration of both baselines and proposed models.We do not normalize the probabilities while report-ing the ECE of uncalibrated models. We use twochoice of pθ models, namely BERT and BERT+CRF.BERT only model adds a linear layer to the out-put of BERT network and uses a softmax activationfunction to produce marginal label probabilities foreach token. BERT+CRF uses a CRF layer on top ofunary potentials obtained from the BERT networkoutputs.

    We use Platt Scaling (Platt, 2000) as the baselinecalibration model. Our Platt scaling model uses theMC-Dropout average of length normalized proba-bility output of the model pθ as input. The lowervariance and length invariance of this input featuremake Platt Scaling a very strong baseline. We alsouse a “Calibrated Mean” baseline using GradientBoosted Decision Trees as our estimator with thesame input feature as Platt.

    A.3.2 NER ExperimentsFor CoNLL dataset, “testa” file was reserved forvalidation data and “testb” was reserved for test.For MADE 1.0 (Jagannatha et al., 2019), sincevalidation data split was not provided we randomlyselected 10% of training data as validation data.

  • Algorithm 1: Forecaster construction for model pθ with max rank kmax.Input: Uncalibrated model pθ , Validation Dataset D = {(x(i), y(i)}

    |D|i=0 , kmax.

    Output: Forecaster FyFunction Get-Forecaster (pθ, D, kmax)

    for i← 0 to |D| doI(x(i))← Get-Candidate-Events(pθ, x(i), kmax)Dtrain ← {(x(i), c,E) : c = 1(y(i) ∈ E) , ∀E ∈ I(x)}Ik=1(x(i))← Get-Candidate-Events(pθ, x(i), 1)Dval ← {(x(i), c,E) : c = 1(y(i) ∈ E) , ∀E ∈ Ik=1(x)}

    endTrain Forecasters F (k)y for k = {1, ..., kmax} using DtrainFy ← F (k)y with minimum ECE on Dvalreturn Fy

    Function Get-Candidate-Events (pθ,x,kmax)Construct top-kmax label sequences using MC-Dropout average of pθ(x) logits.Extract relevant entity set φ(x) from top-kmax label sequences.I(x)← Events corresponding to entities in φ(x).return I(x);

    The length normalized marginal probability for aspan starting at i and of length l is estimated as

    p̂ =1

    M

    ∑MC−Dropout

    pθ(yi, y2, ...yi+l−1|x)1l .

    (6)We use this as the input to both the baseline and

    proposed models. We observe that this kind of nor-malization improves the calibration of baseline andproposed models. We do not normalize the prob-abilities while reporting the ECE of uncalibratedmodels. We use BIO-tags for training. While de-coding, we also allow spans that start with “I-” tag.

    A.3.3 QA experimentsWe use three datasets for our QA experiments,SQAUD 1.1, EMRQA and MADE 1.0. Our mainaim in these experiments is to understand the be-haviour of calibration and not the complexity ofthe tasks themselves. Therefore, we restrict the pas-sage lengths of EMRQA and MADE 1.0 datasetsto be similar to SQuAD1.1. We pre-process thepassages from EMRQA to remove unannotated an-swer span instances and reduce the passage lengthto 20 sentences. EMRQA provides multiple ques-tion templates for the same question type (referredto as logical form in Pampari et al. (2018)). Foreach annotation, we randomly sample 3 questiontemplates for our QA experiments. This is done toensure that question types that have multiple ques-tion templates are not over-represented in the data.

    For example, the question type for “’Does he takeanything for her —problem—” has 49 availableanswer templates, whereas “How often does thepatient take —medication—” only has one. So foreach annotation, we sample 3 question templatesfor a question type. If the question type does nothave 3 available templates, we up-sample. Formore details please refer to Pampari et al. (2018).

    EMRQA is a QA dataset constructed fromnamed entity and relation annotations from clin-ical i2b2 datasets consisting of adverse event, med-ication and risk related questions (Pampari et al.,2018). We aim to also test the performance of ourcalibration method on out-of-domain test data. Todo so, we construct a QA dataset from the clinicalnamed entity and relation dataset MADE 1.0, usingthe questions and the dataset construction proce-dure followed in EMRQA. This allows us to havetwo QA datasets with common question forms, butdifferent text distributions. This experimental setupenables us to evaluate how a QA system would per-form when deployed on a new text corpus. Thiscorresponds to the application scenario where afixed set of questions (such as Adverse event ques-tionnaire (Naranjo et al., 1981)) are to be answeredfor clinical records from different sources. BothEMRQA and MADE 1.0 are constructed from clin-ical documents. However, the documents them-selves have different structure and language due totheir different clinical sources, thereby mimicking

  • the real-world application scenarios of clinical QAsystems.

    MADE QA Construction MADE 1.0 (Jagan-natha et al., 2019) is an NER and relation datasetthat has similar annotation to “relations” and “med-ication” i2b2 datasets used in EMRQA. EMRQAuses an automated procedure to construct ques-tions and answers from NER and relation annota-tions. We replicate the automated QA constructionfollowed by Pampari et al. (2018) on MADE 1.0dataset to obtain a corresponding QA dataset forthe same. For this construction, we use questiontemplates that use annotations that are commonin both MADE 1.0 and EMRQA datasets. Exam-ples of common questions are in Table 7. A fulllist of questions in MADE 1.0 QA is in “ques-tion templates.csv” file included in supplementarymaterials. The dataset splits for EMRQA andMADE QA are provided in Table 8.

    Forecaster features Since we only considersingle-span answer predictions, we require a con-stant number of predictions ( answer start and an-swer end token index), for this task. Therefore wedo not use the “ln” feature in this task. The un-calibrated probability of an event is normalized asfollows and then used as input to all calibrationmodels.

    p̂ =1

    M

    ∑MC−Dropout

    pθ(ystart, yend|x)1/2 (7)

    Unlike the previous tasks, extractive QA withsingle-span output does not have a varying num-ber of output predictions for each data sample. Italways only predicts the start and end spans. There-fore using length normalized (where length is al-ways 2) uncalibrated output does not significantlyaffect the calibration of baseline models. However,we use the length-normalized uncalibrated proba-bility as our input feature to keep our base set of fea-tures consistent throughout the tasks. Additionally,in extractive QA tasks with non-contiguous spans,the number of output predictions can vary and behigher than 2. In such cases, based on our resultson POS and NER, the length-normalized probabil-ity may prove to be more useful. The “Var” featureand “Rank” feature is estimated as described inprevious tasks.

  • Input Output Example Question FormProblem Treatment How does the patient manage her —problem—Treatment Problem Why is the patient on —treatment—Problem Problem Has the patient ever been diagnosed or treated for —problem—Drug Drug Has patient ever been prescribed —medication—

    Table 7: Examples of questions that are common in EMRQA and MADE QA datasets.

    Dataset Name Train Validation TestEMRQA 74414 8870 9198MADE QA 99496 14066 21309

    Table 8: Dataset size for the MADE dataset QA pairs that were constructed using guidelines from EMRQA.EMRQA dataset splits are also provided for comparison.