CONFIDENCE INTERVALS FOR SEROPREVALENCE By Thomas J

CONFIDENCE INTERVALS FOR SEROPREVALENCE

By

Thomas J. DiCiccio David M. Ritzwoller Joseph P. Romano Azeem M. Shaikh

Technical Report No. 2021-02 March 2021

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

CONFIDENCE INTERVALS FOR SEROPREVALENCE

By

Thomas J. DiCiccio Cornell University

David M. Ritzwoller Joseph P. Romano Stanford University

Azeem M. Shaikh

University of Chicago

Technical Report No. 2021-02 March 2021

This research was supported in part by National Science Foundation grants

MMS 1949845 and SES 1530661.

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

http://statistics.stanford.edu

Confidence Intervals for Seroprevalence ∗

Thomas J. DiCiccio David M. Ritzwoller

Department of Social Statistics Graduate School of BusinessCornell University Stanford [email protected] [email protected]

Joseph P. Romano Azeem M. Shaikh

Departments of Statistics and Economics Department of EconomicsStanford University University of Chicago

[email protected] [email protected]

March 27, 2021

Abstract

This paper concerns the construction of confidence intervals in standard seroprevalencesurveys. In particular, we discuss methods for constructing confidence intervals for the pro-portion of individuals in a population infected with a disease using a sample of antibody testresults and measurements of the test’s false positive and false negative rates. We begin bydocumenting erratic behavior in the coverage probabilities of standard Wald and percentilebootstrap intervals when applied to this problem. We then consider two alternative sets of in-tervals constructed with test inversion. The first set of intervals are approximate, using eitherasymptotic or bootstrap approximation to the finite-sample distribution of a chosen test statis-tic. We consider several choices of test statistic, including maximum likelihood estimators andgeneralized likelihood ratio statistics. We show with simulation that, at empirically relevantparameter values and sample sizes, the coverage probabilities for these intervals are close totheir nominal level and are approximately equi-tailed. The second set of intervals are shownto contain the true parameter value with probability at least equal to the nominal level, but canbe conservative in finite samples. To conclude, we outline the application of the methods thatwe consider to several related problems, and we provide a set of practical recommendations.

Keywords: Confidence Intervals, Novel Coronavirus, Serology Testing, Seroprevalence, TestInversion

MSC 2020 Subject Classification: Primary 62F03, Secondary 62P10

∗We acknowledge funding from the National Science Foundation under the Graduate Research Fellowship Programand under the grants MMS-1949845 and SES-1530661.

DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 1

1. Introduction

Accurate measurement of the spread of infectious diseases is critical for informing and calibrating

effective public health strategy (Fauci et al., 2020; Peeling et al., 2020). Seroprevalence surveys, in

which antibody tests are administered to samples of individuals from a population, are a practical

and widely applied strategy for assessing the progression of a pandemic (Krammer and Simon,

2020; Alter and Seder, 2020). However, antibody tests, which detect the presence of viral anti-

bodies in blood samples, are imperfect.1 In particular, false positive and false negative antibody

test results occur with nontrivial probabilties.2 Accounting for the variation in the results of a sero-

prevalence survey induced by this imperfection is important for valid assessment of the uncertainty

in measurements of the spread of an infectious disease.

In this paper, we study the construction of confidence intervals in a standard seroprevalence

survey. Given the public interest in explicit representations of disease incidence, our objective is to

analyze the accuracy of various methods of constructing confidence intervals, so that results from

empirical analyses can be reported with statistical precision. We demonstrate that some methods

based on test inversion offer advantages relative to more standard confidence interval constructions

in terms of the stability and validity of their coverage probabilities.

In a standard seroprevalence survey, the proportion of a population that has been infected with

a disease is a smooth function of the parameters of three independent binomial trials whose real-

izations are observed. Although it may be expected that standard approaches to confidence interval

construction are well suited for such a simple parametric problem, in Section 3, we demonstrate

with a Monte Carlo experiment that standard Wald and percentile bootstrap confidence intervals

have erratic coverage probabilities at empirically relevant parameter values and sample sizes when

applied to this problem.

In fact, as documented in Brown et al. (2001), erratic coverage probabilities for confidence

intervals constructed by using standard methods surface even in the context of inference on a sin-

gle binomial parameter. Bootstrap (and other) methods that are typically second-order correct

in continuous problems may not achieve this accuracy in discrete problems. Usually, claims of

second-order correctness are made based on Edgeworth expansions (Hall, 2013). However, in

1Deeks et al. (2020) give an early systematic review of the accuracy of SARS-CoV-2 antibody tests, highlightingseveral methodological limitations.

2Ainsworth et al. (2020) measure the false positive and negative rates for five leading SARS-CoV-2 immunoassays.Estimates of false positive rates ranged from 0.1% to 1.1%. Estimates of false negative rates ranged from 0.9% to7.3%.

2 CONFIDENCE INTERVALS FOR SEROPREVALENCE

discrete settings, Cramer’s condition – a necessary condition for the application of an Edgeworth

expansion – fails, and second-order accuracy may not be achievable. For example, atoms in the

binomial distribution based on n trials have order n−1/2, so expansions to order n−1 must account

for this discreteness. Additionally, when a binomial random variable has parameter value near zero

or one, even first-order approximations to its limiting distribution are not normal, while many stan-

dard methods for constructing confidence intervals rely, either explicitly or implicitly, on a normal

approximation holding. As the parameter of interest in a seroprevalence survey is a function of

three binomial parameters, inference in this setting is more challenging than for a single binomial

proportion.

To address the erratic coverage probabilities in standard confidence interval constructions, we

consider several alternative approaches based on test inversion. Test inversion is an approach to

confidence interval construction that exploits the duality between tests of the value of a parameter

and confidence intervals for that parameter. A test inversion confidence interval for a parameter

θ consists of the set of points θ0 for which the null hypothesis H(θ0) : θ = θ0 is not rejected.

For parameters θ where the corresponding null hypothesis H(θ0) is simple, the application of test

inversion is typically straightforward. By contrast, for parameters where the corresponding null

hypothesis is composite, the application of test inversion is not immediate. This is the case for

seroprevalence.

Therefore, in Section 4, we discuss the general problem of test inversion for parameters whose

corresponding null hypotheses are composite, and we develop several approaches to this general

problem. We consider both methods based on asymptotic or bootstrap approximation and meth-

ods with finite-sample guarantees. In the later case, a maximization of p-values over a nuisance

parameter space is required. Similar constructions are considered in Berger and Boos (1994) and

Silvapulle (1996). In practice, this maximization is carried out over a discrete grid. We provide

a refinement to such an approximation that maintains the finite-sample coverage requirement. We

take particular care in requiring that the confidence intervals that we develop behave well at both

endpoints; that is, we require that they are equi-tailed.3

In Section 5, we apply these approaches to construct confidence intervals for seroprevalence.

We consider several choices of test statistic, including maximum likelihood estimators and gen-

eralized likelihood ratio statistics. We demonstrate with simulation that the intervals based on

3A 1 − α confidence interval is equi-tailed if the probabilities that the parameter exceeds the upper endpoint or isbelow the lower endpoint of the interval are both near or below α/2. That is, an equi-tailed 1− α confidence intervalshould be given by the set of points satisfying both an upper and a lower confidence bound, each at level 1− α/2.


asymptotic or bootstrap approximation have coverage probabilities that, at empirically relevant

parameter values and sample sizes, are close to, but potentially below, the nominal level and are

approximately equi-tailed. By contrast, the finite-sample valid intervals are conservative, but al-

ways have coverage probabilities that satisfy the coverage requirement.

We contextualize our analysis with data used to estimate seroprevalence at early stages of the

2019 SARS-Cov-2 pandemic. In particular, as a running example, we measure coverage prob-

abilities and average interval lengths for each of the methods we consider at sample sizes and

parameter values close to the estimates and sample sizes of Bendavid et al. (2020a), which was

posted on medRxiv on April 11th, 2020.4 This preprint estimates that the number of coronavirus

cases in Santa Clara County, California on April 3 - 4, 2020 was more than fifty times larger than

the number of officially diagnosed cases, and as a result, received widespread coverage in the pop-

ular and scientific press (Kolata, 2020; Mallapaty, 2020). The methods and design of this study

– including the reported confidence intervals – were questioned by many researchers (Eisen and

Tibshirani, 2020), prompting the release of a revised draft on April 27th, 2020, which we refer to

as Bendavid et al. (2020b), that integrated additional data.5 Our analysis highlights statistical chal-

lenges in seroprevalence surveys at early stages of the spread of infectious diseases, when disease

incidence is close to the error rates of new diagnostic technologies.

Nevertheless, seroprevalence surveys have advantages relative to alternative methods for mea-

suring the spread of infectious diseases. In particular, seroprevalence surveys directly measure the

presence of viral antibodies and therefore do not rely on measurements of reported cases or deaths

that may be subject to large and complex biases. Byambasuren et al. (2020) provide a systematic

review of SARS-CoV-2 seroprevalence surveys, identifying 17 studies that they consider to be of

particularly high quality. The authors find that estimated SARS-CoV-2 seroprevalence ranged from

0.56-717 times the number of reported coronavirus diagnoses in each study’s respective regions,

and that 10 of the studies imply at least 10 times more SARS-CoV-2 infections than reported

cases.6 Given the extent to which seroprevalence surveys imply substantively different popula-

tion exposure, the development of rigorous methods for inference in these studies is important for

informing public health decision-making.

This paper contributes to the literatures on inference in seroprevalence surveys (Rogan and

4This preprint has subsequently been published as Bendavid et al. (2021).5See Gelman (2020), Fithian (2020), and Bennett and Steyvers (2020) for further discussion and analysis of Ben-

david et al. (2020b). In particular, these articles highlight issues and propose alternative approaches for combiningdata measuring false positive rates from different samples.

6They note that estimates of seroprevalence remain well below thresholds for the development of herd immunity.


Gladen, 1978; Hui and Walter, 1980; Walter and Irwig, 1988); see Jewell (2004) for a general

introduction to epidemiological statistics. More broadly, we contribute to the large literature on

test inversion. The classical duality between tests and confidence intervals is discussed in Chapter

3 of Lehmann and Romano (2005). Bootstrap approaches to confidence construction based on

estimating nuisance parameters are developed in Efron (1981), DiCiccio and Romano (1990), and

Carpenter (1999). Conservative approaches to confidence interval construction that maximize p-

values over an appropriate nuisance parameter space are considered in Berger and Boos (1994) and

Silvapulle (1996). Gelman and Carpenter (2020) take a Bayesian approach to the problem studied

in this paper. Toulis (2020) uses test inversion based on a particular choice of test statistic, though

the resulting confidence interval is based on projection. Cai et al. (2020) is more closely related to

one of the approaches we consider, and we discuss some important differences in Section 5.7

We begin, in Section 2, by formulating the problem of interest. Section 3 reviews standard

approaches for confidence interval construction, including Wald intervals, the percentile bootstrap,

and projection methods, and documents their mediocre performance when applied to construct

confidence intervals for seroprevalence. Section 4 develops several approaches to confidence in-

terval construction for general parametric problems based on test inversion. These methods are

applied to seroprevalence in Section 5. In Section 6, we outline the application of the methods

considered in Section 4 to several related problems, including inference on vaccine efficacy in ran-

domized clinical trials. Section 7 provides practical recommendations and a summary of results.

2. Problem Formulation

Suppose that we have obtained three samples of individuals of sizes n1, n2, and n3, with n =

(n1, n2, n3)>. The first sample is selected at random from the population of all individuals. We

have prior knowledge that all individuals in the second sample have not had the disease of interest

and that all individuals in the third sample have had the disease of interest. An antibody test is

administered to all individuals in each sample. The data X1, X2, and X3 denote the number of

individuals in the first, second, and third sample that test positive for the disease, respectively. We

letX denote the vector (X1, X2, X3)>. Observed outcomes ofX are denoted by x = (x1, x2, x3)>.

We assume that the results of each test are independent of the results of all other tests and

7We became aware of Cai et al. (2020), which was posted on arXiv on November 29th, 2020, late in the preparationof this paper, on March 3rd, 2021.


that the results of all tests within a given sample are independent and identically distributed.8 In

particular, we assume that

Xid∼ Binomial (ni, pi) (1)

for each i = 1, 2, and 3. In this context, p1 is the probability of a positive test for a randomly

selected individual, p2 is the probability of a positive test conditional on not having had the disease,

and p3 is the probability of a positive test conditional on having had the disease. The quantities

1− p2 and p3 are referred to as the specificity and sensitivity of the test, respectively.

Additionally, we assume that the test has diagnostic value in the sense that p2 < p3.9 That is,

the probability of testing positive is larger for an individual who has had the disease than for an

individual who has not had the disease.10 In this case, p1 necessarily satisfies p2 ≤ p1 ≤ p3. In

sum, the parameter p = (p1, p2, p2)> exists in the parameter space

Ω =p ∈ [0, 1]3 : p2 ≤ p1 ≤ p3, p2 < p3

.

We are interested in estimating a confidence interval for the probability π that an individual

randomly selected from the population of all individuals has had the disease. By the law of total

of probability, we have that

p1 = p2 (1− π) + p3π , (2)

and, therefore, that

π =p1 − p2

p3 − p2

. (3)

We refer to π as seroprevalence.

A natural estimate of seroprevalence is given by

πn =pn,1 − pn,2pn,3 − pn,2

,

where pn = (pn,1, pn,2, pn,3)> and pn,i = Xi/ni is the usual empirical frequency for group i. We let

the maximum likelihood estimator (MLE) of p for the model p ∈ Ω be denoted by pn. Accordingly,

8We assume the sample sizes are small relative to population size so that the difference between sampling with andwithout replacement is negligible.

9Note that some of the methods developed in Section 5 will not require p2 < p3; see footnote 22.10Additionally, it may be desirable to impose the restriction that p2 < p1, i.e., that the probability that a randomly

sampled individual possesses the disease is greater than zero. This restriction reduces the parameter space for π from[0, 1] to (0, 1].


the MLE of π for the model p ∈ Ω is given by

πn =pn,1 − pn,2pn,3 − pn,2

,

where pn = (pn,1, pn,2, pn,3)>. Typically, pn and pn agree, with the exception occurring if pn /∈ Ω.

An explicit formula for pn is given in Appendix A.

Finally, to facilitate further discussion, let

JTn,p(t) = Pp Tn ≤ t .

denote the distribution of the general statistic Tn under p, and also introduce the related quantity

JTn,p(t−) = Pp Tn < t ,which will be of use in computing p-values for tests of the null hypothesis

π = π0 against alternatives of the form π > π0.

3. Standard Interval Constructions

In this section, we outline and measure coverage probabilities for several standard methods for

constructing confidence intervals for seroprevalence.

3.1 Normal and Bootstrap Approximation

The most obvious approach to constructing confidence intervals for π is to approximate the finite-

sample distribution of πn with its limiting normal distribution. Specifically, by the Delta Method

and Slutsky’s Theorem, under model (1) and i.i.d. sampling, we have that

(πn − π0)/√

Vπn (pn)d→ N (0, 1) ,

as min (ni)→∞, where

Vπn(p) =1

(p3 − p2)2σ21 (p1) +

(p1 − p3)2

(p3 − p2)4σ22 (p2) +

(p2 − p1)2

(p3 − p2)4σ23 (p3) , (4)

and σ2i (pi) = pi (1− pi) /ni. This asymptotic approximation suggests the standard Wald interval

CIπn,∆ =(πn ± z1−α/2

√Vπn (pn)

), (5)


where z1−α/2 is the 1− α/2 quantile of the standard normal cumulative distribution function Φ(·).

We refer to CIπn,∆ as the delta-method confidence interval. Bendavid et al. (2020a) makes use of

this construction.

This interval may perform poorly in finite samples. In particular, the finite-sample distribution

of πn is not necessarily symmetric. For small or large values of the binomial parameter, a Poisson

distribution gives a better approximation than a normal distribution to a binomial distribution.

Moreover, the discreteness in the binomial distribution – with atoms that are of order one over the

square root of the sample size – prevents usual Edgeworth expansions and ensuing second-order

correctness.

To address some of these issues, Bendavid et al. (2020b) apply the percentile bootstrap confi-

dence interval developed in Efron (1981). This interval is given by

CIπn,pb =(cπn (α/2, pn) , cπn (1− α/2, pn)

), (6)

where, for a general test statistic Tn, cTn (1− α, p) denotes the 1− α quantile of JTn,p(t). Typically,

this interval can be shown to be first-order accurate (Efron, 1985), in the sense that the coverage

of the interval is equal to the nominal coverage probability plus a constant of the same order as

the inverse of the square root of the sample size.11 Nevertheless, second-order differences can

have a large effect in small and moderate samples, and the higher-order properties of the percentile

bootstrap have not been established for binomial parameters near the boundary of their parameter

spaces.

Efron (1987) gives a refinement to the percentile bootstrap confidence interval, referred to

as the accelerated bias-corrected, or BCα, confidence interval. The infeasible BCα confidence

interval is given by

CI∗πn,bca =(cπn (Φ (Z (α/2, z0, a)) , pn) , cπn (Φ (Z (1− α/2, z0, a)) , pn)

),

with

Z (α, z0, a) = z0 +(z0 + zα)

1− a (z0 + zα),

where the bias constant z0 and acceleration constant a are unknown. Efron (1987) and DiCiccio

and Romano (1995) discuss approximations to z0 and a, which we define in Appendix B and denote

11In our problem, the relevant quantity scaling this approximation is the minimum sample size min(ni).


by z0 and a. If these approximations are plugged into the infeasible BCα interval, then we obtain

the feasible BCα confidence interval

CIπn,bca =(cπn (Φ (Z (α/2, z0, a)) , pn) , cπn (Φ (Z (1− α/2, z0, a)) , pn)

). (7)

Typically, BCa intervals are second-order accurate, in the sense that the one-sided coverage error

of each endpoint is equal to α/2 plus a constant of the same order as the inverse of the sample

size. However, in our context, such theoretical results are not available due to the discreteness of

the distribution of the data X and potentially small values of the parameter p.

3.2 Projection Intervals

The Wald and bootstrap intervals presented in Section 3.1 are approximate. In contrast, it may be

desirable to construct intervals that ensure coverage of at least 1 − α in finite samples. A well-

known example of such finite-sample valid – or exact – intervals are the Clopper and Pearson

(1934) intervals for the binomial proportion.

A very simple, but crude, approach to constructing finite-sample valid confidence intervals for

seroprevalence is to use projection. In particular, suppose that R1−α is a joint confidence region

for p of nominal level 1 − α. For example, one possible choice of joint confidence region is the

rectangle

R1−α = I1,1−γ × I2,1−γ × I3,1−γ , (8)

where Ij,1−γ is a nominal 1− γ confidence interval for pj , and 1− γ is chosen so that the cartesian

product of marginal confidence intervals jointly contains p with nominal coverage 1 − α. That is,

γ is taken to satisfy (1− γ)3 = 1−α, or γ = 1− (1−α)1/3. In our implementation, we apply the

standard Clopper and Pearson (1934) confidence intervals for pj , as they have guaranteed coverage

in finite-samples.12 Alternatively, the region R1−α can be constructed by inverting likelihood ratio

tests, but would incur a significantly larger computational cost. A related approach is developed in

Toulis (2020).

Given any region R1−α, the projection method simply constructs the confidence interval

I1−α = [ infp∈R1−α

π(p), supp∈R1−α

π(p)] .

12Other choices exist, however, and in particular the intervals recommended in Brown et al. (2001) may performwell.


where π(p) = (p1 − p2)/(p3 − p2) is the value of seroprevalence associated with the parameter p.

Trivially, the chance that π is contained in I1−α is bounded below by the chance that p is contained

in R1−α. Thus, if R1−α has a guaranteed coverage of 1− α, then so does I1−α.

When R1−α is a product of intervals, as in (8), the computation of the projection interval I1−α

is trivial. The reason is that π(p) is monotone increasing in p1 and monotone decreasing in each of

p2 and p3, as p varies on the parameter space Ω. Thus, if the lower and upper confidence limits for

pj are represented as

Ij,1−γ = [ILj,1−γ, IUj,1−γ] ,

then the resulting interval I for π can be easily computed as

I1−α = [π(IL1,1−γ, IU2,1−γ, I

U3,1−γ), π(IU1,1−γ, I

L2,1−γ, I

L3,1−γ)] .

Therefore, the projection interval is easy to implement, but is generally very conservative, in that

the actual coverage is much larger than the nominal level. This results in intervals that are generally

unnecessarily wide.

3.3 Measuring Coverage

To assess the finite-sample performance of the Delta Method, percentile bootstrap, and projection

confidence intervals discussed in Sections 3.1 and 3.2, we estimate by a Monte Carlo experiment

their coverage probabilities and average lengths at parameterizations close to the sample size and

estimates of Bendavid et al. (2020a). We return to this experiment in Section 5, where we measure

the coverage probabilities for the alternative methods considered in Section 4.

Bendavid et al. (2020a) recruited n1 = 3300 participants for serologic testing for SARS-CoV-2

antibodies. The total number of positive tests was x1 = 50. The authors use n2 = 401 pre-

COVID era blood samples to measure the specificity of their test, of which only x2 = 2 samples

tested positive. Similarly, the authors use n3 = 122 blood samples from confirmed COVID-19

patients, of which x3 = 103 samples tested positive.13 These data correspond to an estimated

seroprevalence of πn = 0.012.14 Table 1 reports the nominal 95% confidence intervals computed

13The specificity and sensitivity samples combine data provided by the test manufacturer and additional tests run atStanford. We refer the reader to the statistical appendix of Bendavid et al. (2020a) for further details.

14 Bendavid et al. (2020a) report an alternative estimate of seroprevalence in which the demographics of theirsample are weighted to match the overall demographics of Santa Clara County. We briefly discuss the applicationof the general methods developed in Section 4 to this setting in Section 6, and view further consideration as a usefulextension. Gelman and Carpenter (2020) give a Bayesian approach that accommodates sample weights. Cai et al.(2020) also address the case where are samples are reweighted according to population characteristics.


Interval Ave. Length Ave. Length vs. Delta Method Coverage

Delta Method [0.003,0.022] 0.0185 1.000 0.904

Percentile Bootstrap [0.001,0.021] 0.0186 1.005 0.895

BCα Bootstrap [0.001,0.020] 0.0191 1.028 0.895

Projection [0.001,0.028] 0.0270 1.4578 1.000

Table 1: Average Interval Length and Coverage of Nominal 95% Confidence Intervals for π

Notes: Table 1 reports the Delta Method, percentile bootstrap, BCα bootstrap, and projection confidence intervals,at nominal level 95% computed on data from Bendavid et al. (2020a). Estimates of the average length and coveragefor these intervals at sample sizes n and estimated values pn from this study are also displayed. Estimates of averagelength and coverage are taken over 100,000 bootstrap replicates ofX at the sample size n and the estimated parameterspn from this study.

by the Delta Method, percentile bootstrap, BCα bootstrap, and projection method for the observed

values from Bendavid et al. (2020a).

For each parameter (e.g., p1), we take a large number of replicates of X at each value of an

evenly spaced grid around the estimated value of the parameter (e.g., 0.001, . . . , pn,1, . . . , 2 · pn,1),

holding the other five parameters (e.g., p2, p3, n1, n2, n3) fixed at their observed values (e.g., pn,2,

pn,3, n1, n2, n3).15 For each method at each combination of parameter values and sample sizes, we

compute the proportion of replicates for which the true value of π (i.e., the value of π associated

with the parameterization) is below, contained in, or above the corresponding confidence interval

with nominal coverage probability α = 0.05.

Figure 1 displays the results of this Monte Carlo experiment for the Delta Method, percentile

bootstrap, BCα bootstrap, and projection intervals at parameter values around pn,1, pn,2, n1, and

n2.16 We note that the black dots display one minus the proportion of replicates for which the

realized confidence interval contains the true value of π, i.e., one minus the estimated coverage of

the confidence interval. Additionally, Table 1 reports estimates of the coverage and average length

of each interval taken over 100,000 bootstrap replicates at the sample sizes n and estimated values

pn from Bendavid et al. (2020a).

We find that the Delta Method, percentile bootstrap, and BCα bootstrap intervals are quite

15Due to the computational simplicity of the intervals discussed in this section, we measure coverage with 100,000replications. We use 10,000 replications when we return to this experiment in Section 5.

16There is little variation in the coverage probabilities for parameter values around p3 and n3, so we omit the resultsof this experiment for the sake of clarity.


Figure 1: Estimated Coverage Levels of Nominal 95% Confidence Intervals

p1 p2 n1 n2

πn,∆

πn,pb

πn,bca

Projection

0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800

0.0

0.1

0.2

0.0

0.1

0.2

0.0

0.1

0.2

0.0

0.1

0.2

1 − Coverage Lower Upper

Notes: Figure 1 displays estimates of the coverage probabilities of the Delta Method (πn,∆), percentile bootstrap(πn, pb), BCα bootstrap (π, bca), and projection intervals at parameter values close to the estimate pn and samplesize n of Bendavid et al. (2020a) as specified in Section 3.3. The black dots denote the one minus the proportionof replicates for which the true value of π falls in the realized confidence intervals, i.e., one minus the estimatedcoverage probability. The purple squares and blue triangles denote the proportion of replicates that fall below andabove realized confidence intervals, respectively. The vertical dotted line denotes the estimated value of pn,1, pn,1 orsample size n1, n2 for Bendavid et al. (2020a). The horizontal dotted line denotes one minus the nominal coverage of0.95.

liberal for many parameterizations. In fact, in most cases, the estimated coverage of a nominal 95%

interval is below 90%. Moreover, the estimated coverage decreases sharply as n2 and p1 become

small and is, in most cases, not equi-tailed, in the sense that the proportions of replicates that fall

below and above the confidence intervals are not approximately equal. By contrast, the projection

method is quite conservative. They are approximately 45% longer than the Delta Method intervals

at sample sizes n and estimated values pn from Bendavid et al. (2020a). These findings motivate

the development of alternative methods for constructing confidence intervals that have less erratic

coverage probabilities or ensure coverage at least at the nominal value.


4. Test Inversion

In this section, we consider both approximate and finite-sample valid approaches to the general

problem of constructing test-inversion confidence intervals for parameters θ where the correspond-

ing null hypothesis H (θ0) : θ = θ0 is composite.

To this end, we introduce a more general notation. Suppose that the data X follow a general

parametric model indexed by a parameter (θ, ϑ). The parameter of interest θ is real-valued, tak-

ing values in possibly some strict subset of the real line, and the nuisance parameter ϑ is finite

dimensional. The parameter (θ, ϑ) lies in a parameter space Ω, which need not be a product space,

i.e., the range of ϑ may depend on θ. For a fixed value θ0, the parameter space for the nuisance

parameter ϑ is denoted by

Ω(θ0) = (θ, ϑ) ∈ Ω : θ = θ0 .

Observe that for the case of seroprevalence π, we have that θ = π and can assign ϑ = (p1, p3).

Consider a test of the null hypothesis H (θ0) : θ = θ0 against the alternative θ > θ0 that rejects

for large values of the test statistic Tn. A 1 − α/2 lower confidence bound for θ constructed with

test inversion is given by

L1−α/2 = inf θ0 : H (θ0) is not rejected at level α/2 against the alternative θ > θ0 .

Similarly, a 1− α/2 upper bound for θ constructed with test inversion is given by

U1−α/2 = sup θ0 : H (θ0) is not rejected at level α/2 against the alternative θ < θ0 .

Thus, a two-sided 1 − α confidence interval is given by[L1−α/2, U1−α/2

]. Test inversion reduces

the problem of confidence interval construction for θ to the problem of testing H(θ0) : θ = θ0

against θ > θ0 and θ < θ0.

4.1 Simple Null Hypotheses

It is illustrative to assume that the nuisance parameter ϑ = ϑ0 is known. In this case, the null hy-

pothesis H(θ0) is simple, and so one-sided tests that control the level at α/2 are easily constructed.

Indeed, consider the test that rejects H(θ0) against θ > θ0 for large values of the test statistic

Tn = Tn(X), where the subscript n denotes the sample size and may be a vector. The cumulative


distribution function of Tn is given by

Fn,θ,ϑ(t) = F Tn,θ,ϑ(t) = Pθ,ϑTn ≤ t , (9)

where we use a superscript T to indicate dependence on the choice of test statistic sequence T =

Tn. Additionally, define the related quantity

Fn,θ,ϑ(t−) = F Tn,θ,ϑ(t−) = Pθ,ϑTn < t , (10)

and let t0 denote the observed value of Tn.

The null probability that Tn ≥ t0 is given by

qL,θ0,ϑ0 = 1− Fn,θ0,ϑ0(t−0 ) , (11)

and is a valid p-value in the sense that the test that rejects when this quantity is ≤ α/2 has size

≤ α/2; see, e.g., Lemma 3.3.1 in Lehmann and Romano (2005).17 Thus, a 1 − α/2 confidence

set for θ includes all θ0 such that Fn,θ0,ϑ0(t−0 ) < 1− α/2. If Fn,θ0,ϑ0(t

−0 ) is continuous and strictly

monotone decreasing in θ0, then a 1−α/2 lower confidence bound θL may be obtained by solving

Fn,θL,ϑ0(t−0 ) = 1− α/2 . (12)

Similarly, the quantity

qU,θ0,ϑ0 = Fn,θ0,ϑ0(t) (13)

is a valid p-value for testing H(θ0) against θ < θ0 and a 1− α/2 upper confidence bound θU may

be obtained by solving

Fn,θU ,ϑ0(t0) = α/2 . (14)

Thus, [θL, θU ] is a 1−α confidence interval for θ.18 For a single binomial parameter, this construc-

tion leads to the Clopper and Pearson (1934) interval.19

17Throughout, we will denote various p-values by q rather than p because p is reserved for various estimates ofbinomial parameters.

18Note one may wish to test H(θ0) at each endpoint of the confidence interval to determine whether it should be aclosed or open interval, but we simply take the conservative approach and make it closed.

19Even in the case that the distribution of Tn is discrete, the function Fθ0,ϑ0(t−0 ) is typically continuous in θ0 (as inthe binomial case). If not, one could use the infimum over θ0 such that Fθ0,ϑ0(t−0 ) < 1− α/2 as a lower bound.


4.2 An Approximate Approach

The confidence interval construction developed in the previous section assumes that the nuisance

parameter ϑ is known, and is therefore infeasible. In this section, we discuss a feasible approach

that uses an approximation to ϑ. In particular, if ϑ(θ0) is the MLE for ϑ subject to the constraint

θ = θ0, then the infeasible p-values (11) and (13) can be replaced with

qL,θ0,ϑ(θ0) = 1− Fn,θ0,ϑ(θ0)(t−0 ) and qU,θ0,ϑ(θ0) = Fn,θ0,ϑ(θ0)(t0), (15)

respectively, where Fn,θ0,ϑ(θ0) is approximated either analytically or with the parametric bootstrap.

Accordingly, the infeasible confidence interval [θL, θU ] is replaced with the feasible confidence

interval [θL, θU ], where, analogously to (12) and (14), the endpoints θL and θU are the values of θ0

that satisfy

Fn,θ0,ϑ(θ0)(t−0 ) = 1− α/2 and Fn,θ0,ϑ(θ0)(t0) = α/2 , (16)

respectively. In other words, either Wald or parametric bootstrap tests are constructed for each θ0,

where the distribution of the test statistic Tn is determined under the parameter (θ0, ϑ(θ0)). This

approach was used in DiCiccio and Romano (1990) and DiCiccio and Romano (1995).

In this approximate approach, the family of distributions indexed by (θ, ϑ) has been reduced

to an approximate least favorable one-dimensional family of distributions governed by the pa-

rameter (θ0, ϑ(θ0)) as θ0 varies. The reduction, in principle, allows one to apply any method for

construction of confidence intervals for a one-dimensional parameter. This approach implicitly

orthogonalizes the parameter of interest with respect to the nuisance parameter, so that the effect

of estimating the nuisance parameter is negligible to second-order, which then typically results in

second-order accurate confidence intervals. See Cox and Reid (1987) for a discussion of the role

of orthogonal parameterizations for inference about a scalar parameter in the presence of nuisance

parameters.

4.3 An Infeasible Finite-Sample Approach

The interval endpoints obtained in the previous section, defined in (16), are approximate as the

MLE ϑ(θ0) is an approximation to the true value of the nuisance parameter ϑ0. As a result, the

coverage probability of an interval of this form will depend on the quality of this approximation,

which may be uncertain in small samples.

By contrast, intervals that ensure coverage of at least 1− α in finite samples may be desirable.


An infeasible approach to constructing intervals with test inversion that achieves this guarantee

proceeds by taking the supremum of the p-values over all possible values of the nuisance compo-

nent ϑ, giving

qL,θ0,sup = sup(θ0,ϑ)∈Ω(θ0)

1− Fn,θ0,ϑ(t−0 ) and qU,θ0,sup = sup(θ0,ϑ)∈Ω(θ0)

Fn,θ0,ϑ(t0). (17)

These p-values are clearly valid in finite samples. For example, if ϑ0 is the true value of ϑ, then we

have that

Pθ0,ϑ0qL,θ0,sup ≤ u = Pθ0,ϑ0 sup(θ0,ϑ)∈Ω(θ0)

1− Fn,θ0,ϑ(T−n ) ≤ u (18)

≤ Pθ0,ϑ01− Fn,θ0,ϑ0(T−n ) ≤ u ≤ u .

Note, however, that these p-values may be conservative, as the supremum over ϑ may be obtained

at a value far from ϑ0. On the other hand, if the distribution of the test statistic does not vary much

with ϑ, then these p-values will not be overly conservative. Therefore, it pays to choose a test

statistic that is nearly pivotal, in the sense that its distribution does not depend heavily on ϑ.

The p-values defined in (17) may be impractically conservative if the distribution of Tn is highly

variable with ϑ. To address this issue, one can restrict the space of values for ϑ that are considered

by first constructing a 1−γ confidence region for ϑ. Such an approach is considered in Berger and

Boos (1994) and Silvapulle (1996). This idea has also been applied to nonparametric contexts; see

Romano et al. (2014) and further references therein.

This refined approach proceeds as follows. Fix a small number γ and let I1−γ be a 1 − γ

confidence region for ϑ. Consider the following modified p-values defined by

qL,θ0,I1−γ =

supϑ∈I1−γ 1− Fn,θ0,ϑ(t−0 ) + γ if ϑ ∈ I1−γ : (θ0, ϑ) ∈ Ω(θ0) 6= ∅

γ otherwise(19)

and

qU,θ0,I1−γ =

supϑ∈I1−γ Fn,θ0,ϑ(t0) + γ if ϑ ∈ I1−γ : (θ0, ϑ) ∈ Ω(θ0) 6= ∅

γ otherwise.(20)

The one-sided test that rejects when qL,θ0,I1−γ ≤ α/2 leads to a 1 − α/2 lower confidence bound

for θ. Note that it is possible that the region I1−γ may not include any ϑ for which (θ0, ϑ) ∈ Ω(θ0),


in which case the modified p-value is defined to be γ, i.e., the null hypothesis H(θ0) is rejected at

any level α ≥ γ if the confidence region I1−γ for ϑ is completely incompatible with θ = θ0.20

The p-values obtained by (19) and (20) are valid in finite samples. The following result follows

from the lemma presented in Section 2 of Berger and Boos (1994) or the theorem presented in

Section 2 of Silvapulle (1996).

Lemma 4.1 The p-values qL,θ0,I1−γ and qU,θ0,I1−γ are valid in the sense that, for all ϑ and 0 ≤ u ≤1, we have that

Pθ0,ϑqL,θ0,I1−γ ≤ u ≤ u and Pθ0,ϑqU,θ0,I1−γ ≤ u ≤ u . (21)

PROOF OF LEMMA 4.1: The event qL,θ0,I1−γ ≤ u implies that either

1− Fn,θ0,ϑ(T−n ) ≤ u− γ

or ϑ /∈ I1−γ . The latter event has probability ≤ γ and the former event has probability ≤ u − γ.

An identical argument holds for the event qU,θ0,I1−γ ≤ u.

4.4 A Feasible Finite-Sample Approach

The finite-sample approach defined in Section 4.3 is infeasible, as it involves computing a supre-

mum over an infinite set Ω(θ0). A natural approximation to this approach is to discretize Ω(θ0), so

that the supremum is replaced by maximum over a finite set of values on a grid. In particular, if

G = G(θ0) denotes a finite grid over the space Ω(θ0), then (17) can be approximated by

qL,θ0,max = max(θ0,ϑ)∈G(θ0)

1− Fn,θ0,ϑ(t−0 ) and qU,θ0,max = max(θ0,ϑ)∈G(θ0)

Fn,θ0,ϑ(t0) , (22)

respectively. Similarly, the refinement given in (19) and (20) can be approximated by replacing

I1−γ with I1−γ , where I1−γ denotes a finite grid (or ε-net) approximating I1−γ . In contemporaneous

work, Cai et al. (2020) use this construction to develop confidence intervals for seroprevalence that

ensure finite-sample Type 1 error control up to the error induced by the finiteness of G = G(θ0).

In this section, we develop a modification to this construction that provably maintains finite-

20Note that one may let I1−γ = I1−γ(θ0) depend on θ0, so that I(θ0) denotes a confidence region for ϑ with θfixed at θ0. The choice of I may also depend on whether or not one is constructing a lower confidence bound or upperconfidence bound.


sample Type 1 error control for testing H(θ0), and thereby maintains control over the level of

confidence sets obtained through test inversion by directly accounting for the approximation error

induced by a finite discretization of Ω(θ0). Toward this end, we introduce further structure. Sup-

pose that the components of the data X = (X1, . . . , Xk)> are independent, that the distribution of

Xi depends on a parameter βi, and that the family of distributions forXi has a monotone likelihood

ratio in Xi. Assume further that interest focuses on a real-valued parameter θ = f(β1, . . . , βk) and

that the nuisance parameters are given by ϑ = (β2, . . . , βk). The model can equivalently be thought

of as being parametrized by (θ, ϑ) ∈ Ω as before, or through β = (β1, . . . , βk) ∈ Ω. Additionally,

define

Ω(θ0) = β ∈ Ω : f(β) = θ0 .

In the context of seroprevalence, θ = π.

As before, let Tn = Tn(X1, . . . , Xk) be a test statistic for testing H(θ0), with t0 denoting its

realized value. Note that Tn could itself depend on θ0, which is seen below to occur for seropreva-

lence. Assume that Tn is montone with respect to each each component Xi. It is without loss of

generality, then, to assume that Tn is monotone increasing with respect to each component Xi.21

Let Jn,β(·) denote the cumulative distribution function of Tn for the β- parametrization, so that

Jn,β(t) = Fn,θ,ϑ(t) = PβTn ≤ t ,

and let β(θ0) denote the MLE for β subject to the constraint that θ = θ0. In this case, for example,

we can also represent the bootstrap-based inverse testing p-value defined in Section 4.2 with

qU,θ0,ϑ(θ0) = Jn,β(θ0)(t0) ,

and similarly for the other p-values previously introduced.

The objective is to replace the supremum over I1−γ in (19) and (20) with a finite maximum

while maintaining Type 1 error control in finite samples. Toward this end, consider a partition of

the values of ϑ in I1−γ into r regions E1, . . . , Er. In our implementation, each region is given by a

hyperrectangle of the form

[β′2, β′′2 ]× · · · × [β′k, β

′′k ] , (23)

21If Tn is monotone decreasing with respect to a particular component, say Xj , then Xj can be replaced by −Xj ,whose family of distributions is then monotone increasing with respect to −βj .


though this is not necessary. Note that each region may not lie entirely in the parameter space for

ϑ. For region Ej , let (β2(j), . . . , βk(j)) be the vector giving the smallest value that all but the first

component of β takes on in Ej; that is

βi(j) = infβi : (β2, . . . , βk) ∈ Ej .

Similarly, let (β2(j), . . . , βk(j)) be the vector giving the largest value that all but the first compo-

nent of β takes on in Ej; that is,

βi(j) = supβi : (β2, . . . , βk) ∈ Ej .

For a hyperrectangle of the form (23), clearly βi(j) = β′i and βi(j) = β′′i . Congruently, let

β1(j) = infβ1 : β ∈ Ω(θ0), (β2, . . . , βk) ∈ Ej (24)

and

β1(j) = supβ1 : β ∈ Ω(θ0), (β2, . . . , βk) ∈ Ej . (25)

denote the smallest and largest values that the first component of β takes on Ω(π0) subject to the

constraint that (β2, . . . , βk) in Ej . If the infimum or supremum in (24) or (25) is over a non-empty

set, then define sL(j) = Jn,β(j)(t−) and sU(j) = Jn,β(j)(t), where

β(j) = (β1(j), . . . , βk(j)) and β(j) = (β1(j), . . . , βk(j)) .

Otherwise, if there is no β in Ω(π0) with (β2, . . . , βk) in Ej , then set sL(j) = 1 and sU(j) = 0,

respectively.

We construct the p-values

qL,θ0,I1−γ = max1≤j≤r

(1− sL(j)) + γ (26)

and

qU,θ0,I1−γ = max1≤j≤r

sU(j) + γ (27)

by taking the maximum over the adjusted p-values 1−sL(j) and sU(j) . This refinement is feasible

and valid in finite samples.


Theorem 4.1 Assume that the components of the data X = (X1, . . . , Xk)> are independent, that

each component Xi has distribution in a family having a monotone likelihood ratio, and that the

statistic Tn = Tn(X1, . . . , Xk) is monotone increasing with respect to each component Xi. Let

I1−γ be a 1 − γ confidence region for (β2, . . . , βk). Then, the p-values qL,θ0,I1−γ and qU,θ0,I1−γ are

valid for testing H(θ0) in the sense that, for any 0 ≤ u ≤ 1 and any ϑ,

Pθ0,ϑqL,θ0,I1−γ ≤ u ≤ u and Pθ0,ϑqU,θ0,I1−γ ≤ u ≤ u .

PROOF OF THEOREM 4.1. First, note that I1−γ could be the whole space Ω(θ0) by taking γ = 0.

It follows from Lemma A.1 in Romano et al. (2011) (which is a simple generalization of Lemma

3.4.2 in Lehmann and Romano (2005)) that the family of distributions of Tn satisfy, for any t,

β = (β1, . . . , βk), β′ = (β′1, . . . , β′k) with β′i ≥ βi for all i ≥ 1:

Jn,β′(t) ≤ Jn,β(t) . (28)

The same is true if t is replaced by t−. Thus, we have that Jn,β(j)(t−) ≤ Jn,β(t−) and Jn,β(j)(t) ≥

Jn,β(t) for any β with θ = f(β) = θ0 and (β2, . . . , βk) ∈ Ej . Therefore, the p-values defined by

(19) and (20) satisfy

qL,θ0,I1−γ ≤ max1≤j≤r

(1− sL(j)) + γ = qL,θ0,I1−γ and

qU,θ0,I1−γ ≤ max1≤j≤r

sU(j) + γ = qU,θ0,I1−γ .

Since qL,θ0,I1−γ and qU,θ0,I1−γ are valid p-values (by Lemma 4.1), then so are any random variables

that are stochastically larger.

Thus, tests based on the p-values qL,θ0,I1−γ and qU,θ0,I1−γ may be used to testH(θ0) and, through

test inversion, yield finite-sample valid confidence bounds for θ.

5. Test-Inversion Inference for Seroprevalence

In this section, we apply the methods considered in Section 4 to the problem of constructing ap-

proximate and finite-sample valid confidence intervals for seroprevalence.


5.1 Test Statistics

We begin by exhibiting a set of test statistics Tn applicable to our problem. To this end, we must

first consider estimation of p unconditionally and under the restriction that π = π0. In Appendix

A, we outline the maximum likelihood estimation of p under the restrictions that p ∈ Ω and that

p ∈ Ω(π0) = p ∈ Ω : π = π0 .

We let pn and pn(π0) denote the resultant estimators, respectively.

A natural choice for the test statistic Tn is the difference between πn and π0. Moreover, this

statistic can be Studentized with an estimate of its standard deviation, giving the test statistic

πn (π0) = (πn − π0)/√

Vπn (pn) ,

where Vπn(p) is given in (4). Similarly, πn−π0 can be Studentized with an estimate of its standard

deviation under the constraint π = π0, giving the test statistic

πn,C (π0) = (πn − π0)/√

Vπn (pn(π0)).

Observe that as p1 = p2 (1− π) + p3π, we can rewrite the condition π0 = π as the linear

restriction b (π0)> p = 0 where b (π0) = (1,− (1− π0) ,−π0)> . This observation suggests con-

sideration of the linear test statistic φn (π0) = b(π0)>pn. Moreover, as the variance of φn (π0) is

exactly equal to

Vφn(π0) (p) = σ21 (p1) + (1− π0)2 σ2

2 (p2) + π20σ

23 (p3) , (29)

and can be estimated with the plug-in estimator Vφn(π0) (pn), the test statistic φn (π0) can be Stu-

dentized, giving the alternative test statistic

φn (π0) = φn (π0)/√

Vφ(π0) (pn).

In turn, we can studentize φn (π0) with an estimate of its variance under the restriction π0 = π,

giving the statistic

φn,C (π0) = φn (π0)/√

Vφ(π0) (pn(π0)).22

22Observe that, additionally, φn (π0) is well-defined if p2 = p3, and so φn (π0), φn (π0), or φn.C (π0) may bedesirable choices in situations where p2 is close to p3.


Alternatively, we can use statistics based on the likelihood function. In particular, let L (p | x)

be the likelihood function given by

L (p | x) =∏

1≤i≤3

(nixi

)pxii (1− pj)ni−xi .

The generalized likelihood ratio test statistic for testing H(π0) : π = π0 is given by

Wn = Wn (π0) = 2 ·supp∈Ω

logL (p | X)

supp∈Ω(π0)

logL (p | X)

= 2 ·∑

1≤j≤3

(Xj log

(pn,j

pn,j(π0)

)+ (nj −Xj) log

(1− pn,j

1− pn,j(π0)

)).

Large values of Wn give evidence for both π < π0 and π > π0. To address this issue, we also

consider the signed root likelihood ratio statistic for the restriction π = π0, given by

Rn = Rn (π0) = sign (πn − π0) ·√Wn (π0),

Corrections to improve the accuracy of Wn based on its signed square root Rn have a long his-

tory; see Lawley (1956), Barndorff-Nielsen (1986), Fraser and Reid (1987), Jensen (1986), Jensen

(1992), DiCiccio et al. (2001) and Lee and Young (2005). Frydenberg and Jensen (1989) consider

the effect of discreteness on the efficacy of corrections to improve asymptotic approximations to

the distribution of the likelihood ratio statistic. The statistic Rn can be re-centered and Studentized

as

Rn =(Rn −mR

n (pn(π0))) /√

V Rn (pn(π0)) ,

wheremRn (p) and V R

n (p) denote the mean and variance ofRn under p and in practice are computed

with the bootstrap under pn(π0).

5.2 Approximate Intervals

We now outline the application of the approximate intervals considered in Section 4.2 and measure

the performance of these intervals in the Monte Carlo experiment developed in Section 3.3. Sup-

pose that we are using the test statistic Tn with observed value t0. The approximate test-inversion


intervals are constructed by first computing, for each π0, the p-values

qL,π0,pn(π0) = 1− JTn,pn(π0)(t−0 ) and qU,π0,pn(π0) = JTn,pn(π0)(t0).

The resultant interval with nominal coverage 1− α then takes the form

π0 : qL,π0,pn(π0) ≥ α/2 and qU,π0,pn(π0) ≥ α/2

.

We begin by considering asymptotic approximations to the distribution JTn,pn(π0) for different test

statistics.

Observe that, under the null hypothesis H (π0), the test statistics πn,C(π0), φn,C (π0), and Rn

are asymptotically N (0, 1), and Wn is asymptotically χ21. Thus, if we set Tn equal to any of the

asymptotically normal statistics, we can approximate Jn,pn(π0) with a standard normal distribution.

Likewise, we may apply a χ21 approximation if we set Tn equal to Wn.23

Table 2 reports realizations of these approximate confidence intervals for the test statistics

πn,C(π0), φn,C (π0), Wn, and Rn for the observed values from Bendavid et al. (2020a). Addi-

tionally, Table 2 presents estimates of the coverage and average interval length taken over 10,000

bootstrap replicates computed at the n and estimate pn from this study. Notably, each of these

intervals now includes zero.24 Intervals constructed using πn,C(π0) and φn,C are approximately

3-4% longer, on average, than the Delta Method intervals. Intervals constructed using Wn and

Rn are approximately 3-4% shorter, on average, than the Delta Method intervals. Each interval

construction has coverage probability significantly closer to the nominal level.

Figure 2 displays estimates of the coverage probabilities for these intervals in the Monte Carlo

experiment developed in Section 3.3. Recall that the black dots display one minus the estimates

of the coverage probabilities of the respective intervals, and that the purple squares and blue tri-

angles display the proportion of the replicates in which the true value of π falls below and above

the realized confidence interval, respectively. In contrast to the results for the standard methods

displayed in Figure 1, we estimate that the coverage probabilities for these methods are very close

to the nominal value of 95% for most parameterizations. The intervals constructed with πn,C (π0)

23Observe that using normal approximations to πn or πn is equivalent to constructing Wald intervals for thesestatistics. For that reason, we focus on statistics that make explicit use of the null hypothesis restriction π = π0.

24If the null hypothesis π = 0 is of particular interest or concern, then there exists an exact uniformly most powerfulunbiased level α test for the equivalent problem of testing p1 = p2 against p1 > p2. This is a conditional one-sidedbinomial test; see Section 4.5 of Lehmann and Romano (2005). Such a test does not exist for other values of π0.


Statistic Interval Ave. Length Ave. Length vs. Delta Method Coverage

πn,C [0.000,0.020] 0.0193 1.0404 0.963

φn,C [0.000,0.020] 0.0191 1.0302 0.963

Wn [0.000,0.021] 0.0177 0.9563 0.927

Rn [0.000,0.021] 0.0181 0.9758 0.950

Table 2: Average Interval Length and Coverage for Test-Inversion Nominal 95% Confidence In-tervals for Seroprevalence Using Asymptotic Approximation

Notes: Table 2 reports the approximate test-inversion confidence intervals, constructed with an asymptotic approxima-tion to test statistic null distributions, computed on data from Bendavid et al. (2020a). Estimates of the average lengthand coverage for these intervals at the n and estimate pn from this study are also displayed. Estimates of averagelength and coverage are taken over 10,000 bootstrap replicates of X at the sample size n and the estimated parameterspn from this study.

and φn,C (π0) are not particularly equi-tailed, with the true value of π falling above the realized

confidence interval more frequently than it falls below. The intervals constructed with Wn and Rn

are more equi-tailed, but the overall coverage probability appears more sensitive to perturbations

in p2 and n2.

Next, we refine this approach by directly computing Jn,pn(π0) with the bootstrap. This method

is more accurate, but can be significantly more computationally expensive. Table 3 reports real-

izations of these approximate confidence intervals for the test statistics πn, πn,C(π0), φn, φn,C(π0),

Wn, and Rn computed on data from Bendavid et al. (2020a), in addition to estimates of the cov-

erage and average interval length at the n and estimate pn for this study. Again, each of these

intervals include zero, are roughly the same length as the delta-method intervals, and have cover-

age probability close to the nominal level.

Figure 3 displays estimates of the coverage probabilities for these intervals for the test statistics

πn, πn,C(π0), φn, φn,C(π0), Wn, and Rn in the same Monte Carlo experiment developed in Section

3.3. Similarly to Figure 2, these intervals have coverage close to the nominal value and are approx-

imately equi-tailed. In particular, the interval constructed with Rn is equi-tailed and appears to be

the least sensitive to perturbations in p2 and n2.


Figure 2: Coverage for Test-Inversion 95% Intervals Based on Asymptotic Approximation

p1 p2 n1 n2

π~n,C,aa

φ~n,C,aa

Wn,aa

R~

n,aa

0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800

0.00

0.05

0.00

0.05

0.00

0.05

0.00

0.05


Notes: Figure 2 displays estimates of the coverage probabilities of the approximate confidence intervals constructedwith an asymptotic approximation to test statistic null distributions. Estimates of the coverage for the interval con-structed with the test statistic Tn are denoted by “Tn, aa” and are computed at parameter values close to the estimatespn and sample size n of Bendavid et al. (2020a) as specified in Section 3.3. The black dots denote the one minusthe proportion of replicates for which the true value of π falls in the realized confidence intervals, i.e., one minusthe estimated coverage probability. The purple squares and blue triangles denote the proportion of replicates that fallbelow and above realized confidence intervals, respectively. The vertical dotted line denotes the estimated value ofpn,1, pn,2, or sample n1, n2 for Bendavid et al. (2020a). The horizontal dotted line denotes one minus the nominalcoverage of 0.95.

5.3 Finite-Sample Valid Intervals

We now turn to the application of the finite-sample valid intervals discussed in Section 4.4 to

seroprevalence. We focus our development on the test statistic φn(π0) as it is linear, and therefore

monotone with respect to each sample Xi, as is required.25

To begin, we must choose a partition of the parameter space Ω into a parameter of interest and

a nuisance component. Recall from the discussion in Section 4.3, that finite-sample exact intervals

formed by maximizing p-values over a nuisance space perform best if the distribution of the chosen

25Alternatively, one may also consider the test statistic πn = f(pn), where f(p) = (p1 − p2)/(p3 − p2). Then, fis monotone with respect to each component as long as p ∈ Ω.


Figure 3: Coverage for Test-Inversion Nominal 95% Confidence Intervals Based on the Bootstrap

p1 p2 n1 n2

πn,boot

π~n,boot

φn,boot

φ~n,boot

Wn,boot

Rn,boot

0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06


Notes: Figure 3 displays estimates of the coverage probabilities of the approximate confidence intervals constructedwith the bootstrap. Estimates of the coverage for the interval constructed with the test statistic Tn are denoted by“Tn, boot” and are computed at parameter values close to the estimates pn and sample size n of Bendavid et al.(2020a) as specified in Section 3.3. The black dots denote the one minus the proportion of replicates for which thetrue value of π fall in the realized confidence intervals, i.e., one minus the estimated coverage probability. The purplesquares and blue triangles denote the proportion of replicates that fall below and above realized confidence intervals,respectively. The vertical dotted line denotes the estimate pn,1, pn,2 or sample size n1, n2 for Bendavid et al. (2020a).The horizontal dotted line denotes one minus the nominal coverage of 0.95.


Statistic Interval Ave. Length Ave. Length vs. Delta Method Coverage

πn [0.000,0.021] 0.0189 1.0186 0.965

φn [0.000,0.021] 0.0188 1.0133 0.962

πn [0.000,0.021] 0.0188 1.0150 0.953

φn [0.000,0.021] 0.0188 1.0141 0.953

Wn [0.000,0.021] 0.0181 0.9788 0.946

Rn [0.000,0.021] 0.0186 1.0035 0.951

Table 3: Average Interval Length and Coverage for Approximate Test-Inversion Nominal 95%Confidence Intervals Based on the Bootstrap

Notes: Table 3 reports the approximate test-inversion confidence intervals, constructed with a parametric bootstrapapproximation to null distributions of test statistics, computed on data from Bendavid et al. (2020a). Estimates of theaverage length and coverage for these intervals at sample size n and estimate pn from this study are also displayed.Estimates of average length and coverage are taken over 10,000 bootstrap replicates of X at the sample size n andestimate pn from this study.

test statistic does not vary much with the nuisance parameter. For small values of π0, the variance

of φn(π0), denoted by Vφn(π0)(π0) and specified in (29), is insensitive to changes in p3, as the

variance σ23(p3) enters into Vφn(π0)(π0) linearly and scaled by π2

0 . For sample sizes comparable to

the measurements taken in Bendavid et al. (2020a), where n1 is much larger than n2, the variance

Vφn(π0)(π0) will be less sensitive to changes in p1 than to changes in p3. Thus, we set the nuisance

component ϑ = (p1, p3), giving the parameterization (π, ϑ).

We note that for different sample sizes, it may be attractive to set the nuisance component

ϑ = (p2, p3). For example, Bendavid et al. (2020b) – the April 27th draft of Bendavid et al. (2020a)

– includes larger sensitivity and specificity samples n2 and n3. In particular, the specificity sample

n2 was increased from 401 to 3324.26,27

To fix ideas, consider Figure 4, which displays a heat-map of the 0.975 quantile of φn(π0)

under different parameterizations (π0, ϑ), with the sample size and the null hypothesis restriction

26The choice ϑ = (p2, p3) also has computation advantages. In particular, by the identity p1 = p2 (1− π) + p3π,for any value of seroprevalence π0 and any values of p2 and p3 satisfying p2 < p3, there is a value of p1 that satisfiesp2 ≤ p1 ≤ p3 such that π0 = (p1 − p2)/(p3 − p2). That is, any value of (p2, p3) corresponds to a unique value of p1consistent with a given value of π0 and satisfying the a priori restrictions on the parameter space Ω.

27These additional data were are aggregated over several samples taken at different times and locations. Gelman(2020), Fithian (2020), and Bennett and Steyvers (2020) highlight issues with this aggregation.


Figure 4: Bootstrap Quantiles and Initial Nuisance Parameter Confidence Region

0.7

0.8

0.9

1.0

0.01 0.02 0.03 0.04 0.05p1

p3

0.975 Quantile of Bootstrap Distribution

0.003 0.005 0.007 0.008 0.01 0.012 0.013 0.015 0.017 0.018 0.02

Notes: Figure 4 displays a heat-map of the 0.975 quantile of φn(π0) under different parameterizations (π0, ϑ), wherethe sample size and null hypothesis restriction π0 equal to the sample size and estimated prevalence πn from Bendavidet al. (2020a). The black line exhibits the boundary of the parameter space Ω(π0). The black square dot denotes theconstrained MLE ϑ(π0) = (pn,1(π0), pn,3(π0)). With γ = 0.001, the grey rectangle denotes a 1 − γ confidenceregion for ϑ constructed by taking the cartesian product of two

√1− γ level confidence regions for p1 and p3 each

constructed with the method of Clopper and Pearson (1934). The white dots denote a 10× 10 grid over this space.

π0 equal to the sample size and estimated prevalence πn from Bendavid et al. (2020a). The square

black dot denotes the constrained MLE ϑ(π0) = (pn,1(π0), pn,3(π0)). The black line exhibits the

boundary of the parameter space Ω(π0).

Recall that in the constructions of approximate intervals considered in Section 5.2, a point π0 is

excluded from a confidence interval with nominal coverage 0.95 if the observed value of the chosen

test statistic Tn exceeds or falls below the 0.975 or 0.025 quantiles of the statistic’s finite-sample

distribution at the constrained MLE pn(π0). But, for example, the 0.975 quantile of the bootstrap

distribution of φn(π0) is variable over parameterizations (π0, ϑ). As a result, these approximate

intervals will not exactly control the coverage probability in finite samples, as the event that ϑ

differs from ϑ(π0) occurs with positive probability.


In turn, comparing the realized value of a test statistic to quantiles of the statistic’s finite-

sample distribution at every value of the nuisance component ϑ is both infeasible, as the space of ϑ

is infinite, and impractical, as it would lead to extremely conservative intervals. In fact, we can see

that in Figure 4, the 0.975 quantile of the bootstrap distribution of φn(π0) is approximately four

times as large at p1 = 0.05 than at p1 = pn,1(π0).

Thus, the finite-sample approach developed in Section 4.4 begins by constructing a 1 − γ

confidence region for ϑ and forming a finite grid over this space. The initial confidence region I1−γ

is illustrated in Figure 4 by the greyed rectangle, and a 10 × 10 grid over this space is illustrated

by the grid of white dots. The confidence region I1−γ is formed by taking the Cartesian product of

two√

1− γ level confidence regions for p1 and p3, each constructed by using the exact intervals

of Clopper and Pearson (1934) with γ = 0.001. This grid partitions the values of ϑ in I1−γ into

r = 81 rectangles, which we enumerate E1, . . . , Er. Define, for i = 1, 3, the extreme points

pi(j) = infpi : (p1, p3)> ∈ Ej and pi(j) = suppi : (p1, p3)> ∈ Ej

as well as

p2(j) = infp2 : p ∈ Ω(π0), (p1, p3)> ∈ Ej and (30)

p2(j) = supp2 : p ∈ Ω(π0), (p1, p3)> ∈ Ej .

As the test statistic φn(π0) is monotone increasing in X1 and monotone decreasing in X2 and X3,

define

pL(j) =(p1(j), p2(j), p3(j)

)and pU(j) =

(p1(j), p2(j), p3(j)

)as well as

sL(j) = Jφ(π0)n,pL(j)(t

−0 ) and sU(j) = J

φ(π0)n,pU (j)(t0)

where sL(j) and sL(j) are set equal to 1 and 0, respectively, if the infimum or supremum in (30) are

taken over the empty set. Thus, by Theorem 4.1 we can construct the finite-sample valid p-values

qL,π0,I1−γ = max1≤j≤r

(1− sL(j)) + γ and qU,π0,I1−γ = max1≤j≤r

(sU(j)) + γ

for testing the null hypothesis π = π0. Hence, the resultant finite-sample valid interval with


nominal coverage 1− α takes the form

π0 : qL,π0,I1−γ ≥ α/2 and qU,π0,I1−γ ≥ α/2

.

This approach is closely related to the method developed in Cai et al. (2020), though there

are some differences. Roughly, Cai et al. (2020) compute p-values for test of the null hypothesis

π = π0 with the parametric bootstrap using the particular choice of test statistic πn at each point of

a grid spanning a confidence region for the nuisance parameter, although the exact specification of

their procedure differs from our own. Their construction begins by constructing a joint confidence

region for all three parameters, while our approach proceeds from a smaller initial region for just

p1 and p3. We make an additional correction for a grid approximation to the nuisance space, which

allows us to ensure finite-sample validity. Finally, their construction does not guarantee that the

resulting intervals are equi-tailed.

Table 4 reports realizations of these finite-sample valid intervals with γ equal to 10−4, 10−3,

and 10−2, as well as the projection intervals discussed in Section 3.2, for Bendavid et al. (2020a).

In addition, Table 4 reports estimates of the coverage and average interval length at the estimated

values of pn for this study. The cost of ensuring finite-sample valid coverage is large. The realized

intervals are roughly 40% wider on average than intervals constructed with the Delta Method,

and the coverage is very close to one. Figure 5 displays estimates of the coverage probabilities

for the finite-sample valid intervals for γ equal to 10−4, 10−3, and 10−2 as well as the projection

intervals in the same Monte Carlo experiment developed in Section 3.3.28 Again, the coverage

is very close to one at small sample sizes. However, the finite-sample valid intervals outperform

the crude projection intervals in terms of coverage and average length for larger values of γ. The

difference is most salient in the measurements of coverage.

The additional costs associated with correction of the approximation error induced by a finite

discretization of the nuisance space are not overly burdensome. In particular, consider the test

inversion intervals for π constructed with the p-values

qL,π0,I1−γ,g = max

1− J φ(π0)n,p (t−0 ) + γ : p ∈ Ω(π0), (p2, p3) ∈ I1−γ,g

28Note that the proportion of Monte Carlo replicates for which the true value of π falls below the realized intervals isvery close to zero at most parameter values, and so dots denoting one minus the estimated coverage and the proportionof Monte Carlo replicates for which the true value of π falls above the realized intervals are approximately overlaid.


Method γ Interval Ave. Length Ave. Length vs. Delta Method Coverage

Exact

0.0001 [0.000,0.028] 0.0283 1.5311 0.999

0.0010 [0.000,0.027] 0.0269 1.4538 0.998

0.0100 [0.000,0.026] 0.0259 1.3979 0.998

Projection [0.001,0.028] 0.0270 1.4578 1.000

Table 4: Average Interval Length and Coverage for Valid Test-Inversion and Projection Nominal95% Intervals

Notes: Table 4 reports the finite-sample valid test-inversion and projection confidence intervals computed on data fromBendavid et al. (2020a). Estimates of the average length and coverage for these intervals at sample size n and estimatepn from this study are also displayed. Estimates of average length and coverage are taken over 10,000 bootstrapreplicates of X at the sample size n and the estimate pn from this study.

and

qU,π0,I1−γ,g = maxJ φ(π0)n,p (t0) + γ : p ∈ Ω(π0), (p2, p3) ∈ I1−γ,g

,

where I1−γ,g is a g × g grid over the initial confidence region I1−γ . That is, I1−γ,g denotes the

white dots in Figure 4, where in that case g = 10. The realized value for these intervals with

g = 10 and γ = 10−2 for data from Bendavid et al. (2020a) are [0.000, 0.025]. For these values of

g and γ, this interval construction has an average length of 0.0249, which is 34.85% longer than the

delta-method interval on average, i.e., they are 3.5% shorter than the finite-sample valid intervals

considered in this section.

There are several facets of the finite-sample valid confidence intervals considered in this section

that could potentially be improved. As previously mentioned, there are several possible choices

for the nuisance parameter ϑ that lead to an initial confidence region I1−γ . In addition, in order

to get lower or upper confidence bounds for π, an initial confidence region may be constructed

by taking the product of appropriate one-sided bounds, respectively, rather than using a single

joint confidence region for both lower and upper bounds. This change should save roughly γ/2

in over-coverage. It may be more desirable to use a Studentized test statistic, as its distribution

may vary even less within the initial confidence region. However, a nontrivial modification of the

correction for the approximation error induced by a finite discretization of the nuisance space is

required, since monotonicity may be violated. Finally, for a particular data set, smaller grids over

the nuisance space may be applied to further reduce the length of intervals.


Figure 5: Coverage for Finite-Sample Valid Test-Inversion and Projection Nominal 95% Intervals

p1 p2 n1 n2

φn,γ = 0.0001

φn,γ = 0.001

φn,γ = 0.01

Projection

0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800

0.000

0.002

0.004

0.000

0.002

0.004

0.000

0.002

0.004

0.000

0.002

0.004


Notes: Figure 5 displays estimates of the coverage probabilities of the finite-sample valid test inversion and projectionconfidence intervals. Estimates of the coverage are computed at parameter values close to the estimates pn and samplesize n from Bendavid et al. (2020a) as specified in Section 3.3. The black dots denote the one minus the proportion ofreplicates for which the true value of π falls in the realized confidence intervals, i.e., one minus the estimated coverageprobability. The purple squares and blue triangles denote the proportion of replicates that fall below and above realizedconfidence intervals, respectively. The vertical dotted line denotes the estimated value of pn,1, pn,1, or sample size n1,n2 from Bendavid et al. (2020a).

6. Application to Additional Problems

The methods discussed in Section 4 apply quite generally to the construction of confidence in-

tervals for real-valued parameters θ. To review, it is required that a statistical model be indexed

by (θ, ϑ), where θ is the parameter of interest and ϑ is a nuisance parameter. The inverse testing

approach to confidence interval construction tests the null hypothesis specified by H(θ0) : θ = θ0

for each θ0 and takes as a confidence region all θ0 where H(θ0) is not rejected.

The approximate approach considered in Section 4.2 constructs tests of H(θ0) by simulating

under (θ0, ϑ(θ0)), where ϑ(θ0) is an estimate of ϑ assumingH(θ0) is true. As such, the accuracy of

this method holds under standard bootstrap arguments. The finite-sample valid approach consid-


ered in Section 4.4 proceeds by first constructing a confidence interval for ϑ, either for each fixed

θ0 or without any restrictions, and thus requires knowledge of a finite-sample valid confidence re-

gion for ϑ. Finite-sample error control is obtained by minimizing the p-value of a test of H(θ0)

with ϑ taking values over the confidence region. However, if a grid search is required for this

minimization, finite-sample error control can be maintained by the correction developed in Section

4.4, as is verified formally in Theorem 4.1. The application of this result requires the use of a test

statistic that is monotone, and has a monotone likelihood ratio, in each of its components.

Such conditions apply broadly. For example, suppose X1, . . . , Xk are independent, with Xi

distributed as binomial with parameters ni and pi. The goal is to construct a confidence interval

for some parameter θ = f(p1, . . . , pk). In this case, the family of distributions of Xi has monotone

likelihood ratio in Xi. However, in order to apply Theorem 4.1, we just need to specify ϑ and

a confidence interval for ϑ, and to verify that the chosen test statistic is monotone with respect

to each of its components. This is straightforward to check when the test statistic is given by

Tn = f(p1, . . . , pn), where pi = Xi/ni.

Some important special cases include the following:

(i) Difference of Proportions: f(p1, p2) = p2 − p1

(ii) Relative Risk Reduction: f(p1, p2) = 1− p2/p1

(iii) Odds Ratio: f(p1, p2) = [p2/(1− p2)]/[p1/(1− p1)]

(iv) Prevalence: f(p1, p2, p3) = (p1 − p2)/(p3 − p2)

In cases (i)-(iii), f is monotone in each argument and we may set ϑ equal to p1 or p2 and apply a

Clopper-Pearson interval. As previously discussed, in case (iv), f is monotone in each argument

on the parameter space

Ω =p ∈ [0, 1]3 : p2 ≤ p1 ≤ p3, p2 < p3

,

and we may set ϑ = (p1, p3), and apply a joint confidence set based on marginal Clopper-Pearson

intervals. Agresti and Min (2002) and Fagerland et al. (2015) consider confidence intervals for

(i)-(iii), but not for (iv) as we do here. They consider similar constructions by minimizing p-values

over a nuisance parameter space, but do not account for the discretization required. Uniformly most

accurate unbiased confidence bounds exist only for the odds ratio based on classical constructions,


but optimality considerations fail for the other parameters; see e.g., Problem 5.29 of Lehmann and

Romano (2005).

An important application of estimates of relative risk reduction are in measurements of vaccine

efficacy in randomized clinical trials. In that context, p2 is the attack rate – or probability of

contracting the targeted infectious disease – for experimental subjects who have been treated with

a vaccine and p1 is the attack rate of experimental subjects who have been treated with a placebo.

Jewell (2004) gives a discussion of statistical aspects of this problem.

Constructing confidence intervals for prevalence as in (iv) may be extended to settings where

observations are collected on each of S subpopulations or strata of a population. All of the methods

discussed in this paper apply to this context. Indeed, suppose pi1 denotes the probability that a

randomly selected individual in the ith stratum tests positive. Suppose all strata make use of the

same estimates 1− p2 and p3 of sensitivity 1−p2 and specificity p3, respectively. Then, prevalence

in the ith stratum is πi = (pi1 − p2)/(p3 − p2). If the ith stratum gets the known weight wi, then

the overall prevalence is π =∑

iwiπi, which is clearly a function of S + 2 binomial parameters.

Estimates of prevalence are also of interest in the literature on “causal attribution,” which aims

to measure “how much” of an outcome is attributable to a binary treatment. For example, the es-

timand of interest in Ganong and Noel (2020), which aims to measure the proportion of mortgage

defaults that are caused exclusively by negative equity – as opposed to adverse life events – in the

years during and following the great recession, takes the form of prevalence. Pearl (1999), Rosen-

baum (2001), and Yamamoto (2012) provide further discussion of identification and inference in

analyses of causal attribution, as well as a survey several applications.

7. Conclusion

Standard methods for constructing confidence intervals for seroprevalence derived from the Delta

Method, the percentile bootstrap, and the BCa bootstrap have coverage probabilities that behave

erratically and are not consistently near the desired nominal level at empirically relevant sample

sizes and parameter values. By contrast, methods that combine test inversion with the parametric

bootstrap lead to stable coverage probabilities that are close to the nominal level across a variety

of statistics. Specifically, statistics that are properly Studentized or based on the generalized like-

lihood ratio statistic – and in particular its signed square root – exhibit superior performance. Test

inversion based on the signed square root generalized likelihood ratio statistic gives the best overall

performance in terms of its stability and validity over a range of empirically relevant parameteri-


zations and sample sizes. On the other hand, if one desires a method that has guaranteed coverage

in finite-samples, then we have provided an alternative method that has finite-sample validity, but

does have coverage above the nominal level and results in longer intervals on average.


ReferencesAgresti, A. and Min, Y. (2002). Unconditional small-sample confidence intervals for the odds ratio. Bio-

statistics, 3(3):379–386.

Ainsworth, M., Andersson, M., Auckland, K., Baillie, J. K., Barnes, E., Beer, S., Beveridge, A., Bibi, S.,Blackwell, L., Borak, M., et al. (2020). Performance characteristics of five immunoassays for sars-cov-2:a head-to-head benchmark comparison. The Lancet Infectious Diseases, 20(12):1390–1400.

Alter, G. and Seder, R. (2020). The power of antibody-based surveillance. The New England Journal ofMedicine.

Barndorff-Nielsen, O. (1986). Inference on full and partial parameters based on the standardized signed loglikelihood ratio. Biometrika, 73:307–322.

Bendavid, E., Mulaney, B., Sood, N., Shah, S., Bromley-Dulfano, R., Lai, C., Weissberg, Z., Saavedra-Walker, R., Tedrow, J., Bogan, A., Kupiec, T., Eichner, D., Gupta, R., Ioannidis, J. P. A., and Bhat-tacharya, J. (2021). COVID-19 antibody seroprevalence in Santa Clara County, California. InternationalJournal of Epidemiology. dyab010.

Bendavid, E., Mulaney, B., Sood, N., Shah, S., Ling, E., Bromley-Dulfano, R., Lai, C., Weissberg, Z., Saave-dra, R., Tedrow, J., Tversky, D., Bogan, T. K., Eichner, D., Gupta, R., Ioannidis, J., and Bhattacharya, J.(2020a). Covid-19 antibody seroprevalence in santa clara county, california. MedRxiv, April 11.

Bendavid, E., Mulaney, B., Sood, N., Shah, S., Ling, E., Bromley-Dulfano, R., Lai, C., Weissberg, Z., Saave-dra, R., Tedrow, J., Tversky, D., Bogan, T. K., Eichner, D., Gupta, R., Ioannidis, J., and Bhattacharya, J.(2020b). Covid-19 antibody seroprevalence in santa clara county, california. MedRxiv, April 27.

Bennett, S. T. and Steyvers, M. (2020). Estimating covid-19 antibody seroprevalence in santa clara county,california. a re-analysis of bendavid et al. medRxiv.

Berger, R. and Boos, D. (1994). P values maximized over a confidence set for the nuisance parameter.Journal of the American Statistical Association, 89(427):1012–1016.

Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. StatisticalScience, pages 101–117.

Byambasuren, O., Dobler, C. C., Bell, K., Rojas, D. P., Clark, J., McLaws, M.-L., and Glasziou, P. (2020).Estimating the seroprevalence of sars-cov-2 infections: systematic review. medRxiv.

Cai, B., Ioannidis, J., Bendavid, E., and Tian, L. (2020). Exact inference for disease prevalence based on atest with unknown specificity and sensitivity. arXiv preprint arXiv:2011.14423.

Carpenter, J. (1999). Test inversion bootstrap confidence intervals. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 61(1):159–172.

Clopper, C. J. and Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case ofthe binomial. Biometrika, 26(4):404–413.

Cox, D. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discus-sion). Journal of the Royal Statistical Society, B, 49:1–39.


Deeks, J. J., Dinnes, J., Takwoingi, Y., Davenport, C., Spijker, R., Taylor-Phillips, S., Adriano, A., Beese, S.,Dretzke, J., di Ruffano, L. F., et al. (2020). Antibody tests for identification of current and past infectionwith sars-cov-2. Cochrane Database of Systematic Reviews, (6).

DiCiccio, T., Martin, M., and Stern, S. (2001). Simple and accurate one-sided inference from signed rootsof likelihood ratios. The Canadian Journal of Statistics, 29(1):67–76.

DiCiccio, T. J. and Romano, J. P. (1990). Nonparametric confidence limits by resampling methods and leastfavorable families. Internat. Statist. Rev., 58:59–76.

DiCiccio, T. J. and Romano, J. P. (1995). On bootstrap procedures for second-order accurate confidencelimits in parametric models. Statistica Sinica, pages 141–160.

Efron, B. (1981). Nonparametric standard errors and confidence intervals. Canadian Journal of Statistics,9(2):139–158.

Efron, B. (1985). Bootstrap confidence intervals for a class of parametric problems. Biometrika, 72(1):45–58.

Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association,82(397):171–185.

Eisen, M. B. and Tibshirani, R. (2020). How to identify flawed research before it becomes dangerous. TheNew York Times, 21.

Fagerland, M. W., Lydersen, S., and Laake, P. (2015). Recommended confidence intervals for two indepen-dent binomial proportions. Statistical methods in medical research, 24(2):224–254.

Fauci, A. S., Lane, H. C., and Redfield, R. R. (2020). Covid-19—navigating the uncharted. The NewEngland Journal of Medicine, 382(13):1268.

Fithian, W. (2020). Statistical comment on the revision of bendavid et al.

Fraser, D. A. S. and Reid, N. (1987). On conditional inference for a real parameter: a differential approachon the sample space. Biometrika, 75:251–264.

Frydenberg, M. and Jensen, J. L. (1989). Is the ‘improved likelihood ratio statistic’ really improved in thediscrete case? Biometrika, 76:655–661.

Ganong, P. and Noel, P. J. (2020). Why do borrowers default on mortgages? a new method for causalattribution. Technical report, National Bureau of Economic Research.

Gelman, A. (2020). Concerns with that stanford study of coronavirus prevalence. Statistical Modeling,Causal Inference, and Social Science blog.

Gelman, A. and Carpenter, B. (2020). Bayesian analysis of tests with unknown specificity and sensitivity.Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(5):1269–1283.

Hall, P. (2013). The Bootstrap and Edgeworth Expansion. Springer.

Hui, S. L. and Walter, S. D. (1980). Estimating the error rates of diagnostic tests. Biometrics, pages 167–171.

Jensen, J. L. (1986). Similar tests and the standardized log ratio statistic. Biometrika, 73:567–572.


Jensen, J. L. (1992). The modified signed likelihood statistic and saddlepoint approximations. Biometrika,79:693–703.

Jewell, N. P. (2004). Statistics for Epidemiology. CRC Press, Boca Raton, Florida.

Johnson, S. G. (2020). The nlopt nonlinear-optimization package. http://github.com/stevengj/nlopt.

Kolata, G. (2020). Coronavirus infections may not be uncommon, tests suggest. The New York Times.

Kraft, D. (1988). A software package for sequential quadratic programming. Forschungsbericht- DeutscheForschungs- und Versuchsanstalt fur Luft- und Raumfahrt.

Krammer, F. and Simon, V. (2020). Serology assays to manage covid-19. Science, 368(6495):1060–1061.

Lawley, D. N. (1956). A general method for approximating to the distribution of the likelihood ratio criteria.Biometrika, 43:295–303.

Lee, S. and Young, G. (2005). Parametric bootstrapping with nuisance parameters. Statistics and ProbabilityLetters, 71:143–153.

Lehmann, E. and Romano, J. (2005). Testing Statistical Hypotheses. Springer, New York.

Mallapaty, S. (2020). Antibody tests suggest that coronavirus infections vastly exceed official counts. Nature(Lond.).

Pearl, J. (1999). Probabilities of causation: three counterfactual interpretations and their identification.Synthese, 121(1):93–149.

Peeling, R. W., Wedderburn, C. J., Garcia, P. J., Boeras, D., Fongwen, N., Nkengasong, J., Sall, A., Ta-nuri, A., and Heymann, D. L. (2020). Serology testing in the covid-19 pandemic response. The LancetInfectious Diseases.

Rogan, W. J. and Gladen, B. (1978). Estimating prevalence from the results of a screening test. Americanjournal of epidemiology, 107(1):71–76.

Romano, J., Shaikh, A., and Wolf, M. (2011). Consonance and the closure method in multiple testing, article12. International Journal of Biostatistics, 7(1).

Romano, J., Shaikh, A., and Wolf, M. (2014). A practical two-step method for testing moment inequalities.Econometrica, 82(5):1979–2002.

Rosenbaum, P. R. (2001). Effects attributable to treatment: Inference in experiments and observationalstudies with a discrete pivot. Biometrika, 88(1):219–231.

Silvapulle, M. (1996). A test in the presence of nuisance parameters. Journal of the American StatisticalAssociation, 91(436):1690–1693.

Toulis, P. (2020). Estimation of covid-19 prevalence from serology tests: A partial identification approach.University of Chicago, Becker Friedman Institute for Economics Working Paper, (2020-54).

Walter, S. D. and Irwig, L. M. (1988). Estimation of test error rates, disease prevalence and relative riskfrom misclassified data: a review. Journal of clinical epidemiology, 41(9):923–937.

Yamamoto, T. (2012). Understanding the past: Statistical analysis of causal attribution. American Journalof Political Science, 56(1):237–256.


A. Maximum Likelihood Estimation of Model ParametersIn this section, we develop constrained and unconstrained maximum likelihood estimators for p.Let L (x|p) be the likelihood of observing the data x under p, given by

L (x|p) =∏

1≤i≤3

(nixi

)pxii (1− pj)ni−xi .

The maximum likelihood estimate of p, which we denote by pn, is the solution to the inequalityconstrained convex optimization problem

Maximizep∈Ω

logL (x | p)

It is clear that

pn =

pn if pn,2 ≤ pn,1 ≤ pn,3(x1+x2n1+n2

, x1+x2n1+n2

, pn,3

)if pn,1 < pn,2 ≤ pn,3(

pn,1,x2+x3n3+n3

, x2+x3n3+n3

)if pn,2 ≤ pn,3 < pn,1(∑

j xj∑j nj

,∑j xj∑j nj

,∑j xj∑j nj

)if pn,3 < pn,2.

The constraint p2 ≤ p1 ≤ p3 is unlikely to be binding in a realistic setting. However, somemethods that we study rely on estimates based on bootstrap replicates XB

id∼ Binomial (ni, pn,i),

and in that case there is a small probability that pBn,j = XBj /nj will not satisfy pBn,2 ≤ pBn,1 ≤ pBn,3.

Recall that we can rewrite the condition π = π0 as the linear restriction b (π0)> p = 0 whereb (π0) = (1,− (1− π0) ,−π0)> . The maximum likelihood estimate of p under the constraint thatb (π0)> p = 0, which we denote by pn(π0) is the solution to the inequality and equality constrainedconvex optimization problem

Maximizep∈Ω

logL (x | p)

Subject to b (π0)> p = 0.

As an analytic solution to this problem is not readily available, we solve it numerically with the se-quential quadratic programming algorithm presented in Kraft (1988), available through the NLoptopen-source optimization library (Johnson, 2020).29

B. Accelerated Bias-Corrected Bootstrap ComputationIn this section, we discuss approximation of the constants z0 and a in the BCα confidence interval.Efron (1987) showed that

z0 = Φ−1(J πn,pn (πn)

)29Note that the sequential quadratic programming algorithm we apply is a (highly optimized) quasi-newton method,

approximating the Hessian of the objective function with updates specified by gradient evaluations. As an analyticexpression for the Hessian of the objective function is available, this is unnecessarily computationally costly. However,we were not able to locate any publicly available code accessible in the R programming language for inequality andequality constrained Newton’s method optimization.


and so can be computed from Monte Carlo estimation of J πn (t, pn). Let z0 denote the bootstrapestimate of z0.

Let the likelihood function be given by

L (p | x) =∏

1≤j≤3

(njxj

)pxjj (1− pj)nj−xj .

DiCiccio and Romano (1995) show that a can be approximated with

a =1

6

∑ijk λijkµiµjµk(∑i,j λijµiµj

)3/2

,

where

θi =∂π

∂pi, µi =

∑j

θi/λij,

λij = E[(

∂

∂pilogL (p | X)

)(∂

∂pjlogL (p | X)

)],

λijk = E[(

∂

∂pilogL (p | X)

)(∂

∂pjlogL (p | X)

)(∂

∂pklogL (p | X)

)].

In our setting, it can be seen that

θ1 =1

p3 − p2

θ2 = p1−p3(p3−p2)2

θ3 =p2 − p1

(p3 − p2)2

and that

λij =

ni

pi(1−pi) i = j

0 i 6= j

λijk =

ni(1−2pi)

p2i (1−pi)2 i = j = k

0 otherwise,

Thus, we find that

a =1

6

∑3i=1 λiiµ

3i(∑3

i=1 λiiµ2i

)3/2

where µi = θ/λii. Replacing pi by pn,i in the above expressions for the λii and µi results in theestimator a for the acceleration constant.

Documents

CONFIDENCE INTERVALS FOR SEROPREVALENCE By Thomas J