Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CONFIDENCE INTERVALS FOR SEROPREVALENCE
By
Thomas J. DiCiccio David M. Ritzwoller Joseph P. Romano Azeem M. Shaikh
Technical Report No. 2021-02 March 2021
Department of Statistics STANFORD UNIVERSITY
Stanford, California 94305-4065
CONFIDENCE INTERVALS FOR SEROPREVALENCE
By
Thomas J. DiCiccio Cornell University
David M. Ritzwoller Joseph P. Romano Stanford University
Azeem M. Shaikh
University of Chicago
Technical Report No. 2021-02 March 2021
This research was supported in part by National Science Foundation grants
MMS 1949845 and SES 1530661.
Department of Statistics STANFORD UNIVERSITY
Stanford, California 94305-4065
http://statistics.stanford.edu
Confidence Intervals for Seroprevalence ∗
Thomas J. DiCiccio David M. Ritzwoller
Department of Social Statistics Graduate School of BusinessCornell University Stanford [email protected] [email protected]
Joseph P. Romano Azeem M. Shaikh
Departments of Statistics and Economics Department of EconomicsStanford University University of Chicago
[email protected] [email protected]
March 27, 2021
Abstract
This paper concerns the construction of confidence intervals in standard seroprevalencesurveys. In particular, we discuss methods for constructing confidence intervals for the pro-portion of individuals in a population infected with a disease using a sample of antibody testresults and measurements of the test’s false positive and false negative rates. We begin bydocumenting erratic behavior in the coverage probabilities of standard Wald and percentilebootstrap intervals when applied to this problem. We then consider two alternative sets of in-tervals constructed with test inversion. The first set of intervals are approximate, using eitherasymptotic or bootstrap approximation to the finite-sample distribution of a chosen test statis-tic. We consider several choices of test statistic, including maximum likelihood estimators andgeneralized likelihood ratio statistics. We show with simulation that, at empirically relevantparameter values and sample sizes, the coverage probabilities for these intervals are close totheir nominal level and are approximately equi-tailed. The second set of intervals are shownto contain the true parameter value with probability at least equal to the nominal level, but canbe conservative in finite samples. To conclude, we outline the application of the methods thatwe consider to several related problems, and we provide a set of practical recommendations.
Keywords: Confidence Intervals, Novel Coronavirus, Serology Testing, Seroprevalence, TestInversion
MSC 2020 Subject Classification: Primary 62F03, Secondary 62P10
∗We acknowledge funding from the National Science Foundation under the Graduate Research Fellowship Programand under the grants MMS-1949845 and SES-1530661.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 1
1. Introduction
Accurate measurement of the spread of infectious diseases is critical for informing and calibrating
effective public health strategy (Fauci et al., 2020; Peeling et al., 2020). Seroprevalence surveys, in
which antibody tests are administered to samples of individuals from a population, are a practical
and widely applied strategy for assessing the progression of a pandemic (Krammer and Simon,
2020; Alter and Seder, 2020). However, antibody tests, which detect the presence of viral anti-
bodies in blood samples, are imperfect.1 In particular, false positive and false negative antibody
test results occur with nontrivial probabilties.2 Accounting for the variation in the results of a sero-
prevalence survey induced by this imperfection is important for valid assessment of the uncertainty
in measurements of the spread of an infectious disease.
In this paper, we study the construction of confidence intervals in a standard seroprevalence
survey. Given the public interest in explicit representations of disease incidence, our objective is to
analyze the accuracy of various methods of constructing confidence intervals, so that results from
empirical analyses can be reported with statistical precision. We demonstrate that some methods
based on test inversion offer advantages relative to more standard confidence interval constructions
in terms of the stability and validity of their coverage probabilities.
In a standard seroprevalence survey, the proportion of a population that has been infected with
a disease is a smooth function of the parameters of three independent binomial trials whose real-
izations are observed. Although it may be expected that standard approaches to confidence interval
construction are well suited for such a simple parametric problem, in Section 3, we demonstrate
with a Monte Carlo experiment that standard Wald and percentile bootstrap confidence intervals
have erratic coverage probabilities at empirically relevant parameter values and sample sizes when
applied to this problem.
In fact, as documented in Brown et al. (2001), erratic coverage probabilities for confidence
intervals constructed by using standard methods surface even in the context of inference on a sin-
gle binomial parameter. Bootstrap (and other) methods that are typically second-order correct
in continuous problems may not achieve this accuracy in discrete problems. Usually, claims of
second-order correctness are made based on Edgeworth expansions (Hall, 2013). However, in
1Deeks et al. (2020) give an early systematic review of the accuracy of SARS-CoV-2 antibody tests, highlightingseveral methodological limitations.
2Ainsworth et al. (2020) measure the false positive and negative rates for five leading SARS-CoV-2 immunoassays.Estimates of false positive rates ranged from 0.1% to 1.1%. Estimates of false negative rates ranged from 0.9% to7.3%.
2 CONFIDENCE INTERVALS FOR SEROPREVALENCE
discrete settings, Cramer’s condition – a necessary condition for the application of an Edgeworth
expansion – fails, and second-order accuracy may not be achievable. For example, atoms in the
binomial distribution based on n trials have order n−1/2, so expansions to order n−1 must account
for this discreteness. Additionally, when a binomial random variable has parameter value near zero
or one, even first-order approximations to its limiting distribution are not normal, while many stan-
dard methods for constructing confidence intervals rely, either explicitly or implicitly, on a normal
approximation holding. As the parameter of interest in a seroprevalence survey is a function of
three binomial parameters, inference in this setting is more challenging than for a single binomial
proportion.
To address the erratic coverage probabilities in standard confidence interval constructions, we
consider several alternative approaches based on test inversion. Test inversion is an approach to
confidence interval construction that exploits the duality between tests of the value of a parameter
and confidence intervals for that parameter. A test inversion confidence interval for a parameter
θ consists of the set of points θ0 for which the null hypothesis H(θ0) : θ = θ0 is not rejected.
For parameters θ where the corresponding null hypothesis H(θ0) is simple, the application of test
inversion is typically straightforward. By contrast, for parameters where the corresponding null
hypothesis is composite, the application of test inversion is not immediate. This is the case for
seroprevalence.
Therefore, in Section 4, we discuss the general problem of test inversion for parameters whose
corresponding null hypotheses are composite, and we develop several approaches to this general
problem. We consider both methods based on asymptotic or bootstrap approximation and meth-
ods with finite-sample guarantees. In the later case, a maximization of p-values over a nuisance
parameter space is required. Similar constructions are considered in Berger and Boos (1994) and
Silvapulle (1996). In practice, this maximization is carried out over a discrete grid. We provide
a refinement to such an approximation that maintains the finite-sample coverage requirement. We
take particular care in requiring that the confidence intervals that we develop behave well at both
endpoints; that is, we require that they are equi-tailed.3
In Section 5, we apply these approaches to construct confidence intervals for seroprevalence.
We consider several choices of test statistic, including maximum likelihood estimators and gen-
eralized likelihood ratio statistics. We demonstrate with simulation that the intervals based on
3A 1 − α confidence interval is equi-tailed if the probabilities that the parameter exceeds the upper endpoint or isbelow the lower endpoint of the interval are both near or below α/2. That is, an equi-tailed 1− α confidence intervalshould be given by the set of points satisfying both an upper and a lower confidence bound, each at level 1− α/2.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 3
asymptotic or bootstrap approximation have coverage probabilities that, at empirically relevant
parameter values and sample sizes, are close to, but potentially below, the nominal level and are
approximately equi-tailed. By contrast, the finite-sample valid intervals are conservative, but al-
ways have coverage probabilities that satisfy the coverage requirement.
We contextualize our analysis with data used to estimate seroprevalence at early stages of the
2019 SARS-Cov-2 pandemic. In particular, as a running example, we measure coverage prob-
abilities and average interval lengths for each of the methods we consider at sample sizes and
parameter values close to the estimates and sample sizes of Bendavid et al. (2020a), which was
posted on medRxiv on April 11th, 2020.4 This preprint estimates that the number of coronavirus
cases in Santa Clara County, California on April 3 - 4, 2020 was more than fifty times larger than
the number of officially diagnosed cases, and as a result, received widespread coverage in the pop-
ular and scientific press (Kolata, 2020; Mallapaty, 2020). The methods and design of this study
– including the reported confidence intervals – were questioned by many researchers (Eisen and
Tibshirani, 2020), prompting the release of a revised draft on April 27th, 2020, which we refer to
as Bendavid et al. (2020b), that integrated additional data.5 Our analysis highlights statistical chal-
lenges in seroprevalence surveys at early stages of the spread of infectious diseases, when disease
incidence is close to the error rates of new diagnostic technologies.
Nevertheless, seroprevalence surveys have advantages relative to alternative methods for mea-
suring the spread of infectious diseases. In particular, seroprevalence surveys directly measure the
presence of viral antibodies and therefore do not rely on measurements of reported cases or deaths
that may be subject to large and complex biases. Byambasuren et al. (2020) provide a systematic
review of SARS-CoV-2 seroprevalence surveys, identifying 17 studies that they consider to be of
particularly high quality. The authors find that estimated SARS-CoV-2 seroprevalence ranged from
0.56-717 times the number of reported coronavirus diagnoses in each study’s respective regions,
and that 10 of the studies imply at least 10 times more SARS-CoV-2 infections than reported
cases.6 Given the extent to which seroprevalence surveys imply substantively different popula-
tion exposure, the development of rigorous methods for inference in these studies is important for
informing public health decision-making.
This paper contributes to the literatures on inference in seroprevalence surveys (Rogan and
4This preprint has subsequently been published as Bendavid et al. (2021).5See Gelman (2020), Fithian (2020), and Bennett and Steyvers (2020) for further discussion and analysis of Ben-
david et al. (2020b). In particular, these articles highlight issues and propose alternative approaches for combiningdata measuring false positive rates from different samples.
6They note that estimates of seroprevalence remain well below thresholds for the development of herd immunity.
4 CONFIDENCE INTERVALS FOR SEROPREVALENCE
Gladen, 1978; Hui and Walter, 1980; Walter and Irwig, 1988); see Jewell (2004) for a general
introduction to epidemiological statistics. More broadly, we contribute to the large literature on
test inversion. The classical duality between tests and confidence intervals is discussed in Chapter
3 of Lehmann and Romano (2005). Bootstrap approaches to confidence construction based on
estimating nuisance parameters are developed in Efron (1981), DiCiccio and Romano (1990), and
Carpenter (1999). Conservative approaches to confidence interval construction that maximize p-
values over an appropriate nuisance parameter space are considered in Berger and Boos (1994) and
Silvapulle (1996). Gelman and Carpenter (2020) take a Bayesian approach to the problem studied
in this paper. Toulis (2020) uses test inversion based on a particular choice of test statistic, though
the resulting confidence interval is based on projection. Cai et al. (2020) is more closely related to
one of the approaches we consider, and we discuss some important differences in Section 5.7
We begin, in Section 2, by formulating the problem of interest. Section 3 reviews standard
approaches for confidence interval construction, including Wald intervals, the percentile bootstrap,
and projection methods, and documents their mediocre performance when applied to construct
confidence intervals for seroprevalence. Section 4 develops several approaches to confidence in-
terval construction for general parametric problems based on test inversion. These methods are
applied to seroprevalence in Section 5. In Section 6, we outline the application of the methods
considered in Section 4 to several related problems, including inference on vaccine efficacy in ran-
domized clinical trials. Section 7 provides practical recommendations and a summary of results.
2. Problem Formulation
Suppose that we have obtained three samples of individuals of sizes n1, n2, and n3, with n =
(n1, n2, n3)>. The first sample is selected at random from the population of all individuals. We
have prior knowledge that all individuals in the second sample have not had the disease of interest
and that all individuals in the third sample have had the disease of interest. An antibody test is
administered to all individuals in each sample. The data X1, X2, and X3 denote the number of
individuals in the first, second, and third sample that test positive for the disease, respectively. We
letX denote the vector (X1, X2, X3)>. Observed outcomes ofX are denoted by x = (x1, x2, x3)>.
We assume that the results of each test are independent of the results of all other tests and
7We became aware of Cai et al. (2020), which was posted on arXiv on November 29th, 2020, late in the preparationof this paper, on March 3rd, 2021.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 5
that the results of all tests within a given sample are independent and identically distributed.8 In
particular, we assume that
Xid∼ Binomial (ni, pi) (1)
for each i = 1, 2, and 3. In this context, p1 is the probability of a positive test for a randomly
selected individual, p2 is the probability of a positive test conditional on not having had the disease,
and p3 is the probability of a positive test conditional on having had the disease. The quantities
1− p2 and p3 are referred to as the specificity and sensitivity of the test, respectively.
Additionally, we assume that the test has diagnostic value in the sense that p2 < p3.9 That is,
the probability of testing positive is larger for an individual who has had the disease than for an
individual who has not had the disease.10 In this case, p1 necessarily satisfies p2 ≤ p1 ≤ p3. In
sum, the parameter p = (p1, p2, p2)> exists in the parameter space
Ω =p ∈ [0, 1]3 : p2 ≤ p1 ≤ p3, p2 < p3
.
We are interested in estimating a confidence interval for the probability π that an individual
randomly selected from the population of all individuals has had the disease. By the law of total
of probability, we have that
p1 = p2 (1− π) + p3π , (2)
and, therefore, that
π =p1 − p2
p3 − p2
. (3)
We refer to π as seroprevalence.
A natural estimate of seroprevalence is given by
πn =pn,1 − pn,2pn,3 − pn,2
,
where pn = (pn,1, pn,2, pn,3)> and pn,i = Xi/ni is the usual empirical frequency for group i. We let
the maximum likelihood estimator (MLE) of p for the model p ∈ Ω be denoted by pn. Accordingly,
8We assume the sample sizes are small relative to population size so that the difference between sampling with andwithout replacement is negligible.
9Note that some of the methods developed in Section 5 will not require p2 < p3; see footnote 22.10Additionally, it may be desirable to impose the restriction that p2 < p1, i.e., that the probability that a randomly
sampled individual possesses the disease is greater than zero. This restriction reduces the parameter space for π from[0, 1] to (0, 1].
6 CONFIDENCE INTERVALS FOR SEROPREVALENCE
the MLE of π for the model p ∈ Ω is given by
πn =pn,1 − pn,2pn,3 − pn,2
,
where pn = (pn,1, pn,2, pn,3)>. Typically, pn and pn agree, with the exception occurring if pn /∈ Ω.
An explicit formula for pn is given in Appendix A.
Finally, to facilitate further discussion, let
JTn,p(t) = Pp Tn ≤ t .
denote the distribution of the general statistic Tn under p, and also introduce the related quantity
JTn,p(t−) = Pp Tn < t ,which will be of use in computing p-values for tests of the null hypothesis
π = π0 against alternatives of the form π > π0.
3. Standard Interval Constructions
In this section, we outline and measure coverage probabilities for several standard methods for
constructing confidence intervals for seroprevalence.
3.1 Normal and Bootstrap Approximation
The most obvious approach to constructing confidence intervals for π is to approximate the finite-
sample distribution of πn with its limiting normal distribution. Specifically, by the Delta Method
and Slutsky’s Theorem, under model (1) and i.i.d. sampling, we have that
(πn − π0)/√
Vπn (pn)d→ N (0, 1) ,
as min (ni)→∞, where
Vπn(p) =1
(p3 − p2)2σ21 (p1) +
(p1 − p3)2
(p3 − p2)4σ22 (p2) +
(p2 − p1)2
(p3 − p2)4σ23 (p3) , (4)
and σ2i (pi) = pi (1− pi) /ni. This asymptotic approximation suggests the standard Wald interval
CIπn,∆ =(πn ± z1−α/2
√Vπn (pn)
), (5)
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 7
where z1−α/2 is the 1− α/2 quantile of the standard normal cumulative distribution function Φ(·).
We refer to CIπn,∆ as the delta-method confidence interval. Bendavid et al. (2020a) makes use of
this construction.
This interval may perform poorly in finite samples. In particular, the finite-sample distribution
of πn is not necessarily symmetric. For small or large values of the binomial parameter, a Poisson
distribution gives a better approximation than a normal distribution to a binomial distribution.
Moreover, the discreteness in the binomial distribution – with atoms that are of order one over the
square root of the sample size – prevents usual Edgeworth expansions and ensuing second-order
correctness.
To address some of these issues, Bendavid et al. (2020b) apply the percentile bootstrap confi-
dence interval developed in Efron (1981). This interval is given by
CIπn,pb =(cπn (α/2, pn) , cπn (1− α/2, pn)
), (6)
where, for a general test statistic Tn, cTn (1− α, p) denotes the 1− α quantile of JTn,p(t). Typically,
this interval can be shown to be first-order accurate (Efron, 1985), in the sense that the coverage
of the interval is equal to the nominal coverage probability plus a constant of the same order as
the inverse of the square root of the sample size.11 Nevertheless, second-order differences can
have a large effect in small and moderate samples, and the higher-order properties of the percentile
bootstrap have not been established for binomial parameters near the boundary of their parameter
spaces.
Efron (1987) gives a refinement to the percentile bootstrap confidence interval, referred to
as the accelerated bias-corrected, or BCα, confidence interval. The infeasible BCα confidence
interval is given by
CI∗πn,bca =(cπn (Φ (Z (α/2, z0, a)) , pn) , cπn (Φ (Z (1− α/2, z0, a)) , pn)
),
with
Z (α, z0, a) = z0 +(z0 + zα)
1− a (z0 + zα),
where the bias constant z0 and acceleration constant a are unknown. Efron (1987) and DiCiccio
and Romano (1995) discuss approximations to z0 and a, which we define in Appendix B and denote
11In our problem, the relevant quantity scaling this approximation is the minimum sample size min(ni).
8 CONFIDENCE INTERVALS FOR SEROPREVALENCE
by z0 and a. If these approximations are plugged into the infeasible BCα interval, then we obtain
the feasible BCα confidence interval
CIπn,bca =(cπn (Φ (Z (α/2, z0, a)) , pn) , cπn (Φ (Z (1− α/2, z0, a)) , pn)
). (7)
Typically, BCa intervals are second-order accurate, in the sense that the one-sided coverage error
of each endpoint is equal to α/2 plus a constant of the same order as the inverse of the sample
size. However, in our context, such theoretical results are not available due to the discreteness of
the distribution of the data X and potentially small values of the parameter p.
3.2 Projection Intervals
The Wald and bootstrap intervals presented in Section 3.1 are approximate. In contrast, it may be
desirable to construct intervals that ensure coverage of at least 1 − α in finite samples. A well-
known example of such finite-sample valid – or exact – intervals are the Clopper and Pearson
(1934) intervals for the binomial proportion.
A very simple, but crude, approach to constructing finite-sample valid confidence intervals for
seroprevalence is to use projection. In particular, suppose that R1−α is a joint confidence region
for p of nominal level 1 − α. For example, one possible choice of joint confidence region is the
rectangle
R1−α = I1,1−γ × I2,1−γ × I3,1−γ , (8)
where Ij,1−γ is a nominal 1− γ confidence interval for pj , and 1− γ is chosen so that the cartesian
product of marginal confidence intervals jointly contains p with nominal coverage 1 − α. That is,
γ is taken to satisfy (1− γ)3 = 1−α, or γ = 1− (1−α)1/3. In our implementation, we apply the
standard Clopper and Pearson (1934) confidence intervals for pj , as they have guaranteed coverage
in finite-samples.12 Alternatively, the region R1−α can be constructed by inverting likelihood ratio
tests, but would incur a significantly larger computational cost. A related approach is developed in
Toulis (2020).
Given any region R1−α, the projection method simply constructs the confidence interval
I1−α = [ infp∈R1−α
π(p), supp∈R1−α
π(p)] .
12Other choices exist, however, and in particular the intervals recommended in Brown et al. (2001) may performwell.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 9
where π(p) = (p1 − p2)/(p3 − p2) is the value of seroprevalence associated with the parameter p.
Trivially, the chance that π is contained in I1−α is bounded below by the chance that p is contained
in R1−α. Thus, if R1−α has a guaranteed coverage of 1− α, then so does I1−α.
When R1−α is a product of intervals, as in (8), the computation of the projection interval I1−α
is trivial. The reason is that π(p) is monotone increasing in p1 and monotone decreasing in each of
p2 and p3, as p varies on the parameter space Ω. Thus, if the lower and upper confidence limits for
pj are represented as
Ij,1−γ = [ILj,1−γ, IUj,1−γ] ,
then the resulting interval I for π can be easily computed as
I1−α = [π(IL1,1−γ, IU2,1−γ, I
U3,1−γ), π(IU1,1−γ, I
L2,1−γ, I
L3,1−γ)] .
Therefore, the projection interval is easy to implement, but is generally very conservative, in that
the actual coverage is much larger than the nominal level. This results in intervals that are generally
unnecessarily wide.
3.3 Measuring Coverage
To assess the finite-sample performance of the Delta Method, percentile bootstrap, and projection
confidence intervals discussed in Sections 3.1 and 3.2, we estimate by a Monte Carlo experiment
their coverage probabilities and average lengths at parameterizations close to the sample size and
estimates of Bendavid et al. (2020a). We return to this experiment in Section 5, where we measure
the coverage probabilities for the alternative methods considered in Section 4.
Bendavid et al. (2020a) recruited n1 = 3300 participants for serologic testing for SARS-CoV-2
antibodies. The total number of positive tests was x1 = 50. The authors use n2 = 401 pre-
COVID era blood samples to measure the specificity of their test, of which only x2 = 2 samples
tested positive. Similarly, the authors use n3 = 122 blood samples from confirmed COVID-19
patients, of which x3 = 103 samples tested positive.13 These data correspond to an estimated
seroprevalence of πn = 0.012.14 Table 1 reports the nominal 95% confidence intervals computed
13The specificity and sensitivity samples combine data provided by the test manufacturer and additional tests run atStanford. We refer the reader to the statistical appendix of Bendavid et al. (2020a) for further details.
14 Bendavid et al. (2020a) report an alternative estimate of seroprevalence in which the demographics of theirsample are weighted to match the overall demographics of Santa Clara County. We briefly discuss the applicationof the general methods developed in Section 4 to this setting in Section 6, and view further consideration as a usefulextension. Gelman and Carpenter (2020) give a Bayesian approach that accommodates sample weights. Cai et al.(2020) also address the case where are samples are reweighted according to population characteristics.
10 CONFIDENCE INTERVALS FOR SEROPREVALENCE
Interval Ave. Length Ave. Length vs. Delta Method Coverage
Delta Method [0.003,0.022] 0.0185 1.000 0.904
Percentile Bootstrap [0.001,0.021] 0.0186 1.005 0.895
BCα Bootstrap [0.001,0.020] 0.0191 1.028 0.895
Projection [0.001,0.028] 0.0270 1.4578 1.000
Table 1: Average Interval Length and Coverage of Nominal 95% Confidence Intervals for π
Notes: Table 1 reports the Delta Method, percentile bootstrap, BCα bootstrap, and projection confidence intervals,at nominal level 95% computed on data from Bendavid et al. (2020a). Estimates of the average length and coveragefor these intervals at sample sizes n and estimated values pn from this study are also displayed. Estimates of averagelength and coverage are taken over 100,000 bootstrap replicates ofX at the sample size n and the estimated parameterspn from this study.
by the Delta Method, percentile bootstrap, BCα bootstrap, and projection method for the observed
values from Bendavid et al. (2020a).
For each parameter (e.g., p1), we take a large number of replicates of X at each value of an
evenly spaced grid around the estimated value of the parameter (e.g., 0.001, . . . , pn,1, . . . , 2 · pn,1),
holding the other five parameters (e.g., p2, p3, n1, n2, n3) fixed at their observed values (e.g., pn,2,
pn,3, n1, n2, n3).15 For each method at each combination of parameter values and sample sizes, we
compute the proportion of replicates for which the true value of π (i.e., the value of π associated
with the parameterization) is below, contained in, or above the corresponding confidence interval
with nominal coverage probability α = 0.05.
Figure 1 displays the results of this Monte Carlo experiment for the Delta Method, percentile
bootstrap, BCα bootstrap, and projection intervals at parameter values around pn,1, pn,2, n1, and
n2.16 We note that the black dots display one minus the proportion of replicates for which the
realized confidence interval contains the true value of π, i.e., one minus the estimated coverage of
the confidence interval. Additionally, Table 1 reports estimates of the coverage and average length
of each interval taken over 100,000 bootstrap replicates at the sample sizes n and estimated values
pn from Bendavid et al. (2020a).
We find that the Delta Method, percentile bootstrap, and BCα bootstrap intervals are quite
15Due to the computational simplicity of the intervals discussed in this section, we measure coverage with 100,000replications. We use 10,000 replications when we return to this experiment in Section 5.
16There is little variation in the coverage probabilities for parameter values around p3 and n3, so we omit the resultsof this experiment for the sake of clarity.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 11
Figure 1: Estimated Coverage Levels of Nominal 95% Confidence Intervals
p1 p2 n1 n2
πn,∆
πn,pb
πn,bca
Projection
0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
1 − Coverage Lower Upper
Notes: Figure 1 displays estimates of the coverage probabilities of the Delta Method (πn,∆), percentile bootstrap(πn, pb), BCα bootstrap (π, bca), and projection intervals at parameter values close to the estimate pn and samplesize n of Bendavid et al. (2020a) as specified in Section 3.3. The black dots denote the one minus the proportionof replicates for which the true value of π falls in the realized confidence intervals, i.e., one minus the estimatedcoverage probability. The purple squares and blue triangles denote the proportion of replicates that fall below andabove realized confidence intervals, respectively. The vertical dotted line denotes the estimated value of pn,1, pn,1 orsample size n1, n2 for Bendavid et al. (2020a). The horizontal dotted line denotes one minus the nominal coverage of0.95.
liberal for many parameterizations. In fact, in most cases, the estimated coverage of a nominal 95%
interval is below 90%. Moreover, the estimated coverage decreases sharply as n2 and p1 become
small and is, in most cases, not equi-tailed, in the sense that the proportions of replicates that fall
below and above the confidence intervals are not approximately equal. By contrast, the projection
method is quite conservative. They are approximately 45% longer than the Delta Method intervals
at sample sizes n and estimated values pn from Bendavid et al. (2020a). These findings motivate
the development of alternative methods for constructing confidence intervals that have less erratic
coverage probabilities or ensure coverage at least at the nominal value.
12 CONFIDENCE INTERVALS FOR SEROPREVALENCE
4. Test Inversion
In this section, we consider both approximate and finite-sample valid approaches to the general
problem of constructing test-inversion confidence intervals for parameters θ where the correspond-
ing null hypothesis H (θ0) : θ = θ0 is composite.
To this end, we introduce a more general notation. Suppose that the data X follow a general
parametric model indexed by a parameter (θ, ϑ). The parameter of interest θ is real-valued, tak-
ing values in possibly some strict subset of the real line, and the nuisance parameter ϑ is finite
dimensional. The parameter (θ, ϑ) lies in a parameter space Ω, which need not be a product space,
i.e., the range of ϑ may depend on θ. For a fixed value θ0, the parameter space for the nuisance
parameter ϑ is denoted by
Ω(θ0) = (θ, ϑ) ∈ Ω : θ = θ0 .
Observe that for the case of seroprevalence π, we have that θ = π and can assign ϑ = (p1, p3).
Consider a test of the null hypothesis H (θ0) : θ = θ0 against the alternative θ > θ0 that rejects
for large values of the test statistic Tn. A 1 − α/2 lower confidence bound for θ constructed with
test inversion is given by
L1−α/2 = inf θ0 : H (θ0) is not rejected at level α/2 against the alternative θ > θ0 .
Similarly, a 1− α/2 upper bound for θ constructed with test inversion is given by
U1−α/2 = sup θ0 : H (θ0) is not rejected at level α/2 against the alternative θ < θ0 .
Thus, a two-sided 1 − α confidence interval is given by[L1−α/2, U1−α/2
]. Test inversion reduces
the problem of confidence interval construction for θ to the problem of testing H(θ0) : θ = θ0
against θ > θ0 and θ < θ0.
4.1 Simple Null Hypotheses
It is illustrative to assume that the nuisance parameter ϑ = ϑ0 is known. In this case, the null hy-
pothesis H(θ0) is simple, and so one-sided tests that control the level at α/2 are easily constructed.
Indeed, consider the test that rejects H(θ0) against θ > θ0 for large values of the test statistic
Tn = Tn(X), where the subscript n denotes the sample size and may be a vector. The cumulative
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 13
distribution function of Tn is given by
Fn,θ,ϑ(t) = F Tn,θ,ϑ(t) = Pθ,ϑTn ≤ t , (9)
where we use a superscript T to indicate dependence on the choice of test statistic sequence T =
Tn. Additionally, define the related quantity
Fn,θ,ϑ(t−) = F Tn,θ,ϑ(t−) = Pθ,ϑTn < t , (10)
and let t0 denote the observed value of Tn.
The null probability that Tn ≥ t0 is given by
qL,θ0,ϑ0 = 1− Fn,θ0,ϑ0(t−0 ) , (11)
and is a valid p-value in the sense that the test that rejects when this quantity is ≤ α/2 has size
≤ α/2; see, e.g., Lemma 3.3.1 in Lehmann and Romano (2005).17 Thus, a 1 − α/2 confidence
set for θ includes all θ0 such that Fn,θ0,ϑ0(t−0 ) < 1− α/2. If Fn,θ0,ϑ0(t
−0 ) is continuous and strictly
monotone decreasing in θ0, then a 1−α/2 lower confidence bound θL may be obtained by solving
Fn,θL,ϑ0(t−0 ) = 1− α/2 . (12)
Similarly, the quantity
qU,θ0,ϑ0 = Fn,θ0,ϑ0(t) (13)
is a valid p-value for testing H(θ0) against θ < θ0 and a 1− α/2 upper confidence bound θU may
be obtained by solving
Fn,θU ,ϑ0(t0) = α/2 . (14)
Thus, [θL, θU ] is a 1−α confidence interval for θ.18 For a single binomial parameter, this construc-
tion leads to the Clopper and Pearson (1934) interval.19
17Throughout, we will denote various p-values by q rather than p because p is reserved for various estimates ofbinomial parameters.
18Note one may wish to test H(θ0) at each endpoint of the confidence interval to determine whether it should be aclosed or open interval, but we simply take the conservative approach and make it closed.
19Even in the case that the distribution of Tn is discrete, the function Fθ0,ϑ0(t−0 ) is typically continuous in θ0 (as inthe binomial case). If not, one could use the infimum over θ0 such that Fθ0,ϑ0(t−0 ) < 1− α/2 as a lower bound.
14 CONFIDENCE INTERVALS FOR SEROPREVALENCE
4.2 An Approximate Approach
The confidence interval construction developed in the previous section assumes that the nuisance
parameter ϑ is known, and is therefore infeasible. In this section, we discuss a feasible approach
that uses an approximation to ϑ. In particular, if ϑ(θ0) is the MLE for ϑ subject to the constraint
θ = θ0, then the infeasible p-values (11) and (13) can be replaced with
qL,θ0,ϑ(θ0) = 1− Fn,θ0,ϑ(θ0)(t−0 ) and qU,θ0,ϑ(θ0) = Fn,θ0,ϑ(θ0)(t0), (15)
respectively, where Fn,θ0,ϑ(θ0) is approximated either analytically or with the parametric bootstrap.
Accordingly, the infeasible confidence interval [θL, θU ] is replaced with the feasible confidence
interval [θL, θU ], where, analogously to (12) and (14), the endpoints θL and θU are the values of θ0
that satisfy
Fn,θ0,ϑ(θ0)(t−0 ) = 1− α/2 and Fn,θ0,ϑ(θ0)(t0) = α/2 , (16)
respectively. In other words, either Wald or parametric bootstrap tests are constructed for each θ0,
where the distribution of the test statistic Tn is determined under the parameter (θ0, ϑ(θ0)). This
approach was used in DiCiccio and Romano (1990) and DiCiccio and Romano (1995).
In this approximate approach, the family of distributions indexed by (θ, ϑ) has been reduced
to an approximate least favorable one-dimensional family of distributions governed by the pa-
rameter (θ0, ϑ(θ0)) as θ0 varies. The reduction, in principle, allows one to apply any method for
construction of confidence intervals for a one-dimensional parameter. This approach implicitly
orthogonalizes the parameter of interest with respect to the nuisance parameter, so that the effect
of estimating the nuisance parameter is negligible to second-order, which then typically results in
second-order accurate confidence intervals. See Cox and Reid (1987) for a discussion of the role
of orthogonal parameterizations for inference about a scalar parameter in the presence of nuisance
parameters.
4.3 An Infeasible Finite-Sample Approach
The interval endpoints obtained in the previous section, defined in (16), are approximate as the
MLE ϑ(θ0) is an approximation to the true value of the nuisance parameter ϑ0. As a result, the
coverage probability of an interval of this form will depend on the quality of this approximation,
which may be uncertain in small samples.
By contrast, intervals that ensure coverage of at least 1− α in finite samples may be desirable.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 15
An infeasible approach to constructing intervals with test inversion that achieves this guarantee
proceeds by taking the supremum of the p-values over all possible values of the nuisance compo-
nent ϑ, giving
qL,θ0,sup = sup(θ0,ϑ)∈Ω(θ0)
1− Fn,θ0,ϑ(t−0 ) and qU,θ0,sup = sup(θ0,ϑ)∈Ω(θ0)
Fn,θ0,ϑ(t0). (17)
These p-values are clearly valid in finite samples. For example, if ϑ0 is the true value of ϑ, then we
have that
Pθ0,ϑ0qL,θ0,sup ≤ u = Pθ0,ϑ0 sup(θ0,ϑ)∈Ω(θ0)
1− Fn,θ0,ϑ(T−n ) ≤ u (18)
≤ Pθ0,ϑ01− Fn,θ0,ϑ0(T−n ) ≤ u ≤ u .
Note, however, that these p-values may be conservative, as the supremum over ϑ may be obtained
at a value far from ϑ0. On the other hand, if the distribution of the test statistic does not vary much
with ϑ, then these p-values will not be overly conservative. Therefore, it pays to choose a test
statistic that is nearly pivotal, in the sense that its distribution does not depend heavily on ϑ.
The p-values defined in (17) may be impractically conservative if the distribution of Tn is highly
variable with ϑ. To address this issue, one can restrict the space of values for ϑ that are considered
by first constructing a 1−γ confidence region for ϑ. Such an approach is considered in Berger and
Boos (1994) and Silvapulle (1996). This idea has also been applied to nonparametric contexts; see
Romano et al. (2014) and further references therein.
This refined approach proceeds as follows. Fix a small number γ and let I1−γ be a 1 − γ
confidence region for ϑ. Consider the following modified p-values defined by
qL,θ0,I1−γ =
supϑ∈I1−γ 1− Fn,θ0,ϑ(t−0 ) + γ if ϑ ∈ I1−γ : (θ0, ϑ) ∈ Ω(θ0) 6= ∅
γ otherwise(19)
and
qU,θ0,I1−γ =
supϑ∈I1−γ Fn,θ0,ϑ(t0) + γ if ϑ ∈ I1−γ : (θ0, ϑ) ∈ Ω(θ0) 6= ∅
γ otherwise.(20)
The one-sided test that rejects when qL,θ0,I1−γ ≤ α/2 leads to a 1 − α/2 lower confidence bound
for θ. Note that it is possible that the region I1−γ may not include any ϑ for which (θ0, ϑ) ∈ Ω(θ0),
16 CONFIDENCE INTERVALS FOR SEROPREVALENCE
in which case the modified p-value is defined to be γ, i.e., the null hypothesis H(θ0) is rejected at
any level α ≥ γ if the confidence region I1−γ for ϑ is completely incompatible with θ = θ0.20
The p-values obtained by (19) and (20) are valid in finite samples. The following result follows
from the lemma presented in Section 2 of Berger and Boos (1994) or the theorem presented in
Section 2 of Silvapulle (1996).
Lemma 4.1 The p-values qL,θ0,I1−γ and qU,θ0,I1−γ are valid in the sense that, for all ϑ and 0 ≤ u ≤1, we have that
Pθ0,ϑqL,θ0,I1−γ ≤ u ≤ u and Pθ0,ϑqU,θ0,I1−γ ≤ u ≤ u . (21)
PROOF OF LEMMA 4.1: The event qL,θ0,I1−γ ≤ u implies that either
1− Fn,θ0,ϑ(T−n ) ≤ u− γ
or ϑ /∈ I1−γ . The latter event has probability ≤ γ and the former event has probability ≤ u − γ.
An identical argument holds for the event qU,θ0,I1−γ ≤ u.
4.4 A Feasible Finite-Sample Approach
The finite-sample approach defined in Section 4.3 is infeasible, as it involves computing a supre-
mum over an infinite set Ω(θ0). A natural approximation to this approach is to discretize Ω(θ0), so
that the supremum is replaced by maximum over a finite set of values on a grid. In particular, if
G = G(θ0) denotes a finite grid over the space Ω(θ0), then (17) can be approximated by
qL,θ0,max = max(θ0,ϑ)∈G(θ0)
1− Fn,θ0,ϑ(t−0 ) and qU,θ0,max = max(θ0,ϑ)∈G(θ0)
Fn,θ0,ϑ(t0) , (22)
respectively. Similarly, the refinement given in (19) and (20) can be approximated by replacing
I1−γ with I1−γ , where I1−γ denotes a finite grid (or ε-net) approximating I1−γ . In contemporaneous
work, Cai et al. (2020) use this construction to develop confidence intervals for seroprevalence that
ensure finite-sample Type 1 error control up to the error induced by the finiteness of G = G(θ0).
In this section, we develop a modification to this construction that provably maintains finite-
20Note that one may let I1−γ = I1−γ(θ0) depend on θ0, so that I(θ0) denotes a confidence region for ϑ with θfixed at θ0. The choice of I may also depend on whether or not one is constructing a lower confidence bound or upperconfidence bound.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 17
sample Type 1 error control for testing H(θ0), and thereby maintains control over the level of
confidence sets obtained through test inversion by directly accounting for the approximation error
induced by a finite discretization of Ω(θ0). Toward this end, we introduce further structure. Sup-
pose that the components of the data X = (X1, . . . , Xk)> are independent, that the distribution of
Xi depends on a parameter βi, and that the family of distributions forXi has a monotone likelihood
ratio in Xi. Assume further that interest focuses on a real-valued parameter θ = f(β1, . . . , βk) and
that the nuisance parameters are given by ϑ = (β2, . . . , βk). The model can equivalently be thought
of as being parametrized by (θ, ϑ) ∈ Ω as before, or through β = (β1, . . . , βk) ∈ Ω. Additionally,
define
Ω(θ0) = β ∈ Ω : f(β) = θ0 .
In the context of seroprevalence, θ = π.
As before, let Tn = Tn(X1, . . . , Xk) be a test statistic for testing H(θ0), with t0 denoting its
realized value. Note that Tn could itself depend on θ0, which is seen below to occur for seropreva-
lence. Assume that Tn is montone with respect to each each component Xi. It is without loss of
generality, then, to assume that Tn is monotone increasing with respect to each component Xi.21
Let Jn,β(·) denote the cumulative distribution function of Tn for the β- parametrization, so that
Jn,β(t) = Fn,θ,ϑ(t) = PβTn ≤ t ,
and let β(θ0) denote the MLE for β subject to the constraint that θ = θ0. In this case, for example,
we can also represent the bootstrap-based inverse testing p-value defined in Section 4.2 with
qU,θ0,ϑ(θ0) = Jn,β(θ0)(t0) ,
and similarly for the other p-values previously introduced.
The objective is to replace the supremum over I1−γ in (19) and (20) with a finite maximum
while maintaining Type 1 error control in finite samples. Toward this end, consider a partition of
the values of ϑ in I1−γ into r regions E1, . . . , Er. In our implementation, each region is given by a
hyperrectangle of the form
[β′2, β′′2 ]× · · · × [β′k, β
′′k ] , (23)
21If Tn is monotone decreasing with respect to a particular component, say Xj , then Xj can be replaced by −Xj ,whose family of distributions is then monotone increasing with respect to −βj .
18 CONFIDENCE INTERVALS FOR SEROPREVALENCE
though this is not necessary. Note that each region may not lie entirely in the parameter space for
ϑ. For region Ej , let (β2(j), . . . , βk(j)) be the vector giving the smallest value that all but the first
component of β takes on in Ej; that is
βi(j) = infβi : (β2, . . . , βk) ∈ Ej .
Similarly, let (β2(j), . . . , βk(j)) be the vector giving the largest value that all but the first compo-
nent of β takes on in Ej; that is,
βi(j) = supβi : (β2, . . . , βk) ∈ Ej .
For a hyperrectangle of the form (23), clearly βi(j) = β′i and βi(j) = β′′i . Congruently, let
β1(j) = infβ1 : β ∈ Ω(θ0), (β2, . . . , βk) ∈ Ej (24)
and
β1(j) = supβ1 : β ∈ Ω(θ0), (β2, . . . , βk) ∈ Ej . (25)
denote the smallest and largest values that the first component of β takes on Ω(π0) subject to the
constraint that (β2, . . . , βk) in Ej . If the infimum or supremum in (24) or (25) is over a non-empty
set, then define sL(j) = Jn,β(j)(t−) and sU(j) = Jn,β(j)(t), where
β(j) = (β1(j), . . . , βk(j)) and β(j) = (β1(j), . . . , βk(j)) .
Otherwise, if there is no β in Ω(π0) with (β2, . . . , βk) in Ej , then set sL(j) = 1 and sU(j) = 0,
respectively.
We construct the p-values
qL,θ0,I1−γ = max1≤j≤r
(1− sL(j)) + γ (26)
and
qU,θ0,I1−γ = max1≤j≤r
sU(j) + γ (27)
by taking the maximum over the adjusted p-values 1−sL(j) and sU(j) . This refinement is feasible
and valid in finite samples.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 19
Theorem 4.1 Assume that the components of the data X = (X1, . . . , Xk)> are independent, that
each component Xi has distribution in a family having a monotone likelihood ratio, and that the
statistic Tn = Tn(X1, . . . , Xk) is monotone increasing with respect to each component Xi. Let
I1−γ be a 1 − γ confidence region for (β2, . . . , βk). Then, the p-values qL,θ0,I1−γ and qU,θ0,I1−γ are
valid for testing H(θ0) in the sense that, for any 0 ≤ u ≤ 1 and any ϑ,
Pθ0,ϑqL,θ0,I1−γ ≤ u ≤ u and Pθ0,ϑqU,θ0,I1−γ ≤ u ≤ u .
PROOF OF THEOREM 4.1. First, note that I1−γ could be the whole space Ω(θ0) by taking γ = 0.
It follows from Lemma A.1 in Romano et al. (2011) (which is a simple generalization of Lemma
3.4.2 in Lehmann and Romano (2005)) that the family of distributions of Tn satisfy, for any t,
β = (β1, . . . , βk), β′ = (β′1, . . . , β′k) with β′i ≥ βi for all i ≥ 1:
Jn,β′(t) ≤ Jn,β(t) . (28)
The same is true if t is replaced by t−. Thus, we have that Jn,β(j)(t−) ≤ Jn,β(t−) and Jn,β(j)(t) ≥
Jn,β(t) for any β with θ = f(β) = θ0 and (β2, . . . , βk) ∈ Ej . Therefore, the p-values defined by
(19) and (20) satisfy
qL,θ0,I1−γ ≤ max1≤j≤r
(1− sL(j)) + γ = qL,θ0,I1−γ and
qU,θ0,I1−γ ≤ max1≤j≤r
sU(j) + γ = qU,θ0,I1−γ .
Since qL,θ0,I1−γ and qU,θ0,I1−γ are valid p-values (by Lemma 4.1), then so are any random variables
that are stochastically larger.
Thus, tests based on the p-values qL,θ0,I1−γ and qU,θ0,I1−γ may be used to testH(θ0) and, through
test inversion, yield finite-sample valid confidence bounds for θ.
5. Test-Inversion Inference for Seroprevalence
In this section, we apply the methods considered in Section 4 to the problem of constructing ap-
proximate and finite-sample valid confidence intervals for seroprevalence.
20 CONFIDENCE INTERVALS FOR SEROPREVALENCE
5.1 Test Statistics
We begin by exhibiting a set of test statistics Tn applicable to our problem. To this end, we must
first consider estimation of p unconditionally and under the restriction that π = π0. In Appendix
A, we outline the maximum likelihood estimation of p under the restrictions that p ∈ Ω and that
p ∈ Ω(π0) = p ∈ Ω : π = π0 .
We let pn and pn(π0) denote the resultant estimators, respectively.
A natural choice for the test statistic Tn is the difference between πn and π0. Moreover, this
statistic can be Studentized with an estimate of its standard deviation, giving the test statistic
πn (π0) = (πn − π0)/√
Vπn (pn) ,
where Vπn(p) is given in (4). Similarly, πn−π0 can be Studentized with an estimate of its standard
deviation under the constraint π = π0, giving the test statistic
πn,C (π0) = (πn − π0)/√
Vπn (pn(π0)).
Observe that as p1 = p2 (1− π) + p3π, we can rewrite the condition π0 = π as the linear
restriction b (π0)> p = 0 where b (π0) = (1,− (1− π0) ,−π0)> . This observation suggests con-
sideration of the linear test statistic φn (π0) = b(π0)>pn. Moreover, as the variance of φn (π0) is
exactly equal to
Vφn(π0) (p) = σ21 (p1) + (1− π0)2 σ2
2 (p2) + π20σ
23 (p3) , (29)
and can be estimated with the plug-in estimator Vφn(π0) (pn), the test statistic φn (π0) can be Stu-
dentized, giving the alternative test statistic
φn (π0) = φn (π0)/√
Vφ(π0) (pn).
In turn, we can studentize φn (π0) with an estimate of its variance under the restriction π0 = π,
giving the statistic
φn,C (π0) = φn (π0)/√
Vφ(π0) (pn(π0)).22
22Observe that, additionally, φn (π0) is well-defined if p2 = p3, and so φn (π0), φn (π0), or φn.C (π0) may bedesirable choices in situations where p2 is close to p3.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 21
Alternatively, we can use statistics based on the likelihood function. In particular, let L (p | x)
be the likelihood function given by
L (p | x) =∏
1≤i≤3
(nixi
)pxii (1− pj)ni−xi .
The generalized likelihood ratio test statistic for testing H(π0) : π = π0 is given by
Wn = Wn (π0) = 2 ·supp∈Ω
logL (p | X)
supp∈Ω(π0)
logL (p | X)
= 2 ·∑
1≤j≤3
(Xj log
(pn,j
pn,j(π0)
)+ (nj −Xj) log
(1− pn,j
1− pn,j(π0)
)).
Large values of Wn give evidence for both π < π0 and π > π0. To address this issue, we also
consider the signed root likelihood ratio statistic for the restriction π = π0, given by
Rn = Rn (π0) = sign (πn − π0) ·√Wn (π0),
Corrections to improve the accuracy of Wn based on its signed square root Rn have a long his-
tory; see Lawley (1956), Barndorff-Nielsen (1986), Fraser and Reid (1987), Jensen (1986), Jensen
(1992), DiCiccio et al. (2001) and Lee and Young (2005). Frydenberg and Jensen (1989) consider
the effect of discreteness on the efficacy of corrections to improve asymptotic approximations to
the distribution of the likelihood ratio statistic. The statistic Rn can be re-centered and Studentized
as
Rn =(Rn −mR
n (pn(π0))) /√
V Rn (pn(π0)) ,
wheremRn (p) and V R
n (p) denote the mean and variance ofRn under p and in practice are computed
with the bootstrap under pn(π0).
5.2 Approximate Intervals
We now outline the application of the approximate intervals considered in Section 4.2 and measure
the performance of these intervals in the Monte Carlo experiment developed in Section 3.3. Sup-
pose that we are using the test statistic Tn with observed value t0. The approximate test-inversion
22 CONFIDENCE INTERVALS FOR SEROPREVALENCE
intervals are constructed by first computing, for each π0, the p-values
qL,π0,pn(π0) = 1− JTn,pn(π0)(t−0 ) and qU,π0,pn(π0) = JTn,pn(π0)(t0).
The resultant interval with nominal coverage 1− α then takes the form
π0 : qL,π0,pn(π0) ≥ α/2 and qU,π0,pn(π0) ≥ α/2
.
We begin by considering asymptotic approximations to the distribution JTn,pn(π0) for different test
statistics.
Observe that, under the null hypothesis H (π0), the test statistics πn,C(π0), φn,C (π0), and Rn
are asymptotically N (0, 1), and Wn is asymptotically χ21. Thus, if we set Tn equal to any of the
asymptotically normal statistics, we can approximate Jn,pn(π0) with a standard normal distribution.
Likewise, we may apply a χ21 approximation if we set Tn equal to Wn.23
Table 2 reports realizations of these approximate confidence intervals for the test statistics
πn,C(π0), φn,C (π0), Wn, and Rn for the observed values from Bendavid et al. (2020a). Addi-
tionally, Table 2 presents estimates of the coverage and average interval length taken over 10,000
bootstrap replicates computed at the n and estimate pn from this study. Notably, each of these
intervals now includes zero.24 Intervals constructed using πn,C(π0) and φn,C are approximately
3-4% longer, on average, than the Delta Method intervals. Intervals constructed using Wn and
Rn are approximately 3-4% shorter, on average, than the Delta Method intervals. Each interval
construction has coverage probability significantly closer to the nominal level.
Figure 2 displays estimates of the coverage probabilities for these intervals in the Monte Carlo
experiment developed in Section 3.3. Recall that the black dots display one minus the estimates
of the coverage probabilities of the respective intervals, and that the purple squares and blue tri-
angles display the proportion of the replicates in which the true value of π falls below and above
the realized confidence interval, respectively. In contrast to the results for the standard methods
displayed in Figure 1, we estimate that the coverage probabilities for these methods are very close
to the nominal value of 95% for most parameterizations. The intervals constructed with πn,C (π0)
23Observe that using normal approximations to πn or πn is equivalent to constructing Wald intervals for thesestatistics. For that reason, we focus on statistics that make explicit use of the null hypothesis restriction π = π0.
24If the null hypothesis π = 0 is of particular interest or concern, then there exists an exact uniformly most powerfulunbiased level α test for the equivalent problem of testing p1 = p2 against p1 > p2. This is a conditional one-sidedbinomial test; see Section 4.5 of Lehmann and Romano (2005). Such a test does not exist for other values of π0.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 23
Statistic Interval Ave. Length Ave. Length vs. Delta Method Coverage
πn,C [0.000,0.020] 0.0193 1.0404 0.963
φn,C [0.000,0.020] 0.0191 1.0302 0.963
Wn [0.000,0.021] 0.0177 0.9563 0.927
Rn [0.000,0.021] 0.0181 0.9758 0.950
Table 2: Average Interval Length and Coverage for Test-Inversion Nominal 95% Confidence In-tervals for Seroprevalence Using Asymptotic Approximation
Notes: Table 2 reports the approximate test-inversion confidence intervals, constructed with an asymptotic approxima-tion to test statistic null distributions, computed on data from Bendavid et al. (2020a). Estimates of the average lengthand coverage for these intervals at the n and estimate pn from this study are also displayed. Estimates of averagelength and coverage are taken over 10,000 bootstrap replicates of X at the sample size n and the estimated parameterspn from this study.
and φn,C (π0) are not particularly equi-tailed, with the true value of π falling above the realized
confidence interval more frequently than it falls below. The intervals constructed with Wn and Rn
are more equi-tailed, but the overall coverage probability appears more sensitive to perturbations
in p2 and n2.
Next, we refine this approach by directly computing Jn,pn(π0) with the bootstrap. This method
is more accurate, but can be significantly more computationally expensive. Table 3 reports real-
izations of these approximate confidence intervals for the test statistics πn, πn,C(π0), φn, φn,C(π0),
Wn, and Rn computed on data from Bendavid et al. (2020a), in addition to estimates of the cov-
erage and average interval length at the n and estimate pn for this study. Again, each of these
intervals include zero, are roughly the same length as the delta-method intervals, and have cover-
age probability close to the nominal level.
Figure 3 displays estimates of the coverage probabilities for these intervals for the test statistics
πn, πn,C(π0), φn, φn,C(π0), Wn, and Rn in the same Monte Carlo experiment developed in Section
3.3. Similarly to Figure 2, these intervals have coverage close to the nominal value and are approx-
imately equi-tailed. In particular, the interval constructed with Rn is equi-tailed and appears to be
the least sensitive to perturbations in p2 and n2.
24 CONFIDENCE INTERVALS FOR SEROPREVALENCE
Figure 2: Coverage for Test-Inversion 95% Intervals Based on Asymptotic Approximation
p1 p2 n1 n2
π~n,C,aa
φ~n,C,aa
Wn,aa
R~
n,aa
0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800
0.00
0.05
0.00
0.05
0.00
0.05
0.00
0.05
1 − Coverage Lower Upper
Notes: Figure 2 displays estimates of the coverage probabilities of the approximate confidence intervals constructedwith an asymptotic approximation to test statistic null distributions. Estimates of the coverage for the interval con-structed with the test statistic Tn are denoted by “Tn, aa” and are computed at parameter values close to the estimatespn and sample size n of Bendavid et al. (2020a) as specified in Section 3.3. The black dots denote the one minusthe proportion of replicates for which the true value of π falls in the realized confidence intervals, i.e., one minusthe estimated coverage probability. The purple squares and blue triangles denote the proportion of replicates that fallbelow and above realized confidence intervals, respectively. The vertical dotted line denotes the estimated value ofpn,1, pn,2, or sample n1, n2 for Bendavid et al. (2020a). The horizontal dotted line denotes one minus the nominalcoverage of 0.95.
5.3 Finite-Sample Valid Intervals
We now turn to the application of the finite-sample valid intervals discussed in Section 4.4 to
seroprevalence. We focus our development on the test statistic φn(π0) as it is linear, and therefore
monotone with respect to each sample Xi, as is required.25
To begin, we must choose a partition of the parameter space Ω into a parameter of interest and
a nuisance component. Recall from the discussion in Section 4.3, that finite-sample exact intervals
formed by maximizing p-values over a nuisance space perform best if the distribution of the chosen
25Alternatively, one may also consider the test statistic πn = f(pn), where f(p) = (p1 − p2)/(p3 − p2). Then, fis monotone with respect to each component as long as p ∈ Ω.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 25
Figure 3: Coverage for Test-Inversion Nominal 95% Confidence Intervals Based on the Bootstrap
p1 p2 n1 n2
πn,boot
π~n,boot
φn,boot
φ~n,boot
Wn,boot
Rn,boot
0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800
0.00
0.02
0.04
0.06
0.00
0.02
0.04
0.06
0.00
0.02
0.04
0.06
0.00
0.02
0.04
0.06
0.00
0.02
0.04
0.06
0.00
0.02
0.04
0.06
1 − Coverage Lower Upper
Notes: Figure 3 displays estimates of the coverage probabilities of the approximate confidence intervals constructedwith the bootstrap. Estimates of the coverage for the interval constructed with the test statistic Tn are denoted by“Tn, boot” and are computed at parameter values close to the estimates pn and sample size n of Bendavid et al.(2020a) as specified in Section 3.3. The black dots denote the one minus the proportion of replicates for which thetrue value of π fall in the realized confidence intervals, i.e., one minus the estimated coverage probability. The purplesquares and blue triangles denote the proportion of replicates that fall below and above realized confidence intervals,respectively. The vertical dotted line denotes the estimate pn,1, pn,2 or sample size n1, n2 for Bendavid et al. (2020a).The horizontal dotted line denotes one minus the nominal coverage of 0.95.
26 CONFIDENCE INTERVALS FOR SEROPREVALENCE
Statistic Interval Ave. Length Ave. Length vs. Delta Method Coverage
πn [0.000,0.021] 0.0189 1.0186 0.965
φn [0.000,0.021] 0.0188 1.0133 0.962
πn [0.000,0.021] 0.0188 1.0150 0.953
φn [0.000,0.021] 0.0188 1.0141 0.953
Wn [0.000,0.021] 0.0181 0.9788 0.946
Rn [0.000,0.021] 0.0186 1.0035 0.951
Table 3: Average Interval Length and Coverage for Approximate Test-Inversion Nominal 95%Confidence Intervals Based on the Bootstrap
Notes: Table 3 reports the approximate test-inversion confidence intervals, constructed with a parametric bootstrapapproximation to null distributions of test statistics, computed on data from Bendavid et al. (2020a). Estimates of theaverage length and coverage for these intervals at sample size n and estimate pn from this study are also displayed.Estimates of average length and coverage are taken over 10,000 bootstrap replicates of X at the sample size n andestimate pn from this study.
test statistic does not vary much with the nuisance parameter. For small values of π0, the variance
of φn(π0), denoted by Vφn(π0)(π0) and specified in (29), is insensitive to changes in p3, as the
variance σ23(p3) enters into Vφn(π0)(π0) linearly and scaled by π2
0 . For sample sizes comparable to
the measurements taken in Bendavid et al. (2020a), where n1 is much larger than n2, the variance
Vφn(π0)(π0) will be less sensitive to changes in p1 than to changes in p3. Thus, we set the nuisance
component ϑ = (p1, p3), giving the parameterization (π, ϑ).
We note that for different sample sizes, it may be attractive to set the nuisance component
ϑ = (p2, p3). For example, Bendavid et al. (2020b) – the April 27th draft of Bendavid et al. (2020a)
– includes larger sensitivity and specificity samples n2 and n3. In particular, the specificity sample
n2 was increased from 401 to 3324.26,27
To fix ideas, consider Figure 4, which displays a heat-map of the 0.975 quantile of φn(π0)
under different parameterizations (π0, ϑ), with the sample size and the null hypothesis restriction
26The choice ϑ = (p2, p3) also has computation advantages. In particular, by the identity p1 = p2 (1− π) + p3π,for any value of seroprevalence π0 and any values of p2 and p3 satisfying p2 < p3, there is a value of p1 that satisfiesp2 ≤ p1 ≤ p3 such that π0 = (p1 − p2)/(p3 − p2). That is, any value of (p2, p3) corresponds to a unique value of p1consistent with a given value of π0 and satisfying the a priori restrictions on the parameter space Ω.
27These additional data were are aggregated over several samples taken at different times and locations. Gelman(2020), Fithian (2020), and Bennett and Steyvers (2020) highlight issues with this aggregation.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 27
Figure 4: Bootstrap Quantiles and Initial Nuisance Parameter Confidence Region
0.7
0.8
0.9
1.0
0.01 0.02 0.03 0.04 0.05p1
p3
0.975 Quantile of Bootstrap Distribution
0.003 0.005 0.007 0.008 0.01 0.012 0.013 0.015 0.017 0.018 0.02
Notes: Figure 4 displays a heat-map of the 0.975 quantile of φn(π0) under different parameterizations (π0, ϑ), wherethe sample size and null hypothesis restriction π0 equal to the sample size and estimated prevalence πn from Bendavidet al. (2020a). The black line exhibits the boundary of the parameter space Ω(π0). The black square dot denotes theconstrained MLE ϑ(π0) = (pn,1(π0), pn,3(π0)). With γ = 0.001, the grey rectangle denotes a 1 − γ confidenceregion for ϑ constructed by taking the cartesian product of two
√1− γ level confidence regions for p1 and p3 each
constructed with the method of Clopper and Pearson (1934). The white dots denote a 10× 10 grid over this space.
π0 equal to the sample size and estimated prevalence πn from Bendavid et al. (2020a). The square
black dot denotes the constrained MLE ϑ(π0) = (pn,1(π0), pn,3(π0)). The black line exhibits the
boundary of the parameter space Ω(π0).
Recall that in the constructions of approximate intervals considered in Section 5.2, a point π0 is
excluded from a confidence interval with nominal coverage 0.95 if the observed value of the chosen
test statistic Tn exceeds or falls below the 0.975 or 0.025 quantiles of the statistic’s finite-sample
distribution at the constrained MLE pn(π0). But, for example, the 0.975 quantile of the bootstrap
distribution of φn(π0) is variable over parameterizations (π0, ϑ). As a result, these approximate
intervals will not exactly control the coverage probability in finite samples, as the event that ϑ
differs from ϑ(π0) occurs with positive probability.
28 CONFIDENCE INTERVALS FOR SEROPREVALENCE
In turn, comparing the realized value of a test statistic to quantiles of the statistic’s finite-
sample distribution at every value of the nuisance component ϑ is both infeasible, as the space of ϑ
is infinite, and impractical, as it would lead to extremely conservative intervals. In fact, we can see
that in Figure 4, the 0.975 quantile of the bootstrap distribution of φn(π0) is approximately four
times as large at p1 = 0.05 than at p1 = pn,1(π0).
Thus, the finite-sample approach developed in Section 4.4 begins by constructing a 1 − γ
confidence region for ϑ and forming a finite grid over this space. The initial confidence region I1−γ
is illustrated in Figure 4 by the greyed rectangle, and a 10 × 10 grid over this space is illustrated
by the grid of white dots. The confidence region I1−γ is formed by taking the Cartesian product of
two√
1− γ level confidence regions for p1 and p3, each constructed by using the exact intervals
of Clopper and Pearson (1934) with γ = 0.001. This grid partitions the values of ϑ in I1−γ into
r = 81 rectangles, which we enumerate E1, . . . , Er. Define, for i = 1, 3, the extreme points
pi(j) = infpi : (p1, p3)> ∈ Ej and pi(j) = suppi : (p1, p3)> ∈ Ej
as well as
p2(j) = infp2 : p ∈ Ω(π0), (p1, p3)> ∈ Ej and (30)
p2(j) = supp2 : p ∈ Ω(π0), (p1, p3)> ∈ Ej .
As the test statistic φn(π0) is monotone increasing in X1 and monotone decreasing in X2 and X3,
define
pL(j) =(p1(j), p2(j), p3(j)
)and pU(j) =
(p1(j), p2(j), p3(j)
)as well as
sL(j) = Jφ(π0)n,pL(j)(t
−0 ) and sU(j) = J
φ(π0)n,pU (j)(t0)
where sL(j) and sL(j) are set equal to 1 and 0, respectively, if the infimum or supremum in (30) are
taken over the empty set. Thus, by Theorem 4.1 we can construct the finite-sample valid p-values
qL,π0,I1−γ = max1≤j≤r
(1− sL(j)) + γ and qU,π0,I1−γ = max1≤j≤r
(sU(j)) + γ
for testing the null hypothesis π = π0. Hence, the resultant finite-sample valid interval with
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 29
nominal coverage 1− α takes the form
π0 : qL,π0,I1−γ ≥ α/2 and qU,π0,I1−γ ≥ α/2
.
This approach is closely related to the method developed in Cai et al. (2020), though there
are some differences. Roughly, Cai et al. (2020) compute p-values for test of the null hypothesis
π = π0 with the parametric bootstrap using the particular choice of test statistic πn at each point of
a grid spanning a confidence region for the nuisance parameter, although the exact specification of
their procedure differs from our own. Their construction begins by constructing a joint confidence
region for all three parameters, while our approach proceeds from a smaller initial region for just
p1 and p3. We make an additional correction for a grid approximation to the nuisance space, which
allows us to ensure finite-sample validity. Finally, their construction does not guarantee that the
resulting intervals are equi-tailed.
Table 4 reports realizations of these finite-sample valid intervals with γ equal to 10−4, 10−3,
and 10−2, as well as the projection intervals discussed in Section 3.2, for Bendavid et al. (2020a).
In addition, Table 4 reports estimates of the coverage and average interval length at the estimated
values of pn for this study. The cost of ensuring finite-sample valid coverage is large. The realized
intervals are roughly 40% wider on average than intervals constructed with the Delta Method,
and the coverage is very close to one. Figure 5 displays estimates of the coverage probabilities
for the finite-sample valid intervals for γ equal to 10−4, 10−3, and 10−2 as well as the projection
intervals in the same Monte Carlo experiment developed in Section 3.3.28 Again, the coverage
is very close to one at small sample sizes. However, the finite-sample valid intervals outperform
the crude projection intervals in terms of coverage and average length for larger values of γ. The
difference is most salient in the measurements of coverage.
The additional costs associated with correction of the approximation error induced by a finite
discretization of the nuisance space are not overly burdensome. In particular, consider the test
inversion intervals for π constructed with the p-values
qL,π0,I1−γ,g = max
1− J φ(π0)n,p (t−0 ) + γ : p ∈ Ω(π0), (p2, p3) ∈ I1−γ,g
28Note that the proportion of Monte Carlo replicates for which the true value of π falls below the realized intervals isvery close to zero at most parameter values, and so dots denoting one minus the estimated coverage and the proportionof Monte Carlo replicates for which the true value of π falls above the realized intervals are approximately overlaid.
30 CONFIDENCE INTERVALS FOR SEROPREVALENCE
Method γ Interval Ave. Length Ave. Length vs. Delta Method Coverage
Exact
0.0001 [0.000,0.028] 0.0283 1.5311 0.999
0.0010 [0.000,0.027] 0.0269 1.4538 0.998
0.0100 [0.000,0.026] 0.0259 1.3979 0.998
Projection [0.001,0.028] 0.0270 1.4578 1.000
Table 4: Average Interval Length and Coverage for Valid Test-Inversion and Projection Nominal95% Intervals
Notes: Table 4 reports the finite-sample valid test-inversion and projection confidence intervals computed on data fromBendavid et al. (2020a). Estimates of the average length and coverage for these intervals at sample size n and estimatepn from this study are also displayed. Estimates of average length and coverage are taken over 10,000 bootstrapreplicates of X at the sample size n and the estimate pn from this study.
and
qU,π0,I1−γ,g = maxJ φ(π0)n,p (t0) + γ : p ∈ Ω(π0), (p2, p3) ∈ I1−γ,g
,
where I1−γ,g is a g × g grid over the initial confidence region I1−γ . That is, I1−γ,g denotes the
white dots in Figure 4, where in that case g = 10. The realized value for these intervals with
g = 10 and γ = 10−2 for data from Bendavid et al. (2020a) are [0.000, 0.025]. For these values of
g and γ, this interval construction has an average length of 0.0249, which is 34.85% longer than the
delta-method interval on average, i.e., they are 3.5% shorter than the finite-sample valid intervals
considered in this section.
There are several facets of the finite-sample valid confidence intervals considered in this section
that could potentially be improved. As previously mentioned, there are several possible choices
for the nuisance parameter ϑ that lead to an initial confidence region I1−γ . In addition, in order
to get lower or upper confidence bounds for π, an initial confidence region may be constructed
by taking the product of appropriate one-sided bounds, respectively, rather than using a single
joint confidence region for both lower and upper bounds. This change should save roughly γ/2
in over-coverage. It may be more desirable to use a Studentized test statistic, as its distribution
may vary even less within the initial confidence region. However, a nontrivial modification of the
correction for the approximation error induced by a finite discretization of the nuisance space is
required, since monotonicity may be violated. Finally, for a particular data set, smaller grids over
the nuisance space may be applied to further reduce the length of intervals.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 31
Figure 5: Coverage for Finite-Sample Valid Test-Inversion and Projection Nominal 95% Intervals
p1 p2 n1 n2
φn,γ = 0.0001
φn,γ = 0.001
φn,γ = 0.01
Projection
0.01 0.02 0.03 0.005 0.010 2000 4000 6000 200 400 600 800
0.000
0.002
0.004
0.000
0.002
0.004
0.000
0.002
0.004
0.000
0.002
0.004
1 − Coverage Lower Upper
Notes: Figure 5 displays estimates of the coverage probabilities of the finite-sample valid test inversion and projectionconfidence intervals. Estimates of the coverage are computed at parameter values close to the estimates pn and samplesize n from Bendavid et al. (2020a) as specified in Section 3.3. The black dots denote the one minus the proportion ofreplicates for which the true value of π falls in the realized confidence intervals, i.e., one minus the estimated coverageprobability. The purple squares and blue triangles denote the proportion of replicates that fall below and above realizedconfidence intervals, respectively. The vertical dotted line denotes the estimated value of pn,1, pn,1, or sample size n1,n2 from Bendavid et al. (2020a).
6. Application to Additional Problems
The methods discussed in Section 4 apply quite generally to the construction of confidence in-
tervals for real-valued parameters θ. To review, it is required that a statistical model be indexed
by (θ, ϑ), where θ is the parameter of interest and ϑ is a nuisance parameter. The inverse testing
approach to confidence interval construction tests the null hypothesis specified by H(θ0) : θ = θ0
for each θ0 and takes as a confidence region all θ0 where H(θ0) is not rejected.
The approximate approach considered in Section 4.2 constructs tests of H(θ0) by simulating
under (θ0, ϑ(θ0)), where ϑ(θ0) is an estimate of ϑ assumingH(θ0) is true. As such, the accuracy of
this method holds under standard bootstrap arguments. The finite-sample valid approach consid-
32 CONFIDENCE INTERVALS FOR SEROPREVALENCE
ered in Section 4.4 proceeds by first constructing a confidence interval for ϑ, either for each fixed
θ0 or without any restrictions, and thus requires knowledge of a finite-sample valid confidence re-
gion for ϑ. Finite-sample error control is obtained by minimizing the p-value of a test of H(θ0)
with ϑ taking values over the confidence region. However, if a grid search is required for this
minimization, finite-sample error control can be maintained by the correction developed in Section
4.4, as is verified formally in Theorem 4.1. The application of this result requires the use of a test
statistic that is monotone, and has a monotone likelihood ratio, in each of its components.
Such conditions apply broadly. For example, suppose X1, . . . , Xk are independent, with Xi
distributed as binomial with parameters ni and pi. The goal is to construct a confidence interval
for some parameter θ = f(p1, . . . , pk). In this case, the family of distributions of Xi has monotone
likelihood ratio in Xi. However, in order to apply Theorem 4.1, we just need to specify ϑ and
a confidence interval for ϑ, and to verify that the chosen test statistic is monotone with respect
to each of its components. This is straightforward to check when the test statistic is given by
Tn = f(p1, . . . , pn), where pi = Xi/ni.
Some important special cases include the following:
(i) Difference of Proportions: f(p1, p2) = p2 − p1
(ii) Relative Risk Reduction: f(p1, p2) = 1− p2/p1
(iii) Odds Ratio: f(p1, p2) = [p2/(1− p2)]/[p1/(1− p1)]
(iv) Prevalence: f(p1, p2, p3) = (p1 − p2)/(p3 − p2)
In cases (i)-(iii), f is monotone in each argument and we may set ϑ equal to p1 or p2 and apply a
Clopper-Pearson interval. As previously discussed, in case (iv), f is monotone in each argument
on the parameter space
Ω =p ∈ [0, 1]3 : p2 ≤ p1 ≤ p3, p2 < p3
,
and we may set ϑ = (p1, p3), and apply a joint confidence set based on marginal Clopper-Pearson
intervals. Agresti and Min (2002) and Fagerland et al. (2015) consider confidence intervals for
(i)-(iii), but not for (iv) as we do here. They consider similar constructions by minimizing p-values
over a nuisance parameter space, but do not account for the discretization required. Uniformly most
accurate unbiased confidence bounds exist only for the odds ratio based on classical constructions,
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 33
but optimality considerations fail for the other parameters; see e.g., Problem 5.29 of Lehmann and
Romano (2005).
An important application of estimates of relative risk reduction are in measurements of vaccine
efficacy in randomized clinical trials. In that context, p2 is the attack rate – or probability of
contracting the targeted infectious disease – for experimental subjects who have been treated with
a vaccine and p1 is the attack rate of experimental subjects who have been treated with a placebo.
Jewell (2004) gives a discussion of statistical aspects of this problem.
Constructing confidence intervals for prevalence as in (iv) may be extended to settings where
observations are collected on each of S subpopulations or strata of a population. All of the methods
discussed in this paper apply to this context. Indeed, suppose pi1 denotes the probability that a
randomly selected individual in the ith stratum tests positive. Suppose all strata make use of the
same estimates 1− p2 and p3 of sensitivity 1−p2 and specificity p3, respectively. Then, prevalence
in the ith stratum is πi = (pi1 − p2)/(p3 − p2). If the ith stratum gets the known weight wi, then
the overall prevalence is π =∑
iwiπi, which is clearly a function of S + 2 binomial parameters.
Estimates of prevalence are also of interest in the literature on “causal attribution,” which aims
to measure “how much” of an outcome is attributable to a binary treatment. For example, the es-
timand of interest in Ganong and Noel (2020), which aims to measure the proportion of mortgage
defaults that are caused exclusively by negative equity – as opposed to adverse life events – in the
years during and following the great recession, takes the form of prevalence. Pearl (1999), Rosen-
baum (2001), and Yamamoto (2012) provide further discussion of identification and inference in
analyses of causal attribution, as well as a survey several applications.
7. Conclusion
Standard methods for constructing confidence intervals for seroprevalence derived from the Delta
Method, the percentile bootstrap, and the BCa bootstrap have coverage probabilities that behave
erratically and are not consistently near the desired nominal level at empirically relevant sample
sizes and parameter values. By contrast, methods that combine test inversion with the parametric
bootstrap lead to stable coverage probabilities that are close to the nominal level across a variety
of statistics. Specifically, statistics that are properly Studentized or based on the generalized like-
lihood ratio statistic – and in particular its signed square root – exhibit superior performance. Test
inversion based on the signed square root generalized likelihood ratio statistic gives the best overall
performance in terms of its stability and validity over a range of empirically relevant parameteri-
34 CONFIDENCE INTERVALS FOR SEROPREVALENCE
zations and sample sizes. On the other hand, if one desires a method that has guaranteed coverage
in finite-samples, then we have provided an alternative method that has finite-sample validity, but
does have coverage above the nominal level and results in longer intervals on average.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 35
ReferencesAgresti, A. and Min, Y. (2002). Unconditional small-sample confidence intervals for the odds ratio. Bio-
statistics, 3(3):379–386.
Ainsworth, M., Andersson, M., Auckland, K., Baillie, J. K., Barnes, E., Beer, S., Beveridge, A., Bibi, S.,Blackwell, L., Borak, M., et al. (2020). Performance characteristics of five immunoassays for sars-cov-2:a head-to-head benchmark comparison. The Lancet Infectious Diseases, 20(12):1390–1400.
Alter, G. and Seder, R. (2020). The power of antibody-based surveillance. The New England Journal ofMedicine.
Barndorff-Nielsen, O. (1986). Inference on full and partial parameters based on the standardized signed loglikelihood ratio. Biometrika, 73:307–322.
Bendavid, E., Mulaney, B., Sood, N., Shah, S., Bromley-Dulfano, R., Lai, C., Weissberg, Z., Saavedra-Walker, R., Tedrow, J., Bogan, A., Kupiec, T., Eichner, D., Gupta, R., Ioannidis, J. P. A., and Bhat-tacharya, J. (2021). COVID-19 antibody seroprevalence in Santa Clara County, California. InternationalJournal of Epidemiology. dyab010.
Bendavid, E., Mulaney, B., Sood, N., Shah, S., Ling, E., Bromley-Dulfano, R., Lai, C., Weissberg, Z., Saave-dra, R., Tedrow, J., Tversky, D., Bogan, T. K., Eichner, D., Gupta, R., Ioannidis, J., and Bhattacharya, J.(2020a). Covid-19 antibody seroprevalence in santa clara county, california. MedRxiv, April 11.
Bendavid, E., Mulaney, B., Sood, N., Shah, S., Ling, E., Bromley-Dulfano, R., Lai, C., Weissberg, Z., Saave-dra, R., Tedrow, J., Tversky, D., Bogan, T. K., Eichner, D., Gupta, R., Ioannidis, J., and Bhattacharya, J.(2020b). Covid-19 antibody seroprevalence in santa clara county, california. MedRxiv, April 27.
Bennett, S. T. and Steyvers, M. (2020). Estimating covid-19 antibody seroprevalence in santa clara county,california. a re-analysis of bendavid et al. medRxiv.
Berger, R. and Boos, D. (1994). P values maximized over a confidence set for the nuisance parameter.Journal of the American Statistical Association, 89(427):1012–1016.
Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. StatisticalScience, pages 101–117.
Byambasuren, O., Dobler, C. C., Bell, K., Rojas, D. P., Clark, J., McLaws, M.-L., and Glasziou, P. (2020).Estimating the seroprevalence of sars-cov-2 infections: systematic review. medRxiv.
Cai, B., Ioannidis, J., Bendavid, E., and Tian, L. (2020). Exact inference for disease prevalence based on atest with unknown specificity and sensitivity. arXiv preprint arXiv:2011.14423.
Carpenter, J. (1999). Test inversion bootstrap confidence intervals. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 61(1):159–172.
Clopper, C. J. and Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case ofthe binomial. Biometrika, 26(4):404–413.
Cox, D. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference (with discus-sion). Journal of the Royal Statistical Society, B, 49:1–39.
36 CONFIDENCE INTERVALS FOR SEROPREVALENCE
Deeks, J. J., Dinnes, J., Takwoingi, Y., Davenport, C., Spijker, R., Taylor-Phillips, S., Adriano, A., Beese, S.,Dretzke, J., di Ruffano, L. F., et al. (2020). Antibody tests for identification of current and past infectionwith sars-cov-2. Cochrane Database of Systematic Reviews, (6).
DiCiccio, T., Martin, M., and Stern, S. (2001). Simple and accurate one-sided inference from signed rootsof likelihood ratios. The Canadian Journal of Statistics, 29(1):67–76.
DiCiccio, T. J. and Romano, J. P. (1990). Nonparametric confidence limits by resampling methods and leastfavorable families. Internat. Statist. Rev., 58:59–76.
DiCiccio, T. J. and Romano, J. P. (1995). On bootstrap procedures for second-order accurate confidencelimits in parametric models. Statistica Sinica, pages 141–160.
Efron, B. (1981). Nonparametric standard errors and confidence intervals. Canadian Journal of Statistics,9(2):139–158.
Efron, B. (1985). Bootstrap confidence intervals for a class of parametric problems. Biometrika, 72(1):45–58.
Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical Association,82(397):171–185.
Eisen, M. B. and Tibshirani, R. (2020). How to identify flawed research before it becomes dangerous. TheNew York Times, 21.
Fagerland, M. W., Lydersen, S., and Laake, P. (2015). Recommended confidence intervals for two indepen-dent binomial proportions. Statistical methods in medical research, 24(2):224–254.
Fauci, A. S., Lane, H. C., and Redfield, R. R. (2020). Covid-19—navigating the uncharted. The NewEngland Journal of Medicine, 382(13):1268.
Fithian, W. (2020). Statistical comment on the revision of bendavid et al.
Fraser, D. A. S. and Reid, N. (1987). On conditional inference for a real parameter: a differential approachon the sample space. Biometrika, 75:251–264.
Frydenberg, M. and Jensen, J. L. (1989). Is the ‘improved likelihood ratio statistic’ really improved in thediscrete case? Biometrika, 76:655–661.
Ganong, P. and Noel, P. J. (2020). Why do borrowers default on mortgages? a new method for causalattribution. Technical report, National Bureau of Economic Research.
Gelman, A. (2020). Concerns with that stanford study of coronavirus prevalence. Statistical Modeling,Causal Inference, and Social Science blog.
Gelman, A. and Carpenter, B. (2020). Bayesian analysis of tests with unknown specificity and sensitivity.Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(5):1269–1283.
Hall, P. (2013). The Bootstrap and Edgeworth Expansion. Springer.
Hui, S. L. and Walter, S. D. (1980). Estimating the error rates of diagnostic tests. Biometrics, pages 167–171.
Jensen, J. L. (1986). Similar tests and the standardized log ratio statistic. Biometrika, 73:567–572.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 37
Jensen, J. L. (1992). The modified signed likelihood statistic and saddlepoint approximations. Biometrika,79:693–703.
Jewell, N. P. (2004). Statistics for Epidemiology. CRC Press, Boca Raton, Florida.
Johnson, S. G. (2020). The nlopt nonlinear-optimization package. http://github.com/stevengj/nlopt.
Kolata, G. (2020). Coronavirus infections may not be uncommon, tests suggest. The New York Times.
Kraft, D. (1988). A software package for sequential quadratic programming. Forschungsbericht- DeutscheForschungs- und Versuchsanstalt fur Luft- und Raumfahrt.
Krammer, F. and Simon, V. (2020). Serology assays to manage covid-19. Science, 368(6495):1060–1061.
Lawley, D. N. (1956). A general method for approximating to the distribution of the likelihood ratio criteria.Biometrika, 43:295–303.
Lee, S. and Young, G. (2005). Parametric bootstrapping with nuisance parameters. Statistics and ProbabilityLetters, 71:143–153.
Lehmann, E. and Romano, J. (2005). Testing Statistical Hypotheses. Springer, New York.
Mallapaty, S. (2020). Antibody tests suggest that coronavirus infections vastly exceed official counts. Nature(Lond.).
Pearl, J. (1999). Probabilities of causation: three counterfactual interpretations and their identification.Synthese, 121(1):93–149.
Peeling, R. W., Wedderburn, C. J., Garcia, P. J., Boeras, D., Fongwen, N., Nkengasong, J., Sall, A., Ta-nuri, A., and Heymann, D. L. (2020). Serology testing in the covid-19 pandemic response. The LancetInfectious Diseases.
Rogan, W. J. and Gladen, B. (1978). Estimating prevalence from the results of a screening test. Americanjournal of epidemiology, 107(1):71–76.
Romano, J., Shaikh, A., and Wolf, M. (2011). Consonance and the closure method in multiple testing, article12. International Journal of Biostatistics, 7(1).
Romano, J., Shaikh, A., and Wolf, M. (2014). A practical two-step method for testing moment inequalities.Econometrica, 82(5):1979–2002.
Rosenbaum, P. R. (2001). Effects attributable to treatment: Inference in experiments and observationalstudies with a discrete pivot. Biometrika, 88(1):219–231.
Silvapulle, M. (1996). A test in the presence of nuisance parameters. Journal of the American StatisticalAssociation, 91(436):1690–1693.
Toulis, P. (2020). Estimation of covid-19 prevalence from serology tests: A partial identification approach.University of Chicago, Becker Friedman Institute for Economics Working Paper, (2020-54).
Walter, S. D. and Irwig, L. M. (1988). Estimation of test error rates, disease prevalence and relative riskfrom misclassified data: a review. Journal of clinical epidemiology, 41(9):923–937.
Yamamoto, T. (2012). Understanding the past: Statistical analysis of causal attribution. American Journalof Political Science, 56(1):237–256.
38 CONFIDENCE INTERVALS FOR SEROPREVALENCE
A. Maximum Likelihood Estimation of Model ParametersIn this section, we develop constrained and unconstrained maximum likelihood estimators for p.Let L (x|p) be the likelihood of observing the data x under p, given by
L (x|p) =∏
1≤i≤3
(nixi
)pxii (1− pj)ni−xi .
The maximum likelihood estimate of p, which we denote by pn, is the solution to the inequalityconstrained convex optimization problem
Maximizep∈Ω
logL (x | p)
It is clear that
pn =
pn if pn,2 ≤ pn,1 ≤ pn,3(x1+x2n1+n2
, x1+x2n1+n2
, pn,3
)if pn,1 < pn,2 ≤ pn,3(
pn,1,x2+x3n3+n3
, x2+x3n3+n3
)if pn,2 ≤ pn,3 < pn,1(∑
j xj∑j nj
,∑j xj∑j nj
,∑j xj∑j nj
)if pn,3 < pn,2.
The constraint p2 ≤ p1 ≤ p3 is unlikely to be binding in a realistic setting. However, somemethods that we study rely on estimates based on bootstrap replicates XB
id∼ Binomial (ni, pn,i),
and in that case there is a small probability that pBn,j = XBj /nj will not satisfy pBn,2 ≤ pBn,1 ≤ pBn,3.
Recall that we can rewrite the condition π = π0 as the linear restriction b (π0)> p = 0 whereb (π0) = (1,− (1− π0) ,−π0)> . The maximum likelihood estimate of p under the constraint thatb (π0)> p = 0, which we denote by pn(π0) is the solution to the inequality and equality constrainedconvex optimization problem
Maximizep∈Ω
logL (x | p)
Subject to b (π0)> p = 0.
As an analytic solution to this problem is not readily available, we solve it numerically with the se-quential quadratic programming algorithm presented in Kraft (1988), available through the NLoptopen-source optimization library (Johnson, 2020).29
B. Accelerated Bias-Corrected Bootstrap ComputationIn this section, we discuss approximation of the constants z0 and a in the BCα confidence interval.Efron (1987) showed that
z0 = Φ−1(J πn,pn (πn)
)29Note that the sequential quadratic programming algorithm we apply is a (highly optimized) quasi-newton method,
approximating the Hessian of the objective function with updates specified by gradient evaluations. As an analyticexpression for the Hessian of the objective function is available, this is unnecessarily computationally costly. However,we were not able to locate any publicly available code accessible in the R programming language for inequality andequality constrained Newton’s method optimization.
DICICCIO, RITZWOLLER, ROMANO, AND SHAIKH 39
and so can be computed from Monte Carlo estimation of J πn (t, pn). Let z0 denote the bootstrapestimate of z0.
Let the likelihood function be given by
L (p | x) =∏
1≤j≤3
(njxj
)pxjj (1− pj)nj−xj .
DiCiccio and Romano (1995) show that a can be approximated with
a =1
6
∑ijk λijkµiµjµk(∑i,j λijµiµj
)3/2
,
where
θi =∂π
∂pi, µi =
∑j
θi/λij,
λij = E[(
∂
∂pilogL (p | X)
)(∂
∂pjlogL (p | X)
)],
λijk = E[(
∂
∂pilogL (p | X)
)(∂
∂pjlogL (p | X)
)(∂
∂pklogL (p | X)
)].
In our setting, it can be seen that
θ1 =1
p3 − p2
θ2 = p1−p3(p3−p2)2
θ3 =p2 − p1
(p3 − p2)2
and that
λij =
ni
pi(1−pi) i = j
0 i 6= j
λijk =
ni(1−2pi)
p2i (1−pi)2 i = j = k
0 otherwise,
Thus, we find that
a =1
6
∑3i=1 λiiµ
3i(∑3
i=1 λiiµ2i
)3/2
where µi = θ/λii. Replacing pi by pn,i in the above expressions for the λii and µi results in theestimator a for the acceleration constant.