23
Ann Inst Stat Math (2019) 71:983–1005 https://doi.org/10.1007/s10463-018-0668-7 Simultaneous confidence bands for the distribution function of a finite population in stratified sampling Lijie Gu 1 · Suojin Wang 2 · Lijian Yang 3 Received: 16 August 2017 / Revised: 24 April 2018 / Published online: 21 May 2018 © The Institute of Statistical Mathematics, Tokyo 2018 Abstract Stratified sampling is one of the most important survey sampling approaches and is widely used in practice. In this paper, we consider the estimation of the distribu- tion function of a finite population in stratified sampling by the empirical distribution function (EDF) and kernel distribution estimator (KDE), respectively. Under gen- eral conditions, the rescaled estimation error processes are shown to converge to a weighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs) are constructed for the population distribution function based on EDF and KDE. Simulation experiments and illustrative data example show that the cover- age frequencies of the proposed SCBs under the optimal and proportional allocations are close to the nominal confidence levels. Keywords Confidence band · Stratified population distribution · Allocation · Brownian bridge · Kernel · Superpopulation This research was supported in part by Jiangsu Specially-Appointed Professor Program SR10700111, Jiangsu Province Key-Discipline Program ZY107992, National Natural Science Foundation of China Awards NSFC 11371272, 11771240, 11701403, Research Fund for the Doctoral Program of Higher Education of China Award 20133201110002, 2017 Jiangsu Overseas Visiting Scholar Program for University Prominent Young and Middle-aged Teachers and Presidents, and the Simons Foundation Mathematics and Physical Sciences Program Award #499650. Helpful comments from a reviewer are greatly appreciated. B Lijian Yang [email protected] 1 School of Mathematical Sciences and Center for Advanced Statistics and Econometrics Research, Soochow University, Suzhou 215006, China 2 Department of Statistics, Texas A&M University, College Station, TX 77843, USA 3 Center for Statistical Science and Department of Industrial Engineering, Tsinghua University, Beijing 100084, China 123

Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

Ann Inst Stat Math (2019) 71:983–1005https://doi.org/10.1007/s10463-018-0668-7

Simultaneous confidence bands for the distributionfunction of a finite population in stratified sampling

Lijie Gu1 · Suojin Wang2 · Lijian Yang3

Received: 16 August 2017 / Revised: 24 April 2018 / Published online: 21 May 2018© The Institute of Statistical Mathematics, Tokyo 2018

Abstract Stratified sampling is one of themost important survey sampling approachesand is widely used in practice. In this paper, we consider the estimation of the distribu-tion function of a finite population in stratified sampling by the empirical distributionfunction (EDF) and kernel distribution estimator (KDE), respectively. Under gen-eral conditions, the rescaled estimation error processes are shown to converge to aweighted sum of transformed Brownian bridges. Moreover, simultaneous confidencebands (SCBs) are constructed for the population distribution function based on EDFand KDE. Simulation experiments and illustrative data example show that the cover-age frequencies of the proposed SCBs under the optimal and proportional allocationsare close to the nominal confidence levels.

Keywords Confidence band · Stratified population distribution · Allocation ·Brownian bridge · Kernel · Superpopulation

This research was supported in part by Jiangsu Specially-Appointed Professor Program SR10700111,Jiangsu Province Key-Discipline Program ZY107992, National Natural Science Foundation of ChinaAwards NSFC 11371272, 11771240, 11701403, Research Fund for the Doctoral Program of HigherEducation of China Award 20133201110002, 2017 Jiangsu Overseas Visiting Scholar Program forUniversity Prominent Young and Middle-aged Teachers and Presidents, and the Simons FoundationMathematics and Physical Sciences Program Award #499650. Helpful comments from a reviewer aregreatly appreciated.

B Lijian [email protected]

1 School of Mathematical Sciences and Center for Advanced Statistics and EconometricsResearch, Soochow University, Suzhou 215006, China

2 Department of Statistics, Texas A&M University, College Station, TX 77843, USA

3 Center for Statistical Science and Department of Industrial Engineering, Tsinghua University,Beijing 100084, China

123

Page 2: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

984 L. Gu et al.

1 Introduction

Survey sampling is one of themost important branches of statistics. Its classic researchcontents are to design appropriate sampling methods to obtain some units from thefinite population, and to make inference about the population using the sampled obser-vations, such as the population total, population mean, and population quantiles; seeCochran (1977) and Lohr (2009). Therefore, one particular task is to estimate thecumulative distribution function (CDF) of the finite population and make correspond-ing inferences.

There are many approaches in the current literature for estimating the populationdistribution function under various conditions; see, for instance, Chambers and Dun-stan (1986),Wang andDorfman (1996), Chen andWu (2002), Frey (2009), andO’Neilland Stern (2012) . Most of these existing works focus on studying the consistency andasymptotic properties at any fixed point of the proposed distribution estimator. How-ever, this is often unsatisfactory as investigators may be interested in making statisticalinferences on the unknown distribution function or testing whether the population dis-tribution function is significantly different from a known distribution, for which asimultaneous confidence band (SCB) is desirable.

SCB is a powerful and appropriate inference tool for an entire unknown curve orfunction, which is a direct analogous concept of a confidence interval, regarded as acollection of confidence intervals over the whole range of function. For instance, theearlier work of Bickel and Rosenblatt (1973) is about SCB for a probability densityfunction. The constructions of SCB for nonparametric regression function can be foundin Härdle (1989), Xia (1998), Wang and Yang (2009). Recent theoretical developmenton SCB has appeared in various contexts, see Degras (2011), Zhu et al. (2012), Caoet al. (2012), Ma et al. (2012), Zheng et al. (2014), Gu et al. (2014), Song et al. (2014),and Cao et al. (2016) for functional data analysis; Song and Yang (2009) and Caiand Yang (2015) for a conditional variance function; Wang et al. (2013) about smoothSCB for cumulative distribution function in the independent and identically distributed(i.i.d.) settings based on continuous kernel distribution estimator (KDE); Gu and Yang(2015) for a single-index link function; Shao and Yang (2012) and Wang et al. (2014)for time series analysis; and Cardot and Josserand (2011) and Cardot et al. (2013) forSCB for functional data mean curve under survey sampling.

In the context of SCB for the finite population distribution function, Frey (2009)constructed the nonsmooth Kolmogorov–Smirnov type of SCB for the population dis-tribution function under simple random sampling by a recursive algorithm to computeexact coverage frequency of SCB designed for the case when the sample size andpopulation size are both small. More recently, Wang et al. (2016) considered largesample asymptotics based SCB for the CDF of the finite population also under simplerandom sampling.

However, in practice, stratified sampling is a standard technique in survey method-ology commonly employed to increase efficiency over simple random sampling.Consider, for instance, a farm acres dataset in Example 3.2 of Lohr (2009), whichconsists of four census regions of the United States–Northeast, North Central, South,andWest. Figure 8 depicts the distribution functions and histograms of the four regions,clearly showing significant variation among the four, as is also seen in Figure 3.1 in

123

Page 3: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 985

Lohr (2009). Thus, stratified sampling is highly recommended in such a case. Yet,there do not exist any results on SCB for the CDF of the finite population under strat-ified sampling in previous works. If one naively treats a stratified sample as a simplerandom sample (SRS) and applies the method proposed by Wang et al. (2016) to con-struct SCBs for a stratified population distribution, the performance of the SCBs is notsatisfying for a stratified sample as expected. In particular, the empirical coverage fre-quencies are much different from the nominal confidence level. Some detailed resultsare shown in Tables 3, 4, 5 and 6 and explained in Sects. 4 and 5. Hence, in this paper,we consider estimators of the finite population distribution function under stratifiedsampling based on the nonsmooth empirical distribution function (EDF) defined in(4) below and the smooth kernel distribution estimator (KDE) in (6), and developtheir corresponding SCBs to provide a powerful tool to make statistical inferences forstratified population distribution function.

The rest of the paper is organized as follows. Section 2 provides themain theoreticalresults as well as technical conditions needed in our theoretical development. Section3 describes the actual procedures to implement the SCBs. Simulation studies are givenin Sect. 4, and illustrative data example in Sect. 5. Some technical proofs are given inthe Appendix.

2 Main results

In stratified sampling, a finite population π of N units is first divided into severalnonoverlapping subpopulations called strata. Once the strata have been determined,a simple random sample is drawn from each stratum with the drawings being madeindependently across the different strata. The total sample size is denoted by n.

In sample surveys, usually the population size N is large but finite. Thus, the classicframework of asymptotics in statistics may not directly apply for a finite population.On the other hand, in sampling theory the finite population π may be viewed as asample of size N drawn from a superpopulation which has a continuous distribu-tion F(x). This is the setting we assume in this work. Specifically, in the spirit ofRosén (1964) for finite population asymptotics, we assume that there is a sequenceof populations {πk}∞k=1 as i.i.d. random samples of sizes Nk (Nk → ∞ as k → ∞)generated from a superpopulation with a mixture continuous distribution functionF(x) = ∑S

s=1 WsFs(x), where each Fs(x) is a continuous distribution function andWs , s = 1, 2, . . . , S, are weights satisfying Ws ∈ (0, 1),

∑Ss=1 Ws = 1. Each πk

can be viewed as a “post-stratified” population in the sense that it can be dividedinto S strata π1k, π2k, . . . , πSk of N1k, N2k, . . . , NSk units respectively according tothe superpopulation components Fs(x), s = 1, 2, . . . , S. It is clear that πsk can beregarded as an i.i.d. random sample from Fs(x) conditional on Nsk . Then stratifiedrandom sampling is applied to the population πk . Notice that in this framework, thestratum sizes Nsk, s = 1, 2, . . . , S, are treated as observed random variables due to therandomness of drawingπk from the superpopulation and Nk = N1k+N2k+· · ·+NSk .Let Wsk = Nsk N

−1k be the stratum weights of population πk , which are easily seen

to satisfy that Wsk converges to Ws almost surely, and of course in probability, ask → ∞.

123

Page 4: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

986 L. Gu et al.

Denote ysi,k , 1 ≤ s ≤ S, 1 ≤ i ≤ Nsk as the characteristic of interest for the i th

unit in the sth stratum of πk . Then one has the population πk = {ysi,k

}S,Nsks=1,i=1 and the

subpopulations (strata) πsk = {ysi,k

}Nski=1, s = 1, . . . , S. Mathematically, the discrete

CDF of the population πk can be defined as

FNk (x) = N−1k

S∑

s=1

Nsk∑

i=1

I (ysi,k ≤ x) =S∑

s=1

Wsk FNsk (x), (1)

where FNsk (x) is the CDF of πsk given by

FNsk (x) = N−1sk

Nsk∑

i=1

I (ysi,k ≤ x), s = 1, . . . , S. (2)

By the law of large numbers, we have FNk (x) →p F(x) and FNsk (x) →p Fs(x) ask → ∞, ∀x ∈ R, where→p means convergence in probability. Our goal is to estimatethe population distribution FNk (x) as well as the superpopulation distribution F(x)with their corresponding SCBs in stratified sampling.

For each k ≥ 1, denote{Ys1,k, . . . ,Ysnsk ,k

}as an SRS drawn from the sth stratum

of population πk with sample size nsk (nsk ≤ Nsk), s = 1, . . . , S. Hence, the totalsample size is nk = n1k + n2k + · · · + nSk . Define Fnsk (x) and Fnk (x) as the EDFsof sth stratum πsk and population πk , which are both random functions due to therandomness of sampling, given by

Fnsk (x) = n−1sk

nsk∑

i=1

I (Ysi,k ≤ x), s = 1, . . . , S, (3)

Fnk (x) =S∑

s=1

Wsk Fnsk (x), x ∈ R. (4)

In addition,we adoptKDE, an integral formof a distribution estimator that appearedin Reiss (1981), Liu and Yang (2008), Wang et al. (2013) for i.i.d. and stationarysequences, to each stratum CDF FNsk and the population CDF FNk . They are respec-tively defined as

F̂sk(x) =∫ x

−∞n−1sk

nsk∑

i=1

Khs

(u − Ysi,k

)du, s = 1, . . . , S, x ∈ R, (5)

F̂k(x) =S∑

s=1

Wsk F̂sk(x), x ∈ R (6)

in which K is a kernel function rescaled as Khs (u) = K (u/hs) /hs with bandwidthhs = hnsk > 0.

123

Page 5: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 987

For convenience, denote the maximal deviation between any two distribution func-tions F1 and F2 as

M (F1, F2) = ‖F1 − F2‖∞ = supx∈R |F1(x) − F2(x)| , (7)

and for nonnegative integer p and γ ∈ (0, 1], the collection of functions whose pthderivatives satisfy Hölder conditions of order γ as

C (p,γ ) (R) ={

g : R → R

∣∣∣∣∣‖g‖p,γ = sup

x1 =x2,x1,x2∈R

∣∣g(p) (x1) − g(p) (x2)

∣∣

|x1 − x2|γ < +∞}

.

We assume some general technical conditions as follows:

(C1) ∀s ∈ {1, . . . , S}, min (nsk, Nsk − nsk) →p ∞, as k → ∞.(C2) There exist constants ws ∈ (0, 1) and C ∈ [0, 1), such that as k → ∞,

wsk = nsk/nk →p ws and nk/Nk → C.(C3) There exist an integer p ≥ 0 and γ ∈ (1/2, 1] such that Fs ∈ C (p,γ ) (R), and

Fs(x) is uniformly continuous over x ∈ R.(C4) The bandwidth hs = hnsk > 0 satisfies λskh

p+γnsk →p 0, as k → ∞, in which

λsk =(n−1sk − N−1

sk

)−1/2.

(C5) The kernel K is a continuous and symmetric function, supported on [−1, 1],and is a q th order kernel for some even integer q > p + γ , i.e., its momentsμr (K ) = ∫

vr K (v) dv satisfy μ0 (K ) ≡ 1, μq (K ) = 0, μr (K ) ≡ 0 for anyinteger r ∈ (0, q).

Remark 1 In many convergence statements of the conditions, we use convergencein probability because we treat the population as an i.i.d. random sample from thesuperpopulation. In this framework, both nsk and Nsk are treated as observed randomvariables. In the rest of the paper, as a convention we may drop “in probability” forbrevity if there is no confusion when discussing convergence in probability.

Since in stratified sampling an SRS is independently drawn from each stratum, wefirst state the following Theorem 1which is essentially Theorems 1 and 2 inWang et al.(2016). It is a finite population version of Donsker’s Theorem and the equivalence ofEDF and KDE under simple random sampling. Its proof can be found in Wang et al.(2016).

Theorem 1 Under Condition (C1), as k → ∞, for the sth stratum of population πk ,FNsk (x), Fnsk (x) and F̂sk(x) respectively given in (2), (3) and (5) satisfy

λsk{Fnsk (x) − FNsk (x)

} d→ Bs (Fs(x)) , s = 1, . . . , S, (8)

where λsk =(n−1sk − N−1

sk

)−1/2 = n1/2sk (1 − nsk/Nsk)−1/2 is a finite population

corrected scale factor and Bs (·) represents the Brownian bridge for the sth stratum.In addition, under Conditions (C1)–(C5),

123

Page 6: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

988 L. Gu et al.

λskM(F̂sk, Fnsk

)= λsk supx∈R

∣∣∣F̂sk(x) − Fnsk (x)

∣∣∣ = op (1) , s = 1, . . . , S. (9)

The above theorem can be extended to finite stratified sampling. To properly for-mulate the extension, we first provide a simple lemma.

Lemma 1 Under Condition (C2), ∀s ∈ {1, . . . , S}, w−1s − CW−1

s ≥ 0, and thereexists at least one s ∈ {1, . . . , S} such that w−1

s − CW−1s > 0.

The above lemma ensures that all(w−1s − CW−1

s

)/ (1 − C) in the next theorem

are nonnegative with at least one of them being positive. The proofs of Lemma 1,Theorem 2 and others are given in the Appendix.

Theorem 2 Under Conditions (C1), (C2), as k → ∞, FNk (x) and Fnk (x) given in(1), (4) satisfy

λk{Fnk (x) − FNk (x)

} d→ B∗(x),

in which λk =(n−1k − N−1

k

)−1/2and

B∗(x) =S∑

s=1

Ws

√(w−1s − CW−1

s

)/ (1 − C)Bs {Fs(x)} (10)

where the transformed Brownian bridges Bs {Fs(x)}, s = 1, . . . , S, are independentof each other.

Corollary 1 Under Conditions (C1)–(C2), as k → ∞, one has

P[λkM

(Fnk , FNk

) ≤ t] → D∗ (t) = P

[maxx∈R

∣∣B∗(x)

∣∣ ≤ t

], t ≥ 0,

where B∗ (x) is as defined in (10), and D∗ (t) represents the extreme value distributionof B∗ (x). Hence, for α ∈ (0, 1), an asymptotic 100 (1 − α)% SCB based on Fnk forthe finite population CDF FNk (x) is

[max

{Fnk (x) − λ−1

k D∗1−α, 0

},min

{Fnk (x) + λ−1

k D∗1−α, 1

}], x ∈ R, (11)

where D∗1−α = (D∗)−1 (1 − α) is the 100 (1 − α)th percentile of D∗ with (D∗)−1

being the inverse function of D∗.

Theorem 3 extends the asymptotic property of the finite population CDF FNk (x)to the superpopulation CDF F(x).

Theorem 3 Under Conditions (C1)–(C2) and if nk/Nk → C ≡ 0 as k → ∞, onehas λkM(FNk , F) = op (1). Hence,

P[λkM(Fnk , F) ≤ t

] → D∗ (t) , t ∈ R.

Furthermore, for α ∈ (0, 1), an asymptotic 100 (1 − α)% SCB based on Fnk for thesuperpopulation CDF F(x) is constructed as (11).

123

Page 7: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 989

Theorem 4 below shows that nonsmooth Fnk and smooth F̂k are asymptoticallyuniformly equivalent. By Slutsky’s Theorem, F̂k automatically inherits the asymptoticproperties of Fnk . Therefore, one can successfully construct smooth SCBs for finitepopulation CDF FNk (x) and superpopulation CDF F(x) based on F̂k .

Theorem 4 Under Conditions (C1)–(C5), as k → ∞, Fnk (x) and F̂k(x) given in (4),

(6) satisfy λkM(F̂k, Fnk

)= op (1). Consequently,

P

[λkM(F̂k, FNk ) ≤ t

]→ D∗ (t) , t ∈ R.

Furthermore, under nk/Nk → C ≡ 0 in Condition (C2),

P

[λkM(F̂k, F) ≤ t

]→ D∗ (t) , t ∈ R.

Hence, an asymptotic 100 (1 − α)% smooth SCBs based on F̂k for finite populationCDF FNk (x) and superpopulation CDF F(x) are constructed as

[max

{F̂k(x) − λ−1

k D∗1−α, 0

},min

{F̂k(x) + λ−1

k D∗1−α, 1

}], x ∈ R. (12)

Remark 2 When S = 1, stratified sampling is reduced to simple random samplingand the extreme value of B∗(x) in Theorem 2 follows the Kolmogorov distribution,

i.e., D∗ (t) = D (t) = 1 − 2∞∑j=1

(−1) j−1 exp(−2 j2t2

), t > 0; D (t) = 0, t ≤ 0.

The critical value D∗1−α = (D∗)−1 (1 − α) can be computed by a standard numerical

method.When S > 1, however, the distribution of B∗(x) is not distribution-free, and D∗

1−α

can be obtained only by non-trivial Monte-Carlo methods. More details about theimplementation of the SCBs will be presented in the following Sect. 3.

3 Implementation

In this section, we outline a practical procedure to implement the construction of theSCBs for finite population CDF FNk (x) and superpopulation CDF F(x) based onestimators EDF Fnk (x) and KDE F̂k(x) given in ( 11) and (12), respectively.

The kernel function used in our proposed KDE F̂k(x) is the quartic kernel,K (u) = 15

(1 − u2

)2I {|u| ≤ 1} /16, which satisfies Condition (C6) with q = 2. The

bandwidth for each stratum is taken to be hs1 = IQRs ×λ−2sk , or hs2 = IQRs ×λ

−2/3sk ,

s = 1, . . . , S, where IQRs stands for the Inter-Quartile Range of an SRS drawn fromthe sth stratum, and hs1 automatically satisfies Condition (C5), while hs2 satisfies (C5)if p + γ > 3/2, especially taking p = 1 since γ ∈ (1/2, 1].

The most important part of the procedure to construct the SCBs is to obtain thecritical values D∗

1−α = (D∗)−1 (1 − α), i.e., the 100 (1 − α)th percentiles of D∗.

123

Page 8: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

990 L. Gu et al.

According to the stratified sample drawn from each stratum, it is easy to constructthe empirical CDF of each stratum Fnsk , s = 1, . . . , S, and then to generate thetransformed Brownian bridges Bsl

{Fnsk (x)

}, l = 1, . . . , L , where L is a preset large

integer, the default of which is set to be 1000 due to lengthy simulation time. Onetakes the maximal absolute value over a equally divided values of x for each copyof the weighted sum of Brownian bridges

∑Ss=1 λkWskλ

−1sk Bsl

{Fnsk (x)

}(SCBs for

FNk ) or∑S

s=1 Wskw−1/2sk Bsl

{Fnsk (x)

}(SCBs for F), and estimates D∗

1−α by theempirical quantile of these maximum values. The above limiting method is based onthe limiting distribution stated in Theorems 2–4. In our implementation, a is taken tobe 401, which appears to be sufficient in this application; see more discussion on thisin the next section.

An alternative method for obtaining the critical values in finite sample cases isthe bootstrap technique. Suppose for each stratum, Nsk = tsknsk + rsk , 1 ≤ rsk ≤nsk − 1, s = 1, . . . , S. Adopting the idea of McCarthy and Snowden (1985), onereplicates each sample elements tsk times, together with selecting rsk elements fromthe sample without replacement to construct an artificial population, from which Lrepeated stratified samples are drawnwithout replacement. For each stratifiedbootstrapsample, the bootstrap EDF and KDE are computed. Then one takes the maximalabsolute value of the difference between each bootstrap EDF (KDE) and real sampleEDF (KDE), and then estimates the critical values for the SCB based on EDF (KDE)using the proper quantiles of these maximum values.

For stratified random sampling, there are two common methods to allocate thenumber of units to be drawn from each stratum. One is called proportional allocationas nsk the number of sampled units in each stratum is proportional to Nsk the size ofthe stratum, i.e., nsk = (nk/Nk) Nsk . The other is the Neyman allocation, a specialcase of the optimal allocation in which the costs in the strata are assumed to be equal.In this case, nsk is proportional to NskVsk , where V 2

sk is population variance in stratums. For the estimation of the population CDF,

V 2sk =

∫Nsk

Nsk − 1FNsk (x)

{1 − FNsk (x)

}dx, s = 1, . . . , S,

where the integration is approximated by the sum over 401 equally spaced grid pointson the range of a pilot SRS drawn from stratum s and the unknown FNsk (x) may beestimated by the corresponding pilot sample EDF.

4 Simulation study

In this section, we present some simulation results to show the finite-sample perfor-mance of the proposed estimators and the corresponding SCBs.

In our simulation settings, we consider a population with four strata to match thereal data “agpop.dat” discussed in Sect. 5. These four strata are generated, respectivelyfrom� (1.41, 0.66),� (1.45, 2.25),� (0.67, 2.98),� (0.76, 9.56), which are regardedas four superpopulations following Gamma distribution � (υ, β) with the probabilitydensity function (PDF):

123

Page 9: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 991

f (x) = 1

βυ� (υ)xυ−1e−x/β, x > 0.

The parameters of the above four Gamma distributions are taken to be the rescaledmoment estimation based on agpop.dat.

The number of units for each stratum is taken to be N1k = 200, N2k = 1000,N3k = 1400, N4k = 400, respectively, so that the total number of units in the entirepopulation is Nk = N1k + N2k + N3k + N4k = 3000, while the total sample size nk istaken to 150, 300, 600, 900.Denotem, M as theminimumandmaximumof a stratifiedrandom sample. We examine the global discrepancy between Fnk (x) and F̂k(x) underthe proportional and optimal allocations measured by the Integrated Squared Error(ISE):

ISE(Fnk , FNk ) =∫

{Fnk (x) − FNk (x)}2dx,

ISE(F̂k, FNk ) =∫

{F̂k(x) − FNk (x)}2dx,

ISE(Fnk , F) =∫

{Fnk (x) − F(x)}2dx,

ISE(F̂k, F) =∫

{F̂k(x) − F(x)}2dx,

in which the integration is approximated by the sum over 401 equally spacedgrid points from m − (M − m) n−2

k to M . One then computes the Mean Inte-grated Squared Error (MISE) as the average of ISE over 1000 replications. Underthe proportional and optimal allocations, Figs. 1, 2, 3 and 4 show respectively,the boxplots of the random values ISE(F̂k, FNk ), ISE(F̂k, F) and the randomratios ISE(F̂k, FNk )/ISE

(Fnk , FNk

), ISE(F̂k, F)/ISE(Fnk , F). These plots indi-

150 300 600 900

0.00

0.02

0.04

0.06

0.08

Sample Size

ISE

(a)

150 300 600 900

0.00

0.02

0.04

0.06

0.08

Sample Size

ISE

(b)

Fig. 1 Boxplots of ISE(F̂k , FNk ) with bandwidth hs1 under the optimal/proportional allocation, respec-tively. a Optimal allocation. b Proportional allocation

123

Page 10: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

992 L. Gu et al.

Sample Size

ISE

(a)

150 300 600 900 150 300 600 900

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

Sample Size

ISE

(b)

Fig. 2 Boxplots of ISE(F̂k , F)with bandwidth hs1 under the optimal/proportional allocation, respectively.a Optimal allocation. b Proportional allocation

Sample Size

ISE

_rat

io

(a)

150 300 600 900 150 300 600 900

0.85

0.90

0.95

1.00

1.05

0.85

0.90

0.95

1.00

1.05

Sample Size

ISE

_rat

io

(b)

Fig. 3 Boxplots of the ratio ISE(F̂k , FNk )/ISE(Fnk , FNk ) with bandwidth hs1 under the opti-mal/proportional allocation, respectively. a Optimal allocation. b Proportional allocation

cate that for both allocations ISE(F̂k, FNk ) →p 0, ISE(F̂k, F) →p 0 andISE(F̂k, FNk )/ISE

(Fnk , FNk

) →p 1, ISE(F̂k, F)/ISE(Fnk , F) →p 1. More-over, Figs. 5 and 6 show that the differences of ISE between the two allo-cations are not significant. Table 1 contains MISE(F̂k, FNk ), MISE(F̂k, F) andthe ratios MISE(F̂k, FNk )/MISE(Fnk , FNk ), MISE(F̂k, F)/MISE(Fnk , F), whichare compared under the two allocations. It shows that as nk increases, undereither the proportional or optimal allocation, MISE(F̂k, FNk ) goes to zero andMISE(F̂k, FNk )/MISE(Fnk , FNk ) is smaller than (but close to) 1. MISE(F̂k, F) andMISE(F̂k, F) /MISE(Fnk , F) behave similarly. All these findings are consistent withthe asymptotic properties. In addition, although MISE(F̂k, FNk ) and MISE(F̂k, F)

123

Page 11: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 993

Sample Size

ISE

_rat

io

(a)

150 300 600 900 150 300 600 900

0.85

0.90

0.95

1.00

1.05

0.85

0.90

0.95

1.00

1.05

Sample Size

ISE

_rat

io

(b)

Fig. 4 Boxplots of the ratio ISE(F̂k , F)/ISE(Fnk , F) with bandwidth hs1 under the optimal/proportionalallocation respectively. a Optimal allocation. b Proportional allocation

Sample Size

ISE

_rat

io

(a)

150 300 600 900

05

1015

150 300 600 900

05

1015

Sample Size

ISE

_rat

io

(b)

Fig. 5 Boxplots of the ratio a ISEopt(Fnk , FNk )/ISEprop(Fnk , FNk ) and b ISEopt(F̂k , FNk )/

ISEprop(F̂k , FNk ) with bandwidth hs1

under the optimal allocation are smaller than that under the proportional allocation,the differences are small.

Next, we compare the SCBs constructed by F̂k , Fnk with the confidence levels1−α = 0.99, 0.95, 0.9, 0.8. Under the proportional allocation and optimal allocation,respectively, Tables 3 and 4 report the coverage frequencies over 1000 replicationsthat the true curve was covered by SCBs at the 401 equally spaced grid points fromm − (M − m) n−2

k to M . For visualization of actual function estimates, Fig. 7 depicts

curves of the true population CDF FNk , EDF Fnk , KDE F̂k taking bandwidth hs1,together with corresponding asymptotic 95% nonsmooth and smooth SCBs at nk =300. Other settings yielded similar results, but are not included to save space. Ourmain findings are summarized as follows:

123

Page 12: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

994 L. Gu et al.

Sample Size

ISE

_rat

io

(a)

150 300 600 900

05

1015

150 300 600 900

05

1015

Sample Size

ISE

_rat

io

(b)

Fig. 6 Boxplots of the ratio a ISEopt(Fnk , F)/ISEprop(Fnk , F) and b ISEopt(F̂k , F)/ISEprop(F̂k , F)

with bandwidth hs1

Table 1 ComparingMISEs of F̂k (with bandwidth hs1) and Fnk under the optimal/proportional allocation,respectively

nk

150 300 600 900

MISEopt(F̂k , FNk ) 0.0089 0.0041 0.0018 0.0011

MISEprop(F̂k , FNk ) 0.0094 0.0047 0.0020 0.0012

MISEopt(F̂k , FNk )/MISEprop(F̂k , FNk ) 0.9489 0.8588 0.9011 0.8987

MISEopt(F̂k , FNk )/MISEopt(Fnk , FNk ) 0.9888 0.9948 0.9975 0.9992

MISEprop(F̂k , FNk )/MISEprop(Fnk , FNk ) 0.9882 0.9945 0.9977 0.9991

MISEopt(F̂k , F) 0.0096 0.0046 0.0024 0.0016

MISEprop(F̂k , F) 0.0100 0.0051 0.0026 0.0018

MISEopt(F̂k , F)/MISEprop(F̂k , F) 0.9627 0.8913 0.9298 0.9122

MISEopt(F̂k , F)/MISEopt(Fnk , F) 0.9888 0.9944 0.9973 0.9988

MISEprop(F̂k , F)/MISEprop(Fnk , F) 0.9878 0.9939 0.9974 0.9988

(1) From Tables 3, 4 and Fig. 7, one can see that there are no significant differencesbetween the two allocations regarding the performance of SCBs. The coveragefrequencies of SCBs for population CDF FNk based on EDF Fnk and KDF F̂kwith bandwidth hs1 are close to the nominal levels. Meanwhile, three curves ofFNk , Fnk , F̂k are very close in Fig. 7. All these results reveal that the smooth KDEF̂k is asymptotically as efficient as the nonsmooth EDF Fnk , and automaticallyinherits the asymptotic properties of Fnk , which is consistent with Theorem 4.

(2) It is not surprising that the SCBs based on F̂k with bandwidth hs2 do not workwell. This is because in our simulation settings the above four Gamma distri-bution � (1.41, 0.66) , � (1.45, 2.25), � (0.67, 2.98), � (0.76, 9.56) as Fs, s =

123

Page 13: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 995

population CDFstratified KDEsmooth SCBstratified EDFnonsmooth SCB

(a)

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

population CDFstratified KDEsmooth SCBstratified EDFnonsmooth SCB

(b)

Fig. 7 Plots of 95% SCBs for the population CDF FNk with bandwidth hs1 at nk = 300 by opti-mal/proportional allocation, respectively. a Optimal allocation. b Proportional allocation

Table 2 Comparisons of the computing time (minutes) of SCBs’ coverage frequencies given the confidencelevels 1 − α = 0.99, 0.95, 0.9, 0.8 based on the bootstrap/limiting method with bandwidth hs1 over 1000replications under the proportional allocation

Computing time nk = 150 nk = 300 nk = 600 nk = 900

Limiting 6.89 7.58 8.35 8.83

Bootstrap 215.30 350.17 596.49 868.00

1, 2, 3, 4, do not satisfy p+ γ > 3/2. To examine this, we ran further simulationstudies by choosing� (2, 0.5),� (3, 1) as Fs, s = 1, 2,which satisfy p+γ > 3/2,and the results with N1k = 2000, N2k = 1000 under the proportional allocationare shown in Table 5. However, bandwidth hs2 still does not perform as well ashs1. Since bandwidth hs2 satisfies Condition (C5) only when Fs has smoothnessorder p + γ > 3/2 and is also sensitive to the simulation results, we recommendusing bandwidth hs1 in practice.

(3) Tables 3, 4 and 5 show that the coverage frequencies of SCBs for FNk based on Fnkand F̂k with bandwidth hs1 by the bootstrap method are generally as good as thatof SCBs by the limiting method. But in Table 2, one can see that the computingtime for the bootstrap is much longer than that for the limiting method.

(4) With Nk fixed at 3000, the performance of the SCBs for F deteriorates as nkincreases from150 to 900, because the condition limk→∞ nk/Nk = 0 forTheorem3 is increasingly violated.

(5) For comparisons, we use the same stratified samples but incorrectly treat them asif they were simple random samples and apply the method developed in Wanget al. (2016) to construct SCBs for FNk and F . As shown in Tables 3, 4 and 5,the coverage frequencies of SCBs for simple random sampling are much lowerthan the nominal levels under the optimal allocation, while are generally higherthan the nominal levels under the proportional allocation. In one word, the SCBs

123

Page 14: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

996 L. Gu et al.

Table3

Coveragefrequenciesof

SCBsforFNkandF

underop

timal

allocatio

nbasedon

bootstrap/lim

iting

metho

dcomparedto

simplerand

omsampling:

leftof

the

parentheses-F̂kwith

hs1;right

oftheparentheses-F̂kwith

hs2;insidetheparentheses-Fn k

n kSC

B1

−α

=0.99

1−

α=

0.95

1−

α=

0.90

1−

α=

0.80

150

Boo

tstrap,FNk

0.98

8(0.987

)0.98

40.94

5(0.947

)0.93

70.89

2(0.899

)0.85

70.77

4(0.778

)0.64

8

Lim

iting

,FNk

0.98

8(0.984

)0.99

90.93

8(0.931

)0.97

90.89

2(0.883

)0.95

00.79

4(0.774

)0.88

0

SRS,

FNk

0.95

8(0.956

)0.98

20.86

4(0.858

)0.91

10.80

1(0.804

)0.84

90.69

6(0.692

)0.78

2

Lim

iting

,F

0.98

3(0.979

)0.99

70.93

6(0.927

)0.97

80.87

1(0.856

)0.95

00.76

2(0.744

)0.88

0

SRS,

F0.95

5(0.953

)0.98

10.86

1(0.858

)0.91

90.79

0(0.782

)0.86

00.66

4(0.656

)0.77

7

300

Boo

tstrap,FNk

0.98

4(0.984

)0.98

10.95

4(0.956

)0.92

70.90

1(0.901

)0.77

90.80

4(0.792

)0.47

7

Lim

iting

,FNk

0.98

5(0.985

)0.99

30.95

2(0.950

)0.97

10.91

2(0.908

)0.94

10.79

9(0.787

)0.81

2

SRS,

FNk

0.91

6(0.915

)0.93

60.74

7(0.746

)0.80

90.61

6(0.615

)0.69

80.44

8(0.445

)0.54

3

Lim

iting

,F

0.97

4(0.976

)0.98

90.92

9(0.923

)0.96

80.87

9(0.871

)0.93

20.75

8(0.732

)0.77

7

SRS,

F0.88

8(0.888

)0.92

80.72

3(0.722

)0.79

80.59

0(0.588

)0.69

50.42

4(0.417

)0.54

8

600

Boo

tstrap,FNk

0.98

3(0.985

)0.98

00.95

5(0.956

)0.76

50.90

6(0.909

)0.44

80.81

4(0.809

)0.12

6

Lim

iting

,FNk

0.98

4(0.983

)0.99

20.94

5(0.944

)0.94

20.90

3(0.904

)0.75

60.80

4(0.795

)0.39

5

SRS,

FNk

0.69

5(0.694

)0.73

00.43

2(0.430

)0.47

90.29

9(0.296

)0.36

00.17

6(0.172

)0.20

5

Lim

iting

,F

0.96

4(0.964

)0.98

60.83

3(0.833

)0.91

50.81

7(0.810

)0.71

90.69

7(0.689

)0.35

3

SRS,

F0.63

9(0.639

)0.71

10.39

2(0.391

)0.47

70.27

8(0.276

)0.35

60.16

2(0.161

)0.20

1

900

Boo

tstrap,FNk

0.98

9(0.991

)0.94

80.94

1(0.940

)0.47

40.89

5(0.894

)0.15

60.80

5(0.806

)0.01

9

Lim

iting

,FNk

0.98

6(0.986

)0.98

60.95

1(0.949

)0.75

30.90

4(0.904

)0.38

80.79

6(0.793

)0.09

1

SRS,

FNk

0.41

9(0.421

)0.44

80.20

0(0.200

)0.22

40.08

3(0.085

)0.09

80.04

0(0.040

)0.02

8

Lim

iting

,F

0.95

5(0.956

)0.96

70.85

0(0.852

)0.69

40.74

6(0.748

)0.34

10.60

6(0.601

)0.07

6

SRS,

F0.36

1(0.360

)0.44

00.16

7(0.167

)0.20

50.09

2(0.093

)0.12

10.04

6(0.047

)0.05

1

123

Page 15: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 997

Table4

Coveragefrequenciesof

SCBsforFNkandFun

dertheprop

ortio

nalallo

catio

nbasedon

thebo

otstrap/lim

iting

metho

dcomparedto

simplerand

omsampling:

left

oftheparentheses-F̂kwith

hs1;right

oftheparentheses-F̂kwith

hs2;insidetheparentheses-Fn k

n kSC

B1

−α

=0.99

1−

α=

0.95

1−

α=

0.90

1−

α=

0.80

150

Boo

tstrap,FNk

0.98

9(0.987

)0.98

90.95

1(0.951

)0.93

90.90

4(0.896

)0.84

90.79

0(0.776

)0.62

5

Lim

iting

,FNk

0.98

5(0.985

)0.99

60.94

5(0.934

)0.97

70.89

8(0.885

)0.94

90.80

5(0.771

)0.88

1

SRS,

FNk

0.99

7(0.997

)1.00

00.97

2(0.972

)0.99

00.94

3(0.940

)0.97

20.87

6(0.875

)0.94

1

Lim

iting

,F

0.97

9(0.978

)0.99

50.93

4(0.926

)0.97

30.88

0(0.867

)0.94

40.77

5(0.739

)0.88

5

SRS,

F0.99

5(0.995

)0.99

80.96

6(0.965

)0.98

90.92

9(0.929

)0.98

90.86

6(0.858

)0.93

2

300

Boo

tstrap,FNk

0.98

8(0.990

)0.98

40.94

2(0.938

)0.89

20.87

4(0.874

)0.71

80.76

5(0.759

)0.37

5

Lim

iting

,FNk

0.98

8(0.986

)0.99

30.94

4(0.941

)0.97

10.89

4(0.889

)0.93

60.79

4(0.774

)0.76

3

SRS,

FNk

0.99

6(0.996

)0.99

70.97

7(0.976

)0.98

80.94

4(0.943

)0.97

40.87

4(0.871

)0.92

3

Lim

iting

,F

0.97

7(0.974

)0.99

00.92

1(0.917

)0.96

70.86

2(0.854

)0.91

80.73

3(0.718

)0.74

9

SRS,

F0.99

0(0.990

)0.99

60.95

7(0.956

)0.98

10.91

6(0.914

)0.95

70.83

9(0.836

)0.90

7

600

Boo

tstrap,FNk

0.99

5(0.994

)0.97

80.94

2(0.941

)0.68

30.89

4(0.893

)0.34

80.76

4(0.770

)0.08

3

Lim

iting

,FNk

0.99

1(0.990

)0.99

50.95

2(0.952

)0.92

70.91

1(0.909

)0.73

40.81

8(0.820

)0.33

0

SRS,

FNk

0.99

7(0.997

)0.99

80.97

9(0.982

)0.99

20.95

5(0.954

)0.97

10.89

8(0.898

)0.89

0

Lim

iting

,F

0.97

2(0.971

)0.98

80.91

4(0.911

)0.90

00.84

3(0.844

)0.72

30.71

3(0.707

)0.34

2

SRS,

F0.98

9(0.989

)0.99

70.94

8(0.949

)0.97

50.91

5(0.914

)0.94

50.82

4(0.824

)0.85

5

900

Boo

tstrap,FNk

0.99

0(0.991

)0.91

60.96

0(0.962

)0.38

00.93

1(0.931

)0.10

40.81

8(0.818

)0.00

9

Lim

iting

,FNk

0.98

8(0.987

)0.97

50.94

6(0.945

)0.66

50.89

2(0.890

)0.29

00.79

2(0.789

)0.05

3

SRS,

FNk

0.99

7(0.997

)0.99

70.98

1(0.981

)0.97

90.94

7(0.947

)0.91

80.88

4(0.882

)0.64

3

Lim

iting

,F

0.94

2(0.939

)0.95

40.84

3(0.838

)0.63

30.75

4(0.744

)0.28

70.59

6(0.594

)0.04

9

SRS,

F0.97

6(0.975

)0.98

70.91

0(0.909

)0.94

40.84

5(0.842

)0.86

30.71

9(0.717

)0.59

7

123

Page 16: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

998 L. Gu et al.

Table5

Coveragefrequenciesof

SCBsforCDFof

popu

latio

nwith

twostrata

from

Gam

madistributio

nun

dertheprop

ortio

nalallocatio

nbasedon

thebo

otstrap/lim

iting

methodcomparedto

simplerandom

sampling:

leftof

theparentheses-F̂kwith

hs1;right

oftheparentheses-F̂kwith

hs2;insidetheparentheses-Fn k

n kSC

B1

−α

=0.99

1−

α=

0.95

1−

α=

0.90

1−

α=

0.80

150

Boo

tstrap,FNk

0.99

0(0.989

)0.98

70.94

8(0.946

)0.94

20.89

9(0.896

)0.88

80.80

9(0.791

)0.78

1

Lim

iting

,FNk

0.98

2(0.983

)0.99

30.94

6(0.940

)0.97

90.88

9(0.883

)0.95

70.78

8(0.776

)0.90

0

SRS,

FNk

0.99

6(0.996

)0.99

90.98

5(0.984

)0.99

40.97

0(0.968

)0.98

80.92

3(0.908

)0.97

5

Lim

iting

,F

0.97

8(0.976

)0.99

40.93

7(0.928

)0.97

30.88

2(0.873

)0.95

20.76

5(0.738

)0.89

9

SRS,

F0.99

6(0.996

)0.99

90.98

0(0.979

)0.99

50.96

4(0.961

)0.98

70.90

4(0.897

)0.97

0

300

Boo

tstrap,FNk

0.98

5(0.983

)0.98

20.93

2(0.934

)0.92

90.88

8(0.885

)0.86

90.78

3(0.779

)0.75

0

Lim

iting

,FNk

0.99

0(0.991

)1.00

00.94

4(0.942

)0.97

80.90

4(0.901

)0.94

50.80

2(0.796

)0.89

1

SRS,

FNk

1.00

0(1.000

)1.00

00.98

9(0.988

)1.00

00.97

1(0.969

)0.99

00.92

2(0.921

)0.96

5

Lim

iting

,F

0.98

3(0.981

)0.99

60.92

6(0.922

)0.97

10.86

4(0.856

)0.93

70.71

9(0.710

)0.87

0

SRS,

F0.99

7(0.997

)1.00

00.98

4(0.984

)0.99

50.95

3(0.951

)0.98

40.88

5(0.883

)0.95

5

600

Boo

tstrap,FNk

0.98

8(0.986

)0.98

40.94

2(0.942

)0.93

10.89

2(0.899

)0.88

10.80

1(0.809

)0.76

7

Lim

iting

,FNk

0.99

5(0.995

)0.99

80.95

9(0.959

)0.97

90.92

0(0.921

)0.95

00.81

9(0.816

)0.89

4

SRS,

FNk

1.00

0(1.000

)1.00

00.99

6(0.995

)0.99

90.97

6(0.976

)0.99

50.93

5(0.934

)0.96

6

Lim

iting

,F

0.97

0(0.969

)0.99

20.90

3(0.902

)0.95

40.83

4(0.831

)0.91

50.69

7(0.690

)0.83

2

SRS,

F0.99

7(0.997

)1.00

00.97

1(0.968

)0.99

10.93

7(0.933

)0.97

40.85

5(0.853

)0.93

8

900

Boo

tstrap,FNk

0.99

2(0.992

)0.99

10.94

1(0.944

)0.92

70.89

3(0.895

)0.86

70.79

5(0.796

)0.74

6

Lim

iting

,FNk

0.98

4(0.983

)0.98

80.94

8(0.948

)0.96

30.90

9(0.906

)0.93

40.80

1(0.799

)0.86

1

SRS,

FNk

0.99

6(0.996

)0.99

70.98

3(0.982

)0.99

00.96

4(0.964

)0.98

00.92

3(0.923

)0.94

6

Lim

iting

,F

0.95

1(0.952

)0.98

10.86

0(0.856

)0.92

50.76

1(0.760

)0.86

30.59

8(0.597

)0.75

6

SRS,

F0.99

0(0.990

)0.99

50.95

7(0.955

)0.98

10.90

1(0.899

)0.95

40.79

7(0.796

)0.89

1

123

Page 17: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 999

for simple random sampling do not perform well for a stratified sample. This is astrong motivation for us to propose the new method in this paper.

A reviewer pointed out that the number of grid points can be sensitive to the results.There are two issues associated with this point: (1) how the number, a, of grid pointsaffects the simulated critical values and (2) how the number of grid points, b, affectsthe computation of the coverage frequencies. To investigate these two questions. weconducted some further simulations in the case of proportional allocation with nk =150. The results with the replication number of weighted sum of Brownian bridgesL = 20,000 show that the differences of the critical values using a = 401 and a = 801grid points are negligibly small at all significance levels under consideration, and thusthe differences of coverage frequencies between using a = 401 and a = 801 arealso negligibly small. Moreover, the coverage frequencies between using b = 401and b = 801 points are in fact identical at all significance levels. We believe that onemajor contributing factor to this phenomenon is the smooth and monotone increasingfeatures of the distribution function. In any event, using a = b = 401 appears to besufficient in our simulation studies.

In conclusion, we recommend the smooth SCB based on KDE F̂k with bandwidthhs1 for population CDF FNk . One can use the limiting method to obtain the criticalvalues for SCBs to save computing time. The nonsmooth SCB based on EDF Fnk is agood choice for validation purposes.

5 Illustrative data example

As an illustration, we apply the proposed method to the USA Census of Agriculturedataset introduced in the Introduction which is described and analyzed in Lohr (2009)(see Example 3.2). The population level data file is called agpop.dat. We focus on the1992 information on acreage devoted to farms for each of the Nk = 3078 counties andcounty-equivalents in the United States as our population. According to the censusregions, the population is divided into four strata: Northeast, North Central, South, andWest. After omitting the 19 missing data, the numbers of each stratum are Nk1 = 213,Nk2 = 1052 , Nk3 = 1376, Nk4 = 418. Figure 8 shows the four discrete populationdistribution functions FNsk (x), s = 1, . . . , 4, and their smoothed relative frequencyplots. These relative frequency plots appear to be from different Gamma distributions.Hence it is useful to stratify the overall population.

Table 6 shows the coverage frequencies of the SCBs for the agpop distributionFNk (x) under the proportional allocation over 1000 replications, including the smoothSCB based on KDE F̂k with bandwidth hs1 and the nonsmooth SCB based on EDFFnk by the bootstrap and limiting methods, compared to SRS. In contrast, the coveragefrequencies of the smooth SCB by the limiting method approach the nominal levels,while the SCBs for SRS are inaccurate as their coverage frequencies are much higherthan the nominal levels. Figure 9 depicts the true population CDF FNk , EDF Fnk ,KDE F̂k with bandwidth hs1 and the corresponding asymptotic 95% nonsmooth andsmooth SCBs at sample size nk = 300 in one of the 1000 runs. One sees that notonly the curves of Fnk , F̂k fit the true population CDF FNk well, but also the 95%nonsmooth and smooth SCBs both contain FNk . All these numerical and graphical

123

Page 18: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

1000 L. Gu et al.

0.0

0.2

0.4

0.6

0.8

1.0

Em

piric

al C

DF

NortheastNorth CentralSouthWestpopulation CDF

(a)

0 500000 1000000 1500000 2000000 2500000 3000000 0 500000 1000000 1500000 2000000 2500000 3000000

0e+

001e

−06

2e−

063e

−06

4e−

065e

−06

6e−

06

smoo

thed

rel

ativ

e fr

eque

ncy

NortheastNorth CentralSouthWest

(b)

Fig. 8 Plots of a empirical CDFs of each stratum and population, b smoothed relative frequencies for eachstratum in agpop.dat

Table 6 Coverage frequencies of the SCBs for agpop CDF FNk (x) with bandwidth hs1 based on 1000replications under proportional allocation

nk SCB 1 − α

0.99 0.95 0.90 0.80

150 Bootstrap, nonsmooth 0.982 0.935 0.843 0.733

Bootstrap, smooth 0.980 0.934 0.859 0.744

Limiting, nonsmooth 0.984 0.952 0.876 0.777

Limiting, smooth 0.985 0.952 0.893 0.786

SRS, nonsmooth 1.000 0.990 0.976 0.925

SRS, smooth 1.000 0.990 0.977 0.931

300 Bootstrap, nonsmooth 0.980 0.935 0.870 0.752

Bootstrap, smooth 0.985 0.941 0.876 0.760

Limiting, nonsmooth 0.989 0.944 0.874 0.746

Limiting, smooth 0.991 0.950 0.881 0.755

SRS, nonsmooth 0.999 0.983 0.955 0.892

SRS, smooth 0.999 0.984 0.956 0.895

600 Bootstrap, nonsmooth 0.995 0.966 0.915 0.818

Bootstrap, smooth 0.996 0.964 0.909 0.819

Limiting, nonsmooth 0.988 0.953 0.903 0.819

Limiting, smooth 0.989 0.955 0.905 0.825

SRS, nonsmooth 0.996 0.989 0.963 0.902

SRS, smooth 0.996 0.989 0.962 0.902

900 Bootstrap, nonsmooth 0.994 0.968 0.915 0.785

Bootstrap, smooth 0.994 0.967 0.921 0.790

Limiting, nonsmooth 0.990 0.965 0.910 0.794

Limiting, smooth 0.990 0.964 0.916 0.793

SRS, nonsmooth 0.998 0.990 0.971 0.915

SRS, smooth 0.998 0.990 0.971 0.916

123

Page 19: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 1001

0 500000 1000000 1500000 2000000 2500000 3000000

0.0

0.2

0.4

0.6

0.8

1.0

agpop SCB

population CDFstratified KDEsmooth SCBstratified EDFnonsmooth SCB

Fig. 9 Plot of 95% SCBs for the “agpop” CDF with bandwidth hs1 at nk = 300 under proportionalallocation

results suggest that our proposed SCBs are robust tools for statistical inference on thefinite population distribution in stratified random sampling.

The R code used to produce the results in the simulations and example is availableon the website: https://github.com/gulijie2018/StratifiedSCB.

Appendix

In this Appendix, we use an = o (bn) to denote that limn→∞ an/bn = 0, and an =O (bn) to denote that lim supn→∞ an/bn = c, where c is a constant. In addition,we denote by op

(Op

)and oa.s. a sequence of random variables of order o (O) in

probability and almost surely, respectively, while ua.s. means oa.s. uniformly in thedomain.

In the following we will prove Lemma 1 and Theorems 2–4.

A.1 Proof of Lemma 1

Our framework given in Sect. 2 and Condition (C2) ensure that, for any s ∈ {1, . . . , S},

limk→∞ (Nsk/Nk) = Ws, lim

k→∞ (nsk/nk) = ws, limk→∞ (nk/Nk) = C.

Hence,

CW−1s ws = lim

k→∞ (nk/Nk) (Nsk/Nk)−1 (nsk/nk) = lim

k→∞ (nsk/Nsk) ≤ 1.

123

Page 20: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

1002 L. Gu et al.

Making use of the simple inequality

nk/Nk =∑S

s=1 nsk∑S

s=1 Nsk≥ min

1≤s≤S(nsk/Nsk)

and letting k → ∞, one obtains that

C = limk→∞ nk/Nk ≥ lim

k→∞ min1≤s≤S

(nsk/Nsk) = min1≤s≤S

limk→∞ (nsk/Nsk)

= min1≤s≤S

CW−1s ws,

since C < 1 according to Condition (C2), one obtains that min1≤s≤S CW−1s ws < 1.

The Lemma 1 is proved. �

A.2 Proof of Theorem 2

For s = 1, . . . , S, combining (8) in Theorem 1 with Skorohod’s Representation The-orem shown in Theorem 6.7 of Billingsley (1999), there exits a version B̃sk (·) of

Brownian bridge Bs (·) that satisfies B̃sk (Fs(x))d→ Bs (Fs(x)) as k → ∞ such that

supx∈R∣∣∣λsk

{Fnsk (x) − FNsk (x)

} − B̃sk (Fs(x))∣∣∣ → 0, a.s.,

which implies that

Fnsk (x) − FNsk (x) = λ−1sk B̃sk (Fs(x)) + ua.s.

(λ−1sk

).

Recalling the definitions of FNk (x) and Fnk (x) given in (1) and (4), one has

λk{Fnk (x) − FNk (x)

} = λk

{S∑

s=1

Wsk Fnsk (x) −S∑

s=1

Wsk FNsk (x)

}

= λk

S∑

s=1

Wsk{Fnsk (x) − FNsk (x)

}

= λk

S∑

s=1

Wsk

{λ−1sk B̃sk (Fs(x)) + ua.s.

(λ−1sk

)}.

According to Condition (C2), as k → ∞,

nskNsk

= nsknk

· nkNk

· Nk

Nsk→p wsCW−1

s ,

123

Page 21: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 1003

and

λkWskλ−1sk = Nsk

Nk

√nknsk

· 1 − nsk/Nsk

1 − nk/Nk→p Ws

w−1s

1 − wsCW−1s

1 − C. (A.1)

Hence,

λk{Fnk (x) − FNk (x)

} d→S∑

s=1

Ws

√(w−1s − CW−1

s

)/ (1 − C)Bs {Fs(x)} .

The proof of Theorem 2 is completed. �

A.3 Proof of Theorem 3

Note that λk N−1/2k =

(n−1k − N−1

k

)−1/2N−1/2k = (nk/Nk)

1/2 (1 − nk/Nk)−1/2

→ 0 when nk/Nk → C ≡ 0 as k → ∞. Because of a sequence of populations{πk}∞k=1 as i.i.d. random samples generated from F(x), Donsker’s Theorem entails

that N 1/2k

{FNk (x) − F(x)

} d→ B {F(x)}. Hence, as k → ∞,

λkM(FNk , F) = λkOp

(N−1/2k

)= op (1) .

Then Theorem 3 follows by Theorem 2 and Slutsky’s Theorem. �

A.4 Proof of Theorem 4

According to the definitions of Fnk (x) and F̂k(x) given in (4) and (6), one has

λk

{Fnk (x) − F̂k(x)

}= λk

{S∑

s=1

Wsk Fnsk (x) −S∑

s=1

Wsk F̂sk(x)

}

.

Then (9) and (A.1) imply that

λkM(F̂k, FNk ) = λk supx∈R∣∣∣F̂k(x) − Fnk (x)

∣∣∣

≤ λk

S∑

s=1

Wsk supx∈R∣∣∣F̂sk(x) − Fnsk (x)

∣∣∣

= λk

S∑

s=1

Wskλ−1sk × op (1) = op (1) .

Applying Theorems 2 and 3 and Slutsky’s Theorem, Theorem 4 is proved. �

123

Page 22: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

1004 L. Gu et al.

References

Bickel, P. J., Rosenblatt,M. (1973). On some globalmeasures of the deviations of density function estimates.Annals of Statistics, 1, 1071–1095.

Billingsley, P. (1999). Convergence of Probability Measures (2nd ed.). New York: Wiley.Cai, L., Yang, L. (2015). A smooth simultaneous confidence band for conditional variance function. TEST,

24, 632–655.Cao,G.,Yang,L., Todem,D. (2012). Simultaneous inference for themean function basedondense functional

data. Journal of Nonparametric Statistics, 24, 359–377.Cao, G., Wang, L., Li, Y., Yang, L. (2016). Oracle-efficient confidence envelopes for covariance functions

in dense functional data. Statistica Sinica, 26, 359–383.Cardot, H., Josserand, E. (2011). Horvitz-Thompson estimators for functional data: asymptotic confidence

bands and optimal allocation for stratified sampling. Biometrika, 98, 107–118.Cardot, H., Degras, D., Josserand, E. (2013). Confidence bands for Horvitz–Thompson estimators using

sampled noisy functional data. Bernoulli, 19, 2067–2097.Chambers, R. L., Dunstan, R. (1986). Estimation distribution functions from survey data. Biometrika, 73,

597–604.Chen, J.,Wu, C. (2002). Estimation of distribution function and quantiles using themodel-calibrated pseudo

empirical likelihood method. Statistica Sinica, 12, 1223–1239.Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: Wiley.Degras, D. (2011). Simultaneous confidence bands for nonparametric regression with functional data.

Statistica Sinica, 21, 1735–1765.Frey, J. (2009). Confidence bands for the CDF when sampling from a finite population. Computational

Statistics and Data Analysis, 53, 4126–4132.Gu, L., Yang, L. (2015). Oracally efficient estimation for single-index link function with simultaneous

confidence band. Electronic Journal of Statistics, 9, 1540–1561.Gu, L., Wang, L., Härdle, W., Yang, L. (2014). A simultaneous confidence corridor for varying coefficient

regression with sparse functional data. TEST, 23, 806–843.Härdle, W. (1989). Asymptotic maximal deviation of M-smoothers. Journal of Multivariate Analysis, 29,

163–179.Liu, R., Yang, L. (2008). Kernel estimation of multivariate cumulative distribution function. Journal of

Nonparametric Statistics, 20, 661–677.Lohr, S. (2009). Sampling: Design and analysis (2nd ed.). Boston: Brooks/Cole.Ma, S., Yang, L., Carroll, R. (2012). A simultaneous confidence band for sparse longitudinal regression.

Statistica Sinica, 22, 95–122.McCarthy, P. J., Snowden, C. B. (1985). The bootstrap and finite population sampling. Vital and Health

Statistics, 73, 1–23.O’Neill, T., Stern, S. (2012). Finite population corrections for the Kolmogorov–Smirnov tests. Journal of

Nonparametric Statistics, 24, 497–504.Reiss, R. (1981). Nonparametric estimation of smooth distribution functions. Scandinavian Journal of

Statistics, 8, 116–119.Rosén, B. (1964). Limit theorems for sampling from finite population. Arkiv för Matematik, 5, 383–424.Shao, Q., Yang, L. (2012). Polynomial spline confidence band for time series trend. Journal of Statistical

Planning and Inference, 142, 1678–1689.Song, Q., Yang, L. (2009). Spline confidence bands for variance function. Journal of Nonparametric Statis-

tics, 21, 589–609.Song, Q., Liu, R., Shao, Q., Yang, L. (2014). A simultaneous confidence band for dense longitudinal

regression. Communications in Statistics-Theory and Methods, 43, 5195–5210.Wang, J., Yang, L. (2009). Polynomial spline confidence bands for regression curves. Statistica Sinica, 19,

325–342.Wang, J., Cheng, F., Yang, L. (2013). Smooth simultaneous confidence bands for cumulative distribution

functions. Journal of Nonparametric Statistics, 25, 395–407.Wang, J., Liu,R., Cheng, F.,Yang, L. (2014).Oracally efficient estimation of autoregressive error distribution

with simultaneous confidence band. Annals of Statistics, 42, 654–668.Wang, J., Wang, S., Yang, L. (2016). Simultaneous confidence bands for the distribution function of a finite

population and of its superpopulation. TEST, 25, 692–709.

123

Page 23: Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous confidence bands (SCBs)

SCB for stratified population distribution 1005

Wang, S., Dorfman, A. (1996). A new estimator for the finite population distribution function. Biometrika,83, 639–652.

Xia, Y. (1998). Bias-corrected confidence bands in nonparametric regression. Journal of the Royal StatisticalSociety Series B, 60, 797–811.

Zheng, S., Yang, L., Härdle, W. (2014). A smooth simultaneous confidence corridor for the mean of sparsefunctional data. Journal of the American Statistical Association, 109, 661–673.

Zhu, H., Li, R., Kong, L. (2012). Multivariate varying coefficient model for functional responses. Annalsof Statistics, 40, 2634–2666.

123