Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Estimating Disease Onset Distribution
Functions from Censored Mixture Data
Yanyuan Ma and Yuanjia Wang
Abstract
We consider nonparametric estimation of disease onset distribution functions inmultiple populations using censored data with unknown population identifiers. Theproblem is motivated from studies aiming at estimating the disease risk distributionin deleterious mutation carriers for the purpose of genetic counseling and design oftherapeutic intervention trials to modify disease onset progression. In these studies,the distribution of disease risk in subjects assumes a mixture form. Although thepopulation identifiers are missing, design and scientific knowledge allow easy calculationof the probability of an observation belonging to each population. We propose a generalfamily of simple weighted least squares estimators and show that the existing consistentnonparametric methods belong to this family. We further show that this family alsoincludes a class of imputation estimators that is not known in the literature for this typeof problems. We identify a computationally effortless estimator in the family, studyits asymptotic properties, and show its significant efficiency gain comparing to theexisting estimators in the literature. The application to a large genetic epidemiologicalstudy of Huntington’s disease reveals information on the age-at-onset distribution ofHD not previously available in the clinical literature which may generate new clinicalhypothesis.
KEY WORDS: Mixture observations; NPMLE; Unknown Population Label; WeightedLeast Squares. Huntington’s disease;
Short title: Estimation in Censored Mixture Data
Yanyuan Ma is Professor (E-mail: [email protected]), Department of Statistics, Texas A&M Univer-sity, 3143 TAMU, College Station, TX 77843-3143. Yuanjia Wang is Assistant Professor (E-mail: [email protected]), Department of Biostatistics, Columbia University, New York, NY 10032. Thisresearch is supported by grants from the U.S. National Science Foundation and the U.S. National Instituteof Neurological Disorders and Stroke (NS073671-01). Samples and data from the COHORT Study, whichreceives support from HP Therapeutics, Inc., were used in this study. We thank the Huntington StudyGroup COHORT investigators and coordinators who collected data and/or samples used in this study, aswell as participants and their families who made this work possible.
1 Introduction
In some scientific studies, it is of interest to estimate distribution function of an outcome us-
ing data arising from a mixture of multiple populations with unknown population identifiers.
For example, in Huntington’s disease (HD) research, one of the major goals is to estimate
the distribution of age-at-onset of HD (subject to censoring) in HD-gene mutation carriers.
Accurate estimation of the carrier distribution is important for genetic counseling and for
designing clinical trials of therapeutics modifying disease onset progression. These estimates
also provide prospective measures of the accuracy of a genetic mutation test. In the Co-
operative Huntington’s Observational Research Trial (COHORT, Huntington Study Group
COHORT Investigators, 2012), initial participants (probands) in COHORT underwent a
clinical evaluation and were genotyped. Through a systematic family history interview, they
also reported age-at-onset of HD of their relatives. However, relatives are not genotyped
and their mutation status is unknown. Thus, relatives form a sample of a mixture of car-
rier and non-carrier populations with unknown population identifiers. The probability of
a subject belonging to the mutation carrier population is easily calculated by Mendelian
inheritance. Such a design where probands are genotyped and provide disease onset times
(subject to censoring) in their relatives through a family history interview is commonly ap-
plied to study distribution of a disease in mutation carriers (Marder et al. 2003; Wang et
al. 2008; Huntington Study Group COHORT Investigators 2012).
Another example of studies resulting data with similar structure is the quantitative trait
locus (QTL) studies. In the discovery stage of an QTL study, subjects are genotyped at
known locations along their genome, and the goals are to determine the location of the
gene influencing manifestation of a quantitative trait. The genotypes at the typed markers
are known for a subject, but they are missing for locations in between markers. Under a
standard interval mapping framework (Lander and Botstein 1989, Wu et al. 2007), a subject’s
trait distribution is a mixture of QTL genotype-specific distributions, where the mixture
proportions are obtained from the design of experiment (e.g., backcross, intercross), location
1
and genotypes at the flanking markers and recombination fraction between the markers and
the QTL. In many cases, the quantitative outcome of interest such as time to flowering of a
plant (Ferrira et al. 1995; Lin and Wu 2006) is subject to right censoring.
The research goal of both types of studies can be formulated as estimating distribution
functions for censored outcomes arising from multiple populations while it is unknown from
which population each observation is drawn. The probability that an observation belongs
to each population can be calculated through taking into account the scientific knowledge
and the experiment design. Modeling the distribution in each population parametrically,
for example through a Gaussian mixture model, and proceeding with the usual maximum
likelihood estimation is one choice (McLachlan and Peel 2000). To be more flexible and
to leave the distribution in each population completely model free, Wacholder et al. (1998)
investigated a nonparametric model and proposed a nonparametric maximum likelihood es-
timator (NPMLE). Later on, two alternative estimators were developed, respectively aiming
at ensuring monotonicity of the nonparametric distribution function (Chatterjee and Wa-
cholder, 2001) and at improving estimation efficiency (Fine et al., 2004). Since the proposal
in Chatterjee and Wacholder (2001) is also a nonparametric maximum likelihood estimator,
to distinguish it from the original estimator in Wacholder et al. (1998), the original proposal
is named NPMLE1 and the modified version NPMLE2. The estimator in Fine et al. (2004)
exploits the independence assumption between the censoring times and event times, hence
is named IND.
When using these methods to analyze the COHORT data, we observe that the existing
nonparametric methods are inadequate. To be specific, NPMLE1 and IND are very ineffi-
cient. In addition, IND requires each subject to have a positive probability of being observed
to have an event at all time points of interest, which is often not satisfied for many chronic
diseases, and is violated in the COHORT study. In the COHORT study, not every family
member will eventually develop HD, therefore the probability of being censored is one for
the disease-free subjects. Finally, NPMLE2 is not a consistent estimator (Ma and Wang
2
2012). To provide valid estimation and improve stability and efficiency, we propose a general
family of simple weighted least square (WLS) type estimators. We derive the asymptotically
optimal member of this family, and identify an extremely computationally efficient estimator
that has competitive performance compared with the optimal member. We demonstrate the
relationship of the WLS family with the existing methods and with a class of imputation
based methods that has not been proposed for this type of problems.
The rest of this paper is organized as follows. In Section 2, we propose the WLS family,
identify the recommended estimator within this family and derive its asymptotic proper-
ties for inference. We study its relationship to the existing estimators and provide insights
on limitations of the existing estimators. In Section 3, we carry out simulation studies to
demonstrate the finite sample properties to illustrate our theoretical findings. In Section
4, we apply the proposed methods in the COHORT HD data to estimate the age-at-onset
distribution for HD-gene mutation carriers from family members whose genotypes are not ob-
served. Our analysis provides interesting information not previously available in the clinical
literature. We provide an example to perform HD risk prediction using HD mutation testing
results and other demographic information. We conclude this work with some discussions in
Section 5.
2 A family of WLS estimators
Assume there are p populations with cumulative distribution functions F1(t), . . . , Fp(t) that
are completely arbitrary. The corresponding probability density functions are f1(t), . . . , fp(t).
Let F(t) = {F1(t), . . . , Fp(t)}T and f(t) = {f1(t), . . . , fp(t)}T. Assume that the ith (i =
1, . . . , n) subject belongs to one of the p populations, and the probability of this observation
belonging to the kth population is qik for k = 1, . . . , p. Thus, we can write the ith observation
as (qi, Si), where qi = (qi1, . . . , qip)T, and Si is a random event time. Note that
∑pk=1 qik = 1.
In all the studies of interest, qi only takes m <∞ different vector values which we denote by
3
u1, . . . ,um, and we assume that there are rj observations corresponding to each of the uj’s
for j = 1, . . . ,m. Obviously∑m
j=1 rj = n. Assume further that the observations are subject
to censoring at Ci’s, and the censoring times are independent of the survival times. Note
that we also allow the situation that the censoring distribution has smaller support than the
support of the event times. In summary, an observation subject to censoring can be written
as (qi, Yi, δi), where Yi = min(Ci, Si) and δi = I(Si ≤ Ci). The observations are assumed to
be independent and ordered so that Y1 < Y2 · · · < Yn−1 < Yn. Our interest lies in estimating
the p distribution functions F1(t), . . . , Fp(t) and making inference.
To illustrate these notations using the studies we introduced in Section 1, note that
Si can be age-at-onset of an event (e.g., time-to-onset of HD). For the HD study, qik is the
probabilities of the ith relative carrying the kth genotype at the HD-gene given the proband’s
genotype status, and Fk(t) is the distribution function of Si’s within the subjects with the kth
genotype. An autosomal dominant disease yields p = 2, and additive genetic models yield
p = 3. Components of F(t) corresponding to certain genotypes are also called penetrance
functions of a mutation, that is, the probability of developing a disease by certain age for
homozygous or heterozygous mutation carriers. In the QTL studies, qik is the probability of
a subject carrying the kth genotype at the QTL given genotypes at the flanking markers.
The dimension p depends on the experimental design, for example, p = 2 for a backcross
experiment and p = 3 for an intercross experiment. In either situation, since genotypes may
not be observed, the distribution of Si is a mixture of F1, . . . , Fp, i.e., qTi F.
Taking advantage of the finiteness of m, we propose to first estimate distribution of the
outcomes in each of the m fixed mixture groups, and then use a familiar weighted least square
to retrieve the distribution F. Specifically, denote Hj(t) = uTj F(t) for j = 1, . . . ,m, and let
H(t) = {H1(t), . . . , Hm(t)}T. Obviously, Hj(t) is a valid cumulative distribution function,
and can be estimated using all observations with qi = uj for i = 1, . . . , n. For convenience,
the collection of observations with qi = uj is denoted as (Yji, δji) for i = 1, . . . , rj, and we
also assume they are ordered so that Yj1 < Yj2 < · · · < Yjrj−1 < Yjrj for all j = 1, . . . ,m.
4
Denote an estimated distribution function as Hj(t) and let H(t) = {H1(t), . . . , Hm(t)}T.
Denote the matrix U = (u1, . . . ,um). From H(t) = UTF(t), we easily obtain a WLS family
of estimators,
F(t) = (UWUT)−1UWH(t), (1)
where W is a m×m weight matrix.
2.1 The proposed estimator and its inference
Within the WLS family, we propose to use a diagonal matrix with r1, . . . , rm as diagonal
elements in the weight matrix W and use a classical Kaplan-Meier estimator in the jth
group to obtain Hj(t) for j = 1, . . . ,m. The resulting estimator has a simple form
F(t) = (URUT)−1URH(t) =
(m∑j=1
rjujuTj
)−1{ m∑j=1
rjujHj(t)
}, (2)
where R is a diagonal matrix with the diagonal elements r1, . . . , rm. Because the Kaplan-
Meier estimator is known to be root-n consistent, we can easily obtain that the estimator
F(t) is root-n consistent. Its asymptotic variance can be estimated as
cov{F(t)
}=
(m∑j=1
rjujuTj
)−1{ m∑j=1
r2j σ2j (t)uju
Tj
}(m∑j=1
rjujuTj
)−1,
where σ2j (t) = {1− Hj(t)}2
∑Yji≤t δji/{(rj − i)(rj − i+ 1)}.
This result provides an easy way to perform hypothesis testing. For example, to test
H0 : aTF(t) = c versus H1 : aTF(t) 6= c or H1 : aTF(t) < c or H1 : aTF(t) > c for any length
p vector a and any constant c, the Wald type test statistic is
T ={aTF(t)− c
}/[aTcov
{F(t)
}a]1/2
.
5
The statistic T has a standard normal distribution under H0. When a = (1,−1, 0, . . . , 0)T
and c = 0, this corresponds to testing whether the subjects from the first and second pop-
ulation have the same distribution at t, which is a research question often encountered in
practice.
It is also possible to perform the test simultaneously at several different t values. For
example, let t = (t1, . . . , tl)T and assume t1 < · · · < tl. Denote F(t) = {F(t1), . . . ,F(tl)}
and let c be a length l vector. Suppose one wishes to test H0 : aTF(t) = cT versus H1 :
aTF(t) 6= cT. Denote the Kaplan-Meier estimator Hj(t) = {Hj(t1), . . . , Hj(tl)}, and its
variance-covariance matrix as Vj(t). Using the asymptotic properties of the Kaplan-Meier
estimator, we know that Vj(t) can be estimated by Vj(t), where the (a, b)th entry is Vj,a,b =
{1− Hj(ta)}{1− Hj(tb)}∑
Yji≤ta δji/{(rj − i)(rj − i+ 1)} for any 1 ≤ a ≤ b ≤ l. Thus, one
can form the test statistic
T = {aTF(t)− cT}
[m∑j=1
{aT(URUT)−1URej}2Vj
]−1{aTF(t)− cT}T,
where ej is a length-m vector with one on the jth entry and zero elsewhere. Under H0,
T has a chi-square distribution with degrees of freedom l. A useful case in practice is
when a = (1,−1, 0, . . . , 0)T and c = 0. This corresponds to testing if the first and second
populations have the same distribution simultaneously at all values in the vector t.
Testing H0 : aTF(t) = c(t) at all t values is also possible, where c(t) is an arbitrary
deterministic function of t. From Breslow and Crowley (1974), r1/2j {Hj(t)−Hj(t)} converges
weakly to a Gaussian process for j = 1, . . . ,m with mean zero and an explicit covariance
function. Thus R(t) = aTF(t) − c(t) as a linear combination of the Hj(t)’s also has the
similar property of converging weakly to a Gaussian process. One can form test statistic
such as Kolmogorov-Smirnov type statistic supt∈[0,τ ]R(t) (Fleming et al. 1980) or∫ τ0R(t)dt
(Pepe and Fleming 1989) and derive their asymptotic null distributions.
However, the asymptotic distributions might not always be suitable to use in practice.
6
One reason is that the approximation at the large value of t can be quite imprecise. The
second reason is that the above asymptotic results are valid only in the region Hj(t) < 1.
In practice, some of the populations might have a smaller support than others. Hence,
depending on the uj values, for the same t value, some Hj(t) might be smaller than one
while others might be one. This creates complications in practice, especially because it is
often not known which Hj(t) has what support. The third reason is that only when rj is
large, the asymptotic expression will be a close approximation. However, in practice, some
of the rj values can be quite small. Due to these reasons, we propose to use an alternative
permutation approach when the asymptotic results are not suitable.
When p = 2, a test of scientific interest is whether there is a difference between distribu-
tions of mutation carriers and non-carriers, i.e., H0 : F1(t) = F2(t), either at a finite set of
t values or for the entire range. A permutation strategy can be used (Churchill and Doerge
1994) in this case. Specifically, we permute the (Yi, δi) pairs and couple them with q1, . . . ,qn
values to create a permuted sample, and use the estimator in (2) to obtain a new estimate
of F(t) and a permuted test statistic F1(t) − F2(t). Repeat this process sufficiently large
number of times to obtain the empirical distribution of F1(t) − F2(t) under H0. Note that
to perform the same test when p > 2, the permutation needs to be carried out separately
within each subsample sharing the same (qi3, . . . , qip)T value.
In the following, we further explore the WLS family of estimators in (1) and provide a
justification for our recommendation in (2). We also show that the two existing methods
NPMLE1 and IND are non-ideal members of the WLS family.
2.2 Choice of group estimation
We first study the choice of methods to estimate Hj(t) for j = 1, . . . ,m in (1). It is easy
to see that Hj(t) is the distribution function of Si’s for the collection of observations that
satisfies qi = uj. Thus, estimation within the uj group is a classical problem of estimating
distribution function with randomly censored data. The familiar Kaplan-Meier estimator is
7
known to be the maximum likelihood estimator in this setting (Kaplan and Meier, 1958)
hence provides the most efficient estimate for each Hj(t). Thus this is the optimal choice.
An additional advantage is that other than the independent censoring assumption, no extra
requirement needs to be imposed on the relationship between the censoring process and the
event process for the Kaplan-Meier estimator to be valid.
NPMLE1 in Wacholder et al. (1998) proceeds by performing a nonparametric maximum
likelihood in each of the m groups, and recover F(t) via F(t) = (UUT)−1UH(t). Hence it
is a member of the WLS family. It makes the good choice of using Kaplan-Meier estimation
in estimating Hj(t).
IND proposed in Fine et al. (2004) makes a different choice in estimating Hj(t), and then
recovers F(t) via F(t) = (URUT)−1URH(t). Hence it is also a member of the WLS family.
To estimate H(t), IND exploits the independence of the event process and the censoring
process, and uses the relation Pr(Yi > t) = Pr(Si > t) Pr(Ci > t). The IND estimates Hj(t)
through
Hj(t) = 1− 1
G(t)
{1∑n
i=1 I(qi = uj)
n∑i=1
I(qi = uj)I(Yi ≥ t)
},
where G(t) is a Kaplan-Meier estimate of the survival function G(t) = Pr(Ci > t) of the
censoring process. This method originates from Ying et al. (1995). However, it has several
limitations comparing to a direct Kaplan-Meier estimator of Hj(t). First, the method can
only be used in the region G(t) > 0. Therefore in the situations where the censoring process
has a smaller support than the event process, and if t is larger than the upper limit of the
possible censoring time, the method ceases to be valid. This is the case with the Huntington
study data. Second, it is less efficient than the maximum likelihood estimation, which is
reflected in our simulation results. Third, it is not easy to obtain the variance estimation of
the IND. Finally, although a Kaplan-Meier estimation is avoided in the estimation related
to the event process, it is still used in the estimation of the censoring process. Hence it does
8
not provide computational advantage either.
2.3 Choice of Weights
Because for different j values, Hj(t)’s are estimated using distinct observations, H1(t), . . . , Hm(t)
are mutually independent. Thus, the optimal weight matrix W should be diagonal. Let the
diagonal elements of W be w1, . . . , wm. The estimation variance of the WLS family in (1) is
cov{F (t)
}=
(m∑j=1
wjujuTj
)−1{ m∑j=1
w2jσ
2j (t)uju
Tj
}(m∑j=1
wjujuTj
)−1,
where σ2j (t) is the variance of the estimator Hj(t). Thus, theoretically, by letting wj =
1/σ2j (t), we would obtain the optimal weights in terms of estimation efficiency within the
WLS family.
Although this is the optimal weighting strategy in theory, in practice, we observe that it is
often sub-optimal. We provide several explanations. First, σ2j (t) is not known and can only be
estimated in practice. Although asymptotically this estimation itself does not cause efficiency
loss for any WLS estimator, it creates numerical instability in finite samples. This instability
can be especially harmful when some groups contain very few observations, because the
estimation of σ2j (t) can be noisy. Second, occasionally, it may happen that in one of the
groups, say the j0th group, the last observation is not censored and its observed event time
Srj0 is smaller than t, which is the time at which we are interested in estimating Hj0(t). In this
case, the Kaplan-Meier estimator yields Hj0(t) = 1, and the estimated variance σ2j0
(t) = 0.
Although this can be handled numerically either by assigning an upper limit on the weight
wj0 or by solving a constrained least square problem instead of directly implementing WLS,
the numerical instability caused by this phenomenon still persists. Intrinsically, this is caused
by the fact that we cannot assess σ2j0
(t) sufficiently well at the upper limit of the data. For
example, σ2j0
(t) = 0 might be a suitable estimation of the variability of Hj0(t) if the event
process indeed has the support to the left of t, and it might not be by chance that all the
9
observations are to the left of t.
Since Ma and Wang (2012) observed that in the absence of censoring, equally weighting
each observation and weighting each observation by its inverse-variance exhibited very little
difference in terms of estimation efficiency, we propose to simply assign equal weights to each
observation in the same uj group. This results in the weight choice of wj = rj in (2), which is
a direct result from the fact that if each observation has a weight of one, then the group with
rj observations gets a weight of rj. This weighting strategy is simple and extremely stable in
computation. In the simulations in Section 3, we did not find any better weighting scheme
other than this simple choice, even including the theoretically optimal weights calculated
using the true Hj(t) functions.
Inspecting the weighting choice of IND, we find that it matches exactly with our pro-
posal. NPMLE1 on the other hand makes the choice of assigning wj = 1 for j = 1, . . . ,m.
This choice unnecessarily downweights the observations belonging to the larger groups, and
consequently diminishes the advantage of the more accurately estimated Hj(t)’s. In practice,
we find this choice leads to substantial efficiency loss. In addition, it can also be vulnerable
to some degenerated groups. For example, when a group contains only one observation, the
Kaplan-Meier estimation in this group is certainly not reliable, yet this estimation is allowed
to enter the final estimator of F(t) with the same importance as the other estimates which
can greatly influence the final result.
2.4 Link with imputation methods
An alternative method of treating censoring in general is imputation, and is related to the
self-consistent estimator (Efron 1967). In our context, the idea is the following.
Suppose with full data, we have a consistent estimating equation
0 =n∑i=1
φ{I(Si ≤ t),qi,F(t)}. (3)
10
Under censoring, if Si is observed, then we can use the ith observation as it is in (3). If
Si is censored by Ci, then two situations can occur. If Ci > t, then it is certain that
Si > t as well. Hence we can safely replace I(Si ≤ t) by zero in the ith observation in
(3). If Ci ≤ t, then Si can be in (Ci, t] or in (t,∞). Given censoring, the probability of
Si ∈ (Ci, t] is {qTi F(t) − qT
i F(Ci)}/{1 − qTi F(Ci)}, while the probability of Si ∈ (t,∞) is
{1− qTi F(t)}/{1− qT
i F(Ci)}. Thus we can replace the ith term in (3) by two terms
qTi F(t)− qT
i F(Ci)
1− qTi F(Ci)
φ{1,qi,F(t)}+1− qT
i F(t)
1− qTi F(Ci)
φ{0,qi,F(t)}.
Of course, qTi F(·) is unknown, but it is Hj(·) for qi = uj and can be estimated using any of
the previous mentioned estimators. In summary, the final estimating equation is
0 =n∑i=1
δiφ{I(Si ≤ t),qi,F(t)}+n∑i=1
(1− δi)I(Ci > t)φ{0,qi,F(t)}
+n∑i=1
(1− δi)I(Ci ≤ t)
[qTi F(t)− qT
i F(Ci)
1− qTi F(Ci)
φ{1,qi,F(t)}+1− qT
i F(t)
1− qTi F(Ci)
φ{0,qi,F(t)}
],
where we write Hj(Ci) = qTi F(Ci) and Hj(t) = qT
i F(t) if qi = uj.
In fact, the only known class of consistent estimating equations of the form in (3) is
φ{I(Si ≤ t),qi,F(t)} = ωiqiI(Si ≤ t) − ωiqiqTi F(t) (Ma and Wang 2012). This yields the
estimating equation
n∑i=1
{δiωiqiI(Si ≤ t) + (1− δi)I(Ci ≤ t)
qTi F(t)− qT
i F(Ci)
1− qTi F(Ci)
ωiqi − ωiqiqTi F(t)
}= 0.
In the Appendix, we show that this is a member of the WLS family (1).
11
3 Simulation study
We perform a simulation study to illustrate the finite sample performance of several estima-
tion and inference procedures discussed above. We generate a total of 1000 repetitions, each
with the sample size n = 1000. The data is generated from a mixture of p = 2 different popu-
lations, with m = 3 different mixing probability vectors. The first two mixing groups contain
approximately 40% and 5% of the observations, and the remaining 55% observations are in
the third mixing group. The true population distributions are both truncated exponential,
with support [0, 10] and [0, 5] respectively. We generate censoring times from a uniform
distribution between 0 and 3.9. This results in an approximately 50% censoring rate. We
perform both estimation and testing under this simulation design. When investigating the
type I error rate under H0, we set both distributions to be the same truncated exponential
with support [0, 5], while keeping everything else unchanged. This results in about 40%
censoring rate.
We implemented our proposed WLS method, as well as the existing methods: IND,
NPMLE1 and NPMLE2. For illustration, we also provided the oracle WLS method, where
the weights are calculated using the true distribution functions. The estimation results at
three representative time points are provided in Table 1, where they are respectively at the
beginning, middle and end of the range of the observed time points. The estimation results
for the entire curves are depicted in Figure 1. It is seen that NPMLE2 shows a very large bias
in comparison to all the other methods, while both IND and NPMLE1 have larger estimation
variability in comparison to the proposed WLS method. For example, at t = 1.95, the bias
of NPMLE2 is about 25 to 30 times larger than the other three consistent estimators (WLS,
IND and NPMLE1), while the empirical standard errors of IND and NPLME1 are 30% and
29% larger than the WLS, respectively. The efficiency gain is more notable towards the
higher end of the t values. When t = 2.92, the improvement in empirical standard error of
the proposed WLS over IND and NPMLE1 is 38% and 30%, respectively. The WLS has very
small biases, and its estimation variance is about the same as the oracle WLS (the difference
12
is ≤ 2%).
We also examined several tests based on the proposed WLS estimator. We report es-
timation and the single, multiple points and curve testing results in Table 2 and Table 3.
Table 2 shows the finite sample bias of the estimated cumulative distribution functions, their
empirical standard errors, estimated standard errors and 95% confidence interval coverage
at the three representative time points under the null and the alternative. It is seen that the
estimation biases are small, the estimation standard errors are well estimated and the 95%
confidence interval coverages are close to their nominal level. For the single and multiple time
points testing, we used the test statistics proposed in section 2.1 and their asymptotic null
distributions to compute p values. For testing the entire difference between two distribution
functions, we used the test statistic supt |F1(t) − F2(t)| and performed 1000 permutations
to compute its p value. It is seen from Table 3 that the type I error rates of all three tests
adhere to their nominal levels. In addition, the power of the three tests is comparable.
4 Analysis of COHORT data
We applied the methods discussed in previous sections to the COHORT HD data. HD
is an autosomal dominant neurodegenerative disease caused by expansion of CAG repeats
(Huntingtons Disease Collaborative Research Group, 1993). Subjects with a CAG repeats
length ≥36 are considered to be HD mutation positive and have highly elevated risk of
developing the disease, while CAG subjects with repeats length< 36 do not develop HD
(Rubinsztein et al. 1996; Nance et al. 1998; Walker 2007). There were 4735 subjects in
the analysis, with m = 6 distinct values of a subject’s probability of carrying the HD gene
mutation. It is known that the Cox proportional hazards model may not be suitable for
modeling age-at-onset of HD and a parametric model with six parameters was proposed
(Langbehn et al. 2004).
We take a nonparametric approach to protect against misspecification. We implemented
13
the proposed WLS as well as consistent estimators in the existing literature (IND and
NPMLE1), and the results are shown in Figure 2. IND provides highly non-smooth es-
timates for the carrier group at several ages (30 and 90) and has an estimate much larger
than one at older ages. It also provided some positive estimates for the non-carrier group,
which is inconsistent with clinical knowledge since subjects without HD mutation do not
develop HD (Rubinsztein et al. 1996; Nance et al. 1998; Walker 2007). The performance
problems for IND are encountered because in some of the m subgroups, the sample size is
small. Thus estimates within these groups could result in large variability, and they subse-
quently adversely influence the overall estimates when combining subgroups to form IND.
NPMLE1 provides an estimated cumulative risk of below 40% at age 80, which may be too
low comparing with the existing clinical literature. Finally, the proposed WLS estimates the
cumulative risk of HD to be 33.9% (95% CI: [32.0%, 35.8%]) by age 40 and 74.5% (95% CI:
[73.9%, 76.0%]) by age 80.
Next, we estimate the disease distribution functions stratified by gender. Figure 3
presents the estimated distribution functions for males and females. We presented the same
four estimators as in the overall analysis and similar conclusions for these estimators can be
drawn. The proposed WLS provides the most reasonable estimates. It suggests that females
might have a slightly elevated risk than males across a wide range of ages. We performed a
permutation test of the difference between the entire distribution curve of female and male
carriers as introduced in Section 2.1, and obtained a p-value of 0.083, which suggests some
evidence of a gender effect.
To further investigate the potential gender difference, we estimate the distribution func-
tion stratified by both gender and whether a subject reported an affected father or affected
mother at the time of the family history interview (Figure 4). We observe that female car-
riers with an affected father had a slightly higher risk than female carriers with an affect
mother across a wide range of age. In contrast, male carriers with an affected father had
a similar risk comparing to male carriers with an affected mother until age 60, and after
14
age 60 the risk in the former is slightly higher. The test comparing the difference between
female carriers with affected father to female carriers with affected mother had a p-value of
0.096. These results are consistent with a potential anticipation effect: it is observed that
a male could transmit an expanded CAG repeats to his offspring which may increase the
likelihood of an earlier age-at-onset in the offspring (Ranen et al. 1995; Wexler et al. 2004).
Our analysis suggests that the anticipation effect may manifest in female offsprings across
a wide range of age, while for male offsprings, the anticipation effect did not manifest until
about age 60. This observation has not been noted in the clinical literature and may worth
further investigation as a potential new medical hypothesis. Finally, a test comparing female
or male carriers who reported an affected parent (either mother or father) to female or male
carriers who did not report any affected parent at the time of interview is significant with a
p-value < 0.01.
The estimated risk curve can also serve as a prospective accuracy measure of HD mutation
test. Let Y = 1 denote a positive mutation test and Y = 0 a negative mutation test. The
first component of F(t) is the cumulative risk for carriers, i.e., F1(t) = pr(T ≤ t|Y = 1).
This quantity is referred as the positive predictive value in the diagnostic testing literature.
Unlike retrospective accuracy measures such as sensitivity and specificity which are studied
extensively, prospective predictive measures (especially for time-to-event outcomes) are less
explored. Here the estimated curves can be used to predict a subject’s risk of HD given his
or her mutation test results and other demographic information. For example, from Figure
4, a female subject who has a positive HD mutation and report an affected father has about
65% chance of developing HD by age 50. These curves can also be used to predict conditional
probabilities of developing HD in the next few years given the current age of a subject.
15
5 Discussion
Here we provide a general WLS family to estimate the distribution functions of several
populations when the observations are from a mixture of these populations and are sub-
ject to right censoring. Existing consistent nonparametric estimators in these problems are
NPMLE1 and IND, and they are shown to be nonideal members of this family. We have
further proposed a practically optimal member of the WLS family. It is easy to see that when
there is no censoring, the proposed WLS estimator is identical to the IND. However, when
there is censoring we demonstrate that the propose estimator has superior performance and
computational stability comparing to both IND and NPMLE1. In addition, the proposed
estimator is extremely easy to implement and its asymptotic properties are also easily estab-
lished. We illustrate the methods and their applications to perform risk prediction through
an application to the COHORT study.
Finally, it is worth pointing out that in addition to the WLS family, an alternative estima-
tor is a maximum likelihood estimator (MLE) through self-consistent equation. Specifically,
when no censoring is present, treating F(t) as a parameter, its log-likelihood function is∑ni=1 I(Si ≤ t)log{qT
i F(t)}+∑n
i=1 I(Si > t)log{1−qTi F(t)}. Maximizing this function with
respect to F(t) will yield an estimating equation
n∑i=1
φ{I(si ≤ t),qi,F(t)} =n∑i=1
I(Si ≤ t)− qTi F(t)
qTi F(t){1− qT
i F(t)}qi = 0.
We can then use the same imputation procedure as in Section 2.4 to obtain a new estimator
for F(t) that is not within the WLS family. The MLE has not been reported in the literature
before, hence this provides another new estimator. However, when examining it in the
simulations, we find no efficiency gain over the proposed WLS estimator. In addition, since
the MLE cannot be solved explicitly, its computation requires iteration procedure such as
the Newton-Raphson method. In some occasions, the iterative computation may cause
numerical instability, and the algorithm may even fail to converge. In light of these numerical
16
performances, we suggest to use the proposed WLS method in (2).
Appendix
Illustration on imputation estimators as a member of WLS family
The imputation estimating equation
n∑i=1
{δiωiqiI(Si ≤ t) + (1− δi)I(Ci ≤ t)
qTi F(t)− qT
i F(Ci)
1− qTi F(Ci)
ωiqi − ωiqiqTi F(t)
}= 0.
can be explicitly solved to obtain
F(t) =
(n∑i=1
ωiqiqTi
)−1 n∑i=1
{δiωiqiI(Si ≤ t) + (1− δi)I(Ci ≤ t)
qTi F(t)− qT
i F(Ci)
1− qTi F(Ci)
ωiqi
}
=
[m∑j=1
ujuTj
{n∑i=1
ωiI(qi = uj)
}]−1m∑j=1
uj
n∑i=1
I(qi = uj)ωi
{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)
Hj(t)− Hj(Ci)
1− Hj(Ci)
}.
Denote
Hj(t) =n∑i=1
ωiI(qi = uj)∑ni=1 ωiI(qi = uj)
{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)
Hj(t)− Hj(Ci)
1− Hj(Ci)
}
=n∑i=1
ωiI(qi = uj)∑ni=1 ωiI(qi = uj)
{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)
Hj(t)−Hj(Ci)
1−Hj(Ci)
}+R.
Then
R =n∑i=1
ωiI(qi = uj)∑ni=1 ωiI(qi = uj)
(1− δi)I(Ci ≤ t)
{Hj(t)− Hj(Ci)
1− Hj(Ci)− Hj(t)−Hj(Ci)
1−Hj(Ci)
}
has the property that n1/2R has a normal distribution with mean zero when n → ∞ as
long as Hj(t)’s are consistent estimates of Hj(t) and are asymptotically normal. Simple
17
calculation shows that in the jth group,
E
{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)
Hj(t)−Hj(Ci)
1−Hj(Ci)
}= Hj(t).
Hence, Hj(t) is a root-n consistent estimator of Hj(t), and the imputation estimator has the
equivalent form of
F(t) =
[m∑j=1
ujuTj
{n∑i=1
ωiI(qi = uj)
}]−1 m∑j=1
uj
{n∑i=1
ωiI(qi = uj)
}Hj(t).
Viewing∑n
i=1 ωiI(qi = uj) as wj, the imputation estimator is within the WLS family (1).
References
Breslow, N. and Crowley, J. (1974). A large sample study of the life table and product limit
estimates under random censorship. Annals of Statistics, 2, 437-453.
Chatterjee, N. and Wacholder, S. (2001). “A Marginal Likelihood Approach for Estimating
Penetrance from Kin-cohort Designs”. Biometrics, 57, 245-252.
Churchill, G. A., and Doerge, R. W. (1994). Empirical threshold values for quantitative
trait mapping. Genetics, 138, 963-971.
Efron, B. (1967). “The Two-Sample ProblemWith Censored Data,” in Proceedings Fifth
Berkeley Symposium in Mathematical Statistics, Vol. IV, eds. L. Le Cam and J. Ney-
man, New York: Prentice-Hall, pp. 831-853.
Ferreira, M. E., Stagopan, J., Yandell, B. S., Williams, P.H. and Osborn, T.C. (1995).
“Mapping loci controlling vernalization requirement and flower time in Brassica napus”.
Theoretical and Applied Genetics, 90, 727-732.
Fine, J. P., Zou, F. and Yandell, B. S. (2004). Nonparametric estimation of the effects of
18
quantitative trait loci. Biometrics, 5, 501-513.
Fleming, T., O’Fallon, J., and O’Brien, P. (1980). Modified Kolmogorov-Smirnov Test
Procedures with Application to Arbitrarily Right-Censored Data, Biometrics, 36, 607-
625.
Gail, M., Pee, D., Benichou, J., and Carroll, R. (1999). Designing studies to estimate the
penetrance of an identified autosomal dominant mutation: cohort, case-control, and
genotyped-proband designs. Genetic Epidemiology, 16, 15-39.
Huntingtons Disease Collaborative Research Group. (1993). A novel gene containing a
trinucleotide repeat that is expanded and unstable on Huntingtons disease chromosomes.
Cell, 72, 971-983.
Huntington Study Group COHORT Investigators (2012). Characterization of a Large Group
of Individuals with Huntington Disease and their Relatives Enrolled in the COHORT
Study. PLoS ONE, In press.
Kaplan, E. L. and Meier, P. (1958). Nonparametric Estimation from incomplete observa-
tions. Journal of the American Statistical Association, 53, 457-481.
Lander, E. S. and Botstein, D. (1989). “Mapping Mendelian Factors Underlying Quantitative
Traits Using RFLP Linkage Maps”. Genetics, 121, 743-756.
Langbehn, D. R., Brinkman, R. R., Falush, D., Paulsen, J. S., Hayden, M. R. (2004). A new
model for prediction of the age of onset and penetrance for Huntington’s disease based
on CAG length. Clin Genet, 65, 267-277.
Lin, M. and R. L. Wu (2006). A joint model for nonparametric functional mapping of
longitudinal trajectories and time-to-events. BMC Bioinformatics 7(1), 138.
Ma, Y. and Wang, Y. (2012). Efficient Distribution Estimation for Data with Unobserved
Sub-population Identifiers. Electronic Journal of Statistics. 6, 710-737.
Marder K., Levy, G., Louis, E.D., Mejia-Santana, H., Cote, L.,Andrews, H.,Harris, J.,Waters,
19
C., Ford, B., Frucht, S., Fahn, S. and Ottman, R. (2003). Accuracy of family history
data on Parkinson’s disease. Neurology, 61, 18-23.
McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. New York: Wiley.
Moore, D., Chatterjee, N., Pee, D., and Gail, M. (2001). Pseudo-likelihood estimates of
the cumulative risk of an autosomal dominant disease from a kin-cohort study. Genet.
Epidemiol. 20, 210-227.
Nance M. A., Seltzer W, Ashizawa T, Bennett R, McIntosh N, Myers RH, Potter NT, Shea
DK. ACMG/ASHG Statement (1998). Laboratory guidelines for Huntington disease
genetic testing. Am J Hum Genet, 62, 1243-1247.
Pepe, M., and Fleming T. (1989). “Weighted Kaplan-Meier Statistics: A Class of Distance
Tests for Censored Survival Data.” Biometrics, 45, 497-507.
Ranen, N. G., Stine O. C., and Abbott M. H. et al. (1995), Anticipation and instability
of IT-15 (CAG)n repeats in parent-offspring pairs with Huntington disease, Am J Hum
Genet, 57, 593-602.
Rubinsztein, D. C., Leggo, J., Coles, R., and Almqvist, E. et al. (1996). Phenotypic charac-
terization of individuals with 30-40 CAG repeats in the Huntington disease (HD) gene
reveals HD cases with 36 repeats and apparently normal elderly individuals with 36-39
repeats. Am J Hum Genet, 59, 16-22.
The Huntington’s Disease Collaborative Research Group (1993). “A novel gene containing
a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromo-
somes”. Cell, 72, 971-983.
Wacholder, S., Hartge, P., Struewing, J., Pee, D., McAdams, M., Brody, L. and Tucker,
M. (1998). “The Kin-cohort Study for Estimating Penetrance”. American Journal of
Epidemiology, 148, 623-630.
Walker, F. O. (2007). “Huntington’s disease”. Lancet, 369, 218-228.
20
Wang, Y., Clark, L.N., Louis, E.D., Mejia-Santana, H., Harris, J., Cote, L.J., Waters, C.,
Andrews, D., Ford, B., Frucht, S., Fahn, S., Ottman, R., Rabinowitz, D. and Marder,
K (2008). Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using
the kin-cohort method. Arch Neurol. 65(4):467-474.PMID: 18413468.
Wexler, N. S., Lorimer, J., Porter, J., Gomez, F., Moskowitz, C. et al. (2004). Venezuelan
kindreds reveal that genetic and environmental factors modulate Huntington’s disease
age of onset, Proc Natl Acad Sci, 101, 3498-3503.
Wu, R., Ma, C., and Casella, G. (2007). S tatistical Genetics of Quantitative Traits: Linkage,
Maps, and QTL. New York: Springer.
Ying, Z. Jung, S. H. and Wei, L. J. (1995). Survival Analysis with median regression models.
Journal of the American Statistical Association, 90, 178-184.
21
Table 1: Simulation study. Estimation bias and variance for different estimators.
Method bias(F1) SD(F1) bias(F2) SD(F2)t = 0.9750,F(t) = (0.2357, 0.4203)T
Oracle -0.0010 0.0229 -0.0009 0.0229WLS 0.0004 0.0229 0.0004 0.0229IND -0.0000 0.0289 0.0009 0.0289
NPMLE1 0.0002 0.0310 0.0002 0.0310NPMLE2 -0.0222 0.0200 0.0136 0.0200
t = 1.9500,F(t) = (0.4203, 0.6785)T
Oracle -0.0034 0.0297 -0.0008 0.0297WLS -0.0014 0.0298 0.0005 0.0298IND -0.0013 0.0387 0.0004 0.0387
NPMLE1 -0.0018 0.0383 0.0001 0.0383NPMLE2 -0.0418 0.0268 0.0251 0.0268
t = 2.9250,F(t) = (0.5651, 0.8371)T
Oracle -0.0048 0.0349 -0.0019 0.0349WLS -0.0007 0.0347 -0.0001 0.0347IND 0.0003 0.0479 -0.0012 0.0479
NPMLE1 -0.0016 0.0444 -0.0010 0.0444NPMLE2 -0.0551 0.0324 0.0337 0.0324
22
Table 2: Simulation study. Estimation bias (bias), empirical standard error (sd), estimated
standard error (sd) and 95% confidence.
t F(t) bias(F ) sd(F ) sd(F ) 95% CIUnder H0 : F1(t) = F2(t)
0.9750 0.4203 -0.0010 0.0264 0.0262 0.93600.4203 0.0004 0.0300 0.0291 0.9400
1.9500 0.6785 -0.0002 0.0287 0.0280 0.94400.6785 0.0003 0.0308 0.0310 0.9470
2.9250 0.8371 -0.0002 0.0276 0.0276 0.95200.8371 -0.0010 0.0308 0.0306 0.9410
Under H1 : F1(t) 6= F2(t)0.9750 0.2357 0.0004 0.0229 0.0225 0.9500
0.4203 0.0004 0.0288 0.0284 0.94601.9500 0.4203 -0.0014 0.0298 0.0290 0.9380
0.6785 0.0005 0.0327 0.0320 0.94902.9250 0.5651 -0.0007 0.0347 0.0348 0.9500
0.8371 -0.0001 0.0336 0.0338 0.9480
Table 3: Simulation study. Empirical rejection rates for the single, muliple and curve testingat various nominal levels. Multiple t is the result of testing F1(t) = F2(t) at the three listedt values simultaneously. Curve is testing the curve F1(t) = F2(t).
t-value 0.01 0.05 0.1 0.2Under H0 : F1(t) = F2(t)
0.9750 0.0120 0.0470 0.1050 0.20601.9500 0.0130 0.0610 0.0970 0.18302.9250 0.0080 0.0510 0.0970 0.2050
Multiple t 0.0090 0.0540 0.1070 0.2080All t 0.0150 0.0620 0.1070 0.2090
Under H1 : F1(t) 6= F2(t)0.9750 0.9790 0.9950 0.9980 0.99801.9500 0.9950 0.9990 1.0000 1.00002.9250 0.9920 0.9970 0.9990 1.0000
Multiple t 0.9990 1.0000 1.0000 1.0000All t 0.9780 0.9970 0.9990 0.9990
23
Figure 1: Simulation study. True CDF (solid) and the mean (dashed), 95% pointwise con-fidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs. TheOracle (left) are WLS (right) are plotted on the top row and are almost identical and havesuperior performance. IND (left), NPMLE1 (middle) and NPMLE2 (right) have either havewider bands or biased and are on the bottom row.
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
24
Figure 2: COHORT study. The cumulative rate of disease onset based on WLS (upper-left),IND (upper-right), and NPMLE1 (lower).
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Age
Cum
ula
tive r
isk o
f H
D
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Age
Cum
ula
tive r
isk o
f H
D
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Age
Cum
ula
tive r
isk o
f H
D
25
Figure 3: COHORT study. The cumulative risk of disease onset based on WLS (upper-left),IND (upper-right), and NPMLE1 (lower) stratified by gender (black: female, red: male).
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Age
Cum
ula
tive r
isk o
f H
D
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Age
Cum
ula
tive r
isk o
f H
D
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Age
Cum
ula
tive r
isk o
f H
D
26
Figure 4: COHORT study. The cumulative risk of disease onset stratified by gender andreport of affected father (black), affected mother (blue) or none (red) at the time of familyhistory interview (based on WLS).
10 20 30 40 50 60 70 80 900
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Age
Cum
ula
tive r
isk o
f H
D
Male
father
mother
non
10 20 30 40 50 60 70 80 900
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Age
Cum
ula
tive r
isk o
f H
D
Female
father
mother
non
27