Estimating Disease Onset Distribution Functions from ... · of problems. We identify a computationally e ortless estimator in the family, study ... it is of interest to estimate distribution

Estimating Disease Onset Distribution

Functions from Censored Mixture Data

Yanyuan Ma and Yuanjia Wang

Abstract

We consider nonparametric estimation of disease onset distribution functions inmultiple populations using censored data with unknown population identifiers. Theproblem is motivated from studies aiming at estimating the disease risk distributionin deleterious mutation carriers for the purpose of genetic counseling and design oftherapeutic intervention trials to modify disease onset progression. In these studies,the distribution of disease risk in subjects assumes a mixture form. Although thepopulation identifiers are missing, design and scientific knowledge allow easy calculationof the probability of an observation belonging to each population. We propose a generalfamily of simple weighted least squares estimators and show that the existing consistentnonparametric methods belong to this family. We further show that this family alsoincludes a class of imputation estimators that is not known in the literature for this typeof problems. We identify a computationally effortless estimator in the family, studyits asymptotic properties, and show its significant efficiency gain comparing to theexisting estimators in the literature. The application to a large genetic epidemiologicalstudy of Huntington’s disease reveals information on the age-at-onset distribution ofHD not previously available in the clinical literature which may generate new clinicalhypothesis.

KEY WORDS: Mixture observations; NPMLE; Unknown Population Label; WeightedLeast Squares. Huntington’s disease;

Short title: Estimation in Censored Mixture Data

Yanyuan Ma is Professor (E-mail: [email protected]), Department of Statistics, Texas A&M Univer-sity, 3143 TAMU, College Station, TX 77843-3143. Yuanjia Wang is Assistant Professor (E-mail: [email protected]), Department of Biostatistics, Columbia University, New York, NY 10032. Thisresearch is supported by grants from the U.S. National Science Foundation and the U.S. National Instituteof Neurological Disorders and Stroke (NS073671-01). Samples and data from the COHORT Study, whichreceives support from HP Therapeutics, Inc., were used in this study. We thank the Huntington StudyGroup COHORT investigators and coordinators who collected data and/or samples used in this study, aswell as participants and their families who made this work possible.

1 Introduction

In some scientific studies, it is of interest to estimate distribution function of an outcome us-

ing data arising from a mixture of multiple populations with unknown population identifiers.

For example, in Huntington’s disease (HD) research, one of the major goals is to estimate

the distribution of age-at-onset of HD (subject to censoring) in HD-gene mutation carriers.

Accurate estimation of the carrier distribution is important for genetic counseling and for

designing clinical trials of therapeutics modifying disease onset progression. These estimates

also provide prospective measures of the accuracy of a genetic mutation test. In the Co-

operative Huntington’s Observational Research Trial (COHORT, Huntington Study Group

COHORT Investigators, 2012), initial participants (probands) in COHORT underwent a

clinical evaluation and were genotyped. Through a systematic family history interview, they

also reported age-at-onset of HD of their relatives. However, relatives are not genotyped

and their mutation status is unknown. Thus, relatives form a sample of a mixture of car-

rier and non-carrier populations with unknown population identifiers. The probability of

a subject belonging to the mutation carrier population is easily calculated by Mendelian

inheritance. Such a design where probands are genotyped and provide disease onset times

(subject to censoring) in their relatives through a family history interview is commonly ap-

plied to study distribution of a disease in mutation carriers (Marder et al. 2003; Wang et

al. 2008; Huntington Study Group COHORT Investigators 2012).

Another example of studies resulting data with similar structure is the quantitative trait

locus (QTL) studies. In the discovery stage of an QTL study, subjects are genotyped at

known locations along their genome, and the goals are to determine the location of the

gene influencing manifestation of a quantitative trait. The genotypes at the typed markers

are known for a subject, but they are missing for locations in between markers. Under a

standard interval mapping framework (Lander and Botstein 1989, Wu et al. 2007), a subject’s

trait distribution is a mixture of QTL genotype-specific distributions, where the mixture

proportions are obtained from the design of experiment (e.g., backcross, intercross), location

1

and genotypes at the flanking markers and recombination fraction between the markers and

the QTL. In many cases, the quantitative outcome of interest such as time to flowering of a

plant (Ferrira et al. 1995; Lin and Wu 2006) is subject to right censoring.

The research goal of both types of studies can be formulated as estimating distribution

functions for censored outcomes arising from multiple populations while it is unknown from

which population each observation is drawn. The probability that an observation belongs

to each population can be calculated through taking into account the scientific knowledge

and the experiment design. Modeling the distribution in each population parametrically,

for example through a Gaussian mixture model, and proceeding with the usual maximum

likelihood estimation is one choice (McLachlan and Peel 2000). To be more flexible and

to leave the distribution in each population completely model free, Wacholder et al. (1998)

investigated a nonparametric model and proposed a nonparametric maximum likelihood es-

timator (NPMLE). Later on, two alternative estimators were developed, respectively aiming

at ensuring monotonicity of the nonparametric distribution function (Chatterjee and Wa-

cholder, 2001) and at improving estimation efficiency (Fine et al., 2004). Since the proposal

in Chatterjee and Wacholder (2001) is also a nonparametric maximum likelihood estimator,

to distinguish it from the original estimator in Wacholder et al. (1998), the original proposal

is named NPMLE1 and the modified version NPMLE2. The estimator in Fine et al. (2004)

exploits the independence assumption between the censoring times and event times, hence

is named IND.

When using these methods to analyze the COHORT data, we observe that the existing

nonparametric methods are inadequate. To be specific, NPMLE1 and IND are very ineffi-

cient. In addition, IND requires each subject to have a positive probability of being observed

to have an event at all time points of interest, which is often not satisfied for many chronic

diseases, and is violated in the COHORT study. In the COHORT study, not every family

member will eventually develop HD, therefore the probability of being censored is one for

the disease-free subjects. Finally, NPMLE2 is not a consistent estimator (Ma and Wang

2

2012). To provide valid estimation and improve stability and efficiency, we propose a general

family of simple weighted least square (WLS) type estimators. We derive the asymptotically

optimal member of this family, and identify an extremely computationally efficient estimator

that has competitive performance compared with the optimal member. We demonstrate the

relationship of the WLS family with the existing methods and with a class of imputation

based methods that has not been proposed for this type of problems.

The rest of this paper is organized as follows. In Section 2, we propose the WLS family,

identify the recommended estimator within this family and derive its asymptotic proper-

ties for inference. We study its relationship to the existing estimators and provide insights

on limitations of the existing estimators. In Section 3, we carry out simulation studies to

demonstrate the finite sample properties to illustrate our theoretical findings. In Section

4, we apply the proposed methods in the COHORT HD data to estimate the age-at-onset

distribution for HD-gene mutation carriers from family members whose genotypes are not ob-

served. Our analysis provides interesting information not previously available in the clinical

literature. We provide an example to perform HD risk prediction using HD mutation testing

results and other demographic information. We conclude this work with some discussions in

Section 5.

2 A family of WLS estimators

Assume there are p populations with cumulative distribution functions F1(t), . . . , Fp(t) that

are completely arbitrary. The corresponding probability density functions are f1(t), . . . , fp(t).

Let F(t) = {F1(t), . . . , Fp(t)}T and f(t) = {f1(t), . . . , fp(t)}T. Assume that the ith (i =

1, . . . , n) subject belongs to one of the p populations, and the probability of this observation

belonging to the kth population is qik for k = 1, . . . , p. Thus, we can write the ith observation

as (qi, Si), where qi = (qi1, . . . , qip)T, and Si is a random event time. Note that

∑pk=1 qik = 1.

In all the studies of interest, qi only takes m <∞ different vector values which we denote by

3

u1, . . . ,um, and we assume that there are rj observations corresponding to each of the uj’s

for j = 1, . . . ,m. Obviously∑m

j=1 rj = n. Assume further that the observations are subject

to censoring at Ci’s, and the censoring times are independent of the survival times. Note

that we also allow the situation that the censoring distribution has smaller support than the

support of the event times. In summary, an observation subject to censoring can be written

as (qi, Yi, δi), where Yi = min(Ci, Si) and δi = I(Si ≤ Ci). The observations are assumed to

be independent and ordered so that Y1 < Y2 · · · < Yn−1 < Yn. Our interest lies in estimating

the p distribution functions F1(t), . . . , Fp(t) and making inference.

To illustrate these notations using the studies we introduced in Section 1, note that

Si can be age-at-onset of an event (e.g., time-to-onset of HD). For the HD study, qik is the

probabilities of the ith relative carrying the kth genotype at the HD-gene given the proband’s

genotype status, and Fk(t) is the distribution function of Si’s within the subjects with the kth

genotype. An autosomal dominant disease yields p = 2, and additive genetic models yield

p = 3. Components of F(t) corresponding to certain genotypes are also called penetrance

functions of a mutation, that is, the probability of developing a disease by certain age for

homozygous or heterozygous mutation carriers. In the QTL studies, qik is the probability of

a subject carrying the kth genotype at the QTL given genotypes at the flanking markers.

The dimension p depends on the experimental design, for example, p = 2 for a backcross

experiment and p = 3 for an intercross experiment. In either situation, since genotypes may

not be observed, the distribution of Si is a mixture of F1, . . . , Fp, i.e., qTi F.

Taking advantage of the finiteness of m, we propose to first estimate distribution of the

outcomes in each of the m fixed mixture groups, and then use a familiar weighted least square

to retrieve the distribution F. Specifically, denote Hj(t) = uTj F(t) for j = 1, . . . ,m, and let

H(t) = {H1(t), . . . , Hm(t)}T. Obviously, Hj(t) is a valid cumulative distribution function,

and can be estimated using all observations with qi = uj for i = 1, . . . , n. For convenience,

the collection of observations with qi = uj is denoted as (Yji, δji) for i = 1, . . . , rj, and we

also assume they are ordered so that Yj1 < Yj2 < · · · < Yjrj−1 < Yjrj for all j = 1, . . . ,m.

4

Denote an estimated distribution function as Hj(t) and let H(t) = {H1(t), . . . , Hm(t)}T.

Denote the matrix U = (u1, . . . ,um). From H(t) = UTF(t), we easily obtain a WLS family

of estimators,

F(t) = (UWUT)−1UWH(t), (1)

where W is a m×m weight matrix.

2.1 The proposed estimator and its inference

Within the WLS family, we propose to use a diagonal matrix with r1, . . . , rm as diagonal

elements in the weight matrix W and use a classical Kaplan-Meier estimator in the jth

group to obtain Hj(t) for j = 1, . . . ,m. The resulting estimator has a simple form

F(t) = (URUT)−1URH(t) =

(m∑j=1

rjujuTj

)−1{ m∑j=1

rjujHj(t)

}, (2)

where R is a diagonal matrix with the diagonal elements r1, . . . , rm. Because the Kaplan-

Meier estimator is known to be root-n consistent, we can easily obtain that the estimator

F(t) is root-n consistent. Its asymptotic variance can be estimated as

cov{F(t)

}=

(m∑j=1

rjujuTj

)−1{ m∑j=1

r2j σ2j (t)uju

Tj

}(m∑j=1

rjujuTj

)−1,

where σ2j (t) = {1− Hj(t)}2

∑Yji≤t δji/{(rj − i)(rj − i+ 1)}.

This result provides an easy way to perform hypothesis testing. For example, to test

H0 : aTF(t) = c versus H1 : aTF(t) 6= c or H1 : aTF(t) < c or H1 : aTF(t) > c for any length

p vector a and any constant c, the Wald type test statistic is

T ={aTF(t)− c

}/[aTcov

{F(t)

}a]1/2

.

5

The statistic T has a standard normal distribution under H0. When a = (1,−1, 0, . . . , 0)T

and c = 0, this corresponds to testing whether the subjects from the first and second pop-

ulation have the same distribution at t, which is a research question often encountered in

practice.

It is also possible to perform the test simultaneously at several different t values. For

example, let t = (t1, . . . , tl)T and assume t1 < · · · < tl. Denote F(t) = {F(t1), . . . ,F(tl)}

and let c be a length l vector. Suppose one wishes to test H0 : aTF(t) = cT versus H1 :

aTF(t) 6= cT. Denote the Kaplan-Meier estimator Hj(t) = {Hj(t1), . . . , Hj(tl)}, and its

variance-covariance matrix as Vj(t). Using the asymptotic properties of the Kaplan-Meier

estimator, we know that Vj(t) can be estimated by Vj(t), where the (a, b)th entry is Vj,a,b =

{1− Hj(ta)}{1− Hj(tb)}∑

Yji≤ta δji/{(rj − i)(rj − i+ 1)} for any 1 ≤ a ≤ b ≤ l. Thus, one

can form the test statistic

T = {aTF(t)− cT}

[m∑j=1

{aT(URUT)−1URej}2Vj

]−1{aTF(t)− cT}T,

where ej is a length-m vector with one on the jth entry and zero elsewhere. Under H0,

T has a chi-square distribution with degrees of freedom l. A useful case in practice is

when a = (1,−1, 0, . . . , 0)T and c = 0. This corresponds to testing if the first and second

populations have the same distribution simultaneously at all values in the vector t.

Testing H0 : aTF(t) = c(t) at all t values is also possible, where c(t) is an arbitrary

deterministic function of t. From Breslow and Crowley (1974), r1/2j {Hj(t)−Hj(t)} converges

weakly to a Gaussian process for j = 1, . . . ,m with mean zero and an explicit covariance

function. Thus R(t) = aTF(t) − c(t) as a linear combination of the Hj(t)’s also has the

similar property of converging weakly to a Gaussian process. One can form test statistic

such as Kolmogorov-Smirnov type statistic supt∈[0,τ ]R(t) (Fleming et al. 1980) or∫ τ0R(t)dt

(Pepe and Fleming 1989) and derive their asymptotic null distributions.

However, the asymptotic distributions might not always be suitable to use in practice.

6

One reason is that the approximation at the large value of t can be quite imprecise. The

second reason is that the above asymptotic results are valid only in the region Hj(t) < 1.

In practice, some of the populations might have a smaller support than others. Hence,

depending on the uj values, for the same t value, some Hj(t) might be smaller than one

while others might be one. This creates complications in practice, especially because it is

often not known which Hj(t) has what support. The third reason is that only when rj is

large, the asymptotic expression will be a close approximation. However, in practice, some

of the rj values can be quite small. Due to these reasons, we propose to use an alternative

permutation approach when the asymptotic results are not suitable.

When p = 2, a test of scientific interest is whether there is a difference between distribu-

tions of mutation carriers and non-carriers, i.e., H0 : F1(t) = F2(t), either at a finite set of

t values or for the entire range. A permutation strategy can be used (Churchill and Doerge

1994) in this case. Specifically, we permute the (Yi, δi) pairs and couple them with q1, . . . ,qn

values to create a permuted sample, and use the estimator in (2) to obtain a new estimate

of F(t) and a permuted test statistic F1(t) − F2(t). Repeat this process sufficiently large

number of times to obtain the empirical distribution of F1(t) − F2(t) under H0. Note that

to perform the same test when p > 2, the permutation needs to be carried out separately

within each subsample sharing the same (qi3, . . . , qip)T value.

In the following, we further explore the WLS family of estimators in (1) and provide a

justification for our recommendation in (2). We also show that the two existing methods

NPMLE1 and IND are non-ideal members of the WLS family.

2.2 Choice of group estimation

We first study the choice of methods to estimate Hj(t) for j = 1, . . . ,m in (1). It is easy

to see that Hj(t) is the distribution function of Si’s for the collection of observations that

satisfies qi = uj. Thus, estimation within the uj group is a classical problem of estimating

distribution function with randomly censored data. The familiar Kaplan-Meier estimator is

7

known to be the maximum likelihood estimator in this setting (Kaplan and Meier, 1958)

hence provides the most efficient estimate for each Hj(t). Thus this is the optimal choice.

An additional advantage is that other than the independent censoring assumption, no extra

requirement needs to be imposed on the relationship between the censoring process and the

event process for the Kaplan-Meier estimator to be valid.

NPMLE1 in Wacholder et al. (1998) proceeds by performing a nonparametric maximum

likelihood in each of the m groups, and recover F(t) via F(t) = (UUT)−1UH(t). Hence it

is a member of the WLS family. It makes the good choice of using Kaplan-Meier estimation

in estimating Hj(t).

IND proposed in Fine et al. (2004) makes a different choice in estimating Hj(t), and then

recovers F(t) via F(t) = (URUT)−1URH(t). Hence it is also a member of the WLS family.

To estimate H(t), IND exploits the independence of the event process and the censoring

process, and uses the relation Pr(Yi > t) = Pr(Si > t) Pr(Ci > t). The IND estimates Hj(t)

through

Hj(t) = 1− 1

G(t)

{1∑n

i=1 I(qi = uj)

n∑i=1

I(qi = uj)I(Yi ≥ t)

},

where G(t) is a Kaplan-Meier estimate of the survival function G(t) = Pr(Ci > t) of the

censoring process. This method originates from Ying et al. (1995). However, it has several

limitations comparing to a direct Kaplan-Meier estimator of Hj(t). First, the method can

only be used in the region G(t) > 0. Therefore in the situations where the censoring process

has a smaller support than the event process, and if t is larger than the upper limit of the

possible censoring time, the method ceases to be valid. This is the case with the Huntington

study data. Second, it is less efficient than the maximum likelihood estimation, which is

reflected in our simulation results. Third, it is not easy to obtain the variance estimation of

the IND. Finally, although a Kaplan-Meier estimation is avoided in the estimation related

to the event process, it is still used in the estimation of the censoring process. Hence it does

8

not provide computational advantage either.

2.3 Choice of Weights

Because for different j values, Hj(t)’s are estimated using distinct observations, H1(t), . . . , Hm(t)

are mutually independent. Thus, the optimal weight matrix W should be diagonal. Let the

diagonal elements of W be w1, . . . , wm. The estimation variance of the WLS family in (1) is

cov{F (t)

}=

(m∑j=1

wjujuTj

)−1{ m∑j=1

w2jσ

2j (t)uju

Tj

}(m∑j=1

wjujuTj

)−1,

where σ2j (t) is the variance of the estimator Hj(t). Thus, theoretically, by letting wj =

1/σ2j (t), we would obtain the optimal weights in terms of estimation efficiency within the

WLS family.

Although this is the optimal weighting strategy in theory, in practice, we observe that it is

often sub-optimal. We provide several explanations. First, σ2j (t) is not known and can only be

estimated in practice. Although asymptotically this estimation itself does not cause efficiency

loss for any WLS estimator, it creates numerical instability in finite samples. This instability

can be especially harmful when some groups contain very few observations, because the

estimation of σ2j (t) can be noisy. Second, occasionally, it may happen that in one of the

groups, say the j0th group, the last observation is not censored and its observed event time

Srj0 is smaller than t, which is the time at which we are interested in estimating Hj0(t). In this

case, the Kaplan-Meier estimator yields Hj0(t) = 1, and the estimated variance σ2j0

(t) = 0.

Although this can be handled numerically either by assigning an upper limit on the weight

wj0 or by solving a constrained least square problem instead of directly implementing WLS,

the numerical instability caused by this phenomenon still persists. Intrinsically, this is caused

by the fact that we cannot assess σ2j0

(t) sufficiently well at the upper limit of the data. For

example, σ2j0

(t) = 0 might be a suitable estimation of the variability of Hj0(t) if the event

process indeed has the support to the left of t, and it might not be by chance that all the

9

observations are to the left of t.

Since Ma and Wang (2012) observed that in the absence of censoring, equally weighting

each observation and weighting each observation by its inverse-variance exhibited very little

difference in terms of estimation efficiency, we propose to simply assign equal weights to each

observation in the same uj group. This results in the weight choice of wj = rj in (2), which is

a direct result from the fact that if each observation has a weight of one, then the group with

rj observations gets a weight of rj. This weighting strategy is simple and extremely stable in

computation. In the simulations in Section 3, we did not find any better weighting scheme

other than this simple choice, even including the theoretically optimal weights calculated

using the true Hj(t) functions.

Inspecting the weighting choice of IND, we find that it matches exactly with our pro-

posal. NPMLE1 on the other hand makes the choice of assigning wj = 1 for j = 1, . . . ,m.

This choice unnecessarily downweights the observations belonging to the larger groups, and

consequently diminishes the advantage of the more accurately estimated Hj(t)’s. In practice,

we find this choice leads to substantial efficiency loss. In addition, it can also be vulnerable

to some degenerated groups. For example, when a group contains only one observation, the

Kaplan-Meier estimation in this group is certainly not reliable, yet this estimation is allowed

to enter the final estimator of F(t) with the same importance as the other estimates which

can greatly influence the final result.

2.4 Link with imputation methods

An alternative method of treating censoring in general is imputation, and is related to the

self-consistent estimator (Efron 1967). In our context, the idea is the following.

Suppose with full data, we have a consistent estimating equation

0 =n∑i=1

φ{I(Si ≤ t),qi,F(t)}. (3)

10

Under censoring, if Si is observed, then we can use the ith observation as it is in (3). If

Si is censored by Ci, then two situations can occur. If Ci > t, then it is certain that

Si > t as well. Hence we can safely replace I(Si ≤ t) by zero in the ith observation in

(3). If Ci ≤ t, then Si can be in (Ci, t] or in (t,∞). Given censoring, the probability of

Si ∈ (Ci, t] is {qTi F(t) − qT

i F(Ci)}/{1 − qTi F(Ci)}, while the probability of Si ∈ (t,∞) is

{1− qTi F(t)}/{1− qT

i F(Ci)}. Thus we can replace the ith term in (3) by two terms

qTi F(t)− qT

i F(Ci)

1− qTi F(Ci)

φ{1,qi,F(t)}+1− qT

i F(t)

1− qTi F(Ci)

φ{0,qi,F(t)}.

Of course, qTi F(·) is unknown, but it is Hj(·) for qi = uj and can be estimated using any of

the previous mentioned estimators. In summary, the final estimating equation is

0 =n∑i=1

δiφ{I(Si ≤ t),qi,F(t)}+n∑i=1

(1− δi)I(Ci > t)φ{0,qi,F(t)}

+n∑i=1

(1− δi)I(Ci ≤ t)

[qTi F(t)− qT

i F(Ci)

1− qTi F(Ci)

φ{1,qi,F(t)}+1− qT

i F(t)

1− qTi F(Ci)

φ{0,qi,F(t)}

],

where we write Hj(Ci) = qTi F(Ci) and Hj(t) = qT

i F(t) if qi = uj.

In fact, the only known class of consistent estimating equations of the form in (3) is

φ{I(Si ≤ t),qi,F(t)} = ωiqiI(Si ≤ t) − ωiqiqTi F(t) (Ma and Wang 2012). This yields the

estimating equation

n∑i=1

{δiωiqiI(Si ≤ t) + (1− δi)I(Ci ≤ t)

qTi F(t)− qT

i F(Ci)

1− qTi F(Ci)

ωiqi − ωiqiqTi F(t)

}= 0.

In the Appendix, we show that this is a member of the WLS family (1).

11

3 Simulation study

We perform a simulation study to illustrate the finite sample performance of several estima-

tion and inference procedures discussed above. We generate a total of 1000 repetitions, each

with the sample size n = 1000. The data is generated from a mixture of p = 2 different popu-

lations, with m = 3 different mixing probability vectors. The first two mixing groups contain

approximately 40% and 5% of the observations, and the remaining 55% observations are in

the third mixing group. The true population distributions are both truncated exponential,

with support [0, 10] and [0, 5] respectively. We generate censoring times from a uniform

distribution between 0 and 3.9. This results in an approximately 50% censoring rate. We

perform both estimation and testing under this simulation design. When investigating the

type I error rate under H0, we set both distributions to be the same truncated exponential

with support [0, 5], while keeping everything else unchanged. This results in about 40%

censoring rate.

We implemented our proposed WLS method, as well as the existing methods: IND,

NPMLE1 and NPMLE2. For illustration, we also provided the oracle WLS method, where

the weights are calculated using the true distribution functions. The estimation results at

three representative time points are provided in Table 1, where they are respectively at the

beginning, middle and end of the range of the observed time points. The estimation results

for the entire curves are depicted in Figure 1. It is seen that NPMLE2 shows a very large bias

in comparison to all the other methods, while both IND and NPMLE1 have larger estimation

variability in comparison to the proposed WLS method. For example, at t = 1.95, the bias

of NPMLE2 is about 25 to 30 times larger than the other three consistent estimators (WLS,

IND and NPMLE1), while the empirical standard errors of IND and NPLME1 are 30% and

29% larger than the WLS, respectively. The efficiency gain is more notable towards the

higher end of the t values. When t = 2.92, the improvement in empirical standard error of

the proposed WLS over IND and NPMLE1 is 38% and 30%, respectively. The WLS has very

small biases, and its estimation variance is about the same as the oracle WLS (the difference

12

is ≤ 2%).

We also examined several tests based on the proposed WLS estimator. We report es-

timation and the single, multiple points and curve testing results in Table 2 and Table 3.

Table 2 shows the finite sample bias of the estimated cumulative distribution functions, their

empirical standard errors, estimated standard errors and 95% confidence interval coverage

at the three representative time points under the null and the alternative. It is seen that the

estimation biases are small, the estimation standard errors are well estimated and the 95%

confidence interval coverages are close to their nominal level. For the single and multiple time

points testing, we used the test statistics proposed in section 2.1 and their asymptotic null

distributions to compute p values. For testing the entire difference between two distribution

functions, we used the test statistic supt |F1(t) − F2(t)| and performed 1000 permutations

to compute its p value. It is seen from Table 3 that the type I error rates of all three tests

adhere to their nominal levels. In addition, the power of the three tests is comparable.

4 Analysis of COHORT data

We applied the methods discussed in previous sections to the COHORT HD data. HD

is an autosomal dominant neurodegenerative disease caused by expansion of CAG repeats

(Huntingtons Disease Collaborative Research Group, 1993). Subjects with a CAG repeats

length ≥36 are considered to be HD mutation positive and have highly elevated risk of

developing the disease, while CAG subjects with repeats length< 36 do not develop HD

(Rubinsztein et al. 1996; Nance et al. 1998; Walker 2007). There were 4735 subjects in

the analysis, with m = 6 distinct values of a subject’s probability of carrying the HD gene

mutation. It is known that the Cox proportional hazards model may not be suitable for

modeling age-at-onset of HD and a parametric model with six parameters was proposed

(Langbehn et al. 2004).

We take a nonparametric approach to protect against misspecification. We implemented

13

the proposed WLS as well as consistent estimators in the existing literature (IND and

NPMLE1), and the results are shown in Figure 2. IND provides highly non-smooth es-

timates for the carrier group at several ages (30 and 90) and has an estimate much larger

than one at older ages. It also provided some positive estimates for the non-carrier group,

which is inconsistent with clinical knowledge since subjects without HD mutation do not

develop HD (Rubinsztein et al. 1996; Nance et al. 1998; Walker 2007). The performance

problems for IND are encountered because in some of the m subgroups, the sample size is

small. Thus estimates within these groups could result in large variability, and they subse-

quently adversely influence the overall estimates when combining subgroups to form IND.

NPMLE1 provides an estimated cumulative risk of below 40% at age 80, which may be too

low comparing with the existing clinical literature. Finally, the proposed WLS estimates the

cumulative risk of HD to be 33.9% (95% CI: [32.0%, 35.8%]) by age 40 and 74.5% (95% CI:

[73.9%, 76.0%]) by age 80.

Next, we estimate the disease distribution functions stratified by gender. Figure 3

presents the estimated distribution functions for males and females. We presented the same

four estimators as in the overall analysis and similar conclusions for these estimators can be

drawn. The proposed WLS provides the most reasonable estimates. It suggests that females

might have a slightly elevated risk than males across a wide range of ages. We performed a

permutation test of the difference between the entire distribution curve of female and male

carriers as introduced in Section 2.1, and obtained a p-value of 0.083, which suggests some

evidence of a gender effect.

To further investigate the potential gender difference, we estimate the distribution func-

tion stratified by both gender and whether a subject reported an affected father or affected

mother at the time of the family history interview (Figure 4). We observe that female car-

riers with an affected father had a slightly higher risk than female carriers with an affect

mother across a wide range of age. In contrast, male carriers with an affected father had

a similar risk comparing to male carriers with an affected mother until age 60, and after

14

age 60 the risk in the former is slightly higher. The test comparing the difference between

female carriers with affected father to female carriers with affected mother had a p-value of

0.096. These results are consistent with a potential anticipation effect: it is observed that

a male could transmit an expanded CAG repeats to his offspring which may increase the

likelihood of an earlier age-at-onset in the offspring (Ranen et al. 1995; Wexler et al. 2004).

Our analysis suggests that the anticipation effect may manifest in female offsprings across

a wide range of age, while for male offsprings, the anticipation effect did not manifest until

about age 60. This observation has not been noted in the clinical literature and may worth

further investigation as a potential new medical hypothesis. Finally, a test comparing female

or male carriers who reported an affected parent (either mother or father) to female or male

carriers who did not report any affected parent at the time of interview is significant with a

p-value < 0.01.

The estimated risk curve can also serve as a prospective accuracy measure of HD mutation

test. Let Y = 1 denote a positive mutation test and Y = 0 a negative mutation test. The

first component of F(t) is the cumulative risk for carriers, i.e., F1(t) = pr(T ≤ t|Y = 1).

This quantity is referred as the positive predictive value in the diagnostic testing literature.

Unlike retrospective accuracy measures such as sensitivity and specificity which are studied

extensively, prospective predictive measures (especially for time-to-event outcomes) are less

explored. Here the estimated curves can be used to predict a subject’s risk of HD given his

or her mutation test results and other demographic information. For example, from Figure

4, a female subject who has a positive HD mutation and report an affected father has about

65% chance of developing HD by age 50. These curves can also be used to predict conditional

probabilities of developing HD in the next few years given the current age of a subject.

15

5 Discussion

Here we provide a general WLS family to estimate the distribution functions of several

populations when the observations are from a mixture of these populations and are sub-

ject to right censoring. Existing consistent nonparametric estimators in these problems are

NPMLE1 and IND, and they are shown to be nonideal members of this family. We have

further proposed a practically optimal member of the WLS family. It is easy to see that when

there is no censoring, the proposed WLS estimator is identical to the IND. However, when

there is censoring we demonstrate that the propose estimator has superior performance and

computational stability comparing to both IND and NPMLE1. In addition, the proposed

estimator is extremely easy to implement and its asymptotic properties are also easily estab-

lished. We illustrate the methods and their applications to perform risk prediction through

an application to the COHORT study.

Finally, it is worth pointing out that in addition to the WLS family, an alternative estima-

tor is a maximum likelihood estimator (MLE) through self-consistent equation. Specifically,

when no censoring is present, treating F(t) as a parameter, its log-likelihood function is∑ni=1 I(Si ≤ t)log{qT

i F(t)}+∑n

i=1 I(Si > t)log{1−qTi F(t)}. Maximizing this function with

respect to F(t) will yield an estimating equation

n∑i=1

φ{I(si ≤ t),qi,F(t)} =n∑i=1

I(Si ≤ t)− qTi F(t)

qTi F(t){1− qT

i F(t)}qi = 0.

We can then use the same imputation procedure as in Section 2.4 to obtain a new estimator

for F(t) that is not within the WLS family. The MLE has not been reported in the literature

before, hence this provides another new estimator. However, when examining it in the

simulations, we find no efficiency gain over the proposed WLS estimator. In addition, since

the MLE cannot be solved explicitly, its computation requires iteration procedure such as

the Newton-Raphson method. In some occasions, the iterative computation may cause

numerical instability, and the algorithm may even fail to converge. In light of these numerical

16

performances, we suggest to use the proposed WLS method in (2).

Appendix

Illustration on imputation estimators as a member of WLS family

The imputation estimating equation

n∑i=1


qTi F(t)− qT

i F(Ci)

1− qTi F(Ci)

ωiqi − ωiqiqTi F(t)

}= 0.

can be explicitly solved to obtain

F(t) =

(n∑i=1

ωiqiqTi

)−1 n∑i=1


qTi F(t)− qT

i F(Ci)

1− qTi F(Ci)

ωiqi

}

=

[m∑j=1

ujuTj

{n∑i=1

ωiI(qi = uj)

}]−1m∑j=1

uj

n∑i=1

I(qi = uj)ωi

{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)

Hj(t)− Hj(Ci)

1− Hj(Ci)

}.

Denote

Hj(t) =n∑i=1

ωiI(qi = uj)∑ni=1 ωiI(qi = uj)

{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)

Hj(t)− Hj(Ci)

1− Hj(Ci)

}

=n∑i=1


{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)

Hj(t)−Hj(Ci)

1−Hj(Ci)

}+R.

Then

R =n∑i=1


(1− δi)I(Ci ≤ t)

{Hj(t)− Hj(Ci)

1− Hj(Ci)− Hj(t)−Hj(Ci)

1−Hj(Ci)

}

has the property that n1/2R has a normal distribution with mean zero when n → ∞ as

long as Hj(t)’s are consistent estimates of Hj(t) and are asymptotically normal. Simple

17

calculation shows that in the jth group,

E

{δiI(Si ≤ t) + (1− δi)I(Ci ≤ t)

Hj(t)−Hj(Ci)

1−Hj(Ci)

}= Hj(t).

Hence, Hj(t) is a root-n consistent estimator of Hj(t), and the imputation estimator has the

equivalent form of

F(t) =

[m∑j=1

ujuTj

{n∑i=1

ωiI(qi = uj)

}]−1 m∑j=1

uj

{n∑i=1

ωiI(qi = uj)

}Hj(t).

Viewing∑n

i=1 ωiI(qi = uj) as wj, the imputation estimator is within the WLS family (1).

References

Breslow, N. and Crowley, J. (1974). A large sample study of the life table and product limit

estimates under random censorship. Annals of Statistics, 2, 437-453.

Chatterjee, N. and Wacholder, S. (2001). “A Marginal Likelihood Approach for Estimating

Penetrance from Kin-cohort Designs”. Biometrics, 57, 245-252.

Churchill, G. A., and Doerge, R. W. (1994). Empirical threshold values for quantitative

trait mapping. Genetics, 138, 963-971.

Efron, B. (1967). “The Two-Sample ProblemWith Censored Data,” in Proceedings Fifth

Berkeley Symposium in Mathematical Statistics, Vol. IV, eds. L. Le Cam and J. Ney-

man, New York: Prentice-Hall, pp. 831-853.

Ferreira, M. E., Stagopan, J., Yandell, B. S., Williams, P.H. and Osborn, T.C. (1995).

“Mapping loci controlling vernalization requirement and flower time in Brassica napus”.

Theoretical and Applied Genetics, 90, 727-732.

Fine, J. P., Zou, F. and Yandell, B. S. (2004). Nonparametric estimation of the effects of

18

quantitative trait loci. Biometrics, 5, 501-513.

Fleming, T., O’Fallon, J., and O’Brien, P. (1980). Modified Kolmogorov-Smirnov Test

Procedures with Application to Arbitrarily Right-Censored Data, Biometrics, 36, 607-

625.

Gail, M., Pee, D., Benichou, J., and Carroll, R. (1999). Designing studies to estimate the

penetrance of an identified autosomal dominant mutation: cohort, case-control, and

genotyped-proband designs. Genetic Epidemiology, 16, 15-39.

Huntingtons Disease Collaborative Research Group. (1993). A novel gene containing a

trinucleotide repeat that is expanded and unstable on Huntingtons disease chromosomes.

Cell, 72, 971-983.

Huntington Study Group COHORT Investigators (2012). Characterization of a Large Group

of Individuals with Huntington Disease and their Relatives Enrolled in the COHORT

Study. PLoS ONE, In press.

Kaplan, E. L. and Meier, P. (1958). Nonparametric Estimation from incomplete observa-

tions. Journal of the American Statistical Association, 53, 457-481.

Lander, E. S. and Botstein, D. (1989). “Mapping Mendelian Factors Underlying Quantitative

Traits Using RFLP Linkage Maps”. Genetics, 121, 743-756.

Langbehn, D. R., Brinkman, R. R., Falush, D., Paulsen, J. S., Hayden, M. R. (2004). A new

model for prediction of the age of onset and penetrance for Huntington’s disease based

on CAG length. Clin Genet, 65, 267-277.

Lin, M. and R. L. Wu (2006). A joint model for nonparametric functional mapping of

longitudinal trajectories and time-to-events. BMC Bioinformatics 7(1), 138.

Ma, Y. and Wang, Y. (2012). Efficient Distribution Estimation for Data with Unobserved

Sub-population Identifiers. Electronic Journal of Statistics. 6, 710-737.

Marder K., Levy, G., Louis, E.D., Mejia-Santana, H., Cote, L.,Andrews, H.,Harris, J.,Waters,

19

C., Ford, B., Frucht, S., Fahn, S. and Ottman, R. (2003). Accuracy of family history

data on Parkinson’s disease. Neurology, 61, 18-23.

McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. New York: Wiley.

Moore, D., Chatterjee, N., Pee, D., and Gail, M. (2001). Pseudo-likelihood estimates of

the cumulative risk of an autosomal dominant disease from a kin-cohort study. Genet.

Epidemiol. 20, 210-227.

Nance M. A., Seltzer W, Ashizawa T, Bennett R, McIntosh N, Myers RH, Potter NT, Shea

DK. ACMG/ASHG Statement (1998). Laboratory guidelines for Huntington disease

genetic testing. Am J Hum Genet, 62, 1243-1247.

Pepe, M., and Fleming T. (1989). “Weighted Kaplan-Meier Statistics: A Class of Distance

Tests for Censored Survival Data.” Biometrics, 45, 497-507.

Ranen, N. G., Stine O. C., and Abbott M. H. et al. (1995), Anticipation and instability

of IT-15 (CAG)n repeats in parent-offspring pairs with Huntington disease, Am J Hum

Genet, 57, 593-602.

Rubinsztein, D. C., Leggo, J., Coles, R., and Almqvist, E. et al. (1996). Phenotypic charac-

terization of individuals with 30-40 CAG repeats in the Huntington disease (HD) gene

reveals HD cases with 36 repeats and apparently normal elderly individuals with 36-39

repeats. Am J Hum Genet, 59, 16-22.

The Huntington’s Disease Collaborative Research Group (1993). “A novel gene containing

a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromo-

somes”. Cell, 72, 971-983.

Wacholder, S., Hartge, P., Struewing, J., Pee, D., McAdams, M., Brody, L. and Tucker,

M. (1998). “The Kin-cohort Study for Estimating Penetrance”. American Journal of

Epidemiology, 148, 623-630.

Walker, F. O. (2007). “Huntington’s disease”. Lancet, 369, 218-228.

20

Wang, Y., Clark, L.N., Louis, E.D., Mejia-Santana, H., Harris, J., Cote, L.J., Waters, C.,

Andrews, D., Ford, B., Frucht, S., Fahn, S., Ottman, R., Rabinowitz, D. and Marder,

K (2008). Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using

the kin-cohort method. Arch Neurol. 65(4):467-474.PMID: 18413468.

Wexler, N. S., Lorimer, J., Porter, J., Gomez, F., Moskowitz, C. et al. (2004). Venezuelan

kindreds reveal that genetic and environmental factors modulate Huntington’s disease

age of onset, Proc Natl Acad Sci, 101, 3498-3503.

Wu, R., Ma, C., and Casella, G. (2007). S tatistical Genetics of Quantitative Traits: Linkage,

Maps, and QTL. New York: Springer.

Ying, Z. Jung, S. H. and Wei, L. J. (1995). Survival Analysis with median regression models.

Journal of the American Statistical Association, 90, 178-184.

21

Table 1: Simulation study. Estimation bias and variance for different estimators.

Method bias(F1) SD(F1) bias(F2) SD(F2)t = 0.9750,F(t) = (0.2357, 0.4203)T

Oracle -0.0010 0.0229 -0.0009 0.0229WLS 0.0004 0.0229 0.0004 0.0229IND -0.0000 0.0289 0.0009 0.0289

NPMLE1 0.0002 0.0310 0.0002 0.0310NPMLE2 -0.0222 0.0200 0.0136 0.0200

t = 1.9500,F(t) = (0.4203, 0.6785)T

Oracle -0.0034 0.0297 -0.0008 0.0297WLS -0.0014 0.0298 0.0005 0.0298IND -0.0013 0.0387 0.0004 0.0387

NPMLE1 -0.0018 0.0383 0.0001 0.0383NPMLE2 -0.0418 0.0268 0.0251 0.0268

t = 2.9250,F(t) = (0.5651, 0.8371)T

Oracle -0.0048 0.0349 -0.0019 0.0349WLS -0.0007 0.0347 -0.0001 0.0347IND 0.0003 0.0479 -0.0012 0.0479

NPMLE1 -0.0016 0.0444 -0.0010 0.0444NPMLE2 -0.0551 0.0324 0.0337 0.0324

22

Table 2: Simulation study. Estimation bias (bias), empirical standard error (sd), estimated

standard error (sd) and 95% confidence.

t F(t) bias(F ) sd(F ) sd(F ) 95% CIUnder H0 : F1(t) = F2(t)

0.9750 0.4203 -0.0010 0.0264 0.0262 0.93600.4203 0.0004 0.0300 0.0291 0.9400

1.9500 0.6785 -0.0002 0.0287 0.0280 0.94400.6785 0.0003 0.0308 0.0310 0.9470

2.9250 0.8371 -0.0002 0.0276 0.0276 0.95200.8371 -0.0010 0.0308 0.0306 0.9410

Under H1 : F1(t) 6= F2(t)0.9750 0.2357 0.0004 0.0229 0.0225 0.9500

0.4203 0.0004 0.0288 0.0284 0.94601.9500 0.4203 -0.0014 0.0298 0.0290 0.9380

0.6785 0.0005 0.0327 0.0320 0.94902.9250 0.5651 -0.0007 0.0347 0.0348 0.9500

0.8371 -0.0001 0.0336 0.0338 0.9480

Table 3: Simulation study. Empirical rejection rates for the single, muliple and curve testingat various nominal levels. Multiple t is the result of testing F1(t) = F2(t) at the three listedt values simultaneously. Curve is testing the curve F1(t) = F2(t).

t-value 0.01 0.05 0.1 0.2Under H0 : F1(t) = F2(t)

0.9750 0.0120 0.0470 0.1050 0.20601.9500 0.0130 0.0610 0.0970 0.18302.9250 0.0080 0.0510 0.0970 0.2050

Multiple t 0.0090 0.0540 0.1070 0.2080All t 0.0150 0.0620 0.1070 0.2090

Under H1 : F1(t) 6= F2(t)0.9750 0.9790 0.9950 0.9980 0.99801.9500 0.9950 0.9990 1.0000 1.00002.9250 0.9920 0.9970 0.9990 1.0000

Multiple t 0.9990 1.0000 1.0000 1.0000All t 0.9780 0.9970 0.9990 0.9990

23

Figure 1: Simulation study. True CDF (solid) and the mean (dashed), 95% pointwise con-fidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs. TheOracle (left) are WLS (right) are plotted on the top row and are almost identical and havesuperior performance. IND (left), NPMLE1 (middle) and NPMLE2 (right) have either havewider bands or biased and are on the bottom row.

0 0.5 1 1.5 2 2.5 3 3.5

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5

0

0.2

0.4

0.6

0.8

1

24

Figure 2: COHORT study. The cumulative rate of disease onset based on WLS (upper-left),IND (upper-right), and NPMLE1 (lower).

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Age

Cum

ula

tive r

isk o

f H

D

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Age

Cum

ula

tive r

isk o

f H

D

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Age

Cum

ula

tive r

isk o

f H

D

25

Figure 3: COHORT study. The cumulative risk of disease onset based on WLS (upper-left),IND (upper-right), and NPMLE1 (lower) stratified by gender (black: female, red: male).

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Age

Cum

ula

tive r

isk o

f H

D

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Age

Cum

ula

tive r

isk o

f H

D

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Age

Cum

ula

tive r

isk o

f H

D

26

Figure 4: COHORT study. The cumulative risk of disease onset stratified by gender andreport of affected father (black), affected mother (blue) or none (red) at the time of familyhistory interview (based on WLS).

10 20 30 40 50 60 70 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Age

Cum

ula

tive r

isk o

f H

D

Male

father

mother

non

10 20 30 40 50 60 70 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Age

Cum

ula

tive r

isk o

f H

D

Female

father

mother

non

27

Documents

Estimating Disease Onset Distribution Functions from ... · of problems. We identify a computationally e ortless estimator in the family, study ... it is of interest to estimate distribution