Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Weighted Kaplan-Meier Estimator For Different SamplingMethods
A PROJECT
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Weitong Yin
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF SCIENCE
Dr. Steven Chiou and Dr. Barry James
August, 2015
c© Weitong Yin 2015
ALL RIGHTS RESERVED
Acknowledgements
First of all, I would like to extend my sincere gratitude to my advisors, Dr. Barry
James and Dr. Steven Chiou, for their support on my course work, project and my
future study over the past two years. I would also like to give my sincere thanks to
Dr. Kang James for serving as my committee member. During her seminar classes, she
offered me important ideas which contribute a lot to this project. Furthermore, I would
like to thank my parents for their spiritual support during my study in United States.
i
Dedication
To those who held me up over the years
ii
Abstract
Kaplan-Meier(KM) estimator is a powerful approach for estimating the survival func-
tions. However, under some specific sampling methods, bias may affect the accuracy
of estimating the survival functions. Specifically, in this paper, we are interested in
weighted KM estimators for different sampling methods and we propose weighted KM
estimators case-cohort sampling and stratified sampling. In order to evaluate the effec-
tiveness of these weighted KM estimators, bootstrap is applied to analyze the variance
of the estimations. Simulation results are provided to compare the estimated survival
curves with survival curves based on Weibull distribution, from which the data is ran-
domly generated.
Keywords: Kaplan-Meier estimator, Case-cohort sampling, Stratified sampling, Weighted
Kaplan-Meier estimator, Bootstrap.
iii
Contents
Acknowledgements i
Dedication ii
Abstract iii
List of Tables vi
List of Figures vii
1 Introduction 2
1.1 Kaplan-Meier Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Right-Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Case-Cohort Sampling . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Stratified Random Sampling . . . . . . . . . . . . . . . . . . . . 5
2 Methodology 6
2.1 Introduction: Maximum Likelihood Estimation . . . . . . . . . . . . . . 6
2.1.1 Parametric Maximum Likelihood Estimation . . . . . . . . . . . 6
2.1.2 Nonparametric Maximum Likelihood Estimation . . . . . . . . . 7
2.2 Weighted Kaplan Meier Estimator . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Kaplan Meier Estimator with for Case-Cohort Sampling . . . . . 8
iv
2.2.2 Kaplan Meier Estimator with for Stratified Random Sampling. . 9
3 Analysis 10
3.1 Parametric Approach-Greenwoods Formula . . . . . . . . . . . . . . . . 10
3.2 Nonparametric Approach-Bootstrap . . . . . . . . . . . . . . . . . . . . 11
4 Simulation 12
4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Weighted Kaplan Meier Estimators . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Kaplan Meier Estimator with Case-cohort Weights . . . . . . . . 13
4.2.2 Kaplan Meier Estimator with Stratified Sampling Weights . . . . 15
4.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Conclusion 20
References 21
Appendix A. R Codes 22
v
List of Tables
4.1 Summary for Survival Function Sample . . . . . . . . . . . . . . . . . . 13
vi
List of Figures
4.1 Weighted KM Estimator(Case-cohort Sampling) with Different Censoring
Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Weighted KM Estimator(Stratified Sampling by Time) with Different
Censoring Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Weighted KM Estimator(Stratified Sampling by Censoring Status) with
Different Censoring Rates . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Bootstrap 95% C.I. of Weighted KM Estimator (Case-cohort Sampling)
with Different Censoring Rates . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by Censoring
Status) with Different Censoring Rates . . . . . . . . . . . . . . . . . . . 18
4.6 Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by Censoring
Status) with Different Censoring Rates . . . . . . . . . . . . . . . . . . . 19
vii
1
2
Chapter 1
Introduction
1.1 Kaplan-Meier Estimator
Survival analysis is the analysis of time that one or more events happen. Usually, it
deals with finding the survival functions, the models of time-to-event data. In terms of
“event”, it often refers to the currency of specific event of our interest, such as “death”,
“failure” and etc. Concerning our object of primary interest, the survival function, it’s
often defined as S(t) = Pr(T > t)[1]. t is some specified time, T denotes a random
variable of the “event” time. Symbol Pr stands for the probability. It shows that the
survival function is actually the probability that the time of event happening is greater
than specified time t. Usually we assume the S(0) = 1, which means no event happens
at the beginning time. Then when t is greater, S(t) keeps decreasing to 0, which means
no survivors.
We consider estimators of the survival functions with no prior assumptions about the
shapes of the relevant functions. It is commonly defined as the nonparametric approach
for survival analysis. In some cases, since nonparametric analysis gives some informa-
tion about the pattern of the distribution, it also help us choosing a proper parametric
model. Kaplan-Meier estimator, also named product-limit estimator, is the most com-
monly used nonparametric method, which usually allows for censoring.
2
3
1.2 Right-Censoring
Right-censoring is a common case in survival analysis. For example, in a medical re-
search, some of the patients may withdraw from the study in advance before we observe
the event of interest. As a result, we can only observe the time when the patients left.
But we never know the time when those events happened on these patients.
Considering the case without censoring, let S(t) be the probability that a given subject
will survive through time t. For a sample of size N from this population, let the observed
times until death of the N sample members be t1 ≤ t2 ≤ t3 ≤ .. ≤ tN . Corresponding
to each ti are ni, the number “at risk” just prior to time ti, and di, the number of
happened events at time ti.
The simplest case is to estimate the survival function using the whole population
without any censoring. Then the estimator can be expressed as the product form:
S(t) =∏ti<t
ni − dini
. (1.1)
This estimator is named after Edward L. Kaplan and Paul Meier, who each submitted
similar manuscripts to the Journal of the American Statistical Association. Thus, it is
also called KM estimator, which we mentioned at the beginning of this article.
However, the classical KM estimator cannot always meet our needs in the real-world
application. For example, if it is too expensive or too time consuming to collect and
analyze data on all subjects, we have to take a sample from the whole population.
Using the observed data naively in the sample would lead to misleading results because
statistical information from the sample is often biased which highly depends on the
sampling methods. Considering a sample includes all the subjects experienced the
event and only partial of the subjects which don’t, the survival function would be
underestimated by using the classical KM estimator. Thus, for solving this problem, we
have to adjust the biases by incorporating a corresponding “weight”, which depends on
applied sampling methods. The modified estimator is named as weighted KM estimator.
Particularly, these estimators often include the case with censoring.
4
1.3 Sampling
Sampling is the process of selecting units (e.g., people, organizations) from a population
of interest so that by studying the sample we may fairly generalize our results back to
the population from which they were chosen.
1.3.1 Simple Random Sampling
Simple random sampling is one of the basic forms of sampling. There are two ways
of taking samples: without replacement and with replacement. In the case with re-
placement, the same subject may be withdrawn more than once. In the case without
replacement, all subjects in the sample are distinct. Simple random sampling from a
population of N subjects is withdrawing independent samples of size 1. Obviously, each
subject is randomly withdrawn with same probability 1N with replacement. Since in
most cases, sampling the same subject doesn’t give us any additional information, we
usually take simple sampling without replacement. We withdraw a sample with size n
from the population. Each subset of n distinct subjects in the population has the same
probability of being selected. A sample with size n is randomly selected with probabilityn!(N−n)!
N ! . Thus, under this circumstance, each subject in the population is selected with
probability nN [2].
1.3.2 Systematic Sampling
Systematic sampling is a sampling method involving the selection of elements from
an ordered sampling frame. In this approach, progression through the list is treated
circularly, with a return to the top once the end of the list is passed. The sampling
starts by selecting a subject from the list at random and then every kth subject in the
frame is selected, where k is the sampling interval: this is calculated as: k = Nn where
n is the preassigned sample size, and N is the population size[2].
1.3.3 Case-Cohort Sampling
Case-Cohort design (Prentice 1986) is an effective and economical design which origi-
nated to allow efficient analysis of studies where it is too expensive and time consuming
5
to collect and analyze data on all subjects. “Cases” refer to those subjects who have ex-
perienced a specific event. “Controls” refer to that subjects who have not experienced
a specific event. Case-cohort random sampling involves two steps. Firstly a subset
called sub-cohort is randomly selected from the whole cohort regardless of their sta-
tus(experienced the specific event or not). Secondly, the sub-cohort and the remaining
cases in the cohort who are not selected in the sub-cohort form a case-cohort sample. [3]
Strictly speaking, case-cohort sampling is the special case of stratified random sampling.
One of the stratums, “Cases”, is assigned 100% sampling fraction.
1.3.4 Stratified Random Sampling
In statistical surveys, when subpopulations within an overall population vary, it is ad-
vantageous to sample each subpopulation (stratum) independently. Stratification is
the process of dividing members of the population into homogeneous subgroups before
sampling. The strata should be mutually exclusive, which means every element in the
population must be assigned to only one stratum. The strata should also be collectively
exhaustive, that is no population element can be excluded. Then simple random sam-
pling or systematic sampling is applied within each stratum. Optimal allocation is the
core of the stratified sampling strategies.
Sometimes, we want an allocation that provides the most precision, assuming that
the direct cost to sample an individual subject is equal across strata. Given a stratified
sample with a fixed sample size, a special case of optimal allocation is given by Neyman.
Based on Neyman allocation, the best sample size for s stratum would be[2]:
ns =n(Nsσs)
[∑
(Niσi)]. (1.2)
(where ns is the sample size for stratum s, n is the total sample size, Ns is the population
size for stratum s, and σs is the standard deviation of stratum s.)
Chapter 2
Methodology
2.1 Introduction: Maximum Likelihood Estimation
2.1.1 Parametric Maximum Likelihood Estimation
Firstly, we need to give out some definitions.
Survival Function:
S (t) = P (T > t) = 1− F (t) .
SC (t) = P (C > t) = 1− FC (t) .
Hazard Function:
h (t) = limδ→0
(P (t ≤ T < t+ δ|T ≥ t)
δ
).
hC (t) = limδ→0
(P (t ≤ C < t+ δ|C ≥ t)
δ
).
For each subject, one of a pair observationsti = (yi, ci) can be observed. By dividing
the data into two catergories,censored or not censored, we can express the likelihood
function like following:
L = L1.L2.
Based on different data with or without censoring, the likelihood functions would be
different. For Not Censored Data:
L1 =∏
h (yi)S (yi)P (C > yi) =∏
h (yi)S (yi)SC (yi) .
6
7
For Censored Data:
L2 =∏
hC (ci)SC (ci)P (T > ci) =∏
hC (ci)SC (ci)S (ci) .
In order to express the Likelihood function as a simple form, we difine a Indicator γ as
following:
γi =
1 if yi < ci
0 otherwise
Likelihood function under such circumstances can be expressed as below:
L =∏
[h (ti)S (ti)SC (ti)]γi∏
[hC (ti)SC (ti)S (ti)]1−γi =
∏h (ti)
γihC (ti)1−γiS (ti)SC (ti) .
Furthermore, when maximizing the likelihood function (MLF), we only focus on the
estimated values of parameters λ, k . Since our assumption,Failure time and Censoring
time are independent. We can simplify the likelihood function by removing SC (ti) and
hC (ti)1−γi
L =∏
h (ti)γi S(ti).
Since it’s easier to deal with, we take the log function of L
Log (L) =∑
[γiLogh (ti) + LogS (ti)].
2.1.2 Nonparametric Maximum Likelihood Estimation
The KM estimator has an excellent interpretation as a nonparametric maximum likeli-
hood estimator. Its fathers, Edward L. Kaplan and Paul Meier, provided us an intuitive
approach. Assuming a subject is censored at the time t, in order to make likelihood
function as large as possible, we keep the survival function unchanged at the censoring
time. In the other case, assuming a subject experienced the event at t, we need to make
the survival function just before t(i) as large as possible and the survival function at t(i)
as small as possible.
Then the likelihood function L can be expressed as below:
L =m∏i=1
[S(t(i−1))− S(t(i))]di [S(t(i)]
ci . (2.1)
8
Write pi =S(t(i))
S(t(i−1)). Then S(t(i)) can be expressed as below:
S(t(i)) = p1p2p3 . . . pi. (2.2)
Then we can rewrite the likelihood function as below:
L =m∏i=1
(1− pi)di(pi)ci(p1p2p3 . . . pi−1)di+ci . (2.3)
Let ni =∑
j≥i(dj + ci) be the total number of subjects at risk. Then we can write the
likelihood function as below:
L =m∏i=1
(1− pi)di(pi)ni−di . (2.4)
Based on this likelihood function, the maximum likelihood estimation of pi is
pi =ni − dini
(2.5)
The KM estimator is the product of these probabilities, pi’s.
2.2 Weighted Kaplan Meier Estimator
2.2.1 Kaplan Meier Estimator with for Case-Cohort Sampling
Case-Cohort design ( Prentice 1986 ) is an effective and economical design which origi-
nated to allow efficient analysis of studies where it is too expensive and time consuming
to collect and analyze data on all subjects. Since we know the case-cohort sample is
actually a combination of all the cases in the cohort and a part of the controls. If the
sample is naively used to estimate the survival function, then the survival functions
would be underestimated. The reason is the rate of death in case-cohort is obviously
higher than the full cohort. We’re using the biased sample with higher death rate to
estimate. It definitely can’t represent the statistical characteristic of the full cohort data
set.
In order to adjust this bias, we incorporate a new weight hi, we denote it as S1. S1 can
be expressed as below[3] :
S1(t) =∏y(i)<t
hi.pi (2.6)
9
hi is defined as below:
hi = γi +(1− γi)αi
pn(2.7)
pn is the sub-cohort inclusion probability, (n is the sample size of the sub-cohort, n is
the sample size of the full cohort.)
pn =n
n(2.8)
αi is the inclusion indicator function defined as below:
αi =
1 if ith observation is in the sub-cohort
0 if ith observation is not in the sub-cohort(2.9)
2.2.2 Kaplan Meier Estimator with for Stratified Random Sampling.
When the full cohort data is not available, the survival functions cannot be estimated
actually. Since the observed data is a biased sample, using it naively would lead to
incorrect results. To solve this problem, we need to incorporate proper weight. Define
ψis be the strata indicator; ψis = 1 if the ith subject is in the sth stratum and ψis = 0
otherwise. Define ξi be the sampling indicator; ξi = 1 if the ith subject is sampled and
ξi = 0 otherwise. Define pn,s be the inclusion probabilities, pn,s = nsns
, where ns and ns
are number of all the subjects in sth stratum and number of sampled subjects in sth
stratum. In order to adjust this bias, we incorporate a new weight hi, we denote it as
S1. S1 can be expressed as below[4] :
S2(t) =∏y(i)<t
hi.pi (2.10)
where the weight for the ith subject can be expressed as[4]:
hi =S∑s=1
ξiψispn,s
(2.11)
Chapter 3
Analysis
In order to increase the accuracy of survival function estimator, we need to analyze
variances and confidence intervals of the estimated survival functions from different
estimators.
3.1 Parametric Approach-Greenwoods Formula
Theorem 1. (Greenwoods Formula) The variance of survival function is given by
Greenwood formula[5]:
ˆV ar(S(t)) = (S(t))2k∑j=1
djnj(nj − dj)
. (3.1)
Since pi Binomial(ni, pi), variance of pi is given by
V ar(pi) =pi(1− pi)
ni. (3.2)
To obtain the variance of S(t), the KM estimator, we need to apply the delta method.
Firstly, we have by taking Logs, we can find following equation.
LogS(t(i)) =
i∑j=1
logpi. (3.3)
10
11
Delta method is expressed as below:
V ar(f(X)) = (f ′(X))2V ar(X). (3.4)
In our case, we can get the V ar(logpi) can be expressed as below:
V ar(logpi) = (1
pi)2V ar(pi) =
1− pinipi
. (3.5)
Since we assumed the covariances of the pi’s are zero, we find
V ar(LogS(t(i))) =i∑
j=1
1− pinipi
=∑ dj
nj(nj − dj). (3.6)
Apply the delta method again, it reaches the final destination:
V ar(S(t(i)) = (S(t(i))2∑ 1− pi
nipi. (3.7)
3.2 Nonparametric Approach-Bootstrap
In medical science we have very often to deal with small sample size. There are several
reasons for that. The common one is rarity of illness or difficulty with gathering pa-
tients possessing the same parameters. And we don’t have too much information about
the distributions of samples. Classical statistical methods don’t work for analysis of the
variance. Facing these problems, Bootstrap comes out to be one of best choice. Without
assuming distributions of the sample, like t-distribution, z-distribution, we can still use
Bootstrap resampling to estimate survival functions.
If we receive N bootstrap samples and N Kaplan-Meier estimates. For bootstrap esti-
mator of survival function we take[6]
SB(t) =1
N
N∑j=1
Sj(t). (3.8)
Chapter 4
Simulation
In this section, we aim at comparing the weighted KM estimators with unweighted KM
estimators under different sampling schemes: (1) case-cohort sampling and (2) stratified
sampling. Since our data set comes from known distribution, then by comparing the
estimations with the theoretical values, the improvements are presented. All simulations
are conducted in R.
4.1 Data Set
For convenience of evaluation, we choose to generate data randomly. First of all, we
assume the event time, T obeys Weibull distribution with parameters λ, k. Correspond-
ingly, the survival function would be S(t) = e−(t/λ)k.First step, we generate a random
vector, Weibull, with length n. Each element in the vector is randomly generated from
a Weibull distribution with parameters λ, k. These values in the random vector stands
for the pre-assumed event time for the n subjects. Then the same approach, we ran-
domly generate another vector with the same length, Censor, where each element in the
vector is randomly generated from a uniform distribution, U(0, C). These values in the
random vector stands for the censoring time for the corresponding n subjects. At each
corresponding position of the two vectors, we take the minimum value for constructing
a new vector, Time, with the same length n. We use values in Time as the observed
time( event time or censored time which is smaller ). At the same time, we record the
censoring status by a indicator vector, Ind. In each corresponding position, we record
12
13
0 if it’s censored, otherwise 1. To this stage, we have already generated a raw data set.
Then we order the data set by Time from small to big and let the Ind go with Time.(
Ind is not ordered).Now, we have constructed a full cohort for survival analysis.
In R, we use function survfit to estimate the survival function. Summary for the
survival function will show like below:
time n.risk n.event survival standard error lower 95% CI upper 95% CI
0.00823 200 1 0.995 0.00499 0.985 10.06673 188 1 0.990 0.00724 0.976 10.10404 180 1 0.984 0.00905 0.967 10.12567 177 1 0.979 0.01057 0.958 10.14638 174 1 0.973 0.01191 0.95 0.9970.16613 169 1 0.967 0.01316 0.942 0.993
Table 4.1: Summary for Survival Function Sample
4.2 Weighted Kaplan Meier Estimators
4.2.1 Kaplan Meier Estimator with Case-cohort Weights
When given a population with case-cohort sampling, the KM estimator, without weights,
will be biased, regardless of the censoring rate. This is the main reason we are investi-
gating a weighted KM estimator.
Simulations were conducted to show weighted KM estimator removed such kind of
bias from unweighted KM estimator. As mentioned at the beginning of this chapter,
we choose Weibull(1, 0.5) for pre-assumed event time and for comparison we choose
uniform distributions, U(0, 2), U(0, 1) and U(0, 0.8) to control the censoring rate with
respect to 20%,40%,80%. For each single simulation, we choose full cohort with sample
size 1000. And a sample with size 300 will be selected by simple random sampling.
This is the sub-cohort we defined before. Then we include all the other cases which
are not in the sub-cohort. The average case-cohort sample size are 600, 700, 800 with
respect to 20%,40%,80% censoring rate. Both unweighted KM estimator and weighted
KM estimator is applied to this sampled group.
14
We repeated the process for 1000 times. After prefixing time points every 0.01 from
0 to C, survival function values were recorded at each time point. In each replication,
we get a sequence of survival function values with respect to each time point. Then
after 1000 times’ replication, average of survival function values was taken at each time
point. Based on these average survival function values at each time point, we plotted
the graph and compared with the true survival curve from Weibull(1, 0.5). Figure 4.1 is
conducted on the average of 1000 replications. Our estimation, the green curve, overlaps
the true curve almost everywhere. Thus it seems to be very good estimation. Also we
can find the unweighted KM estimation, the red curves, have significant bias in all the
cases. And the bias increased as the censoring rate increases.
0.0 0.5 1.0 1.5 2.0
0.0
0.4
0.8
Weighted KM Estimation with 20% Censoring Rate
time
S(t
)
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Weighted KM Estimation with 40% Censoring Rate
time
S(t
)
0.0 0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
1.0
Weighted KM Estimation with 80% Censoring Rate
time
S(t
)
Figure 4.1: Weighted KM Estimator(Case-cohort Sampling) with Different CensoringRates
15
4.2.2 Kaplan Meier Estimator with Stratified Sampling Weights
When given a population with stratified sampling, the KM estimator, without weights,
will be biased, regardless of the censoring rate. This is the main reason we are investi-
gating a weighted KM estimator.
Simulations were conducted to show weighted KM estimator removed such kind of
bias from unweighted KM estimator. As mentioned at the beginning of this chapter,
we choose Weibull(1, 0.5) for pre-assumed event time and for comparison we choose
uniform distributions, U(0, 2), U(0, 1) and U(0, 0.8) to control the censoring rate with
respect to 20%,40%,80%. For each single simulation, we choose full cohort with sample
size 1000.
We simulated two stratifying schemes, stratified by time and stratified by censoring
status. For the first scheme, we divided the full cohort into three subsets with respect
to observational time (0,1/3),(1/3,2/3), (2/3,∞) In each subset or stratum, a sample
with portions 1, 0.5, 0.2 were selected by simple random sampling. Combine all the data
to form a stratified sample. The average stratified sample size are 520. The same ap-
proach for the second scheme. We divided the full cohort into two subsets according to
the censoring status. In stratum of censored subjects, we choose portion 1. In stratum
of uncensored subjects, we choose portion 0.2. Both stratifying schemes simulations
were conducted on censoring rate with respect to 20%, 40%, 80% .Both unweighted KM
estimator and weighted KM estimator is applied to this sampled group.
We repeated the process for 1000 times. After prefixing time points every 0.01 from
0 to C, survival function values were recorded at each time point. In each replication,
we get a sequence of survival function values with respect to each time point. Then
after 1000 times’ replication, average of survival function values was taken at each time
point. Based on these average survival function values at each time point, we plotted
the graph and compared with the true survival curve from Weibull(1, 0.5). Figure 4.2
is conducted on the average of 1000 replications. Our estimation , the green curve,
overlaps the true curve almost everywhere. Thus it seems to be very good estimation.
Also we can find the unweighted KM estimation, the red curves, have significant bias
16
in all the cases. And the bias increased as the censoring rate increases.
Following results were conducted on stratifying by observational time.
0.0 0.5 1.0 1.5 2.0
0.0
0.4
0.8
Weighted KM Estimation with 20% Censoring Rate
time
S(t
)
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Weighted KM Estimation with 40% Censoring Rate
timeS
(t)
0.0 0.5 1.0 1.5 2.0
0.0
0.4
0.8
Weighted KM Estimation with 80% Censoring Rate
time
S(t
)
Figure 4.2: Weighted KM Estimator(Stratified Sampling by Time) with Different Cen-soring Rates
17
Simulations were conducted on stratifying by censoring status on the average of 1000
replications.
0.0 0.5 1.0 1.5 2.0
0.0
0.4
0.8
Weighted KM Estimation with 20% Censoring Rate
time
S(t
)
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.6
1.0
Weighted KM Estimation with 40% Censoring Rate
time
S(t
)
0.00 0.05 0.10 0.15 0.20
0.70
0.80
0.90
1.00
Weighted KM Estimation with 80% Censoring Rate
time
S(t
)
Figure 4.3: Weighted KM Estimator(Stratified Sampling by Censoring Status) withDifferent Censoring Rates
4.3 Bootstrap
In this section, we applied Bootstrap to sampled data sets with the Bootstrap number,
200. In order to get rid of occasional results, we take the average of 200 replications.
By obtaining the quantiles of 97.5% and 2.5% from 200 survival function values at each
time point, 95% confidence interval(C.I.) of our estimations are constructed. Figure
4.4, Figure 4.5 and Figure 4.6 show that the theoretical curve locates in the 95% C.I.
of our estimations. In each figure, green line is the 95% upper bound and blue line
is the 95% lower bound. Black lines, the theoretical curves, are nearly overlapped
almost everywhere by the red lines, which are bootstrap estimation from weighted KM
estimators.
18
0.0 0.5 1.0 1.5 2.0
0.0
0.4
0.8
Case−cohort Sampling with 20% Censoring Rate
time
S(t
)
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.6
1.0
Case−cohort Sampling with 40% Censoring Rate
time
S(t
)
0.00 0.05 0.10 0.15 0.20
0.70
0.80
0.90
1.00
Case−cohort Sampling with 80% Censoring Rate
time
S(t
)
Figure 4.4: Bootstrap 95% C.I. of Weighted KM Estimator (Case-cohort Sampling)with Different Censoring Rates
0.0 0.5 1.0 1.5 2.0
0.0
0.4
0.8
Stratified by Censoring Status with 20% Censoring Rate
time
S(t
)
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.6
1.0
Stratified by Censoring Status with 40% Censoring Rate
time
S(t
)
0.00 0.05 0.10 0.15 0.20
0.70
0.80
0.90
1.00
Stratified by Censoring Status with 80% Censoring Rate
time
S(t
)
Figure 4.5: Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by CensoringStatus) with Different Censoring Rates
19
0.0 0.5 1.0 1.5 2.0
0.0
0.4
0.8
Stratified by Time with 20% Censoring Rate
time
S(t
)
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
Stratified by Time with 40% Censoring Rate
time
S(t
)
0.00 0.05 0.10 0.15 0.20
0.70
0.80
0.90
1.00
Stratified by Time with 80% Censoring Rate
time
S(t
)
Figure 4.6: Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by CensoringStatus) with Different Censoring Rates
Chapter 5
Conclusion
From the simulation results, we can see that our weighted KM estimators is significantly
closer to the theoretical values than unweighted KM estimator whatever with any level
of censoring rate. Indeed, after 1000 replications, weighted KM estimators overlapped
expected theoretical curve almost everywhere. In the meantime, unweighted KM esti-
mators are clearly biased. Bootstrap gives another evidence that the theoretical curve
locates in the 95% confidence intervals of our weighted KM estimators. Thus, we con-
clude that weighted KM estimators for case-cohort sampling and stratified sampling
give more accurate estimation of survival function than unweighted KM estimator.
Although unweighted KM estimator often works well for unbiased sampling schemes,
we would suggest double check the sampling scheme before applying the unweighted KM
estimator to the data. For an biased sampling scheme, we’re supposed to set up a proper
weighted KM estimator for it. This article only provided weights for two special cases,
case-cohort sampling and stratified sampling with three distinct censoring rates. To
reach more general conclusion, we still need more explorations under different cases,
like using another theoretical distribution but Weibull or using different biased sam-
pling schemes.
20
References
[1] Borgan. 2005. Kaplan Meier Estimator. Encyclopedia of Biostatistics
[2] Lohr, Sharon. 2009 Sampling: Design and Analysis. Cengage Learning
[3] Chiou, S., Kang, S. and Yan, J. (2014) Fast Accelerated Failure Time Modeling for
Case-Cohort Data. Statistics and Computing, 24(4):559-568.
[4] Chiou, S., Kang, S. and Yan, J. (2014+) Semiparametric Accelerated Failure Time
Modeling for Clustered Failure Times From Stratified Sampling Journal of American
Statistical Association, Forthcoming.
[5] Jenkins, Stephen P. 2005. Survival Analysis Unpublished manuscript, Institute for
Social and Economic Research, University of Essex, Colchester, UK
[6] Efron, Bradley, and Robert J. Tibshirani. 1994. An introduction to the bootstrap.
CRC press
21
Appendix A
R Codes
##S t r a t i f i e d Sampling ( s t r a t a based on time )
simDat <− function (n , C, p ) {## use C to c o n t r o l censor ing r a t e
## use p ( v e c t o r ) to c o n t r o l the i n c l u s i o n p r o b a b i l i e s in the t h r e e s t r a t a
cen <− runif (n , 0 , C)
Y <− rweibull (n , 1 , 0 . 5 )
T <− pmin( cen , Y)
d e l t a <− i f e l s e (T < cen , 1 , 0)
## straID <− i f e l s e ( d e l t a == 1 , 1 , i f e l s e (T < 2/3 , 2 , 3))
straID <− i f e l s e (T < 1/3 , 1 , i f e l s e (T < 2/3 , 2 , 3 ) )
sampInd <− unlist ( sapply ( 1 : 3 , function ( x ) rbinom( table ( straID ) [ x ] , 1 , p [ x ] ) ) )
h i <− sampInd / rep (p , table ( straID ) )
sampInd [ order (T) ] <− sampInd
hi [ order (T) ] <− hi
data . frame (T = T, d e l t a = de l ta , straID = straID , sampInd = sampInd , h i = hi )
}
ttmp <− seq (0 , 2 , 0 . 01 ) ## dummy time vector , upper bound at C
sv1 <− sv2 <− matrix (0 , ncol = length ( ttmp ) , nrow = 1000)
for ( i in 1 :1000) {
22
23
dat <− simDat (1000 , 2 , c (1 , 0 . 5 , 0 . 2 ) )
## without we igh t
f i t 1 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , data = dat , subset = sampInd == 1)
sv1 [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 1 $ surv ) [max(which( x >= c (0 , f i t 1 $time ) ) ) ] )
f i t 2 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = dat , subset = sampInd == 1)
sv2 [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 2 $ surv ) [max(which( x >= c (0 , f i t 2 $time ) ) ) ] )
}
plot ( ttmp , pweibull ( ttmp , 1 , 0 . 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” , main=”Weighted KM Estimation with 20% Censoring Rate” )
l ines ( ttmp , apply ( sv1 , 2 , mean) , col = 2) ## c l e a r l y b i a s e d
l ines ( ttmp , apply ( sv2 , 2 , mean) , col = 3)
## Boots trap
l ibrary ( s u r v i v a l )
l ibrary ( boot )
l ibrary ( Hmisc )
n <− 200
Censor<−runif (n , 0 , 1 )
Weibull<−rweibull (n , 1 , 5)
time <− pmin( Censor , Weibull )
ind<−i f e l s e ( Censor<=Weibull , 0 , 1 )
temp<−cbind (time , ind ) [ order ( time ) , ]
temp<−data . frame ( temp )
survobj<−Surv ( temp$time , temp$ ind )
S<−rep (1 , n )
SE<−rep (1 , n )
for ( i in 1 : length (S ) ){S [ i ]<−mean( bootkm ( survobj ,q=0.5 ,B=100 ,temp$time [ i ] , pr=TRUE) )
SE [ i ]<−sd ( bootkm ( survobj ,q=0.5 ,B=100 ,temp$time [ i ] , pr=TRUE) )
}plot ( temp$time , S , x lab=”Time” , ylab=” S u r v i a l Function ” , main=” Bootstrap Est imation ” , type=”p” )
l ines ( seq (0 , max( temp$time ) , 0 . 0 1 ) , pweibull ( seq (0 , max( temp$time ) , 0 . 0 1 ) , 1 , 5 , lower . t a i l = FALSE) ,
col = 2)
24
##S t r a t i f i e d Sampling ( s t r a t a based on censor ing )
simDat <− function (n , C, a , b , p ) {## use C to c o n t r o l censor ing r a t e
## use p ( v e c t o r ) to c o n t r o l the i n c l u s i o n p r o b a b i l i e s in the t h r e e s t r a t a
cen <− runif (n , 0 , C)
Y <− rweibull (n , a , b )
T <− pmin( cen , Y)
d e l t a <− i f e l s e (T < cen , 1 , 0)
## straID <− i f e l s e ( d e l t a == 1 , 1 , i f e l s e (T < 2/3 , 2 , 3))
straID <− d e l t a
sampInd <− unlist ( sapply ( 1 : 2 , function ( x ) rbinom( table ( straID ) [ x ] , 1 , p [ x ] ) ) )
h i <− sampInd / rep (p , table ( straID ) )
sampInd [ order (T) ] <− sampInd
hi [ order (T) ] <− hi
data . frame (T = T, d e l t a = de l ta , straID = straID , sampInd = sampInd , h i = hi )
}
ttmp <− seq (0 , 2 , 0 . 01 ) ## dummy time vector , upper bound at C
sv1 <− sv2 <− matrix (0 , ncol = length ( ttmp ) , nrow = 1000)
for ( i in 1 :1000) {dat <− simDat (1000 , 2 , 1 , 0 . 5 , c (1 , 0 . 2 ) )
## without we igh t
f i t 1 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , data = dat , subset = sampInd == 1)
sv1 [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 1 $ surv ) [max(which( x >= c (0 , f i t 1 $time ) ) ) ] )
f i t 2 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = dat , subset = sampInd == 1)
sv2 [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 2 $ surv ) [max(which( x >= c (0 , f i t 2 $time ) ) ) ] )
}
plot ( ttmp , pweibull ( ttmp , 1 , 0 . 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” , main=”Weighted KM Estimation with 80% Censoring Rate” )
l ines ( ttmp , apply ( sv1 , 2 , mean) , col = 2) ## c l e a r l y b i a s e d
l ines ( ttmp , apply ( sv2 , 2 , mean) , col = 3)
25
## case−cohor t sampling
n <− 200
Censor<−runif (n , 0 , 1 )
Weibull<−rweibull (n , 1 , 0 . 5 )
time <− pmin( Censor , Weibull )
ind<−i f e l s e ( Censor<=Weibull , 0 , 1 )
temp<−cbind (time , ind ) [ order ( time ) , ]
temp<−data . frame ( temp )
subinx <− sample ( 1 : length ( time ) , n/3 , replace = FALSE)
temp$ subcohort <− 0
temp$ subcohort [ subinx ] <− 1
pn <− table ( temp$ subcohort ) [ [ 2 ] ] / sum( table ( temp$ subcohort ) )
temp$hi <− temp$ ind + (1 − temp$ ind ) ∗ temp$ subcohort / pn
simDat <− function (n , C, a , b ) {## use C to c o n t r o l censor ing r a t e
## use p ( v e c t o r ) to c o n t r o l the i n c l u s i o n p r o b a b i l i e s in the t h r e e s t r a t a
cen <− runif (n , 0 , C)
Y <− rweibull (n , a , b )
T <− pmin( cen , Y)
d e l t a <− i f e l s e (T < cen , 1 , 0)
temp<−cbind (T, de l t a ) [ order (T) , ]
temp<−data . frame ( temp )
subinx <− sample ( 1 : length (T) , n/5 , replace = FALSE)
subcohort <− rep (0 , length (T) )
subcohort [ subinx ] <− 1
## straID <− i f e l s e ( d e l t a == 1 , 1 , i f e l s e (T < 2/3 , 2 , 3))
pn <− table ( subcohort ) [ [ 2 ] ] / sum( table ( subcohort ) )
h i <− d e l t a + (1 − d e l t a ) ∗ subcohort / pn
hi [ order (T) ] <− hi
data . frame (T = T, d e l t a = de l ta , subcohort=subcohort , h i = hi )
26
}
ttmp <− seq (0 , 1 , 0 . 01 ) ## dummy time vector , upper bound at C
sv1 <− sv2 <− matrix (0 , ncol = length ( ttmp ) , nrow = 1000)
for ( i in 1 :1000) {dat <− simDat (1000 , 1 ,1 , 5 )
## without we igh t
f i t 1 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , data = dat , subset = subcohort == 1)
sv1 [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 1 $ surv ) [max(which( x >= c (0 , f i t 1 $time ) ) ) ] )
f i t 2 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = dat , subset = subcohort == 1)
sv2 [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 2 $ surv ) [max(which( x >= c (0 , f i t 2 $time ) ) ) ] )
}
plot ( ttmp , pweibull ( ttmp , 1 , 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” )
l ines ( ttmp , apply ( sv1 , 2 , mean) , col = 2) ## c l e a r l y b i a s e d
l ines ( ttmp , apply ( sv2 , 2 , mean) , col = 3)
##Boots trap
l ibrary ( boots t rap )
l ibrary ( s u r v i v a l )
simDatStr <− function (n , C, p ) {## use C to c o n t r o l censor ing r a t e
## use p ( v e c t o r ) to c o n t r o l the i n c l u s i o n p r o b a b i l i e s in the t h r e e s t r a t a
cen <− runif (n , 0 , C)
Y <− rweibull (n , 1 , 0 . 5 )
T <− pmin( cen , Y)
d e l t a <− i f e l s e (T < cen , 1 , 0)
straID <− i f e l s e ( d e l t a == 1 , 1 , i f e l s e (T < 2/3 , 2 , 3 ) )## s t r a t i f i e d by Time
straID <− i f e l s e (T < 1/3 , 1 , i f e l s e (T < 2/3 , 2 , 3 ) )## s t r a t i f i e d by Censoring S t a t u s
sampInd <− unlist ( sapply ( 1 : 3 , function ( x ) rbinom( table ( straID ) [ x ] , 1 , p [ x ] ) ) )
h i <− sampInd / rep (p , table ( straID ) )
sampInd [ order (T) ] <− sampInd
27
h i [ order (T) ] <− hi
data . frame (T = T, d e l t a = de l ta , straID = straID , sampInd = sampInd , h i=hi )
}simDatCase <− function (n , C) {
## use C to c o n t r o l censor ing r a t e
## use p ( v e c t o r ) to c o n t r o l the i n c l u s i o n p r o b a b i l i e s in the t h r e e s t r a t a
cen <− runif (n , 0 , C)
Y <− rweibull (n , 1 , 0 . 5 )
T <− pmin( cen , Y)
d e l t a <− i f e l s e (T < cen , 1 , 0)
temp<−cbind (T, d e l t a ) [ order (T) , ]
temp<−data . frame ( temp )
subinx <− sample ( 1 : length (T) , n/3 , replace = FALSE)
temp$ subcohort <− 0
temp$ subcohort [ subinx ] <− 1
pn <− table ( temp$ subcohort ) [ [ 2 ] ] / sum( table ( temp$ subcohort ) )
temp$hi <− temp$d e l t a + (1 − temp$d e l t a ) ∗ temp$ subcohort / pn
data . frame (T = temp$T, d e l t a = temp$de l ta , h i = temp$hi )
}## Boots trap
nrep<−200## r e p l i c a t i o n t imes
nboot<−200## Boots t rapp ing t imes
ttmp <− seq (0 , 1 , 0 . 01 ) ## changab le to Contro l Censoring Rate ( 2:20% , 1:40% , 0.2:80%)
sv0<−svBU<−svBL<−matrix (0 , ncol=length ( ttmp ) ,nrow=nrep )
for (m in 1 : nrep ){dat <− simDatCase (1000 , 1)## changab le to Contro l Censoring Rate ( 2:20% , 1:40% , 0.2:80%)
f i t 0<−s u r v f i t ( Surv (T, d e l t a )˜1 ,data=dat , weights=hi )
sv0 [m, ]<−sapply ( ttmp , function ( x ) c (1 , f i t 0 $ surv ) [max(which( x >= c (0 , f i t 0 $time ) ) ) ] )
svB <− matrix (0 , ncol = length ( ttmp ) , nrow = nboot )
for ( i in 1 : nboot ){temp<−sample1<−subset ( dat , hi>0)
subinx <− sample ( 1 : length ( sample1$T) , length ( sample1$T) , replace = TRUE)
28
for ( j in 1 : length ( sample1$T)){temp [ j , ]<−sample1 [ subinx [ j ] , ]
}resample<−data . frame (T=temp$T, de l t a=temp$de l ta , h i=temp$hi )
f i t 1<− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = resample )
svB [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 1 $ surv ) [max(which( x >= c (0 , f i t 1 $time ) ) ) ] )
}for ( k in 1 : length ( ttmp )){
svBU [m, k ]<−quantile ( svB [ , k ] , 0 . 9 7 5 )## 95% Upper bound
svBL [m, k ]<−quantile ( svB [ , k ] , 0 . 0 2 5 )## 95% Lower bound
}}plot ( ttmp , pweibull ( ttmp , 1 , 0 . 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” , main=”95% C. I . from Bootstrap o f Case−cohort Sampling with 40% Censoring Rate” )
l ines ( ttmp , apply ( sv0 , 2 ,mean) , col=2)
l ines ( ttmp , apply (svBU , 2 , mean) , col = 3)
l ines ( ttmp , apply ( svBL , 2 , mean) , col = 4)