Weighted Kaplan-Meier Estimator For Di erent Sampling Methods · 2016. 11. 18. · This estimator is named after Edward L. Kaplan and Paul Meier, who each submitted similar manuscripts

Weighted Kaplan-Meier Estimator For Different SamplingMethods

A PROJECT

SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

OF THE UNIVERSITY OF MINNESOTA

BY

Weitong Yin

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

MASTER OF SCIENCE

Dr. Steven Chiou and Dr. Barry James

August, 2015

c© Weitong Yin 2015

ALL RIGHTS RESERVED

Acknowledgements

First of all, I would like to extend my sincere gratitude to my advisors, Dr. Barry

James and Dr. Steven Chiou, for their support on my course work, project and my

future study over the past two years. I would also like to give my sincere thanks to

Dr. Kang James for serving as my committee member. During her seminar classes, she

offered me important ideas which contribute a lot to this project. Furthermore, I would

like to thank my parents for their spiritual support during my study in United States.

i

Dedication

To those who held me up over the years

ii

Abstract

Kaplan-Meier(KM) estimator is a powerful approach for estimating the survival func-

tions. However, under some specific sampling methods, bias may affect the accuracy

of estimating the survival functions. Specifically, in this paper, we are interested in

weighted KM estimators for different sampling methods and we propose weighted KM

estimators case-cohort sampling and stratified sampling. In order to evaluate the effec-

tiveness of these weighted KM estimators, bootstrap is applied to analyze the variance

of the estimations. Simulation results are provided to compare the estimated survival

curves with survival curves based on Weibull distribution, from which the data is ran-

domly generated.

Keywords: Kaplan-Meier estimator, Case-cohort sampling, Stratified sampling, Weighted

Kaplan-Meier estimator, Bootstrap.

iii

Contents

Acknowledgements i

Dedication ii

Abstract iii

List of Tables vi

List of Figures vii

1 Introduction 2

1.1 Kaplan-Meier Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Right-Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.3 Case-Cohort Sampling . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.4 Stratified Random Sampling . . . . . . . . . . . . . . . . . . . . 5

2 Methodology 6

2.1 Introduction: Maximum Likelihood Estimation . . . . . . . . . . . . . . 6

2.1.1 Parametric Maximum Likelihood Estimation . . . . . . . . . . . 6

2.1.2 Nonparametric Maximum Likelihood Estimation . . . . . . . . . 7

2.2 Weighted Kaplan Meier Estimator . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Kaplan Meier Estimator with for Case-Cohort Sampling . . . . . 8

iv

2.2.2 Kaplan Meier Estimator with for Stratified Random Sampling. . 9

3 Analysis 10

3.1 Parametric Approach-Greenwoods Formula . . . . . . . . . . . . . . . . 10

3.2 Nonparametric Approach-Bootstrap . . . . . . . . . . . . . . . . . . . . 11

4 Simulation 12

4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Weighted Kaplan Meier Estimators . . . . . . . . . . . . . . . . . . . . . 13

4.2.1 Kaplan Meier Estimator with Case-cohort Weights . . . . . . . . 13

4.2.2 Kaplan Meier Estimator with Stratified Sampling Weights . . . . 15

4.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Conclusion 20

References 21

Appendix A. R Codes 22

v

List of Tables

4.1 Summary for Survival Function Sample . . . . . . . . . . . . . . . . . . 13

vi

List of Figures

4.1 Weighted KM Estimator(Case-cohort Sampling) with Different Censoring

Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Weighted KM Estimator(Stratified Sampling by Time) with Different

Censoring Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Weighted KM Estimator(Stratified Sampling by Censoring Status) with

Different Censoring Rates . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 Bootstrap 95% C.I. of Weighted KM Estimator (Case-cohort Sampling)

with Different Censoring Rates . . . . . . . . . . . . . . . . . . . . . . . 18

4.5 Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by Censoring

Status) with Different Censoring Rates . . . . . . . . . . . . . . . . . . . 18

4.6 Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by Censoring

Status) with Different Censoring Rates . . . . . . . . . . . . . . . . . . . 19

vii

1

2

Chapter 1

Introduction

1.1 Kaplan-Meier Estimator

Survival analysis is the analysis of time that one or more events happen. Usually, it

deals with finding the survival functions, the models of time-to-event data. In terms of

“event”, it often refers to the currency of specific event of our interest, such as “death”,

“failure” and etc. Concerning our object of primary interest, the survival function, it’s

often defined as S(t) = Pr(T > t)[1]. t is some specified time, T denotes a random

variable of the “event” time. Symbol Pr stands for the probability. It shows that the

survival function is actually the probability that the time of event happening is greater

than specified time t. Usually we assume the S(0) = 1, which means no event happens

at the beginning time. Then when t is greater, S(t) keeps decreasing to 0, which means

no survivors.

We consider estimators of the survival functions with no prior assumptions about the

shapes of the relevant functions. It is commonly defined as the nonparametric approach

for survival analysis. In some cases, since nonparametric analysis gives some informa-

tion about the pattern of the distribution, it also help us choosing a proper parametric

model. Kaplan-Meier estimator, also named product-limit estimator, is the most com-

monly used nonparametric method, which usually allows for censoring.

2

3

1.2 Right-Censoring

Right-censoring is a common case in survival analysis. For example, in a medical re-

search, some of the patients may withdraw from the study in advance before we observe

the event of interest. As a result, we can only observe the time when the patients left.

But we never know the time when those events happened on these patients.

Considering the case without censoring, let S(t) be the probability that a given subject

will survive through time t. For a sample of size N from this population, let the observed

times until death of the N sample members be t1 ≤ t2 ≤ t3 ≤ .. ≤ tN . Corresponding

to each ti are ni, the number “at risk” just prior to time ti, and di, the number of

happened events at time ti.

The simplest case is to estimate the survival function using the whole population

without any censoring. Then the estimator can be expressed as the product form:

S(t) =∏ti<t

ni − dini

. (1.1)

This estimator is named after Edward L. Kaplan and Paul Meier, who each submitted

similar manuscripts to the Journal of the American Statistical Association. Thus, it is

also called KM estimator, which we mentioned at the beginning of this article.

However, the classical KM estimator cannot always meet our needs in the real-world

application. For example, if it is too expensive or too time consuming to collect and

analyze data on all subjects, we have to take a sample from the whole population.

Using the observed data naively in the sample would lead to misleading results because

statistical information from the sample is often biased which highly depends on the

sampling methods. Considering a sample includes all the subjects experienced the

event and only partial of the subjects which don’t, the survival function would be

underestimated by using the classical KM estimator. Thus, for solving this problem, we

have to adjust the biases by incorporating a corresponding “weight”, which depends on

applied sampling methods. The modified estimator is named as weighted KM estimator.

Particularly, these estimators often include the case with censoring.

4

1.3 Sampling

Sampling is the process of selecting units (e.g., people, organizations) from a population

of interest so that by studying the sample we may fairly generalize our results back to

the population from which they were chosen.

1.3.1 Simple Random Sampling

Simple random sampling is one of the basic forms of sampling. There are two ways

of taking samples: without replacement and with replacement. In the case with re-

placement, the same subject may be withdrawn more than once. In the case without

replacement, all subjects in the sample are distinct. Simple random sampling from a

population of N subjects is withdrawing independent samples of size 1. Obviously, each

subject is randomly withdrawn with same probability 1N with replacement. Since in

most cases, sampling the same subject doesn’t give us any additional information, we

usually take simple sampling without replacement. We withdraw a sample with size n

from the population. Each subset of n distinct subjects in the population has the same

probability of being selected. A sample with size n is randomly selected with probabilityn!(N−n)!

N ! . Thus, under this circumstance, each subject in the population is selected with

probability nN [2].

1.3.2 Systematic Sampling

Systematic sampling is a sampling method involving the selection of elements from

an ordered sampling frame. In this approach, progression through the list is treated

circularly, with a return to the top once the end of the list is passed. The sampling

starts by selecting a subject from the list at random and then every kth subject in the

frame is selected, where k is the sampling interval: this is calculated as: k = Nn where

n is the preassigned sample size, and N is the population size[2].

1.3.3 Case-Cohort Sampling

Case-Cohort design (Prentice 1986) is an effective and economical design which origi-

nated to allow efficient analysis of studies where it is too expensive and time consuming

5

to collect and analyze data on all subjects. “Cases” refer to those subjects who have ex-

perienced a specific event. “Controls” refer to that subjects who have not experienced

a specific event. Case-cohort random sampling involves two steps. Firstly a subset

called sub-cohort is randomly selected from the whole cohort regardless of their sta-

tus(experienced the specific event or not). Secondly, the sub-cohort and the remaining

cases in the cohort who are not selected in the sub-cohort form a case-cohort sample. [3]

Strictly speaking, case-cohort sampling is the special case of stratified random sampling.

One of the stratums, “Cases”, is assigned 100% sampling fraction.

1.3.4 Stratified Random Sampling

In statistical surveys, when subpopulations within an overall population vary, it is ad-

vantageous to sample each subpopulation (stratum) independently. Stratification is

the process of dividing members of the population into homogeneous subgroups before

sampling. The strata should be mutually exclusive, which means every element in the

population must be assigned to only one stratum. The strata should also be collectively

exhaustive, that is no population element can be excluded. Then simple random sam-

pling or systematic sampling is applied within each stratum. Optimal allocation is the

core of the stratified sampling strategies.

Sometimes, we want an allocation that provides the most precision, assuming that

the direct cost to sample an individual subject is equal across strata. Given a stratified

sample with a fixed sample size, a special case of optimal allocation is given by Neyman.

Based on Neyman allocation, the best sample size for s stratum would be[2]:

ns =n(Nsσs)

[∑

(Niσi)]. (1.2)

(where ns is the sample size for stratum s, n is the total sample size, Ns is the population

size for stratum s, and σs is the standard deviation of stratum s.)

Chapter 2

Methodology

2.1 Introduction: Maximum Likelihood Estimation

2.1.1 Parametric Maximum Likelihood Estimation

Firstly, we need to give out some definitions.

Survival Function:

S (t) = P (T > t) = 1− F (t) .

SC (t) = P (C > t) = 1− FC (t) .

Hazard Function:

h (t) = limδ→0

(P (t ≤ T < t+ δ|T ≥ t)

δ

).

hC (t) = limδ→0

(P (t ≤ C < t+ δ|C ≥ t)

δ

).

For each subject, one of a pair observationsti = (yi, ci) can be observed. By dividing

the data into two catergories,censored or not censored, we can express the likelihood

function like following:

L = L1.L2.

Based on different data with or without censoring, the likelihood functions would be

different. For Not Censored Data:

L1 =∏

h (yi)S (yi)P (C > yi) =∏

h (yi)S (yi)SC (yi) .

6

7

For Censored Data:

L2 =∏

hC (ci)SC (ci)P (T > ci) =∏

hC (ci)SC (ci)S (ci) .

In order to express the Likelihood function as a simple form, we difine a Indicator γ as

following:

γi =

1 if yi < ci

0 otherwise

Likelihood function under such circumstances can be expressed as below:

L =∏

[h (ti)S (ti)SC (ti)]γi∏

[hC (ti)SC (ti)S (ti)]1−γi =

∏h (ti)

γihC (ti)1−γiS (ti)SC (ti) .

Furthermore, when maximizing the likelihood function (MLF), we only focus on the

estimated values of parameters λ, k . Since our assumption,Failure time and Censoring

time are independent. We can simplify the likelihood function by removing SC (ti) and

hC (ti)1−γi

L =∏

h (ti)γi S(ti).

Since it’s easier to deal with, we take the log function of L

Log (L) =∑

[γiLogh (ti) + LogS (ti)].

2.1.2 Nonparametric Maximum Likelihood Estimation

The KM estimator has an excellent interpretation as a nonparametric maximum likeli-

hood estimator. Its fathers, Edward L. Kaplan and Paul Meier, provided us an intuitive

approach. Assuming a subject is censored at the time t, in order to make likelihood

function as large as possible, we keep the survival function unchanged at the censoring

time. In the other case, assuming a subject experienced the event at t, we need to make

the survival function just before t(i) as large as possible and the survival function at t(i)

as small as possible.

Then the likelihood function L can be expressed as below:

L =m∏i=1

[S(t(i−1))− S(t(i))]di [S(t(i)]

ci . (2.1)

8

Write pi =S(t(i))

S(t(i−1)). Then S(t(i)) can be expressed as below:

S(t(i)) = p1p2p3 . . . pi. (2.2)

Then we can rewrite the likelihood function as below:

L =m∏i=1

(1− pi)di(pi)ci(p1p2p3 . . . pi−1)di+ci . (2.3)

Let ni =∑

j≥i(dj + ci) be the total number of subjects at risk. Then we can write the

likelihood function as below:

L =m∏i=1

(1− pi)di(pi)ni−di . (2.4)

Based on this likelihood function, the maximum likelihood estimation of pi is

pi =ni − dini

(2.5)

The KM estimator is the product of these probabilities, pi’s.

2.2 Weighted Kaplan Meier Estimator

2.2.1 Kaplan Meier Estimator with for Case-Cohort Sampling

Case-Cohort design ( Prentice 1986 ) is an effective and economical design which origi-

nated to allow efficient analysis of studies where it is too expensive and time consuming

to collect and analyze data on all subjects. Since we know the case-cohort sample is

actually a combination of all the cases in the cohort and a part of the controls. If the

sample is naively used to estimate the survival function, then the survival functions

would be underestimated. The reason is the rate of death in case-cohort is obviously

higher than the full cohort. We’re using the biased sample with higher death rate to

estimate. It definitely can’t represent the statistical characteristic of the full cohort data

set.

In order to adjust this bias, we incorporate a new weight hi, we denote it as S1. S1 can

be expressed as below[3] :

S1(t) =∏y(i)<t

hi.pi (2.6)

9

hi is defined as below:

hi = γi +(1− γi)αi

pn(2.7)

pn is the sub-cohort inclusion probability, (n is the sample size of the sub-cohort, n is

the sample size of the full cohort.)

pn =n

n(2.8)

αi is the inclusion indicator function defined as below:

αi =

1 if ith observation is in the sub-cohort

0 if ith observation is not in the sub-cohort(2.9)

2.2.2 Kaplan Meier Estimator with for Stratified Random Sampling.

When the full cohort data is not available, the survival functions cannot be estimated

actually. Since the observed data is a biased sample, using it naively would lead to

incorrect results. To solve this problem, we need to incorporate proper weight. Define

ψis be the strata indicator; ψis = 1 if the ith subject is in the sth stratum and ψis = 0

otherwise. Define ξi be the sampling indicator; ξi = 1 if the ith subject is sampled and

ξi = 0 otherwise. Define pn,s be the inclusion probabilities, pn,s = nsns

, where ns and ns

are number of all the subjects in sth stratum and number of sampled subjects in sth

stratum. In order to adjust this bias, we incorporate a new weight hi, we denote it as

S1. S1 can be expressed as below[4] :

S2(t) =∏y(i)<t

hi.pi (2.10)

where the weight for the ith subject can be expressed as[4]:

hi =S∑s=1

ξiψispn,s

(2.11)

Chapter 3

Analysis

In order to increase the accuracy of survival function estimator, we need to analyze

variances and confidence intervals of the estimated survival functions from different

estimators.

3.1 Parametric Approach-Greenwoods Formula

Theorem 1. (Greenwoods Formula) The variance of survival function is given by

Greenwood formula[5]:

ˆV ar(S(t)) = (S(t))2k∑j=1

djnj(nj − dj)

. (3.1)

Since pi Binomial(ni, pi), variance of pi is given by

V ar(pi) =pi(1− pi)

ni. (3.2)

To obtain the variance of S(t), the KM estimator, we need to apply the delta method.

Firstly, we have by taking Logs, we can find following equation.

LogS(t(i)) =

i∑j=1

logpi. (3.3)

10

11

Delta method is expressed as below:

V ar(f(X)) = (f ′(X))2V ar(X). (3.4)

In our case, we can get the V ar(logpi) can be expressed as below:

V ar(logpi) = (1

pi)2V ar(pi) =

1− pinipi

. (3.5)

Since we assumed the covariances of the pi’s are zero, we find

V ar(LogS(t(i))) =i∑

j=1

1− pinipi

=∑ dj

nj(nj − dj). (3.6)

Apply the delta method again, it reaches the final destination:

V ar(S(t(i)) = (S(t(i))2∑ 1− pi

nipi. (3.7)

3.2 Nonparametric Approach-Bootstrap

In medical science we have very often to deal with small sample size. There are several

reasons for that. The common one is rarity of illness or difficulty with gathering pa-

tients possessing the same parameters. And we don’t have too much information about

the distributions of samples. Classical statistical methods don’t work for analysis of the

variance. Facing these problems, Bootstrap comes out to be one of best choice. Without

assuming distributions of the sample, like t-distribution, z-distribution, we can still use

Bootstrap resampling to estimate survival functions.

If we receive N bootstrap samples and N Kaplan-Meier estimates. For bootstrap esti-

mator of survival function we take[6]

SB(t) =1

N

N∑j=1

Sj(t). (3.8)

Chapter 4

Simulation

In this section, we aim at comparing the weighted KM estimators with unweighted KM

estimators under different sampling schemes: (1) case-cohort sampling and (2) stratified

sampling. Since our data set comes from known distribution, then by comparing the

estimations with the theoretical values, the improvements are presented. All simulations

are conducted in R.

4.1 Data Set

For convenience of evaluation, we choose to generate data randomly. First of all, we

assume the event time, T obeys Weibull distribution with parameters λ, k. Correspond-

ingly, the survival function would be S(t) = e−(t/λ)k.First step, we generate a random

vector, Weibull, with length n. Each element in the vector is randomly generated from

a Weibull distribution with parameters λ, k. These values in the random vector stands

for the pre-assumed event time for the n subjects. Then the same approach, we ran-

domly generate another vector with the same length, Censor, where each element in the

vector is randomly generated from a uniform distribution, U(0, C). These values in the

random vector stands for the censoring time for the corresponding n subjects. At each

corresponding position of the two vectors, we take the minimum value for constructing

a new vector, Time, with the same length n. We use values in Time as the observed

time( event time or censored time which is smaller ). At the same time, we record the

censoring status by a indicator vector, Ind. In each corresponding position, we record

12

13

0 if it’s censored, otherwise 1. To this stage, we have already generated a raw data set.

Then we order the data set by Time from small to big and let the Ind go with Time.(

Ind is not ordered).Now, we have constructed a full cohort for survival analysis.

In R, we use function survfit to estimate the survival function. Summary for the

survival function will show like below:

time n.risk n.event survival standard error lower 95% CI upper 95% CI

0.00823 200 1 0.995 0.00499 0.985 10.06673 188 1 0.990 0.00724 0.976 10.10404 180 1 0.984 0.00905 0.967 10.12567 177 1 0.979 0.01057 0.958 10.14638 174 1 0.973 0.01191 0.95 0.9970.16613 169 1 0.967 0.01316 0.942 0.993

Table 4.1: Summary for Survival Function Sample

4.2 Weighted Kaplan Meier Estimators

4.2.1 Kaplan Meier Estimator with Case-cohort Weights

When given a population with case-cohort sampling, the KM estimator, without weights,

will be biased, regardless of the censoring rate. This is the main reason we are investi-

gating a weighted KM estimator.

Simulations were conducted to show weighted KM estimator removed such kind of

bias from unweighted KM estimator. As mentioned at the beginning of this chapter,

we choose Weibull(1, 0.5) for pre-assumed event time and for comparison we choose

uniform distributions, U(0, 2), U(0, 1) and U(0, 0.8) to control the censoring rate with

respect to 20%,40%,80%. For each single simulation, we choose full cohort with sample

size 1000. And a sample with size 300 will be selected by simple random sampling.

This is the sub-cohort we defined before. Then we include all the other cases which

are not in the sub-cohort. The average case-cohort sample size are 600, 700, 800 with

respect to 20%,40%,80% censoring rate. Both unweighted KM estimator and weighted

KM estimator is applied to this sampled group.

14

We repeated the process for 1000 times. After prefixing time points every 0.01 from

0 to C, survival function values were recorded at each time point. In each replication,

we get a sequence of survival function values with respect to each time point. Then

after 1000 times’ replication, average of survival function values was taken at each time

point. Based on these average survival function values at each time point, we plotted

the graph and compared with the true survival curve from Weibull(1, 0.5). Figure 4.1 is

conducted on the average of 1000 replications. Our estimation, the green curve, overlaps

the true curve almost everywhere. Thus it seems to be very good estimation. Also we

can find the unweighted KM estimation, the red curves, have significant bias in all the

cases. And the bias increased as the censoring rate increases.

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

Weighted KM Estimation with 20% Censoring Rate

time

S(t

)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0


time

S(t

)

0.0 0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

1.0


time

S(t

)

Figure 4.1: Weighted KM Estimator(Case-cohort Sampling) with Different CensoringRates

15

4.2.2 Kaplan Meier Estimator with Stratified Sampling Weights

When given a population with stratified sampling, the KM estimator, without weights,

will be biased, regardless of the censoring rate. This is the main reason we are investi-

gating a weighted KM estimator.

Simulations were conducted to show weighted KM estimator removed such kind of

bias from unweighted KM estimator. As mentioned at the beginning of this chapter,

we choose Weibull(1, 0.5) for pre-assumed event time and for comparison we choose

uniform distributions, U(0, 2), U(0, 1) and U(0, 0.8) to control the censoring rate with

respect to 20%,40%,80%. For each single simulation, we choose full cohort with sample

size 1000.

We simulated two stratifying schemes, stratified by time and stratified by censoring

status. For the first scheme, we divided the full cohort into three subsets with respect

to observational time (0,1/3),(1/3,2/3), (2/3,∞) In each subset or stratum, a sample

with portions 1, 0.5, 0.2 were selected by simple random sampling. Combine all the data

to form a stratified sample. The average stratified sample size are 520. The same ap-

proach for the second scheme. We divided the full cohort into two subsets according to

the censoring status. In stratum of censored subjects, we choose portion 1. In stratum

of uncensored subjects, we choose portion 0.2. Both stratifying schemes simulations

were conducted on censoring rate with respect to 20%, 40%, 80% .Both unweighted KM

estimator and weighted KM estimator is applied to this sampled group.

We repeated the process for 1000 times. After prefixing time points every 0.01 from

0 to C, survival function values were recorded at each time point. In each replication,

we get a sequence of survival function values with respect to each time point. Then

after 1000 times’ replication, average of survival function values was taken at each time

point. Based on these average survival function values at each time point, we plotted

the graph and compared with the true survival curve from Weibull(1, 0.5). Figure 4.2

is conducted on the average of 1000 replications. Our estimation , the green curve,

overlaps the true curve almost everywhere. Thus it seems to be very good estimation.

Also we can find the unweighted KM estimation, the red curves, have significant bias

16

in all the cases. And the bias increased as the censoring rate increases.

Following results were conducted on stratifying by observational time.

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8


time

S(t

)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0


timeS

(t)

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8


time

S(t

)

Figure 4.2: Weighted KM Estimator(Stratified Sampling by Time) with Different Cen-soring Rates

17

Simulations were conducted on stratifying by censoring status on the average of 1000

replications.

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8


time

S(t

)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.6

1.0


time

S(t

)

0.00 0.05 0.10 0.15 0.20

0.70

0.80

0.90

1.00


time

S(t

)

Figure 4.3: Weighted KM Estimator(Stratified Sampling by Censoring Status) withDifferent Censoring Rates

4.3 Bootstrap

In this section, we applied Bootstrap to sampled data sets with the Bootstrap number,

200. In order to get rid of occasional results, we take the average of 200 replications.

By obtaining the quantiles of 97.5% and 2.5% from 200 survival function values at each

time point, 95% confidence interval(C.I.) of our estimations are constructed. Figure

4.4, Figure 4.5 and Figure 4.6 show that the theoretical curve locates in the 95% C.I.

of our estimations. In each figure, green line is the 95% upper bound and blue line

is the 95% lower bound. Black lines, the theoretical curves, are nearly overlapped

almost everywhere by the red lines, which are bootstrap estimation from weighted KM

estimators.

18

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

Case−cohort Sampling with 20% Censoring Rate

time

S(t

)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.6

1.0


time

S(t

)

0.00 0.05 0.10 0.15 0.20

0.70

0.80

0.90

1.00


time

S(t

)

Figure 4.4: Bootstrap 95% C.I. of Weighted KM Estimator (Case-cohort Sampling)with Different Censoring Rates

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

Stratified by Censoring Status with 20% Censoring Rate

time

S(t

)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.6

1.0


time

S(t

)

0.00 0.05 0.10 0.15 0.20

0.70

0.80

0.90

1.00


time

S(t

)

Figure 4.5: Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by CensoringStatus) with Different Censoring Rates

19

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

Stratified by Time with 20% Censoring Rate

time

S(t

)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0


time

S(t

)

0.00 0.05 0.10 0.15 0.20

0.70

0.80

0.90

1.00


time

S(t

)

Figure 4.6: Bootstrap 95% C.I. of Weighted KM Estimator (Stratified by CensoringStatus) with Different Censoring Rates

Chapter 5

Conclusion

From the simulation results, we can see that our weighted KM estimators is significantly

closer to the theoretical values than unweighted KM estimator whatever with any level

of censoring rate. Indeed, after 1000 replications, weighted KM estimators overlapped

expected theoretical curve almost everywhere. In the meantime, unweighted KM esti-

mators are clearly biased. Bootstrap gives another evidence that the theoretical curve

locates in the 95% confidence intervals of our weighted KM estimators. Thus, we con-

clude that weighted KM estimators for case-cohort sampling and stratified sampling

give more accurate estimation of survival function than unweighted KM estimator.

Although unweighted KM estimator often works well for unbiased sampling schemes,

we would suggest double check the sampling scheme before applying the unweighted KM

estimator to the data. For an biased sampling scheme, we’re supposed to set up a proper

weighted KM estimator for it. This article only provided weights for two special cases,

case-cohort sampling and stratified sampling with three distinct censoring rates. To

reach more general conclusion, we still need more explorations under different cases,

like using another theoretical distribution but Weibull or using different biased sam-

pling schemes.

20

References

[1] Borgan. 2005. Kaplan Meier Estimator. Encyclopedia of Biostatistics

[2] Lohr, Sharon. 2009 Sampling: Design and Analysis. Cengage Learning

[3] Chiou, S., Kang, S. and Yan, J. (2014) Fast Accelerated Failure Time Modeling for

Case-Cohort Data. Statistics and Computing, 24(4):559-568.

[4] Chiou, S., Kang, S. and Yan, J. (2014+) Semiparametric Accelerated Failure Time

Modeling for Clustered Failure Times From Stratified Sampling Journal of American

Statistical Association, Forthcoming.

[5] Jenkins, Stephen P. 2005. Survival Analysis Unpublished manuscript, Institute for

Social and Economic Research, University of Essex, Colchester, UK

[6] Efron, Bradley, and Robert J. Tibshirani. 1994. An introduction to the bootstrap.

CRC press

21

Appendix A

R Codes

##S t r a t i f i e d Sampling ( s t r a t a based on time )

simDat <− function (n , C, p ) {## use C to c o n t r o l censor ing r a t e

## use p ( v e c t o r ) to c o n t r o l the i n c l u s i o n p r o b a b i l i e s in the t h r e e s t r a t a

cen <− runif (n , 0 , C)

Y <− rweibull (n , 1 , 0 . 5 )

T <− pmin( cen , Y)

d e l t a <− i f e l s e (T < cen , 1 , 0)

## straID <− i f e l s e ( d e l t a == 1 , 1 , i f e l s e (T < 2/3 , 2 , 3))

straID <− i f e l s e (T < 1/3 , 1 , i f e l s e (T < 2/3 , 2 , 3 ) )

sampInd <− unlist ( sapply ( 1 : 3 , function ( x ) rbinom( table ( straID ) [ x ] , 1 , p [ x ] ) ) )

h i <− sampInd / rep (p , table ( straID ) )

sampInd [ order (T) ] <− sampInd

hi [ order (T) ] <− hi

data . frame (T = T, d e l t a = de l ta , straID = straID , sampInd = sampInd , h i = hi )

}

ttmp <− seq (0 , 2 , 0 . 01 ) ## dummy time vector , upper bound at C

sv1 <− sv2 <− matrix (0 , ncol = length ( ttmp ) , nrow = 1000)

for ( i in 1 :1000) {

22

23

dat <− simDat (1000 , 2 , c (1 , 0 . 5 , 0 . 2 ) )

## without we igh t

f i t 1 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , data = dat , subset = sampInd == 1)

sv1 [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 1 $ surv ) [max(which( x >= c (0 , f i t 1 $time ) ) ) ] )

f i t 2 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = dat , subset = sampInd == 1)


}

plot ( ttmp , pweibull ( ttmp , 1 , 0 . 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” , main=”Weighted KM Estimation with 20% Censoring Rate” )

l ines ( ttmp , apply ( sv1 , 2 , mean) , col = 2) ## c l e a r l y b i a s e d

l ines ( ttmp , apply ( sv2 , 2 , mean) , col = 3)

## Boots trap

l ibrary ( s u r v i v a l )

l ibrary ( boot )

l ibrary ( Hmisc )

n <− 200

Censor<−runif (n , 0 , 1 )

Weibull<−rweibull (n , 1 , 5)

time <− pmin( Censor , Weibull )

ind<−i f e l s e ( Censor<=Weibull , 0 , 1 )

temp<−cbind (time , ind ) [ order ( time ) , ]

temp<−data . frame ( temp )

survobj<−Surv ( temp$time , temp$ ind )

S<−rep (1 , n )

SE<−rep (1 , n )

for ( i in 1 : length (S ) ){S [ i ]<−mean( bootkm ( survobj ,q=0.5 ,B=100 ,temp$time [ i ] , pr=TRUE) )

SE [ i ]<−sd ( bootkm ( survobj ,q=0.5 ,B=100 ,temp$time [ i ] , pr=TRUE) )

}plot ( temp$time , S , x lab=”Time” , ylab=” S u r v i a l Function ” , main=” Bootstrap Est imation ” , type=”p” )

l ines ( seq (0 , max( temp$time ) , 0 . 0 1 ) , pweibull ( seq (0 , max( temp$time ) , 0 . 0 1 ) , 1 , 5 , lower . t a i l = FALSE) ,

col = 2)

24

##S t r a t i f i e d Sampling ( s t r a t a based on censor ing )

simDat <− function (n , C, a , b , p ) {## use C to c o n t r o l censor ing r a t e



Y <− rweibull (n , a , b )




straID <− d e l t a





data . frame (T = T, d e l t a = de l ta , straID = straID , sampInd = sampInd , h i = hi )

}



for ( i in 1 :1000) {dat <− simDat (1000 , 2 , 1 , 0 . 5 , c (1 , 0 . 2 ) )

## without we igh t

f i t 1 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , data = dat , subset = sampInd == 1)


f i t 2 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = dat , subset = sampInd == 1)


}

plot ( ttmp , pweibull ( ttmp , 1 , 0 . 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” , main=”Weighted KM Estimation with 80% Censoring Rate” )



25

## case−cohor t sampling

n <− 200

Censor<−runif (n , 0 , 1 )

Weibull<−rweibull (n , 1 , 0 . 5 )

time <− pmin( Censor , Weibull )

ind<−i f e l s e ( Censor<=Weibull , 0 , 1 )

temp<−cbind (time , ind ) [ order ( time ) , ]


subinx <− sample ( 1 : length ( time ) , n/3 , replace = FALSE)

temp$ subcohort <− 0

temp$ subcohort [ subinx ] <− 1

pn <− table ( temp$ subcohort ) [ [ 2 ] ] / sum( table ( temp$ subcohort ) )

temp$hi <− temp$ ind + (1 − temp$ ind ) ∗ temp$ subcohort / pn

simDat <− function (n , C, a , b ) {## use C to c o n t r o l censor ing r a t e



Y <− rweibull (n , a , b )



temp<−cbind (T, de l t a ) [ order (T) , ]


subinx <− sample ( 1 : length (T) , n/5 , replace = FALSE)

subcohort <− rep (0 , length (T) )

subcohort [ subinx ] <− 1


pn <− table ( subcohort ) [ [ 2 ] ] / sum( table ( subcohort ) )

h i <− d e l t a + (1 − d e l t a ) ∗ subcohort / pn


data . frame (T = T, d e l t a = de l ta , subcohort=subcohort , h i = hi )

26

}



for ( i in 1 :1000) {dat <− simDat (1000 , 1 ,1 , 5 )

## without we igh t

f i t 1 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , data = dat , subset = subcohort == 1)


f i t 2 <− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = dat , subset = subcohort == 1)


}

plot ( ttmp , pweibull ( ttmp , 1 , 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” )



##Boots trap

l ibrary ( boots t rap )

l ibrary ( s u r v i v a l )

simDatStr <− function (n , C, p ) {## use C to c o n t r o l censor ing r a t e



Y <− rweibull (n , 1 , 0 . 5 )



straID <− i f e l s e ( d e l t a == 1 , 1 , i f e l s e (T < 2/3 , 2 , 3 ) )## s t r a t i f i e d by Time

straID <− i f e l s e (T < 1/3 , 1 , i f e l s e (T < 2/3 , 2 , 3 ) )## s t r a t i f i e d by Censoring S t a t u s




27

h i [ order (T) ] <− hi

data . frame (T = T, d e l t a = de l ta , straID = straID , sampInd = sampInd , h i=hi )

}simDatCase <− function (n , C) {

## use C to c o n t r o l censor ing r a t e



Y <− rweibull (n , 1 , 0 . 5 )



temp<−cbind (T, d e l t a ) [ order (T) , ]


subinx <− sample ( 1 : length (T) , n/3 , replace = FALSE)

temp$ subcohort <− 0

temp$ subcohort [ subinx ] <− 1

pn <− table ( temp$ subcohort ) [ [ 2 ] ] / sum( table ( temp$ subcohort ) )

temp$hi <− temp$d e l t a + (1 − temp$d e l t a ) ∗ temp$ subcohort / pn

data . frame (T = temp$T, d e l t a = temp$de l ta , h i = temp$hi )

}## Boots trap

nrep<−200## r e p l i c a t i o n t imes

nboot<−200## Boots t rapp ing t imes

ttmp <− seq (0 , 1 , 0 . 01 ) ## changab le to Contro l Censoring Rate ( 2:20% , 1:40% , 0.2:80%)

sv0<−svBU<−svBL<−matrix (0 , ncol=length ( ttmp ) ,nrow=nrep )

for (m in 1 : nrep ){dat <− simDatCase (1000 , 1)## changab le to Contro l Censoring Rate ( 2:20% , 1:40% , 0.2:80%)

f i t 0<−s u r v f i t ( Surv (T, d e l t a )˜1 ,data=dat , weights=hi )

sv0 [m, ]<−sapply ( ttmp , function ( x ) c (1 , f i t 0 $ surv ) [max(which( x >= c (0 , f i t 0 $time ) ) ) ] )

svB <− matrix (0 , ncol = length ( ttmp ) , nrow = nboot )

for ( i in 1 : nboot ){temp<−sample1<−subset ( dat , hi>0)

subinx <− sample ( 1 : length ( sample1$T) , length ( sample1$T) , replace = TRUE)

28

for ( j in 1 : length ( sample1$T)){temp [ j , ]<−sample1 [ subinx [ j ] , ]

}resample<−data . frame (T=temp$T, de l t a=temp$de l ta , h i=temp$hi )

f i t 1<− s u r v f i t ( Surv (T, d e l t a ) ˜ 1 , weights = hi , data = resample )

svB [ i , ] <− sapply ( ttmp , function ( x ) c (1 , f i t 1 $ surv ) [max(which( x >= c (0 , f i t 1 $time ) ) ) ] )

}for ( k in 1 : length ( ttmp )){

svBU [m, k ]<−quantile ( svB [ , k ] , 0 . 9 7 5 )## 95% Upper bound

svBL [m, k ]<−quantile ( svB [ , k ] , 0 . 0 2 5 )## 95% Lower bound

}}plot ( ttmp , pweibull ( ttmp , 1 , 0 . 5 , lower . t a i l = FALSE) , type=” l ” , xlab = ” time ” , ylab = ”S( t ) ” , main=”95% C. I . from Bootstrap o f Case−cohort Sampling with 40% Censoring Rate” )

l ines ( ttmp , apply ( sv0 , 2 ,mean) , col=2)

l ines ( ttmp , apply (svBU , 2 , mean) , col = 3)

l ines ( ttmp , apply ( svBL , 2 , mean) , col = 4)

Documents

Weighted Kaplan-Meier Estimator For Di erent Sampling Methods · 2016. 11. 18. · This estimator is named after Edward L. Kaplan and Paul Meier, who each submitted similar manuscripts