View
5
Download
0
Category
Preview:
Citation preview
Survival Analysis in Medical Research
By Qamruz Zaman1, Karl P Pfeiffer2,
1Department of Statistics, University of Peshawar, Pakistan
2Department of Medical Statistics, Informatics and Health Economics,
Medical University Innsbruck
Abstract
For the last few decades, special attention has been given to the field of survival analysis.
Survival analysis techniques play important part in different areas of research. The
purpose of this article is not to elaborate its uses in different fields but to describe some
of the frequently used concepts of survival analysis in medical research. Nonparametric
techniques (Kaplan-Meier method and Logrank test) of survival analyses are more
popular due to simplicity as well as the assumption free property. For multivariate
analysis, if the proportional hazards assumption is satisfied, Semi parametric Cox
proportional hazard model is used to identify risk factors, while in case of non-
proportional hazard model, time dependent regression model is applied to data set.
Furthermore, hazard functions of commonly used Survival distributions are described.
Some of the under rated areas due to lack of software’s are also discussed.
Key words: Censoring, Cox regression model, Failure time, Kaplan-Meier survival
function, Logrank test, Proportional hazards assumption
Corresponding author:
Qamruz Zaman E-mail: ayanqamar@gmail.com
Editor: Ghorai, Jugal K. jugal@csd.uwm.edu
1. Introduction
Now-a-days, survival analysis techniques have become important tools for analyzing the
data belonging to field of medicine, engineering, marketing etc. Although, the basic
functions of survival analysis are same in all fields, but it is familiar with different names
e.g. in the field of engineering it is called reliability analysis, in sociology people
recognized it as event history analysis, in economics it is famous with the name of
duration analysis and medical researchers give it the name of survival analysis. Survival
analysis deals with models, methods and is used for analyzing data of life times. Survival
data can arise in the fields of medicine, engineering, economics etc. One of the common
uses of survival analysis in clinical trial is the comparison of survival times of different
treatments in some fatal diseases; a demographer can use the technique in studying the
length of working hours of a group of people, duration of marriage. In an open book
exam, the examiner can use survival analysis for measuring the number of hours of
completing the paper. In engineering, one of the uses of survival analysis is the waiting
time of failure of an item. In economics, we may study the survival of a new business.
Survival analysis is different from the other procedures due to following reasons.
1. In survival analysis, response variable is always time.
2. Staggered entries are more common in medical research. By staggered entries we
mean that all individuals in the study do not have the same entrance time. This
does not affect the survival analysis, as the analysis deals with the length of the
observation time and not based on the same entrance.
3. The assumption of normality is not hold in survival analysis, as survival data are
generally skewed. The commonly used distributions in survival analysis are
exponential, weibull, lognormal, gamma, log-logistic etc.
4. Concept of censoring, which may affect the hazard function.
5. Time dependent covariates.
A data sample is said to be censored when values of the variable are not observed for
some of the items in the sample. In medical studies, the actual time of death of some
subjects may not be noted for many reasons e.g. they move away or the allocated time for
the study elapses prior the events [1]. One of the main reasons of censoring is the limiting
duration of study period.
Survival analysis is a very vast field, which can not be covered in a single article. The
main aim and things which tried in this article is to avoid technical terms as much as
possible and to explain/ describe only important and commonly used topics. This will not
only helpful for the experienced, also for the emerging researchers. Although, some work
has already been done towards the point [2-7], but most of the papers either described the
specific topics or very technical not only for the medical practitioners but also sometimes
for the statisticians who have less knowledge/ experience of the subject.
Some of the books covering the concept of survival analysis are Modelling Survival Data
in Medical Research [8], Statistical Models Based on Counting Processes [9], Analysis of
Survival Data [10], Survival Analysis [11], Analysing Survival Data from clinical trials
and Observational Studies [12] and Survival analysis with Long-term Survivors [13].
Survival analysis is based on the time until an event occurs. Time may be in hours, days,
weeks, months and years from the beginning of follow-up until an event occurs. Time is a
positive real-valued variable, has a continuous distribution. Examples of the survival time
are
• Time from the diagnosis of a disease to the development of a disease
• Time to death
• Length of stay in a jail/ hospital/ school
• HIV viral load measurements
Event may be the recovery from a disease, death etc. In medical research three
techniques (Parametric, Semi-parametric and Nonparametric) of survival analysis are
mainly used to obtain the estimates of survival probabilities, mean/ median survival time.
The method is also used for the comparison of two or more treatments/ procedures.
Similarly, multivariate analysis procedure of survival analysis is used to obtain the risk
factors.
Survival time is described by three functions:
1.1. Cumulative Proportion Surviving (Survival Function) S (t)
Let the survival time (random variable) be denoted by T, Survival function by and is
defined as the probability that an individual survives longer than t.
( )tS
( ) ( )tthanlongersurvivesindividualanPtS =
( ) ( )tbeforefailsindividualanPtS −=1
The range of is 0 and 1 i.e. ( )tS ( ) 10 ≤≤ tS . The graph of survival function is a step
function and is called survival curve. At time zero, ( )tS reaches to its maximum value 1
and if the last observed time is event time ( )tS achieves the minimum value zero.
1.2. Probability Density Function
The probability density function of failure time data is defined as
( ) ( )t
ttTtPtft ∆
∆+<<=→∆ 0
lim
( ) ( )dt
tdFtf =
( ) ( )( )tSdtdtf −= 1
( ) ( ) ( )tStSdtdtf ′−=
−=
The probability density function is also known as the unconditional failure rate.
1.3. Hazard Function:
The hazard function is a measure of the probability of failure during a very small interval,
assuming that the individual has survived at the beginning of the interval. It is defined as
( ) ( )( )t
tttinfailsttimetosurvivewhoindividualanPtht ∆
∆+=→∆
,lim0
The function is also known as instantaneous failure rate, force of mortality, conditional
rate and age specific failure rate. The hazard function is not a probability as it does not lie
between 0 and 1. The function is commonly used for identifying the models. Such as
exponential, Weibull or gamma curve that fits one’s data. Survival model is usually
expressed in terms of hazard function.
The cumulative hazard function is defined as
( ) ( )∫=t
dtthtH0
( ) ( )tHdtdth =
There exist relations among S (t), h (t) and f (t)
By definition
( ) ( )( )
( )( ) ( )tS
dtd
tStS
tStfth log−=
′−==
( ) ( )tStH log−=
Equivalently,
( ) ( )( )tHtS −= exp
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛−= ∫
t
dtthtS0
exp
Thus given any one of three functions ( )tS , ( )th and ( )tf , the others can be derived
The characteristics of survival distribution are used for estimating mean, median survival
time and survival probabilities. For comparing the survival times of different groups e.g.
survival times of control and placebo groups or to compare the survival times of males
and females thalassaemia patients, several methods have been developed. Similarly, in
identifying the potential risk/ prognostic factor, methods of regression analysis using
semi parametric and parametric approaches have been developed. Survival analysis is the
combination of three commonly used techniques i.e. nonparametric, semi parametric and
parametric techniques. Here, we briefly describe some aspects of these techniques.
2. Nonparametric approach
The simplest and easy method which is free of assumptions is the nonparametric method.
Nonparametric methods are often very easy and simple to understand as compared to
parametric methods. Furthermore, nonparametric analyses are more widely used in
situations, where there is doubt about the exact form of distribution. In the nonparametric
methods, the most popular and commonly used method is the Kaplan-Meier method [14].
The Kaplan-Meier (KM) estimator is also called a nonparametric maximum likelihood
estimator. It is used for estimating the survival probabilities. The method is a modified
form of the life table technique, with the condition that each time interval contains
exactly one event and event occurs at the beginning of the interval. For large samples, it
is approximately normally distributed with mean ( )tS and variance is estimated by the
Greenwood’s formula. The function is defined as
Let x1,x2,…,xn be independently identically distributed survival times having distribution
function F(x) and let G(c) be the distribution function of independently identically
distributed censoring times c1, c2, …,cn. Further xi and ci are assumed to be independent.
Let is the observed survival time and{ iii cxt ,min= } ( )iii cxI ≤=δ , indicate whether the
survival time is censored or event. Let the number of individuals who are alive just before
time ti, including those who are about to die at this time, be denoted by ni and di denotes
the number who die at this time. The Kaplan- Meier estimator is defined as
( )in
i i
iiKM
ndn
tSδ
∏=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
1
^ (1)
The important assumption of the Kaplan-Meier survival function is that the distribution
of censoring times is independent of the exact survival times. The function is a step
function, changes only at every event time. If the last observed time is event time, the
Kaplan-Meier survival estimates for all times greater than the last time is 0. But if the last
observed time is censored time, Kaplan-Meier function gives value greater than 0. In this
case one can not calculate the mean survival time and choose the median, which is
common in practice for skewed data. The other reason of unpopularity of the mean
survival time is that the right tail has a marked influence on the mean time, while
censoring often makes it difficulty to estimate the right tail. Similarly, if the last
observation is censored and number of events is less than 50% it is not possible to
calculate the median survival time. A typical pattern of the Kaplan-Meier survival curve
is shown in Figure 1.
Kaplan and Meier gave an approximate variance formula for their estimator, which they
attributed to Greenwood estimator [15]. The Greenwood variance estimator for the
Kaplan-Meier survival function is
( )⎟⎟⎠
⎞⎜⎜⎝
⎛−
=⎥⎦⎤
⎢⎣⎡ ∑
=
n
i iii
iKMKMGr dnn
dSSarV
1
2^^^ (2)
The problem with the Greenwood estimator is that it underestimates the true variance,
especially in the right tails of the survival distribution [16]. Furthermore, it conveys little
intuition about how the variance of the survival function changes, especially when
censoring occurs. Greenwood’s formula is only accurate for large sample but concerns
have been raised about its small sample performance. In particular, the method tends to
underestimate the standard error, especially in the right tails of the survival curve. To
overcome this problem Peto’s [16], introduced the following estimator,
i
KMKM
KMP n
SSSarV
⎥⎦⎤
⎢⎣⎡ −
=⎥⎦⎤
⎢⎣⎡
^2^
^^1
(3)
But it is also claimed that the Peto’s estimator underestimates the true variance,
especially in the left and right tails of the survival distribution [17], and with heavy
censoring [18]. To overcome the problem of censoring in variance, Zhao [19] gave the
idea of homogeneous variance estimator and Borkowf [20] adjusted hybrid variance
estimator. But due to lack of availability of the software’s, Greenwood’s and Peto’s
estimators are common in practice.
2.1 Proportional Hazards Assumption
The most important assumption for Logrank and Cox-regression model is the
proportional hazards assumption. We illustrate the concept with the help of an example.
Consider the Freireich et al. [21] conducted leukaemia data set, which was used
completely or partially by Kleinbaum [22], Gehan [23], Rossa and Zielin`ski [24] and
Borkowf [20] too. The data set consisted of the remission times in weeks of two groups
i.e. treatment and placebo groups each having 21 patients. Placebo group is free of
censoring, while in treatment group 9 events occurred during the research period, while
the remaining was censored. The survival plots of two groups is presented in figure (2)
Figure (2) shows that the two curves appear “parallel”, means that there is a constant
vertical distance between the two curves at any given time. Since the two curves do not
cross at each point, the hazards for the two groups are proportional. Instead of plotting
survival functions, one can also check the assumption by plotting ( )[ ]{ }tSloglog − against
for each group. tlog
2.2 Comparison of Survival Distributions
Kaplan-Meier survival function is used for estimating and drawing the single survival
curve, but there are many situations in which we want to compare more than one curve
e.g. Survival curves of the patients receiving different treatments or of males and females
smokers.
Comparisons of two or more survival distributions (curves) are common practice in
medical research. Several methods are used for comparing survival distributions, out of
which the most commonly used rank based tests are the logrank, which is also called
Mantel-Cox test [25] and Gehan’s test also referred as Wilcoxon [23, 26]. Special
designed tests are necessary for censored data; otherwise, nonparametric tests Mann-
Whitney U test or Wilcoxon rank sum test are used.
The logrank statistic, like many other χ2 – tests, consists of observed vs. expected events.
This can explain by considering two Groups I and II. Let ( ) ( ) ( )rttt <<< L21 be the r
distinct death times for each group. At time , let and be the number of deaths ( )jt jd1 jd2
in Group I and II, respectively. Furthermore; n1j and n2j are the number of person’s at risk
prior to time . All these can be summarized in Table 1. ( )jt
Logrank test statistic is
( )( )1~ 2
2
111
2 χχVar
Exdr
jjj ⎟⎟⎠
⎞⎜⎜⎝
⎛−
=∑= (4)
Where j
jjj n
dnEx 11 = the mean of the hypergeometric random variable and the variance
of is given as jd1
( ) ( )( )12
21
−
−=
jj
jjjjjij nn
dndnndVar , so
( )∑=
=r
jjdVarVar
11
A large value of χ2 would lead to the conclusion that two treatments are not equally
effective. Some of the properties of logrank test were studied by Gill [27]. The test is
more appropriate, powerful and reliable as compared to the other tests in a situation
where two or more survival curves do not cross i.e. whose hazard functions are
proportional or when “odds ratios” are constant over time intervals. In case of crossing
curves, weighted tests are more powerful e.g. Gehan’s wilcoxon test. The wilcoxon test
puts more weight on early events as compared to the logrank test and is described as
( )
( )( )1~ 2
11
2
2
111
2 χχ
∑
∑
=
=⎟⎟⎠
⎞⎜⎜⎝
⎛−
=
jjj
r
jjjj
G
dVarn
Exdn (5)
Except Wilcoxon test, there are also some other weighted tests for non-proportional
hazards model e.g. Tarone-Ware test [28]. These tests can also be used for comparing
more than two groups. The response of these tests for two published examples is also
discussed. Table 2 gives the values of Logrank, Wilcoxen and Tarone-Ware tests and
their corresponding p-values for leukaemia data set (Proportional Hazards assumption
satisfied as shown in figure2).
All three tests give the results against the alternative hypothesis that the two survival
curves are not equal. Here all tests give satisfactory results but it does not happen in all
situations.
Consider the case where the two survival curves or hazard functions cross. In this case,
logrank test does not provide the satisfactory answer and we go for the other options.
To illustrate the concept, consider example from Klein and Moeschberger [29], the data
set was also used by Lin and Wang [30]. The data set is divided on basis of surgically
and percutaneous placement of catheter Group 1 and 2 respectively. The aim of the study
was to assess the time to first exit-site infection (in months) in patients with renal
insufficiency. The two groups consisted of 43 and 76 patients respectively. For further
detailed about the data concern the above references. Survival curves for two groups are
reproduced by using the R [31] package and are shown in figure.3. The results are
summarized in Table 3.
From the figure 3, we can say that the two survival curves are different, but the results in
Table (3) contradict this fact in which all the tests give the unsatisfactory result. This
means that one can not recommend a test for each and every situation. For the above data,
Lin and Wang suggested test which is based on the squared difference, gives a statistic
value of 2.2516 and a p-value of 0.0121 indicating the difference between the two
groups.
Except these there are also some other tests available, but one can not say any thing about
an ideal test which is suitable in each and every situation. One test is suitable in one
situation and fails in other. Some give more weight to initial events [23, 32] and some
give more to the last event [33].
3. Semi-parametric approach
3.1 Cox regression models
For the diagnosis of a disease, medical doctors investigate the cause or the other
characteristics of a disease. For example, is the heart patient has the disease of high blood
pressure? or family history of diabetic related to the development of diabetic disease?
In this case, high blood pressure and family history are referred to as covariates or risk
factors or explanatory variables. Now-a-days, the identification of the most important risk
factors is becoming the important task for handling the disease. Regression analysis is
generally used for identifying the risk factors. But due to the presence of censoring in
survival data, ordinary regression models are not used on survival data. For this purpose,
in survival analysis, Cox’s regression model/ Cox proportional hazard (PH) model [34] is
widely used. The proportional hazards regression model is very popular due to the easy
concept and accessibility of software [5, 35-39]. Like logrank test, it is also based on the
proportional hazards assumption.
If a set of covariates is represented by Zi= (z1i, z2i, …, zki), famous Cox’s regression
model is
( ) ( ) ( iii zthzth )βexp, 0= (6)
( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛= ∑
=
k
jijjii zthzth
10 exp, β , j=1, 2, …, k
Where is an unspecified baseline hazard function and is a function of time only,
assume to be same for all subjects,
( )th0
jβ are unknown parameters, describe the importance
of covariates and are the values measured on subject i at time zero. The baseline
hazard describes the shape of the distribution while
ijz
( )izβexp gives the level of each
individual’s hazard. The model based on the assumption that independent covariates
affect the hazard in a multiplicative way [12]. z is a vector of covariates of interest. z may
be discrete (sex, marital status), continuous (blood pressure) or the mixture of discrete
and continuous factors (interaction height and sex). The main advantage of Cox PH
model is that we can estimate the parametersβ , without having to estimate . This
fact makes the model as a semi-parametric.
( )th0
The model can also be expressed as
( )( ) ij
k
jj
ii zthzth ∑
=
=⎭⎬⎫
⎩⎨⎧
10
,log β
If we take( )( ) ⎭
⎬⎫
⎩⎨⎧
=thzth
y iii
0
,log , the above equation becomes the multiple regression
equation.
∑=
=k
jijji zy
1
β
It is declared that the hazards for two individuals ( 1z and are assumed to be
proportional.
)2z
( )( ) ( )( )211
2
1 exp zzththHR −== β
Cox assumes that hazard ratio comparing two specifications of covariates is constant over
time. Examples of such covariates are sex of the patient, treatment etc. Consider a single
covariate z, assuming two values 0 and 1. Where z = 0 if the patient receives treatment 1
and z =1, if the patient receives treatment 2. In terms of proportional hazard, this can be
written as . If , the affect of treatment 2 is more than treatment1. Both
have the same affects if . If , treatment 1 is superior to treatment 2.
( ) ( )theth 12
β
= 1<β
e
1=β
e 1>β
e β is the log
hazard ratio.
The assumption of proportional hazard in Cox regression means that the base line hazard
function is only a function of time (t), where the exponential part involves zi’s but
independent of t. If it is not possible then zi’s called time dependent covariates otherwise
time independent covariates. Patient performance during the treatment period is an
example of time dependent covariate. Several procedures haven been suggested for
checking this assumption [40-42]. An easy and simple method of checking the
proportional hazards assumption in the presence of covariates is the plotting method.
Suppose and are two different sets of covariates, draw the plot of
and against time (function of time), Where
1z 2z ( )[ ]1,loglog ztS−
([ 2,loglog ztS− )] ( )ztS , is the survival
function. If the two curves are parallel, there is no reason to believe that the proportional
hazards assumption is not satisfied. The method is easy for few covariates but for large
number it becomes tedious. Except the plot of cumulative hazard functions for checking
the assumption of proportional hazard model, there is another method which is based on
the concept of introducing the time dependent variable t or log t as the independent
variable in to the model and centered about a point t*.
( ) ( ) ( )[ *loglogexp 210 ttzthth − ]+= ββ (7)
Test 02 =β , if it is true, proportional hazard assumption is satisfied. If it is not, time
dependent can be used.
Since is free from parametric assumption, it is not possible to apply the full
likelihood function for estimating
( )th0
β ’s. For the model, Cox [34, 43] suggested an
estimation procedure, in which the analysis concentrates only on the effect of covariates
and leaving completely unspecified. β-coefficients in Cox proportional hazards
model can be estimated by using the partial likelihood method, which is a function of
observed survival times and unknown parameters.
( )th0
( )( )
∏∑ ∑
∑=
∈ =
=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛n
i
tRl
k
jjij
k
jjij
i
z
z
1
1
1
exp
exp
β
β (8)
Where , i = 1,2, …,n denote the n ordered exact failure times and ( )it ( )( )itR consists of all
individuals whose survival times are at least . The summation in the denominator of
likelihood is the sum of the values of over all individuals who are at risk
at this time. The likelihood depends only on the rank of failure times and the variance of
the partial likelihood is larger than the variance of complete likelihood.
( )it
⎟⎟⎠
⎞⎜⎜⎝
⎛∑=
ji
k
jj z
1
exp β
Tied survival times can be handled by using the likelihood form of Breslow [44]. This is
simple to understand and is more suitable in a situation when the number of tied
observations at any death time is not too large.
( )( )
i
i
c
n
i
tRl
k
jjij
k
jjij
z
z
∏∑ ∑
∑=
∈ =
=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
1
1
1
exp
exp
β
β (9)
Where denotes the number of tied survival times. In 1981, Tsiatis investigated its
properties [45]. The other model dealing with tied situation are Efron [46], Cox [34],
ic
Kalbfleisch and Prentice [47] etc. In case of large tied survival times, Cox’s assumes the
discrete survival times and generalized the model by introducing the concept of logistic
transformation
( )( )
( )( ) ⎟⎟
⎠
⎞⎜⎜⎝
⎛−
=− ∑
=
k
jjij zdtth
dtthdtth
dtth10
0 exp11
β
All these procedures are used for Cox proportional hazards model. If its assumption
failed, one can not use the procedure. There are several reasons for the failure of
assumption out of which the most common reason is the involvement of time dependent
covariates. Examples of time dependent covariates are, status of a female may change
during the study period (i.e. girl to married female and married female to mother), the
cholesterol level of a patient changes during the study period; regular examination of
patients in a clinic is also an example of a time dependent covariate. In time dependent
Cox regression, the values of variable changes over the period or time. Time dependent
covariates are also referred to as updated covariates [5] and are of two types 1) external
and 2) internal variables [47].
External variable is a variable whose values in most cases are known in advance. Such
variables do not require the survival of patients. Simple examples of external variable are
the age of a patient, dose of a drug. The measurement of internal variable is only
possible, if the patient is alive. Examples are systolic blood pressure, white blood cell
count etc. For the analysis of time dependent covariates, extended Cox model may be
used, called time dependent or non-proportional hazard Cox model.
Consider a set of q covariates, out of which q1 are time independent and q2 are time
dependent covariates. The modified form of the Cox proportional hazards model is
obtained by dividing the exponential part in to time independent and time dependent
parts.
i.e. (10) ( ) ( ) ( )⎥⎦
⎤⎢⎣
⎡+= ∑ ∑
= =
1 2
1 10 exp,
q
i
q
jjjii tzzthzth αβ
Like the cox proportional hazard model, the regression coefficients for the time
dependent model is also obtained through the maximum likelihood procedure. The
likelihood for time-dependent is the same as that for time-independent except that z is
replaced by z (t), so
( )
( )( )( )
i
i
c
n
i
tRli
k
jjij
k
jijij
tz
tz
∏∑ ∑
∑=
∈ =
=
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
1
1
1
exp
exp
β
β (11)
3.2 Accelerated failure-time models
Although; Cox proportional hazards model is more frequently used in survival analysis,
but still there exist some other models, for example accelerated failure-time models. In
this model more attention is give to the survival time than to hazard function. With
censored data, if in linear regression analysis, we replace response variable by ,
the resultant model is called the accelerated failure time model [6, 48, 49]. The typical
outlook of the model is
iT iTlog
iii ZT εβ +′=log (12)
The role of the covariates in the above equation is to accelerate (or decelerate) the time to
failure. The error terms iε are assumed to be independent and identical distributed with
mean zero. Various choices of ε distribution lead to the regression version of different
parametric survival models. A weibull regression model is obtain if ε has an extreme
value distribution and a lognormal model is obtain if it has a standard normal distribution.
Kay and Kinnersley [2] showed that the accelerated failure time model performs better
than the proportional hazards model in applications where the effects of treatment are to
accelerate or delay the event of interest. The equation (12) is referred to as the parametric
model if the distribution of baseline hazards function is specified and if not; it is called a
semi parametric model. Cox and Oakes [10], Kalbfleisch and Prentics [47] described in
detail the parametric survival models. The main reason of the unpopularity of accelerated
failure time model is its complicated estimation process, even if the data set consists of
small number of covariates [50]. Similarly, the accelerated failure time is based on a
parametric model, which may be difficult sometime to fit and this is one of the main
reason of the popularity of proportional hazards model.
4. Parametric Approach and Models
In parametric models, methods of estimation and inference based on the likelihood are
easy and straightforward, but based on the stronger assumptions as compared to semi-
parametric and assumptions free nonparametric models. Choosing a theoretical
distribution that fits the data well is an art, the idea of this article is not to describe the art
but to discuss some familiar survival distributions. A detailed description of the
parametric methods of Survival analysis was discussed by Lawless [51].The commonly
used standard distributions in most cases are not suitable to survival data. Exponential,
Weibull, Gamma, Log-logistic are the most familiar survival distributions. Out of these
the simplest and most commonly used distribution is the exponential distribution. Epstein
[52] and Davis [53] gave the examples, which fit the exponential distribution. Similarly,
Byar et al [54] and Zelen [55] also applied the distribution. In exponential distribution,
hazard function is assumed to be constant over time. A straight line plot of ( )tSlog
against t is an indication that the data follows the exponential distribution [56], where the
slope of the line is an estimate of the hazard rateλ . Survival function in exponential
distribution is ( ) ( ) 0,exp >−= λλttS , the hazard function is ( ) λ=th . Survival pattern of
exponential distribution are shown in Figure 4 for λ=0.5, 0.15, 0.05.
Weibull distribution plays the same part in the analysis of survival data as the normal
distribution is doing in linear modeling. In 1939, Weibull proposed the Weibull
distribution and in 1951, he discussed its applications [57, 58]. Weibull distribution has
two parameters, shape parameter ( )p and scale parameter ( )λ , which allows taking
different shapes. Weibull distribution reduces to exponential distribution, when i.e.
constant hazard rate. If , hazard rate decreases and it is increases for . Due to
these characteristics, Weibull distribution has broader application in survival analysis
than exponential distribution. The survival function of Weibull distribution
is , which is also helpful for the graphical presentation. Survival curves
of Weibul distribution are shown in figure 5 for λ=0.5, p= 0.5; λ=0.15, p= 1; λ=0.05, p=2.
1=p
1<p 1>p
( ) ( )pttS λ−= exp
Like Weibull distribution, Log-logistic distribution is also characterized by two
parameters (shape and scale). If , hazard rate decreases and it is constant for . 1<p 1=p
While the hazard rate increases for . The hazard function and survival function of
the distribution are
1>p
, ( ) ( ) ( )[ ] 11 1−− += pp ttpt λλλλ ( ) ( )[ ] 1
1−
+= pttS λ
p and λ is the shape parameter and scale parameter of Gamma distribution. When
, the hazard rate is constant (increases, decreases). Survival curves of log-
logistic distribution are shown in figure 6 for λ=0.5, p= 0.5; λ=0.15, p= 1; λ=0.05, p=2.
( 0,00 <>=p )
Lognormal distribution is also comes under the umbrella of two parameter distributions.
First time Boag’s [59] applied the distribution in cancer research. The hazard function of
lognormal distribution involves the incomplete normal integral. The function increases
initially to a maximum and then decreases. The distribution has been frequently applied
in biomedical research.
Like Cox model, parametric models can also be fitted to the survival data using
maximum likelihood method. The procedure is described as
Suppose that the survival time’s are observed and q of the n individuals die at
times
nttt ,,, 21 K
( ) ( ) ( )qttt <<< K21 and that the survival times of the remaining n-q (q < n)
individuals are censored. If denotes the probability density function of the survival
time t and be the survival function, the likelihood function can be expressed as
( )tf
( )tS
( ){ } ( ){ } i
ic
i
cn
ii tStf −
=∏ 1
1
(13)
Where is an indicator variable, takes the value 0 when the survival time is censored
and 1 for the uncensored survival time. In case of censored free data, the function is
reduced to the product of probability density functions. The function can be maximized
with respect to the unknown parameters.
ic
5. Computer Software
The facility of applying the Kaplan-Meier estimator, Greenwood’s variance estimator,
Log-rank / Weighted log-rank tests, Cox regression (Proportional hazards / Time
dependent) and parametric regression are available in well known softwares, such as
R[31], SPSS[60],S-Plus[61],SAS[62], STATA [63] and BMDP[64].In STATA, stset and
sts commands are used for drawing the Kaplan-Meier curves. In SAS we have the choice
of PROC LIFETEST, while in Splus, surv.fit (time, status) and surfit commands are used
for obtaining the survival probabilities. Some software’s also provide the facility of time-
dependent analysis but use of accelerated failure time and parametric models are not very
popular due to lack of easily availability of their software’s. This area needs special
attention.
6. Conclusion
Techniques of Survival analysis have substantial impact on the development of medical
research. Almost every medical journal contained some material/ articles which directly
or indirectly used the methods of survival analysis. In most research, due to relax
conditions semi-parametric and nonparametric methods are preferred over the parametric
method. The Kaplan-Meier method, the logrank test and the Cox’s proportional hazards
model are commonly used and popular in survival data analysis. When the proportional
hazards assumption is not satisfied, time-dependent covariate is easily incorporated in to
the model by using available softwares.The development of computer software’s,
provided facility of identifying the correct parametric distribution as well as the model. If
the data follows a specific distribution, results obtain have smaller variance as compared
to nonparametric methods.
Except all these efforts still there is a scope for improvement and understanding of basic
concepts, their applications and presentations in research articles. Easily accessibility of
the software would make the underestimated technique of Survival analysis “accelerated
failure time model” more popular. The area need special attention, improvement of its
application in research, various extensions of the model may prove useful in the future.
References
1. Gross AJ, Clark VA. Survival Distributions: Reliability Applications in the
Biomedical Sciences. Wiley: New York, 1975.
2. Kay R, Kinnersley N. On the use of the accelerated failure time model as an
alternative to the proportional hazards model in the treatment of time to event
data: A case study in Influenza. Drug Information Journal 2002; 36:571-579.
3. Fleming TR, Lin DY. Survival analysis in clinical trials: past developments and
future directions. Biometrics 2000; 56:971-983.
4. Hougaard P. Fundamentals of Survival Data. Biometrics 1999; 55:13-22.
5. Altman DG, de Stavola BL. Practical problems in fitting a proportional hazards
model to data with updated measurements of the covariates. Statistics in Medicine
1994; 13:301-341.
6. Wei LJ. The accelerated failure time model: a useful alternative to the Cox
Regression model in Survival Analysis. Statistics in Medicine 1992; 11:1871-
1879.
7. Andersen PK. Survival Analysis 1982-1991: The second decade of the
proportional hazards regression model. Statistics in Medicine 1991; 10: 1931-
1941.
8. Collett D. Modeling survival data in Medical research. Chapman & Hall:
London, 1994.
9. Andersen PK, Borgan Ø, Gill RD, Keiding N. Statistical Models Based on
Counting Processes. Springer Verlag: New York, 1993.
10. Cox DR, Oakes D. Analysis of Survival Data. Chapman & Hall: London, 1984.
11. Klein JP, Moeschberger M. Survival Analysis. Springer Verlag: New York, 1997.
12. Marubini E, Valsecchi M G. Analyzing Survival Data from clinical trials and
observational studies. John Wiley & Sons: Chichester, 1995.
13. Maller R, Zhou X. Survival Analysis with Long-term Survivors. John Wiley &
Sons: Chichester, 1996.
14. Kaplan EL, Meier PL. Nonparametric estimation from incomplete observations.
Journal of the American Statistical Association 1958; 53: 457-481.
15. Greenwood Major. A report on the natural duration of cancer. In Reports on
public Health and Medical Subjects, vol. 33. His Majesty’s Stationery Office:
London, 1926; 1-26.
16. Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N,
McPherson K, Peto J, Smith PG. Design and analysis of randomized clinical
trials requiring prolonged observation of each patient. II. Analysis and examples.
British Journal of Cancer 1977; 35: 1-39.
17. Altman DG. Practical Statistics for Medical Research. Chapman & Hall/CRC:
Boca Raton 1991; 365-395.
18. Slud EV, Byar DP, Green SB. A comparison of reflected versus test-based
confidence intervals for the median survival time, based on censored data.
Biometrics 1984; 40: 587-600.
19. Zhao G. The homogenetic estimate for the variance of survival rate. Statistics in
Medicine 1996; 15:51-60.
20. Borkowf CB. A simple hybrid variance estimator for the Kaplan-Meier survival
function. Statistics in Medicine 2005; 24: 827-851.
21. Freireich EJ, Gehan E, Frei III E, Schroeder LR, Wolman IJ, Anbari R, Burgert
EO, Mills Sd, Pinkel D, Selwry OS, Moon JH, Gendel BR, Spurr CL, Storrs R,
Haurani F, Hoogstraten B, Lee S. The effect of 6-mercaptopurine on the duration
of steroid-induced remissions in acute leukaemia: a model for evaluation of other
potentially useful therapy. Blood 1963; 21:699-716.
22. Kleinbaum DG. Survival Analysis- A Self-Learning Text. Springer: New York,
1995.
23. Gehan EA. A generalized Wilcoxon test for comparing arbitrarily singly-
censored samples. Biometrika 1965; 52: 203-223.
24. Rossa A, Zielin`ski R. A simple improvement of the Kaplan-Meier estimator.
Communication in Statistics- Theory & Methods 2002; 13: 147-158.
25. Mantel N. Evaluation of survival data and two new rank order statistics arising in
its consideration. Cancer Chemotherapy Report 1966; 50: 163-170.
26. Gehan EA. A generalized two sample Wilcoxon test for doubly-censored data.
Biometrika 1965; 52: 650-653
27. Gill RD. Censoring and Stochastic Integrals, Volume 24, Mathematical Centre
Tracts. Amsterdam: Mathematisch Centrum 1980.
28. Tarone RE, Ware J. On distribution-free tests for equality of survival
distributions. Biometrika 1977; 64: 156-160.
29. Klein JP, Moeschberger M. Survival Analysis, Techniques for censored and
Truncated Data. Springer 1998.
30. Lin X, Wang H. A New Testing Approach for Comparing the Overall
Homogeneity of Survival Curves. Biometrical Journal 2004; 46: 489-496.
31. R Development Core Team. R: a language and environment for statistical computing.
Vienna, Austria, R Foundation for Statistical Computing, 2004.
32. Moreau T, Maccario J, Lellouch J, Huber C. Weighted log rank statistics for
comparing two distributions. Biometrika 1992; 79: 195-198.
33. Gray R, Tsiatis A. A linear rank test for use when the main interest is in
differences in cure rates. Biometrics 1989; 45: 899-904.
34. Cox DR. Regression models and life tables (with discussion). Journal of the
Royal Statistical Society Series B 1972; 34: 187-220.
35. Lin DY. Goodness-of-fit analysis for the Cox regression model based on a class
of parameter estimators. Journal of the American Statistical Association 1991; 86:
725-728.
36. Aalen OO. A linear regression model for the analysis of life times. Statistics in
Medicine 1989; 8: 907-925.
37. Lagakos SW, Schoenfeld DA. Properties of proportional –hazards tests under
misspecified regression models. Biometrics 1984; 40: 1037-1048.
38. Solomon PJ. Effect of misspecification of regression models in the analysis of
survival data. Biometrika 1984; 71: 291-298.
39. Bryson MC, Johnson ME. The incidence of monotone likelihood in the Cox
model. Technometrics 1981; 23: 381-383.
40. Pettitt AN, Bin Daud I. Case-weighted measures of influence for proportional
hazards regression. Applied Statistics 1989; 38:51-67.
41. Kay R.Goodness of fit methods for the proportional hazards regression model: a
review’, Revue d’Epide´miologie et Sante´ Publique 1984; 32: 185-198.
42. Andersen PK. Testing goodness of fit of Cox’s regression and life models.
Biometrics 1982; 38: 67-77.
43. Cox DR. Partial likelihood. Biometrika 1975; 62: 269-276.
44. Breslow N. Covariance analysis of censored survival data. Biometrics 1974; 30:
89-99.
45. Tsiatis AA. 1981. A large sample study of the estimate for the integrated hazard
function in Cox’s regression model for survival data. Annals of Statistics 1981; 9:
93-108.
46. Efron B. The efficiency of Cox’s likelihood function for censored data. Journal
of the American Statistical Association 1977; 72: 557-65.
47. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data.
Wiley: New York, 1980.
48. Koul H, Susarla V, Van Ryzin J. Regression analysis with randomly right
censored data. Annals of Statistics 1981; 9: 1276-1288.
49. Miller R. Least squares regression with censored data. Biometrika 1976; 63: 449-
464.
50. Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure
time model. Biometrika 2003; 90: 341-353.
51. Lawless JF. Statistical Models and Methods for Lifetime Data. Wiley: New York,
1982.
52. Epstein B. (1958). The exponential distribution and Its Role in Life Testing.
Industrial Quality Control 1958; 15: 2-7.
53. Davis DJ. An analysis of Some Failure Data. Journal of the American Statistical
Association 1952; 47: 113-150.
54. Byar, David P. Selecting Optimum Treatment in Clinical Trials Using Covariate
information, presented at the 1974 Annual Meeting of the American Statistical
Association, August 28, 1974.
55. Zelen M. (1966). Applications of Exponential Models to Problems in Cancer
Research. Journal of the Royal Statistical Society Series A 1966; 3: 368-398.
56. Lee ET. Statistical Methods for Survival Data Analysis. Lifetime Learning
Publications Belmont: California, 1980.
57. Weibull W. A Statistical Distribution of Wide Applicability. Journal of Applied
Mechanics 1951; 18: 293-297.
58. Weibull W. A Statistical Theory of the Strength of Materials, Ingeniors
Vetenskaps Akademien Handlingar, No. 151; The Phenomenon of Rupture in
Solids 1939; 293-297.
59. Boag JW. Maximum Likelihood Estimates of Proportion of Patients Curd by
Cancer Therapy. Journal of the Royal Statistical Society Series B 1949; 11: 15.
60. SPSS Corporation. SPSS 11.5 for windows. Chicago, 2004
61. SPLUS. Version 2.0, Statistical Sciences Inc., Seattle 1992.
62. SAS Software: SAS Institute Inc, Cary, NC, 1991.
63. STATA. Release 3: Reference Manual, Computing Research Center, Los
Angeles, 1989.
64. BMDP Statistical Software, Vol. II, Dixon, W.J. (Ed.). University of California
Press, Berkley 1988.
Table 1. Number of deaths at the jth death time in each of two groups
Group Number of deaths at ( )it Number surviving beyond ( )it Number at risk just before ( )it
I
II
d1j
d2j
n1j – d1j
n2j – d2j
n1j
n2j
Total dj nj – dj nj
Table 2 Results of Logrank, Wilcoxen and Tarone-Ware tests and their corresponding p-values for leukaemia data set
Statistical test Values p-value
Logrank
Wilcoxen
Tarone-Ware
16.793
13.995
15.124
0.0000
0.0002
0.0000
Table 3 Results of Logrank, Wilcoxen and Tarone-Ware tests and their corresponding p-values for kidney dialysis patients
Statistical test Values p-value
Logrank
Wilcoxen
Tarone-Ware
2.530 0.002 0.403
0.1117
0.9636
0.5260
ion
unc
iva
S
0
0.2
0.4
0.6
0.8
1
0 4.5 9.5 14.5 21.5
Time
urv
l Ft
Figure 1: Pattern of Kaplan-Meier Survival Function by SPSS
0
0.2
0.4
0.6
0.8
1
1 4 7 10 13 17 22 32
Time (Weeks)
Sur
viva
l Fun
ctio
n
Treatment groupPlacebo group
Figure 2: Survival plot of Leukaemia Data Set
Figure3: Survival curves comparison of the estimated time to infection for kidney dialysis patients.
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Time
Surv
ival
Fun
ctio
n
Hazard Rate= 0.5Hazard Rate= 0.15Hazard Rate= 0.05
Figure 4: Survival curves of exponential distribution for different values of hazard rate
on
Fun
c
viva
S
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Time
url
ti
Shape= 0.5Shape=1Shape= 2
Figure 5: Survival curves of Weibull distribution for different values of shape parameter.
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Time
Surv
ival
Fun
ctio
n
Shape= 0.5Shape= 1Shape= 2
Figure 6: Survival curves of log-logistic distribution for different values of shape parameter.
Recommended