Survival analysis in Medical Research

Survival Analysis in Medical Research

By Qamruz Zaman1, Karl P Pfeiffer2,

1Department of Statistics, University of Peshawar, Pakistan

2Department of Medical Statistics, Informatics and Health Economics,

Medical University Innsbruck

Abstract

For the last few decades, special attention has been given to the field of survival analysis.

Survival analysis techniques play important part in different areas of research. The

purpose of this article is not to elaborate its uses in different fields but to describe some

of the frequently used concepts of survival analysis in medical research. Nonparametric

techniques (Kaplan-Meier method and Logrank test) of survival analyses are more

popular due to simplicity as well as the assumption free property. For multivariate

analysis, if the proportional hazards assumption is satisfied, Semi parametric Cox

proportional hazard model is used to identify risk factors, while in case of non-

proportional hazard model, time dependent regression model is applied to data set.

Furthermore, hazard functions of commonly used Survival distributions are described.

Some of the under rated areas due to lack of software’s are also discussed.

Key words: Censoring, Cox regression model, Failure time, Kaplan-Meier survival

function, Logrank test, Proportional hazards assumption

Corresponding author:

Qamruz Zaman E-mail: ayanqamar@gmail.com

Editor: Ghorai, Jugal K. jugal@csd.uwm.edu

1. Introduction

Now-a-days, survival analysis techniques have become important tools for analyzing the

data belonging to field of medicine, engineering, marketing etc. Although, the basic

functions of survival analysis are same in all fields, but it is familiar with different names

e.g. in the field of engineering it is called reliability analysis, in sociology people

recognized it as event history analysis, in economics it is famous with the name of

duration analysis and medical researchers give it the name of survival analysis. Survival

analysis deals with models, methods and is used for analyzing data of life times. Survival

data can arise in the fields of medicine, engineering, economics etc. One of the common

uses of survival analysis in clinical trial is the comparison of survival times of different

treatments in some fatal diseases; a demographer can use the technique in studying the

length of working hours of a group of people, duration of marriage. In an open book

exam, the examiner can use survival analysis for measuring the number of hours of

completing the paper. In engineering, one of the uses of survival analysis is the waiting

time of failure of an item. In economics, we may study the survival of a new business.

Survival analysis is different from the other procedures due to following reasons.

1. In survival analysis, response variable is always time.

2. Staggered entries are more common in medical research. By staggered entries we

mean that all individuals in the study do not have the same entrance time. This

does not affect the survival analysis, as the analysis deals with the length of the

observation time and not based on the same entrance.

3. The assumption of normality is not hold in survival analysis, as survival data are

generally skewed. The commonly used distributions in survival analysis are

exponential, weibull, lognormal, gamma, log-logistic etc.

4. Concept of censoring, which may affect the hazard function.

5. Time dependent covariates.

A data sample is said to be censored when values of the variable are not observed for

some of the items in the sample. In medical studies, the actual time of death of some

subjects may not be noted for many reasons e.g. they move away or the allocated time for

the study elapses prior the events [1]. One of the main reasons of censoring is the limiting

duration of study period.

Survival analysis is a very vast field, which can not be covered in a single article. The

main aim and things which tried in this article is to avoid technical terms as much as

possible and to explain/ describe only important and commonly used topics. This will not

only helpful for the experienced, also for the emerging researchers. Although, some work

has already been done towards the point [2-7], but most of the papers either described the

specific topics or very technical not only for the medical practitioners but also sometimes

for the statisticians who have less knowledge/ experience of the subject.

Some of the books covering the concept of survival analysis are Modelling Survival Data

in Medical Research [8], Statistical Models Based on Counting Processes [9], Analysis of

Survival Data [10], Survival Analysis [11], Analysing Survival Data from clinical trials

and Observational Studies [12] and Survival analysis with Long-term Survivors [13].

Survival analysis is based on the time until an event occurs. Time may be in hours, days,

weeks, months and years from the beginning of follow-up until an event occurs. Time is a

positive real-valued variable, has a continuous distribution. Examples of the survival time

• Time from the diagnosis of a disease to the development of a disease

• Time to death

• Length of stay in a jail/ hospital/ school

• HIV viral load measurements

Event may be the recovery from a disease, death etc. In medical research three

techniques (Parametric, Semi-parametric and Nonparametric) of survival analysis are

mainly used to obtain the estimates of survival probabilities, mean/ median survival time.

The method is also used for the comparison of two or more treatments/ procedures.

Similarly, multivariate analysis procedure of survival analysis is used to obtain the risk

factors.

Survival time is described by three functions:

1.1. Cumulative Proportion Surviving (Survival Function) S (t)

Let the survival time (random variable) be denoted by T, Survival function by and is

defined as the probability that an individual survives longer than t.

( ) ( )tthanlongersurvivesindividualanPtS =

( ) ( )tbeforefailsindividualanPtS −=1

The range of is 0 and 1 i.e. ( )tS ( ) 10 ≤≤ tS . The graph of survival function is a step

function and is called survival curve. At time zero, ( )tS reaches to its maximum value 1

and if the last observed time is event time ( )tS achieves the minimum value zero.

1.2. Probability Density Function

The probability density function of failure time data is defined as

( ) ( )t

ttTtPtft ∆

∆+<<=→∆ 0

( ) ( )dt

tdFtf =

( ) ( )( )tSdtdtf −= 1

( ) ( ) ( )tStSdtdtf ′−=

The probability density function is also known as the unconditional failure rate.

1.3. Hazard Function:

The hazard function is a measure of the probability of failure during a very small interval,

assuming that the individual has survived at the beginning of the interval. It is defined as

( ) ( )( )t

tttinfailsttimetosurvivewhoindividualanPtht ∆

∆+=→∆

The function is also known as instantaneous failure rate, force of mortality, conditional

rate and age specific failure rate. The hazard function is not a probability as it does not lie

between 0 and 1. The function is commonly used for identifying the models. Such as

exponential, Weibull or gamma curve that fits one’s data. Survival model is usually

expressed in terms of hazard function.

The cumulative hazard function is defined as

( ) ( )∫=t

dtthtH0

( ) ( )tHdtdth =

There exist relations among S (t), h (t) and f (t)

By definition

( ) ( )( )

( )( ) ( )tS

tStfth log−=

′−==

( ) ( )tStH log−=

Equivalently,

( ) ( )( )tHtS −= exp

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∫

dtthtS0

Thus given any one of three functions ( )tS , ( )th and ( )tf , the others can be derived

The characteristics of survival distribution are used for estimating mean, median survival

time and survival probabilities. For comparing the survival times of different groups e.g.

survival times of control and placebo groups or to compare the survival times of males

and females thalassaemia patients, several methods have been developed. Similarly, in

identifying the potential risk/ prognostic factor, methods of regression analysis using

semi parametric and parametric approaches have been developed. Survival analysis is the

combination of three commonly used techniques i.e. nonparametric, semi parametric and

parametric techniques. Here, we briefly describe some aspects of these techniques.

2. Nonparametric approach

The simplest and easy method which is free of assumptions is the nonparametric method.

Nonparametric methods are often very easy and simple to understand as compared to

parametric methods. Furthermore, nonparametric analyses are more widely used in

situations, where there is doubt about the exact form of distribution. In the nonparametric

methods, the most popular and commonly used method is the Kaplan-Meier method [14].

The Kaplan-Meier (KM) estimator is also called a nonparametric maximum likelihood

estimator. It is used for estimating the survival probabilities. The method is a modified

form of the life table technique, with the condition that each time interval contains

exactly one event and event occurs at the beginning of the interval. For large samples, it

is approximately normally distributed with mean ( )tS and variance is estimated by the

Greenwood’s formula. The function is defined as

Let x1,x2,…,xn be independently identically distributed survival times having distribution

function F(x) and let G(c) be the distribution function of independently identically

distributed censoring times c1, c2, …,cn. Further xi and ci are assumed to be independent.

Let is the observed survival time and{ iii cxt ,min= } ( )iii cxI ≤=δ , indicate whether the

survival time is censored or event. Let the number of individuals who are alive just before

time ti, including those who are about to die at this time, be denoted by ni and di denotes

the number who die at this time. The Kaplan- Meier estimator is defined as

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

The important assumption of the Kaplan-Meier survival function is that the distribution

of censoring times is independent of the exact survival times. The function is a step

function, changes only at every event time. If the last observed time is event time, the

Kaplan-Meier survival estimates for all times greater than the last time is 0. But if the last

observed time is censored time, Kaplan-Meier function gives value greater than 0. In this

case one can not calculate the mean survival time and choose the median, which is

common in practice for skewed data. The other reason of unpopularity of the mean

survival time is that the right tail has a marked influence on the mean time, while

censoring often makes it difficulty to estimate the right tail. Similarly, if the last

observation is censored and number of events is less than 50% it is not possible to

calculate the median survival time. A typical pattern of the Kaplan-Meier survival curve

is shown in Figure 1.

Kaplan and Meier gave an approximate variance formula for their estimator, which they

attributed to Greenwood estimator [15]. The Greenwood variance estimator for the

Kaplan-Meier survival function is

( )⎟⎟⎠

⎞⎜⎜⎝

⎛−

=⎥⎦⎤

⎢⎣⎡ ∑

iKMKMGr dnn

dSSarV

2^^^ (2)

The problem with the Greenwood estimator is that it underestimates the true variance,

especially in the right tails of the survival distribution [16]. Furthermore, it conveys little

intuition about how the variance of the survival function changes, especially when

censoring occurs. Greenwood’s formula is only accurate for large sample but concerns

have been raised about its small sample performance. In particular, the method tends to

underestimate the standard error, especially in the right tails of the survival curve. To

overcome this problem Peto’s [16], introduced the following estimator,

SSSarV

⎥⎦⎤

⎢⎣⎡ −

=⎥⎦⎤

⎢⎣⎡

But it is also claimed that the Peto’s estimator underestimates the true variance,

especially in the left and right tails of the survival distribution [17], and with heavy

censoring [18]. To overcome the problem of censoring in variance, Zhao [19] gave the

idea of homogeneous variance estimator and Borkowf [20] adjusted hybrid variance

estimator. But due to lack of availability of the software’s, Greenwood’s and Peto’s

estimators are common in practice.

2.1 Proportional Hazards Assumption

The most important assumption for Logrank and Cox-regression model is the

proportional hazards assumption. We illustrate the concept with the help of an example.

Consider the Freireich et al. [21] conducted leukaemia data set, which was used

completely or partially by Kleinbaum [22], Gehan [23], Rossa and Zielin`ski [24] and

Borkowf [20] too. The data set consisted of the remission times in weeks of two groups

i.e. treatment and placebo groups each having 21 patients. Placebo group is free of

censoring, while in treatment group 9 events occurred during the research period, while

the remaining was censored. The survival plots of two groups is presented in figure (2)

Figure (2) shows that the two curves appear “parallel”, means that there is a constant

vertical distance between the two curves at any given time. Since the two curves do not

cross at each point, the hazards for the two groups are proportional. Instead of plotting

survival functions, one can also check the assumption by plotting ( )[ ]{ }tSloglog − against

for each group. tlog

2.2 Comparison of Survival Distributions

Kaplan-Meier survival function is used for estimating and drawing the single survival

curve, but there are many situations in which we want to compare more than one curve

e.g. Survival curves of the patients receiving different treatments or of males and females

smokers.

Comparisons of two or more survival distributions (curves) are common practice in

medical research. Several methods are used for comparing survival distributions, out of

which the most commonly used rank based tests are the logrank, which is also called

Mantel-Cox test [25] and Gehan’s test also referred as Wilcoxon [23, 26]. Special

designed tests are necessary for censored data; otherwise, nonparametric tests Mann-

Whitney U test or Wilcoxon rank sum test are used.

The logrank statistic, like many other χ2 – tests, consists of observed vs. expected events.

This can explain by considering two Groups I and II. Let ( ) ( ) ( )rttt <<< L21 be the r

distinct death times for each group. At time , let and be the number of deaths ( )jt jd1 jd2

in Group I and II, respectively. Furthermore; n1j and n2j are the number of person’s at risk

prior to time . All these can be summarized in Table 1. ( )jt

Logrank test statistic is

( )( )1~ 2

2 χχVar

jjj ⎟⎟⎠

⎞⎜⎜⎝

⎛−

=∑= (4)

Where j

dnEx 11 = the mean of the hypergeometric random variable and the variance

of is given as jd1

( ) ( )( )12

jjjjjij nn

dndnndVar , so

( )∑=

jjdVarVar

A large value of χ2 would lead to the conclusion that two treatments are not equally

effective. Some of the properties of logrank test were studied by Gill [27]. The test is

more appropriate, powerful and reliable as compared to the other tests in a situation

where two or more survival curves do not cross i.e. whose hazard functions are

proportional or when “odds ratios” are constant over time intervals. In case of crossing

curves, weighted tests are more powerful e.g. Gehan’s wilcoxon test. The wilcoxon test

puts more weight on early events as compared to the logrank test and is described as

( )( )1~ 2

2 χχ

=⎟⎟⎠

⎞⎜⎜⎝

⎛−

Exdn (5)

Except Wilcoxon test, there are also some other weighted tests for non-proportional

hazards model e.g. Tarone-Ware test [28]. These tests can also be used for comparing

more than two groups. The response of these tests for two published examples is also

discussed. Table 2 gives the values of Logrank, Wilcoxen and Tarone-Ware tests and

their corresponding p-values for leukaemia data set (Proportional Hazards assumption

satisfied as shown in figure2).

All three tests give the results against the alternative hypothesis that the two survival

curves are not equal. Here all tests give satisfactory results but it does not happen in all

situations.

Consider the case where the two survival curves or hazard functions cross. In this case,

logrank test does not provide the satisfactory answer and we go for the other options.

To illustrate the concept, consider example from Klein and Moeschberger [29], the data

set was also used by Lin and Wang [30]. The data set is divided on basis of surgically

and percutaneous placement of catheter Group 1 and 2 respectively. The aim of the study

was to assess the time to first exit-site infection (in months) in patients with renal

insufficiency. The two groups consisted of 43 and 76 patients respectively. For further

detailed about the data concern the above references. Survival curves for two groups are

reproduced by using the R [31] package and are shown in figure.3. The results are

summarized in Table 3.

From the figure 3, we can say that the two survival curves are different, but the results in

Table (3) contradict this fact in which all the tests give the unsatisfactory result. This

means that one can not recommend a test for each and every situation. For the above data,

Lin and Wang suggested test which is based on the squared difference, gives a statistic

value of 2.2516 and a p-value of 0.0121 indicating the difference between the two

groups.

Except these there are also some other tests available, but one can not say any thing about

an ideal test which is suitable in each and every situation. One test is suitable in one

situation and fails in other. Some give more weight to initial events [23, 32] and some

give more to the last event [33].

3. Semi-parametric approach

3.1 Cox regression models

For the diagnosis of a disease, medical doctors investigate the cause or the other

characteristics of a disease. For example, is the heart patient has the disease of high blood

pressure? or family history of diabetic related to the development of diabetic disease?

In this case, high blood pressure and family history are referred to as covariates or risk

factors or explanatory variables. Now-a-days, the identification of the most important risk

factors is becoming the important task for handling the disease. Regression analysis is

generally used for identifying the risk factors. But due to the presence of censoring in

survival data, ordinary regression models are not used on survival data. For this purpose,

in survival analysis, Cox’s regression model/ Cox proportional hazard (PH) model [34] is

widely used. The proportional hazards regression model is very popular due to the easy

concept and accessibility of software [5, 35-39]. Like logrank test, it is also based on the

proportional hazards assumption.

If a set of covariates is represented by Zi= (z1i, z2i, …, zki), famous Cox’s regression

model is

( ) ( ) ( iii zthzth )βexp, 0= (6)

( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

jijjii zthzth

10 exp, β , j=1, 2, …, k

Where is an unspecified baseline hazard function and is a function of time only,

assume to be same for all subjects,

( )th0

jβ are unknown parameters, describe the importance

of covariates and are the values measured on subject i at time zero. The baseline

hazard describes the shape of the distribution while

( )izβexp gives the level of each

individual’s hazard. The model based on the assumption that independent covariates

affect the hazard in a multiplicative way [12]. z is a vector of covariates of interest. z may

be discrete (sex, marital status), continuous (blood pressure) or the mixture of discrete

and continuous factors (interaction height and sex). The main advantage of Cox PH

model is that we can estimate the parametersβ , without having to estimate . This

fact makes the model as a semi-parametric.

( )th0

The model can also be expressed as

( )( ) ij

ii zthzth ∑

=⎭⎬⎫

⎩⎨⎧

,log β

If we take( )( ) ⎭

⎬⎫

⎩⎨⎧

=thzth

,log , the above equation becomes the multiple regression

equation.

jijji zy

It is declared that the hazards for two individuals ( 1z and are assumed to be

proportional.

( )( ) ( )( )211

1 exp zzththHR −== β

Cox assumes that hazard ratio comparing two specifications of covariates is constant over

time. Examples of such covariates are sex of the patient, treatment etc. Consider a single

covariate z, assuming two values 0 and 1. Where z = 0 if the patient receives treatment 1

and z =1, if the patient receives treatment 2. In terms of proportional hazard, this can be

written as . If , the affect of treatment 2 is more than treatment1. Both

have the same affects if . If , treatment 1 is superior to treatment 2.

( ) ( )theth 12

= 1<β

e 1>β

e β is the log

hazard ratio.

The assumption of proportional hazard in Cox regression means that the base line hazard

function is only a function of time (t), where the exponential part involves zi’s but

independent of t. If it is not possible then zi’s called time dependent covariates otherwise

time independent covariates. Patient performance during the treatment period is an

example of time dependent covariate. Several procedures haven been suggested for

checking this assumption [40-42]. An easy and simple method of checking the

proportional hazards assumption in the presence of covariates is the plotting method.

Suppose and are two different sets of covariates, draw the plot of

and against time (function of time), Where

1z 2z ( )[ ]1,loglog ztS−

([ 2,loglog ztS− )] ( )ztS , is the survival

function. If the two curves are parallel, there is no reason to believe that the proportional

hazards assumption is not satisfied. The method is easy for few covariates but for large

number it becomes tedious. Except the plot of cumulative hazard functions for checking

the assumption of proportional hazard model, there is another method which is based on

the concept of introducing the time dependent variable t or log t as the independent

variable in to the model and centered about a point t*.

( ) ( ) ( )[ *loglogexp 210 ttzthth − ]+= ββ (7)

Test 02 =β , if it is true, proportional hazard assumption is satisfied. If it is not, time

dependent can be used.

Since is free from parametric assumption, it is not possible to apply the full

likelihood function for estimating

( )th0

β ’s. For the model, Cox [34, 43] suggested an

estimation procedure, in which the analysis concentrates only on the effect of covariates

and leaving completely unspecified. β-coefficients in Cox proportional hazards

model can be estimated by using the partial likelihood method, which is a function of

observed survival times and unknown parameters.

( )th0

( )( )

∏∑ ∑

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

β (8)

Where , i = 1,2, …,n denote the n ordered exact failure times and ( )it ( )( )itR consists of all

individuals whose survival times are at least . The summation in the denominator of

likelihood is the sum of the values of over all individuals who are at risk

at this time. The likelihood depends only on the rank of failure times and the variance of

the partial likelihood is larger than the variance of complete likelihood.

⎟⎟⎠

⎞⎜⎜⎝

⎛∑=

exp β

Tied survival times can be handled by using the likelihood form of Breslow [44]. This is

simple to understand and is more suitable in a situation when the number of tied

observations at any death time is not too large.

( )( )

∏∑ ∑

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

β (9)

Where denotes the number of tied survival times. In 1981, Tsiatis investigated its

properties [45]. The other model dealing with tied situation are Efron [46], Cox [34],

Kalbfleisch and Prentice [47] etc. In case of large tied survival times, Cox’s assumes the

discrete survival times and generalized the model by introducing the concept of logistic

transformation

( )( )

( )( ) ⎟⎟

⎞⎜⎜⎝

⎛−

=− ∑

jjij zdtth

dtthdtth

dtth10

0 exp11

All these procedures are used for Cox proportional hazards model. If its assumption

failed, one can not use the procedure. There are several reasons for the failure of

assumption out of which the most common reason is the involvement of time dependent

covariates. Examples of time dependent covariates are, status of a female may change

during the study period (i.e. girl to married female and married female to mother), the

cholesterol level of a patient changes during the study period; regular examination of

patients in a clinic is also an example of a time dependent covariate. In time dependent

Cox regression, the values of variable changes over the period or time. Time dependent

covariates are also referred to as updated covariates [5] and are of two types 1) external

and 2) internal variables [47].

External variable is a variable whose values in most cases are known in advance. Such

variables do not require the survival of patients. Simple examples of external variable are

the age of a patient, dose of a drug. The measurement of internal variable is only

possible, if the patient is alive. Examples are systolic blood pressure, white blood cell

count etc. For the analysis of time dependent covariates, extended Cox model may be

used, called time dependent or non-proportional hazard Cox model.

Consider a set of q covariates, out of which q1 are time independent and q2 are time

dependent covariates. The modified form of the Cox proportional hazards model is

obtained by dividing the exponential part in to time independent and time dependent

parts.

i.e. (10) ( ) ( ) ( )⎥⎦

⎤⎢⎣

⎡+= ∑ ∑

1 10 exp,

jjjii tzzthzth αβ

Like the cox proportional hazard model, the regression coefficients for the time

dependent model is also obtained through the maximum likelihood procedure. The

likelihood for time-dependent is the same as that for time-independent except that z is

replaced by z (t), so

( )( )( )

∏∑ ∑

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

β (11)

3.2 Accelerated failure-time models

Although; Cox proportional hazards model is more frequently used in survival analysis,

but still there exist some other models, for example accelerated failure-time models. In

this model more attention is give to the survival time than to hazard function. With

censored data, if in linear regression analysis, we replace response variable by ,

the resultant model is called the accelerated failure time model [6, 48, 49]. The typical

outlook of the model is

iT iTlog

iii ZT εβ +′=log (12)

The role of the covariates in the above equation is to accelerate (or decelerate) the time to

failure. The error terms iε are assumed to be independent and identical distributed with

mean zero. Various choices of ε distribution lead to the regression version of different

parametric survival models. A weibull regression model is obtain if ε has an extreme

value distribution and a lognormal model is obtain if it has a standard normal distribution.

Kay and Kinnersley [2] showed that the accelerated failure time model performs better

than the proportional hazards model in applications where the effects of treatment are to

accelerate or delay the event of interest. The equation (12) is referred to as the parametric

model if the distribution of baseline hazards function is specified and if not; it is called a

semi parametric model. Cox and Oakes [10], Kalbfleisch and Prentics [47] described in

detail the parametric survival models. The main reason of the unpopularity of accelerated

failure time model is its complicated estimation process, even if the data set consists of

small number of covariates [50]. Similarly, the accelerated failure time is based on a

parametric model, which may be difficult sometime to fit and this is one of the main

reason of the popularity of proportional hazards model.

4. Parametric Approach and Models

In parametric models, methods of estimation and inference based on the likelihood are

easy and straightforward, but based on the stronger assumptions as compared to semi-

parametric and assumptions free nonparametric models. Choosing a theoretical

distribution that fits the data well is an art, the idea of this article is not to describe the art

but to discuss some familiar survival distributions. A detailed description of the

parametric methods of Survival analysis was discussed by Lawless [51].The commonly

used standard distributions in most cases are not suitable to survival data. Exponential,

Weibull, Gamma, Log-logistic are the most familiar survival distributions. Out of these

the simplest and most commonly used distribution is the exponential distribution. Epstein

[52] and Davis [53] gave the examples, which fit the exponential distribution. Similarly,

Byar et al [54] and Zelen [55] also applied the distribution. In exponential distribution,

hazard function is assumed to be constant over time. A straight line plot of ( )tSlog

against t is an indication that the data follows the exponential distribution [56], where the

slope of the line is an estimate of the hazard rateλ . Survival function in exponential

distribution is ( ) ( ) 0,exp >−= λλttS , the hazard function is ( ) λ=th . Survival pattern of

exponential distribution are shown in Figure 4 for λ=0.5, 0.15, 0.05.

Weibull distribution plays the same part in the analysis of survival data as the normal

distribution is doing in linear modeling. In 1939, Weibull proposed the Weibull

distribution and in 1951, he discussed its applications [57, 58]. Weibull distribution has

two parameters, shape parameter ( )p and scale parameter ( )λ , which allows taking

different shapes. Weibull distribution reduces to exponential distribution, when i.e.

constant hazard rate. If , hazard rate decreases and it is increases for . Due to

these characteristics, Weibull distribution has broader application in survival analysis

than exponential distribution. The survival function of Weibull distribution

is , which is also helpful for the graphical presentation. Survival curves

of Weibul distribution are shown in figure 5 for λ=0.5, p= 0.5; λ=0.15, p= 1; λ=0.05, p=2.

1<p 1>p

( ) ( )pttS λ−= exp

Like Weibull distribution, Log-logistic distribution is also characterized by two

parameters (shape and scale). If , hazard rate decreases and it is constant for . 1<p 1=p

While the hazard rate increases for . The hazard function and survival function of

the distribution are

, ( ) ( ) ( )[ ] 11 1−− += pp ttpt λλλλ ( ) ( )[ ] 1

+= pttS λ

p and λ is the shape parameter and scale parameter of Gamma distribution. When

, the hazard rate is constant (increases, decreases). Survival curves of log-

logistic distribution are shown in figure 6 for λ=0.5, p= 0.5; λ=0.15, p= 1; λ=0.05, p=2.

( 0,00 <>=p )

Lognormal distribution is also comes under the umbrella of two parameter distributions.

First time Boag’s [59] applied the distribution in cancer research. The hazard function of

lognormal distribution involves the incomplete normal integral. The function increases

initially to a maximum and then decreases. The distribution has been frequently applied

in biomedical research.

Like Cox model, parametric models can also be fitted to the survival data using

maximum likelihood method. The procedure is described as

Suppose that the survival time’s are observed and q of the n individuals die at

nttt ,,, 21 K

( ) ( ) ( )qttt <<< K21 and that the survival times of the remaining n-q (q < n)

individuals are censored. If denotes the probability density function of the survival

time t and be the survival function, the likelihood function can be expressed as

( ){ } ( ){ } i

ii tStf −

=∏ 1

Where is an indicator variable, takes the value 0 when the survival time is censored

and 1 for the uncensored survival time. In case of censored free data, the function is

reduced to the product of probability density functions. The function can be maximized

with respect to the unknown parameters.

5. Computer Software

The facility of applying the Kaplan-Meier estimator, Greenwood’s variance estimator,

Log-rank / Weighted log-rank tests, Cox regression (Proportional hazards / Time

dependent) and parametric regression are available in well known softwares, such as

R[31], SPSS[60],S-Plus[61],SAS[62], STATA [63] and BMDP[64].In STATA, stset and

sts commands are used for drawing the Kaplan-Meier curves. In SAS we have the choice

of PROC LIFETEST, while in Splus, surv.fit (time, status) and surfit commands are used

for obtaining the survival probabilities. Some software’s also provide the facility of time-

dependent analysis but use of accelerated failure time and parametric models are not very

popular due to lack of easily availability of their software’s. This area needs special

attention.

6. Conclusion

Techniques of Survival analysis have substantial impact on the development of medical

research. Almost every medical journal contained some material/ articles which directly

or indirectly used the methods of survival analysis. In most research, due to relax

conditions semi-parametric and nonparametric methods are preferred over the parametric

method. The Kaplan-Meier method, the logrank test and the Cox’s proportional hazards

model are commonly used and popular in survival data analysis. When the proportional

hazards assumption is not satisfied, time-dependent covariate is easily incorporated in to

the model by using available softwares.The development of computer software’s,

provided facility of identifying the correct parametric distribution as well as the model. If

the data follows a specific distribution, results obtain have smaller variance as compared

to nonparametric methods.

Except all these efforts still there is a scope for improvement and understanding of basic

concepts, their applications and presentations in research articles. Easily accessibility of

the software would make the underestimated technique of Survival analysis “accelerated

failure time model” more popular. The area need special attention, improvement of its

application in research, various extensions of the model may prove useful in the future.

References

1. Gross AJ, Clark VA. Survival Distributions: Reliability Applications in the

Biomedical Sciences. Wiley: New York, 1975.

2. Kay R, Kinnersley N. On the use of the accelerated failure time model as an

alternative to the proportional hazards model in the treatment of time to event

data: A case study in Influenza. Drug Information Journal 2002; 36:571-579.

3. Fleming TR, Lin DY. Survival analysis in clinical trials: past developments and

future directions. Biometrics 2000; 56:971-983.

4. Hougaard P. Fundamentals of Survival Data. Biometrics 1999; 55:13-22.

5. Altman DG, de Stavola BL. Practical problems in fitting a proportional hazards

model to data with updated measurements of the covariates. Statistics in Medicine

1994; 13:301-341.

6. Wei LJ. The accelerated failure time model: a useful alternative to the Cox

Regression model in Survival Analysis. Statistics in Medicine 1992; 11:1871-

7. Andersen PK. Survival Analysis 1982-1991: The second decade of the

proportional hazards regression model. Statistics in Medicine 1991; 10: 1931-

8. Collett D. Modeling survival data in Medical research. Chapman & Hall:

London, 1994.

9. Andersen PK, Borgan Ø, Gill RD, Keiding N. Statistical Models Based on

Counting Processes. Springer Verlag: New York, 1993.

10. Cox DR, Oakes D. Analysis of Survival Data. Chapman & Hall: London, 1984.

11. Klein JP, Moeschberger M. Survival Analysis. Springer Verlag: New York, 1997.

12. Marubini E, Valsecchi M G. Analyzing Survival Data from clinical trials and

observational studies. John Wiley & Sons: Chichester, 1995.

13. Maller R, Zhou X. Survival Analysis with Long-term Survivors. John Wiley &

Sons: Chichester, 1996.

14. Kaplan EL, Meier PL. Nonparametric estimation from incomplete observations.

Journal of the American Statistical Association 1958; 53: 457-481.

15. Greenwood Major. A report on the natural duration of cancer. In Reports on

public Health and Medical Subjects, vol. 33. His Majesty’s Stationery Office:

London, 1926; 1-26.

16. Peto R, Pike MC, Armitage P, Breslow NE, Cox DR, Howard SV, Mantel N,

McPherson K, Peto J, Smith PG. Design and analysis of randomized clinical

trials requiring prolonged observation of each patient. II. Analysis and examples.

British Journal of Cancer 1977; 35: 1-39.

17. Altman DG. Practical Statistics for Medical Research. Chapman & Hall/CRC:

Boca Raton 1991; 365-395.

18. Slud EV, Byar DP, Green SB. A comparison of reflected versus test-based

confidence intervals for the median survival time, based on censored data.

Biometrics 1984; 40: 587-600.

19. Zhao G. The homogenetic estimate for the variance of survival rate. Statistics in

Medicine 1996; 15:51-60.

20. Borkowf CB. A simple hybrid variance estimator for the Kaplan-Meier survival

function. Statistics in Medicine 2005; 24: 827-851.

21. Freireich EJ, Gehan E, Frei III E, Schroeder LR, Wolman IJ, Anbari R, Burgert

EO, Mills Sd, Pinkel D, Selwry OS, Moon JH, Gendel BR, Spurr CL, Storrs R,

Haurani F, Hoogstraten B, Lee S. The effect of 6-mercaptopurine on the duration

of steroid-induced remissions in acute leukaemia: a model for evaluation of other

potentially useful therapy. Blood 1963; 21:699-716.

22. Kleinbaum DG. Survival Analysis- A Self-Learning Text. Springer: New York,

23. Gehan EA. A generalized Wilcoxon test for comparing arbitrarily singly-

censored samples. Biometrika 1965; 52: 203-223.

24. Rossa A, Zielin`ski R. A simple improvement of the Kaplan-Meier estimator.

Communication in Statistics- Theory & Methods 2002; 13: 147-158.

25. Mantel N. Evaluation of survival data and two new rank order statistics arising in

its consideration. Cancer Chemotherapy Report 1966; 50: 163-170.

26. Gehan EA. A generalized two sample Wilcoxon test for doubly-censored data.

Biometrika 1965; 52: 650-653

27. Gill RD. Censoring and Stochastic Integrals, Volume 24, Mathematical Centre

Tracts. Amsterdam: Mathematisch Centrum 1980.

28. Tarone RE, Ware J. On distribution-free tests for equality of survival

distributions. Biometrika 1977; 64: 156-160.

29. Klein JP, Moeschberger M. Survival Analysis, Techniques for censored and

Truncated Data. Springer 1998.

30. Lin X, Wang H. A New Testing Approach for Comparing the Overall

Homogeneity of Survival Curves. Biometrical Journal 2004; 46: 489-496.

31. R Development Core Team. R: a language and environment for statistical computing.

Vienna, Austria, R Foundation for Statistical Computing, 2004.

32. Moreau T, Maccario J, Lellouch J, Huber C. Weighted log rank statistics for

comparing two distributions. Biometrika 1992; 79: 195-198.

33. Gray R, Tsiatis A. A linear rank test for use when the main interest is in

differences in cure rates. Biometrics 1989; 45: 899-904.

34. Cox DR. Regression models and life tables (with discussion). Journal of the

Royal Statistical Society Series B 1972; 34: 187-220.

35. Lin DY. Goodness-of-fit analysis for the Cox regression model based on a class

of parameter estimators. Journal of the American Statistical Association 1991; 86:

725-728.

36. Aalen OO. A linear regression model for the analysis of life times. Statistics in

Medicine 1989; 8: 907-925.

37. Lagakos SW, Schoenfeld DA. Properties of proportional –hazards tests under

misspecified regression models. Biometrics 1984; 40: 1037-1048.

38. Solomon PJ. Effect of misspecification of regression models in the analysis of

survival data. Biometrika 1984; 71: 291-298.

39. Bryson MC, Johnson ME. The incidence of monotone likelihood in the Cox

model. Technometrics 1981; 23: 381-383.

40. Pettitt AN, Bin Daud I. Case-weighted measures of influence for proportional

hazards regression. Applied Statistics 1989; 38:51-67.

41. Kay R.Goodness of fit methods for the proportional hazards regression model: a

review’, Revue d’Epide´miologie et Sante´ Publique 1984; 32: 185-198.

42. Andersen PK. Testing goodness of fit of Cox’s regression and life models.

Biometrics 1982; 38: 67-77.

43. Cox DR. Partial likelihood. Biometrika 1975; 62: 269-276.

44. Breslow N. Covariance analysis of censored survival data. Biometrics 1974; 30:

89-99.

45. Tsiatis AA. 1981. A large sample study of the estimate for the integrated hazard

function in Cox’s regression model for survival data. Annals of Statistics 1981; 9:

93-108.

46. Efron B. The efficiency of Cox’s likelihood function for censored data. Journal

of the American Statistical Association 1977; 72: 557-65.

47. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data.

Wiley: New York, 1980.

48. Koul H, Susarla V, Van Ryzin J. Regression analysis with randomly right

censored data. Annals of Statistics 1981; 9: 1276-1288.

49. Miller R. Least squares regression with censored data. Biometrika 1976; 63: 449-

50. Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure

time model. Biometrika 2003; 90: 341-353.

51. Lawless JF. Statistical Models and Methods for Lifetime Data. Wiley: New York,

52. Epstein B. (1958). The exponential distribution and Its Role in Life Testing.

Industrial Quality Control 1958; 15: 2-7.

53. Davis DJ. An analysis of Some Failure Data. Journal of the American Statistical

Association 1952; 47: 113-150.

54. Byar, David P. Selecting Optimum Treatment in Clinical Trials Using Covariate

information, presented at the 1974 Annual Meeting of the American Statistical

Association, August 28, 1974.

55. Zelen M. (1966). Applications of Exponential Models to Problems in Cancer

Research. Journal of the Royal Statistical Society Series A 1966; 3: 368-398.

56. Lee ET. Statistical Methods for Survival Data Analysis. Lifetime Learning

Publications Belmont: California, 1980.

57. Weibull W. A Statistical Distribution of Wide Applicability. Journal of Applied

Mechanics 1951; 18: 293-297.

58. Weibull W. A Statistical Theory of the Strength of Materials, Ingeniors

Vetenskaps Akademien Handlingar, No. 151; The Phenomenon of Rupture in

Solids 1939; 293-297.

59. Boag JW. Maximum Likelihood Estimates of Proportion of Patients Curd by

Cancer Therapy. Journal of the Royal Statistical Society Series B 1949; 11: 15.

60. SPSS Corporation. SPSS 11.5 for windows. Chicago, 2004

61. SPLUS. Version 2.0, Statistical Sciences Inc., Seattle 1992.

62. SAS Software: SAS Institute Inc, Cary, NC, 1991.

63. STATA. Release 3: Reference Manual, Computing Research Center, Los

Angeles, 1989.

64. BMDP Statistical Software, Vol. II, Dixon, W.J. (Ed.). University of California

Press, Berkley 1988.

Table 1. Number of deaths at the jth death time in each of two groups

Group Number of deaths at ( )it Number surviving beyond ( )it Number at risk just before ( )it

n1j – d1j

n2j – d2j

Total dj nj – dj nj

Table 2 Results of Logrank, Wilcoxen and Tarone-Ware tests and their corresponding p-values for leukaemia data set

Statistical test Values p-value

Logrank

Wilcoxen

Tarone-Ware

16.793

13.995

15.124

0.0000

0.0002

0.0000

Table 3 Results of Logrank, Wilcoxen and Tarone-Ware tests and their corresponding p-values for kidney dialysis patients

Statistical test Values p-value

Logrank

Wilcoxen

Tarone-Ware

2.530 0.002 0.403

0.1117

0.9636

0.5260

0 4.5 9.5 14.5 21.5

Figure 1: Pattern of Kaplan-Meier Survival Function by SPSS

1 4 7 10 13 17 22 32

Time (Weeks)

Treatment groupPlacebo group

Figure 2: Survival plot of Leukaemia Data Set

Figure3: Survival curves comparison of the estimated time to infection for kidney dialysis patients.

0 5 10 15 20

Hazard Rate= 0.5Hazard Rate= 0.15Hazard Rate= 0.05

Figure 4: Survival curves of exponential distribution for different values of hazard rate

0 5 10 15 20

Shape= 0.5Shape=1Shape= 2

Figure 5: Survival curves of Weibull distribution for different values of shape parameter.

0 5 10 15 20

Shape= 0.5Shape= 1Shape= 2

Figure 6: Survival curves of log-logistic distribution for different values of shape parameter.

Survival analysis in Medical Research

Documents

Applied Survival Analysis

1 Introduction to medical survival analysis John Pearson Biostatistics consultant University of Otago Canterbury 7 October 2008

Survival analysis

Survival analysis - Univr

M2 Medical Epidemiology Survival Analysis; Evidence-Based Medicine

Subscription Survival Analysis

SURVIVAL ANALYSIS IN MEDICAL DATASETS

6/1/2015 1 Survival Analysis. Introduction Abbreviated Outline Descriptive overview of survival analysis Terminology and notation Goals of survival analysis

ch3l4 survival analysis

The Survival Guide for Medical Students

Overview Survival analysis - Paul Dickman analysis Paul W. Dickman Department of Medical Epidemiology and Biostatistics Karolinska Institutet Stockholm, Sweden paul.dickman@mep.ki.se

Survival Guide for Medical Internship

Survival analysis

Survival Analysis Ppt

Introduction to medical survival analysis - OtagoEstimated survivor function Months 75+

Survival Analysis A Brief Introduction. 2 3 1. Survival Function, Hazard Function In many medical studies, the primary endpoint is time until an event

Medical Emergency Survival

Survival Analysis Applied

Survival analysis -definitions

Survival Analysis - Purdue University · 2011-05-04 · Survival Analysis STAT 526 Professor Olga Vitek May 4, 2011 9. Survival Data and Survival Functions Statistical analysis of