Bootstrap, permutation tests and LASSO · Chapter 5 Bootstrap, permutation tests and LASSO ... Let’s imagine you are going to give the 50 students under your supervision each a

Chapter 5

Bootstrap, permutation tests and LASSO

R is designed to do powerful and difficult series of computations. However, it is also very useful for doing simplecalculations and you can, in fact, use it as a calculator (several of the authors have installed it on their smart phonesfor this very purpose). The more complex things we will learn later in the course depend on being able to do simplemanipulations of data and so this is the first topic we will cover in the workshop.

The objectives of this session are for you to learn:

• How to do construct bootstrap confidence intervals from first principles,

• How to do permutation tests from first principles,

• How to identify optimal model complexity for LASSO regression models,

• How to use the glmnet package.

46

The R Training Course (extra material) Saw Swee Hock School of Public Health

5.1 Bootstrap: the basic idea

Imagine you hold a large database of clinical data and wish to give out a subset of it to your students to analyse. Tomake it concrete, let’s say there are N = 100 000 individuals for whom you have measured their BMI. To generatesome fake data to represent the large database, you can type the following:

set.seed(888) # So you get the same data as me!BMI = rlnorm(100000,3.2,0.12)hist(BMI) # Yup, quite realistic BMI distribution

Histogram of BMI

BMI

Fre

quen

cy

15 20 25 30 35 40

050

0015

000

2500

0

Let’s imagine you are going to give the 50 students under your supervision each a sample of size n = 100 BMImeasurements and get them to calculate the median BMI, and you want to know how far off these will be from thereal median BMI from the entire dataset (so you can increase the sample size for the students if it’s too far off). Thestandard error is a measure of sampling variability for a statistic such as the median, and you can calculate it easilyfor this example by doing the following:

• For each student:

– sample 100 people from the database,– calculate the median and store for reference

• Then calculate the standard deviation of the medians. This is the standard error.

set.seed(777) # So you get the same results as me!medianoutput=c() # Put results in herefor(student in 1:50){

minidata = sample(BMI,100) # sample of 100medianoutput[student] = median(minidata)

}sd(medianoutput) # the SE

47


hist(medianoutput) # to see what it looks likepoints(median(BMI),1,pch='+',col='red',cex=3)

Histogram of medianoutput

medianoutput

Fre

quen

cy

23.8 24.2 24.6 25.0

02

46

810

12

+

If you used the same random number seed as we did, then you should get a SE of about 0.300. But that is just anestimated SE based on a small number of repeat draws (in this case, 50, one per student). If you repeated the processmuch more, you’d have a better estimate of the SE. Let’s try that, pretending we had 1000 students:


minidata = sample(BMI,100) # sample of 100medianoutput[student] = median(minidata)


hist(medianoutput) # to see what it looks likepoints(median(BMI),1,pch='+',col='red',cex=3)

48



medianoutput

Fre

quen

cy

23 24 25 26 27

050

010

0015

0020

00

+

Now the SE is 0.365. If we repeated it with 10 000 ‘students’, the SE is still 0.365 (try it and see). For bigger andbigger samples of students, we don’t get a different SE. Therefore, we have determined the standard error to be 0.365for

• the median

• of a sample of size 100

• from this dataset.

Now, what about the more usual situation: you have one dataset, and you want to work out the SE for some statistic,such as the median, for that dataset. We can stick with the existing scenario a little longer, and pretend you are oneof the students, been given a dataset of 100 patients, and want to work out the SE for the median based just on yoursample.

set.seed(555) # So you get the same results as me!mydata = sample(BMI,100) # the boss gives you this

49


Full data

BMI

Fre

quen

cy

10 20 30 40 50

050

0015

000

2500

0

Your data

mydata

Fre

quen

cy10 20 30 40 50

05

1015

2025

30

The graphs above show the distribution from what you have in your data (right) and the actual population they weredrawn from (left). Clearly, while these are not the same, they are quite similar. So you could repeat the process beforeto estimate the SE for the median, but resampling the sample you have, rather than sampling again from the population.


minidata = sample(mydata,100,replace=TRUE) # note the 2 changesmedianoutput[student] = median(minidata)


hist(medianoutput) # to see what it looks likepoints(median(mydata),1,pch='+',col='red',cex=3)

50



medianoutput

Fre

quen

cy

23 24 25 26 27

050

010

0015

0020

00

+

Resampling sample

medianoutput

Fre

quen

cy23 24 25 26 27

050

010

0015

0020

0025

00

+

The standard error of this is 0.365, i.e. the same as if you’d been sampling the full population rather than resamplingthe sample of size 100!1

What we’ve just done is to derive a bootstrap estimate of the standard error for an estimate. This works by treating thesample as if it were the population, resampling (with replacement!) samples of the same size as yours—so you get thecorrect sample size in the calculations—calculating the statistic (here the median, but it’s general) for each resample,and then quantifying the spread of the sample of statistics. The approach has two forms of error:

1. Error due to replacing the population by the actual sample. There is nothing you can do about this error, but itis small if your original sample is fairly large.

2. Error due to resampling a finite number of times. You can reduce this error by running your code for longer, forinstance by changing for(student in 1:50) to for(student in 1:10000).

The standard error thus obtained can be used to construct a confidence interval in the usual way (±1.96SE), or theconfidence interval can be obtained by taking the quantiles of the resampled statistics:

quantile(medianoutput,c(0.025,0.975))

The quantile approach is slightly better as it does not presuppose the sampling distribution of the statistic to be normallydistributed.

5.2 Bootstrap: using R functions

Although the code to bootstrap a confidence interval is quite simple:

output=c()for(b in 1:10000)# eg 10 000, can change

1In general this is not the case, but they should be close.

51


{bootsample = sample(data,length(data),replace=TRUE)output[b] = fn(bootsample) #fn just some function eg median

}quantile(output,c(0.025,0.975)) # CI

you may find you wish to implement it using a function.

bs=function(x,fn,nboot){

output=c()for(b in 1:nboot){

bootsample = sample(x,length(data),replace=TRUE)output[b] = fn(bootsample)

}return(quantile(output,c(0.025,0.975)))

}bs(data,median,10000)

## 2.5% 97.5%## 24.22581 25.29295

bs(data,mean,10000)

## 2.5% 97.5%## 24.44064 25.64401

bs(data,sd,10000)

## 2.5% 97.5%## 2.623573 3.474083

5.3 Permutation tests

The nice thing about bootstrap is that it does not make any assumptions about the distribution of the data: in theexamples above, we calculated confidence intervals for medians, but we did not assume the data were normal or eventhat the sampling distribution were normal. Given that even traditional non-parametric tests make assumptions aboutthe distributions2, can we do something like the bootstrap to assess hypotheses?

It turns out we can test whether two distributions have the same value of some test statistic by doing a permutation test.Suppose that we can arrange the data into two vectors: x, containing the group membership (for simplicity let’s callthese 1 and 2), and y, containing the data values. Let’s say we wish to see whether the function fn (e.g. the median)is the same for the two groups. This will be the case if x and y are independent of each other.

If we were to shuffle the order of one of the vectors (x, say, becoming xs), then it will be the case that x and y are inde-pendent. So if we shuffle the vector many, many times, and for each one we calculate fn(y[xs==1])-fn(y[xs==2])2For instance, the Mann–Whitney–Wilcoxon test assumes that either two distributions are the same or that one ‘stochastically dominates’ the other.

52


we can generate the distribution of this difference under the assumption that the null hypothesis is true. Then we caninspect where in this distribution the actual difference lies.

If you recall the definition of a p-value, it is the probability (conditional on the null hypothesis being true) of thetest statistic being as or more extreme as it actually was. So if you calculate the proportion of samples in which thedifference fn(y[xs==1])-fn(y[xs==2]) is further into the tails than fn(y[x==1])-fn(y[x==2]), youhave the p-value. An example follows to clarify how the calculation is done.

5.3.1 Permutation test example

We’ll look at data3 from a cohort of patients with dengue. To read their data into R, assuming the following file is inyour working directory, you can type:

denguedata=read.csv('data_Dengue_Singapore.csv')

These data are from patients who had dengue haemorrhagic fever (DHF) at presentation to the hospital, or who didnot (but had dengue fever, DF). Let’s look to see whether the distribution of platelet counts (plotted below) differedwithin these two groups.

Platelet count

Per

cent

of T

otal

0

10

20

30

0 200 400 600 800

DF

0 200 400 600 800

DHF

As the data are quite skewed, we’ll work with the median as the measure of centrality and assess whether it is differentin the two groups of patients.

x=as.numeric(denguedata$DHFpresentation=='Yes') # convert to 0, 1y=denguedata$Plateletactualdifference = median(y[x==0])-median(y[x==1])

3These are actually synthetic data based on a cohort of patients. Their details have been mixed up a bit to protect their privacy.

53


H0differences=c()set.seed(333)for(b in 1:10000){

xs=sample(x)H0differences[b]=median(y[xs==0])-median(y[xs==1])

}pvalue=mean(H0differences>=abs(actualdifference))+

mean(H0differences<= -abs(actualdifference))

When we ran this with this seed, we got a p-value of 0.0001. Of course, we could write a function to do it, whichfacilitates tests of equality of different functions of the data:

permtest=function(x,y,fn,nperm=10000){

actualdifference = fn(y[x==0])-fn(y[x==1])H0differences=c()for(b in 1:nperm){

xs=sample(x)H0differences[b]=fn(y[xs==0])-fn(y[xs==1])

}pvalue=mean(H0differences>=abs(actualdifference))+

mean(H0differences<= -abs(actualdifference))return(pvalue)

}permtest(x,y,median)

## [1] 0

permtest(x,log(y),mean)

## [1] 0

permtest(x,y,sd)

## [1] 0.0013

5.4 LASSO

In the last two sections, we’ve seen two ways we can randomly resample or shuffle the dataset to perform non-parametric inference. A similar approach can be used in parametric inference to validate a particular model formand estimate its performance out of sample using cross-validation. We’ll see how this works in the context of usingLASSO (least absolute shrinkage and selection operator) for regression.

54


5.4.1 Recap: regression models

In regression modelling, for each individual (i) we have one measurement (yi) that we will call the outcome, dependentor response variable, and one or more variables (x1

i , x2i etc) that we will call the input, independent or explanatory

variables. The explanatory variables are either numeric or can be converted to numeric by introducing dummy vari-ables. The response variable is either numeric (in which case we typically fit a linear regression), binary (logistic),counts (Poisson) or time/censoring indicators (Cox). Regardless of which model for yi we use, the fitting procedureinvolves optimising some function (minimising the sum of squares, maximising the log-likelihood, etc) over the spaceof values the coefficients of the xs can take.

Regression Support of y Model of y Link to xLinear R yi ∼ N(µi, σ

2) µi = β0 + β1x1i + β2x

2i + . . .

Logistic {0, 1} yi ∼ Bin(1, pi) log(pi/[1− pi]) = β0 + β1x1i + β2x

2i + . . .

Poisson {0, N} yi ∼ Po(µi) logµi = β0 + β1x1i + β2x

2i + . . .

Cox (R+, {0, 1}) yi arises at rate h(yi, xi) h(yi, xi) = h0(yi) exp(β1x1i + β2x

2i + . . .)

Two major issues that arise in regression modelling are: how to select the predictor variables if you have a lot of them,and how to overcome the tendency for the coefficients to be too large (e.g. if the predictors are correlated, or if thereis separation in a logistic regression). Both can be addressed by shrinking the estimated coefficients towards or tozero: if a coefficient is shrunk to exactly zero, then the predictor it belongs to is excluded from the model. This can beeffected using LASSO.

5.4.2 The LASSO objective function

In standard linear regression, we seek the value of β0, β1, . . . that minimises the sum of squared errors: (β0, β1, . . .) =∑ni=1

[β0 + β1x

1i + β2x

2i + . . .− yi

]2which is equivalent to maximising the (log) likelihood.

If we want to make the coefficients smaller, we can add a penalty term to the objective function that depends on themagnitude of the coefficients. In LASSO, the penalty term is the scaled sum of the absolute value of the coefficients4:(β0, β1, . . .) =

∑ni=1

[β0 + β1x

1i + β2x

2i + . . .− yi

]2+∑Kk=1 λ|βk|

The scaling parameter λ determines how much the βs are shrunk towards 0: if λ is equal to 0, then there is noshrinkage at all, whereas if λ is huge, the function can be minimised by shrinking all the parameters back towards 0.For intermediate values, some βs will be set to 0 and others not. But how to determine λ?

Too little or too much shrinkage can lead to a low ability to generalise to other data, e.g. if quirks in the data lead you toconfuse noise for signal. We can determine an optimal amount of shrinkage by partitioning the data into different sets:some data are used to try out different values of λ and to estimate the βs for each one, while the other data are used toassess the predictive performance of models of the different complexities. We typically do this using cross-validation.

5.4.3 Cross validation

A straightforward way to validate a statistical model is to build it using one dataset and validate it on a separate dataset.If you just have a single dataset to work with, then you can still do validation by either dividing the data in two andmeasuring the performance in the second part, or by repeatedly dividing the data in two and measuring the averageperformance in the repeated ‘second parts’. The latter is called cross-validation.

Commonly we divide the data into 10 segments, and use each successively as validation data, with the other 9 segmentsbeing used to build a model (to train it). Because the model will differ in each of the training sets, this is a way tovalidate a procedure rather than a particular model.4Note, some special approaches are used for the intercept β0 and the variance term σ2. Details can be found in Hastie et al (2009) Elements ofStatistical Learning 2e, for example.

55


In the context of LASSO, we can use cross-validation to select the value of λ that optimises out of sample predictiveperformance. The procedure is:

1. Divide the data into 10 parts. For each of these:

(a) Let that part be the validation data and the other 9 parts be the training data.

(b) For a set of values of penalty term λ, find the optimal values of the coefficients β for that penalty.

(c) For each of these coefficients, predict the validation data, and measure the accuracy of this prediction(e.g. the mean squared error).

2. After repeating this for each fold, for each value of λ, find the average out of sample predictive accuracy. Selectthe value of λ that minimises the average error.

3. Refit the model to the entire dataset using the optimal value of λ.

5.4.4 LASSO and cross-validation in R

To learn how to implement LASSO in R, we’ll start with some fake data, so that we know the ‘truth’ that the LASSO ismeant to reproduce. We’ll make it a very simple, clean example, with n = 1000 individuals, each with 10 x variablesmeasured (we’ll assume xik ∼ N(100, 202) for i = 1, . . . , n and k = 1, . . . , 10) and one y variable, which has amean that depends on the first three values of x but none of the other seven.

## Make fake dataset: the fakasetset.seed(12345)n=1000x=matrix(rnorm(n*10,100,20),n,10)y=rnorm(n,0.1*x[,1]+0.2*x[,2]+0.3*x[,3],10)

We’ll work with the glmnet package, which contains two functions we’ll use to build a model with LASSO. First,the function with the same name:

library(glmnet)fit=glmnet(x,y)plot(fit)points(c(0,0,0),c(0.1,0.2,0.3),

col=c('black','red','green'),pch=16) #16 for circle

This fits a sequence of models of varying complexity (as measured by λ or its transformation, the L1 Norm which ispresented on the x-axis). For each level of complexity, the optimal coefficients of the xs are found and plotted in the ydirection. We’ve overlaid as points the true values of the coefficients of the first three coefficients (the non-zero ones)in the same colour as the corresponding lines. We can see that as the L1 Norm gets bigger (i.e. the λ gets smaller),first x3 enters the model (green), then x2 (red), then x1 (black), then the other coefficients (which in reality are zero)start to enter the model, while the coefficients of the first three variables grow towards their true values.

56


0.0 0.2 0.4 0.6 0.8

0.00

0.10

0.20

0.30

L1 Norm

Coe

ffici

ents

0 2 3 5 10

To identify the value of λ to use in the final model, we use the cv.glmnet() function to do ten-fold (by default)cross-validation, as follows:

cv=cv.glmnet(x,y)#fit=glmnet(x,y,lambda = cv$lambda[which(cv$lambda>=cv$lambda.1se)])plot(cv)

The chart shows for a discrete set of penalty terms (cv$lambda if you want to extract them) on the x-axis, theestimated out of sample predictions’ mean squared error on the y-axis. The upper x-axis shows the number of non-zero parameters in the model, so you can see that the model becomes more complicated as we move from right to left,and that the prediction error stops getting better after 3–6 parameters have entered the model. The ‘optimal’ penaltywith the minimal error is indicated by one of the dotted lines (with around 9 terms in the model). Also marked isthe model with the highest penalty/fewest parameters within 1 standard error of the optimal (this is around 3 termsin the model, which does coincide with the number of coefficients there should be!). This so-called ‘1SE rule’ isadvocated by Hastie et al5 among others, but the justification for it seems merely to be that it ‘works well’ which is atad unsatisfying. (And note, if you repeated the ten-fold cross-validation many more times, the SE would reduce sothis estimate is not robust to different decisions about the implementation of the algorithm.)

5Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning 2e.

57


−4 −3 −2 −1 0 1 2

100

120

140

160

log(Lambda)

Mea

n−S

quar

ed E

rror

10 10 10 9 9 8 6 3 3 3 2 1

Despite the weak justification for the ‘1SE rule’, it does work quite well, so let’s use it. To fit the model with theoptimal λ, we can either take the original fit and try to find the one with the closest λ, or rerun the LASSO with thetarget λ as the end of a sequence of values (the algorithm supposedly is more efficient when given a sequence of λs).To do the latter:

fit=glmnet(x,y,lambda = cv$lambda[which(cv$lambda>=cv$lambda.1se)])plot(fit)points(c(0,0,0),c(0.1,0.2,0.3),col=1:3,pch=16)

We see that in this instance, the three parameters are selected correctly, and the values of the parameters are all a bitsmaller than the values used in the simulations (i.e. they have been shrunk back towards 0).

0.0 0.1 0.2 0.3 0.4 0.5

0.00

0.10

0.20

0.30

L1 Norm

Coe

ffici

ents

0 2 2 3 3 3

58


The optimal parameters can be obtained thus:

coef(fit,s=cv$lambda.1se)

The fitted values from the model can be obtained thus (for the original data, but you can replace with a new set of xsif you like):

predict(fit,newx = x,s=cv$lambda.1se)

Note that glmnet does not (currently) provide estimates of uncertainty (i.e. confidence intervals) or perform tradi-tional hypothesis testing (to obtain p-values). However, with some work, you could embed it in a bootstrap routine toestimate CIs!

The glmnet function can be applied to other regressions, not just linear regression, by setting the family option toone of “gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, and “mgaussian”.

59


5.5 Practical: bootstrap

5.5.1 Challenge 1!

Take the data from the clinical dengue study in Singapore and construct 95% confidence intervals for the medianplatelet count among patients who have DHF on admission, who have DF on admission and subsequently developDHF, and those who have DF throughout their hospitalisation. Construct also a 95% confidence interval for the differ-ence in medians between patients who start with DF and progress to DHF, and for patients who have DF throughout.

5.5.2 Challenge 2!

The sex ratio at birth is defined to be the number of male births to the number of female births in a given area andinterval of time. In ‘normal’ countries it is a little more than 1 (1.04–1.07) but can be affected by wars or naturaldisasters, or by sex selection. The sex ratio in India and China is highly skewed due to strong son preference.

The table below shows the number of births by ethnic group and gender of the child in Singapore for the year 2017.Find a 95% confidence interval for the sex ratio at birth for these groups, to assess whether Singapore could be subjectto the same son preference related sexual selection as China and India.

Group Female MaleChinese 11 319 12 015Indian 2 151 2 243Malay 3 529 3 758Other 2,123 2 386All 19 202 20 402

60


5.6 Practical: permutation test

5.6.1 Challenge 3!

Laoprasopwattana et al (2014)6 collated data on children with severe dengue in Songklanagarind Hospital over a 22-year period, in an attempt to identify risk factors for death (12.6% of children with severe dengue over this time perioddied).

Use a permutation test to assess whether the mean weights (Bw, in kg) and/or ages are different for children who livedand those who died.

5.6.2 Challenge 4!

The primary vector for dengue is the mosquito Aedes aegypti7, which is a domesticated mosquito that likes living inurban environments. The mosquito is a very loving creature and forms life-long partnerships with his or her mate.Only the female bites—the males are vegetarian—when she needs blood for her eggs. A novel method of controlinvolves infecting mosquitoes with the bacteria, Wolbachia, which interferes with their reproduction and lifespan.

Segoli et al (2014)8 conducted field studies in Cairns, Australia, which is the only part of Australia to see autochthonoustransmission. In a series of experiments, they introduced infected and uninfected males to a population of females andmeasured mating success.

Get the data for experiment 1. Take the data from the two control replicates with compatible and incompatible males(i.e. there were 20 females in each, and either 30 infected or 30 non-infected males).

First, look at whether the female managed to lay any viable eggs or not. Use a chi-squared test to assess whether theprobability of being able to lay any viable eggs differed depending on whether the male was infected or not. (Clearly,the number of viable eggs is very different and no test is necessary to tell you that.)

Next, look at the number of eggs laid. Use a permutation test to assess whether the mean number of eggs laid (notviable necessarily) differs between the infected and uninfected groups.

6Laoprasopwattana K, Chaimongkol W, Pruekprasert P, Geater A (2014) Acute Respiratory Failure and Active Bleeding Are the Important FatalityPredictive Factors for Severe Dengue Viral Infection. PLOS ONE 9(12): e114499. doi:10.1371/journal.pone.0114499

7Although there are arguments that the genus should be Stegoymia, in public health and some entomological circles we prefer to stick to Aedes.8Segoli M, Hoffmann AA, Lloyd J, Omodei GJ, Ritchie SA (2014) The Effect of Virus-Blocking Wolbachia on Male Competitiveness of the DengueVector Mosquito, Aedes aegypti. PLOS Negl Trop Dis 8(12): e3294. doi:10.1371/journal.pntd.0003294

61


5.7 Practical: LASSO

5.7.1 Challenge 5!

Take the Thai paediatric dengue data, and use LASSO to develop a model to predict the outcome (death or not). Usea logistic regression form of LASSO.

5.7.2 Challenge 6!

Take the Singapore dengue data, and use LASSO to develop a model to predict the outcome (DHF or not) among thosepatients who do not have DHF on admission. Use a logistic regression form of LASSO.

62

Chapter 6

Survival analysis

Survival analysis is the study of the time it takes for events to happen. It is frequently used in medical statistics—to un-derstand the time to remission of cancer, to the next psychiatric episode, to death after release from hospital—althoughit has many other applications, in ecology, business, and recidivism, for example. Survival analytical techniques pro-vide a way to understand—via analogies of regression—the effect of predictors on the development of the event ofinterest. Time to event data are complicated by the censoring that often results from patient drop-out or termination ofstudies, and this makes analysis of such data markedly different to other experimental or observational data.

These notes will guide you in learning to distinguish the different kinds of censored data, to fit and compare non-parametric summaries of survival data via the Kaplan–Meier method, to develop semi-parametric regression modelsusing the Cox proportional hazards model.

The first part of these notes introduces the key factors that distinguish survival data from other sorts of data commonlyencountered in statistics: censoring, which often results from patients dropping out of long studies; and the survivaland hazard functions, which are used to summarise survival data. We will then look at non-parametric methods tounderstand survival data.

6.1 Censoring, survival and hazards

6.1.1 Definitions

Censoring is when an observation is incomplete. The cause of the censoring must be independent of the event of inter-est if we are to use standard methods of analysis, but statistical theory does not tell you if the events are independent ornot: you need to rely on your own knowledge and judgement to determine if this is a reasonable assumption. Althoughcensoring does arise from time to time in non-survival data, for instance, when test readings cannot be made below acertain minimum threshold, it affects almost all survival analysis.

6.1.2 Examples of censored data

Imagine a study in which lung cancer patients are recruited to test the effect of a drug on their survival from lungcancer, e.g. by randomly allocation consenting patients in a hospital to receive the new drug or the current standard ofcare equiprobably.

Consider the following three patients:

63


Start of study End of study

Event Censoring

Figure 6.1: Diagram of right censoring. Each line represents an individual. The grey box is the period of the study.Black dots indicate the (known) time events occur. White dots indicate censoring events. Some are due to the studyending, others are due to drop-out during the study.

• Patient A takes part in the study until her death at age TA. Her survival time is uncensored.

• Patient B takes part in the study until age TB . He then leaves the study. His survival time is censored: we knowit is at least TB but we don’t know it precisely.

• Patient C takes part in the study until age TC . She then is hit by a car and dies. Her survival time with regardto the event of interest, namely death through lung cancer, is also censored: we know it is at least TC but theunfortunate accident has censored it.

6.1.3 Types of censoring

The most common form of censoring is right censoring. Both examples in the previous section were of right censoring.Here, the subject is followed until some time, at which the event has yet to occur, but then takes no further part in thestudy. This may be because:

• the subject dies from another cause, independently of the cause of interest;

• the study ends while the subject survives; or

• the subject is lost to the study, by dropping out, moving to a different area, etc.

If our data contain only uncensored and right-censored data, we can represent all individuals by the triple (i, ti, δi)(see Figure 6.2):

• i indexes subjects,

• ti is the time at which the death or censoring event occurs to individual i, and

64


Figure 6.2: Example data formatted in a spreadsheet. Three pieces of information can represent all individuals.

• δi is an indicator: δi = 1 if i is uncensored and δi = 0 if censored. Most methods developed for survival dataapply to right censored data only.

Other types of censoring are much rarer and are not considered in this short course!

6.1.4 Survival function

The survival function, S(t), is the probability of surviving at least to time t, i.e. S(t) = p(T > t). A very commonway to estimate it non-parametrically was given by Kaplan and Meier (1958)1, which allows its estimation even in thepresence of censoring. We’ll learn how to do that shortly. Being estimable in the presence of censoring is perhapsthe main reason S(t) is favoured to other characteristics of the distribution, like the mean. Typically, the populationsurvival function is smooth, though estimates of it are not.

6.1.5 Hazard function

The hazard function, h(t), is the instantaneous failure rate. Its formal definition requires some mathematical notation:h(t)=lim∆t→0

p(t<T<t+∆t|T≥t)∆t

If you don’t have enough mathematical training to make sense of that equation, aloose way to think of the hazard is the as the risk of the event happening right now given that it has not happened yet.

We can write S(t) as a function of h(t): S(t)=exp(−∫ t

0h(u)du

). Because the two are functions of each other, they

contain the same information. We usually use the survival function for plotting and testing for differences, and thehazard function when doing regressions for observational studies.

6.1.6 Behaviour of hazard function

As can be seen in Figure 6.3, even though the hazard plots have very different shapes, the survival plots look verysimilar. As such, we usually don’t care about the exact shape of the hazard function, but just whether it is higher orlower for some groups (e.g. men and women, or drug and placebo arms).

6.2 The Kaplan–Meier estimate of the survival function

The Kaplan–Meier estimate of the survival function is an empirical or non-parametric method of estimating S(t) fromnon- or right-censored data. It is extremely popular as it requires only very weak assumptions and yet utilises the

1Kaplan EL, Meier P (1958) J Am Stat Assoc 53:457–81.

65


Figure 6.3: Constant hazard (top): this may be appropriate for the time until dengue infection in aseasonal Singapore.Increasing hazard (middle): this may be appropriate for time of death of cancer patients who are not responding totreatment, or death rates among adult animals. Rising and falling hazard (bottom): this may be appropriate for time toonset of symptoms of tuberculosis following infection.

66


information content of both fully observed and right-censored data. It comes as standard in most statistical packages(such as R and SPSS). The estimator was developed back in the 1950s, independently by Kaplan and Meier, whoused it to understand the duration of cancer (Meier) and the lifetime of sub-oceanic telephone cables (Kaplan). Theyindependently submitted their research on survival times to the Journal of the American Statistical Association, whoseeditor encouraged them to submit a joint paper, which they did in Kaplan and Meier (1958)2. Google Scholar has over40 000 citations for this paper.

6.2.1 Motivating example: Leukæmia data

We consider remission times for two groups of leukæmia patients. Freireich et al. (1963)3 applied 6-Mercaptopurineand a placebo to 42 youths (≤ 20 years) with leukæmia in one of a series of landmark studies carried out at the US’sNational Cancer Institute (you can read about Freireich and other leading scientists in the early war on cancer in thewonderful book The Emperor of All Maladies by Mukherjee). The times of interest are the duration of remission inweeks. These are:

6-MP 6, 6, 6, 7, 10, 13, 16, 22, 23, 6+, 9+, 10+, 11+, 17+, 19+, 20+,25+, 32+, 32+, 34+, 35+

Placebo 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23

using the notation t+ to indicate a right censored observation at time t. So, for example, we have three patients in the6-MP arm whose remission lasted 6 days, and another whose remission lasted at least 6 days (the 6+ entry).

6.2.2 The estimate in the absence of censoring

Question: What proportion of the sample given a placebo survived to:

• time 0.0? Answer:





In the absence of censoring, the empirical survival function S(t) is a step function with heights equal to the proportionof the starting population surviving to the instant after t, i.e. S(t) = n(t+)

n(0) .

Here we are using the notation t− to mean an instant before time t and t+ to mean an instant after, and n(t) is thenumber of participants in the study at time t.

6.2.3 Kaplan–Meier’s method in the presence of censoring

What do we do when there is censoring? The above formula requires generalising to the case when there are changingnumbers of active participants in the study. The generalisation accounting for drop out is called the Kaplan–Meierestimate. The Kaplan–Meier estimate of S(t) is S(t) = S(t−)× p (T > t|T ≥ t) .2Kaplan EL, Meier P (1958) J Am Stat Assoc 53:457–81.3Freireich et al (1963) Blood 21:699–716.

67


If no failures occur at time t, p(T > t|T ≥ t) = 1. If one or more failures occur at time t, p (T > t|T ≥ t) =n(t−)−d(t)n(t−) where d(t) is the number of participants who had the event at time t. Note that the Kaplan–Meier estimate

does not change between events, nor at times when only censorings occur. It drops only at times when a failure hasbeen observed.

6.2.4 Confidence interval for S(t)

There is an ugly formula that estimates and approximates the variance of the survival function at any given time point,which we could try to use to construct asymptotic confidence intervals for S(t) using the usual S(t)±1.96SE

{S(t)

}.

However, this may lead to a CI that exceeds 1 or is less than 0, which is not possible for a survival probability. Analternative was proposed by Kalbfleisch and Prentice (2002)4 that gets around this problem, which involves gettinga variance estimate for the complementary log-log function of the survival probability log

{− log S(t)

}, creating a

95% confidence interval on the complementary log-log scale, and then back transforming to the survival scale. Thisensures the 95%CI lies entirely between 0 and 1. In R, you should select this option.

6.2.5 Finding the Kaplan–Meier estimate in R

We need to learn how to plot the Kaplan-Meier curve and confidence intervals for it. This can be done quite easilyusing the survival package. Let’s see how it works using an example from a clinical trial in which patients withbipolar disorder were randomised to omega 3 or a placebo and the time until their next attack of bipolar disorder wasrecorded (Stoll et al. 1999). The data are available in the file bipolar.csv in the webpage. There are three columnsin the dataset: time, censored and treatment. The first gives the survival or censoring time in days, the second indicatesif the event is survival (no) or censoring (yes), and the third indicates if the patient received omega 3 oil (omega) orthe placebo (placebo).

Here is the code:

library(survival)bipolardata = read.csv('bipolar.csv') #(1)fit = survfit(Surv(t,d)õ, conf.type='log-log', data=bipolardata) #(2)plot(fit, conf.int=TRUE, mark.time=FALSE, col=c('black','red'),

xlab='Time (days)', ylab='Probability of no relapse') #(3)

(1) Reads in the dataset if it is contained in the current directory.

(2) Derives the Kaplan-Meier estimates and stores it in an object called fit. The survfit function is used todo this. The main argument of the function is a model equation, in this case Surv(t,d)∼o. This looks likethe model equation for regression or generalised linear models, with the response variable on the left and thepredictor on the right (the predictor is the binary variable o that tells us if the patient got omega 3 oil or not).However, in this case, the response is not just a simple y variable, but the ‘survival object’ Surv(t,d). Thisbrings together the information on the time of the event and whether the event was an attack of bipolar disorderor a censoring event. The other arguments tell it to do a complementary log-log plot (see above for a discussion)and to look in the bipolardata object to find t, d, and o.

(3) Plots the fitted Kaplan-Meier estimates. The optional arguments tell the function to plot confidence intervals,not to mark the time at which censorings occurred, to make the placebo black and the drug red, and how to labelthe axes.

4Kalbfleisch JD and Prentice RL (2002) The Statistical Analysis of Failure Time Data, 2e, London: Wiley.

68


0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

t

S((t))

Figure 6.4: Kaplan–Meier curves for a randomised clinical trial of omega 3 supplements (orange) vs placebo (blue)for bipolar disorder.

6.2.6 Testing differences in survival curves

As in most statistics, a key objective is to test whether subpopulations behave in the same way. For example, a group ofpatients has been allocated to treatment with omega 3 oils or a placebo by Stoll et al (1999)5 and the duration withoutan attack recorded and plotted (orange for omega 3, blue for placebo, Figure 6.4). It appears that the two treatmentarms do differ, with the survival curve of the patients on the drug lying above that of the placebo patients. Is this agenuine finding, or can it be explained by a small sample size giving the spurious impression of a difference?

Various tests have been proposed for testing for differences in survival between categorical covariates; we presentthree named ones. They all have a very similar structure, but have different power depending on what the exact natureof the difference between the survival curves is.

6.2.7 Test with two categories

Initially we limit attention to two categories.

Hypotheses:

H0 : S1(t) = S2(t)∀tH1 : S1(t) 6= S2(t) for at least one t

Assumptions: This test makes several important assumptions.

• Censoring is independent of group.

• The total number of events in each arm is ‘large’.

5Stoll et al (1999) Arch Gen Psychiatry 56:407–12.

69


P-value: The p-value, p = Pr(Q > q|H0 true), follows from the CDF of the χ21 distribution.

Weights: The various tests differ in terms of the weights used, and hence the kind of discrepancies between thesurvival functions that the test is best able to pick up.

• The log-rank or Mantel-Haenszel test uses wi = 1 (Peto and Peto, 1972). It puts emphasis on larger values oftime.

• The (generalised) Wilcoxon test uses wi = ni (Gehan, 1965; Breslow, 1970). It puts emphasis on smaller valuesof time.

• The Tarone–Ware test uses wi = n1/2i (Tarone and Ware, 1977). It puts emphasis on intermediate values of

time.

6.2.8 Tests with multiple categories

If there are ≥ 3 groups to be compared, similar tests can be constructed by generalising the 2-group case.

Hypotheses:

H0 : S1(t) = S2(t) = · · · = SK(t)∀tH1 : There is at least one pair of categories k and j such that Sk(t) 6= Sj(t)

for at least one t.

6.2.9 In R

Running these tests in R is simple. We use a function survdiff to test for differences between two or more survivalcurves, which uses the following syntax:

survdiff(Surv(t,d)õ, data=bipolardata,rho=0)

The first part is the survival object we used in plotting the Kaplan–Meier curves, and in this case we ‘regress’ it on thevariable o (omega oils or placebo). The rho option allows us to switch between the log-rank (rho=0), (generalised)Wilcoxon test (rho=1) and Tarone-Ware options, i.e. it controls the weights mentioned above.

6.2.10 Discussion

These tests are very useful in assessing whether a covariate affects survival, and for a randomised trial, might be all weneed to demonstrate that an intervention works. However, they do not allow us to say how survival is affected. Ideally,we’d like to be able to say how much more at risk on group is than another. We’d also like to be able to incorporatenon-categorical covariates, such as age, and several variables, to deal with confounding in an observational study.

This can be done using Cox’s proportional hazards model, which we will discuss in the next part of the course. TheCox model also allows you to adjust for confounders and is thus a form of regression model.

The Kaplan-Meier approach has one major strength: its robustness. It makes no assumptions about the distributionof survival times and is therefore appropriate for the whole gamut of patterns that can arise in practice. However, ithas weaknesses. A minor weakness is that it is biased for small times, before the first event has occurred. The majorweakness is that it provides no mechanism to adjust for confounders, except to stratify the analysis and analyse a fewsubgroups independently. This limits its use to randomized trials (why?), to exploratory analysis, or to summarizingsingle populations. To adjust for confounding, we typically use Cox regression.

70


6.3 Cox regression

In the last section we considered testing for a difference in survival based on a categorical covariate, such as treatment.This lets us know if there is a difference, but it doesn’t help us answer how much more at risk one individual is thananother. Similarly, it is not ideal when dealing with a continuous covariate: we can arbitrarily bin the covariate intogroups, but a different grouping will give a different test result. Importantly, it does not adjust for confounders, makingit difficult to justify for observational studies.

In this section, we seek to achieve two things:

• incorporate continuous covariates into our survival analysis; and

• analyse the effect of (not just the presence of) covariates on survival.

We do these within a unified framework, namely using Cox’s proportion hazards model (PHM).

6.3.1 Who is Cox?

David Cox is an English statistician, and a renowned one at that. He has written over 300 papers or books on a varietyof topics, has advised government, was knighted for his contribution to science, and holds numerous fellowships andawards. His paper introducing the proportional hazards assumption and inference for it (Cox, 1972), has been citedaround 20 000 times, according to Google scholar.

6.3.2 Recall: linear regression

In linear regression, we use a predictor variable x to explain some of the uncertainty in a response variable y. If the ithindividual has independent and dependent variables xi and yi, respectively, the linear model is yi = β0 + β1xi + εi,where εi are assumed to have mean 0 and standard deviation σ (it may also be assumed normally distributed). Notethat this is a model, and it depends on certain assumptions, e.g. that the relationship is linear. Note also that asεi ∈ (−∞,∞) and β0 + β1xi ∈ (−∞,∞), so must yi ∈ (−∞,∞).

6.3.3 Survival regression

How can we do likewise for survival data? We choose to focus on models for the hazard function, as this allowsstatements such as “the risk to males is X times the risk to females” more readily than using the survival function asour basis.

A natural first guess for such a regression survival model would be h(t,x)=β0 + β1x, i.e. the hazard at time t for anindividual with covariate(s) x is β0 plus β1 times the covariate(s).

There is no “error” term as the randomness is implicit to the survival process, for the same reasons there are no “errors”in a logistic regression model. Here we have used the notation h(t, x) to be the hazard function for an individual whoseindependent variable has the value x, while β0 is a baseline hazard function (for the time being assumed constant intime t) for individuals with x = 0.

However, this is a bad model. The range of h(t, x) may extend below zero for certain values of β0 or β1, but the rangeof h(t, x) must be ≥ 0. Luckily, a similar problem has arisen and been solved in generalised linear modelling. There,the predictors are incorporated into different distributions for the dependent variable. For a Poisson model, the meanmust be positive, and the exponential function is used as the canonical link function between covariates and mean. Wethus follow suit by exponentiating the covariate terms: h(t,x) = exp(β0 + β1x) = h0 exp(β1x) > 0.

71


We can have more than one predictor but for simplicity let us assume a single one for now (we’ll see some exam-ples with more, later). Note that for a cohort with identical predictors x, the above form implies that lifetimes areexponentially distributed, which we know to be unrealistic for most public health problems.

6.3.4 The Cox proportional hazards model

We therefore consider the following generalisation: h(t,x1, x2, . . .) = h0(t) exp(β1x1 + β2x2 + . . .). Note that wehave decomposed the hazard into a product of two items:

• h0(t), a term that depends on time but not the covariates; and

• exp(β1x1 + β2x2 + . . .), a term that depends on the covariates but not time.

This is the Cox PHM. The beauty of this model, as observed by Cox, is that if you use a model of this form, and you areinterested in the effects of the covariates on survival, then you do not need to specify the form of h0(t). Even withoutdoing so you may estimate the βs. The Cox PHM is thus called a semi-parametric model, as some assumptions aremade (on exp(β1x1 + β2x2 + . . .)) but no form is pre-specified for the baseline hazard h0(t).

To see why it is called the PHM, consider two individuals A and B with covariates x = a and x = b (which we cantreat for simplicity as scalars). Then the ratio of their hazards at time t is

h(t, a) = h0(t) exp(βa)

h(t, b) = h0(t) exp(βb) = h(t, a) exp(β[b− a]).

In other words, h(t, a) ∝ h(t, b), i.e. the hazards are proportional to each other and their ratio does not depend ontime. In particular, the hazard for the individual with covariate a is exp(β[a − b]) times that of the individual withcovariate b. This term, exp(β[a− b]), is called the hazard ratio comparing a to b.

If you considered two individuals such that a = b + 1, i.e. with the covariate just one unit different from each other,then the hazard ratio becomes exp(β), which we usually call the hazard ratio for that covariate, as it measures howeach unit change in the predictor

If β = 0 then the hazard ratio for that covariate is equal to e0 = 1, i.e. that covariate doesn’t affect survival. Thus wecan use the notion of hazard ratios to assess if covariates influence survival. The hazard ratio also tells us how muchmore likely one individual is to die than another at any particular point in time. If the hazard ratio comparing men towomen were 2, say, it would mean that, at any instant in time, men are twice as likely to die as women.

Note however that this is a model—it could be wrong. There may be an interaction between covariates and time, inwhich case hazards are not proportional. There are ways to check for violations of the proportional hazards assumptionand to extend the PHM to incorporate such interactions. Similarly, there is no reason why we should expect the logof the hazard function to be linear in the covariates. Unlike in linear regression, there is no simple way to absorbadditional variation (the σ term) as the randomness in the data is generated implicitly by the hazard function, so thatmissing a predictor out cannot really be rectified by hoping the error term can capture that source of variation.

6.3.5 Hypothesis testing for the PHM

There are two tests that are very useful in testing the hypothesis that one or more covariates have no effect. These arethe Wald and the (partial) likelihood ratio tests. We’ll consider just the Wald test for this workshop, but if you’d like toknow about the (partial) likelihood ratio test, let us know (it is useful for testing between competing, nested models).The Wald test has null and alternative hypotheses:

72


H0 : β = 0

H1 : β 6= 0

for the model h(t, x) = h0(t) exp(βx). For models with multiple parameters, it is most often convenient to use theWald test for one parameter at a time.

For a test of a single parameter being equal to 0, the Wald test statistic is z = β/SE(β)

. If the null hypothesis istrue, then z should be a realization from a N(0, 1) distribution. Large values of z support the alternative hypothesis.

6.3.6 Fitting the Cox PHM in R

The survival package has built in routines to fit Cox PHMs. As with Kaplan–Meier, the response variable is asurvival object rather than a single variable. The syntax is:

fit=coxph(Surv(time,censoring)˜predictor1+predictor2)

This can then be summarised using

summary(fit)

which returns parameter estimates, confidence intervals and Wald test results.

6.3.7 Examples

We consider the following special cases one at a time:

• one continuous covariate;

• two continuous covariates;

• one binary covariate;

• one categorical covariate; and

• one continuous and one categorical/binary covariate.

Note, though, that the approach generalises to more complex models in an obvious way. Each will be illustrated by anexample of an analysis of survival data using the uis.dat data. These relate to the length of time drug users are ableto avoid drug use following a residential treatment programme. Eight covariates were also recorded. Refer to Hosmeret al (2008)6 for more details and references. In all these examples, is the survival time (until first re-use of drugs orleaving the study [note that it was considered likely that those leaving the study had taken up drug use again and sothese are not considered to be right censored in this particular example]) or the time of an (unspecified) right-censoringevent.

6Hosmer et al (2008) Applied Survival Analysis 2e. London: Wiley.

73


6.3.8 A single continuous covariate

Theory:

• The covariate is x ∈ R.

• The parameter is β ∈ R.

• The hazard rate is h(t, x) = h0(t) exp(βx).

• The hazard ratio for two individuals with covariates xa and xb is exp{β(xa − xb)}. Increasing x by one unitscales the hazard rate by exp{β(x+ 1− x)} = eβ .

Example: age of drug addicts at the time of admission to the programme. Here is the R code and output:

library(survival)data=read.csv('uis.csv')fit1=coxph(Surv(TIME,CENSOR)ÃGE,data=data)summary(fit1)

## Call:## coxph(formula = Surv(TIME, CENSOR) ˜ AGE, data = data)#### n= 623, number of events= 504## (5 observations deleted due to missingness)#### coef exp(coef) se(coef) z Pr(>|z|)## AGE -0.012878 0.987204 0.007188 -1.792 0.0732 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### exp(coef) exp(-coef) lower .95 upper .95## AGE 0.9872 1.013 0.9734 1.001#### Concordance= 0.52 (se = 0.014 )## Rsquare= 0.005 (max possible= 1 )## Likelihood ratio test= 3.25 on 1 df, p=0.07## Wald test = 3.21 on 1 df, p=0.07## Score (logrank) test = 3.21 on 1 df, p=0.07

The fitted hazard rate is h(t, ai) = h0(t) exp{−0.013ai} where ai is the age of individual i. Thus, each year of anaddicts age is estimated to multiply the risk of taking drugs again by 0.987 (95% confidence interval: 0.97, 1.00). Thisturns out not to be statistically discernible from no effect (p = 0.073,from the Pr(> |z|) cell and the Wald test row),and so we have no reason to reject the hypothesis that age has no effect on reversion to drug use.

6.3.9 Two continuous covariates

Theory: (using scalar notation)

• The covariates are (x1, x2) ∈ R2.

• The parameters are (β1, β2) ∈ R2, or (β1, β2, β12) ∈ R3 if there is an interaction between x1 and x2.

74


• With no interaction, the hazard rate is h(t, x1, x2) = h0(t) exp(β1x1 + β2x2).

• With an interaction, the hazard rate is h(t, x1, x2) = h0(t) exp(β1x1 + β2x2 + β12x1x2).

• With no interaction, the hazard ratio for two individuals with covariates (x1A, x2) and (x1B , x2) is exp{β1(x1A−x1B)}, i.e. it depends on x1 only if x2 is fixed. Increasing x1 by one unit while keeping x2 fixed scales the haz-ard rate by eβ1 . A similar interpretation holds for β2 by symmetry. With an interaction, the βs can no longer beinterpreted thus. The effect of increasing x1 while keeping x2 fixed depends on the value of x2.

Example: number of previous drug treatments and depression of drug addicts. Let ni be the number of previous drugtreatments and bi be the Beck depression score for individual i. Here, ni is in the range [0, 40] and bi in [0, 54].

fit2=coxph(Surv(TIME,CENSOR)˜BECKTOTA+NDRUGTX,data=data)summary(fit2)

## Call:## coxph(formula = Surv(TIME, CENSOR) ˜ BECKTOTA + NDRUGTX, data = data)#### n= 580, number of events= 468## (48 observations deleted due to missingness)#### coef exp(coef) se(coef) z Pr(>|z|)## BECKTOTA 0.009553 1.009599 0.004818 1.983 0.0474 *## NDRUGTX 0.030238 1.030699 0.007649 3.953 7.72e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### exp(coef) exp(-coef) lower .95 upper .95## BECKTOTA 1.010 0.9905 1.000 1.019## NDRUGTX 1.031 0.9702 1.015 1.046#### Concordance= 0.554 (se = 0.014 )## Rsquare= 0.03 (max possible= 1 )## Likelihood ratio test= 17.58 on 2 df, p=2e-04## Wald test = 19.63 on 2 df, p=5e-05## Score (logrank) test = 19.78 on 2 df, p=5e-05

The fitted hazard rate for the no-interaction model is: h(t,b,n) = h0(t) exp{0.00955bi + 0.0302ni}

A Beck scale of 0–9 indicates no depression, 10–18 indicates mild–moderate depression, 19–29 indicates moderate–severe depression and 30–63 indicates severe depression (Beck et al., 1988). Thus the hazard ratio for a “typical”addict with severe depression (40) relative to one with mild depression (15) is exp{(40−15)×0.00955} = 1.27. Thep-value for the hypothesis that there is no effect based on depression is just under 5%, indicating some weak evidenceof an effect, with more depressed drug users more likely to go back to drug use. The hazard ratio for someonewho has already undergone 5 treatments for drug addiction (the mean for these data) relative to someone who hasnever had any treatment is exp{(5 − 0) × 0.0302} = 1.16. A serial user with 20 prior treatments has a hazard rateexp{(20 − 0) × 0.0302} = 1.83 times that of someone undergoing his/her first treatment. This is highly significant,with the p-value for the hypothesis that there is no effect of prior treatment being less than 0.01%.

The interaction model is fit using the following code:

75


fit3=coxph(Surv(TIME,CENSOR)˜BECKTOTA*NDRUGTX,data=data)summary(fit3)

## Call:## coxph(formula = Surv(TIME, CENSOR) ˜ BECKTOTA * NDRUGTX, data = data)#### n= 580, number of events= 468## (48 observations deleted due to missingness)#### coef exp(coef) se(coef) z Pr(>|z|)## BECKTOTA 0.0042471 1.0042561 0.0063580 0.668 0.504## NDRUGTX 0.0078984 1.0079297 0.0192870 0.410 0.682## BECKTOTA:NDRUGTX 0.0012341 1.0012349 0.0009519 1.296 0.195#### exp(coef) exp(-coef) lower .95 upper .95## BECKTOTA 1.004 0.9958 0.9918 1.017## NDRUGTX 1.008 0.9921 0.9705 1.047## BECKTOTA:NDRUGTX 1.001 0.9988 0.9994 1.003#### Concordance= 0.555 (se = 0.014 )## Rsquare= 0.033 (max possible= 1 )## Likelihood ratio test= 19.24 on 3 df, p=2e-04## Wald test = 22.81 on 3 df, p=4e-05## Score (logrank) test = 23.29 on 3 df, p=4e-05

None of these terms is now significant, although the model as a whole is (because the p-values at the bottom are allsmall). We would therefore throw away the interaction term and go back to the no-interaction model.

6.3.10 A single binary covariate

Theory:

• The covariate is x ∈ {0, 1}. If the set has alternative labels, relabel them 0 and 1.

• The parameter is β ∈ R.

• The hazard rate is h(t, x) = h0(t) exp{βx}. In particular:

h(t, 0) = h0(t)

h(t, 1) = h0(t)eβ

It is obvious that there are just two hazard rates and that for category 1 the hazard rate is eβ times that forcategory 0.

• The hazard ratio for group 1 relative to group 0 is eβ .

Example: the effect of “race” on the effectiveness of the drug treatment. Individuals have been classified as “white”(or, more appropriately, pinkish-grey) and “other” (it is not clear what people with both European and non-Europeanancestry are classified as). Let us denote “white” as 0 and “other” as 1, and the “race” of individual i as ri.

76


fit4=coxph(Surv(TIME,CENSOR)˜RACE,data=data)summary(fit4)

## Call:## coxph(formula = Surv(TIME, CENSOR) ˜ RACE, data = data)#### n= 622, number of events= 504## (6 observations deleted due to missingness)#### coef exp(coef) se(coef) z Pr(>|z|)## RACE -0.2846 0.7523 0.1059 -2.687 0.0072 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### exp(coef) exp(-coef) lower .95 upper .95## RACE 0.7523 1.329 0.6113 0.9258#### Concordance= 0.532 (se = 0.011 )## Rsquare= 0.012 (max possible= 1 )## Likelihood ratio test= 7.59 on 1 df, p=0.006## Wald test = 7.22 on 1 df, p=0.007## Score (logrank) test = 7.27 on 1 df, p=0.007

The fitted hazard rate is: h(t,r)=h0(t) exp(−0.285r), that is, the hazard rate for “others” is 0.75 (95% confidenceinterval 0.611–0.926) that of “whites”. The p-value is less than 1%, so there is strong evidence that the residentialprogramme is better at treating “others” than “whites”.

6.3.11 A single categorical covariate or factor

Theory:

• The covariate is x ∈ {c0, c1, . . . , cK−1}. We cannot simply relabel these {0, 1, . . . ,K − 1} as we did fortwo categories, as there is no reason why the hazard ratio for group ci+1 relative to ci should be eβ for allneighbouring pairs. One way to deal with this is just to inform your software package that these are factors orcategorical data (in R we do this using the factor() command). Alternatively, we can create dummy binaryvariables Cik = 1 if xi = ck and 0 otherwise. Note that we only createK−1 of these, as Ci0 = 1−

∑K−1k=1 Cik.

• The parameters are βk ∈ R for k = 1, . . . ,K − 1.

• The hazard rate is h(t, xi) = h0(t) exp(∑K−1

k=1 βkCik)

In particular:

h(t, c0) = h0(t)

h(t, c1) = h0(t)eβ1

...h(t, cK−1) = h0(t)e

βK−1 .

• The hazard ratio for group ck (k 6= 0) relative to group c0 is eβk . The hazard ratio for groups ck and cj(k 6= 0 6= j) is exp(βk − βj).

77


Example: effect of drug used on reversion to drug use. Each individual has been categorised according to heroin orcocaine use (particularly hard drugs). Category 1 represents heroin and cocaine use, 2 heroin only, 3 cocaine only and4 is neither heroin nor cocaine. We will write di = k if individual i fell into drug use category k.

fit5=coxph(Surv(TIME,CENSOR)˜factor(HERCOC),data=data)summary(fit5)

## Call:## coxph(formula = Surv(TIME, CENSOR) ˜ factor(HERCOC), data = data)#### n= 610, number of events= 493## (18 observations deleted due to missingness)#### coef exp(coef) se(coef) z Pr(>|z|)## factor(HERCOC)2 0.07759 1.08068 0.14450 0.537 0.5913## factor(HERCOC)3 -0.25489 0.77501 0.13508 -1.887 0.0592 .## factor(HERCOC)4 -0.16220 0.85027 0.13017 -1.246 0.2127## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### exp(coef) exp(-coef) lower .95 upper .95## factor(HERCOC)2 1.0807 0.9253 0.8141 1.434## factor(HERCOC)3 0.7750 1.2903 0.5947 1.010## factor(HERCOC)4 0.8503 1.1761 0.6588 1.097#### Concordance= 0.536 (se = 0.013 )## Rsquare= 0.013 (max possible= 1 )## Likelihood ratio test= 7.79 on 3 df, p=0.05## Wald test = 7.91 on 3 df, p=0.05## Score (logrank) test = 7.96 on 3 df, p=0.05

The fitted hazard function is:

h(t, 1) = h0(t)

h(t, 2) = h0(t)× 1.08

h(t, 3) = h0(t)× 0.78

h(t, 4) = h0(t)× 0.85.

Together these paint a confusing message: soft drug users appear to be at a lower risk of “re-offending” than thoseusing both heroin and cocaine, while heroin-users appears more at risk than cocaine-users. However, the p-value ofthe model—i.e. assessing the null hypothesis that no predictor is associated with survival versus the alternative that atleast one is—is only around 5% (the output at the bottom), and none of the parameters has an associated p-value ofless than 5%. We would conclude that there is no strong evidence of an effect of the type of drug used.

6.3.12 A single binary and a single continuous covariate

Theory:

• The covariates are x1 ∈ {0, 1} and x2 ∈ R.

78


• The parameters are (β1, β2) ∈ R2 for the no-interaction model and (β1, β2, β12) ∈ R3 for the interaction model.

• For the no-interaction model, the hazard rate is h(t, x1, x2) = h0(t) exp(β1x1 + β2x2). In particular

h(t, 0, x2) = h0(t) exp(β2x2)

h(t, 1, x2) = h0(t) exp(β1 + β2x2)

• For the interaction model, the hazard rate is h(t, x1, x2) = h0(t) exp(β1x1 + β2x2 + β12x1x2), i.e.:

h(t, 0, x2) = h0(t) exp(β2x2)

h(t, 1, x2) = h0(t) exp(β1 + [β2 + β12]x2)

• For the no-interaction model, the hazard ratio between the group with x1 = 1 and the group with x1 = 0 is eβ1 .The hazard ratio for a one-unit change in x2 for either of the x1 groups is eβ2 .

• For the interaction model, the hazard ratio for a one-unit change in x2 for the group with x1 = 0 is exp(β2),for the group with x1 = 1 it is exp(β2 + β12). The hazard ratio comparing the group with x1 = 1 to that withx1 = 0 depends on the value of the continuous covariate x2.

Example: “race” and number of treatments for drug addiction. Each individual has been categorised as “white”x1 = 0 or “other” x1 = 1 and the number of previous treatments for drug addiction x2 has been recorded.

fit6=coxph(Surv(TIME,CENSOR)˜RACE+NDRUGTX,data=data)summary(fit6)

## Call:## coxph(formula = Surv(TIME, CENSOR) ˜ RACE + NDRUGTX, data = data)#### n= 607, number of events= 493## (21 observations deleted due to missingness)#### coef exp(coef) se(coef) z Pr(>|z|)## RACE -0.256819 0.773508 0.107428 -2.391 0.016820 *## NDRUGTX 0.026618 1.026976 0.007616 3.495 0.000474 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### exp(coef) exp(-coef) lower .95 upper .95## RACE 0.7735 1.2928 0.6266 0.9548## NDRUGTX 1.0270 0.9737 1.0118 1.0424#### Concordance= 0.559 (se = 0.014 )## Rsquare= 0.03 (max possible= 1 )## Likelihood ratio test= 18.28 on 2 df, p=1e-04## Wald test = 19.71 on 2 df, p=5e-05## Score (logrank) test = 19.88 on 2 df, p=5e-05

Thus, being “non-white” lowers the risk of returning to drug use by 0.23 (i.e. 1 − 0.7735; 95% confidence interval0.05–0.37), which is statistically significant (p = 0.017), after adjusting for the number of past treatments for drugabuse. Each past treatment increases the risk of failure of the course by 2.7% (95%CI 1.18–4.24%, p < 0.001).

If we allow an interaction, the output is as follows:

79


fit7=coxph(Surv(TIME,CENSOR)˜RACE*NDRUGTX,data=data)summary(fit7)

## Call:## coxph(formula = Surv(TIME, CENSOR) ˜ RACE * NDRUGTX, data = data)#### n= 607, number of events= 493## (21 observations deleted due to missingness)#### coef exp(coef) se(coef) z Pr(>|z|)## RACE -0.251486 0.777644 0.145241 -1.732 0.083361 .## NDRUGTX 0.026761 1.027122 0.008044 3.327 0.000879 ***## RACE:NDRUGTX -0.001351 0.998650 0.024813 -0.054 0.956576## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### exp(coef) exp(-coef) lower .95 upper .95## RACE 0.7776 1.2859 0.5850 1.034## NDRUGTX 1.0271 0.9736 1.0111 1.043## RACE:NDRUGTX 0.9986 1.0014 0.9512 1.048#### Concordance= 0.559 (se = 0.014 )## Rsquare= 0.03 (max possible= 1 )## Likelihood ratio test= 18.28 on 3 df, p=4e-04## Wald test = 19.76 on 3 df, p=2e-04## Score (logrank) test = 20.13 on 3 df, p=2e-04

As the interaction term is not statistically significant (with a p-value very close to 1), we would prefer the no-interactionmodel.

6.3.13 Suitability of the Proportional Hazards Assumption

Unlike the Kaplan-Meier approach, when you develop a PHM you are making a fairly strong assumption, that thehazards at all times are proportional to each other. As in all statistical modeling, it is important to assess the validityof assumptions such as this. There are three ways to assess the proportional hazards assumption (PHA): graphical, viaa complementary log-log plot, testing using residuals, and testing by extending the Cox PHM.

The graphical approach requires having categorical variables. You estimate the survival function at each point for eachgroup separately, then calculate for each group, then plot these functions against time. If the two (or more) curves lookapproximately parallel, then the PHA is probably okay.

To use residuals requires extracting out Schoenfeld residuals. These you then test for independence versus predictors.

To extend the model usually involves incorporating an interaction with time by splitting individuals up into multiplemeasurements at some arbitrary breakpoint and allowing the rate to be different after the break point. See Hosmer etal (2008)7 for details.

If you would like to learn more about these, let us know.

7Hosmer et al (2008) Applied Survival Analysis 2e. London: Wiley.

80


Task 1: malaria combination therapy trial

Abdulla et al (2012)8 conducted a stage III randomised trial to compare Pyronaridine-artesunate (PA) to mefloquineplus artesunate (MA) as a combination therapy for patients with P. falciparum malaria. The study was conducted inaround 1200 patients in Thailand, Cambodia, Viet Nam, Burkina Faso, the Ivory Coast, Tanzania, and India between2007 and 2008. The outcome was time to aparasitemia.

The dataset is available in the file malariatrial.csv.

The data from Cambodia (kh) has been separately marked out from those from other countries (ot). In addition tothis, the dataset contains the treatment variable (Tx), the time (in h) to parasite clearance, and an indicator of rightcensoring (1 if right censored, 0 if not).

Part A: Read in the data and make Kaplan–Meier plots of the time to aparasitemia by treatment. Summarise this plot.

Part B: Conduct an appropriate test of a difference between the two arms. What would you write in a paper on thesedata?

Part C: Why has Cambodia been singled out? Investigate and summarise your findings.

8N Engl J Med 366:1298–1309

81


Task 2: malaria incubation periods

Lover et al (2013, 2014)9 did a systematic review of the literature on volunteer challenge studies in malaria (yes,there have been such studies). They extracted data from old papers on the subject and compiled measurements ofthe incubation period (the time from infection to symptom onset) into a dataset, a summarised version of which canbe found in vivaxincubationperiods.csv. The studies were conducted in different locations and the part ofthe world has been indicated in the file by the variables latitude (either temperate or tropical) and hemisphere(either newworld, i.e. the Americas, or oldworld, i.e. Eurasia and Africa).

Part A: Read in the data and make Kaplan–Meier plots of the incubation time (measured in days). Summarise whatyou find.

Part B: Are there statistically significant differences between the geographic locations from which the strains wereobtained?

9Emerg Infect Dis 19:1058–1065; BMC Inf Dis 14:539

82

Documents

Bootstrap, permutation tests and LASSO · Chapter 5 Bootstrap, permutation tests and LASSO ... Let’s imagine you are going to give the 50 students under your supervision each a