The cross-over trial: a subtle knife

september2008 135

T h e c r o s s - o v e r t r i a l : a s u b t l e k n i f eT h e c r o s s - o v e r t r i a l : a s u b t l e k n i f e

In the second book of the trilogy “His Dark Ma-terials”, author Philip Pullman, describes how one of the main characters gains possession of the “subtle knife”1.

This knife is so sharp that it can cut through any known material and even cut through the curtains that divide this world from the infi nite number of adjoining worlds that simultaneously exist. You will have to read the books to under-stand the concept of other adjoining worlds!

What has this got to do with cross-over trials? Well this type of trial is not unlike the subtle knife: maybe not as sharp, but certainly sharp enough compared to the alternatives, and one that is quite at home in the various adjacent worlds that make up the vast spectrum of ap-plication areas where the design is used. These worlds include medicine, psychology, sports science, dairy science and agriculture. So what is it? Read on!

The 2×2 cross-over trial

The 2×2 cross-over trial is the simplest and prob-ably most used of these designs: though one that is not without controversy. Like the subtle knife, not everyone can use it well.

We will take as our context the comparison of two drugs that provide relief for patients suffer-ing from chronic obstructive pulmonary disease (COPD). This is a disease of the lungs that causes breathlessness and is prevalent among heavy smokers. We will label the drugs as A and B and suppose a suffi ciently large number of patients have volunteered to take part in a clinical trial that will evaluate and compare the effects of these drugs. The patients are randomly divided into two groups of equal size, which we will label as I and II. Each patient in group I takes drug

A for 4 weeks and then crosses over (hence the name of the trial) to take drug B for 4 weeks. In group II each patient takes drug B for 4 weeks and then crosses over to take drug A for 4 weeks. The plan of the trial is summarised in Table 1. After the trial is completed the data obtained are analysed to determine if drug A is superior to drug B.

Why does each patient need to take both drugs? The answer is that this is the strength of the design: the effects of drugs A and B can be compared directly using the two responses recorded on each patient. The alternative would be to compare the effects of drug A observed on some patients with the effects of drug B observed on a different set of patients. This is the so-called parallel-groups design, and we will say more about this design later. But why do we need two groups of patients when we can make the within-subject comparisons by using just one of the groups? Well tempus fugit, time moves on, and the conditions of the trial in period 2 might be different to those in period 1. For example, if the weather happened to be cold and damp in the second period and warm and dry in the fi rst period, we might expect to see more severe symptoms in the second period. Or perhaps the medical staff and equipment in the second period are not the same as the fi rst, etc. In well-planned trials we do not necessar-

ily expect to see an effect of time, but building this possibility into the design provides some insurance against such an eventuality. Using two groups ensures that we can fi t a statistical model that will separate out the difference between the two drugs from any differences between the two periods. For an example of a real COPD cross-over trial, see Jones and Kenward2, where much more information on cross-over trials can be found.

Plots and analysis

The data obtained from a 2×2 trial consist of two measurements on each patient: one from the end of period 1 and one from the end of period 2. In our example of treating COPD, the recorded response is the mean morning peak fl ow rate (PEFR) in litres per minute, which is a measure of the rate at which a patient can exhale. We expect that the drugs will increase this rate, and one more so than the other. When plotted, such data might look like the plot in Figure 1, where there are 20 patients in each group. Here the two responses from each patient have been joined by a line: If the line is going up it indicates that the response was higher in the second period and vice versa.

We can see that most patients have a down-ward trend in response in group I and an upward trend in group II. In other words, patients seem to be doing better on drug A. However, not all patients are better off taking drug A, as can be seen in the plot of the group I data. This plot also highlights the differences between the patients. Some have low PEFR values and others have high values. However, the improvement with drug A as compared with drug B is about the same for almost everyone. This is an ideal situation for a cross-over trial: large between-patient variability

Cross-over trials are a sharp and useful tool. Byron Jones explains just how sharp they can be—and how not to

cut yourself.

Table 1. 2×2 cross-over trial examining effect of two different drugs on two groups of patients

Group Period 1 (fi rst 4 weeks)

Period 2 (second 4 weeks)

I Drug A Drug BII Drug B Drug A

september2008136

and small within-patient variability. By basing our analyses on the within-patient differences we can get a precise estimate of the mean dif-ference between the two drugs. How do we esti-mate this difference? Assume we have calculated the mean of the responses in each period and in each group, to give four means, m11, m12, m21 and m22, where mij is the mean from group i and period j, i = 1,2 and j = 1,2. These are shown and identifi ed in Table 2, along with the values obtained from our example dataset. These means are plotted in Figure 2, where we have joined the means from the same treatment with a line.

The difference m11–m12 is an estimate of the mean effect of (drug A – drug B) plus the mean

effect of (period 1 – period 2). The difference m21 – m22 is an estimate of the mean effect of (drug B – drug A) plus the mean effect of ( period 1 – period 2). Therefore to cancel out the period difference and estimate the mean effect of drug A – drug B we calculate 0.5×[(m11 – m12) – (m21 – m22)]. If there are N patients in total (N/2 in each group) and the standard deviation of the response is σ, then the standard error of this estimator is √(2(1–ρ)/N)σ, where ρ is the cor-relation between the two responses on the same patient. So as ρ gets larger the standard error gets smaller. An alternative, but equivalent, way to think about the standard error of this estimator is to consider the within-patient and between-patient components of the variance of a response on a patient: σ2W and σ2B, re-spectively, where σ2 = σ2

W + σ2B. Then ρ = σ2

B/(σ2

W + σ2B) and the formula for the standard error

is √(2σ2W/N), which has a familiar form for the

standard error of the difference in two means, with N observations in each mean. As the vari-ance of the difference between the two repeated observations on a patient (2σ2

W) gets smaller, the standard error gets smaller.

The parallel-groups design

Before saying more about the cross-over trial, let us return, as promised, to the parallel-groups design. In this design the available patients are randomly divided into two groups. Everyone in group I will get drug A and everyone in group II will get drug B. This design is summarised in Table 3.

If this design was a knife it would not be as sharp as the subtle knife (the cross-over): in fact it is quite blunt. By this I mean that to achieve the same precision of estimation of the mean difference in drug effects that can be achieved by the cross-over trial, this design requires many more patients. To see this, we note that the es-timator of the difference in mean drug effects is the difference of the observed mean response of the group I patients minus the observed mean response of the group II patients. If there are N/2 patients in each group then the standard error of this estimator is √(4/N)σ. The ratio of the standard errors from the two designs (paral-lel-groups:cross-over) is √(2/(1–ρ)). If ρ = 2/3, for example, which is not at all unusual, then the parallel-groups design will require 2/(1–ρ), i.e. 6 times as many patients as the cross-over trial (note that sample size depends on the square

of the ratio of the standard errors). Even when ρ = 0, and the ratio reduces to 2, the benefi t of the cross-over design is that the replicate data on the same patient is as valuable as data on two different patients. Of course, the parallel-groups design can be sharpened up a little if a baseline response value is taken before the drug is taken and a change-from-baseline score is used as the response. Then the ratio is 4.

A plot of the ratio of sample sizes is given in Figure 3, where we have indicated the ratios corresponding to ρ = 0 and ρ = 2/3. Finally, we note something that we will refer to later: that the cross-over trial contains within it a parallel-groups design. If only the data from the fi rst period were used, then the structure of the data would be as in Table 3.

Potential disadvantages

If the cross-over trial is as sharp as I have made out you may wonder why it is not used in every clinical trial. Well, unlike the subtle knife, it can-not be used in all circumstances. A key assump-tion is that patients will be in the same medical state at the start of period 2 as they were at the start of period 1. If our treatment cures the dis-ease then clearly the cross-over cannot be used. The sorts of medical conditions where the cross-over has been successfully used are those that are chronic and stable over time. Good examples are asthma, COPD, migraine, hypertension, arthritis and heart problems, to name but a few (other ex-amples are given towards the end of this article). The other limiting factor is that the half-lives of the drugs to be compared must not be too long. If the active constituents of a drug given in period 1 are not cleared from the body before the start of period 2, then there will be a residual amount of the fi rst drug still present when the second drug is given in period 2. This residual amount could affect the performance of the drug given

Figure 1. Patients plot for mean peak fl ow rate (PEFR)

Mea

n P

EFR

: ltrs

/min

010

020

030

040

050

0

A B B AGrp I Grp II

Table 2. 2×2 cross-over trial: means of patient responses with drugs A and B

Group Period 1 (fi rst 4 weeks)

Period 2 (second 4 weeks)

I m11 = 229 (drug A) m12 = 209 (drug B)II m21 = 201 (drug B) m22 = 252 (drug A)

Mean PEFR: ltrs/min

150

200

250

300

150

200

250

300

A, m11

B, m12B, m21

A, m22

Period 1 Period 2

Figure 2. Means plot for mean peak fl ow rate (PEFR)

Table 3. Parallel-groups trial examining effect of two different drugs on two groups of patients

Group Period (4 weeks)

I Drug A

II Drug B

rho

ratio

of s

ampl

e si

zes:

PG

/CO

2

4

6

8

10

12

14

16

18

20

0.0 0.2 0.4 0.6 0.8 1.0

Figure 3. Ratio of parallel-group (PG) sample size to cross-over (CO) sample size

september2008 137

in period 2. This is referred to as the carry-over effect of the fi rst drug. If carry-over effects are large it is diffi cult to separate the true effect of the second drug from the residual effect of the fi rst. In order to remove any suspected residual effects, a so-called wash-out period is sometimes interposed between period 1 and period 2. How long the wash-out period needs to be will depend on the half-life of the drug or the lag time of drug effect due to that effect having to move through a number of different biological pathways to manifest itself. A period equal to about fi ve half-lives is usually considered to be adequate for pharmacological carry-over.

The potential for carry-over effects is certain-ly a limiting factor on the use of this particular cross-over design. However, if more periods and/or additional sequences of taking the drugs are used, then carry-over effects can be effi ciently separated from the drug effects and unambigu-ous results obtained.

Testing for a carry-over effect difference

Isn’t it annoying to be told that something you have been doing for years, and thought was right, is wrong and should be stopped? Well this is what happened to one proposal for test-ing in the situation where two drugs have any carry-over effects. Before we say more about this proposal, a little more has to be said about how we might test for carry-over effects. Suppose we were to add together both the responses from each patient to give a “total” response. The total response from a patient in group I would, apart from the background random variability, measure the sum of the period effects plus the sum of the drug effects. The same can be said for a total response from a patient in group II. In other words, if there are no other fi xed effects in our statistical model, the true difference between the mean of the patient totals from group I and the mean of the patient totals from group II will equal zero. However, if only one of the drugs has a carry-over effect, or if both drugs have carry-over effects but these are not the same, then the aforementioned true difference will not be zero. The null hypothesis of a zero difference between the true mean totals can therefore be tested us-ing a simple two-sample t-test (assuming that the data are normally distributed). If this t-test gives a signifi cant result, there is evidence that the carry-over effects are unequal, i.e. one drug has a carry-over effect and the other does not or both drugs have carry-over effects and these are unequal. So far so good. But recall the earlier comparison in precision between the cross-over trial and the parallel-groups design. The test we have just described is for a comparison of two groups in a parallel-groups design. As we know, if this design was a knife it would not be very sharp: in fact when testing for carry-over effects the knife is very blunt. The large between-patient

variability that is seen in cross-over trials is what makes the knife blunt. So the chance of fi nding a difference between the carry-over effects that is of the same size or smaller than the difference between the drug effects is very low indeed. In other words, it is a pretty worthless test. If a signifi cant result is obtained then you are either very unlucky that the 1 in 20 or 1 in 100 times you expect to make a type I error has happened or there is a very large difference in the carry-over effects. In the latter case the trial planners need to go back to the drawing board and fi gure out why they did not anticipate this in the fi rst place and instead use a parallel-groups trial to compare the drugs (in the former case don’t try making a living as a gambler!).

Now the proposal I wish to refer to was in-troduced by Grizzle3 and went like this: we fi rst test for (differential) carry-over effects using the difference in group mean totals, as described above. If this test is not signifi cant then we can assume that there are no carry-over effects and use the data from both periods (as we have done above) to estimate and test for the difference between the two drugs. If the carry-over test is signifi cant, then the data in the second period are “contaminated” by the carry-over effects and cannot be used: the comparison of the two drugs can only be made using the data from the fi rst period (no chance of carry-over effects being present there). However, this is a fl awed pro-cedure as Freeman4 pointed out. This two-stage procedure (using the carry-over test to decide whether the data from period 1 or the data from both periods should be used) infl ates the type I error of the test for a drug difference and biases the estimate of this difference. Unfortunately, this two-stage procedure became standard practice in the years following its introduction. Hopefully, anyone aware of Freeman’s results will have stopped using it. If you are someone for whom this is unhappy news, you now know what to do: stop using this two-stage procedure and take even more care that your experiment is well designed.

Some additional examples

One of the major uses of the 2×2 cross-over trial is in showing that two drugs are bioequivalent (for more details see Patterson and Jones5 and Senn6), i.e. that their effects are suffi ciently similar so that one drug may safely be used in place of the other. The COPD example already given is somewhat typical of a clinical trial run by a pharmaceutical company, where patients make regular visits to a clinic to be assessed.

However, not all cross-over trials take place in such controlled conditions.

James et al.7 compared the effect of 40 mg of oxprenolol with that of a placebo on the anxi-ety levels of musicians who might be expected to experience stage-fright. In this 2×2 trial,

24 musicians were randomly divided into two groups of 12 and each musician gave two musi-cal performances, the two performances being on separate days. In addition to having a range of clinical assessments, their musical performances were assessed by two professional assessors. It was found that taking the drug improved overall musical performance, particularly the one given on the fi rst day.

Another crossover trial took place at altitudes of up to 19 000 feet. Greene et al.8 compared the effectiveness of acetazolamide with placebo on preventing mountain sickness. The subjects were 24 amateur climbers who were divided randomly into two groups of 12. Each group climbed each of two mountains (Mount Kilimanjaro and Mount Kenya) over a 3-week period with fi ve rest days in between. It was concluded that acetazolamide is an effective prophylactic for acute mountain sickness. We might suppose that in this particu-lar trial the consequences of subject drop-out might be more serious than usual!

Conclusion

I hope that this article has shown something of the usefulness of cross-over trials and given some insight into their particular properties. Like the subtle knife however, much thought must be given as to when they should be used and whether other types of design are more appro-priate. However, when the assumptions are met, this design is very effi cient and has a sharpness that cannot be beaten.

References1. Pullman, P. (2007) The Subtle Knife. London:

Scholastic.2. Jones, B. and Kenward, M. G. (2003) Design

and Analysis of Cross-over Trials, 2nd edn. Boca Raton, FL: Chapman and Hall/CRC.

3. Grizzle, J. E. (1965) The two-period change-over design and its use in clinical trials. Biometrics, 21, 467–480.

4. Freeman, P. (1989) The performance of the two-stage analysis of two treatment, two period crossover trials. Statistics in Medicine, 8, 1421–1432.

5. Patterson. S. D. and Jones, B. (2006) Bioequivalence and Statistics in Clinical Pharmacology. Boca Raton, FL: Chapman and Hall/CRC.

6. Senn, S. (2004). When is a drug not a drug? Signifi cance, 1, 159–161.

7. James, I. M., Griffi th, D. N. W., Pearson, R. M. and Newbury, P. (1977). Effect of oxprenenolol on stage-fright in musicians. Lancet, 2, 952–954.

8. Greene, M. K., Kerr, A. M., McIntosh, I. B. and Prescott, R. J. (1981) Acetazolamide in prevention of acute mountain sickness: a double-blind controlled cross-over study. British Medical Journal, 28, 811–813.

Byron Jones is Senior Director and Senior Statistical Consultant at Pfi zer Global Research and Develop-ment, Sandwich, Kent.

Documents

The cross-over trial: a subtle knife