20
* Correspondence to: Hans-Helge Mu K ller, Institute of Medical Biometry, Philipps-University of Marburg, Bunsenstrasse 3, D-35037 Marburg, Germany. E-mail: MuellerH@Mailer.Uni-Marburg.DE CCC 0277} 6715/99/141769}20$17.50 Copyright ( 1999 John Wiley & Sons, Ltd. STATISTICS IN MEDICINE Statist. Med. 18, 1769 }1788 (1999) OPTIMIZATION OF TESTING TIMES AND CRITICAL VALUES IN SEQUENTIAL EQUIVALENCE TESTING HANS-HELGE MU G LLER* AND HELMUT SCHA G FER Institute of Medical Biometry, Philipps-University of Marburg, Bunsenstrasse 3, D-35037 Marburg, Germany SUMMARY In long-term clinical trials, interim analyses are planned to reduce the number of patients needed. To meet this issue in a practical way, group sequential designs are used. Most of these trials are conducted with the objective of demonstrating di!erences in e$cacy of treatments, for example, to show superiority of a new drug or experimental treatment to a control. However, an increasing number of trials are designed to establish equivalence in e$cacy or bioequivalence. This paper deals with group sequential test procedures in two-sided equivalence trials. Optimized designs with respect to sample size behaviour are constructed. Tables containing optimal testing times and corresponding optimal critical values or values to construct an underlying a-spending function, respectively, are provided. An example illustrates their use when planning interim analyses in equivalence trials. Copyright ( 1999 John Wiley & Sons, Ltd. INTRODUCTION When testing for di!erences in e$cacy of two treatments in major clinical trials, interim analyses are carried out on ethical or economic grounds. When planning the trial, relevant di!erences are proposed. If such di!erences are present, the implementation of interim analyses often leads to early termination with the bene"t in saving of time for claiming superiority of one of the treatments and saving of patients and costs for the trial. These statements remain valid for trials with the objective to establish equivalence instead of superiority when the therapeutic e!ects of the treatments are nearly identical. This paper is concerned with group sequential test procedures for the two-sided equivalence problem. Group sequential designs will be constructed which are optimized with respect to the average sample size. Sequential procedures are applied to control for the type I error rate. They will be chosen satisfying a power requirement for a special value of di!erences of the treatments in the primary endpoint. Using fully sequential test procedures, the minimal average sample size may be obtained. In fully sequential designs, an interim analysis is performed after every assessed patient, however the use of fully sequential plans or procedures with many interim looks has disadvan- tages. If treatment di!erences other than those assumed for the power requirement are present,

Optimization of testing times and critical values in sequential equivalence testing

Embed Size (px)

Citation preview

* Correspondence to: Hans-Helge MuK ller, Institute of Medical Biometry, Philipps-University of Marburg, Bunsenstrasse3, D-35037 Marburg, Germany. E-mail: [email protected]

CCC 0277}6715/99/141769}20$17.50Copyright ( 1999 John Wiley & Sons, Ltd.

STATISTICS IN MEDICINE

Statist. Med. 18, 1769}1788 (1999)

OPTIMIZATION OF TESTING TIMES AND CRITICALVALUES IN SEQUENTIAL EQUIVALENCE TESTING

HANS-HELGE MUG LLER* AND HELMUT SCHAG FER

Institute of Medical Biometry, Philipps-University of Marburg, Bunsenstrasse 3, D-35037 Marburg, Germany

SUMMARY

In long-term clinical trials, interim analyses are planned to reduce the number of patients needed. To meetthis issue in a practical way, group sequential designs are used. Most of these trials are conducted with theobjective of demonstrating di!erences in e$cacy of treatments, for example, to show superiority of a newdrug or experimental treatment to a control. However, an increasing number of trials are designed toestablish equivalence in e$cacy or bioequivalence. This paper deals with group sequential test procedures intwo-sided equivalence trials. Optimized designs with respect to sample size behaviour are constructed.Tables containing optimal testing times and corresponding optimal critical values or values to construct anunderlying a-spending function, respectively, are provided. An example illustrates their use when planninginterim analyses in equivalence trials. Copyright ( 1999 John Wiley & Sons, Ltd.

INTRODUCTION

When testing for di!erences in e$cacy of two treatments in major clinical trials, interim analysesare carried out on ethical or economic grounds. When planning the trial, relevant di!erences areproposed. If such di!erences are present, the implementation of interim analyses often leads toearly termination with the bene"t in saving of time for claiming superiority of one of thetreatments and saving of patients and costs for the trial. These statements remain valid for trialswith the objective to establish equivalence instead of superiority when the therapeutic e!ects ofthe treatments are nearly identical.

This paper is concerned with group sequential test procedures for the two-sided equivalenceproblem. Group sequential designs will be constructed which are optimized with respect to theaverage sample size.

Sequential procedures are applied to control for the type I error rate. They will be chosensatisfying a power requirement for a special value of di!erences of the treatments in the primaryendpoint. Using fully sequential test procedures, the minimal average sample size may beobtained. In fully sequential designs, an interim analysis is performed after every assessed patient,however the use of fully sequential plans or procedures with many interim looks has disadvan-tages. If treatment di!erences other than those assumed for the power requirement are present,

the desired savings are replaced by greatly increased patient numbers and costs. Although suchtreatment di!erences are not hoped for, they frequently occur. For instance, a new drug, which iscompared with a standard control with the objective of demonstrating superiority, may have no,or only a small, therapeutic bene"t. Furthermore, every interim analysis requires complete dataand high data quality. Thus high standing charges arise. Therefore group sequential proceduresare preferable in most situations. Designs with up to "ve analyses are practicable and contributefor the most part of reduction in average sample size. Common designs are based on theproposals by Pocock, O'Brien, Fleming and Harrington.1}4 For superiority trials, Brittain andBailey5 completed these procedures by two- and three-stage designs nearly minimizing theaverage sample size for the proposed di!erence in e$cacy. In combination with the a-spendingfunction approach of Lan and DeMets,6 the group sequential procedures become #exible.

Controlled clinical trials to demonstrate bioequivalence of two drugs or therapeutic equiva-lence of an experimental treatment with a control usually require more patients than trials toprove di!erences in bioavailability or e$cacy. Consequently, minimizing the average number ofpatients is at least as important for equivalence trials as for superiority trials. Whitehead7

reviewed the criteria and methods for claiming equivalence. He also identi"ed appropriatesequential procedures for combined equivalence and superiority trials. The designs are based onthe boundary approach which has its roots in fully sequential testing. Thus the double triangulardesigns proposed by Whitehead involve a large number of interim analyses although hisprocedure could also accommodate fewer looks. In practice, these designs with a large number ofinterim looks (in the example7 interim analyses were planned every 999 patients, 40 interimanalyses at maximum) have the disadvantages mentioned above. Other group sequential equiva-lence test procedures published in the literature are based on repeated con"dence intervalsfollowing the approach of Jennison and Turnbull.8,9 They do not exhaust the type I error leveland have no optimality properties with respect to the average sample size over an expected regionof di!erences in e$cacy. Even if the power of the whole procedure is adequate, the lack of amountof evidence available at the times of early interim looks causes the observed conservativebehaviour of this approach to group sequential equivalence testing in contrast to "xed sampleequivalence testing (see also Senn,10 pp. 328}329).

The purpose of this paper is to present optimized group sequential procedures testingfor equivalence (non-inferiority and non-superiority) in the context of a single normallydistributed response variable. The test procedures are based on observations of a Brownianmotion with drift, where the drift parameter describes the di!erence in e$cacy. For every analysisthe decision to claim equivalence on a local signi"cance level follows the &optimal test' of thepaper by Mehring.11 Group sequential plans with up to "ve stages are constructed to minimizethe average sample size, speci"cally by choosing optimal testing times as well as optimal criticalboundaries.

Although the test procedures are only exact in the Gaussian setting, they also may be applied insituations where the underlying process for the statistical analyses is approximately a Brownianmotion with drift. Thus this approach to group sequential equivalence testing is expected to beused in bioequivalence and outside bioequivalence, for instance, occasionally in therapeuticdecision-making (see Senn,10 p. 208), whenever power is adequate and reduction of sample sizedue to sequential methods is of interest. For example, by asymptotic theory, these sequentialmethods may be applied when planning survival studies in the "eld of life-threatening dis-eases.12}15 Most of the therapeutic trials, however, will be trials to demonstrate di!erences:superiority, clinically relevant superiority, or non-inferiority. Note that non-inferiority, meaning

1770 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

that the experimental treatment will not be inferior to the control by more than a clinicallyunimportant di!erence, is also termed one-sided equivalence or brie#y equivalence. Applicationand limitation to application of two-sided equivalence testing and sequential methods in this areawill be outlined in the discussion.

Here an ethical requirement will be mentioned: if a clinically relevant di!erence of thetreatments to be compared is possible, for instance, in survival studies, an additional monitoringprocedure to detect such a relevant treatment di!erence when performing interim analyses has tobe implemented for ethical acceptance. The procedure for testing equivalence will then becomeconservative. Improvements of the presented designs will be discussed.

As an illustration of the proposed methods, the randomized double-blind clinical trial of Castleet al.16 on the comparison of salmeterol with salbutamol in asthmatic patients who requireregular bronchodilator treatment will be used. This study is an example of a large equivalencetrial. With a "xed-sample test, the authors showed equivalence of the rates of serious events undertreatment with salmeterol and salbutamol. Whitehead7 makes use of this trial to demonstrate theapplication of sequential methods for planning equivalence trials. We take a second look at thesame example to discuss the pros and cons of optimized group sequential procedures.

GROUP SEQUENTIAL EQUIVALENCE TESTING

Suppose that an experimental treatment (E) will be compared to a control (C) in a controlledclinical trial with the objective of establishing equivalence in e$cacy and that safety is nota primary concern of the trial, or with the objective to show bioequivalence. Patients will berandomized to the two treatments, n

Eto the experimental treatment and n

Cto the control.

Observation of a patient, i, will result in a normally distributed outcome, Xi. In a hypertension

trial, for example, Xi,E

or Xi,C

may be the deviation from a systolic blood pressure level to beadjusted for patient i when treated with a new anti-hypertensive drug E or a standard anti-hypertensive drug C, respectively. The random variables X

iare assumed to be independent, and,

in each treatment group, identically distributed. The treatment e!ects of E and C will be describedby the location parameters k

Eand k

C, respectively. For simplicity, the variance p2 in the

treatment groups will be assumed to be identical in both groups, and known in a "rst instance.The statistic to measure treatment di!erences is based on the average outcomes XM

Eand XM

C. The

advantage of E over C may be estimated by the di!erence XME!XM

Cand transformed into the

statistic

JMnq(1!q)NXM

E!XM

Cp

where n"nE#n

Cdenotes the total sample size and q"n

E/n the proportion of patients allocated

to the experimental treatment.As the trial proceeds, data will be accumulated until the total sample size n is reached. The

proportion t of assessed patients relative to n is a measure of the amount of information relative tothe proposed end of the trial. Therefore t will be termed information time or information fraction.The statistic

¹t"JMtnq(1!q)N

XME, t

!XMC, 5

p

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1771

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

measures the cumulative discrepancy of E with respect to C involving the outcome of the patients

up to time t. For t between 0 and 1, ¹tde"nes a stochastic process. Jt ¹

tcan be extended to

a Brownian motion with drift parameter

d"JMnq(1!q)NkE!k

Cp

(linkage formula for normal outcome).

Asymptotic theory allows use of the Gaussian model in a wide "eld of applications. To utilizethe proposed methods, one then has to link the relevant e$cacy parameters in the specialapplication and the drift parameter d. If the variance is unknown, the normal case in medicine, theestimated variance may be used instead of p2 in the de"nition of the statistic ¹

t. In survival

studies,¹tmay be the asymptotic logrank statistic. Then, the relevant e$cacy parameter will be h,

the quotient of the proportional hazard rates, and the formula to link h and the drift parameterd is

d"JMnq(1!q)N log h (linkage formula for survival data)

where n is the maximum number of deaths to be observed when performing the study. Anotherexample dealing with equivalence trials in the setting of binary response data and a respectivelinkage formula for binary response data is given in a later section. Moreover, accommodation towithin-patient comparisons is possible. In bioequivalence studies, for instance, a two-periodcross-over model may be "tted involving log-transformed areas under curves of concentration-time pro"les (see also Lan and Zucker17).

Figure 1 shows three simulated sample paths of the process ¹t, two paths with di!erent

behaviour in the situation of advantage of E over C (d'0) and one for equal therapeutic e!ects

(d"0). The function of expected values Jtd is also plotted against the information time. At everytime, ¹

tis normally distributed with variance 1. (Further explanation of the "gure will follow

below.)Testing for equivalence means testing a null hypothesis of the form H

0: DdD*d* against the

alternative H1: DdD(d* where d* is the drift parameter of the Brownian motion corresponding in

some way, which will be explained later, to the minimal important di!erence in e$cacy of thetreatments. Because of the symmetry of the hypotheses, positive and negative d-values areexchangeable.

In group sequential equivalence testing, the maximum number of analyses will be denoted bym. The information times at which the analyses are performed will be denoted by t

1, t

2,2, t

m"1.

The trial stops if there is evidence from an analysis to claim equivalence. The kth analysis atinformation time t

kwill be performed on the observed value of ¹

tk. If this observed value is

between the critical boundaries $bk, equivalence is claimed and the trial stops. Otherwise, the

trial goes through to the next analysis. If the trial comes up to the end and D¹1D'b

m, the null

hypothesis of important di!erences between the treatments would be accepted.Test procedures testing for equivalence have to control for the type I error level. Probability of

establishing equivalence although the null hypothesis is true will be restricted to a given level a.Furthermore, a power requirement will be made. A power of 1!b will be stipulated to detectequality d"0.

Figure 1 illustrates two di!erent four-stage group sequential test procedures with a"0)05 andtype II error level b"0)1. The plotted dots show the testing times and critical boundaries at thesetimes. Simulated sample paths of di!erent Gaussian processes are plotted for illustration. For the

1772 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Figure 1. Simulated sample paths of the Gaussian process, expected mean and two di!erent optimized four-stage groupsequential designs (dots) testing equivalence, a"0)05 and b"0)1 (see also Table I(b)). Criterion for minimization: (a) and(b), average sample size given equality, d*"3)5217; (c) and (d), sum of average sample size given equality and maximumsample size, d*"3)3385. Drift parameter d: (a) and (c), d"d*, minimal important di!erence in the null hypothesis; (b) and

(d), d"0, equality in the equivalence alternative

simulated study in Figure 1(a) equivalence will be claimed on the strength of the "rst interimanalysis although the null hypothesis is true. This may happen in 5 per cent of all the paths of theGaussian process with d"d*. Using the same test procedure, in Figure 1(b) equivalence isestablished, a correct decision in this case. This will happen in 90 per cent of the paths of theGaussian process with d"0. Here the trial stops after the third interim analysis, saving about 20per cent of the patient numbers. Figures 1(c) and 1(d) illustrate another test procedure. In bothexamples the correct decision is taken. In example 1(c), the null hypothesis, that importantdi!erences between the two treatments may be present, is accepted at the end of the trial. Inexample 1(d), equivalence is claimed at the second interim analysis. Compared with the "xedsample test, there was a saving in the patient numbers of more than 20 per cent.

The type I error level a will be exhausted for the limiting value d* in the null hypothesis, so thatthere is no unnecessary loss of power. d* depends on a, b, the testing times and the criticalboundaries of a design and has to be calculated, for instance, in the way explained in the nextparagraph.

If the number of analyses and the testing times 0(t1(t

2(2(t

m"1 are "xed, there are at

least two possible concepts to de"ne a group sequential design: the a-spending function ap-proach6 or the concept of power-spending (Bauer18). Laying down the values of a chosena-spending function at the testing times and b or the values of a chosen power-spending function

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1773

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

at the testing times and a, the critical boundaries b1, b

2,2, b

mwill be computed together with the

value d*, which de"nes the null-hypothesis of the test. This may be done by numerical integrationwith the corresponding spending algorithm using the independent increments of a Brownianmotion as shown by Armitage et al.19 The result of these calculations is a statistical test procedurefor an abstract Gaussian process ¹

tas obtained from a Brownian motion following the de"nition

above.For application to a special clinical trial, the linkage formula has to be used. In a "rst step, the

statistical model for the special clinical trial has to be selected and the minimal importantdi!erence in e$cacy of the treatments must be stated within this model. The minimal importantdi!erence is the limiting value of di!erences clinically acceptable for equivalence. In the case ofnormal outcome variable, this di!erence may be denoted by k*

E!k*

Cand measured in multiples

of p. To link the abstract Gaussian model and the model for the special clinical trial, one then hasto determine the required maximum sample size of the trial. The minimal important di!erenceand the characteristic value d* are substituted into the linkage formula. For simplicity ofnotation, the minimal important di!erence is supposed to be positive as d*. Transforming theformula, this results in the necessary maximum sample size to achieve the proposed power atd"0, for example:

n"1

q(1!q)p2

(k*E!k*

C)2

d*2 (sample size formula for normal outcome).

Finally, the kth interim analysis will be performed after assessment of n tkpatients.

In the next section, tables of special group sequential test procedures will be presented. All theseprocedures are exact level a procedures with power 1!b to detect equality d"0. For instance,the two designs of the illustration in Figure 1 can be found in Table I(b). Sometimes it may bereasonable to detect another fraction of d* with adequate power instead of 0d*. (See Senn,10pp. 215}217). To allow for this requirement a simple improvement of the algorithm will benecessary. However, an increase in calculation time is evident.

In what follows, the use of the provided tables will be explained in detail; for every designtabulated in the tables, there is a list of values beginning at the left margin of the table. The entriesstand for the characteristic values of a design. In the "rst row there are the error levels a and b, themaximum number m and the average number of analyses given equality, the calculated value ofd*, the value of d*2, termed maximum sample size, to be used in the sample size formula to obtainthe maximum sample size, and a value termed average sample size, which may be used instead ofd*2 in the same formula to calculate the average sample size given equality of treatment e!ects.The testing times and the critical values are tabulated in the following rows of the table, one rowfor each testing time. The "rst and the last column contain the respective testing time and thecorresponding critical boundary for the process ¹

t. Instead of critical boundaries for ¹

tone may

use the nominal levels. These levels are the local levels for the two-sided p-values of conventionaltests for superiority. The trial stops, and equivalence is claimed, if a local p-value is greater thanthe corresponding nominal level. The other columns show, in each row corresponding to thetesting time, the power-spending values and the a-spending values for construction of anunderlying power-spending function or a-spending function, respectively, to achieve #exibility.

In the tables, optimized group sequential designs were presented. The next section deals withthe method of optimization. Using the planned information times for the analyses and thea-spending values given in the table, an a-spending function may be de"ned, for instance by

1774 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

interpolation or a combination of building plateaus and interpolation. In this way, the wellknown a-spending approach6 may be applied to group sequential equivalence testing. Thisapproach meets the practical problem of deviations from the planned sample sizes whenperforming the respective interim or "nal analysis. Note that the application of this principlerequires numerical integration to adjust the critical values. Small deviations from the plannedinformation fractions will then result in nearly no loss of optimality. Of course, this statement willonly be valid provided a small deviation will result in only little change of the planned a-spendingvalue. In practice, only an arti"cial construction of an a-spending function will not meet thisrequirement. (Note that a-spending values for information times in the neighbourhood of plannedinterim looks may also be de"ned by optimization "xing the information times and a-spendingvalues for all other analyses.)

OPTIMIZED DESIGNS

Sequential methods in statistical testing were developed to reach a reliable conclusion asearly as possible and to save study costs. Study costs will essentially depend on the numberof patients. When using sequential designs, the number of patients is a random variable, denotedby N. As demonstrated in the paper by Whitehead7 and in the previous section, sequentialdesigns may reduce the number of patients not only in superiority trials, but also in equivalencetrials.

Theoretically, there is an unlimited number of sequential designs which may be used. To selecta group sequential design for a special clinical trial, one may specify the maximum number ofanalyses and then choose the testing times t

1, t

2,2, t

m~1and critical boundaries b

1, b

2,2, b

min

such a way that the average sample size is minimized. However, the average sample size E(N)depends on the unknown drift parameter d. This functional dependence may be denoted by Ed(N).To select an optimal group sequential design in a special situation, one has to make assumptionsupon the range of possible values of d which may be expected. If equality or approximate equalityin e$cacy of both treatments is assumed, then it is recommendable to use the minimizationcriterion Ed (N) for d"0. The corresponding optimal group sequential designs are tabulated inTable I after an explanatory heading and the tabulated values of the "xed sample test as referencefor comparison.

If there is considerable uncertainty whether the treatments actually are equivalent, then it maynot be recommendable to base the choice of the optimal group sequential design only on theassumption d"0. One may wish to have reasonable protection against an undesired increase ofsample size when the true value of d is nearby or larger than the value d*, which is the valuecorresponding to the minimal important di!erence of the treatments. In this case, one may use thesum of the average sample size for two di!erent values of d, for example d"0 and a d-valuebetween d* and in"nity. The optimization criteria E

0(N)#Ed*(N) and E

0(N)#E

=(N)"

E0(N)#n lead to nearly identical designs. The optimal group sequential designs corresponding

to the criterion E0(N)#n are tabulated in the lower part of Table I.

For the two types of optimized designs in Table I the type I error risk is set to a"0)05. Type IIerror risk varies from b"0)05 over b"0)1 and b"0)2 to b"0)3, splitting up Table I in parts(a)}(d). Designs with up to "ve stages for the lower type II error levels and up to four stages forthe higher levels are provided. The values were computed using a search algorithm (2m!2-dimensional search) in addition to an algorithm for the a-spending method mentioned in thesection before.

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1775

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Table I. Group sequential test procedures for equivalence with optimized testing times andcritical values. Fixed sample test as reference, optimization with respect to average samplesize given equality, and optimization with respect to sum of average sample size given

equality and maximum sample size(a)

a b Maximum Average d* Maximum Averagenumber of number of sample size sample sizeanalyses analyses

Information Power-spending a-spending Nominal level Criticaltime value value boundary

0)05 0)05 1 1)0000 3)6048 12)9947 12)99471)0000 0)9500 0)05000 0)05000 1)9600

0)05 0)05 2 1)3145 3)8290 14)6613 9)99520)5357 0)6855 0)03612 0)31450 1)00581)0000 0)9500 0)05000 0)06147 1)8701

0)05 0)05 3 1)6780 3)9086 15)2769 9)31440)4392 0)5209 0)02939 0)47910 0)70760)6497 0)8011 0)04180 0)23503 1)18751)0000 0)9500 0)05000 0)06681 1)8330

0)05 0)05 4 2)0532 3)9479 15)5858 9)02320)3993 0)4226 0)02521 0)57742 0)55720)5300 0)6785 0)03678 0)36961 0)89720)7193 0)8457 0)04398 0)20003 1)28151)0000 0)9500 0)05000 0)06998 1)8120

0)05 0)05 5 2)4332 3)9709 15)7679 8)86320)3778 0)3580 0)02226 0)64198 0)46490)4710 0)5873 0)03310 0)46603 0)72900)5969 0)7526 0)04012 0)30971 1)01580)7662 0)8689 0)04512 0)18072 1)33861)0000 0)9500 0)05000 0)07209 1)7985

0)05 0)05 2 1)3754 3)6570 13)3733 10)36650)6400 0)6246 0)02064 0)37538 0)88641)0000 0)9500 0)05000 0)05372 1)9291

0)05 0)05 3 1)7950 3)6670 13)4466 9)79950)5656 0)4601 0)01561 0)53986 0)61300)7495 0)7449 0)02505 0)28446 1)07041)0000 0)9500 0)05000 0)05462 1)9219

0)05 0)05 4 2)2224 3)6709 13)4758 9)55910)5343 0)3685 0)01299 0)63155 0)47960)6548 0)6133 0)02038 0)42660 0)79500)8109 0)7959 0)02732 0)23940 1)17651)0000 0)9500 0)05000 0)05500 1)9189

0)05 0)05 5 2)6523 3)6730 13)4911 9)42730)5172 0)3100 0)01128 0)68996 0)39890)6058 0)5227 0)01769 0)52160 0)64090)7166 0)6905 0)02297 0)36027 0)91490)8496 0)8245 0)02883 0)21191 1)24831)0000 0)9500 0)05000 0)05521 1)9172

1776 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Table I. (Continued)(b)

a b Maximum Average d* Maximum Averagenumber of number of sample size sample sizeanalyses analyses

Information Power-spending a-spending Nominal level Criticaltime value value boundary

0)05 0)10 1 1)0000 3)2897 10)8222 10)82221)0000 0)9000 0)05000 0)10000 1)6449

0)05 0)10 2 1)3909 3)4464 11)8780 9)02370)6055 0)6091 0)03389 0)39089 0)85801)0000 0)9000 0)05000 0)11886 1)5596

0)05 0)10 3 1)8183 3)4976 12)2330 8)59230)5177 0)4570 0)02729 0)54305 0)60820)7117 0)7247 0)03966 0)31902 0)99651)0000 0)9000 0)05000 0)12676 1)5270

0)05 0)10 4 2)2543 3)5217 12)4025 8)40360)4805 0)3694 0)02325 0)63058 0)48090)6052 0)6043 0)03452 0)44804 0)75870)7748 0)7720 0)04200 0)28603 1)06691)0000 0)9000 0)05000 0)13113 1)5097

0)05 0)10 5 2)6934 3)5355 12)4997 8)29870)4602 0)3127 0)02043 0)68733 0)40250)5509 0)5194 0)03090 0)53573 0)61930)6681 0)6767 0)03789 0)39426 0)85190)8162 0)7977 0)04328 0)26725 1)10941)0000 0)9000 0)05000 0)13389 1)4989

0)05 0)10 2 1)4496 3)3273 11)0706 9)26450)7036 0)5504 0)02074 0)44957 0)75611)0000 0)9000 0)05000 0)10630 1)6151

0)05 0)10 3 1)9284 3)3352 11)1234 8)90490)6375 0)4026 0)01568 0)59741 0)52810)7992 0)6690 0)02552 0)36417 0)90741)0000 0)9000 0)05000 0)10791 1)6077

0)05 0)10 4 2)4118 3)3385 11)1453 8)74840)6090 0)3221 0)01301 0)67787 0)41540)7176 0)5440 0)02070 0)49751 0)67840)8515 0)7220 0)02803 0)31940 0)99571)0000 0)9000 0)05000 0)10862 1)6044

0)05 0)10 5 2)8966 3)3402 11)1572 8)66120)5931 0)2712 0)01126 0)72878 0)34670)6741 0)4615 0)01792 0)58297 0)54910)7720 0)6182 0)02348 0)43648 0)77820)8838 0)7526 0)02972 0)29115 1)05561)0000 0)9000 0)05000 0)10902 1)6026

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1777

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Table I. (Continued)(c)

a b Maximum Average d* Maximum Averagenumber of number of sample size sample sizeanalyses analyses

Information Power-spending a-spending Nominal level Criticaltime value value boundary

0)05 0)20 1 1)0000 2)9263 8)5631 8)56311)0000 0)8000 0)05000 0)20000 1)2816

0)05 0)20 2 1)5035 3)0133 9)0797 7)74200)7033 0)4965 0)03089 0)50348 0)66901)0000 0)8000 0)05000 0)22670 1)2089

0)05 0)20 3 2)0295 3)0390 9)2355 7)53370)6319 0)3671 0)02439 0)63291 0)47760)7920 0)6034 0)03674 0)44693 0)76051)0000 0)8000 0)05000 0)23636 1)1842

0)05 0)20 4 2)5599 3)0507 9)3070 7)44020)6008 0)2957 0)02053 0)70434 0)37950)7079 0)4942 0)03146 0)56025 0)58250)8426 0)6502 0)03927 0)41905 0)80811)0000 0)8000 0)05000 0)24130 1)1717

0)05 0)20 2 1)5537 2)9475 8)6877 7)86290)7873 0)4463 0)02087 0)55366 0)59231)0000 0)8000 0)05000 0)20912 1)2560

0)05 0)20 3 2)1210 2)9524 8)7168 7)68890)7355 0)3239 0)01566 0)67606 0)41790)8605 0)5550 0)02609 0)48033 0)70581)0000 0)8000 0)05000 0)21156 1)2493

0)05 0)20 4 2)6897 2)9546 8)7294 7)61080)7124 0)2590 0)01289 0)74101 0)33050)7989 0)4449 0)02104 0)59601 0)53020)8993 0)6064 0)02888 0)43901 0)77391)0000 0)8000 0)05000 0)21267 1)2463

(d)

a b Maximum Average d* Maximum Averagenumber of number of sample size sample sizeanalyses analyses

Information Power-spending a-spending Nominal level Criticaltime value value boundary

0)05 0)30 1 1)0000 2)6803 7)1841 7)18411)0000 0)7000 0)05000 0)30000 1)0364

0)05 0)30 2 1)5940 2)7291 7)4480 6)78340)7802 0)4060 0)02860 0)59403 0)53301)0000 0)7000 0)05000 0)32814 0)9779

1778 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Table I. (Continued)(d)} (contd)

a b Maximum Average d* Maximum Averagenumber of number of sample size sample sizeanalyses analyses

Information Power-spending a-spending Nominal level Criticaltime value value boundary

0)05 0)30 3 2)2015 2)7429 7)5232 6)67790)7244 0)2975 0)02214 0)70246 0)38200)8508 0)5009 0)03459 0)54969 0)59821)0000 0)7000 0)05000 0)33737 0)9594

0)05 0)30 4 2)8108 2)7491 7)5575 6)62980)6995 0)2392 0)01842 0)76081 0)30440)7864 0)4056 0)02912 0)64652 0)45860)8897 0)5445 0)03731 0)52591 0)63431)0000 0)7000 0)05000 0)34192 0)9504

0)05 0)30 2 1)6345 2)6922 7)2479 6)84610)8483 0)3655 0)02096 0)63449 0)47541)0000 0)7000 0)05000 0)30966 1)0159

0)05 0)30 3 2)2744 2)6951 7)2637 6)75800)8091 0)2640 0)01556 0)73599 0)33720)9027 0)4616 0)02655 0)57203 0)56511)0000 0)7000 0)05000 0)31229 1)0104

0)05 0)30 4 2)9136 2)6964 7)2708 6)71760)7912 0)2110 0)01271 0)78904 0)26760)8574 0)3666 0)02123 0)67098 0)42480)9309 0)5089 0)02954 0)53489 0)62061)0000 0)7000 0)05000 0)31351 1)0079

In Table II, the required increment in maximum sample size and the gain in average sample sizeunder the condition d"0 compared to the "xed sample test are tabulated for both types ofoptimized designs. The results demonstrate that group sequential equivalence testing is mostpro"table when the power of the trial to establish equivalence is high. If the power is less than 70per cent, interim analyses will not be worthwhile. Furthermore, for trials with a power of 95 percent, most of the expected gain in reduction of patient numbers will be achieved with four interimanalyses. There will be no relevant additional gain by further increasing the number of interimanalyses. When the sum of the average sample size given equality and the maximum sample size isused as a minimization criterion for constructing optimized group sequential designs, there willbe only a small increase in maximal sample size compared to the "xed sample test. Nevertheless,the average sample size is close to the minimal possible average sample size.

Of course, it is possible to construct optimal group sequential designs for any speci"ed value ofthe drift parameter d. As a third type of example, the special value d"1

2d* is used for exemplary

tabulation in Table III. As well as the designs of the second type, these designs will be appropriateif there are good reasons to assume that there are no clinically relevant di!erences in the primaryendpoint between the two treatments, but one cannot assume that the treatments are equal.

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1779

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Table II. Saving of sample size by optimized group sequential test proced-ures for equivalence, a"0)05. Optimization with respect to average samplesize given equality, and optimization with respect to sum of average sample

size given equality and maximum sample size

b Number of Maximum sample Average sample sizestages size given equality

0)05 2 #12)8% !23)1%0)05 3 #17)6% !28)3%0)05 4 #19)9% !30)6%0)05 5 #21)3% !31)8%0)10 2 #9)8% !16)6%0)10 3 #13)0% !20)6%0)10 4 #14)6% !22)3%0)10 5 #15)5% !23)3%0)20 2 #6)0% !9)6%0)20 3 #7)9% !12)0%0)20 4 #8)7% !13)1%0)30 2 #3)7% !5)6%0)30 3 #4)7% !7)0%0)30 4 #5)2% !7)7%

0)05 2 #2)9% !20)2%0)05 3 #3)5% !24)6%0)05 4 #3)7% !26)4%0)05 5 #3)8% !27)5%0)10 2 #2)3% !14)4%0)10 3 #2)8% !17)7%0)10 4 #3)0% !19)2%0)10 5 #3)1% !20)0%0)20 2 #1)5% !8)2%0)20 3 #1)8% !10)2%0)20 4 #1)9% !11)1%0)30 2 #0)9% !4)7%0)30 3 #1)1% !5)9%0)30 4 #1)2% !6)5%

As a function of the location parameter d/d*, the average sample sizes of the three types ofgroup sequential tests presented above are plotted relative to the "xed sample test (Figure 2).A curve like these will be called the sample size characteristic curve of a group sequential design.As can be seen from Figure 2(c), the four-stage group sequential test which is optimal for d"0 isstill nearly optimal for values DdD(0)3d*, but is no longer optimal in the region DdD'0)5d*. Thesample size characteristic curves of the group sequential tests which are optimal with respect tothe sum of maximum and average sample size given d"0 and of those which are optimal withrespect to the average sample size for DdD"0)5d* are nearly equal. Over the whole region ofd-values, their sample size characteristic curves are near the lower boundary of the sample sizecharacteristic curves of all four-stage procedures. The use of these types of group sequential teststhus leads to nearly optimal gain in average sample size over the whole range of possible values ofdi!erence in the main outcome of the two treatments.

1780 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Table III. Group sequential test procedures for equivalence with optimized testing times and criticalvalues. Optimization with respect to average sample size given the half of d* as unimportant di!erence

a b Maximum Average d* Maximum Averagenumber of number of sample size sample sizeanalyses analyses

Informationtime

Power-spendingvalue

a-spendingvalue

Nominal level Critical boundary

0)05 0)05 4 2)1017 3)6936 11)7063 9)44160)5254 0)3937 0)01461 0)60626 0)51540)6620 0)6633 0)02369 0)37114 0)89430)8318 0)8413 0)03247 0)18628 1)32171)0000 0)9500 0)05000 0)05706 1)9029

0)05 0)10 4 2)3435 3)3497 11)2205 8)68660)5993 0)3331 0)01403 0)66694 0)43040)7175 0)5704 0)02269 0)46917 0)72380)8615 0)7530 0)03115 0)28406 1)07121)0000 0)9000 0)05000 0)11075 1)5949

Figure 2. Average sample size of optimized group sequential test procedures for equivalence relative to the "xed sampletest as reference, example a"0)05 and b"0)1. (a) Minimization with respect to average sample size given equality, up to"ve stages. (b) Minimization with respect to sum of average sample size given equality and maximum sample size, up to"ve stages. (c) Four-stage designs from (a) and (b) and minimized with respect to average sample size given half of d*,

together with the lower boundary (dotted line) of all four-stage procedures

AN EXAMPLE: COMPARISON OF TWO BRONCHODILATORS

Castle et al.16 examined in a clinical trial the di!erent e!ects of the two bronchodilatorssalmeterol (E) and salbutamol (C) in asthmatic patients; n"25,180 patients were randomized,nE"16,787 to salmeterol treatment and n

C"8393 to salbutamol. For the 2:1 randomization, the

fraction of patients allocated to the group obtaining the experimental treatment is q"23. Various

analyses were carried out, and the principal conclusion of the clinical comparison was that thetwo treatments are equivalent. The main claim for equivalence seems to come from the statistical

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1781

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

comparison of the incidence of serious events (including death) during the "rst 16 weeks. Therewere 722 serious events observed in the salmeterol treatment group and 362 in the group treatedwith salbutamol. The incidence rates p

Eand p

Ccan be estimated by the proportions xN

E"0)04301

and xNC"0)04313, respectively. An estimation of their di!erence p

E-p

Cis given by !0)00012 with

90 per cent con"dence interval [!0)00459, 0)00434]. Thus, by the demonstrated similarity of theincidence rates, the treatments are equivalent if the incidence of serious events is accepted as theprimary endpoint.

Suppose the trial is redesigned with the objective of establishing equivalence by comparison ofthe binomially distributed numbers of serious events in the two treatment groups. Discrepancy ofthe treatments may be measured using the statistic

XME!XM

C

SGXM

E(1!XM

E)

nE

#

XMC(1!XM

C)

nC

H.

When sequentially accumulating data, the information time t is the proportion of assessedpatients relative to n, and the statistic above forms a process approximately following a Gaussianprocess ¹

twith drift parameter

d"JnpE!p

C

SGpE(1!p

E)

q#

pC(1!p

C)

1!q H(linkage formula for binary outcome).

Consequently, the approximate formula for sample size determination is

Ap*E(1!p*

E)

q#

p*C(1!p*

C)

1!q B1

(p*E!p*

C)2

n5!"-%

(sample size formula for binary outcome)

where p*E

and p*C

are the incidence rates corresponding to the minimal important di!erence of thetwo treatments and n

5!"-%denotes the sample size provided in the tables.

As supposed by Whitehead,7 the probability of a serious event occurrence within the "rst 16weeks is assumed to be p*

C"0)05 for patients treated with salbutamol. A 1 per cent reduction of

this probability to a value p*E"0)04 for patients treated with salmeterol will then be the minimal

important di!erence. Under these speci"cations, including the intention to allocate patients bybalanced randomization with fraction q"2

3to salmeterol, the "xed sample test with signi"cance

level a"0)05 to test for equivalence and power 1-b"0)95 to detect equal incidence rates wouldthen require recruitment of 26,004 patients using the speci"ed asymptotic model, 17,336 patientsrandomized to salmeterol and 8668 randomized to salbutamol.

Figure 3 ((a) and (b)) illustrates the de"nition of the hypotheses of important di!erences versusequivalence as a consequence of the speci"cations. The de"nition is based on the assumption thatthe di!erence of the incidence rates is an adequate parameter to evaluate di!erences in seriousevents and that the baseline rate for the well accepted control, p*

C"0)05, is approximately correct.

The de"nition of the equivalence hypothesis should also be acceptable for values of pC

in anextended region of interest. If not, another statistical model should be applied. For example,a model focusing on the logits of incidences instead of the incidence rates may be used in this case.See Whitehead7 for a description of this alternative model resembling a Brownian motion withdrift, and see Figure 3 ((c) and (d)) for distinction between the de"nitions of the equivalencealternative.

1782 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Figure 3. Hypotheses for the example &comparison of two bronchodilators', H1: area of equivalence meaning the area of

unimportant di!erences in incidence rates. Statistical model: (a) and (b), model for the example in this paper; (c) and (d),model with drift parameter d proportional to the di!erence of logits of incidence rates. Presented region of incidence rates:(a) and (c), local baseline assumption (*) extended to the region of interest; (b) and (d), global space of incidence rates

Interim analyses will be planned to gain a maximal reduction in average sample size. Next onehas to choose the number of interim analyses which are feasible. In the case of the present study,recruitment per week was about 500 patients. Thus, the duration of the recruitment was aboutone year. In this situation, it will not be practical to perform more than three interim analyses.Therefore, an optimized four-stage design is selected from Table I(a). Two di!erent possibilitieswill be illustrated in the following paragraphs, design (a) and design (b), taken from the upper andlower part of the table, respectively.

If one can be sure that there are nearly equal incidence rates, the minimization criterion&average sample size under the condition of equality' will be chosen (a). Then the maximumsample size will be 31,188 patients, 20,792 treated with salmeterol and 10,396 with salbutamol. Ifthe incidence rates are equal, the power of stopping within the "rst, up to the second, and up tothe third interim look will be 0)4226, 0)6785 and 0)8457, respectively. Hence, the average sample

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1783

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

Figure 4. Simulated sample path for the example &comparison of two bronchodilators' and optimized four-stage groupsequential designs (dots) testing equivalence, a"0)05 and b"0)05 (see also Table I(a)). Criterion for minimization: (a)

average sample size given equality; (b) sum of average sample size given equality and maximum sample size

size will be reduced to 18,055 patients. The information time points for the interim analyses takenfrom the table are 0)3993, 0)5300 and 0)7193. By multiplying these values with the maximumsample size, the information time is transformed into the respective number of patients. Thus, the"rst, second and third interim analysis will be performed after assessment of 4151, 5510 and 7478,respectively, patients treated with salbutamol and twice the number treated with salmeterol.(Note that randomization techniques should balance treatment allocation to control the fractionq. One may aim at near accuracy in the neighbourhood of the information times where analysesare planned to be performed.) The critical boundaries of the test statistic ¹

tgiven above for

claiming equivalence are $0)5572, $0)8972 and $1)2815, respectively. If the trial has notstopped after an interim analysis, the "nal analysis is performed on the maximum number ofpatients. The critical boundaries for ¹

1are $1)8120. Instead of the critical boundaries, one can

use the nominal levels given in the table. In practice, one will perform a conventional test fordi!erences of the treatments after the respective number of patients for an analysis is assessed. Theunadjusted two-sided p-values obtained from these tests are compared to the nominal levels.When testing for equivalence, the null hypothesis will be rejected and the trial will be stopped ifthe p-value is larger than the nominal level provided in the table.

If there is uncertainty whether the incidence rates are equal, it may be preferable to use theminimization criterion which takes into account the maximum sample size (b). Then themaximum sample size will be 26,967 patients, 17,978 randomly allocated to salmeterol and 8989to salbutamol. When the assumption of nearly equal incidence rates fails, no more than 1000patients have to be recruited in addition to the number of patients required for the "xed sampledesign. Although the average sample size under the condition of equality will be 19,128 patients,which is about 1000 patients more compared to the minimal average sample size of all four-stageprocedures, nevertheless there will be a saving of about 7000 patients.

Figure 4 illustrates group sequential equivalence testing in the study of Castle et al.16 whenusing one of the two explicit described special optimized designs referred to as (a) and (b),respectively. One simulated sample path consistent with the presented result of the trial and thetwo exemplary designs are shown. The path was plotted up to the information time correspond-ing to the 25,180 recruited patients. The information times within the designs are less than1 because the maximum sample size will increase when using group sequential designs. If design(a) had been implemented, the trial would have stopped from the second interim analysis,claiming equivalence. Using design (b), the trial would have stopped at the "rst interim look, with

1784 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

conclusion of equivalence. In both, the reliable conclusion of equivalence could be made based onsubstantially fewer patients to be recruited than the real trial.

DISCUSSION

Developed to establish bioequivalence in biopharmacy, statistical testing for equivalence hasbeen extended to the general context of comparisons of treatment e!ects with the objective ofdemonstrating equivalence. In medical care, even small treatment di!erences become importantwhen treatment di!erences are divided in those important and those unimportant, which meansthat the treatments will be accepted to be clinically equivalent. Consequently, equivalence trialsfor therapeutic decision-making require in general large patient numbers. Early stopping as soonas reliable claiming of equivalence is possible will save costs of the trial. If the new treatment isknown to have important secondary advantages, for example, is less toxic or involves lessextensive surgery, then it is also an ethical issue to conclude equivalence as early as possible.Therefore sequential methods have also been explored in equivalence trials.

If the experimental treatment is known to have secondary advantages (for example, is moreconservative) over the control it may be preferable to test whether the experimental treatment isat least as e!ective as the control (non-inferiority/one-sided equivalence instead of two-sidedequivalence). If no secondary advantages of one of the treatments over the other are known,examination of secondary endpoints may be a primary concern of the trial. Then the trial shouldnot stop when equivalence in the primary endpoint is demonstrated. Thus, sequential testing fortwo-sided therapeutic equivalence of two treatments is mainly based on economic grounds.

The last paragraph addresses the point that trials with the objective of establishing equivalencehave to be divided carefully in one-sided equivalence trials and two-sided equivalence trials.Occasions were stated where one-sided therapeutic equivalence testing is appropriate. On theother hand, known therapeutic di!erences in secondary endpoints, advantages and disadvan-tages, often cannot be summarized to overall secondary advantages (or disadvantages) of theexperimental treatment. Then there are occasions in therapeutic decision-making where estab-lishing two-sided equivalence in the primary endpoint will be of interest.

It should be mentioned that the trial of Castle et al.16 was not designed as an equivalence trialwith a primary endpoint. Furthermore, there are doubts as to whether to conclude equivalencebased on the incidence of serious events as de"ned. (Besides, note that the two drugs are nottherapeutically equivalent; salmeterol has the longer duration of action and salbutamol the fasteronset.) Despite the fact that this trial is therefore a poor teaching example for planning andanalysing an equivalence trial, the example is illustrative and allows for a simple comparison withthe sequential procedures identi"ed by Whitehead.7 For anyone who disapproves of the short-comings, see the INJECT trial,20 a large equivalence trial on 6010 patients. (Note that theINJECT trial was planned as a one-sided equivalence trial.)

According to the desired savings in average patient numbers, Whitehead7 demonstrates theadvantages of sequential designs in equivalence testing. He also discusses possible disadvantagesof the double triangular design and a design due to Pampallona and Tsiatis.21 Both designs weredeveloped to establish either superiority or equivalence. Since the formal hypotheses havea competitive intersection of unimportant di!erences, they appear to be more adequate if themajor objective is demonstrating superiority. If the major objective of the trial is demonstratingequivalence, a claim of superiority of one of the treatments causes a problem. There may be a highrisk of concluding superiority if the treatment e!ects are not identical, but the di!erences are

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1785

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

clinically irrelevant or unimportant. In the case of unimportant di!erences of the treatments,which means that the equivalence alternative is true, a conclusion of stopping the trial withclaimed superiority from an interim analysis may be misinterpreted in the way that the treatmentsare not equivalent. (The correct interpretation would be that the treatments are not equallye$cacious. An example for the possibility of a misinterpretation of study results when sequen-tially testing competitive hypotheses will be mentioned. The &groupe d'eH tude et de traitment ducarcinome heH patocellulaire' investigated a trial22 with the primary objective of demonstratingsuperiority of the experimental treatment. The trial stopped after an interim analysis althoughclinically relevant superiority of the experimental treatment was likely.) Hence, as a "rst step, thispaper was interested in constructing sequential designs for the only objective of establishingequivalence. Consequently, comparisons of the sample size behaviour of the proposed optimizedgroup sequential designs with the designs presented by Whitehead7 have to be restricted to themaximum sample size and the average sample size when the treatments are nearly equal.

It may be preferable in most of the therapeutic equivalence trials to use designs whichadditionally allow for early termination with acceptance of the null hypothesis of importantdi!erences because clinically relevant di!erences may be present. The trial should stop for ethicalreasons at the time of an interim analysis, if clinically relevant di!erences of the treatments arelikely and unimportant di!erences are unlikely. In some cases, the objective of establishingclinically relevant superiority of one of the treatments may be a secondary concern of the trial.Then early stopping with demonstration of clinically relevant di!erences is also of interest. Allthese cases will require extended stopping rules or designs testing combined hypotheses depend-ing on the size of the gap between clinically unimportant and clinically relevant di!erences.Dealing with optimized group sequential test procedures in these cases was beyond the scope ofthis paper. However, additional testing for major di!erences of the treatments is recommendedfor ethical acceptance. If major di!erences can be shown on the basis of an interim analysis, thetrial should stop with acceptance of the null-hypothesis. As a consequence, there will be a loss inoptimality of the proposed procedure testing equivalence. For example, at each interim analysisa con"dence interval with con"dence level 1!a or 1!2a for the e!ect size parameter d may becalculated. Then the following stopping rule may be added: stop the trial with acceptance ofimportant di!erences if the con"dence interval covers e!ect sizes corresponding, via the linkageformula, to clinically relevant di!erences but did not cover any unimportant di!erence.

The approach considered in this paper may be regarded as a "rst step of optimization indesigning interim analyses for equivalence trials. Further investigation will improve optimizationwhen extending the class of stopping rules. Searching for the optimum will then be much moretime intensive. In a next step, optimized designs for two-sided superiority trials as well as fortwo-sided equivalence trials may be computed which allow stopping with accepting the null-hypothesis if there will only be a small chance of claiming the desired alternative.

This paper has presented group sequential methods for two-sided equivalence testing. Thesemethods are a compromise between two extremes; one extreme is the "xed sample design with nointerim analysis at all, and the other extreme is to perform many interim analyses in a designbased on fully sequential test procedures. With the group sequential methods proposed in thepresent paper, only a few interim analyses will be performed. The example shows that groupsequential test procedures with only four stages will reach most of the possible saving of patientnumbers. The average sample size was reduced from about 26,000 to about 18,000 or 19,000,respectively. The gain is less than the reduction to about 15,000 achieved by applying the doubletriangular design, but the maximum sample size is only increased to about 31,000 or 27,000,

1786 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

respectively, instead of 40,000 with the double triangular test. The excessive high maximumsample size of the double triangular design may be explained as follows. The interim analyses inthe beginning of the trial are a-consuming without o!ering su$cient power to stop the trialbecause there is not enough information to conclude. Second, the number of interim analyses inthe example of Whitehead7 was about three times higher for the double triangular test than forthe design due to Pampallona and Tsiatis, however the group sequential designs drawn from thework of Pampallona and Tsiatis21 are not optimal with respect to any sample size criterion.

Monitoring to achieve high data quality for an analysis has its price measured in time delayand money. In practice, there are many trials where it is impossible or too cost intensive toimplement procedures with many interim analyses. Hence, in many clinical trials practical issuesof data collection allow only for a small number of interim analyses. In these cases, the designshould be optimized with respect to ethical and economic aspects. If it is possible to specify anoverall cost function including all these aspects, optimal testing times and critical values have tobe determined to minimize the costs. In the present paper, this problem was solved for simple costfunctions depending on the maximum and the average number of patients to be recruited for thetrial. The class of cost functions may be extended, for instance, a weighted sum of the averagenumber of patients and the average number of analyses can be taken.

We suggest that optimized group sequential procedures with few stages will combinethe bene"t of savings with practical issues. Thus we recommend their use also in equivalencetesting.

REFERENCES

1. Pocock, S. J. &Group sequential methods in the design and analysis of clinical trials', Biometrika, 64,191}199 (1977).

2. O'Brien, P. C. and Fleming, T. R. &A multiple testing procedure for clinical trials', Biometrics, 35,549}556 (1979).

3. Pocock, S. J. &Interim analyses for randomized clinical trials: the group sequential approach', Biometrics,38, 153}162 (1982).

4. Fleming, T. R., Harrington, D. P. and O'Brien, P. C. &Designs for group sequential tests', ControlledClinical ¹rials, 5, 348}361 (1984).

5. Brittain, E. H. and Bailey, K. R. &Optimization of multistage testing times and critical values in clinicaltrials', Biometrics, 49, 763}772 (1993).

6. Lan, K. K. G. and DeMets, D. L. &Discrete sequential boundaries for clinical trials', Biometrika, 70,659}663 (1983).

7. Whitehead, J. &Sequential designs for equivalence studies', Statistics in Medicine, 15, 2703}2715 (1996).8. Durrleman, S. and Simon, R. &Planning and monitoring of equivalence studies', Biometrics, 46, 329}336

(1990).9. Jennison, C. and Turnbull, B. W. &Sequential equivalence testing and repeated con"dence intervals, with

applications to normal and binary responses', Biometrics, 49, 31}43 (1993).10. Senn, S. J. Statistical Issues in Drug Development, Wiley, Chichester, 1997.11. Mehring, G. H. &On optimal tests for general interval-hypotheses', Communications in Statistics } ¹heory

and Methods, 22, 1257}1297 (1993).12. Tsiatis, A. A. &Repeated signi"cance testing for a general class of statistics used in censored survival

analysis', Journal of the American Statistical Association, 77, 855}861 (1982).13. Sellke, T. and Siegmund, D. &Sequential analysis of the proportional hazards model', Biometrika, 70,

315}326 (1983).14. Olschewski, M. and Schumacher, M. &Sequential analysis of survival times in clinical trials', Biometrical

Journal, 28, 273}293 (1986).15. Lan, K. K. G. and DeMets, D. L. &Group sequential procedures: calendar versus information time',

Statistics in Medicine, 8, 1191}1198 (1989).

OPTIMIZED DESIGNS IN SEQUENTIAL EQUIVALENCE TESTING 1787

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)

16. Castle, W., Fuller, R., Hall, J. and Palmer, J. &Serevent nationwide surveillance study: comparison ofsalmeterol with salbutamol in asthmatic patients who require regular bronchodilator treatment', BritishMedical Journal, 306, 1034}1037 (1993).

17. Lan, K. K. G. and Zucker, D. M. &Sequential monitoring of clinical trials: the role of information andBrownian motion', Statistics in Medicine, 12, 753}765 (1993).

18. Bauer, P. &The choice of sequential boundaries based on the concept of power spending', Biometrie undInformatik in Medizin und Biologie, 23, 3}15 (1992).

19. Armitage, P., McPherson, C. K. and Rowe, B. C. &Repeated signi"cance tests on accumulating data',Journal of the Royal Statistical Society Series A, 132, 235}244 (1969).

20. &Randomised, double-blind comparison of reteplase double-bolus administration with streptokinase inacute myocardial infarction (INJECT): trial to investigate equivalence. INternational Joint E$cacyComparison of Thrombolytics', ¸ancet, 346, 329}336, 324}325, 980 (1995).

21. Pampallona, S. and Tsiatis, A. A. &Group sequential designs for one-sided and two-sided hypothesistesting with provision for early stopping in favor of the null hypothesis', Journal of Statistical Planningand Inference, 42, 19}35 (1994).

22. Groupe d'Etude et de Traitment du Carcinome Hepatocellulaire &A comparison of lipiodol chemoem-bolization and conservative treatment for unresectable hepatocellular carcinoma', New England Journalof Medicine, 332, 1256}1261 (1995).

1788 H.-H. MUG LLER AND H. SCHAG FER

Copyright ( 1999 John Wiley & Sons, Ltd. Statist. Med. 18, 1769}1788 (1999)