3
Evaluating Effects of Treatment in Subgroups of Patients Within a Clinical Trial: The Case of Non- &Wave Myocardial Infarction and Beta Blockers Salim Yusuf, MRCP, DPhil, Janet Wittes, PhD, and Jeffrey Probstfield, MD M ost medical researchers believe that randomized clinical trials are the best means of evaluating the effects of a treatment on outcomes in a particular disease. l Randomized clinical trials are par- ticularly important when the plausible effect is only moderate, e.g., a 15, 20 or 25% reduction in the risk of developing a major adverse outcome such as death, re- infarction or stroke.2m4 In order to detect moderate treatment effects reliably, the errors inherent in the clinical trial must be relatively small. Two sources of errors, systematic biases and random errors, occur and both should be minimized. Both these “errors” affect the reliable detection of treatment effects in a trial as a whole or in subgroups within the trial.2 Systematic biases in a single trial are avoided by the allocation of patients to active treatment or control by using strict randomization (not by alternating, odd-even dates or any other method that allows foreknowledge of treatment assignment). Further, analyses should include all randomized patients and the results should be re- ported based upon rules established in the protocol or before knowledge of the results. Random errors are chiefly avoided by having studies of sufficient size and by combining the results of several related trials. For most common treatments of interest in cardiovascular disease, because the plausible range of effects is only about 15 or 20%, often several hundred to about a thou- sand events are needed to reach reliable conclusions.*s4 Even when the above criteria are satisfied, we are only in a position to provide reliable answers regarding the average effects in the overall trial, but not about the effects in specific subgroups. How, then, should one ap- ply the results of a trial to specific subsets of patients, each subset being only a part of the overall data? In this commentary, we will point out the following: the treat- ment effect is likely to be qualitatively similar (i.e., in the same direction) in all subgroups of patients without obvious contraindications to treatment but is also likely to be quantitatively dissimilar (differences in degree of effect) even when the effects appear to be identical; and estimates of treatment effect within a subgroup chosen for special emphasis are usually “biased” and so the From the Clinical Trials Branch, National Heart, Lung, and Blood Institute, Bethesda, Maryland and the Veterans Administration Coop- erative Studies Group, West Haven, Connecticut. Manuscript received and accepted April 18, 1990. Address for reprints: Salim Yusuf, MRCP, DPhil, Clinical Trials Branch, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute, Bethesda, Maryland 20892. most appropriate estimate in a subgroup is closer to the overall result. Low likelihood of differences in kind (qualitative in- teraction) but higher likelihood of differences in degree (quantitative interaction): Patients who are thought to be clearly benefited or definitely harmed by a given treatment are usually not entered into trials. Therefore, a priori, trials exclude the “extremes” expected on bio- logic and pharmacologic grounds. The low likelihood of qualitative interactions in a trial is supported by the ex- periences from cardiovascular clinical trials conducted in the last 3 decades.3,4 Claims made in individual trials for apparent qualitative interactions between various subgroups have not been replicated in further studies. For example, Andersen et al5 claimed that long-term p blockade is beneficial among patients <65 years of age and harmful in those older.5 The international practolol study claimed that treatment was only beneficial among those with anterior infarction, but not among those with inferior infarction.6 In both cases, subsequent studies showed benefit in the elderly and in those with inferior myocardial infarction (MI). Another example concerns the evaluation of thrombolytic agents in acute MI. Many investigators were so convinced that treatment >6 hours would be of little benefit that several studies excluded such patients (i.e., a strong prior expectation of a qualitative interaction). The available data suggest, however, that even delayed treatment provides about two-thirds of the benefit of earlier treatment (a quanti- tative interaction).3l7 Biases and errors in detecting subgroup effects within a trial: STATISTICAL POWER: In a trial designed with power adequate to detect a given difference in the overall trial, the power to detect similarly sized differ- ences within the various subgroups is substantially low- er. The smaller the subgroup, the lower the power. For example, the /I Blocker Heart Attack Trial (BHAT), which randomized about 4,000 patients, was designed to have 90% power to detect an overall reduction in mor- tality of 25%~~ To have 90% power of detecting a similar effect in a specific subset of patients (with similar event rates), one would need 4,000 patients in each subset of interest. Conversely, in a trial such as BHAT, which provides a clear result overall, even if the effect is identi- cal in several large subsets, random variation may exag- gerate or dilute the effects so that some subgroups may spuriously appear to have a large effect and others no effect or even a harmful effect. A practical demonstra- tion of this point is an analysis reported by the Second 220 THE AMERICAN JOURNAL OF CARDIOLOGY VOLUME 66

Evaluating effects of treatment in subgroups of patients within a clinical trial: The case of non-Q-wave myocardial infarction and beta blockers

Embed Size (px)

Citation preview

Evaluating Effects of Treatment in Subgroups of Patients Within a Clinical Trial: The Case of Non- &Wave Myocardial Infarction and Beta Blockers

Salim Yusuf, MRCP, DPhil, Janet Wittes, PhD, and Jeffrey Probstfield, MD

M ost medical researchers believe that randomized clinical trials are the best means of evaluating the effects of a treatment on outcomes in a

particular disease. l Randomized clinical trials are par- ticularly important when the plausible effect is only moderate, e.g., a 15, 20 or 25% reduction in the risk of developing a major adverse outcome such as death, re- infarction or stroke.2m4 In order to detect moderate treatment effects reliably, the errors inherent in the clinical trial must be relatively small. Two sources of errors, systematic biases and random errors, occur and both should be minimized. Both these “errors” affect the reliable detection of treatment effects in a trial as a whole or in subgroups within the trial.2

Systematic biases in a single trial are avoided by the allocation of patients to active treatment or control by using strict randomization (not by alternating, odd-even dates or any other method that allows foreknowledge of treatment assignment). Further, analyses should include all randomized patients and the results should be re- ported based upon rules established in the protocol or before knowledge of the results. Random errors are chiefly avoided by having studies of sufficient size and by combining the results of several related trials. For most common treatments of interest in cardiovascular disease, because the plausible range of effects is only about 15 or 20%, often several hundred to about a thou- sand events are needed to reach reliable conclusions.*s4 Even when the above criteria are satisfied, we are only in a position to provide reliable answers regarding the average effects in the overall trial, but not about the effects in specific subgroups. How, then, should one ap- ply the results of a trial to specific subsets of patients, each subset being only a part of the overall data? In this commentary, we will point out the following: the treat- ment effect is likely to be qualitatively similar (i.e., in the same direction) in all subgroups of patients without obvious contraindications to treatment but is also likely to be quantitatively dissimilar (differences in degree of effect) even when the effects appear to be identical; and estimates of treatment effect within a subgroup chosen for special emphasis are usually “biased” and so the

From the Clinical Trials Branch, National Heart, Lung, and Blood Institute, Bethesda, Maryland and the Veterans Administration Coop- erative Studies Group, West Haven, Connecticut. Manuscript received and accepted April 18, 1990.

Address for reprints: Salim Yusuf, MRCP, DPhil, Clinical Trials Branch, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute, Bethesda, Maryland 20892.

most appropriate estimate in a subgroup is closer to the overall result.

Low likelihood of differences in kind (qualitative in- teraction) but higher likelihood of differences in degree (quantitative interaction): Patients who are thought to be clearly benefited or definitely harmed by a given treatment are usually not entered into trials. Therefore, a priori, trials exclude the “extremes” expected on bio- logic and pharmacologic grounds. The low likelihood of qualitative interactions in a trial is supported by the ex- periences from cardiovascular clinical trials conducted in the last 3 decades.3,4 Claims made in individual trials for apparent qualitative interactions between various subgroups have not been replicated in further studies. For example, Andersen et al5 claimed that long-term p blockade is beneficial among patients <65 years of age and harmful in those older.5 The international practolol study claimed that treatment was only beneficial among those with anterior infarction, but not among those with inferior infarction.6 In both cases, subsequent studies showed benefit in the elderly and in those with inferior myocardial infarction (MI). Another example concerns the evaluation of thrombolytic agents in acute MI. Many investigators were so convinced that treatment >6 hours would be of little benefit that several studies excluded such patients (i.e., a strong prior expectation of a qualitative interaction). The available data suggest, however, that even delayed treatment provides about two-thirds of the benefit of earlier treatment (a quanti- tative interaction).3l7

Biases and errors in detecting subgroup effects within a trial: STATISTICAL POWER: In a trial designed with power adequate to detect a given difference in the overall trial, the power to detect similarly sized differ- ences within the various subgroups is substantially low- er. The smaller the subgroup, the lower the power. For example, the /I Blocker Heart Attack Trial (BHAT), which randomized about 4,000 patients, was designed to have 90% power to detect an overall reduction in mor- tality of 25%~~ To have 90% power of detecting a similar effect in a specific subset of patients (with similar event rates), one would need 4,000 patients in each subset of interest. Conversely, in a trial such as BHAT, which provides a clear result overall, even if the effect is identi- cal in several large subsets, random variation may exag- gerate or dilute the effects so that some subgroups may spuriously appear to have a large effect and others no effect or even a harmful effect. A practical demonstra- tion of this point is an analysis reported by the Second

220 THE AMERICAN JOURNAL OF CARDIOLOGY VOLUME 66

TABLE I Example of “Subgrouping” in International Society

for the Investigation of Stress-2: Astrology and Aspirin

Vascular Mortality at Week 5

Odds Asplrln Placebo Decrease

(%) (%I (% f SD)

Patients born under 150/ 1,357 147/1.442 8% adverse Libra and Gemini’ (11.1) (10.2) UW

Patients born under 654/7.228 868/7.157 26%+5 other “birth signs” (9.0) (12.1) (p<00001)

Overall results 804/8.587 1016/8.600 23% f 4

(9.4) (11.8) (p < O.oool)

* The best estimate in this subgroup 1s probably closer to the overall results than to the apparent effect observed in this subgroup alone

NS = not slgndlcant. SD = standard deviation

TABLE II Hypothetical Subgroup Effects illustrating the

“Play of Chance” in a Trial That Shows Clear Overall Benefit

Treatment Control Risk

(%) (%I Decrease (%) p Value

Overall result 240/3,GOO 300/3,OCxl 20 <O.Ol

(8) (10) Subgroup A* 8O/l,OCO 100/1,ooO 20 NS

(8) (10) Subgroup B’ 70,l ,Oal 110/1.000 36 10.001

(7) (11) Subgroup Ct 90/1,mo 90/1,000 0 NS

(9) (9)

The probablllty of obsewng B or C, If only one trial were conducted and only 3 subgroups were examined, IS about 11 and 6% respectively However, for most questlons, often several trials are conducted and large numbers of subgroups are examined Therefore, the probablllty of observing B or C wIthIn any single trial ~~reases substantlaliy

* Effect observed (20% risk decrease) !s ldentlcal to the overall result, but because we of the subgroup 1s only one-thwd that of the overall trial. the difference !s not statistically slgnlflcant, ‘The 180 deaths (the same as in subgroup A) are split dlffw ently Play of chance has decreased by lust 10 deaths I” the treated group (I e ,80 - 10 = 70) and increased the number of deaths by 10 I” control group (100 + 10 = 110) Observed treatment effect 6 large (36% risk decrease) and 6 highly slgnlflcant (p < 0001); 1 Reverse of sltuatlon B Play of chance has dMed the result so that there are 90 deaths each !n the treated and control groups (0% risk decrease)

NS = not slgnlficant

International Study of Infarct Survival7 Despite over- whelming evidence that aspirin decreased mortality overall, 2 subgroups (those born under the astrologic signs of Libra and Gemini) appeared to be harmed (Ta- ble I). In this case, it is highly likely that the true effects in patients in these 2 subgroups is closer to the overall trial result than the apparent harm observed within these subgroups.

The probability of observing apparent variations in the sizes of effect among different subgroups when, in fact, there are none, is greater than observing similar results. An example of the play of chance with sub- groups in a hypothetical trial that provides a clear result overall is listed in Table II. If one examined the effects in several subgroups, the play of chance would appear to exaggerate (subgroup B) or dilute (subgroup C) the results in some subgroups. In subgroup A, even though the apparent results are identical to the overall result, the smaller sample size renders the difference not statis- tically significant. Furberg and Byington9 examined the effects of propranolol in about 150 subgroups in the BHAT (Figure 1). Although the estimated effects are

clustered around the overall effect, a few effects in small subgroups appear to be much more extreme. This pattern, which approximates a “normal” distribution, suggests that there is no real heterogeneity of effect among the various subgroups in this trial.

ST.~TISTICAL MULTIPLICITY: Examination of multiple subgroups makes it very difficult to interpret a claim of statistical significance at a given p value, (e.g., p = 0.05). For example, in tests of 2 truly comparable treat- ments, examining numerous subgroups will almost guarantee that at least 1 comparison will be “nominally significant.” Conversely, even if a treatment were truly effective in all subgroups, examination of numerous subgroups will generate some in which there appears to be no effect or even a harmful effect (Figure 1).

BIASLS in ~MPIIASIZING SLBGROIJP HFECTS: Reports of trials may emphasize the most striking subgroup effects even if they were found on post hoc analyses. Because such “extreme” results are largely the product of ran- dom errors, describing or estimating the effects within that extreme subgroup alone without formally placing the observations in the context of the overall result and similar analyses of other relevant trials is usually biased and often misleading. A more appropriate analysis

50 -

48 - 46 - 44 - 42 -

40 - 38 -

36 - g 34-

2 32-

s 30- m 28- ; 26-

I.L 24- 0 22-

ti 20-

g 18- -

-11-10-9 -8 -7 -6 -5 -4 -3

1

1

B. -1 0 1 2 3 4 5 6

DIFFERENCE BETWEEN PROPRANOLOL AND PLACEBO

MORTALITY RATES (%) IN SUBGROUP

FIGURE 1. Overall propranolol showed a 29% decrease in risk of death with the ,3 Blocker Heart Attack Trial. The mor- talii rate in the contrd group was about 10%. lkerdom, the average absolute decrease was 2.5%. All the large subgroups (white area >2,174 patients) and most of the intermediate size subgroups (stippled area 988 to 2,173 patients) are clus- tered around the mean decrease of 2.5%. Gf the 150 sub- groups examined, only the small (dark area <987 patients) ones show an apparent lack of effect or extreme beneftt. This pattern indicates that there is no real heterogeneity of treat- ment effect.

THE AMERICAN JOURNAL OF CARDIOLOGY JULY 15, 1990 221

would use methods that account for the number of sub- groups, their relation to other subgroups and the size of the effect within the subgroups and overall.‘O

REPLICATION AND CONSISTENCY OF FINDINGS: Observed subgroup effects (whether based on prior hypotheses or not) should be tested in other related studies. Only if a consistent pattern emerges from most or all of these studies should the subgroup effect be believed. A formal approach would be to collect the data systematically from all relevant trials, not just the ones that report it, as it is probable that the effects in the remaining studies were quite unremarkable. Even if the real effects of treatment are similar in several subgroups, the play of chance can produce apparently discordant results for some of the subgroups in different trials.

Effects of beta blockers in patients with Q-wave and non-Q-wave myocardial infarction: In this issue, Gheorghiade and his colleagues report post hoc “hy- pothesis-generating” subgroup analyses from BHAT.r ’ BHAT showed a clear decrease in mortality with long- term /3 blockers in the overall trial. The overall results are consistent with a large body of evidence from 24 trials on a total of about 20,000 patients. An overview of all trials indicates a 23% decrease in mortality (95% confidence interval in 15 to 3 l%, p <O.OOOl). In BHAT, there were fewer nonfatal myocardial infarc- tions in the treated group (4.4% propranolol vs 5.3% placebo), but this difference was not statistically signifi- cant. Several other trials, however, showed similar re- sults. Indeed, the data from all trials clearly indicate that nonfatal MI is decreased by 26% (95% confidence interval of 17 to 34%; p <0.0001).i2 Gheorghiade et al’* observed no benefit of propranolol among the 601 pa- tients with non-Q-wave MI and therefore hypothesized that /3 blockers are ineffective in such patients. One has to evaluate their analysis and hypothesis against the substantial reports that show very clear evidence of de- creases in mortality and reinfarction of about 25% over- all. The suggestion that p blockers may be ineffective among patients with non-Q-wave MI is likely to be spu- rious for the following 4 reasons.

First, the results in these subgroups are not signifi- cantly different from each other (chi-square for hetero- geneity of the log odds ratio is not significant). Because BHAT examined the effects in about 150 subgroups, a number of small subgroups, which includes non-Q-wave MI, may appear to have no benefit simply by chance (Figure 1).

Second, the power to detect a difference of 25% (the difference observed overall) is very low (roughly 15%). The 95% confidence intervals of the effect in those with non-Q-wave MI easily include a 30% risk decrease. Even if the analysis had shown a modest adverse effect, given the small size of the subgroup, it could still be consistent with a worthwhile decrease (e.g., 20 to 25%) in mortality and reinfarction.

Third, the results are directly contradicted by the timolol trial results, which showed a “significant” de- crease in mortality among those with non-Q-wave MI

(14% in placebo vs 7% among treated patients; nominal, p <0.05).‘3

Last, the likelihood of benefit in non-Q-wave MI is indirectly strengthened by observations in several of the trials of early intravenous D blockade that indicate pre- vention of MI among those without ST elevation on the entry electrocardiogram (most of whom probably have a patent infarct-related artery, which is a common fea- ture of non-Q-wave MI). 14,15 Moreover, the decrease in mortality is similar among those with and without ST elevation on the entry electrocardiogram.16

If we use all available information and recognize the limitations of the sort of subgroup analysis presented by Gheorghiade et al,” we conclude that p blockers are likely to benefit patients with non-Q-wave MI, as well as those with Q-wave MI. The size of the benefit for non-Q-wave MI is less clear, but is likely to be closer to the 25% decrease in mortality and reinfarction that is observed in an overview of all trials. Therefore, patients with non-Q-wave MI should be considered for prophy- lactic therapy with fl blockers.

REFERENCES 1. Friedman L. Simon R, Verter J, Wittes J, Wittes R. Proceedings of the workshop on evaluation of therapy. Stat Med /984:3:307m475. 2. Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized trials? Stat Med /984;3:409-420. 3. Yusuf S, Wittes J, Friedman L. Overview of results of randomized clinical trials in heart disease 1: treatments following myocardial infarction. JAMA 1988;260:2088-2093. 4. Yusuf S, Wittes J, Friedman L. Overview of results of randomized clinical trials in Heart Disease II: unstable angina, heart failure, primary prevention with aspirin and risk factor modification. JAMA 1988;260:2259-2263. 5. Andersen MP, Bechsgaard P, Frederiksen J, Hansen DA, Jurgensen HJ, Nielsen B, Pedersen F, Pedersen-Bjergaard 0, Rasmussen SL. Effect of alpreno- lo1 on mortality among patients with definite or suspected acute myocardial infarction: preliminary results. Lancer 1979:2:865-868. 6. Multicenter International Study. Supplementary report: reduction in mortality after myocardial infarction with long-term beta-adrenoceptor blockade. Br Med J 1977;2:419-421. 7. ISIS-2 Collaborative Group. Randomized trial of IV streptokinase, oral aspi- rin, both or neither among 17,187 cases of suspected acute myocardial infarction. Lmtcet 1988:2:349-360. 6. Beta-blocker Heart Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction. 1. Mortality results. JAMA 1982;247:1707~1714. 9. Furberg CD, Byington RP. What do subgroup analyses reveal about differen- tial response to beta-blocker therapy? The Beta-Blocker Heart Attack Trial Experience. Circulation 1983;(suppI I)67:98-101. 10. Davis CE, Leffingwell DP. Empirical Bayes estimates of subgroup effects in clinical trials. Controlled Clin Trials 1990;11:37-42. 11. Gheorghiade M, Schultz L, Tilley B, Kao W, Goldstein S. Effects of propran- 0101 in non-Q-wave acute myocardial infarction in the Beta-Blocker Heart Attack Trial. Am .I Cardiol 1990:66. 12. Yusuf S, Peto R, Lewis J, Collins R, Sleight P. Betablockade during and after myccardial infarction: an overview of the randomized trials. Prog Cardiounsc Dis 3985;27:335-371. 13. Pedersen TR. The Norwegian multicenter study of timolol after myocardial infarction. Circulation 1983;(suppl I)67:49-53. 14. Hjalmarson A, Herlitz J, Holmberg S, Ryden L, Swedberg K, Vedin A, Waagstein F, Waldenstrom A, Walderstrom J, Wedel H, Wilhelmsen L, Wil- helmsson C. The Goteborg metoprolol trial. Effects on mortality and morbidity in acute mycardial infarction. Circulation 1983:(suppI I)67:26-31. 15. Yusuf S, Sleight P, Rossi P, Ramsdale D, Peto R, Furze L, Sterry H, Pearson M, Motwani R, Parish S, Gray R, Bennett D. Bray C. Reduction in infarct size, arrhythmias and chest pain by early intravenous beta-blockade in suspected acute myocardial infarction. Circularion 1983;(supp/ 1)67:26-j/. 16. First International Study of Infarct Survival Collaborative Group. Random- ized trial of intravenous atenolol among 16,027 cases of suspected acute myocardi- al infarction: ISIS-l. Lancer 1986;2:57-66.

222 THE AMERICAN JOURNAL OF CARDIOLOGY VOLUME 66