11
STATISTICS IN MEDICINE, VOL. 8,415-425 (1989) SURROGATE ENDPOINTS IN CLINICAL TRIALS: CARDIOVASCULAR DISEASES JANET WITTES*, EDWARD LAKATOS Biostatistics Research Branch, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute, Federal Building, Room 2AIl, Beihesda MD 20892, U S A . AND JEFFREY PROBSTFIELD Clinical Trials Branch, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute, Federal Building, Room XOS, Bethesda MD 20892, U.S.A. SUMMARY A surrogate endpoint in a cardiovascular clinical trial is defined as endpoint measured in lieu of some other so-called ‘true’ endpoint. A surrogate is especially useful if it is easily measured and highly correlated with the true endpoint. Often the ‘true’ endpoint is one with clinical importance to the patient, for example, mortality or a major clinical outcome, while a surrogate is one biologically closer to the process of disease, for example, ejection fraction. Use of the surrogate can often lead to dramatic reductions in sample size and much shorter studies than use of the true endpoint. We discuss several problems common in trials with surrogate endpoints. Most important is the effect of missing data, especially in the face of informative censoring. Possible solutions are the assignment of scores or formal penalties to missing data. KEY WORDS Surrogate endpoints Endpoints Clinical trials Missing data 1. INTRODUCTION Arguing for a surrrogate endpoint often entails a hint of disreputability, for the very word ‘surrogate’evokes images of distorted motherhood. One picture dates from Science in 1959; the pathetic face of Harlow’s infant rhesus monkey peering out at us as its little hands clutch forlornly its tattered towel.’ The article reported that an infant monkey can be comforted by a towel mother, a cuddly ‘surrogate’,but not by a wire-mesh construction even if the wire surrogate is the only source of food. The lesson was complicated, but one thing seemed clear. Some surrogates are better than others, and which is superior depends on what one measures. More recently, the term ‘surrogate’ has brought Mary Beth Whitehead’ to mind, and the definitions of ‘real’ and ‘surrogate’ get hopelessly intertwined. In clinical trials, although a true endpoint generally measures clinical benefit while a surrogate measures process of disease, sometimes the distinction between the true and the surrogate becomes blurred. This paper addresses several aspects of surrogate endpoints. First, in Section 2, we attempt a definition. In Section 3 we deal with the use of surrogates in cardiovascular clinical trials, while in Section 4 we discuss the relationship between surrogate and true endpoints. The * Address correspondence to Janet Wittes 0277-67 15/89/04O415-11$05.50 0 1989 by John Wiley & Sons, Ltd. Received December 1987 Revised September 1988

Surrogate endpoints in clinical trials: Cardiovascular diseases

Embed Size (px)

Citation preview

Page 1: Surrogate endpoints in clinical trials: Cardiovascular diseases

STATISTICS IN MEDICINE, VOL. 8,415-425 (1989)

SURROGATE ENDPOINTS IN CLINICAL TRIALS: CARDIOVASCULAR DISEASES

JANET WITTES*, EDWARD LAKATOS Biostatistics Research Branch, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood

Institute, Federal Building, Room 2AIl, Beihesda MD 20892, U S A .

AND

JEFFREY PROBSTFIELD Clinical Trials Branch, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute,

Federal Building, Room XOS, Bethesda MD 20892, U.S.A.

SUMMARY A surrogate endpoint in a cardiovascular clinical trial is defined as endpoint measured in lieu of some other so-called ‘true’ endpoint. A surrogate is especially useful if it is easily measured and highly correlated with the true endpoint. Often the ‘true’ endpoint is one with clinical importance to the patient, for example, mortality or a major clinical outcome, while a surrogate is one biologically closer to the process of disease, for example, ejection fraction. Use of the surrogate can often lead to dramatic reductions in sample size and much shorter studies than use of the true endpoint. We discuss several problems common in trials with surrogate endpoints. Most important is the effect of missing data, especially in the face of informative censoring. Possible solutions are the assignment of scores or formal penalties to missing data.

KEY WORDS Surrogate endpoints Endpoints Clinical trials Missing data

1. INTRODUCTION

Arguing for a surrrogate endpoint often entails a hint of disreputability, for the very word ‘surrogate’ evokes images of distorted motherhood. One picture dates from Science in 1959; the pathetic face of Harlow’s infant rhesus monkey peering out at us as its little hands clutch forlornly its tattered towel.’ The article reported that an infant monkey can be comforted by a towel mother, a cuddly ‘surrogate’, but not by a wire-mesh construction even if the wire surrogate is the only source of food. The lesson was complicated, but one thing seemed clear. Some surrogates are better than others, and which is superior depends on what one measures. More recently, the term ‘surrogate’ has brought Mary Beth Whitehead’ to mind, and the definitions of ‘real’ and ‘surrogate’ get hopelessly intertwined.

In clinical trials, although a true endpoint generally measures clinical benefit while a surrogate measures process of disease, sometimes the distinction between the true and the surrogate becomes blurred. This paper addresses several aspects of surrogate endpoints. First, in Section 2, we attempt a definition. In Section 3 we deal with the use of surrogates in cardiovascular clinical trials, while in Section 4 we discuss the relationship between surrogate and true endpoints. The

* Address correspondence to Janet Wittes

0277-67 15/89/04O415-11$05.50 0 1989 by John Wiley & Sons, Ltd.

Received December 1987 Revised September 1988

Page 2: Surrogate endpoints in clinical trials: Cardiovascular diseases

416 J. WITTES, E. LAKATOS AND J. PROBSTFIELD

remainder of the paper describes statistical and inferential problems with surrogate endpoints. Section 5, which is an overview, points out that many clinical trials that use surrogate endpoints suffer from missing data. We amplify the discussion of missing endpoints in Section 6. In Section 7 we mentioned other statistical problems with surrogate endpoints. Finally, Section 8 is a short conclusion.

2. WHAT IS A SURROGATE ENDPOINT?

We define a surrogate endpoint as one that we elect to measure as a substitute for some other variable. The word itself comes from the Latin ‘to elect, or ask, in place of’. Some have expressed the view that all-cause mortality is the only ‘true’ endpoint in a clinical setting. Cause-specific mortality and major morbid events are then surrogates used as strategems to increase specificity and hence statistical power. Similarly, this view holds that all continuous, and many discrete, measures of outcome are ‘surrogates’ on the grounds that the only truly important event is death, and untoward changes in a measured variable merely indicate the inevitable drift towards death.

Other investigators consider not only death but any event or symptom that brings a patient to a doctor as a ‘true’ endpoint; measured variables, like blood pressure or arterial patency, that the patient does not perceive as symptoms are merely surrogates. Another view is that the ‘true’ endpoint for an intervention is usually some cause-specific mortality or morbidity. Total mortality then becomes a surrogate when competing risks prevent the disentanglement of cause-specific deaths from all-cause mortality.

This paper takes a more general, non-prescriptive view: if we design an experiment to show a treatment effect in a variable Y, but instead we measure X, a variable pathophysiologically related to Y, then X is a surrogate for Y. For example, if our interest is to test whether ingestion of calcium alters systolic blood pressure (SBP), then SBP is the variable of interest. On the other hand, if our interest is to know whether ingesting calcium is likely to prevent strokes, then stroke is the ‘true’ outcome variable. In this case, an experiment that measures SBP instead of stroke is using SBP as a surrogate. Another possibility is that our interest is to learn whether supplementing diet with calcium prolongs life. Then, total mortality is the true endpoint and stroke is a surrogate used to assure high power. Thus, one study’s endpoint may be another’s surrogate; the context of the therapeutic intervention determines whether the variable under examination is the variable of interest or simply a surrogate.

3. WHY USE A SURROGATE?

In studies of cardiovascular disease, surrogate endpoints have appeal both operationally and scientifically. Friedman, Furberg, and DeMets3 (p. 17) discuss some advantages and disadvantages of surrogates. Operationally, surrogates are useful for several reasons. First, the length of time required for follow-up in a trial that uses a surrogate endpoint is often much shorter than a trial that uses a true endpoint. Second, in some cases a surrogate may be easier to measure than the true endpoint. For example, if the true endpoint is the size of the infarction as measured by a myocardial scintigraphy, a measure of enzymes is an easily obtainable surrogate. Third, the prevalence of certain rare diseases may be so low that a study of the ‘true’ endpoint is impossible.

Perhaps the most important practical advantage of a surrogate is that the sample size in a trial with a surrogate may be considerably lower than in a trial of the true endpoint. In many clinical trials of cardiovascular disease, very few people die. Consider the treatment of a myocardial infarction (MI). About 80 per cent of people who have an MI and reach the hospital survive the

Page 3: Surrogate endpoints in clinical trials: Cardiovascular diseases

SURROGATE ENDPOINTS CARDIOVASCULAR 417

Table I. Typical cardiovascular trials with surrogate and true endpoints: a comparison of sample sizes and follow-up periods

True endpoint trial Surrogate endpoint trial

Event Endpoint Size Length Endpoint Size Length

Myocardial infarction Death 4000 5 yrs Coronary artery patency 200 90 min Myocardial infarction Death 4000 5 yrs Ejection fraction 30 2-4 wks Stroke Stroke 25000 5 yrs Diastolic blood pressure 200 1-2 yrs

first 10 days.4 The one-year survival rate for those dismissed from hospital post-infarction is about 90 per cent.5 Such high survival rates clearly benefit the patient, but they lead to trials with very large sample sizes when mortality is the endpoint. Currently, several teams of researchers are involved in trials of thrombolytic agents. The rationale for the use of such drugs is that if an agent can dissolve the clot occluding the blood vessel, blood will resume its flow, lesS heart muscle will die, and the heart will therefore pump more efficiently. The pooled estimate of the efficacy of such agents is that their use can reduce short-term mortality by about 20 per cent.6 GISSI’ was a trial of about 12,000 MI patients, half randomized to therapy and half to control, that studied whether thrombolysis reduced mortality in the period of hospitalization immediately post-MI. We might study the efficacy of a thrombolytic agent by measuring EF instead of mortality arguing as follows. A measure of the effectiveness of the heart in pumping blood is the so-called ‘ejection fraction’ (EF). Healthy people have ejection fractions over 50 per cent. For people with EF’s between 15 per cent and 35 per cent, a five percentage point improvement is considered prognostically important.’ After an MI, one’s EF may fall; recovery is thought to be accompanied by a rise in EF. Therefoie, assuming that the standard deviation of EF is roughly 12, the sample size required to detect a five point difference between treatment and control is only 60 patients.

Similarly dramatic differences in sample size obtain with diseases related to blood pressure. A currently unanswered question is the efficacy of antihypertensive treatment for people with diastolic blood pressure in the 85-90 mmHg range. A study designed to compare the difference in stroke rate among such people would require about 25,000 subjects followed for about five years; one based on maintainence of a blood pressure drop would need only 200 subjects followed for a year or two.

Table I compares the required sample size and time of follow-up for trials with true endpoints to trials of surrogates for three typical cardiovascular events.

The operational advantages of a surrogate may provide a strong pragmatic rationale in its favour; intellectually more cogent arguments often stem from the biological relevance and the temporal immediacy of the surrogate. For example, in trials of interventions administered post- MI, 10-day mortality as a surrogate for long-term survival has temporal immediacy. Presumably, nearly all deaths within the first few days after an MI are related to the infarction. As the follow-up after intervention becomes long, deaths from processes quite remote from the cardiovascular system begin to account for an increasingly high percentage of the total mortality. A surrogate endpoint measured close in time to the intervention is quite uncontaminated by extraneous events.

The physiological closeness of an intervention to the studied endpoint is intellectually satisfying. Thus, if thrombolysis liquifies clots, and if clot-free arteries are better for the patient, why not simply measure whether a clot remains in the artery after treatment? If so, the argument goes, the therapy must be beneficial. Further, one can learn about the condition of an endpoint assessed soon after treatment, much sooner than one can measure time until death. Similarly, EF is a more

Page 4: Surrogate endpoints in clinical trials: Cardiovascular diseases

418 J. WITTES, E. LAKATOS AND J. PROBSTFIELD

direct measure of cardiovascular function than is mortality. As physicians often ask, why should death from a totally unrelated cause ‘count against’ the therapy? A similar argument applies to hypertension. Since low blood pressure is prognostically favourable, if a treatment under question reduces blood pressure, why not simply measure blood pressure? Why not measure blood cholesterol to assess the efficacy of a cholesterol-lowering regimen? Or arrhythmias to see whether an antiarrhythmic agent works? If these seem a bit circular, why not measure arterial patency for a cholesterol-lowering trial? The answer is of course obvious, for the problem of causality arises. The fact that low cholesterol (or the opening of arteries, or the raising of ejection fraction) is prognostically favourable does not necessarily imply that the lowering of cholesterol (or the opening of arteries, or the raising of ejection fraction) will prolong life. We must distinguish the use of endpoints to test mechanism from the use of endpoints to test the clinical benefits of therapy.

A surrogate is appropriate when a test of mechanism can replace a test for clinical benefit. The persuasiveness of the surrogate, then, depends in large measure on the state of understanding of the disease under investigation and on the availability of studies indicating that a change in the value of the surrogate leads to an alteration in the true endpoint. For example, the Hypertension Detection and Follow-up Programg randomized hypertensives to treatment or ‘usual care’. Although the protocol called for the use of specific antihypertension medications, the interpretation of the data was not simply that the tested drugs lower mortality from stroke, but that lowering elevated diastolic blood pressure reduces mortality.

4. THE RELATIONSHIP BETWEEN A TRUE ENDPOINT AND ITS SURROGATES

A variable becomes a surrogate by one or more of several paths. The route from epidemiology to clinical trials provides one paradigm. The logic is as follows. Epidemiologic data show that a particular variable has a strong relationship with risk. Laboratory or other biological insights provide a rationale for the observed relationship. Perhaps one fits a logistic or Cox regression to some continuous parameter with mortality, either all-cause or cause-specific, as the outcome. A dose-response relationship between the variable, now dubbed ‘risk factor’, and the risk of mortality emerges. Further epidemiologic studies may follow; these new studies may possess more of a longitudinal flavour. When the association between risk factor and risk seems clear, the intellectual stage is set for clinical trials to test whether intervention with the risk factor reduces risk. Such trials, which may be very large, seek to demonstrate the effect of manipulation of the risk factor on the clinical course of disease. If the trials demonstrate that alteration in the level of risk factor indeed reduces mortality or major morbidity, the variable may serve as a surrogate in future trials. Further trials of other interventions or on different populations may use the surrogate as the endpoint.

One fallacy in this logical process is that a new intervention may reduce the risk factor by some pathway irrelevant to the development of morbid events. Suppose the surrogate is a marker for a variety of processes, only one of which confers the risk. Then a therapeutic manoeuvre that alters the risk factor by manipulation’of a process not related to the risk will appear effective in a surrogate endpoint trial, but will in fact not be effective in practice. For example, although EF is a validated prognostic variable following an MI,* its prognostic value in patients with cardiomyopathy is unknown. Thus a treatment that increases EF in cardiomyopathy may not necessarily confer any benefit to the patient.’O An even more disturbing question relates to the clinical relevance of a second MI in the hours or days immediately after initiation of thrombolytic therapy. Is the second MI a new event or is it the late manifestation of the present MI interrupted

Page 5: Surrogate endpoints in clinical trials: Cardiovascular diseases

SURROGATE ENDPOINTS CARDIOVASCULAR 419

by therapy? In the former case, it is probably indicative of poor prognosis; in the latter, it may not be relevant. Thus, even a ‘proven’ surrogate may not predict change in risk accurately.

Another route for the choice of a surrogate is statistical. One can consider a variable to be a surrogate if it bears a clear statistical relationship with the true endpoint. In some situations, one might demand a rigorous mathematical relationship between the surrogate and the true. In any case, a convincing surrogate should have both biological and clinical relevance; statistical relationship alone is not sufficient.

5. MISSING VALUES IN TRIALS WITH SURROGATE ENDPOINTS

The primary statistical problem with trials using surrogate endpoints is the high likelihood of missing, or censored, data. We use the term ‘censored’ to refer to data censored by time and other processes. Non-informative censoring is a censoring mechanism that operates independently of the variable one is trying to measure. Informative censoring, on the other hand, is a process of censoring that depends on the variable of interest. If the treatment is effective, the degree of informative censoring differs by treatment. Surrogate endpoints often suffer from informative censoring. Well-run adequately funded clinical trials can usually ascertain the vital status of over 99 per cent of the participants. Therefore one can measure all-cause mortality without bias. Cause- specific mortality is a little more difficult, because to assign a cause of death, the investigators must have some medical records and must assign diagnoses independently of the knowledge of treatment. A still more difficult endpoint to assess unbiasedly is a major morbid event, for example, stroke or MI. The bias in all these, however, pales compared with the potential for bias caused by missing data in trials with surrogate endpoints. Consider for example, a few trials that have used surrogate measures. Blankenhorn et d.,l who investigated the effect of lipid lowering on arterial patency among coronary artery bypass patients, missed 15 per cent of the follow-up arteriograms. Most of these missing data resulted from patients’ failure to return for follow-up visits. In ISAM,’* a study of intravenous streptokinase given post-MI, the investigators managed to measure EF among fewer than 50 per cent of the patients entered.

Although many authors have proposed methods for dealing with missing data,13 most have focused on situations in which the ‘missingness’ occurs at random. In clinical trials, the chance of having a missing endpoint may well relate to the treatment variable of most interest. How should one handle missing data? We mentibn three possible methods for dealing with missing endpoints in clinical trials:

1. Analyse the data available and ignore the fact that some observations are missing. This, the most common approach, has large potential bias.

2. Use a formal statistical method to attempt to reduce the bias caused by informative censoring. The simplest approach is to assign a score to the missing value. More complicated methods are under investigation, but as mentioned above, have little practical use if a large proportion of data is missing.

3. Use an informal rule to penalize a study with missing data. Section 6 proposes a possible strategy.

One method to deal with missing data is to assign scores. A score is the result of applying a rule to replace the missing values. One must specify the rule before analysing the data. In this section, we discuss a simple scoring system relevant when the missing observations result from a reason considered part of the ordering of the outcome. For example, if the outcome measure is EF, and the subject fails to appear because he has died, then an EF of 0 is one possible scoring consistent with the above strategy because the EF of a dead person is, in fact, zero. If the outcome is ‘change

Page 6: Surrogate endpoints in clinical trials: Cardiovascular diseases

420 J. WITTES, E. LAKATOS AND J. PROBSTFIELD

in EF‘, then we can assign the deaths a score either of 0 or the lowest rank. In both cases, one should analyse such data with tests based on ranks. Because the resulting distributions tend to be highly skewed, rank tests are less strongly affected by extreme scores than are parametric tests. Introduction of the scoring system leads to additional problems. How do we score transplants? A transplant is not quite so extreme as a death, but it does represent an abject failure of the treatment under test if the outcome is EF. Should we assign it zero also? A next-to-worst outcome? What about the person who regains health, returns to work and full activity, but refuses to have an angiogram? Should we delete this subject or should we infer from his regained health that his EF is nearly normal? Should we attempt a partial ordering of the data, that is, employ EF when available, but interweave functional status when that is known? Does such a tack force us to randomization tests on partially ordered data? The uncertainty about how to score the missing data and the ambiguities introduced thereby become especially serious in the presence of a large number of missing observations. EF is relatively easy to score compared with other endpoints affected by common treatments. For example, when blood pressure rises substantially, physicians often treat until it declines. This leads to mixed endpoints. For example, we might define ‘disease’ as ‘diastolic blood pressure 2 90 or patient on medication’. Such a ploy does not always work well, because sometimes physicians medicate certain patients whose usual blood pressure does not exceed 90 mmHg.

Scoring is not without problems, for while biologically plausible values seem naturally appealing, any assignment of a score is the analyst’s formalized method of guessing what the value would have been had it been observable. To the extent that this guess misrepresents the expectation of the unobservable value, it introduces some arbitrariness into the analysis. Thus, the unbiasedness achieved through including all randomized subjects in the analysis is replaced by the bias inherent in the analyst’s assignment of score. For example, one can argue that no EF of zero has ever been measured on a live person; thus the assigned 0 cannot represent the true value of a missing, but living, person. Sometimes a preferable approach is to analyse the observed data but then discuss the sensitivity of the conclusions to reasonable assumptions about the missing values (see Section 6).

Furthermore, lurking behind every scoring system for missing values are several implicit assumptions. Let X,,] be the rth largest ejection fraction and Yrsl the sth longest survival time in the combined sample. A superscript T or C denotes whether the observation comes from the treated or control group. Then XE1 and X i + denote the ith and ( i+ 1)th largest EF, the ith is from a patient assigned control, the (i+ 1)th from one assigned treatment. Similarly, let cjl and YTj+,I denote the order statistics that correspond with two times of death. Suppose the ranking is:

. . . 7 cj2, Y:j+1], . . . 3 Xg], XE+11, . . . .

A change in exactly two points, so that cjl becomes Y;Sl= YTj+ , ]+& and X & becomes X ; i = X E + l l + ~ achieves the new ordering:

. . . , Y;j+ll, Y’fjl, . . . , x;+ll, X’i] . . . . This has the same rank statistic as the original ordering. Thus, an increase of ( YFj+ ,]- YFjl + E ) in the survival time Yfjl leads to a change in rank sum equal but opposite to the change caused by an increase of (XE, 11 -XF1 + E ) in the EF X&. The system of scoring has the hidden assumption that, on average, additional survival of one day is equivalent to an increase of a fixed number of EF units. Unlike similar statements arising from conclusions in the analysis of epidemiologic data, this relationship is a function of the arbitrarily chosen score so that the inferences drawn are also arbitrary. One of the major objections to scoring systems is that they require an implicit, often

Page 7: Surrogate endpoints in clinical trials: Cardiovascular diseases

SURROGATE ENDPOINTS CARDIOVASCULAR 42 1

Table 11. Number and percentage of patients experiencing various endpoints in a study of the effect of aspirin in hypertension*

Event

Drug

Aspirin Placebo z-statistic for (N=610) (N = 648) difference in

% n YO event rate n

DBP 2 90t 194 31.8 164 25.3 2.6 Medication$ 176 28.9 171 264 1 .o DBP 2 90 or medication 272 44.6 264 40.7 1.4 Death 30 4.9 45 7.4 - 1.9 Death or medication 191 31.1 201 31.0 0 1 Death, DBP290, or

medication 293 48.0 293 45.2 1.0

* The data are from the Coronary Drug Project Aspirin Study.14 The study group consists of the patients randomized to aspirin or piacebo who were not on antihypertensive medication at baseline and who had at least one follow-up visit.

t ‘DBP290’ means that on at least one follow-up visit a measured diastolic blood pressure 290 $ ‘Medication’ means that during at least one follow-up visit, the patient was on antihypertensive medication

untestable assumption that equates differences in survival with differences in the surrogate measure.

Problems of scoring are not academic. Consider the data of Table I1 from the Coronary Drug Project Aspirin Study (CDPA)14. The question is whether aspirin leads to increased blood pressure; the data pertain to patients who had an MI and who were randomized to a daily dose of aspirin or placebo to test whether aspirin reduces mortality. Among the patients randomized into the study, we consider only those with at least one follow-up visit who were not on antihyperten- sive medications at baseline. What should we choose as the endpoint? Ideally, we would like to know what proportion of patients had a diastolic blood pressure above 90. Two events, however, can prevent the meaningful measurement of blood pressure. First, we may find an artificially low blood pressure because a physician assigned antihypertensive medication. If the prescribing physician correctly diagnosed hypertension, then those on medications were in fact hypertensive. Second, the patient. may die. Some, but not all, of the deaths may be related to hypertension. The table is disturbing, because depending on the definition of the outcome variable, the aspirin appears as either dangerous or beneficial. The message seems to be that the group given aspirin had a lower mortality rate, but the information about blood pressure seems murky.

6. A PROPOSAL TO HANDLE MISSING DATA

In our experience missing endpoints occur disproportionately among patients most and least sick. We therefore regard the production and handling of missing data as a game played by nature and statisticians against each other. Nature attempts to enhance the observed treatment effect, while the statisticians desire to provide an unbiased estimate of the treatment effect. To protect against nature, the statisticians aim to underestimate the effect in order to produce a test with significance level no higher than the preselected a-level. To win such a game, Nature would cause either the sickest of the treated subjects or the least sick of the placebo to have missing data. For their part, the statisticians must establish a data analytic rule that counteracts any tendency of Nature to maximize the apparent effect of treatment. We propose the use of strategies that incorporate penalties for the missing data. One possibility is to assign to the missing cohort in each treatment

Page 8: Surrogate endpoints in clinical trials: Cardiovascular diseases

422 J. WITTES, E. LAKATOS AND J. PROBSTFIELD

Table 111. Data from a hypothetical trial of antiplatelet agents on arterial patency

Estimated treatment

Placebo Treatment effect ~~

- Sample size (N) 600 600

Patients with endpoints (n) 510 5 10 Patients with at least one occluded artery (x) 255 204

- Patients missing endpoints (m) 90 90 - -

Event rates * (a) Observed proportion of patients with at

(b) Penalized rate, assuming missing have least one occluded artery (x/m) 0.50 0.40 - 20.0%

(ii) opposite rate 0.48 042 - 12.5% (i) average rate 0.49 0.41 - 17.2%

(c) Worst case analysis 0.42 0.49 + 16.7%

*Estimated occlusion rates: (a) Observed rate ( x / n ) (b) Penalized rate

(i) @,= [xP+0.45(90)1/N, @,= [x,+045(90)1/N (ii) 8, = [x, + 04(90)]/N; 8, = [x, + OWO)l/N

0, = XdN B, = (4 + m)/N )

(c) Worst case:

Table IV. Calculated power of test of antiplatelet agent under assumption of no bias in the missing values*, total sample size = 1200, treatment effect = 20 per cent

~ ~ ~ ~ ~~

Power Power if missing Power if missing Percentage omitting values are assigned values are assigned

missing missing data average rate opposite treatment rate

0 10 115

~

88 85 84

88 81 77

88 77 71

* If there is bias in the censoring, these calculated values overestimate the true power.

group the observed rate in the opposite treatment group. A less severe penalty is to assign to all missing values the average of the observed from both treatment arms. A third strategy, which we do not endorse, is the so-called ‘worst-case’ analysis. This method assigns the most extremely pessimistic value to the missing data. For example, if the outcome is binary, one would assign ‘success’ to all missing values in the control arm and ‘failure’ in the treatment arm. In our view, such a rule produces unnecessarily pessimistic estimates.

All these rules incorporating penalties produce tests that are more conservative than the usual approaches to missing values. The standard methods in the literature impute missing values under the assumption either that censoring is random, or, at least, that censoring occurs as the result of a process removed from the deliberate control of the investigator. Having established a rule for

Page 9: Surrogate endpoints in clinical trials: Cardiovascular diseases

SURROGATE ENDPOINTS CARDIOVASCULAR 423

handling missing data, we should calculate our sample size or power using the expected treatment effect, the expected proportion missing, and the rule. For example, suppose we aim to study 1200 people randomized into two groups, a placebo and an antiplatelet drug, to investigate arterial patency measured five years after coronary artery bypass surgery. Table I11 summarizes some putative data with 15 per cent of the observations missing. In the presence of informative censoring, use of the observed rates will produce biased test statistics. Table IV displays the power if the data were missing at random and the penalties applied. Obviously, under random censoring the penalties lead to a decrease in power, but they provide protection against the possibility that the probability of being missing depends upon treatment.

Another approach is to use penalty rules as part of a sensitivity analysis. If the extent of the missing data is large enough to lead to conflicting results under reasonable penalties, we are uncomfortable drawing conclusions.

7. OTHER STATISTICAL PROBLEMS WITH SURROGATE ENDPOINTS

Thus far, we have viewed surrogate endpoints from the vantage point of the end of the study. In cardiovascular disease, surrogate endpoints often involve problems of screening at the beginning of the trial. Studies that enter patients on the basis of an abnormal value of the very variable that constitutes the surrogate present special problems. For example, if we enter only patients with low EF into a trial for which the outcome is either EF or change in EF, the surrogate would be an observation from a truncated distribution. Typically, unless the study assesses its baseline with at least one measure not used for screening, such truncation leads to regression towards the mean. Therefore, the observed mean change in a variable is not an unbiased estimate of the true change. Regression to the mean exacerbates the bias in the estimated treatment effect caused by informative censoring.'

We briefly mention heterogeneity of variance as one more problem common to many surrogate endpoints. Data we have seen on cholesterol and blood pressure have convinced us that some people have much more variability in their measures than others. If this is so, screening will accept people differentially depending on their intrinsic variances.

What is the practical approach statisticians should take? In designing a study with a surrogate, the statistician should examine critically the assumptions underlying the design parameters. The effects of truncation, informative censoring, and heterogeneity of variance can be large, and can lead to bias and imprecision in the estimates of the treatment effects. Thus, the naive, normal- theory sample sizes computed for surrogate endpoint trials may be much too small to detect a clinically relevant treatment effect. We have noted that some degree of screening often occurs even if not directly stated. One should scrutinize the entry criteria to see whether hidden among them is a screen that can lead to a truncated baseline distribution. Also, the statistician must be aware that the baseline characteristics of patients who enter a trial may not be constant over the course of the study. In particular, a shift tends to occur in trials in which the very ill prevalent cases enter earliest while the less ill incident cases enter later. Such a subtle temporal shift in baseline distributions affects the distribution of outcome, for it often induces greater variability than expected. Particularly in a small study, the actual power may be considerably lower than anticipated.

Another problem with the use of surrogates in many cardiovascular diseases is a consequence of their most important advantage: the required sample size is often too small to detect infrequent but major adverse effects of therapy. For example, a study designed to show whether thrombolytic therapy liquifies clots may be too small to detect an excess rate of intracranial haemorrhage.

Page 10: Surrogate endpoints in clinical trials: Cardiovascular diseases

424 J. WIRES, E. LAKATOS AND J. PROBSTFIELD

8. CONCLUSIONS

Surrogate endpoints play an important role in many cardiovascular clinical trials, frequently as secondary endpoints and sometimes as primary. Nonetheless, they sometimes present a number of problems both clinically and statistically. We must recognize their limitations during all aspects of the planning, conduct, and analysis of clinical trials that use them. In particular, we should understand the links between the surrogate and the true endpoints. When the surrogate is a continuous variable correlated with a screening variable, we should consider strategies to minimize regression to the mean. At the beginning of the trial, we must have a reasonable plan for analysis that includes rules to deal with missing data.

Surrogates often produce dramatic reductions in sample size and length of follow-up. Thus they may permit studies that otherwise would not be feasible. On the other hand, we should be suspicious of calculated sample sizes that appear too small or follow-up periods that appear too short, so that we can ensure that the goals of the trial are testable, the results generalizable, and the sample size large enough to provide insight into adverse effects of therapy. When used wisely, surrogate endpoints can aid our understanding of the biological processes underlying the disease and the mechanisms of therapy.

ACKNOWLEDGEMENTS

Intellectual parentage is often hard to identify. We have discussed the use of surrogate endpoints with many of our colleagues both within and outside the National Heart, Lung, and Blood Institute. A workshop entitled ‘Surrogate endpoints in cardiovascular clinical trials’ was held in Bethesda, MD, on 10 June 1985. Some of the ideas in this paper are distillations and reformulations of parts of the discussion at that meeting. In particulali, we thank Kent Bailey, Erica Brittain, Jeffrey Cutler, Robert Cunnion, Larry Friedman, Max Halperin, Eugene Passamani, and Margaret Wu for many insights into the use of surrogate endpoints. What we have presented here reflects our own views; we know that the opinions of those we have listed often differ from ours. We are especially grateful to James Dambrosia, Robert Wittes, and Salim Yusuf for thoughtful readings of an earlier draft of this paper. We thank Ron Branton and Clyde Hodge for typing the manuscript.

REFERENCES 1. Harlow, H. F. and Zimmerman, R. R. ‘Affectational responses in the infant monkey’, Science, 130,

2. Whitehead vs. Stern (‘The Baby M Case’) New Jersey Supreme Court, 107 NJ 140. 3. Friedman, M., Furberg, C. D. and DeMets, D. L. Fundamentals of Clinical Trials, 2nd edn., PSG

Publishing Co., Littleton, Massachusetts, 1984. 4. Roig, E., Castaner, A., Simmons, B., Patel, R., Ford, E. and Cooper, R. ‘In-hospital mortality rates from

acute myocardial infarction by race in U.S. hospitals: Findings from the National Hospital Discharge Survey’, Circulation, 76, 280-287 (1987).

5. Davis, H. T., &Camilla, J., Bayer, L. W. and Moss, A. J. ‘Survivorship patterns in the post hospital phase of myocardial infarction’, Circulation, 60, 1252-1258 (1979).

6. Yusuf, S., Collins, R., Peto, R., Furberg, C., Stampfer, M. J., Goldhaver, S. Z. and Hennekens, C. H. ‘Intravenous and intracoronary fibrinolytic therapy in acute myocardial infarction: Overview of results of mortality, reinfarction and side effects from 33 randomized controlled trials’, European Heart Journal, 6, 556585 (1985).

7. Gruppo Italian0 Per Lo Studio Della Streprochiasi Nell Infarto Miacardico (GISSI). ‘Effectiveness of intravenous thrombolytic treatment in acute myocardial infarction’, Lancet, 1, 397-401 (1986).

8. The Multicenter Postinfarction Research Group. ‘Risk stratification and survival after myocardial infarction’, New England Journal of Medicine, 30, 331-335 (1985).

421-432 (1959).

Page 11: Surrogate endpoints in clinical trials: Cardiovascular diseases

SURROGATE ENDPOINTS CARDIOVASCULAR 425

9. The Hypertension Detection and Follow-up Program Cooperative Group. ‘Implication of the Hyperten- sion Detection and Follow-up Programs’, Progress in Cardiovascular Disease, 29, 3-10 (1986).

10. Parrillo, J. E., Cunnion, R. E., Epstein, S. E., Parker, M. M., Suffredini, A. F., Brenner, M., Schaer, G. L., Palmeri, S., Ailing, D., Wittes, J., Ferrans, V. J., Rodriquez, A. R., and Fauci, A. S. ‘Anti-inflammatory therapy in dilated cardiomyopathy: A prospective, randomized, controlled trial’ (in preparation).

11. Blankenhorn, D. H., Nessim. S. A., Johnson, R. L., Sanmarco, M. E., Azen, S. P., and Cashin-Hemphill, L. ‘Beneficial effects of combined colestipol-niacin therapy on coronary atherosclerosis and coronary venous bypass grafts’, Journal of the American Medical Association, 257, 3233-3240 (1987).

12. The I.S.A.M. Study Group. ‘A prospective trial of intravenous streptokinase in acute myocardial infarction. Mortality and infarct size at 21 days’, New England Journal of Medicine, 314, 1465-1471 (1986).

13. Little, R. J. A. and Rubin, D. B. Statistical Analysis With Missing Data, Wiley, New York, 1987. 14. Coronary Drug Project Research Group. ‘Aspirin in coronary heart disease’, Journal of Chronic Disease,

15. Lakatos, E., Wittes, J., and Zucker, D. ‘Regression to the mean and informative censoring’ in preparation. 29, 1342-1350 (1976).