15
Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes Didier Renard 1 * , Helena Geys 1 , Geert Molenberghs 1 , Tomasz Burzykowski 1 and Marc Buyse 2 1 Biostatistics, Center for Statistics, Limburgs Universitair Centrum, Universitaire Campus, B-3590 Diepenbeek, Belgium 2 International Institute for Drug Development, Belgium Summary This article extends the work of Buyse et al. (2000) on the validation of surrogate endpoints in a meta- analytic setting to the case of two discrete outcomes, the focus being on binary endpoints. The metho- dology entails fitting of a joint model for the surrogate and the true endpoints that includes several random effects. We propose to fit this model using a pairwise likelihood (PL) approach which seems better suited to the problem at hand than maximum likelihood or penalized quasi-likelihood. The per- formance of the PL estimator is evaluated on the grounds of limited simulations and the methodology is illustrated on data from a meta-analysis of five clinical trials comparing antipsychotic agents for the treatment of chronic schizophrenia. Key words: Binary data; Meta-analysis; Pairwise likelihood; Surrogate Endpoint; Validation. 1. Introduction The evaluation of a treatment (Z ) is based on the observation of a clinically mean- ingful endpoint which is referred to as the “true” endpoint (T ). Often the true endpoint upon which treatment benefits will ultimately be assessed is distant in time or measured at high expense, making it worthwhile to consider an intermedi- ate or surrogate endpoint (S) that can be measured earlier, more conveniently, or more frequently than the endpoint of interest. The validation of surrogate endpoints in clinical trials is a controversial issue (Boissel et al., 1992; Fleming and DeMets, 1996; De Gruttola et al., 1997; Chuang-Stein and DeMasi, 1998) and should be rigorously estab- lished. In a landmark paper, Prentice (1989) proposed a definition as well as a set of operational criteria to validate surrogate endpoints, but they are Biometrical Journal 44 (2002) 8, 921–935 # WILEY-VCH Verlag Berlin GmbH, 13086 Berlin, 2002 0323-3847/02/0812-0921 $ 17.50þ.50/0 * Corresponding author: [email protected]

Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

Embed Size (px)

Citation preview

Page 1: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

Validation of Surrogate Endpointsin Multiple Randomized Clinical Trials with Discrete Outcomes

Didier Renard1*, Helena Geys

1, Geert Molenberghs

1, Tomasz Burzykowski

1 andMarc Buyse

2

1 Biostatistics, Center for Statistics, Limburgs Universitair Centrum, Universitaire Campus,B-3590 Diepenbeek, Belgium

2 International Institute for Drug Development, Belgium

Summary

This article extends the work of Buyse et al. (2000) on the validation of surrogate endpoints in a meta-analytic setting to the case of two discrete outcomes, the focus being on binary endpoints. The metho-dology entails fitting of a joint model for the surrogate and the true endpoints that includes severalrandom effects. We propose to fit this model using a pairwise likelihood (PL) approach which seemsbetter suited to the problem at hand than maximum likelihood or penalized quasi-likelihood. The per-formance of the PL estimator is evaluated on the grounds of limited simulations and the methodologyis illustrated on data from a meta-analysis of five clinical trials comparing antipsychotic agents for thetreatment of chronic schizophrenia.

Key words: Binary data; Meta-analysis; Pairwise likelihood; Surrogate Endpoint;Validation.

1. Introduction

The evaluation of a treatment (Z) is based on the observation of a clinically mean-ingful endpoint which is referred to as the “true” endpoint (T). Often the trueendpoint upon which treatment benefits will ultimately be assessed is distant intime or measured at high expense, making it worthwhile to consider an intermedi-ate or surrogate endpoint (S) that can be measured earlier, more conveniently, ormore frequently than the endpoint of interest.The validation of surrogate endpoints in clinical trials is a controversial

issue (Boissel et al., 1992; Fleming and DeMets, 1996; De Gruttola et al.,1997; Chuang-Stein and DeMasi, 1998) and should be rigorously estab-lished. In a landmark paper, Prentice (1989) proposed a definition as wellas a set of operational criteria to validate surrogate endpoints, but they are

Biometrical Journal 44 (2002) 8, 921–935

# WILEY-VCH Verlag Berlin GmbH, 13086 Berlin, 2002 0323-3847/02/0812-0921 $ 17.50þ.50/0

* Corresponding author: [email protected]

Page 2: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

equivalent solely if the surrogate and true endpoints are binary (Buyse andMolenberghs, 1998). Freedman, Graubard, and Schatzkin (1992) supple-mented these criteria with the so-called proportion of treatment explained(PTE), which quantifies the proportion of treatment effect on the true end-point that is mediated through the surrogate endpoint. This quantity has somedrawbacks, however. Thus, it is not a genuine proportion in the strict senseas it can take on values on the whole real line. Also, confidence limits forthis quantity tend to be wide in general, unless the sample size is large.Molenberghs et al. (2002) further discuss difficulties associated with thismeasure.Buyse and Molenberghs (1998) proposed to replace PTE by two quantities:

the relative effect, linking the effects of treatment on both endpoints at the popu-lation level, and the adjusted association, an individual-level measure of agree-ment between both endpoints after accounting for the effect of treatment. Theyfocused on the case where both the surrogate and true endpoints are either bin-ary or normally distributed. Technically, a joint model for the two endpoints isrequired. The relative effect is defined as RE ¼ b=a, where a and b denote theeffects of Z on S and T respectively. For normally distributed endpoints, theadjusted association gZ is simply the correlation between S and T after correct-ing for treatment. For binary endpoints, gZ can take the form of a log oddsratio. When the true and surrogate endpoints are jointly normally distributed, itturns out that PTE is the product of a nuisance factor, RE and gZ . This suggeststhat PTE is, in effect, a composite quantity, a mixture of two aspects of themodel: the fixed effects (population-averaged level) and the random component(individual level).In order to be informative and of practical value, the validation of a surrogate

endpoint will typically require a large number of observations. It is therefore use-ful to consider situations where data are available from multiple randomized ex-periments, where the experimental unit could be the center in a multicentric trialor the trial in a meta-analysis of several trials. Buyse et al. (2000) show how therelative effect and the adjusted association can be extended to this setting whenthe two endpoints are normally distributed.Since these authors focused on normally distributed endpoints, it is necessary to

explore other settings, often more complicated in nature, due to the absence of aunifying framework provided by the multivariate normal distribution. This paperputs emphasis on the case of discrete outcomes, more specifically on endpointsthat are binary (or that can be dichotomized). Section 2 briefly reviews the ap-proach of Buyse et al. (2000). Section 3 extends the model previously proposedto the case of two binary endpoints. We discuss parameter estimation and proposeto use a pairwise likelihood (PL) approach, which is evaluated on the grounds ofsimulations in Section 4. Section 5 illustrates the method on data from a meta-analysis of five clinical trials comparing antipsychotic agents for the treatment ofchronic schizophrenia.

922 D. Renard et al.: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials

Page 3: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

2. Two Normally Distributed Endpoints

We briefly describe the two-stage model used by Buyse et al. (2000) for surrogateendpoint validation in multiple randomized trials in the case of two normally dis-tributed endpoints. We refer to this paper for additional details.The first stage is based upon a joint regression model for S and T :

Sij j Zij ¼ mSi þ aiZij þ eSij ;

Tij j Zij ¼ mTi þ biZij þ eTij ;

(ð2:1Þ

where the indices i and j refer to trials and subjects within trials respectively; mSi andmTi are trial-specific intercepts; and ai and bi are trial-specific effects of treatment Zon the two endpoints in trial i ¼ 1; . . . ; N. Finally, eSij and eTij are correlated errorterms, assumed to be normally distributed with mean zero and covariance matrix

S ¼ sSS sSTsST sTT

� �: ð2:2Þ

At the second stage, we assume

mSimTiai

bi

0BBB@

1CCCA ¼

mSmTa

b

0BBB@

1CCCAþ

mSi

mTi

aibi

0BBB@

1CCCA ; ð2:3Þ

where the second term on the right-hand side is assumed to follow a zero-meannormal distribution with covariance matrix

D ¼

dSS dST dSa dSbdST dTT dTa dTbdSa dTa daa dabdSb dTb dab dbb

0BBB@

1CCCA : ð2:4Þ

The random-effects representation is obtained by combining the two steps above:

Sij j Zij ¼ mS þ mSi þ ðaþ aiÞ Zij þ eSij ;

Tij j Zij ¼ mT þ mTi þ ðbþ biÞ Zij þ eTij :

(ð2:5Þ

Since both the individual- and trial-level associations are of interest here, thesurrogate endpoint validation issue is examined at each of these levels. A keymotivation for validating a surrogate endpoint is to be able to predict the effect oftreatment on the true endpoint, based on the observed effect of treatment on thesurrogate endpoint. It is therefore essential to explore the quality of the predictionof the treatment effect on the true endpoint by information obtained in the valida-tion process based on trials i ¼ 1; . . . ; N and by information available on the sur-rogate endpoint in a new trial i ¼ 0, say.

Biometrical Journal 44 (2002) 8 923

Page 4: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

A measure to assess the quality of a surrogate at the trial level is given by thecoefficient of determination

R2trial ¼

dSbdab

� �TdSS dSadSa daa

� ��1dSbdab

� �dbb

: ð2:6Þ

This coefficient measures how precisely the effect of treatment on the true end-point can be predicted if the treatment effect on the surrogate endpoint has beenobserved in a new trial (i ¼ 0). It is unitless and ranges in the unit interval if thecorresponding variance-covariance matrix D is positive-definite, two desirable fea-tures for its interpretation.The association between the surrogate and final endpoints is captured by the

coefficient of determination

R2indiv ¼

s2STsSSsTT

; ð2:7Þ

which simply is the squared correlation between S and T after accounting for trialand treatment effects.A surrogate endpoint will be said ‘valid’ if it is both trial-level valid (R2

trial � 1)and individual-level valid (R2

indiv � 1). Guidelines about how close R2trial and R2

indivhave to be to 1 are hard to formulate in full generality. This will be based, preferably,upon expert opinion and confidence limits for these coefficients should be examined.

3. Two Binary Endpoints

In order to extend the methodology to the case of two binary endpoints, we shalladopt a latent variable model formulation. That is, we posit the existence of a pairof continuously distributed latent variables ð~SSij, ~TTijÞ which produce the actual bin-ary values ðSij; TijÞ. These unobservable variables are assumed to have a jointnormal distribution and the realized value of Sij (resp. Tij) equals 1 if ~SSij > 0 (resp.~TTij > 0), and 0 otherwise.We are now in a position to adopt the modeling strategy outlined in the previous

section. Consider the following random-effects model on the latent variable scale:

~SSij j Zij ¼ mS þ mSi þ ðaþ aiÞ Zij þ ~eeSij ;~TTij j Zij ¼ mT þ mTi þ ðbþ biÞ Zij þ ~eeTij :

�ð3:8Þ

The sole difference here is that variances at the individual level (sSS and sTT )are not identifiable parameters and can be fixed, arbitrarily and without loss ofgenerality, to 1. An analogous phenomenon occurs in standard regression modelsfor binary data formulated through the existence of a latent variable (see, forinstance, Cox and Snell, 1989). The S matrix defined in (2.2) can therefore be

924 D. Renard et al.: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials

Page 5: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

replaced by

S ¼ 1 qSTqST 1

� �: ð3:9Þ

This formulation is attractive since the coefficients of determination defined in theprevious section can readily be employed without any modification, although for-mally, their interpretation is bound to the postulated latent variables generating theobserved binary responses. Model (3.8) leads to the following model:

F�1ðP½Sij ¼ 1 j Zij; mSi ; ai; mTi ; bi�Þ ¼ mS þ mSi þ ðaþ aiÞ Zij ;F�1ðP½Tij ¼ 1 j Zij; mSi ; ai; mTi ; bi�Þ ¼ mT þ mTi þ ðbþ biÞ Zij ;

(ð3:10Þ

where F denotes the standard normal cumulative distribution function. This modelcan be recognized as a multilevel probit model and be regarded either as a three-level model or, more precisely, as a multivariate two-level model for binary re-sponse data (Goldstein, 1995). It also belongs to the class of so-called general-ized linear mixed models (Breslow and Clayton, 1993).The contribution of the ith trial to the likelihood function for the parameters

b ¼ ðmS; a; mT ; bÞT , D and qST , conditionally on bi ¼ ðmSi ; ai; mTi ; biÞ

T , is

‘iðb; D; qST j biÞ ¼Qnij¼1

Q1k¼0

Q1l¼0

P½Sij ¼ k; Tij ¼ l j bi�dijkl ; ð3:11Þ

where

dijkl ¼1 if Sij ¼ k and Tij ¼ l ;

0 otherwise :

(

Maximum likelihood estimates for the unknown parameters can be obtained bymaximizing the integrated likelihood function, whose ith contribution is given by

‘iðb; D; qSTÞ ¼Ð‘iðb; D; qST j biÞ f4ðbi; DÞ dbi ; ð3:12Þ

where f4ðbi; DÞ denotes the joint density function of the normal distribution withmean 0 and covariance matrix D.Numerical integration could be accomplished to evaluate this function, by quad-

rature or with Markov chain Monte Carlo methods for instance, but such an ap-proach would be computationally intensive. A number of procedures have beenproposed to avoid numerical integration. Breslow and Clayton (1993), for exam-ple, exploit the penalized quasi-likelihood (PQL) method by applying Laplace’sintegral approximation. They also consider marginal quasi-likelihood (MQL), aname they give to a procedure previously proposed by Goldstein (1991). Thesetwo approaches can be regarded as first-order Taylor expansions of the mean func-tion about the current estimated fixed part predictor (MQL) or the current pre-dicted value (PQL). Based on simulated data, Rodriguez and Goldman (1995)demonstrate that these approximate procedures may be seriously biased when ap-

Biometrical Journal 44 (2002) 8 925

Page 6: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

plied to binary response data. Their simulations reveal that both fixed effects andvariance components may suffer from substantial, if not severe, attenuation biasunder certain circumstances. Goldstein and Rasbash (1996) in turn show thatincluding a second-order term in the PQL expansion (PQL2) leads to considerableimprovement and largely eliminates biases described by Rodriguez and Goldman.While the computational aspect of the PQL or PQL2 procedures is relatively

modest, bias in the estimation of variance components may constitute a problemhere since we have direct interest in D and qST . A compromise between computa-tional burden and bias would therefore be most desirable. An attractive solution isto replace the marginal likelihood by a function that is easier to evaluate, andhence to maximize, but yet enjoys desirable asymptotic properties such as consis-tency or normality. A way to achieve this goal is to use the product of all pairwiselikelihoods within a trial instead of the full contribution of a trial to the likelihood.This is an example of pseudo-likelihood function (Arnold and Strauss, 1991),which has also been termed composite likelihood (Lindsay, 1988). Le Cessie andvan Houwelingen (1994), for instance, propose the use of PL to model correlatedbinary outcomes with logistic marginal response probabilities. They consider, how-ever, a population-averaged model, as opposed to a ‘subject-specific’ (or rather,‘trial-specific’) model considered in the present paper.More formally, the contribution of the ith trial to the log PL can be written

p‘i ¼P2nij¼1

Pj�1

k¼1‘jk ; ð3:13Þ

where ‘jk is the likelihood of the pair ðYij; YikÞ, with Y i ¼ ðSi1; . . . ; Sini ; Ti1; . . . ; TiniÞ,that is:

‘jk ¼ Yð11Þjk log pð11Þjk þ Y

ð10Þjk log pð10Þjk þ Y

ð01Þjk log pð01Þjk þ Y

ð00Þjk log pð00Þjk ;

where

pðlmÞjk ¼ P½Yij ¼ l; Yik ¼ m j Zij; Zik�

and

YðlmÞij ¼

1 if Yij ¼ l and Yik ¼ m ;

0 otherwise :

(

The different terms in (3.13) reflect four different types of association, as illus-trated in Figure 1:(i) the association between the surrogate and true endpoints measured on the

same individual;(ii) the association between the surrogate endpoints measured on two distinct

individuals;(iii) the association between the true endpoints measured on two distinct indivi-

duals;

926 D. Renard et al.: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials

Page 7: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

(iv) the association between the surrogate and true endpoints measured on twodistinct individuals.

Each of these pairwise contributions can be written in terms of univariate andbivariate probits. For example, the probability that both S and T be zero for sub-ject j in trial i can be written as:

P½Sij ¼ 0; Tij ¼ 0 j Zij� ¼ P½~SSij < 0; ~TTij < 0 j Zij�

¼ F2 � mS þ aZijffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffivar ð~SSijÞ

q ; � mT þ bZijffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffivar ð ~TTijÞ

q ; qij

0@

1A :

In this expression, var ð~SSijÞ, var ð ~TTijÞ and qij are obtained by selecting the appro-priate 2� 2 submatrix of the covariance matrix Vi ¼ ZiDZT

i þ Ri, where Zi is asuitable design matrix and Ri is a block-diagonal matrix with blocks equal to S.The function F2ðx; y; qÞ denotes the standardized bivariate normal distributionfunction with correlation coefficient q.Estimates of b, D and qST can be obtained by maximizing the log PL function

p‘ ¼PNi¼1

p‘*i ¼PNi¼1

p‘i=ð2ni � 1Þ :

The above formula corrects for the fact that each response occurs ð2ni � 1Þ timesin the ith contribution to the log PL. It can be shown that under standard regular-ity conditions, PL estimators are consistent and asymptotically normally distribu-ted. The proofs are closely related to the classical proofs for maximum likelihoodestimators (Lehmann, 1983). If v denotes the vector containing the elements of D,the asymptotic covariance matrix of the PL estimators ð~bb; ~vv; ~qqSTÞ can be approxi-mated by the ‘‘sandwich estimator” J�1KJ�1, where

J ¼ �PNi¼1

@2p‘*ið~bb; ~vv; ~qqSTÞ@ðb; v; qSTÞ @ðb; v; qSTÞ

T

Biometrical Journal 44 (2002) 8 927

������ � ������ �

� � � ����� � � � � ��� ��

��

���

����� � � � ���

�������

���

�� � � ������ � � � �

Fig. 1. Association structure for the surrogate and true endpoints in two distinct individualsj and k in trial i.

Page 8: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

and

K ¼PNi¼1

@p‘*ið~bb; ~vv; ~qqSTÞ@ðb; v; qSTÞ

@p‘*ið~bb; ~vv; ~qqSTÞ@ðb; v; qSTÞ

T :

To conclude this section, we shortly discuss the practical implementation of themethod. To ensure a positive-definite covariance matrix D and to improve theconvergence properties of the algorithm, a Cholesky decomposition D*TD*¼ Dwas used and the PL maximized with respect to the unique elements of the Cho-lesky factor. To constrain the residual correlation parameter qST to lie in the inter-val ½�1; 1�, Fisher’s z-transformation

hST ¼ log1þ qST1� qST

� �:

was used. The procedure was implemented in SAS IML (SAS Institute Inc.,1995), using the NLPDD (Double Dogleg) optimization routine, which requiresonly function and gradient calls, for maximizing the log PL function. The programis available upon request from the first author.

4. Simulations

A simulation study was conducted to further investigate the behavior of the PLestimator under different scenarios with varying trial numbers and sizes. Of parti-cular interest is the impact of these parameters on the R2 measures of surrogacydefined in Section 2 and on convergence difficulties. A more extensive simulationstudy, aimed at comparing the performances of the PL estimator with ML orPQL2 in simpler settings, can be found in Renard, Geys, and Molenberghs

(2002). The conclusions from this study were that PL compares well to ML orPQL2 estimators and exhibits a moderate loss of precision.The true underlying model in our simulation was taken to be:

~SSij j Zij ¼ ð0þ mSiÞ þ ð�1þ aiÞ Zij þ ~eeSij ;

~TTij j Zij ¼ ð0þ mTiÞ þ ð�2þ biÞ Zij þ ~eeTij ;

(

with

D ¼

1ffiffiffiffiffiffiffi0:8

p0 0ffiffiffiffiffiffiffi

0:8p

1 0 0

0 0 1ffiffiffiffiffiffiffi0:8

p

0 0ffiffiffiffiffiffiffi0:8

p1

0BBB@

1CCCA

928 D. Renard et al.: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials

Page 9: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

and

S ¼ 1 qSTqST 1

� �:

Note that additional simulations involving more complex forms of the D matrixled to similar conclusions.Data were generated using different scenarios with fixed and variable trial sizes.

In this paper, we report only on a small set of simulations with fixed trial sizes(20 trials with 10 and 100 subjects) and different values of qST . Conclusions re-mained basically unchanged when trial size was allowed to vary. In each case, 250replicates were generated. Results are presented in Table 1 where for each param-eter, we have reported the 5%-trimmed mean, the empirical standard error (acrossreplications) and the mean of the estimated standard errors.Estimates of the fixed-effects parameters seem to exhibit some bias with the

smaller sample size (first two settings) but this bias is essentially eliminated whensample size increases (third setting). Their estimated standard errors are relativelyclose to the empirical ones in all three sets of simulations. Variance parameters in

Biometrical Journal 44 (2002) 8 929

Table 1

Simulation results (250 replications)

20 trials, 10 subj.ðqST ¼

ffiffiffiffiffiffiffi0:5

20 trials, 10 subj.ðqST ¼

ffiffiffiffiffiffiffi0:8

20 trials, 100 subj.ðqST ¼

ffiffiffiffiffiffiffi0:8

Parameter True mean s:d: s.e. mean s:d: s.e. mean s:d: s.e.

mS 0 0.012 0.305 0.290 0.003 0.284 0.285 0.018 0.237 0.233a �1 �1.041 0.515 0.485 �1.040 0.511 0.490 �1.012 0.293 0.288mT 0 0.009 0.301 0.299 0.008 0.315 0.294 0.050 0.232 0.236b �2 �2.048 0.820 0.731 �2.061 0.656 0.696 �1.995 0.469 0.408

dSS 1 1.094 0.939 0.802 1.081 0.821 0.778 0.995 0.454 0.406dST

ffiffiffiffiffiffiffi0:8

p0.955 0.722 0.647 0.986 0.659 0.671 0.906 0.385 0.374

dTT 1 1.291 0.956 0.860 1.157 0.819 0.812 1.031 0.404 0.415dSa 0 �0.064 0.884 0.801 �0.072 0.765 0.750 �0.007 0.343 0.359dTa 0 0.060 0.775 0.729 �0.018 0.736 0.711 0.002 0.328 0.351daa 1 1.455 1.796 1.596 1.479 1.702 1.527 1.040 0.604 0.550dSb 0 0.044 0.850 0.803 0.008 0.668 0.731 �0.026 0.463 0.427dTb 0 �0.097 0.959 0.908 �0.105 0.741 0.805 �0.041 0.483 0.452dab

ffiffiffiffiffiffiffi0:8

p1.004 1.190 1.109 1.053 1.138 1.092 0.863 0.618 0.516

dbb 1 1.584 2.066 1.926 1.402 1.784 1.618 1.022 1.063 0.762qST

y 0.774 0.119 0.122 0.917 0.067 0.076 0.896 0.026 0.025

R2trial 0.8 0.827 0.243 0.331 0.835 0.241 0.342 0.858 0.195 0.167

R2indiv

y 0.608 0.182 0.184 0.843 0.116 0.138 0.803 0.047 0.045

Converged 203ð81%Þ 153ð61%Þ 250ð100%Þ

y See column headers for true value of qST .

Page 10: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

the D matrix tend to be overestimated in smaller samples, owing to the higherdegree of skewness in their distribution. The same comment applies, to a lesserextent, to covariance parameters, especially when their magnitude is large. Theseproblems are not observed with the larger sample size (20 trials of size 100).Also, standard errors of the variance parameters tend to be underestimated in gen-eral.The parameter qST was overestimated in the first two simulation settings, espe-

cially in the first one where the bias is sizeable and the estimates exhibit morevariability than in the second setting. It was, however, correctly estimated in thethird setting, once trial size is large. Standard errors are well approximated in eachcase. The same comments hold for the derived parameter R2

indiv ¼ q2ST. The R2trial

parameter suffer from upward bias in all three simulation settings and its distribu-tion is strongly skewed towards small values. The amount of bias in this param-eter could be attenuated by increasing replication at the trial level.A final comment concerns convergence of the algorithm. Convergence percent-

ages are given in the last row of Table 1. It can be noticed how these percentagesare affected by the magnitude of qST and the trial size. Note that the numbersgiven in the first two settings exclude cases where the solution lied close to theboundary of the parameter space (value of det ðDÞ extremely close to 0 or verylarge value of hST ). As expected, this problem was more frequent with the largestvalue of qST (second setting). No convergence problems were encountered in thelast set of simulations which was characterized by a larger number of subjects ineach trial.For comparison purposes we also used the PQL procedure, as implemented in

the SAS macro GLIMMIX (Wolfinger and O’Connell, 1993), to analyze eachsimulated data set based on the second simulation scenario. Besides the well-known downward bias occuring in the (co)variance parameters (D and qST ), theproportion of data sets where the algorithm converged was dramatically low. Thisproportion was about 44% with an unconstrained S matrix and dropped to about25% when the elements on the main diagonal of S were constrained to equal 1. Inaddition, even when the algorithm did actually converge, the resulting D matrixwas not always positive-definite, therefore yielding R2

trial values outside the unitinterval. Although the GLIMMIX macro is known to perform poorly in general,convergence difficulties are not attributable to this fact alone but rather hint upon aproblem inherent to the algorithm itself. This was also seen in the simulationstudy of Renard, Geys, and Molenberghs (2002) where very poor convergencerates were found with the PQL2 algortihm under certain settings.

5. Example: a Meta-Analysis of Trials in Schizophrenic Subjects

To illustrate the methodology, we use data from five clinical trials comparing theeffects of risperidone to conventional antipsychotic agents (or placebo) for the

930 D. Renard et al.: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials

Page 11: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

treatment of chronic schizophrenia. Only subjects who received optimal doses ofrisperidone (4–6 mg/day) or an active control (haloperidol, perphenazine, zuclo-penthixol) were included in this analysis. Depending on the trial, treatment wasadministered for a duration of 4 to 8 weeks and data at endpoint are analyzedhere.Even though this is not a standard situation for surrogate validation due to

the lack of a ‘gold’ standard, we consider as our primary measure (true end-point), for illustrative purposes, the Clinical Global Impression (CGI) overallchange versus baseline. This scale ranges from 1 ¼ ‘very much improved’ to7 ¼ ‘very much worsened’ and is used by the treating physician to assess a sub-ject’s overall clinical improvement compared to baseline. We define a response inCGI as an improvement since baseline (CGI grade of 1 to 3) and a non-responseotherwise (worsening). As a surrogate measure for global improvement, we con-sider clinical response defined as a 20% or higher reduction in the Positive andNegative Symptoms Scale (PANSS) score from baseline to endpoint. This corre-sponds to a commonly accepted criterion for defining a clinical response (Kay et al.,1988). Thus, we seek to quantify the extent to which a response in PANSS, ameasure of psychiatric disorder, can predict clinical improvement as observed bythe physician.Pooled data from the five trials are presented in Table 2. It can be seen that the

relationship between S and T is very strong (ORST ¼ 31:5, c2 ¼ 261:4,P < 0:0001), as can be expected. Note that patients were rated by the same treat-ing physicians on PANSS and CGI, thereby bringing some possible contaminationbias. Table 3 shows parameter estimates and their standard errors for model (3.10).This model was fitted using the PQL2 procedure implemented in the MLwiN soft-ware package (Goldstein et al., 1998) and using the PL approach. Since thenumber of trials is too small in this example, centers were treated as groupingunits. Thus, 176 units were available for the analysis.As can be seen in Table 3, the PL procedure leads to an estimated D matrix that

is positive-definite. With PQL2, on the other hand, some elements of D were con-

Biometrical Journal 44 (2002) 8 931

Table 2

Pooled data for the schizophrenia example: Surrogate endpoint (S) ¼ response in PANSSscore; True endpoint (T) ¼ improvement in CGI overall change versus baseline

Z S T

0 1

Active Control 0 151 (72)y 58 (28)1 15 (6) 220 (94)

Risperidone 0 91 (71) 37 (29)1 20 (9) 213 (91)

y Frequency (row percentage)

Page 12: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

strained to be zero and as a result, the estimated value of the R2trial coefficient

cannot even be calculated. This makes direct comparison of parameter estimatesfor D rather difficult between PQL2 and PL. This put aside, fixed-effects param-eter estimates are quite similar and the anticipated loss of efficiency in PL esti-mates is moderate (less than 15%). Also, the parameter qST exhibits both a muchhigher point estimate and a much larger standard error.Interestingly, the estimated value of R2

trial is really low (0.006), whereas theestimated value of R2

indiv is rather high (0.924). The latter confirms the strongassociation between S and T (at the individual level) which was seen in Table 2and suggests that they both capture overlapping components of a subject’s psycho-tic status. The very low estimated value for R2

trial, on the other hand, shows that Sprovides very bad predictions for treatment effects on T (at the center level), there-by making of clinical response a rather poor surrogate for clinical improvementaccording to our criterion. We see here one advantage of this approach in thatindividual and trial (center in this example) level components of association canbe completely disentangled. In such an example, this is important since both areindeed very different.

932 D. Renard et al.: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials

Table 3

Results for the schizophrenia data using PQL2 and PL. Parameter estimates and standarderrors are reported

Parameter PQL2 PL

Estimate S.E. Estimate S.E.

mS 0.227 0.056 0.233 0.062

ay 0.166 0.046 0.161 0.049mT 0.441 0.054 0.445 0.062

by 0.100 0.050 0.109 0.057

dSS 0.126 0.050 0.121 0.057dST 0.088 0.042 0.091 0.055dTT 0.083 0.045 0.076 0.063dSa � � �0.005 0.054dTa � � �0.004 0.040daa � � 0.001 0.005dSb �0.007 0.024 0.006 0.046dTb 0.001 0.022 0.024 0.041dab � � �0.001 0.002dbb 0.029 0.023 0.059 0.045qST 0.679 0.018 0.961 0.027

R2trial � � 0.006 0.082

R2indiv 0.461 0.024 0.924 0.052

y Treatment coding: �1 ¼ active control, þ1 ¼ risperidone.

Page 13: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

6. Discussion

In this paper, we have extended the approach proposed by Buyse et al. (2000) toassess the validity of a surrogate endpoint in a meta-analytic context when boththe surrogate and the final endpoints are discrete in nature, the emphasis being onbinary outcomes. This was done by adopting a latent variable model formulationwhich allows us to carry over previously proposed measures of surrogacy in anatural way, under the assumption that latent variables are normally distributed.This, in turn, dictates the use of a joint probit model for the surrogate and the trueendpoints.The major difficulty rests in parameter estimation since, on the one hand, a

direct likelihood approach would be computationally involved and, on the otherhand, standard approximate methods such as PQL may not be satisfactory sinceinterest centers on the random components of the model here. This was our primemotivation for using a PL approach as it provides a net balance between computa-tional burden and bias, although at the (small) price of lower efficiency.It is well-known that generalized linear mixed models are challenging to fit in

general and can pose numerous estimation problems. Practical experience suggeststhat it is not so uncommon for the PQL algorithm to exhibit numerical instabilityand fail to converge, the problem being worsened with PQL2 and more compli-cated models such as the one dealt with in the present paper. Some simulationssuggested that PL is more robust against convergence issues (Renard, Geys, andMolenberghs, 2002) which gives an added advantage to this procedure. Numeri-cal problems should still be expected to occur frequently in the kind of applica-tions sought here, however. Buyse et al. (2000) noticed how often model (2.5)fails to converge in practice. Simulations indicated that such factors as the numberof trials or between-trial variability can be critical for improving convergenceproperties of the algorithm. Obviously, these problems remain topical here, if notworsened as less information is conveyed by binary response variables than bycontinuous ones.To conclude, we briefly outline how the method can be extended to ordinal

endpoints. We can adopt the ‘‘threshold concept“ and assume that there are unob-servable latent variables that are related to the actual responses S and T through aseries of cutoff points. For instance, if S has K categories, we need to define a setof ðK � 1Þ threshold values g1 < . . . < gK�1 and postulate that S and the corre-sponding normally distributed latent variable, U say, are connected by

S ¼ k , gk�1 < U � gk ; k ¼ 1; . . . ; K

with g0 ¼ �1 and gK ¼ þ1 and where, for convenience, we can assume thatg1 ¼ 0. On the latent variable scale, we can again consider model (3.8) and theassociated coefficients of determination as measures of surrogacy at the trial andindividual levels. Parameter estimation can proceed as before by considering thelikelihood of all possible pairs of outcomes. Threshold values used to define S and

Biometrical Journal 44 (2002) 8 933

Page 14: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

T are simply extra parameters to be estimated in the PL function. An extension tomixed situations, where one endpoint is discrete and the other is continuous, isalso feasible. The same modeling strategy can be followed, with one of the com-ponents assumed to be normally distributed and the other being obtained via alatent, normally distributed, variable. The PL function does then involve evaluationof univariate probits only.

Acknowledgement

The first and third authors gratefully acknowledge support from an LUC BijzonderOnderzoeksfonds grant. The second author was supported by the Institute for thePromotion of Innovation by Science and Technology in Flanders (IWT), Belgium.The authors are also grateful to Johnson and Johnson, Pharmaceutical Researchand Development –– and to Tony Vangeneugden in particular –– for permission touse their data.

References

Arnold, B. C. and Strauss, D., 1991: Pseudolikelihood estimation: some examples. Sankhya B 53,233–243.

Boissel, J. P., Collet, J. P., Moleur, P., and Haugh M., 1992: Surrogate endpoints: a basis for arational approach. European Journal of Clinical Pharmacology 43, 235–244.

Breslow, N. E. and Clayton, D. G., 1993: Approximate inference in generalized linear mixed mod-els. Journal of the American Statistical Association 88, 9–25.

Buyse, M. and Molenberghs, G., 1998: The validation of surrogate endpoints in randomized experi-ments. Biometrics 54, 1014–1029.

Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., and Geys, H., 2000: The validation ofsurrogate endpoints in meta-analyses of randomized experiments. Biostatistics 1, 49–67.

Chuang-Stein, C. and DeMasi, R., 1998: Surrogate endpoints in AIDS drug development: currentstatus. Drug Information Journal 32, 439–459.

Cox, D. R. and Snell, E. J., 1989: Analysis of Binary Data, 2nd edition. Chapman and Hall, London.De Gruttola, V., Fleming, T. R., Lin, D. Y. and Coombs, R., 1997: Validating surrogate markers ––

are we being naive? Journal of Infectious Diseases 175, 237–246.Fleming, T. R. and DeMets, D. L., 1996: Surrogate endpoints in clinical trials: are we being misled?

Annals of Internal Medicine 125, 605–613.Flandre, P. and Saidi, Y., 1999: Letters to the editor: Estimating the proportion of treatment effect

explained by a surrogate marker. Statistics in Medicine 18, 107–115.Freedman, L. S., Graubard, B. I., and Schatzkin, A., 1992: Statistical validation of intermediate

endpoints for chronic diseases. Statistics in Medicine 11, 167–178.Goldstein, H., 1991: Nonlinear multilevel models, with an application to discrete response data. Bio-

metrika 78, 45–51.Goldstein, H., 1995: Multilevel Statistical Models, 2nd edition. Edward Arnold, London.Goldstein, H. and Rasbash, J., 1996: Improved approximations for multilevel models with binary

responses. Journal of the Royal Statistical Society A 159, 505–513.Goldstein, H., Rasbash, J., Plewis, I., Draper, D., Browne, W., Yang, M. et al., 1998: A User’s

Guide to MLwiN. Institute of Education, London.

934 D. Renard et al.: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials

Page 15: Validation of Surrogate Endpoints in Multiple Randomized Clinical Trials with Discrete Outcomes

Kay, S. R., Opler, L. A., and Lindenmayer, J.-P., 1988: Reliability and validity of the Positive andNegative Syndrome Scale for schizophrenia. Psychiatric Research 23, 99–110.

Le Cessie, S. and Van Houwelingen, J. C., 1994: Logistic regression for correlated binary data.Applied Statistics 43, 95–108.

Lehmann, E. L., 1983: Theory of Point Estimation. Wiley, New-York.Lindsay, B. G., 1988: Composite likelihood methods. Contemporary Mathematics 80, 221–239.Molenberghs, G., Buyse, M., Burzykowski, T., Renard, D., and Geys, H., 2002: Statistical chal-

lenges in the evaluation of surrogate endpoints in randomized trials. Submitted for publication.Prentice, R. L., 1989: Surrogate endpoints in clinical trials: definitions and operational criteria. Statis-

tics in Medicine 8, 431–440.Renard, D., Geys, H., and Molenberghs, G., 2002: A pairwise likelihood approach to estimation in

multilevel probit models. To appear in Computational Statistics and Data Analysis.Rodrı́guez, G. and Goldman, N., 1995: An assessment of estimation procedures for multilevel models

with binary responses. Journal of the Royal Statistical Society A 158, 73–89.SAS Institute Inc., 1995: SAS/IML Software: Changes and Enhancements Through Release 6.11. SAS

Institute Inc, Cary, NC.Wolfinger, R. and O’Connell, M., 1993: Generalized linear mixed models: a pseudo-likelihood ap-

proach. Journal of Statistical Computation and Simulation 48, 233–243.

Received, August 2001Accepted, January 2002

Biometrical Journal 44 (2002) 8 935