Introduction to Meta-analysis Borenstein, Hedges, Higgins & … Chapters... · 2014-07-14 · Introduction to Meta-analysis Borenstein, Hedges, Higgins & Rothstein . CAMARADES: Bringing

CAMARADES: Bringing evidence to translational medicine

Heterogeneity

Chapters 15 and 16

Introduction to Meta-analysis

Borenstein, Hedges, Higgins &

Rothstein


Heterogeneity

Chapter 15

Overview

• The goal of a synthesis is not simply to compute

a summary effect, but rather to make sense of

the pattern of effects.

– If the effect size is consistent across studies we need

to know that and to consider the implications, and if it

varies we need to know that and consider the

different implications.


• The problem- the observed variation in the estimated

effect sizes is partly spurious as it includes both true

variation in effect sizes and also random error.

• We use the following measures:

– Q statistic (a measure of weighted squared deviations)

– P value (the result of a statistical test based on Q)

– T2 (the between-studies variance)

– T (the between-studies standard deviation)

– I2 (the ratio of true heterogeneity to total observed variation)


• Under the random effects model we allow that true effect

sizes may vary from study to study. We discuss

approaches to identify and then quantify this

heterogeneity.

• When we discuss heterogeneity in effect sizes we mean

the variation in true effect sizes.

• However the variation that we actually observe is partly

spurious, incorporating both (true) heterogeneity and

also random error.

Chapter 16

Identifying and Quantifying

Heterogeneity

True effect size = the effect size in the underlying population, and the effect size

we would observe if the study had an infinitely large sample size (and therefore

no sampling error).


• If all studies in an analysis shared the same true effect size, so that true heterogeneity is zero. We would not expect the observed effects to be identical to each other but because of within-study error, we would expect each to fall within some range of the common effect.

• If the true effect sizes vary from one study to the next , the observed effects vary from one another for 2 reasons:

1. the real heterogeneity in effect size

2. the within-study error

To quantify the heterogeneity we partition the observed variance into these 2 components and then focus on the real heterogeneity in effect sizes.

How do we do this?


1. We compute the total amount of study-to-

study variation actually observed

2. We estimate how much the observed effects

would be expected to vary from each other

if the true effect was actually the same in all

studies

3. The excess variation (if any) is assumed to

reflect real differences in effect sizes (that is,

the heterogeneity)


• Observed effects are

identical in A & B.

• CIs are relatively wide in A

and relatively narrow in B.

• In A, all studies could share

a common effect, with the

observed dispersion falling

within the umbrella of the

CIs.

• The CIs for the B studies

are quite narrow and cannot

comfortably account for the

observed dispersion.

• Similarly, the observed

effects are identical in C &

D where C has wider CIs.

• In C, the effects can be fully

explained by within-study

error, while in D they

cannot.

Dispersion across studies relative

to error within studies


• From A to C both the within-study variance and the observed variance have been

multiplied by 2. Same for B and D.

• The scale has increased but the ratio (observed/within) is unchanged.

• While the effects are more widely dispersed in the 2nd row than in the 1st, this is

not relevant to the purpose of isolating the true dispersion.

• What matters is the ratio of observed to expected dispersion, which is the same in

A and C (and is the same in B and D)

• Q= a statistic that is sensitive to the ratio of the observed variation to the within-

study error, rather than their absolute values.


Computing Q

• 1st step in partitioning the heterogeneity is to compute Q, defined as

» Where Wi is the study weight (1/Vi)

» Yi is the study effect size

» M is the summary effect

» K is the number of studies

• In words, we compute the deviation of each effect size from the mean, square it, weight this by the inverse-variance for that study, and sum these values over all studies to yield the weighted sum of squares (WSS), or Q.


The expected value of Q

based on within-study error

• Because Q is a standardised measure the

expected value does not depend on the metric

of the effect size, but is simply the degrees of

freedom

where k is the number of studies


The excess variation

• Since Q is the observed WSS and df is the

expected WSS (under the assumption that all

studies share a common effect), the difference,

reflects the excess variation, the part that will be

attributed to differences in the true effects from

study to study.


Ratio of observed to expected

variation

A) Observed value of Q =3

Expected value= 5 (k-1)

Observed variation is less than

expected based on within study

error (Q is less than the

degrees of freedom)

B) Observed variation is greater

than we would expect based on

within-study error (Q is greater

than the degrees of freedom)

• Q reflects the total dispersion (WSS)

• Q-df reflects the excess dispersion


• Q = 3 in both A and C because these plots share the same ratio

• Q= 12 in both B and D because these plots share the same ratio

• (despite the fact that the absolute range of effects is higher in C and

D)


How to derive Tau2 and I2 from

Q and df

We can use Q to estimate the

variance (and standard deviation)

of the true effects : start with Q,

remove the dependence on the

number of studies, and return to the

original metric. These are Tau2 and

Tau.

To estimate what proportion of the

observed variance reflects real

differences among studies (rather than

random error) we will start with Q,

remove the dependence on the

number of studies, and express the

results as a ratio, I2


• Researchers typically ask: is the heterogeneity statistically

significant?

• We test the null hypothesis that all studies share a common effect

size.

• Typically we set alpha at 0.10 or at 0.05, with a p value less than

alpha leading us to reject the null hypothesis, and conclude that the

studies do not share a common effect size.

• This test of significance is sensitive both to the magnitude of the

effect (here, the excess dispersion) and the precision with which this

effect is estimated (here, based on the number of studies).


The impact of excess dispersion

Compare plots A vs B, which both have 6

studies.

As the excess dispersion increases (Q

moves from 3.00 in A to 12.00 in B) the p

value moves from 0.70 to 0.035.

Similarly when we compare plots C vs D.

The impact of number of studies

Compare plots A vs C, identical except A

has 6 studies and C has 12 with the

same estimated value of between-study

variation.

With the additional precision the p-value

moves away from zero, from 0.70 (for A)

to 0.83 (for C).

Compare plots B vs D, identical except B

has 6 studies and D has 12 with the

same estimated value of between-

studies variation (Tau2=0.037). With the

added precision the p-value moves

towards zero.


The p-value for the left-hand

columns moves towards 1.0

as we added studies, while the

p-value for the left-hand

column moved towards 0.0 as

we added studies.

At left, since Q is less than df

the additional evidence

strengthens the case that the

excess dispersion is zero, and

moves the p-value towards 1.

At right, since Q exceeds df,

the additional evidence

strengthens the case that the

excess dispersion is not zero,

and moves the p-value

towards 0.


Q and its p-value

• A significant p-value provides evidence that the true effects vary, the converse is not true.

• A nonsignificant p-value should not be taken as evidence that the effect sizes are consistent, since the lack of significance could be due to low power.

• The Q statistic and p-value address only the test of significance and should never be used as surrogates for the amount of true variance.

• A nonsignificant p-value could reflect a trivial amount of observed dispersion, but could also reflect a substantial amount of observed dispersion with imprecise studies

• Similarly, a significant p-value could reflect a substantial amount of observed dispersion but could also reflect a minor amount of observed dispersion with precise studies.

• The purpose of the test is to assess the viability of the null hypothesis, and not to estimate the magnitude of the true dispersion.


Estimating Tau2

• Tau2 is defined as the variance of the true effect sizes

• In other words, if we had an infinitely large (so that the estimate in each study was the true effect) and computed the variance of these effects, this variance would be .

• Since we cannot observe the true effects we cannot compute this variance directly but estimate it from the observed effects, with the estimate denoted T2.

• To yield this estimate we start with the difference (Q-df) which represents the dispersion in true effects on a standardised scale. We divide by a quantity (C) which has the effect of putting the measure back into its original metric and also making it an average, rather than a sum, of squared deviations.


• This means that T2 is in the same metric (squared) as the effect itself,

and also reflects the absolute amount of variation in that scale.

• While the actual variance of the true effects can never be less

than zero, our estimate of this value T2 can be less than zero if,

because of sampling error the observed variance is less than we

would expect based within-study error- in other words, if Q<df. In this

case, T2 is simply set to zero.

• If Q>df then T2 will be positive and it will be based on 2 factors. The

first is the amount of excess variation (Q-df), and the second is the

metric of the effect size index.


• The impact of the excess variation on our estimate of T2 is evident if we compare A vs B.

– The within study error is smaller in B. Therefore while the observed variation is the same in both plots, a

higher proportion of this variation is assumed to be real in B. As we move from A to B, Q moves from

12 to 48.01 and T2 from 0.037 to 0.057.

• The impact of the scale on our estimate of T2 is evident if we compare C vs D.

– Q and df are the same in the 2 plots , which means that the same proportion of the observed variance

will be attributed to between-studies variance. However, the absolute amount of the variance is larger in

D, so this proportion translates into a larger estimate of . As we move from C to D, T2 moves from

0.037 to 0.096.


T2

• T2 (our estimate for the variance of the true effects) is

used to assign weights under the random effects model,

where the weight assigned to each study is

• In words, the total variance for a study (V*Yi) is the sum

of the within-study variance VY and the between-studies

variance, (T2).

• This method of estimating the variance between studies

is known as the method of moments or the DerSimonian

and Laird method.


Tau

• refers to the actual variance and T2 is our

estimate of this parameter.

• Now we turn to the standard deviation of the true

effect sizes.

• Here, refers to the actual standard deviation

and T is our estimate of this parameter.

• T, the estimate of the standard deviation, is

simply the square root of T2.


The expected distribution of true

effects, based on T.

E.g. Plot A the summary effect is 0.41 and T is 0.193. We expect that some 95% of

the true effects will fall in the range of 0.41 plus or minus 1.96 T, or 0.04 to 0.79

and this is reflected in the bell curve.

Plots A and B have the same observed variance , but differ in the proportion of this

variance that is attributed to real differences in effect size.

In A, the bell curve is relatively narrow and captures only a fraction of the observed

dispersion- the rest is assumed to reflect error. In B, the bell curve is relatively

wide, and captures a larger fraction of the dispersion, since most of the dispersion

is here assumed to be real.


The expected distribution of true

effects, based on T.

Similarly, plots C vs D the ratio of true to observed variance is the same, but the

observed dispersion is larger in D. The bell curve is wider in D than in C but in both

cases a comparable proportion of the effects fall within the range of the curve

(because the ratio is the same).


T

• T enables us to talk about the substantive

importance of the dispersion.

• An intervention with a summary effect size of

0.50.

– If T is 0.10, then most of the effects (95%) fall in the

approximate range of 0.30 to 0.70.

– If T is 0.20 then most of the true effects fall in the

approx range of 0.10 to 0.90.

– If T is 0.30 then most of the true effects fall in the

approx range of -0.10 to +0.10


I2

• What proportion of the observed variance reflects real differences in effect size?

• The statistic I2 reflects this proportion

• That is, the ratio of excess dispersion to total dispersion. The statistic I2 can be viewed as a statistic of the form

• That is, the ratio of true heterogeneity to total variance across the observed effect estimates. However, this is not a true definition of I2 because in reality there is not a single VY, since the within-study variances vary from study to study.

• The I2 statistic is a descriptive statistic and not an estimate of any underlying quantity.


Impact of excess dispersion on I2

• For any df, I2 moves in

tandem with Q. As such, it is

driven entirely by the ratio of

observed dispersion to

within-study dispersion.

• In the top row, both plots A

and B have a Q value of

12.00 with 5 degrees of

freedom. Therefore both

have an I2 58.34%.

• Similarly in plots C and D.

The wider scale does not

impact the I2.


Impact of excess dispersion on I2

The scale of I2 has a range of 0-100%, regardless of the scale used for the meta-

analysis itself. It can be interpreted as a ratio, and has the additional advantage

being analogous to indices used in psychometrics (where reliability is the ratio of

true to total variance) or regression (where R2 is the proportion of the total variance

that can be explained by covariates). Importantly, I2 is not directly affected by the

number of studies in the analysis.

I2 reflects the extent of

overlap of CIs, which is not

dependent on the actual

location or spread of the

true effects. As such it is

convenient to view I2 as a

measure of inconsistency

across the findings of the

studies, and not as a

measure of the real

variation across the

underlying true effects.


Concluding remarks I2

• I2 allows us to discuss the amount of variance on a relative scale

• We can use I2 to determine what proportion of the observed variance is real

• If I2 is near zero, then almost all of the observed variance is spurious, which means there is nothing to explain.

• If I2 is large, then it would make sense to speculate about reasons for the variance & possibly to apply techniques such as subgroup analysis or meta-regression to try & explain it. – Low 25%

– Moderate 50%

– High 75%

• This indicates what proportion of the observed variation is real but does not address the dispersion.


Comparing the measures of

heterogeneity

• The Q statistic and its p-value serve as a test of significance. Useful because depends on

number of studies and not sensitive to the metric of the effect size index.

• T2 serves as the between-studies variance in the analysis and our estimate of T serves as

the standard deviation of the true effects. Useful because they are sensitive to the metric of

the effect size and they are not sensitive to the number of studies.

• I2 is the ratio of true heterogeneity to total variation in observed effects, a kind of signal to

noise ratio. Useful because it is not sensitive to the metric of the effect size and it is not

sensitive to the number of studies.


• T2 and T reflect the amount of true heterogeneity

(the variance or the standard deviation)

• I2 reflects the proportion of observed dispersion

that is due to this heterogeneity.


Above, I2 is the same in both plots but in A the true effects are clustered in a small range

(T2=0.006) while in B they are dispersed across a wider range (T2=0.037).

I2 reflects only the

proportion of

variance that is

true, and says

nothing about the

absolute value of

this variance.

T2 reflects only the

absolute value of

the true variance

and says nothing

about the

proportion of

observed variance

that is true.

Above, T2 is the same in both plots , but in A it is a large part (I2= 58.34%) of a small observed

dispersion whereas in B it is a small part (I2=16.01%) of a large observed dispersion.


Please note

• T2 is tied to the effect size index while I2 is not. For example, T2 for a

synthesis of risk ratios will be in the metric of log risk ratios while T2

for a synthesis of standardised mean differences will be in the

metrics of standardised mean differences. It would not be

meaningful to compare the T2 values for 2 synthesis unless they

were in the same metric.

• By contrast, I2 is on a ratio scale of 0% to 100% and it is possible to

compare this value from different syntheses.


1. The Q statistic and its p-value only address the viability of the null hypothesis and not the amount of excess dispersion.

2. Q is sensitive to relative variance (the kind tracked by I2) and not absolute variance (the kind tracked by T2 and T).

An informative presentation of heterogeneity indices requires both a measure of the magnitude and a measure of uncertainty. Magnitude may be represented by the degree of true variation on the scale of the effect measure (T2) or the degree of inconsistency (I2) or both. Uncertainty over whether apparent heterogeneity is genuine may be expressed using the p-value for Q or using confidence intervals for T2 or I2.

Note that uncertainty around T2 or I2 is often very large. If the studies themselves have poor precision (wide CIs), this could mask the presence of real heterogeneity, resulting in an estimate of zero for T2 and I2. Therefore, it would be a mistake to interpret a T2 or I2 of zero as meaning that the effect sizes are consistent unless this is justified by CIs for T2 and I2 that exclude large values.


Confidence intervals

T2 I2


Documents

Introduction to Meta-analysis Borenstein, Hedges, Higgins & … Chapters... · 2014-07-14 · Introduction to Meta-analysis Borenstein, Hedges, Higgins & Rothstein . CAMARADES: Bringing