Generalizing experimental study results to target populations · Case study: The ACTG Trial Examined highly active antiretroviral (HAART) therapy for HIV compared to standard combination

Generalizing experimental study results to targetpopulations

Elizabeth StuartJohns Hopkins Bloomberg School of Public Health

Departments of Mental Health, Biostatistics,and Health Policy and Management

[email protected]/∼estuart

Funding thanks to NSF DRL-1335843, IES R305D150003

February 26, 2016

Elizabeth Stuart (JHSPH) Generalizability February 26, 2016 1 / 25

Outline

1 Introduction, context, and framework

2 The setting and overview of approaches

3 Reweighting approaches

4 Conclusions


Outline




4 Conclusions


Making research results relevant: A range of policy orpractice questions

A given district or school may go on to the What WorksClearinghouse to see whether a new reading intervention is“evidence-based” and helpful for them

The state of Maryland may be deciding whether to recommend thenew program for all schools or districts in the state

Or for all “struggling” schools?

Medicare may be deciding whether or not to approve payment for anew treatment for back pain

Should a broad public health media campaign be started around notswitching car seats to forward facing until a child is 12 months old?


From individual to population effects

All of these reflect a “population” average treatment effect

e.g., across individuals in a population, does this intervention work “onaverage”?This population could be fairly narrow, or quite broad

There may actually be underlying treatment effect heterogeneity

e.g., stronger effects for some individualsLots of interest in tailoring treatments for individuals; not my focustoday

But for policy questions that motivate today’s talk, desire an overallaverage effect


At this point, relatively little attention to how well results from a givenstudy might carry over to a relevant target population

This talk will discuss recent work trying to get people to start thinkingabout these issues, while taking advantage of recent advances in studyquality and data


How much do we need to worry about external validity?

Lots of evidence that the people or groups that participate in trialsdiffer from general populations

Will cause bias if the factors that differ also moderate treatment effects

Districts that participate in rigorous educational evaluations muchlarger than typical districts in the US (Stuart et al., under review)

People that participate in trials of drug abuse treatment have highereducation levels than those in drug abuse treatment nationwide(Susukida et al., in press)

Increasing worries about lack of minority representation in clinicaltrials

And these differences can lead to external validity bias (Bell et al., inpress)


Outline




4 Conclusions


The setting

Assume we have one randomized trial, already conducted

And also covariate data on some target population of interest (do nothave treatment values or outcomes in the population)

The question: How can we use these data to estimate the effects ofthe intervention in the target population?

Note: Focused on assessing and enhancing external validity withrespect to the characteristics of trial and population subjects

Lots of other threats to external validity as well: scale-up problems,implementation, different settings, . . . (see Cook, 2014)


Analysis approaches for estimating population effects

Meta-analysis: When multiple studies available, but does notnecessarily give population estimates

Cross-design synthesis: Explicitly combines experimental andnon-experimental effect estimates (Pressler & Kaizar, 2013)

Model-based approaches: Model outcome in the trial, use topredict outcomes in the population (e.g., BART; Kern et al., 2016)

Post-stratification: Estimate separate effects, then combine usingpopulation proportions

Reweighting: Like a smoothed version of post-stratification (Cole &Stuart, 2009; O’Muircheartaigh & Hedges, 2014)

(Of course design options exist too, e.g., aiming to enrollrepresentative (or “balanced”) samples (Royall!))


Outline




4 Conclusions


Case study: The ACTG Trial

Examined highly active antiretroviral (HAART) therapy for HIVcompared to standard combination therapy

577 US HIV+ adults randomized to treatment, 579 to control

33/577 and 63/579 endpoints (AIDS/death) during 52-week follow-up

Intent-to-treat analysis: Hazard ratio of 0.51 (95% CI: 0.33, 0.77)

Cole & Stuart (2010)


The target population

Don’t necessarily just care about people in trial

What would the effects of the treatment be if implementednationwide?

US estimates of the number of people infected with HIV in 2006(CDC, 2008)

HIV incidence was estimated using a statistical approach withadjustment for testing frequency and extrapolated to the US

Have joint distribution of sex, race, and age groups of the newlyinfected individuals


Inverse probability of selection weighting

Weight the trial subjects up to the population

Each subject in trial receives weight wi = 1P(Si=1|X )

(Inverse of their probability of being in the trial)

Use those weights when calculating means or running regressions

Related to inverse probability of treatment weighting,Horvitz-Thompson estimation in surveys


Standard assumptions

Experiment was randomized

“Sample ignorability for treatment effects”: selection into the trialindependent of impacts given the observed covariates

For the same value of observed covariates, impacts the same acrosstrial and populationNo unmeasured variables related to selection into the trial andtreatment effects (Sensitivity analysis for this: Nguyen et al., underreview)

“Overlap”: all individuals in the population had a non-zero probabilityof participating in the trial

Analogous to strong ignorability/unconfoundedness of treatmentassignment in non-experimental studies

(If outcome under control observed in the population, can use aslightly different assumption)


Effect heterogeneity and predictors of participation

People in trial more likely to be:

Older (not 13-29)MaleWhite or Hispanic

Those characteristics also moderate effects in the trial

Detrimental effects for young peopleLargest effects for those 30-39Larger effects for males, as compared to femalesLarger effects for blacks, as compared to White or Hispanic


Estimated population effects

Hazard ratio 95% CI

Crude trial results 0.51 0.33, 0.77Age weighted 0.68 0.39, 1.17Sex weighted 0.53 0.34, 0.82Race weighted 0.46 0.29, 0.72Age-sex-race weighted 0.57 0.33, 1.00

CI’s longer for weighted results

Effects generally somewhat attenuated, except for weighting only byrace


Placebo checks

Can also use the weighting as a diagnostic

Weighted control group mean should match the population outcomemean if the control conditions are the same (“placebo check”)

In HAART case, if we had mortality information in the population,could see if weighted mortality rate among control group matched thepopulation mortality rate (assuming no treatment in the population)

If placebo check fails, may indicate unobserved differences betweenthe groups

Hartman et al., 2013; Stuart et al., 2011


Outline




4 Conclusions


Everyone wants to assume that study results generalize

But very few statistical methods exist

At this point, lots of “hand waving,” qualitative statements

Need more statistical methods to quantify and improve externalvalidity

For both study design and study analysis


What do we need to assess and enhance external validity?

Information on the factors that influence treatment effectheterogeneity

Information on the factors that influence participation in rigorousevaluations

Data on all of these factors in the trial and the population

Not very helpful if these factors not observed in the population

Methods that allow for the differences between trial and populationon these factors

These are coming along


Data a primary limiting factor

Right now we have very little information on factors that influenceeffects or participation in trials

Sometimes hard to find population data

Trial data also often not publicly available

Even harder to find population data that has the same measures astrial of interest

Stuart & Rhodes (under review): Hard to find appropriate populationdata, and even then out of over 400 measures in each, only about 7were comparable


Conclusions

Can’t necessarily assume that average effects seen in a trial wouldcarry over directly to a target population

Methods allow us to adjust for differences in observed characteristicsbetween the trial sample and population to estimate populationtreatment effects

But only as good as the data available!


And remember . . .

“With better data, fewer assumptions are needed.”- Rubin (2005, p. 324)

“You can’t fix by analysis what you bungled by design.”- Light, Singer and Willett (1990, p. v)

“Real world relationships are invariably more complicated than those wecan represent in mathematically tractable models.”- Royall and Pfeffermann (1981, p. 16)


References, with thanks to all my co-authors

Bell, S.H., Olsen, R.B., Orr, L.L., and Stuart, E.A. (in press). Estimates of external validity bias when impactevaluations select sites non-randomly. Forthcoming in Education Evaluation and Policy Analysis.

Cole, S.R. and Stuart, E.A. (2010). Generalizing evidence from randomized clinical trials to target populations: theACTG-320 trial. American Journal of Epidemiology 172: 107-115.

Imai, K., King, G., and Stuart, E.A. (2008). Misunderstandings between experimentalists and observationalists aboutcausal inference. Journal of the Royal Statistical Society, Series A 171: 481-502.

Kern, H.L., Stuart, E.A., Hill, J., and Green, D.P. (2016). Assessing methods for generalizing experimental impactestimates to target populations. Journal of Research on Educational Effectiveness.

Olsen, R., Bell, S., Orr, L., and Stuart, E.A. (2013). External Validity in Policy Evaluations that Choose SitesPurposively. Journal of Policy Analysis and Management 32(1): 107-121

Stuart, E.A., Cole, S.R., Bradshaw, C.P., and Leaf, P.J. (2011). The use of propensity scores to assess thegeneralizability of results from randomized trials. The Journal of the Royal Statistical Society, Series A 174(2): 369-386.

Stuart, E.A., Bradshaw, C.P., and Leaf, P.J. (2015). Assessing the generalizability of randomized trial results to targetpopulations. Prevention Science 16(3): 475-485.

Susukida, R., Crum, R., Stuart, E.A., and Mojtabai, R. (in press). Assessing Sample Representativeness in RandomizedControl Trials: Application to the National Institute of Drug Abuse Clinical Trials Network. Forthcoming in Addiction.


Documents

Generalizing experimental study results to target populations · Case study: The ACTG Trial Examined highly active antiretroviral (HAART) therapy for HIV compared to standard combination