NBER WORKING PAPER SERIES
UNDERSTANDING AND MISUNDERSTANDING RANDOMIZED CONTROLLED TRIALS
Angus DeatonNancy Cartwright
Working Paper 22595http://www.nber.org/papers/w22595
NATIONAL BUREAU OF ECONOMIC RESEARCH1050 Massachusetts Avenue
Cambridge, MA 02138September 2016, Revised October 2017
We acknowledge helpful discussions with many people over the several years this paper has been in preparation. We would particularly like to note comments from seminar participants at Princeton, Columbia, and Chicago, the CHESS research group at Durham, as well as discussions with Orley Ashenfelter, Anne Case, Nick Cowen, Hank Farber, Jim Heckman, Bo Honoré, Chuck Manski, and Julian Reiss. Ulrich Mueller had a major influence on shaping Section 1. We have benefited from generous comments on an earlier version by Christopher Adams, Tim Besley, Chris Blattman, Sylvain Chassang, Jishnu Das, Jean Drèze, William Easterly, Jonathan Fuller, Lars Hansen, Jeff Hammer, Glenn Harrison, Macartan Humphreys, Michal Kolesár, Helen Milner, Tamlyn Munslow, Suresh Naidu, Lant Pritchett, Dani Rodrik, Burt Singer, Richard Williams, Richard Zeckhauser, and Steve Ziliak. Cartwright’s research for this paper has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 667526 K4U), the Spencer Foundation, and the National Science Foundation (award 1632471). Deaton acknowledges financial support from the National Institute on Aging through the National Bureau of Economic Research, Grants 5R01AG040629-02 and P01AG05842-14 and through Princeton University’s Roybal Center, Grant P30 AG024928. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
NBER working papers are circulated for discussion and comment purposes. They have not been peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies official NBER publications.
© 2016 by Angus Deaton and Nancy Cartwright. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source.
Understanding and Misunderstanding Randomized Controlled Trials Angus Deaton and Nancy CartwrightNBER Working Paper No. 22595September 2016, Revised October 2017JEL No. C10,C26,C93,O22
ABSTRACT
RCTs would be more useful if there were more realistic expectations of them and if their pitfalls were better recognized. For example, and contrary to many claims in the applied literature, randomization does not equalize everything but the treatment across treatments and controls, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) confounders. Estimates apply to the trial sample only, sometimes a convenience sample, and usually selected; justification is required to extend them to other groups, including any population to which the trial sample belongs. Demanding “external validity” is unhelpful because it expects too much of an RCT while undervaluing its contribution. Statistical inference on ATEs involves hazards that are not always recognized. RCTs do indeed require minimal assumptions and can operate with little prior knowledge. This is an advantage when persuading distrustful audiences, but it is a disadvantage for cumulative scientific progress, where prior knowledge should be built upon and not discarded. RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program, combining with other methods, including conceptual and theoretical development, to discover not “what works,” but “why things work”.
Angus Deaton361 Wallace HallWoodrow Wilson SchoolPrinceton UniversityPrinceton, NJ 08544-1013and [email protected]
Nancy CartwrightDepartment of PhilosophyDurham University50 Old ElvetDurham, UK DH1 3HN and University of California, San [email protected]
2
IntroductionRandomizedcontrolledtrials(RCTs)arecurrentlywidelyvisibleineconomicstoday
andhavebeenusedinthesubjectatleastsincethe1960s(seeGreenbergand
Shroder(2004)foracompendium).Itisoftenclaimedthatsuchtrialscandiscover
“whatworks”ineconomics,aswellasinpoliticalscience,education,andsocialpol-
icy.Amongbothresearchersandthegeneralpublic,RCTsareperceivedtoyield
causalinferencesandestimatesofaveragetreatmenteffects(ATEs)thataremore
reliableandmorecrediblethanthosefromanyotherempiricalmethod.Theyare
takentobelargelyexemptfromthemyriadeconometricproblemsthatcharacterize
observationalstudies,torequireminimalsubstantiveassumptions,littleornoprior
information,andtobelargelyindependentof“expert”knowledgethatisoftenre-
gardedasmanipulable,politicallybiased,orotherwisesuspect.
Therearenow“WhatWorks”centersusingandrecommendingRCTsina
rangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld.These
centersseeRCTsastheirpreferredtoolandindeedoftenpreferRCTevidencelexi-
cographically.Asoneofmanyexamples,theUSDepartmentofEducation’sstandard
for“strongevidenceofeffectiveness”requiresa“well-designedandimplemented”
RCT;noobservationalstudycanearnsuchalabel.This“goldstandard”claimabout
RCTsislesscommonineconomics,butImbens(2010,407)writesthat“randomized
experimentsdooccupyaspecialplaceinthehierarchyofevidence,namelyatthe
verytop.”TheAbdulLatifJameelPovertyActionLab(J-PAL),whosestatedmission
is“toreducepovertybyensuringthatpolicyisinformedbyscientificevidence”,ad-
vertisesthatitsaffiliatedprofessors“conductrandomizedevaluationstotestand
improvetheeffectivenessofprogramsandpoliciesaimedatreducingpoverty”,J-
PAL(2017).Theleadpageofitswebsite(echoedinthe‘Evaluation’section)notes
“843ongoingandcompletedrandomizedevaluationsin80countries”withnomen-
tionofanystudiesthatarenotrandomized.
Inmedicine,thegoldstandardviewhaslongbeenwidespread,e.g.fordrug
trialsbytheFDA;anotableexceptionistherecentpaperbyFrieden(2017),ex-di-
3
rectoroftheU.S.CentersforDiseaseControlandPrevention,wholistskeylimita-
tionsofRCTsaswellasarangeofcontextswhereRCTs,evenwhenfeasible,are
dominatedbyothermethods.
WearguethatanyspecialstatusforRCTsisunwarranted.Whichmethodis
mostlikelytoyieldagoodcausalinferencedependsonwhatwearetryingtodis-
coveraswellasonwhatknowledgeisalreadyavailable.Whenlittleprior
knowledgeisavailable,nomethodislikelytoyieldwell-supportedconclusions.This
paperisnotacriticismofRCTsinandofthemselves,letaloneanattempttoidentify
goodandbadstudies.Instead,wewillarguethat,dependingonwhatwewantto
discover,whywewanttodiscoverit,andwhatwealreadyknow,therewilloftenbe
superiorroutesofinvestigation.
Wepresenttwosetsofarguments.Thefirstisanenquiryintotheideathat
ATEsestimatedfromRCTSarelikelytobeclosertothetruththanthoseestimated
inotherways.ThesecondexploreshowtousetheresultsofRCTsoncewehave
them.Inthefirstsection,ourdiscussionrunsinfamiliartermsofbiasandefficiency,
orexpectedloss.Noneofthismaterialisnew,butweknowofnosimilartreatment,
andwewishtodisputemanyoftheclaimsthatarefrequentlymadeintheapplied
literature.Someroutinemisunderstandingsare:(a)randomizationensuresafair
trialbyensuringthat,atleastwithhighprobability,treatmentandcontrolgroups
differonlyinthetreatment;(b)RCTsprovidenotonlyunbiasedestimatesofATEs
butalsopreciseestimates;(c)statisticalinferenceinRCTs,whichrequiresonlythe
simplecomparisonofmeans,isstraightforward,sothatstandardsignificancetests
arereliable.
Nothingwesayinthepapershouldbetakenasageneralargumentagainst
RCTs;wearesimplytryingtochallengeunjustifiableclaims,andexposemisunder-
standings.WearenotagainstRCTs,onlymagicalthinkingaboutthem.Themisun-
derstandingsareimportantbecausewebelievethattheycontributetothecommon
perceptionthatRCTsalwaysprovidethestrongestevidenceforcausalityandforef-
fectiveness.
4
Inthesecondpartofthepaper,wediscusshowtousetheevidencefrom
RCTs.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanad-
vantageinestimation,isoftenadisadvantagewhenwetrytousetheresultsoutside
ofthecontextinwhichtheresultswereobtained;credibilityinestimationcanlead
toincredibilityinuse.Muchoftheliterature,perhapsinspiredbyCampbelland
Stanley’s(1963)famous“primacyofinternalvalidity”,appearstobelievethatinter-
nalvalidityisnotonlynecessarybutalmostsufficienttoguaranteetheusefulnessof
theestimatesindifferentcontexts.Butyoucannotknowhowtousetrialresults
withoutfirstunderstandinghowtheresultsfromRCTsrelatetotheknowledgethat
youalreadypossessabouttheworld,andmuchofthisknowledgeisobtainedby
othermethods.OncethecommitmenthasbeenmadetoseeingRCTswithinthis
broaderstructureofknowledgeandinference,andwhentheyaredesignedtoen-
hanceit,theycanbeenormouslyuseful,notjustforwarrantingclaimsofeffective-
nessbutforscientificprogressmoregenerally.Cumulativescienceisnotadvanced
throughmagicalthinking.
TheliteratureontheprecisionofATEsestimatedfromRCTsgoesbacktothe
verybeginning.Gosset(writingas`Student’)neveracceptedFisher’sargumentsfor
randomizationinagriculturalfieldtrialsandarguedconvincinglythathisownnon-
randomdesignsfortheplacementoftreatmentandcontrolsyieldedmoreprecise
estimatesoftreatmenteffects(seeStudent(1938)andZiliak(2014)).Gosset
workedforGuinnesswhereinefficiencymeantlostrevenue,sohehadreasonsto
care,asshouldwe.Fisherwontheargumentintheend,notbecauseGossetwas
wrongaboutefficiency,butbecause,unlikeGosset’sprocedures,randomizationpro-
videsasoundbasisforstatisticalinference,andthusforjudgingwhetheranesti-
matedATEisdifferentfromzerobychance.Moreover,Fisher’sblockingprocedures
canlimittheinefficiencyfromrandomization(seeYates(1939)).Gosset’sreserva-
tionswereechoedmuchlaterinSavage’s(1962)commentthataBayesianshould
notchoosetheallocationoftreatmentsandcontrolsatrandombutinsuchaway
that,givenwhatelseisknownaboutthetopicandthesubjects,theirplacementre-
vealsthemosttotheresearcher.Theseissuesabouthowtoincorporatepriorinfor-
mationintorandomizedtrialsarecentraltoSection1.
5
Ineconomics,thestrengthsandweaknessesofRCTsarewellexploredinthe
volumesbyHausmanandWise(1985)andbyGarfinkelandManski(1992);inthe
latter,theintroductionbyGarfinkelandManskiisabalancedsummaryofwhatran-
domizedtrialscanandcannotdo.ThepaperinthatvolumebyHeckman(1992)
raisesmanyoftheissuesthatheandhiscoauthorshaveexploredinsubsequentpa-
pers,seeinparticularHeckmanandSmith(1995),andHeckman,LalondeandSmith
(1999)whofocusonlabormarketexperiments.Manski(2013)containsagood
summaryofbothstrengthsandweaknesses.
Thereisalsoamorecontestedrecentliterature.Ontheonehand,thereare
proceduresthattakeasfundamentaltheunrestrictedindividualtreatmenteffectsof
individualsandseeknon-parametricapproachestoestimatingtheiraverage.Onthe
otherhand,theseproceduresarecontrastedwithanapproachthatuseselementsof
economictheorytodefineparametersofinterestandtoidentifymagnitudesthat
arelikelytobeinvarianttopolicymanipulationoracrosscontexts,whereinvari-
anceisdefinedinthesenseofHurwicz(1966).TheintroductioninImbensand
Wooldridge(2009)provideaneloquentdefenseofthetreatment-effectformulation.
Itemphasizesthecredibilitythatcomesfromatheory-freespecificationwithalmost
unlimitedheterogeneityintreatmenteffects.TheintroductioninHeckmanand
Vytlacil(2007)makesanequallyeloquentcaseagainst,notingthatthecrucialingre-
dientsoftreatmentsinRCTsareoftennotclearlyspecified—sothatweoftendonot
knowwhatthetreatmentreallyis—andthatthetreatmenteffectsarehardtolinkto
invariantparametersthatwouldbeusefulelsewhere.Aspectsofthesamedebate
featureinImbens(2010),AtheyandImbens(2017),AngristandPischke(2017),
Heckman(2005,2008,2010)andHeckmanandUrzua(2010).
Deaton(2010)complainsabouttheuseofinstrumentalvariables,including
randomization,asasubstituteforthinkingaboutandconstructingmodelsofeco-
nomicdevelopment.HearguesagainsttheideathatusingRCTstoevaluateprojects
todiscover“whatworks”caneveryieldasystematicbodyofscientificknowledge
thatcanbeusedtoreduceoreliminatepoverty.Thatpaperisanargumentagainst
theusefulnessoftheheterogeneoustreatmentapproach.Itarguesthatrefusingto
6
modelheterogeneity,thoughavoidingassumptions,precludesthesortofcumula-
tiveresearchprogramthatmightyieldusefulpolicy.Thepaper’sclaimthatRCTs
havenospecialclaimtogeneratecredibleandusefulknowledgewaschallengedby
Imbens(2010);someofhisargumentsareansweredbelow.Cartwright(2007)and
CartwrightandMunro(2010)challengeany“goldstandard”viewofRCTs.Cart-
wright(2011,2012,2016)andCartwrightandHardie(2012)focusonthequestion
ofhowtousetheresultsofRCTsandwhatwecanlearnwhenanexperimentshows
thatsomepolicyworkssomewhere.Section2pursuestheseissuesingeneraland
throughcasestudies.
Section1:DoRCTsgivegoodestimatesofAverageTreatmentEffects
Inthissection,weexplorehowtoestimateaveragetreatmenteffects(ATEs)andthe
roleofrandomization.WenotefirstthatestimatingATEsisonlyoneofmanyuses
forthedatageneratedbyanRCT.Westartfromatrialsample,acollectionofsub-
jectsthatwillbeallocatedrandomlytoeitherthetreatmentorcontrolarmofthe
trial.This“sample”mightbe,butrarelyis,arandomsamplefromsomepopulation
ofinterest.Morefrequently,itisselectedinsomeway,forexampletothosewilling
toparticipate,orissimplyaconveniencesamplethatisavailabletothetrialists.
Givenrandomallocationtotreatmentsandcontrols,thedatafromthetrialallowthe
identificationoftwodistributions,𝐹"(𝑌")and𝐹&(𝑌&),ofoutcomes𝑌"and𝑌&inthe
treatedanduntreatedcaseswithinthetrialsample.TheestimatedATEisthediffer-
enceinmeansofthetwodistributionsandisthefocusofmuchoftheliteraturein
socialscienceandmedicine.Yetpolicymakersandresearchersmaywellbeinter-
estedinotherfeaturesofthetwodistributions.Forexample,ifYisincome,they
maybeinterestedinwhetheratreatmentreducedincomeinequality,orinwhatit
didtothe10thor90thpercentilesoftheincomedistribution,eventhoughdifferent
peopleoccupythosepercentilesinthetreatmentandcontroldistributions(seeBit-
leretal(2006)foranexampleinUSwelfarepolicy).Cancertrialsstandardlyusethe
mediandifferenceinsurvival,whichcomparesthetimesuntilhalfthepatientshave
diedineacharm.Morecomprehensively,policymakersmaywishtocompareex-
pectedutilitiesfortreatedanduntreatedunderthetwodistributionsandconsider
7
optimalexpected-utilitymaximizingtreatmentrulesconditionalonthecharacteris-
ticsofsubjects(seeManski(2004)andManskiandTetenov(2016);Bhattacharya
andDupas(2012)containsanapplication.)Theseusesareimportant,butwefocus
onATEshereanddonotconsidertheseotherusesofRCTsanyfurtherinthispaper.
1.1Whatdoesrandomizationdo?
Ausefulwaytothinkabouttheestimationoftreatmenteffectsistouseaschematic
linearcausalmodeloftheform:
(1)
where, istheoutcomeforuniti,𝑇( isadichotomous(1,0)treatmentdummyindi-
catingwhetherornotiistreated,and𝛽( istheindividualtreatmenteffectofthe
treatmentoni.Thex’saretheobservedorunobservedotherlinearcausesofthe
outcome,andwesupposethat(1)capturesaminimalsetofcausesof𝑌( sufficientto
fixitsvalue.Jmaybe(very)large.Becausetheheterogeneityoftheindividualtreat-
menteffects,𝛽( ,isunrestricted,weallowthepossibilitythatthetreatmentinteracts
withthex’sorothervariables,sothattheeffectsofTcandependonanyothervaria-
bles.Notethatwedonotneedisubscriptsonthe𝛾’sthatcontroltheeffectsofthe
othercauses;iftheireffectsdifferacrossindividuals,weincludetheinteractionsof
individualcharacteristicswiththeoriginalx’sasnewx’s.Giventhatthex’scanbe
unobservable,thisisnotrestrictive.
Consideranexperimentthataimstotellussomethingaboutthetreatment
effects;thismightormightnotuserandomization.Eitherway,wecanrepresentthe
treatmentgroupashaving𝑇( = 1andthecontrolgroupashaving𝑇( = 0.Giventhe
study(ortrial)sample,subtractingtheaverageoutcomesamongthecontrolsfrom
theaverageoutcomesamongthetreatments,weget
Y
1−Y
0= β
1+ γ j (xij
1−
j=1
J
∑ xij0) = β
1+ (S
1− S
0) (2)
Thefirsttermonthefar-right-handsideof(2),whichistheATEinthetrialsample,
iswhatwewant,butthesecondtermorerrorterm,whichisthesumofthenetav-
eragebalanceofothercausesacrossthetwogroups,willgenerallybenon-zeroand
Yi = βiTi + γ j xijj=1
J∑Yi
8
needstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe
othercausesareidenticalinthetwogroups,ormoreprecisely(andlessonerously)
whenthesumoftheirnetdifferences𝑆" − 𝑆&iszero;thisisthecaseofperfectbal-
ance.Withperfectbalance,thedifferencebetweenthetwomeansisexactlyequalto
theaverageofthetreatmenteffectsamongthetreated,sothatwehavetheultimate
precisioninthatweknowthetruthinthetrialsample,atleastinthislinearcase.As
always,the“truth”herereferstothetrialsample,anditisalwaysimportanttobe
awarethatthetrialsamplemaynotberepresentativeofthepopulationthatisulti-
matelyofinterest,includingthepopulationfromwhichthetrialsamplecomes;any
suchextensionrequiresfurtherargument.
Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleof
randomization?Inalaboratoryexperiment,wherethereisusuallymuchprior
knowledgeoftheothercauses,theexperimenterhasagoodchanceofcontrolling
(orsubtractingawaytheeffectsof)theothercauses,aimingtoensurethatthelast
termin(1)isclosetozero.Failingsuchknowledgeandcontrol,analternativeis
matching,whichisfrequentlyusedinstatistical,medical,andeconometricwork.For
eachsubject,amatchisfoundthatisascloseaspossibleonallsuspectedcauses,so
that,onceagain,thelasttermin(1)canbekeptsmall.Whenwehaveagoodideaof
thecauses,matchingmayalsodeliverapreciseestimate.Ofcourse,whenthereare
unknownorunobservablecausesthathaveimportanteffects,neitherlaboratory
controlnormatchingoffersprotection.
Whatdoesrandomizationdo?Sincethetreatmentsandcontrolscomefrom
thesameunderlyingdistribution,randomizationguarantees,byconstruction,that
thelasttermontherightin(1)iszeroinexpectation,subjecttothecaveatthatno
correlationsofthex’swithYareintroducedpost-randomization,forexampleby
subjectsnotacceptingtheirassignment.Theexpectationhereistakenoverre-
peatedrandomizationsonthetrialsample,eachwithitsownallocationoftreat-
mentsandcontrols.Assumingthatourcaveatholds,thelasttermin(2)willbezero
whenaveragedoverthisinfinitenumberof(entirelyhypothetical)replications,and
9
theaverageoftheestimatedATEswillbethetrueATEinthetrialsample.So𝛽"de-
liversanunbiasedestimateoftheATEamongthetreatedinthetrialsample,andit
doessowhetherornotthecausesareobserved.Unbiasednessdoesnotrequireus
toknowanythingabouttheothercausesthoughitdoesrequirethattheynotchange
afterrandomizationsoastomakethemcorrelatedwiththetreatment,whichisan
importantcaveattowhichweshallreturn.IftheRCTisrepeatedmanytimesonthe
sametrialsample,then,assumingourcaveatholdsinthetrials,thelasttermin(2)
willbezerowhenaveragedoveraninfinitenumberof(entirelyhypothetical)trials,
andtheaverageoftheestimatedATEswillbethetrueATEinthetrialsample.Of
course,noneofthisistrueinanyonetrialwherethedifferenceinmeanswillbe
equaltotheaveragetreatmenteffectamongthosetreatedplusthetermthatreflects
theimbalanceintheneteffectsoftheothercauses.Wedonotknowthesizeofthis
errorterm,andthereisnothinginrandomizationthatlimitsitssize;bychancethe
randomizationinoursingletrialcanover-representanimportantexcludedcause(s)
inonearmovertheother,inwhichcasetherewillbeadifferencebetweenthe
meansofthetwogroupsthatisnotcausedbythetreatment.
Theunbiasednessresultcaneasilybecompromised.Inparticular,thetreat-
mentmustnotbecorrelatedwithanyothercause.Randomassignmentisdesigned
toaidwiththis,butitisnotsufficientif,forexample,thereislackofblindingsothat
individualsareawareoftheirassignment,orifthoseadministeringthetreatment
aresoaware,andifthatawarenesstriggersanothercause.Similarly,researchers
sometimesreturntoindividualswhowererandomizedyearsbefore,sothatthere
hasbeentimeforthesubjectsorotherstolearntheirassignmentorforothercauses
tobeinfluencedbytheassignment.Thisagainopensupthepossibilityofunbal-
ancedeffectsofcausesotherthanthetreatmentweareinterestedin.Wehaveal-
readynotedthatunbiasednessreferstothetrialsample,whichmayormaynotbe
representativeofthepopulationofinterest.
Ifweweretorepeatthetrialmanytimes,theover-representationoftheun-
balancedcauseswillsometimesbeinthetreatmentsandsometimesinthecontrols.
Theimbalancewillvaryoverreplicationsofthetrial,andalthoughwecannotsee
thisfromoursingletrial,weshouldbeabletocaptureitseffectsonourestimateof
10
theATEfromanestimatedstandarderror.ThiswasFisher’sinnovation:notthat
randomizationbalancedothercausesbetweentreatmentsandcontrolsbutthat,
conditionalonourcaveatabove,randomizationprovidesthebasisforcalculating
thesizeoftheerror.Gettingthestandarderrorandassociatedsignificancestate-
mentsrightareofthegreatestimportance.Giventheabsenceoftreatment-related
post-randomizationchangesinothercauses,randomizationyieldsanunbiasedesti-
mateoftheATEinthetrialsampleaswellasasoundmethodformeasuringerrorof
estimationinthatsample;thereinliesitsvirtue,notthatityieldspreciseestimates
throughbalance.
1.2Misunderstandings:claimingtoomuch
Everythingsofarshouldbeperfectlyfamiliar,butexactlywhatrandomizationdoes
isfrequentlylostinthepracticalliterature.Thereisoftenconfusionbetweenperfect
control,ontheonehand(asinalaboratoryexperimentorperfectmatchingwithno
unobservablecauses),andcontrolinexpectationontheother,whichiswhatran-
domizationcontributes.Ifweknewenoughabouttheproblemtobeabletocontrol
well,thatiswhatwewoulddo.Randomizationisanalternativewhenwedonot
knowenough,butisgenerallyinferiortogoodcontrol.Wesuspectthatatleastsome
ofthepopularandprofessionalenthusiasmforRCTs,aswellasthebeliefthatthey
areprecisebyconstruction,comesfrommisunderstandingsaboutbalance.These
misunderstandingsarenotsomuchamongthetrialistswhowilloftengiveacorrect
accountwhenpressed.Theycomefromimprecisestatementsbytrialiststhatare
takenliterallybythelayaudiencethatthetrialistsarekeentoreach.
Suchamisunderstandingiswellcapturedbyaquotefromthesecondedition
oftheonlinemanualonimpactevaluationjointlyissuedbytheInter-AmericanDe-
velopmentBankandtheWorldBank(thefirst,2011editionissimilar):
“Wecanbeconfidentthatourestimatedimpactconstitutesthetrueimpact
oftheprogram,sincewehaveeliminatedallobservedandunobservedfac-
torsthatmightotherwiseplausiblyexplainthedifferenceinoutcomes.”Ger-
tler,Martinez,Premand,Rawlings,andVermeersch(2016,69).
11
Thisstatementisfalse,becauseitconfusesactualbalanceinanysingletrialwith
balanceinexpectationovermany(hypothetical)trials.Ifitweretrue,andifallfac-
torswereindeedcontrolled(andnoimbalanceswereintroducedpostrandomiza-
tion),thedifferencewouldbeanexactmeasureoftheaveragetreatmenteffect
amongthetreatedinthetrialpopulation(atleastintheabsenceofmeasurementer-
ror).Weshouldnotonlybeconfidentofourestimatebut,asthequotesays,we
wouldknowthatitisthetruth.Notethatthestatementcontainsnoreferenceto
samplesize;wegetthetruthbyvirtueofbalance,notfromalargenumberofobser-
vations.
AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuc-
cessfulscholarswhouseRCTs:
“complicationsthataredifficulttounderstandandcontrolrepresentkeyrea-
sonstoconductexperiments,notapointofskepticism.Thisisbecauseran-
domizationactsasaninstrumentalvariable,balancingunobservablesacross
controlandtreatmentgroups.”Al-UbaydliandList(2013)(italicsintheorig-
inal.)
AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAc-
tion,whichrunsdevelopmentRCTsaroundtheworld:
“Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomly
assigningsubjectstotreatmentsandcontrolgroups.Thismakesitsothatall
thoseotherfactorswhichcouldinfluencetheoutcomearepresentintreat-
mentandcontrol,andthusanydifferenceinoutcomecanbeconfidentlyat-
tributedtotheintervention.”Karlan,GoldbergandCopestake(2009)
Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeply
skepticaloftheuseofevidencefromRCTs,
“Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtoun-
derstandallthefactorsthatinfluenceoutcomes.Saythatanundiscoveredge-
neticvariationmakescertainpeopleunresponsivetomedication.Theran-
domizingprocesswillensure—ormakeithighlyprobable—thatthearmsof
thetrialcontainequalnumbersofsubjectswiththatvariation.Theresultwill
beafairtest.”(Kramer,2016,p.18)
12
ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.Judy
Gueron,thelong-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgov-
ernmentpolicyfor45years,askswhyfederalandstateofficialswerepreparedto
supportrandomizationinspiteoffrequentdifficultiesandinspiteoftheavailability
ofothermethodsandconcludesthatitwasbecause“theywantedtolearnthetruth,”
GueronandRolston(2013,429).Therearemanystatementsoftheform“Weknow
that[projectX]workedbecauseitwasevaluatedwitharandomizedtrial,”Dynarski
(2015).
ItiscommontotreattheATEfromanRCTasifitwerethetruth,notjustin
thetrialsamplebutmoregenerally.Ineconomics,afamousexampleisLalonde’s
(1986)studyoftrainingprograms,whoseresultswereatoddswithanumberof
previousnon-randomizedstudies.Thepaperpromptedalarge-scalere-examination
oftheobservationalstudiestotrytobringthemintoline,thoughitnowseemsjust
aslikelythatthedifferenceslieinthefactthatthestudyresultsapplytodifferent
populations(Heckman,Lalonde,andSmith(1999)).Inepidemiology,Davey-Smith
andIbrahim(2002)statethat“observationalstudiespropose,RCTsdispose”.A
goodexampleistheRCTofhormonereplacementtherapy(HRT)forpost-menopau-
salwomen.HRThadpreviouslybeensupportedbypositiveresultsfromahigh-
qualityandlong-runningobservationalstudy,buttheRCTwasstoppedinthefaceof
excessdeathsinthetreatmentgroup.ThenegativeresultoftheRCTledtowide-
spreadabandonmentofthetherapy,whichmighthavebeenamistake(seeVanden-
broucke(2009)andFrieden(2017)).Yetthemedicalandpopularliteraturerou-
tinelystatesthattheRCTwasrightandtheearlierstudywrong,simplybecausethe
earlierstudywasnotrandomized.Thegoldstandardor“truth”viewdoesharm
whenitunderminestheobligationofsciencetoreconcileRCTsresultswithother
evidenceinaprocessofcumulativeunderstanding.
Thefalsebeliefinautomaticprecisionsuggeststhatweneedpaynoatten-
tiontotheothercausesin(1)or(2).Indeed,GerberandGreen(2012),intheir
standardtextforRCTsinpoliticalscience,writethatrunninganRCTis“aresearch
strategythatdoesnotrequire,letalonemeasure,allpotentialconfounders.”Thisis
trueifwearehappywithestimatesthatarearbitrarilyfarfromthetruth,justso
13
longastheerrorscanceloutoveraseriesofimaginaryexperiments.Inreality,the
causalitythatisbeingattributedtothetreatmentmight,infact,becomingfroman
imbalanceinsomeothercauseinourparticulartrial;limitingthisrequiresserious
thoughtaboutpossibleconfounders.
1.3Samplesize,balance,andprecision
Atthetimeofrandomizationandintheabsenceofpost-randomizationchangesin
othercauses,atrialismorelikelytobebalancedwhenthesamplesizeislarge.As
thesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol
groupswillbecomearbitrarilyclose.Yetthisisoflittlehelpinfinitesamples.As
Fisher(1926)noted:“Mostexperimentersoncarryingoutarandomassignment
willbeshockedtofindhowfarfromequallytheplotsdistributethemselves,”quoted
inMorganandRubin(2012).Evenwithverylargesamplesizes,ifthereisalarge
numberofcauses,balanceoneachcausemaybeinfeasible.Evenwithjustthree
causeswiththreevalueseach,thereare27cellstobalance,andinmostsocialand
medicalcasestherewillbemore.Vandenbroucke(2004)notesthattherearethree
billionbasepairsinthehumangenome,manyorallofwhichcouldberelevantprog-
nosticfactorsforthebiologicaloutcomethatweareseekingtoinfluence.Itistrue,
as(2)makesclear,thatwedonotneedbalanceoneachcauseindividually,onlyon
theirneteffect,theterm𝑆" − 𝑆&.Butconsiderthehumangenomebasepairs.Outof
allofthosebillions,onlyonemightbeimportant,andifthatoneisunbalanced,the
resultsofasingletrialcanbefarfromthetruth.Statementsaboutlargesamples
guaranteeingbalancearenotusefulwithoutguidelinesabouthowlargeislarge
enough,andsuchstatementscannotbemadewithoutknowledgeofothercauses
andhowtheyaffectoutcomes.Ofcourse,lackofbalanceintheneteffectofeither
observablesornon-observablesin(2)doesnotcompromisetheinferenceinanRCT
inthesenseofobtainingastandarderrorfortheunbiasedATE(seeSenn(2013)for
aparticularlyclearstatement).
HavingrunanRCT,itmakesgoodsensetoexamineanyavailablecovariates
forbalancebetweenthetreatmentsandcontrols;ifwesuspectthatanobserved
variablexisapossiblecause,anditsmeansinthetwogroupsareverydifferent,we
14
shouldtreatourresultswithappropriatesuspicion.Inpractice,trialistsineconom-
ics(andinsomeotherdisciplines)usuallycarryoutastatisticaltestforbalanceaf-
terrandomizationbutbeforeanalysis,presumablywiththeaimoftakingsomeap-
propriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthe
samplemeansofobservablecovariates—theobservablex’sin(1)orinteractivefac-
torsrepresentedinβ—forthecontrolandtreatmentgroups,togetherwiththeirdif-
ferences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,ei-
thervariablebyvariable,orjointly.Thesetestsareappropriateforunbiasednessif
weareconcernedthattherandomnumbergeneratormighthavefailed,orifweare
worriedthattherandomizationisunderminedbynon-blindedsubjectswhosys-
tematicallyunderminetheallocation.Otherwise,unbiasednessisguaranteedbythe
randomization,whateverthetestsshow,andasthenextparagraphdemonstrates,
thetestisnotinformativeaboutthebalancethatwouldleadtoprecision.
Ifwewrite𝜇&and𝜇"forthe(vectorsof)truemeansinthetrialsample(i.e.
themeansoverallpossiblerandomizations)oftheobservedcausesofYinthecon-
trolandtreatmentgroupsatthepointofassignment,thenullhypothesisis(pre-
sumably,asjudgedbythetypicalbalancetest)thatthetwovectorsareidentical,
withthealternativebeingthattheyarenot.Butiftherandomizationhasbeencor-
rectlydonethenullhypothesisistruebyconstruction(seee.g.Altman(1985)and
Senn(1994)),whichmayhelpexplainwhyitsorarelyfailsinpractice.AsBegg
(1990)notes,“(I)tisatestofanullhypothesisthatisknowntobetrue.Therefore,if
thetestturnsouttobesignificantitis,bydefinition,aafalsepositive.”Thisis,of
course,consistentwithFisher’scommentsabouttheplotsinthefield,whichnotes
thattwosamplesofplotsrandomlydrawnfromthesamefieldcanlookveryunbal-
anced.Indeed,althoughwecannot“test”itinthisway,weknowthatthenullhy-
pothesisisalsotruefortheunobservablecauses.Notethecontrastwiththestate-
mentquotedaboveclaimingthatRCTsguaranteebalanceoncausesacrosstreat-
mentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthepointof
assignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereas
thebalancetestsareaboutthebalanceofcausesatthepointofassignmentinexpec-
15
tationovermanytrials,whichisguaranteedbyrandomization.Theconfusionisper-
hapsunderstandable,butitisaconfusionnevertheless.Ofcourse,itisalwaysgood
practicetolookforimbalancesbetweenobservedcovariatesinanysingletrialusing
somemoreappropriatedistancemeasure,forexamplethenormalizeddifferencein
means(ImbensandWooldridge(2009,equation3)).Similarly,itwouldhavebeen
goodpracticeforFishertoabandonarandomizationinwhichtherewereclearpat-
ternsinthe(random)distributionofplotsacrossthefield,eventhoughthetreat-
mentandcontrolplotswererandomlyselectionsthat,byconstruction,couldnot
differ“significantly”usingthestandard(incorrect)balancetest.Whethersuchim-
balancesshouldbeseenasunderminingtheestimateoftheATEdependsonour
priorsaboutwhichcovariatesarelikelytobeimportant,andhowimportant,which
is(notcoincidentally)thesamethoughtexperimentthatisroutinelyundertakenin
observationalstudieswhenweworryaboutconfounding.
Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomiza-
tion,forexample,bystratification.Fisher,whoasthequoteaboveillustrates,was
wellawareofthelossofprecisionfromrandomizationarguedfor“blocking”(strati-
fication)inagriculturaltrialsorforusingLatinSquares,bothofwhichrestrictthe
amountofimbalance.Stratification,tobeuseful,requiressomepriorunderstanding
ofthefactorsthatarelikelytobeimportant,andsoittakesusawayfromthe“no
knowledgerequired”or“nopriorsaccepted”appealofRCTs;itrequiresthinking
aboutandmeasuringcovariates.ButasScriven(1974,103)notes:“(C)ausehunting,
likelionhunting,isonlylikelytobesuccessfulifwehaveaconsiderableamountof
relevantbackgroundknowledge”.Cartwright(1994,Chapter2)putsitevenmore
strongly,“nocausesin,nocausesout”.StratificationinRCTs,asinotherformsof
sampling,isastandardmethodforusingbackgroundknowledgetoincreasethe
precisionofanestimator.Ithasthefurtheradvantagethatitallowsfortheexplora-
tionofdifferentATEsindifferentstratawhichcanbeusefulinadaptingortrans-
portingtheresultstootherlocations(seeSection2).
Stratificationisnotpossibleiftherearetoomanycovariates,orifeachhas
manyvalues,sothattherearemorecellsthancanbefilledgiventhesamplesize.
Withfivecovariates,andtenvaluesoneach,andnopriorstolimitthestructure,we
16
wouldhave100,000possiblestrata.Fillingtheseiswellbeyondthesamplesizesin
mosttrials.Analternativethatworksmoregenerallyistore-randomize.Iftheran-
domizationgivesanobviousimbalanceonknowncovariates—treatmentplotsall
ononesideofthefield,allofthetreatmentclinicsinoneregion,toomanyrichand
toofewpoorinthecontrolgroup—wetryagain,andkeeptryinguntilwegetabal-
ancemeasuredasasmallenoughdistancebetweenthemeansoftheobservedco-
variatesinthetwogroups.MorganandRubin(2012)suggesttheMahalanobisD–
statisticbeusedasacriterionanduseFisher’srandomizationinference(tobedis-
cussedfurtherbelow)tocalculatestandarderrorsthattakethere-randomization
intoaccount.Analternative,widelyadaptedinpractice,istoadjustforcovariatesby
runningaregression(orcovariance)analysis,withtheoutcomeontheleft-hand
sideandthetreatmentdummyandthecovariatesasexplanatoryvariables,includ-
ingpossibleinteractionsbetweencovariatesandtreatmentdummies.Freedman
(2008)showsthattheadjustedestimateoftheATEisbiasedinfinitesamples,with
thebiasdependingonthecorrelationbetweenthesquaredtreatmenteffectandthe
covariates.Acceptingsomebiasinexchangeforgreaterprecisionwilloftenmake
sense,thoughitcertainlyunderminesanygoldstandardargumentthatreliesonun-
biasednesswithoutconsiderationofprecision.
1.4Shouldwerandomize?
ThetensionbetweenrandomizationandprecisionthatgoesbacktoFisher,Gosset,
andSavagehasbeenreopenedinrecentpapersbyKasy(2016),Banerjee,Chassang,
andSnowberg(BCS)(2016)andBanerjee,Chassang,Montero,andSnowberg
(BCMS)(2016).
Thetrade-offbetweenbiasandprecisioncanbeformalizedinseveralways,
forexamplebyspecifyingalossorutilityfunctionthatdependsonhowauserisaf-
fectedbydeviationsoftheestimateoftheATEfromthetruthandthenchoosingan
estimatororanexperimentaldesignthatminimizesexpectedlossormaximizesex-
pectedutility.AsSavage(1962)noted,foraBayesian,thisinvolvesallocatingtreat-
mentsandcontrolsin“thespecificlayoutthatpromisedtotellhimthemost,”but
withoutrandomization.Ofcourse,thisrequiresseriousandperhapsdifficultthought
aboutthemechanismsunderlyingtheATE,whichrandomizationavoids.Savagealso
17
notesthatseveralpeoplewithdifferentpriorsmaybeinvolvedinaninvestigation
andthatindividualpriorsmaybeunreliablebecauseof“vaguenessandtemptation
toself-deception,”defectsthatrandomizationmayalleviate,oratleastevade.BCMS
(2016)provideaproofofaBayesianno-randomizationtheorem,andBCS(2016)
provideanillustrationofaschooladministratorwhohaslongbelievedthatschool
outcomesaredetermined,notbyschoolquality,butbyparentalbackground,and
whocanlearnthemostbyplacingdeprivedchildrenin(supposed)high-quality
schoolsandprivilegedchildrenin(supposed)low-qualityschools,whichisthekind
ofstudysettingthatcasestudymethodologyiswellattunedto.AsBCSnote,thisal-
locationwouldnotpersuadethosewithdifferentpriors,andtheyproposerandomi-
zationasameansofsatisfyingskepticalobservers.
Severalpointsareimportant.First,theanti-randomizationtheoremisnota
justificationofanynon-randomizeddesign,forexample,onethatallowsselection
onunobservables,butonlytheoptimaldesignthatismostinformative.Accordingto
Chalmers(2001)andBothwellandPodolsky(2016),thedevelopmentofrandomi-
zationinmedicineoriginatedwithBradford-Hill,whousedrandomizationinthe
firstRCTinmedicine—thestreptomycintrial—becauseitpreventeddoctorsselect-
ingpatientsonthebasisofperceivedneed(oragainstperceivedneed,leaningover
backwardasitwere),anargumentrecentlyechoedbyWorrall(2007).Randomiza-
tionservesthispurpose,butsodoothernon-discretionaryschemes;whatisre-
quiredisthathiddeninformationshouldnotbeallowedtoaffecttheallocation.
Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontrol
dependonthecovariatesandontheinvestigators’priorsabouthowthecovariates
affecttheoutcomes.Thisopensupallsortsofmethodsofinferencethatarelongfa-
miliartoeconomistsbutthatareexcludedbypurerandomization.Forexample,
whatphilosopherscallthehypothetico-deductivemethodworksbyusingtheoryto
makeapredictionthatcanbetakentothedataforpotentialfalsification(asinthe
schoolexampleabove).Thisisthewaythatphysicistslearn,asdoeconomistswhen
theyusetheorytoderivepredictionsthatcanbetestedagainstthedata,perhapsin
anRCT,butmorefrequentlynot.Someofthemostfruitfulresearchprogramsin
economicshavebeengeneratedbythepuzzlesthatresultwhenthedatafailto
18
matchsuchtheoreticalpredictions,suchastheequitypremiumpuzzle,variouspur-
chasingpowerparitypuzzles,theFeldstein-Horiokapuzzle,theconsumption
smoothnesspuzzle,thepuzzleofcaloriedeclineinthefaceofmalnourishmentand
incomegrowth,andmanyothers.
Third,randomization,byrunningroughshodoverpriorinformationfrom
theoryandfromcovariates,iswastefulandevenunethicalwhenitunnecessarilyex-
posespeople,orunnecessarilymanypeople,topossibleharminariskyexperiment.
Worrall(2008)documentsthe(extreme)caseofECMO,anewtreatmentfornew-
bornswithpersistentpulmonaryhypertensionthatwasdevelopedinthe1970sby
intelligentanddirectedtrialanderrorwithinawell-understoodtheoryofthedis-
ease.Inearlyexperimentationbytheinventors,mortalitywasreducedfrom80to
20percent.TheinvestigatorsfeltcompelledtoconductanRCT,albeitwithanadap-
tive‘play-the-winner’designinwhicheachsuccessinanarmincreasedtheproba-
bilityofthenextbabybeingassignedtothatarm.Onebabyreceivedconventional
therapyanddied,11receivedECMOandlived.Evenso,astandardrandomizedcon-
trolledtrialwasthoughtnecessary.Withastoppingruleoffourdeaths,fourmore
babies(outoften)diedinthecontrolgroupandnoneoftheninewhoreceived
ECMO.
Fourth,thenon-randommethodsusepriorinformation,whichiswhythey
dobetterthanrandomization.Thisisbothanadvantageandadisadvantage,de-
pendingonone’sperspective.Ifpriorinformationisnotwidelyaccepted,orisseen
asnon-crediblebythoseweareseekingtopersuade,wewillgeneratemorecredible
estimatesifwedonotusethosepriors.Indeed,thisiswhyBCS(2016)recommend
randomizeddesigns,includinginmedicineandindevelopmenteconomics.Theyde-
velopatheoryofaninvestigatorwhoisfacinganadversarialaudiencewhowill
challengeanypriorinformationandcanevenpotentiallyvetoresultsbasedonit
(thinkofadministrativeagenciessuchastheFDAorjournalreferees).Theexperi-
mentertradesoffhisorherowndesireforprecision(andpreventingpossibleharm
tosubjects),whichwouldrequirepriorinformation,againstthewishesoftheaudi-
ence,whowantsnothingtodowiththosepriors.Eventhen,theapprovaloftheau-
dienceisonlyexante;oncethefullyrandomizedexperimenthasbeendone,nothing
19
stopscriticsarguingthat,infact,therandomizationdidnotofferafairtestbecause
importantothercauseswerenotbalanced.AmongdoctorswhouseRCTs,andespe-
ciallymeta-analysis,suchargumentsare(appropriately)common(seeKramer
(2016)).
Today,whenthepublichascometoquestionexpertpriorknowledge,RCTs
willflourish.Incaseswherethereisgoodreasontodoubtthegoodfaithofexperi-
menters,asinmanypharmaceuticaltrials,randomizationwillindeedbeanappro-
priateresponse.Butwebelievesuchargumentsaredestructiveforscientificen-
deavor(whichisnotthepurposeoftheFDA)andshouldberesistedasageneral
prescriptioninscientificresearch.Previousknowledgeneedstobebuiltonandin-
corporatedintonewknowledge,notdiscardedinthefaceofaggressiveignorance.
Thesystematicrefusaltousepriorknowledgeandtheassociatedpreferencefor
RCTsarerecipesforpreventingcumulativescientificprogress.Intheend,itisalso
self-defeating.ToquoteRodrik(2016)“thepromiseofRCTsastheory-freelearning
machinesisafalseone.”
1.5StatisticalinferenceinRCTs
TheestimatedATEinasimpleRCTisthedifferenceinthemeansbetweenthetreat-
mentandcontrolgroups.Whencovariatesareallowedfor,asinmostRCTsineco-
nomics,theATEisusuallyestimatedfromthecoefficientonthetreatmentdummy
inaregressionthatlookslike(1),butwiththeheterogeneityin𝛽ignored.Modern
workcalculatesstandarderrorsallowingforthepossibilitythatresidualvariances
maybedifferentinthetreatmentandcontrolgroups,usuallybyclusteringthe
standarderrors,whichisequivalenttothefamiliartwosamplestandarderrorin
thecasewithnocovariates.Statisticalinferenceisdonewitht-valuesintheusual
way.Theseproceduresdonotalwaysgivetherightanswer.
Lookingbackat(1),theunderlyingobjectsofinterestaretheindividual
treatmenteffects𝛽( foreachoftheindividualsinthetrialsample.Neitherthey,nor
theirdistribution𝐺(𝛽)isidentifiedfromanRCT;becauseRCTsmakesofewas-
sumptionswhich,inmanycases,istheirstrength,theycanidentifyonlythemeanof
thedistribution.Inmanyobservationalstudies,researchersarepreparedtomake
moreassumptionsonfunctionalformsorondistributions,andforthatpriceweare
20
abletoidentifyotherquantitiesofinterest.Withouttheseassumptions,inferences
mustbebasedonthedifferenceinthetwomeans,astatisticthatissometimesill-
behaved,asweshalldiscussbelow.Thisill-behaviorhasnothingtodowithRCTs,
perse,butwithinRCTs,andtheirminimalassumptions,wecannoteasilyswitch
fromthemeantosomeotherquantityofinterest.
Fisherproposedthatstatisticalinferenceshouldbedoneusingwhathasbe-
comeknownas“randomization”inference,aprocedurethatisasnon-parametricas
theRCT-basedestimateofanATEitself.Totestthenullhypothesisthat𝛽( = 0for
alli,notethat,underthenullthatthetreatmenthasnoeffectonanyindividual,an
estimatednonzeroATEmustbeaconsequenceoftheparticularrandomallocation
thatgeneratedit.Bytabulatingallpossiblecombinationsoftreatmentsandcontrols
inourtrialsample,andtheATEassociatedwitheach,wecancalculatetheexactdis-
tributionoftheestimatedATEunderthenull.Thisallowsustocalculatetheproba-
bilityofcalculatinganestimateaslargeasouractualestimatewhentherearenoef-
fectsoftreatment.Thisrandomizationtestrequiresafinitesample,butitwillwork
foranysamplesize(seeImbensandWooldridge(2009)foranexcellentaccountof
theprocedure).Imbens(2010)arguesthatitisthisrandomizationinferenceplus
theunbiasednessoftheATEthatprovidesthetwinnon-parametricpillarsthatsup-
portplacingRCTsatthe“verytop”ofthehierarchyofevidence.
Randomizationinferencecanbeusedfornullhypothesesthatspecifythatall
ofthetreatmenteffectsarezero,asintheaboveexample,butitcannotbeusedto
testthehypothesisthattheaveragetreatmenteffectiszero,whichwilloftenbeof
interest.Inagriculturaltrials,andinmedicine,thestronger(sharp)hypothesisthat
thetreatmenthasnoeffectwhateverisoftenofinterest.Inmanyeconomicapplica-
tionsthatinvolvemoney,suchaswelfareexperimentsorcost-benefitanalyses,we
areinterestedinwhethertheneteffectofthetreatmentispositiveornegative,and
inthesecases,randomizationinferencecannotbeused.Noneofwhichargues
againstitswideruseinsocialscienceswhenappropriate.
Incaseswhererandomizationinferencecannotbeused,wemustconstruct
testsforthedifferencesintwomeans.Standardprocedureswilloftenworkwell,but
therearetwopotentialpitfalls.One,the‘Fisher-Behrensproblem’,comesfromthe
21
factthat,whenthetwosampleshavedifferentvariances—whichwetypicallywant
topermit—theusualt-statisticdoesnothavethet-distribution.Thesecondprob-
lem,whichismuchhardertoaddress,occurswhenthedistributionoftreatmentef-
fectsisnotsymmetric(BahadurandSavage(1956)).Neitherpitfallisspecificto
RCTs,butRCTsforceustoworkwithmeansinestimatingtreatmenteffectsand,
withonlyaveryfewexceptionsintheliterature,socialscientistswhouseRCTsap-
peartobeunawareofthedifficulties.
InthesimplecaseofcomparingtwomeansinanRCTwithoutcovariates,in-
ferenceisusuallybasedonthetwo–samplet–statisticwhichiscomputedbydivid-
ingtheATEbytheestimatedstandarderrorwhosesquareisgivenby
𝜎4 =𝑛" − 1 6" 𝑌( − 𝑌"(7"
4
𝑛"+
𝑛& − 1 6" 𝑌( − 𝑌&(7&4
𝑛&3
where0referstocontrolsand1totreatments,sothatthereare𝑛"treatmentsand
𝑛&controls,and𝑌"and𝑌&arethetwomeans.Ashaslongbeenknown,the“t-statis-
tic”basedon(3)isnotdistributedasStudent’stifthetwovariances(treatmentand
control)arenotidenticalbuthastheBehrens–Fisherdistribution.Inextremecases,
whenoneofthevariancesiszero,thet–statistichaseffectivedegreesoffreedom
halfofthatofthenominaldegreesoffreedom,sothatthetest-statistichasthicker
tailsthanallowedfor,andtherewillbetoomanyrejectionswhenthenullistrue.
Young(2016)arguesthatthisproblemisworsewhenthetrialresultsarean-
alyzedbyregressingoutcomesnotonlyonthetreatmentdummybutalsoonaddi-
tionalcontrolsandwhenusingclusteredorrobuststandarderrors.Whenthede-
signmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobservations
outcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductionin
theeffectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)
leadingtospuriousfindingsofsignificance.Younglooksat2,003regressionsre-
portedin53RCTpapersintheAmericanEconomicAssociationjournalsandrecalcu-
latesthesignificanceoftheestimatesusingrandomizationinferenceappliedtothe
authors’originaldata.In30to40percentoftheestimatedtreatmenteffectsinindi-
vidualequationswithcoefficientsthatarereportedassignificant,hecannotreject
thenullofnoeffectforanyobservation;thefractionofspuriouslysignificantresults
22
increasesfurtherwhenhesimultaneouslytestsforallresultsineachpaper.These
spuriousfindingscomeinpartfromissuesofmultiple-hypothesistesting,both
withinregressionswithseveraltreatmentsandacrossregressions.Withinregres-
sions,treatmentsarelargelyorthogonal,butauthorstendtoemphasizesignificant
t–valuesevenwhenthecorrespondingF-testsareinsignificant.Acrossequations,
resultsareoftenstronglycorrelated,sothat,atworst,differentregressionsarere-
portingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsig-
nificanteffects.Atthesametime,thepervasivenessofobservationswithhighinflu-
encegeneratesspurioussignificanceonitsown.
Theseissuesarenowbeingtakenmoreseriously.InadditiontoYoung
(2016),ImbensandKolesár(2016)providepracticaladvicefordealingwiththe
Fisher-Behrensproblem,andthebestcurrentpracticetriestobecarefulaboutmul-
tiplehypothesistesting.Yetitremainsthecasethatmanyoftheresultsinthelitera-
turearespuriouslysignificant.
Spurioussignificancealsoariseswhenthedistributionoftreatmenteffects
containsoutliersor,moregenerally,isnotsymmetric.Standardt–testsbreakdown
indistributionswithenoughskewness(seeLehmannandRomano(2005,p.466–
8)).Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffected
whenthedistributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytri-
alshaveoutcomesvaluedinmoney.Doesananti-povertyinnovation—forexample
microfinance—increasetheincomesoftheparticipants?Incomeitselfisnotsym-
metricallydistributed,andthismightbetrueofthetreatmenteffectstooifthereare
afewpeoplewhoaretalentedbutcredit-constrainedentrepreneursandwhohave
treatmenteffectsthatarelargeandpositive,whilethevastmajorityofborrowers
fritterawaytheirloans,oratbestmakepositivebutmodestprofits.Arecentsum-
maryoftheliteratureisconsistentwiththis(seeBanerjee,Karlan,andZinman
(2015)).Anotherimportantexampleisexpendituresonhealthcare.Mostpeople
havezeroexpenditureinanygivenperiod,butamongthosewhodoincurexpendi-
tures,afewindividualsspendhugeamountsthataccountforalargeshareoftheto-
tal.Indeed,inthefamousRandhealthexperiment(seeManning,Newhouseetal.
23
(1987,1988)),thereisasingleverylargeoutlier.Theauthorsrealizethatthecom-
parisonofmeansacrosstreatmentarmsisfragile,and,althoughtheydonotsee
theirproblemexactlyasdescribedhere,theyobtaintheirpreferredestimatesusing
astructuralapproachthatisexplicitlydesignedtomodeltheskewnessofexpendi-
tures.
Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,trans-
forming,oreliminatingobservationsthathavelargeeffectsontheestimates.Butif
theexperimentisaprojectevaluationdesignedtoestimatethenetbenefitsofapol-
icy,theeliminationofgenuineoutliers,asintheRandHealthExperiment,willviti-
atetheanalysis.Itispreciselytheoutliersthatmakeorbreaktheprogram.Trans-
formations,suchastakinglogarithms,mayhelptoproducesymmetry,butthey
changethenatureofthequestionbeingasked;acostbenefitanalysismustbedone
indollars,notlogdollars.
Weconsideranexamplethatillustrateswhatcanhappeninarealisticbut
simplifiedcase;thefullresultsarereportedintheAppendix.Weimagineapopula-
tionofindividuals,eachwithatreatmenteffect𝛽( .Theparentpopulationmeanof
thetreatmenteffectsiszero,butthereisalongtailofpositivevalues;weusealeft-
shiftedlognormaldistribution.Wehaveamicrofinancetrialinmind,wherethereis
alongpositivetailofrareindividualswhocandoamazingthingswithcreditwhile
mostpeoplecannotuseiteffectively.Atrialsampleof2n individualsisrandomly
drawnfromtheparentpopulationandisrandomlysplitbetweenntreatmentsand
ncontrols.Withineachtrialsample,whosetrueATEwillgenerallydifferfromzero
becauseofthesampling,werunmanyRCTsandtabulatethevaluesoftheATEfor
each.
Usingstandardt-tests,the(trueintheparentdistribution)hypothesisthat
theATEiszeroisrejectedbetween14(𝑛 = 25)and6percent(𝑛 = 500)ofthetime.
Theserejectionscomefromtwoseparateissues,bothofwhicharerelevantinprac-
tice;(a)thattheATEintrialsamplediffersfromtheATEintheparentpopulationof
interest,and(b)thatthet-valuesarenotdistributedastinthepresenceofoutliers.
24
Theproblemcasesarewhenthetrialsamplehappenstocontainoneormoreoutli-
ers,somethingthatisalwaysariskgiventhelongpositivetailoftheparentdistribu-
tion.Whenthishappens,everythingdependsonwhethertheoutlierisamongthe
treatmentsorthecontrols;ineffect,theoutliersbecomethesample,reducingthe
effectivenumberofdegreesoffreedom.Inextremecases,oneofwhichisillustrated
inFigureA.1,thedistributionofestimatedATEsisbimodal,dependingonthegroup
towhichtheoutlierisassigned.Whentheoutlierisinthetreatmentgroup,thedis-
persionacrossoutcomesislarge,asistheestimatedstandarderror,andsothose
outcomesrarelyrejectthenullusingthestandardtableoft–values.Theover-rejec-
tionscomefromcaseswhentheoutlierisinthecontrolgroup,theoutcomesarenot
sodispersed,andthet–valuescanbelarge,negative,andsignificant.Whilethese
casesofbimodaldistributionsmaynotbecommonanddependontheexistenceof
largeoutliers,theyillustratetheprocessthatgeneratestheover-rejectionsandspu-
rioussignificance.Notethatthereisnoremedythroughrandomizationinference
here,giventhatourinterestisinthehypothesisthattheaveragetreatmenteffectis
zero.
OurreadingoftheliteratureonRCTsindevelopmenteconomicssuggests
thattheyarenotexemptfromtheseconcerns.Manydevelopmenttrialsarerunon
(sometimesvery)smallsamples,theyhavetreatmenteffectswhereasymmetryis
hardtoruleout—especiallywhentheoutcomesareinmoney—andtheyoftengive
resultsthatarepuzzling,oratleastnoteasilyinterpretedintermsofeconomicthe-
ory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),whocitemany
RCTs,raiseconcernsaboutmisleadinginference,implicitlytreatingallresultsasre-
liable.Nodoubttherearebehaviorsintheworldthatareinconsistentwithstandard
economics,andsomecanbeexplainedbystandardbiasesinbehavioraleconomics,
butitwouldalsobegoodtobesuspiciousofthesignificancetestsbeforeaccepting
thatanunexpectedfindingiswell-supportedandthattheorymustberevised.Repli-
cationofresultsindifferentsettingsmaybehelpful,iftheyaretherightkindof
places(seeourdiscussioninSection2).Yetithardlysolvestheproblemgiventhat
theasymmetrymaybeinthesamedirectionindifferentsettings,thatitseemslikely
tobesoinjustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobe
25
ofuseforinferenceaboutthepopulationofinterest,andthatthe“significant”t–val-
ueswillshowdeparturesfromthenullinthesamedirection.This,then,replicates
thespuriousfindings.
Asummary
Whatdotheargumentsofthissectionmeanabouttheimportanceofrandomization
andtheinterpretationthatshouldbegiventoanestimatedATEfromarandomized
trial?First,weshouldbesurethatanunbiasedestimateofanATEforthetrialpopu-
lationislikelytobeusefulenoughtowarrantthecostsofrunningthetrial.Second,
sincerandomizationdoesnotensureorthogonality,caremustbetaken(e.g.by
blinding)thattherearenosignificantpost-randomizationcorrelateswiththetreat-
ment.Thisisawell-knownlessonbutmanysocialandeconomictrialsarenot
blindedandinsufficientdefenseisofferedthatunbiasednessisnotundermined.In-
deed,lackofblindingisnottheonlysourceofpost-randomizationbias.Treatments
andcontrolsmaybehandledindifferentplaces,orbydifferentlytrainedpractition-
ers,oratdifferenttimesofday,andthesedifferencescanbringwiththemsystem-
aticdifferencesintheothercausestowhichthetwogroupsareexposed.Thesecan,
andshould,beguardedagainst.Butdoingsorequiresanunderstandingofwhat
thesecausallyrelevantfactorsmightbe.Third,theinferenceproblemsreviewed
herecannotjustbepresumedaway.Whenthereissubstantialheterogeneity,the
ATEinthetrialsamplecanbequitedifferentfromtheATEinthepopulationofin-
terest,evenifthetrialisrandomlyselectedfromthatpopulation;inpractice,there-
lationshipbetweenthetrialsampleandthepopulationisoftenobscure.
Beyondthat,inmanycases,thestatisticalinferencewillbefine,butserious
attentionshouldbegiventothepossibilitythatthereareoutliersintreatmentef-
fects,somethingthatknowledgeoftheproblemcansuggestandwhereinspectionof
themarginaldistributionsoftreatmentsandcontrolsmaybeinformative.Forexam-
ple,ifbotharesymmetric,itseemsunlikely(thoughcertainlynotimpossible)that
thetreatmenteffectsarehighlyskewed.MeasurestodealwithFisher-Behrens
shouldbeusedandrandomizationinferenceconsideredwhenappropriatetothe
hypothesisofinterest.
26
Allofthiscanberegardedasrecommendationsforimprovementtocurrent
practice,notachallengetoit.Morefundamentally,westronglycontesttheoften-ex-
pressedideathattheATEcalculatedfromanRCTisautomaticallyreliable,thatran-
domizationautomaticallycontrolsforunobservables,orworstofall,thatthecalcu-
latedATEistrue.If,bychance,itisclosetothetruth,thetruthwearereferringtois
thetruthinthetrialsampleonly.Tomakeanyinferencebeyondthatrequiresanar-
gumentofthekindweconsiderinthenextsection.Wehavealsoarguedthat,de-
pendingonwhatwearetryingtomeasureandwhatwewanttousethatmeasure
for,thereisnopresumptionthatanRCTisthebestmeansofestimatingit.Thattoo
requiresanargument,notapresumption.
Section2:Usingtheresultsofrandomizedcontrolledtrials
2.1Introduction
SupposewehaveestimatedanATEfromawell-conductedRCTonatrialsample,
andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeabout
bychance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinour
trialsample,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?
Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,has
paidmoreattentiontoobtainingresultsthantoconsideringwhatcanbedonewith
them.Thereislittletheoreticalorempiricalworktoguideushowandforwhatpur-
posestousethefindingsofRCTs,suchastheconditionsunderwhichthesamere-
sultsholdoutsideoftheoriginalsettings,howtheymightbeadaptedforuseelse-
where,orhowtheymightbeusedforformulating,testing,understanding,orprob-
inghypothesesbeyondtheimmediaterelationbetweenthetreatmentandtheout-
comeinvestigatedinthestudy.Yetitcannotbethatknowinghowtouseresultsis
lessimportantthanknowinghowtodemonstratethem.Anychainofevidenceisonly
asstrongasitweakestlink,sothatarigorouslyestablishedeffectwhoseapplicabil-
ityisjustifiedbyaloosedeclarationofsimilewarrantslittle.Iftrialsaretobeuseful,
weneedpathstotheirusethatareascarefullyconstructedasarethetrialsthem-
selves.
Theargumentforthe“primacyofinternalvalidity”madebyShadish,Cook,
andCampbell(2002)maybereasonableasawarningthatbadRCTsareunlikelyto
27
generalize,butitissometimesincorrectlytakentoimplythatresultsofaninternally
validtrialwillautomatically,oroften,apply‘asis’elsewhere,orthatthisshouldbe
thedefaultassumptionfailingargumentstothecontrary,asifaparameter,once
wellestablished,canbeexpectedtobeinvariantacrosssettings.Aninvarianceas-
sumptionisoftenmadeinmedicine,forexample,whereitissometimesplausible
thataparticularprocedureordrugworksthesamewayeverywhere(thoughsee
Horton(2000)forastrongdissentandRothwell(2005)forexamplesonbothsides
ofthequestion).Weshouldalsonotetherecentmovementtoensurethattestingof
drugsincludeswomenandminoritiesbecausemembersofthosegroupssuppose
thattheresultsoftrialsonmostlyhealthyyoungwhitemalesdonotapplytothem.
2.2Usingresults,transportability,andexternalvalidity
Supposeatrialhasestablishedaresultinaspecificsetting.If`thesame’resultholds
elsewhere,itissaidtohave`externalvalidity’.Externalvaliditymayreferjusttothe
transportabilityofthecausalconnection,orgofurtherandrequirereplicationofthe
magnitudeoftheATE.Eitherway,theresultholds—everywhere,orwidely,orin
somespecificelsewhere—oritdoesnot.
Thisbinaryconceptofexternalvalidityisoftenunhelpfulbecauseitasksthe
resultsofanRCTtosatisfyaconditionthatisneithernecessarynorsufficientfora
trialtobeuseful,andsobothoverstatesandunderstatestheirvalue.Itdirectsusto-
wardsimpleextrapolation—whetherthesameresultholdselsewhere—orsimple
generalization—itholdsuniversallyoratleastwidely—andawayfrommorecom-
plexbutmoreusefulapplicationsoftheresults.Thefailureofexternalvalidityinter-
pretedassimplegeneralizationorextrapolationsayslittleaboutthevalueofthe
trial.
First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybe-
yondtheoriginalcontext;wediscusstheseinthenextsubsection.Second,thereare
oftengoodreasonstoexpectthattheresultsfromawell-conducted,informative,
andpotentiallyusefulRCTwillnotapplyelsewhereinanysimpleway.Withoutfur-
therunderstandingandanalysis,evensuccessfulreplicationtellsuslittleeitherfor
oragainstsimplegeneralizationortosupportfortheconclusionthatthenextwill
workinthesameway.Nordofailuresofreplicationmaketheoriginalresultuseless.
28
Weoftenlearnmuchfromcomingtounderstandwhyreplicationfailedandcanuse
thatknowledge,inlookingforhowthefactorsthatcausedtheoriginalresultmight
operatedifferentlyindifferentsettings.Third,andparticularlyimportantforscien-
tificprogress,theRCTresultcanbeincorporatedintoanetworkofevidenceandhy-
pothesesthattestorexploreclaimsthatlookverydifferentfromtheresultsre-
portedfromtheRCT.WeshallgiveexamplesbelowofextremelyusefulRCTsthat
arenotexternallyvalidinthe(usual)sensethattheirresultsdonotholdelsewhere,
whetherinaspecifictargetsettingorinthemoresweepingsenseofholdingevery-
where.
BertrandRussell’schicken(Russell(1912))providesanexcellentexampleof
thelimitationstostraightforwardextrapolationfromrepeatedsuccessfulreplica-
tion.Thebirdinfers,onrepeatedevidence,thatwhenthefarmercomesinthe
morning,hefeedsher.TheinferenceservesherwelluntilChristmasmorning,when
hewringsherneckandservesherfordinner.Thoughthischickendidnotbaseher
inferenceonanRCT,hadweconstructedoneforher,wewouldhaveobtainedthe
sameresultthatshedid.Herproblemwasnothermethodology,butratherthatshe
didnotunderstandthesocialandeconomicstructurethatgaverisetothecausalre-
lationsthatsheobserved.
So,establishingcausalitydoesnothinginandofitselftoguaranteegenerali-
zability.NordoestheabilityofanidealRCTtoeliminatebiasfromselectionorfrom
omittedvariablesmeanthattheresultingATEfromthetrialsamplewillapplyany-
whereelse.Theissueisworthmentioningonlybecauseoftheenormousweight
thatiscurrentlyattachedineconomicstothediscoveryandlabelingofcausalrela-
tions,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicabil-
ity,whatmightbelabeled‘anecdotalcausality’.Theoperationofacausegenerally
requiresthepresenceof“supportfactors”,withoutwhichacausethatproducesthe
targetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto
operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)called
INUScausality(InsufficientbutNon-redundantpartsofaconditionthatisitselfUn-
necessarybutSufficientforacontributiontotheoutcome)isoftenthekindofcau-
salitywesee.Astandardexampleisahouseburningdownbecausethetelevision
29
waslefton,althoughtelevisionsdonotoperateinthiswaywithoutsupportfactors,
suchaswiringfaults,thepresenceoftinder,andsoon.Thisisstandardfareinepi-
demiology,whichusestheterm`causalpie’torefertoasetofcausesthatarejointly
butnotseparatelysufficientforaneffect.
Ifwerewrite(1)intheform
𝑌( = 𝛽(𝑇( + 𝛾<𝑥(< = 𝜃 𝑤( 𝑇(
@
<A"
+ 𝛾<𝑥(<
@
<A"
(4)
wherethefunction𝜃(. )controlshowak-vector𝑤( ofk`supportfactors’affectindi-
viduali’streatmenteffect𝛽( .Thesupportfactorsmayincludesomeofthex’s.Since
theATEistheaverageofthe𝛽(𝑠,twopopulationswillhavethesameATEifandonly
iftheyhavethesameaveragefortheneteffectofthesupportfactorsnecessaryfor
thetreatmenttowork,i.e.forthequantityinfrontof𝑇( .Thesearehoweverjustthe
kindoffactorsthatarelikelytobedifferentlydistributedindifferentpopulations,
andindeedwedogenerallyfinddifferentATEsindifferenteconomic(andotherso-
cialpolicy)RCTsindifferentplaceseveninthecaseswhere(unusually)theyall
pointinthesamedirection.
Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocial
structurestoenablethemtowork.ConsidertheRubeGoldbergmachinethatis
riggedupsothatflyingakitesharpensapencil(CartwrightandHardie(2012,77)).
Theunderlyingstructureaffordsaveryspecificformof(4)thatwillnotdescribe
causalprocesseselsewhere.NeitherthesameATEnorthesamequalitativecausal
relationscanbeexpectedtoholdwherethespecificformfor(4)isdifferent.Indeed,
wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelationsthatwe
likeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystems
aredesignedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothat
driverscannotstarttheminreverse;workschedulesforpilotsaredesignedsothey
donotflytoomanyconsecutivehourswithoutrestbecausealertnessandperfor-
mancearecompromised.
30
AsintheRubeGoldbergmachineandinthedesignofcarsandworksched-
ules,theeconomicstructureandequilibriummaydifferinwaysthatsupportdiffer-
entkindsofcausalrelationsandthusrenderatrialinonesettinguselessinanother.
Forexample,atrialthatreliesonprovidingincentivesforpersonalpromotionisof
nouseinastateinwhichapoliticalsystemlockspeopleintotheirsocialandeco-
nomicpositions.Cashtransfersthatareconditionalonparentstakingtheirchildren
toclinicscannotimprovechildhealthintheabsenceoffunctioningclinics.Policies
targetedatmenmaynotworkforwomen.Weusealevertotoastourbread,butlev-
ersonlyoperatetotoastbreadinatoaster;wecannotbrowntoastbypressingan
accelerator,eveniftheprincipleoftheleveristhesameinbothatoasterandacar.
Ifwemisunderstandthesetting,ifwedonotunderstandwhythetreatmentinour
RCTworks,werunthesamerisksasRussell’schicken.
2.3WhenRCTsspeakforthemselves:notransportabilityrequired
Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmayprovidea
counterexampletoageneraltheoreticalproposition,eithertothepropositionitself
(asimplerefutationtest)ortosomeconsequenceofit(acomplexrefutationtest).
AnRCTmayalsoconfirmapredictionofatheory,andalthoughthisdoesnotcon-
firmthetheory,itisevidenceinitsfavor,especiallyifthepredictionseemsinher-
entlyunlikelyinadvance.Thisisallfamiliarterritory,andthereisnothingunique
aboutanRCT;itissimplyoneamongmanypossibletestingprocedures.Evenwhen
thereisnotheory,orveryweaktheory,anRCT,bydemonstratingcausalityinsome
populationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof
workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalva-
lidity.
NoristransportationcalledforwhenanRCTisusedforevaluation,forexam-
pletosatisfydonorsthattheprojecttheyfundedachieveditsaimsinthepopulation
inwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,to
beglobalpublicgoodsrequiresargumentsandguidelinesthatjustifyusingthere-
sultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-product
oftheBankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreat-
mentschangeacrossstudies,evaluationsneednotleadtocumulativeknowledge.Or
31
asHeckmanetal(1999,1934)note,“thedataproducedfromthem[socialexperi-
ments]arefarfromidealforestimatingthestructuralparametersofbehavioral
models.Thismakesitdifficulttogeneralizefindingsacrossexperimentsortouse
experimentstoidentifythepolicy-invariantstructuralparametersthatarerequired
foreconometricpolicyevaluation.”
Ofcourse,whenweaskexactlywhatthoseinvariantstructuralparameters
are,whethertheyexist,andhowtheyshouldbemodeled,weopenupmajorfault
linesinmodernappliedeconomics.Forexample,wedonotintendtoendorseinter-
temporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters
thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotas
universallyacceptedasitoncewas.Butthepointremainsthatweneedsomething,
someregularityorsomeinvariance,andthatsomethingcanrarelyberecoveredby
simplygeneralizingacrosstrials.
Athirdnon-problematicandimportantuseofanRCTiswhentheparameter
ofinterestistheATEinawell-definedpopulationfromwhichthetrialsampleisit-
selfarandomsample.Inthiscasethesampleaveragetreatmenteffect(SATE)isan
unbiasedestimatorofthepopulationaveragetreatmenteffect(PATE)that,byas-
sumption,isourtarget(seeImbens(2004)fortheseterms).Werefertothisasthe
`publichealth’case;likemanypublichealthinterventions,thetargetistheaverage,
`populationhealth,’notthehealthofindividuals.Onemajor(andwidelyrecog-
nized)dangerofthisuseofRCTsisthatscalingupfrom(evenarandom)sampleto
thepopulationwillnotgothroughinanysimplewayiftheoutcomesofindividuals
orgroupsofindividualschangethebehaviorofothers—whichwillbecommonin
economicexamplesbutperhapslesssoinhealth.Thereisalsoanissueoftimingif
timeelapsesbetweenthetrialandtheimplementation.
Ineconomics,a`public-health-style’exampleistheimpositionofacommod-
itytax,wherethetotaltaxrevenueisofinterestandpolicymakersdonotcarewho
paysthetax.Indeed,theorycanoftenidentifyaspecific,well-definedquantity
whosemeasurementiskeyforapolicy(seeDeatonandNg(1998)foranexampleof
whatChetty(2009)callsa“sufficient”statistic).Inthiscase,thebehaviorofaran-
domsampleofindividualsmightwellprovideagoodguidetothetaxrevenuethat
32
canbeexpected.Anothercasecomesfromworkonpovertyprogramswherethe
sponsorsaremostconcernedaboutthebudget;wediscussthesecasesattheendof
thisSection.Evenhere,itiseasytoimaginebehavioraleffectscomingintoplaythat
driveawedgebetweenthetrialanditsfull-scaleimplementation,forexampleif
complianceishigherwhentheschemeiswidelypublicized,orifgovernmentagen-
ciesimplementtheschemedifferentlyfromtrialists.
2.4Transportingresultslaterallyandglobally
TheprogramofRCTsineconomics,asinotherareasofsocialscience,hasthe
broadergoaloffindingout`whatworks.’Atitsmostambitious,thisaimsforuniver-
salreach,andthedevelopmenteconomicsliteraturefrequentlyarguesthat“credi-
bleimpactevaluationsareglobalpublicgoodsinthesensethattheycanofferrelia-
bleguidancetointernationalorganizations,governments,donors,andnongovern-
mentalorganizations(NGOs)beyondnationalborders”,DufloandKremer(2008,
93).SometimestheresultsofasingleRCTareadvocatedashavingwideapplicabil-
ity,withespeciallystrongendorsementwhenthereisatleastonereplication.For
example,KremerandHolla(2009,3)useaKenyantrialasthebasisforablanket
statementwithoutspecifyingcontext,“Provisionoffreeschooluniforms,forexam-
ple,leadsto10%-15%reductionsinteenpregnancyanddropoutrates.”Dufloand
Kremer(2008,104),writingaboutanothertrial,aremorecautious,citingtwoeval-
uationsandrestrictingthemselvestoIndia:“Onecanberelativelyconfidentabout
recommendingthescaling-upofthisprogram,atleastinIndia,onthebasisofthese
estimates,sincetheprogramwascontinuedforaperiodoftime,wasevaluatedin
twodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.”
Evenanumberofreplicationsdonotprovideasoundbasisforinference.Without
theorytosupporttheprojectionofresults,thisisjustinductionbysimpleenumera-
tion—swan1iswhite,swan2iswhite,...,soallswansarewhite.
TheproblemofgeneralizationextendsbeyondRCTs,toboth`fullycon-
trolled’laboratoryexperimentsandtomostnon-experimentalfindings.Ourargu-
menthereisthatevidencefromRCTsisnotautomaticallysimplygeneralizable,and
thatitssuperiorinternalvalidity,ifandwhenitexists,doesnotprovideitwithany
uniqueinvarianceacrosscontext.Thattransportationisfarfromautomaticalso
33
tellsuswhy(evenideal)RCTsofsimilarinterventionsgivedifferentanswersindif-
ferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings
andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservational
studies.
ManyadvocatesofRCTsunderstandthat`whatworks’needstobequalified
to`whatworksunderwhichcircumstances’andtrytosaysomethingaboutwhat
thosecircumstancesmightbe,forexample,byreplicatingRCTsindifferentplaces
andthinkingintelligentlyaboutthedifferencesinoutcomeswhentheyfindthem.
Sometimesthisisdoneinasystematicway,forexamplebyhavingmultipletreat-
mentswithinthesametrialsothatitispossibletoestimatea`responsesurface’that
linksoutcomestovariouscombinationsoftreatments(seeGreenbergandSchroder
(2004)orShadishetal(2002)).Forexample,theRANDhealthexperimenthadmul-
tipletreatments,allowinginvestigation,notofhowmuchhealthinsurancein-
creasedexpendituresunderdifferentcircumstances.Someofthenegativeincome
taxexperiments(NITs)inthe1960sand1970sweredesignedtoestimateresponse
surfaces,withthenumberoftreatmentsandcontrolsineacharmoptimizedtomax-
imizeprecisionofestimatedresponsefunctionssubjecttoanoverallcostlimit(see
Conlisk(1973)).Experimentsontime-of-daypricingforelectricityhadasimilar
structure(seeAigner(1985)).
TheexperimentsbyMDRC(originallyknownastheManpowerDevelopment
ResearchCorporation)havealsobeenanalyzedacrosscitiesinanefforttolinkcity
featurestotheresultsoftheRCTswithinthem(seeBloom,Hill,andRiccio(2005)).
UnliketheRANDandNITexamples,theseareexpostanalysesofcompletedtrials;
thesameistrueofVivalt(2015),whofinds,forthecollectionoftrialsshestudied,
thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller
(standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal
(2013),whoranparallelRCTsonaninterventionimplementedeitherbyanNGOor
bythegovernmentofKenya,foundsimilarresultsthere.Notethattheseanalyses
haveadifferentpurposefrommeta-analysesthatassumethatdifferenttrialsesti-
matethesameparameteruptonoiseandaverageinordertoincreaseprecision.
34
Althoughthereareissueswithallofmethodsofinvestigatingdifferences
acrosstrials,withoutsomedisciplineitistooeasytocomeupwith`just-so’orfairy
storiesthataccountfordifferences.Weriskaprocedurethat,ifaresultisreplicated
infullorinpartinatleasttwoplaces,putsthattreatmentintothe`itworks’box
and,iftheresultdoesnotreplicate,casuallyinterpretsthedifferenceinawaythat
allowsatleastsomeofthefindingstosurvive.
Howcanwedobetterthansimplegeneralizationandsimpleextrapolation?
Manywritersemphasizetheroleoftheoryintransportingandusingtheresultsof
trials,andweshalldiscussthisinthenextsubsection.Butstatisticalapproachesare
alsowidelyused;thesearedesignedtodealwiththepossibilitythattreatmentef-
fectsvarysystematicallywithothervariables.Referringbackto(4),itisclearthat,
supposingthesameformof(4)obtains,ifthedistributionofthewvaluesisthe
sameinthenewcircumstancesasintheold,theATEintheoriginaltrialwillholdin
thenewcircumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowe
haveanyobviouswayofcheckingitunlessweknowwhatthesupportfactorsarein
bothplaces.Oneproceduretodealwithinteractionsispost-experimentalstratifica-
tion,whichparallelspost-surveystratificationinsamplesurveys.Thetrialisbroken
upintosubgroupsthathavethesamecombinationofknown,observablew’s(age,
race,genderforexample),thentheATEswithineachofthesubgroupsarecalcu-
lated,andthentheyarereassembledaccordingtotheconfigurationofw’sinthe
newcontext.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectes-
timatestotheparentpopulationwhenthetrialsampleisnotarandomsampleof
theparent.Othermethodscanbeusedwhentherearetoomanyw’sforstratifica-
tion,forexamplebyestimatingtheprobabilityofeachobservationinthepopulation
includedinthetrialsampleasafunctionofthew’s,thenweightingeachobservation
bytheinverseofthesepropensityscores.AgoodreferenceforthesemethodsisStu-
artetal(2011),orineconomics,Angrist(2004)andHotz,Imbens,andMortimer
(2005).
Thesemethodsareoftennotapplicable,however.First,reweightingworks
onlywhentheobservablefactorsusedforreweightingincludeall(andonly)genu-
ineinteractivecauses.Second,aswithanyformofreweighting,thevariablesusedto
35
constructtheweightsmustbepresentinboththeoriginalandnewcontext.Forex-
ample,ifwearetocarryaresultforwardintime,wemaynotbeabletoextrapolate
fromaperiodoflowinflationtoaperiodofhighinflation.AsHotzetal(2005)note,
itwilltypicallybenecessarytoruleoutsuch`macro’effects,whetherovertime,or
overlocations.Third,italsodependsonassumingthatthesamegoverningequation
(4)coversthetrialandthetargetpopulation.
PearlandBareinboim(2011,2014)andBareinboimandPearl(2013,2014)
providestrategiesforinferringinformationaboutnewpopulationsfromtrialre-
sultsthataremoregeneralthanreweighting.Theysupposewehaveavailableboth
causalinformationandprobabilisticinformationforpopulationA(e.g.theexperi-
mentalone),whileforpopulationB(thetarget)wehaveonly(some)probabilistic
information,andalsothatweknowthatcertainprobabilisticandcausalfactsare
sharedbetweenthetwoandcertainonesarenot.Theyoffertheoremsdescribing
whatcausalconclusionsaboutpopulationBaretherebyfixed.Theirworkunder-
linesthefactthatexactlywhatconclusionsaboutonepopulationcanbesupported
byinformationaboutanotherdependsonexactlywhatcausalandprobabilisticfacts
theyhaveincommon. ButasMuller(2015)notes,this,liketheproblemwithsimple
reweighting,takesusbacktothesituationthatRCTsaredesignedtoavoid,where
weneedtostartfromacompleteandcorrectspecificationofthecausalstructure.
RCTscanavoidthisinestimation—whichisoneoftheirstrengths,supportingtheir
credibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew
context.
Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneral
claimsbysimplegeneralization;thereisnowarrantfortheconvenientassumption
thattheATEestimatedinaspecificRCTisaninvariantparameter,northatthe
kindsofinterventionsandoutcomeswemeasureintypicalRCTsparticipateingen-
eralcausalrelations.Whileitistruethatgeneralcausalclaimsexist—thatgravita-
tionalmassesattracteachother,orthatpeoplerespondtoincentives—theseuse
relativelyabstractconceptsandoperateatamuchhigherlevelthantheclaimsthat
canbereasonablyinferredfromatypicalRCT.
36
Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobeval-
uable,orfailingthat,subgroupanalysis,becauseitcanprovideinformationthatmay
beusefulforgeneralizationortransportation.Forexample,KremerandHolla
(2009)notethat,intheirtrials,schoolattendanceissurprisinglysensitivetosmall
subsidies,whichtheysuggestisbecausetherearealargenumberofstudentsand
parentswhoareonthe(financial)marginbetweenattendingandnotattending
school;ifthisisindeedthemechanismfortheirresults,agoodvariableforstratifi-
cationwouldbedistancefromtherelevantcutoff.Wealsoneedtoknowthatthis
samemechanismworksinanynewtargetsetting.
Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmore
modelbuildingandmore—ordifferent—assumptionsthanadvocatesofRCTsare
oftencomfortablewith.Tobeclear,modelingcausalstructuredoesnotcommitus
totheelaborateandoftenincredibleassumptionsthatcharacterizesomestructural
modelingineconomics,butthereisnoescapefromthinkingaboutthewaythings
work;thewhyaswellasthewhat.
Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,
forexampleaboutdifferencesinsocial,economic,andculturalstructuresandabout
thejointdistributionsofcausalvariables,knowledgethatwilloftenonlybeavaila-
blethroughobservationalstudies.Wewillalsoneedexternalinformation,boththe-
oreticalandempirical,tosettleonaninformativecharacterizationofthepopulation
enrolledintheRCTbecausehowthatpopulationisdescribediscommonlytakento
besomeindicationofwhichotherpopulationstheresultsarelikelytobeexportable
to.Manymedicalandpsychologicaljournalsareexplicitaboutthis.Forinstance,the
rulesforsubmissionrecommendedbytheInternationalCommitteeofMedicalJour-
nalEditors,ICMJE(2015,14)insistthatarticleabstracts“Clearlydescribetheselec-
tionofobservationalorexperimentalparticipants(healthyindividualsorpatients,
includingcontrols),includingeligibilityandexclusioncriteriaandadescriptionof
thesourcepopulation.”AnRCTisconductedonaspecifictrialsample,somehow
drawnfromapopulationofspecificindividuals.Theresultsobtainedarefeaturesof
thatsample,ofthoseveryindividualsatthatverytime,notanyotherpopulation
37
withanydifferentindividualsthatmight,forexample,satisfyoneoftheinfiniteset
ofdescriptionsthatthetrialsamplesatisfies.
Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecial
cases,likeposthocevaluationforpayment-for-results,wearenotespeciallycon-
cernedtolearnabouttheveryindividualsenrolledinthetrial.Mostexperiments
are,andshouldbe,conductedwithaneyetowhattheresultscanhelpuslearn
aboutotherpopulations.Thiscannotbedonewithoutsubstantialassumptions
aboutwhatmightandwhatmightnotberelevanttotheproductionoftheoutcome
studied.(Forexample,theICMJEguidelines(2015,14)goontosay:“Becausethe
relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetime
ofstudydesign,researchersshouldaimforinclusionofrepresentativepopulations
intoallstudytypesandataminimumprovidedescriptivedatafortheseandother
relevantdemographicvariables,”p14.)Sobothintelligentstudydesignandrespon-
siblereportingofstudyresultsinvolvesubstantialbackgroundassumptions.
Ofcourse,thisistrueforallstudies.ButRCTsrequirespecialconditionsif
theyaretobeconductedatallandespeciallyiftheyaretobeconductedsuccess-
fully—forexample,localagreements,compliantsubjects,affordableadministrators,
multipleblinding,peoplecompetenttomeasureandrecordoutcomesreliably,aset-
tingwhererandomallocationismorallyandpoliticallyacceptable,etc.—whereas
observationaldataareoftenmorereadilyandwidelyavailable.InthecaseofRCTs,
thereisdangerthatthesekindsofconsiderationshavetoomucheffect.Thisisespe-
ciallyworrisomewherethefeaturesthatthetrialsampleshouldhavearenotjusti-
fied,madeexplicit,orsubjectedtoseriouscriticalreview.
Theneedforobservationalknowledgeisoneofmanyreasonswhyitiscoun-
ter-productivetoinsistthatRCTsarethegoldstandardorthatsomecategoriesof
evidenceshouldbeprioritizedoverothers;thesestrategiesleaveushelplessinus-
ingRCTsbeyondtheiroriginalcontext.TheresultsofRCTsmustbeintegratedwith
otherknowledge,includingthepracticalwisdomofpolicymakers,iftheyaretobe
useableoutsidethecontextinwhichtheywereconstructed.
38
Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbe-
tweenRCTsandobservationalresultsneedtobeexplained,forexamplebyrefer-
encetothedifferentpopulationsineach,aprocessthatwillsometimesyieldim-
portantevidence,includingontherangeofapplicabilityoftheRCTresultsthem-
selves.WhilethevalidityoftheRCTwillsometimesprovideanunderstandingof
whytheobservationalstudyfoundadifferentanswer,thereisnobasis(orexcuse)
forthecommonpracticeofdismissingtheobservationalstudysimplybecauseit
wasnotanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvance
that,ascollectiveknowledgeadvances,newfindingsmustbeabletoexplainandbe
integratedwithpreviousresults,evenresultsthatarenowthoughttobeinvalid;
methodologicalprejudiceisnotanexplanation.
2.5Usingtheoryforgeneralization
Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincethe
earlyexperiments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincome
taxtrialsusingasimplestatictheoryoflaborsupply.Accordingtothis,people
choosehowtodividetheirtimebetweenworkandleisureinanenvironmentin
whichtheyreceiveaminimumGiftheydonotwork,andwheretheyreceiveanad-
ditionalamount(1-t)wforeachhourtheywork,wherewisthewagerate,andtisa
taxrate.ThetrialsassigneddifferentcombinationsofGandttodifferenttrial
groups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimation
oftheparametersofpreferences,whichcouldthenbeusedinawiderangeofpolicy
calculations,forexampletoraiserevenueatminimumutilitylosstoworkers.
Followingtheseearlytrials,therehasbeenacontinuingtraditionofusing
trialresults,togetherwiththebaselinedatacollectedforthetrial,tofitstructural
modelsthataretobeusedmoregenerally.EarlyexamplesincludeMoffitt(1979)on
laborsupplyandWise(1985)onhousing;amorerecentexampleisHeckman,Pinto
andSavelyev(2013)forthePerrypre-schoolprogram.Developmenteconomicsex-
amplesincludeAttanasio,Meghir,andSantiago(2012),Attanasioetal(2015),Todd
andWolpin(2006),Wolpin(2013),andDuflo,Hanna,andRyan(2012).These
39
structuralmodelssometimesrequireformidableauxiliaryassumptionsonfunc-
tionalformsorthedistributionsofunobservables,buttheyhavecompensatingad-
vantages,includingtheabilitytointegratetheoryandevidence,tomakeout-of-sam-
plepredictions,andtoanalyzewelfare,andtheuseofRCTevidenceallowsthere-
laxationofatleastsomeoftheassumptionsthatareneededforidentification.Inthis
way,thestructuralmodelsborrowcredibilityfromtheRCTsandinreturnhelpset
theRCTresultswithinacoherentframework.Withoutsomesuchinterpretation,the
welfareimplicationsofRCTresultscanbeproblematic;knowinghowpeopleingen-
eral(letalonejustpeopleinthetrialpopulation)respondtosomepolicyisrarely
enoughtotellwhetherornottheyaremadebetteroff,Harrison(2014a,b).Tradi-
tionalwelfareeconomicsdrawsalinkfrompreferencestobehavior,alinkthatisre-
spectedinstructuralworkbutoftenlostinthe`whatworks’literature,andwithout
whichwehavenobasisforinferringwelfarefrombehavior.Whatworksisnot
equivalenttowhatshouldbe.
Lighttouchtheorycandomuchtointerpret,toextend,andtouseRCTre-
sults.InboththeRANDHealthExperimentandnegativeincometaxexperiments,an
immediateissueconcernedthedifferencebetweenshortandlong-runresponses;
indeed,differencesbetweenimmediateandultimateeffectsoccurinawiderangeof
RCTs.BothhealthandtaxRCTsaimedtodiscoverwhatwouldhappenifconsum-
ers/workerswerepermanentlyfacedwithhigherorlowerprices/wages,butthetri-
alscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearningsisef-
fectivelya`firesale’onleisure,sothattheexperimentprovidedanopportunityto
takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentin
apermanentscheme.Howdowegetfromtheshort-runresponsesthatcomefrom
thetrialtothelong-runresponsesthatwewanttoknow?Metcalf(1973)andAsh-
enfelter(1978)providedanswersfortheincometaxexperiments,asdidArrow
(1975)fortheRandHealthExperiment.
Arrow’sanalysisillustrateshowtousebothstructureandobservationaldata
totransportandadaptresultsfromonesettingtoanother.Hemodelsthehealthex-
perimentasatwo-periodmodelinwhichthepriceofmedicalcareisloweredinthe
firstperiodonly,andshowshowtoderivewhatwewant,whichistheresponsein
40
thefirstperiodifpriceswereloweredbythesameproportioninbothperiods.The
magnitudethatwewantisS,thecompensatedpricederivativeofmedicalcarein
period1inthefaceofidenticalincreasesin𝑝"and𝑝4inbothperiods1and2.Thisis
equalto𝑠"" + 𝑠"4,thesumofthederivativesofperiod1’sdemandwithrespectto
thetwoprices.Thetrialgivesonly𝑠"".Butifwehavepost-trialdataonmedicalser-
vicesforbothtreatmentsandcontrols,wecaninfer𝑠4",theeffectoftheexperi-
mentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformof
Slutskysymmetrysaysthat𝑠"4 = 𝑠4"andsoallowsArrowtoinfer𝑠"4andthusS.He
contraststhiswithMetcalf’salternativesolution,whichmakesdifferentassump-
tions—thattwoperiodpreferencesareintertemporallyadditive,inwhichcasethe
long-runelasticitycanbeobtainedfromknowledgeoftheincomeelasticityofpost-
experimentalmedicalcare,whichwouldhavetocomefromanobservationalanaly-
sis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwill-
ingnesstomakeassumptionsandonthedatathatwehave,asuitablecombination
of(elementaryandtransparent)theoreticalassumptionsandobservationaldatain
ordertoadaptandusetrialresults.Suchanalysiscanalsohelpdesigntheoriginal
trialbyclarifyingwhatweneedtoknowinordertousetheresultsofatemporary
treatmenttoestimatethepermanenteffectsthatweneed.Ashenfelterprovidesa
thirdsolution,notingthatthetwo-periodmodelisformallyidenticaltoatwo-person
model,sothatwecanuseinformationontwo-personlaborsupplytotellusabout
thedynamics.
Theorycanoftenallowustoreclassifyneworunknownsituationsasanalo-
goustosituationswherewealreadyhavebackgroundknowledge.Onefrequently
usefulwayofdoingthisiswhenthenewpolicycanberecastasequivalenttoa
changeinthebudgetconstraintthatrespondentsface.Theconsequencesofanew
policymaybeeasiertopredictifwecanreduceittoequivalentchangesinincome
andprices,whoseeffectsareoftenwellunderstoodandwell-studied.Toddand
Wolpin(2008)andWolpin(2013)makethispointandprovideexamples.Inthela-
borsupplycase,anincreaseinthetaxratehasthesameeffectasadecreaseinthe
wagerate,sothatwecanrelyonpreviousliteraturetopredictwhatwillhappen
whentaxratesarechanged.InthecaseofMexico’sPROGRESAconditionalcash
41
transferprogram,ToddandWolpinnotethatthesubsidiespaidtoparentsiftheir
childrengotoschoolcanbethoughtofasacombinationofreductioninchildren’s
wagesandanincreaseinparents’income,whichallowsthemtopredicttheresults
oftheconditionalcashexperimentwithlimitedadditionalassumptions.Ifthis
works,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidateprevious
knowledgeandcontributestoanevolvingbodyoftheoryandempirical,including
trial,evidence.
Theprogramofthinkingaboutpolicychangesasequivalenttopriceandin-
comechangeshasalonghistoryineconomics;muchofrationalchoicetheorycanbe
sointerpreted(seeDeatonandMuellbauer(1980)formanyexamples).Whenthis
conversioniscredible,andwhenatrialonsomeapparentlyunrelatedtopiccanbe
modeledasequivalenttoachangeinpricesandincomes,andwhenwecanassume
thatpeopleindifferentsettingsrespondrelevantlysimilarlytochangesinprices
andincomes,wehaveareadymadeframeworkforincorporatingthetrialresults
intopreviousknowledge,aswellasforextendingthetrialresultsandusingthem
elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peo-
plemaynotinfacttreatataxincreaseasadecreaseinthepriceofleisure,andbe-
havioraleconomicsisfullofexampleswhereapparentlyequivalentstimuligenerate
non-equivalentoutcomes.Theembraceofbehavioraleconomicsbymanyofthecur-
rentgenerationoftrialistsmayaccountfortheirlimitedwillingnesstouseconven-
tionalchoicetheoryinthisway.Unfortunately,behavioraleconomicsdoesnotyet
offerareplacementforthegeneralframeworkofchoicetheorythatissousefulin
thisregard.
Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopula-
tiontowhichthetrialresultsimmediatelyapplyandforthinkingaboutmoving
fromthispopulationtopopulationsofinterest.Ashenfelter’s(1978)analysisis
againagoodillustrationandpredatesmuchsimilarworkinlaterliterature.Thein-
cometaxexperimentsofferedparticipationinthetrialtoarandomsampleofthe
populationofinterest.Becausetherewasnoblindingandnocompulsion,people
whowererandomizedintothetreatmentgroupwerefreetochoosetorefusetreat-
ment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto
42
participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownin
theRCTandInstrumentalVariablesliteratureastheirownidiosyncratic`gain.’The
simplelaborsupplymodelgivesanapproximatecondition:Ifthetreatmentin-
creasesthetaxratefrom t0 to t1 withanoffsettingincreaseinG,thenanindividual
assignedtotheexperimentalgroupwilldeclinetoparticipateif
(t1 − t0 )w0h0 + 12
s00 (t1 − t0 ) > G1 −G0 (5)
wheresubscript1referstothetreatmentsituation,0tothecontrol,ℎ&ishours
worked,and𝑠&&isthe(negative)utility-constantresponseofhoursworkedtothe
taxrate.Ifthereisnosubstitution,thesecondtermontheleft-handsideiszero,and
peoplewillaccepttreatmentiftheincreaseinGmorethanmakesupforthein-
creasesintaxespayable,the`breakeven’condition.Inconsequence,thosewith
higherearningsarelesslikelytoaccepttreatment.Somebetter-offpeoplewithhigh
substitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymorecheap
leisureissufficiententicement.
Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearn
aboutthebetter-offorlow-substitutionpeoplewhodeclinetreatmentbutwho
wouldhavetoacceptitifthepolicywereimplemented.Boththeintention-to-treat
estimatorandthe`astreated’estimatorthatcomparesthetreatedandtheun-
treatedareaffected,notjustbythelaborsupplyeffectsthatthetrialisdesignedto
induce,butbythekindofselectioneffectsthatrandomizationisdesignedtoelimi-
nate.Ofcourse,theanalysisthatleadsto(5)canperhapshelpussaysomething
aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.
Yetthisisnoeasymatterbecauseselectiondepends,notonlyonobservables,such
aspre-experimentalearningsandhoursworked,buton(muchhardertoobserve)
laborsupplyresponsesthatlikelyvaryacrossindividuals.ParaphrasingAshenfelter,
wecannotestimatetheeffectsofapermanentcompulsorynegativeincometaxpro-
gramfromatransitoryvoluntarytrialwithoutstrongassumptionsoradditionalevi-
dence.
Muchofthemodernliterature,forexampleontrainingprograms,wrestles
withtheissueofexactlywhoisrepresentedbytheRCTresults,includingnotonly
43
whoparticipatesinthefirstplacebutwholeavesbeforethetrialiscompleted(see
againHeckman,LalondeandSmith(1999)).Asintheexamplesabove,modelingat-
tritionwithinatrialcanyieldestimatesofbehavioralresponsesthatcanbeusedto
transportthefindingstoothersettings(seeChanandHamilton(2006),Chassang,
PadróIMiguel,andSnowberg(2012)andChassangetal(2015)).Whenpeopleare
allowedtorejecttheirrandomlyassignedtreatmentaccordingtotheirown(realor
perceived)advantage,ortodropoutofatrialonanestimateofthebenefitsand
costsfromdoingso,wehavecomealongwayawayfromtherandomallocationin
thestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof
blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchas
welfaretrials,thateffectivelycompelpeopletoaccepttheirassignments,andsome
wherethetreatmentisgenerousenoughtodoso,therearetrialswheresubjects
havemuchfreedomand,inthosecasesitislessthanobvioustouswhatrole,ifany,
randomizationplaysinwarrantingtheresults.
2.6Scalingup:usingtheaverageforpopulations
ManyRCTsaresmall-scaleandlocal,forexampleinafewschools,clinics,orfarms
inaparticulargeographic,cultural,socio-economicsetting.Ifsuccessfulaccording
toacost-effectivenesscriterion,forexample,itisacandidateforscaling-up,apply-
ingthesameinterventionforamuchlargerarea,oftenawholecountry,orsome-
timesevenbeyond,aswhensometreatmentisconsideredforallrelevantWorld
Bankprojects.Thefactthattheinterventionmightworkdifferentlyatscalehaslong
beennotedintheeconomicsliterature,e.g.GarfinkelandManski(1992),Heckman
(1992),andMoffitt(1992),andisrecognizedintherecentreviewbyBanerjeeand
Duflo(2009).Wewantheretoemphasizethepervasivenessofsucheffectsaswell
astonoteagainthatthisshouldnotbetakenasanargumentagainstusingRCTsbut
onlyagainsttheideathateffectsatscalearelikelytobethesameasinthetrial.
Anexampleofwhatareoftencalled`generalequilibriumeffects’comesfrom
agriculture.SupposeanRCTdemonstratesthatinthestudypopulationanewwayof
usingfertilizerhadasubstantialpositiveeffecton,say,cocoayields,sothatfarmers
whousedthenewmethodssawincreasesinproductionandinincomescompared
tothoseinthecontrolgroup.Iftheprocedureisscaleduptothewholecountry,or
44
toallcocoafarmersworldwide,thepricewilldrop,andifthedemandforcocoais
priceinelastic—asisusuallythoughttobethecase,atleastintheshortrun—cocoa
farmers’incomeswillfall.Indeed,theconventionalwisdomformanycropsisthat
farmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsiderations
mightnotbedecisiveindecidingwhetherornottopromotetheinnovation,and
theremaystillbelongtermgainsif,forexample,somefarmersfindsomethingbet-
tertodothangrowingcocoa.But,inthiscase,thescaled-upeffectisoppositeinsign
tothetrialeffect.Theproblemisnotwiththetrialresults,whichcanbeusefullyin-
corporatedintoamorecomprehensivemarketmodelthatincorporatesthere-
sponsesestimatedbythetrial.Theproblemisonlyifweassumethattheaggregate
looksliketheindividual.Thatotheringredientsoftheaggregatemodelmustcome
fromobservationalstudiesshouldnotbeacriticism,evenforthosewhofavorRCTs;
itissimplythepriceofdoingseriousanalysis.
Therearemanypossibleinterventionsthataltersupplyordemandwhoseef-
fect,inaggregate,willchangeapriceorawagethatisheldconstantintheoriginal
RCT.Educationwillchangethesuppliesofskilledversusunskilledlabor,withimpli-
cationsforrelativewagerates.Conditionalcashtransfersincreasethedemandfor
(andperhapssupplyof)schoolsandclinics,whichwillchangepricesorwaiting
lines,orboth.Thereareinteractionsbetweenpeoplethatwilloperateonlyatscale.
Givingonechildavouchertogotoprivateschoolmightimproveherfuture,butdo-
ingsoforeveryonecandecreasethequalityofeducationforthosechildrenwhoare
leftinthepublicschools(seethecontrastingstudiesofAngristetal(2002)and
HsiehandUrquiola(2002)).Educationalortrainingprogramsmaybenefitthose
whoaretreatedbutharmthoseleftbehind;Créponetal(2014)recognizetheissue
andshowhowtoadaptanRCTtodealwithit.
Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovern-
mentmaynotallowthemasstransferofmoneyfromabroadtoapowerlessseg-
mentofthepopulation,thoughitmaypermitasmall-scaleRCTofcashtransfers,
perhapseveninthehopethatalarge-scaleimplementationwillyieldopportunities
forpredation.ProvisionofhealthcarebyforeignNGOsmaybesuccessfulintrials,
buthaveunintendednegativeconsequencestoscalebecauseofgeneralequilibrium
45
effectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthenatureof
thecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetopro-
videservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthrough
asystem(thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatis
procuredfailingtofinditswaytotheintendedbeneficiaries.LocalizedRCTson
whetherornotfamiliesarebetteroffwithcashtransfersarenotinformativeabout
howpoliticianswouldchangetheamountofthetransferiffacedwithunanticipated
inflation,andatleastasimportant,whetherthegovernmentcouldcutprocurement
fromrelativelywealthyandpoliticallypowerfulfarmers.Withoutapoliticaland
generalequilibriumanalysis,itisimpossibletothinkabouttheeffectsofreplacing
foodsubsidieswithcashtransfers(seee.g.Basu(2010)).
Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscom-
monthanaresocialinteractionsinsocialscience,interactionscanbeimportant.In-
fectiousdiseasesareonewell-knownexample,whereimmunizationprogramsaf-
fectthedynamicsofdiseasetransmissionthroughherdimmunity(seeFineand
Clarkson(1986)andManski(2013,52)).Thesocialandeconomicsettingalsoaf-
fectshowdrugsareactuallyusedandthesameissuescanarise;thedistinctionbe-
tweenefficacyandeffectivenessinclinicaltrialsisinpartrecognitionofthefact.
2.7Drillingdown:usingtheaverageforindividuals
Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfrom
RCTsatthelevelofindividualunits,evenindividualunitsthatwereincludedinthe
trial.Awell-conductedRCTdeliversanATEforthetrialpopulationbut,ingeneral,
thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedinthe
AmericanMedicalAssociation’s“Users’guidetothemedicalliterature”that“ifthe
patientwouldhavebeenenrolledinthestudyhadshebeenthere—thatisshemeets
alloftheinclusioncriteriaanddoesn’tviolateanyoftheexclusioncriteria—thereis
littlequestionthattheresultsareapplicable”(seeGuyattetal(1994,60)).Even
moremisleadingaretheoften-heardstatementsthatanRCTwithanaveragetreat-
menteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworks
fornoone.
46
Theseissuesarefamiliartophysicianspracticingevidence-basedmedicine
whoseguidelinesrequire“integratingindividualclinicalexpertisewiththebest
availableexternalclinicalevidencefromsystematicresearch,”Sackettetal(1996,
71).Exactlywhatthismeansisunclear;physiciansknowmuchmoreabouttheirpa-
tientsthanisallowedforintheATEfromtheRCT(though,onceagain,stratification
inthetrialislikelytobehelpful)andtheyoftenhaveintuitiveexpertisefromlong
practicethatcanhelpthemidentifyfeaturesinaparticularpatientthatmayinflu-
encetheeffectivenessofagiventreatmentforthatpatient.Butthereisanoddbal-
ancestruckhere.Thesejudgmentsaredeemedadmissibleindiscussionwiththein-
dividualpatient,buttheydon’tadduptoevidencetobemadepubliclyavailable,
withtheusualcautionsaboutcredibility,bythestandardsadoptedbymostEBM
sites.Itisalsotruethatphysicianscanhaveprejudicesand`knowledge’thatmight
beanythingbut.Clearly,therearesituationswhereforcingpractitionerstofollow
theaveragewilldobetter,evenforindividualpatients,andotherswheretheoppo-
siteistrue,KahnemanandKlein(2009).
Whetherornotaveragesareusefultoindividualsraisesthesameissue
throughoutsocialscienceresearch.Imaginetwoschools,StJoseph’sandSt.Mary’s,
bothofwhichwereincludedinanRCTofaclassroominnovation.Theinnovationis
successfulonaverage,butshouldtheschoolsadoptit?ShouldStMary’sbeinflu-
encedbyapreviousattemptinStJoseph’sthatwasjudgedafailure?Manywould
dismissthisexperienceasanecdotalandaskhowStJoseph’scouldhaveknownthat
itwasafailurewithoutbenefitof`rigorous’evidence.YetifStMary’sislikeStJo-
seph’s,withasimilarmixofpupils,asimilarcurriculum,andsimilaracademic
standing,mightnotStJoseph’sexperiencebemorerelevanttowhatmighthappen
atStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagood
ideafortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindout
whathappenedandwhy?Theymaybeabletoobservethemechanismofthefailure,
ifsuchitwas,andfigureoutwhetherthesameproblemswouldapplyforthem,or
whethertheymightbeabletoadapttheinnovationtomakeitworkforthem,per-
hapsevenmoresuccessfullythanthepositiveaverageinthetrial.
47
Onceagain,thesequestionsareunlikelytobeeasilyansweredinpractice;
but,aswithtransportability,thereisnoseriousalternativetotrying.Assumingthat
theaverageworksforyouwilloftenbewrong,anditwillatleastsometimesbepos-
sibletodobetter.Asinthemedicalcase,theadvicetoindividualschoolsoftenlacks
specificity.Forexample,theU.S.InstituteofEducationScienceshasprovideda
“user-friendly”guidetopracticessupportedbyrigorousevidence,USDepartmentof
Education(2003).Theadvice,whichissimilartorecommendationsindevelopment
economics,isthattheinterventionbedemonstratedeffectivethroughwell-designed
RCTsinmorethanonesiteandthat“thetrialsshoulddemonstratetheinterven-
tion’seffectivenessinschoolsettingssimilartoyours”(2003,17).Nooperational
definitionof“similar”isprovided.
2.8Examplesandillustrationsfromeconomics
OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethatthey
representanapproachthatisdifferentfrommostcurrentpractice.Todocument
thisandtofilloutthearguments,weprovidesomeexamples.Whiletheseareocca-
sionallycritical,ourpurposeisconstructive;indeed,webelievethatmisunderstand-
ingsabouthowtouseRCTshaveartificiallylimitedtheirusefulness,aswellasalien-
atedsomewhowouldotherwiseusethem.
Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedus-
ingRCTs(andotherRCT-likemethods)andareoftencitedasaleadingexampleof
howanevaluationwithstronginternalvalidityleadstoarapidspreadofthepolicy,
e.g.AngristandPischke(2010)amongmanyothers.Thinkthroughthecausalchain
thatisrequiredforCCTstobesuccessful:Peoplemustlikemoney,theymustlike
(ordonotobjecttoomuch)totheirchildrenbeingeducatedandvaccinated,there
mustexistschoolsandclinicsthatarecloseenoughandwellenoughstaffedtodo
theirjob,andthegovernmentoragencythatisrunningtheschememustcareabout
thewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide
rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs
`work’inmanyreplications,thoughtheycertainlywillnotworkinplaceswherethe
schoolsandclinicsdonotexist,e.g.Levy(2001),norinplaceswherepeople
stronglyopposeeducationorvaccination.
48
Similarly,giventhatthesupportfactorswilloperatewithdifferentstrengths
andeffectivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATE
differsfromplacetoplace;forexample,Vivalt’sAidGradewebsitelists29estimates
fromarangeofcountriesofthestandardized(dividedbylocalstandarddeviationof
theoutcome)effectsofCCTsonschoolattendance;allbutfourshowtheexpected
positiveeffect,andtherangerunsfrom–8to+38percentagepoints,Vivalt(2015).
Eveninthisleadingcase,wherewemightreasonablyconcludethatCCTs`work’in
gettingchildrenintoschool,itwouldbehardtocalculatecrediblecost-effectiveness
numbersortocometoageneralconclusionaboutwhetherCCTsaremoreorless
costeffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeex-
pectedtodifferinnewsettings,justastheyhaveinobservedones,makingthese
predictionsdifficult.
Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—
thattheATEtransportsfromoneplacetoanother—isnotreasonable.AidGrade
usesstandardizedmeasuresofeffectsizedividedbystandarddeviationofoutcome
atbaseline,asdoesthemajormulti-countrystudybyBanerjeeetal(2015).Butwe
mightprefermeasuresthathaveaneconomicinterpretation,suchasadditional
monthsofschoolingper$100spent(forexampleifadonoristryingtodecide
wheretospend,seebelow).Nutritionmightbemeasuredbyheight,orbythelogof
height.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousingan-
othermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsit-
uations.Thisisexactlythesortofthingthataformalanalysisoftransportability
forcesustothinkabout.(NotealsothattheATEintheoriginalRCTcandifferde-
pendingonwhethertheoutcomeismeasuredinlevelsorinlogs;itiseasytocon-
structexampleswherethetwoATEshavedifferentsigns.)
Muchoftheeconomicsliterature,likethemedicalliterature,workswiththe
viewofexternalvaliditythat,unlessthereisevidencetothecontrary,thedirection
andsizeoftreatmenteffectscanbetransportedfromoneplacetoanother.TheJ-
PALwebsitereportsitsfindingsunderageneralheadingofpolicyrelevance,subdi-
videdbyaselectionoftopics.Undereachtopic,thereisalistofrelevantRCTsfrom
arangeofdifferentsettingsaroundtheworld.Theseareconvenientlyconverted
49
intoacommoncost-effectivenessmeasuresothat,forexample,under“education”,
subhead“studentparticipation”,therearefourstudiesfromAfrica:oninforming
parentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluni-
forms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementaread-
ditionalyearsofstudenteducationper$100,andamongthesefourstudies,theav-
erageeffectsofspending$100are20.7years,13.9years,0.71yearsand0.27years
respectively.(Notethatthisisadifferent—andmuchsuperior—standardization
fromtheeffectsizestandardizationdiscussedbelow.)
Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorin-
terestedineducation,andifmarginalandaverageeffectsarethesame,theymight
indicatethatthebestplacetodevoteamarginaldollarisinMadagascar,whereit
wouldbeusedtoinformparentsaboutthevalueofeducation.Thisiscertainlyuse-
ful,butitisnotasusefulasstatementsthatinformationordewormingprograms
areeverywheremorecost-effectivethanprogramsinvolvingschooluniformsor
scholarships,orifnoteverywhere,atleastoversomedomain,anditisthesesecond
kindsofcomparisonthatwouldgenuinelyfulfillthepromiseof`findingoutwhat
works.’Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfrom
oneplacetoanother,iftheKenyanresultsalsoholdinMadagascar,Mali,orNa-
mibia,orsomeotherlistofplaces.J-PAL’smanualforcost-effectiveness,Dhaliwalet
al(2012)explainsin(entirelyappropriate)detailhowtohandlevariationincosts
acrosssites,notingvariablefactorssuchaspopulationdensity,prices,exchange
rates,discountrates,inflation,andbulkdiscounts.Butitgivesshortshrifttocross-
sitevariationinthesizeofATEs,whichplayanequalpartinthecalculationsofcost
effectiveness.Themanualbrieflynotesthatdiminishingreturns(orthelast-mile
problem)mightbeimportantintheorybutarguesthatthebaselinelevelsofout-
comesarelikelytobesimilarinthepilotandreplicationareas,sothattheATEcan
besafelytransportedasis.Allofthislacksajustificationfortransportability,some
understandingofwhenresultstransport,whentheydonot,orbetterstill,howthey
shouldbemodifiedtomakethemtransportable.
50
OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTs
isbyBanerjeeetal(2015),whichtestsa“graduation”programdesignedtoperma-
nentlyliftextremelypoorpeoplefrompovertybyprovidingthemwithagiftofa
productiveasset(fromguinea-pigs,(regular-)pigs,sheep,goats,orchickensde-
pendingonlocale),trainingandsupport,andlife-skillscoaching,aswellassupport
forconsumption,saving,andhealthservices.Theideaisthatthispackageofaidcan
helppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossiblewithone
interventionatatime.ComparableversionsoftheprogramweretestedinEthiopia,
Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethe
chickensdied)findlargelypositiveandpersistenteffects—withsimilar(standard-
ized)effectsizes—forarangeofoutcomes(economic,mentalandphysicalhealth,
andfemaleempowerment).Onesiteapart,essentiallyeveryoneacceptedtheiras-
signment.ReplicationofpositiveATEsoversuchawiderangeofplacescertainly
providesproofofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)
failtoreplicatetheresultinSouthIndia,wherethecontrolgroupgotaccesstomuch
thesamebenefits,whatHeckman,Hohman,andSmith(2000)call“substitution
bias”.Evenso,theresultsareimportantbecause,althoughthereisalongstanding
interestinpovertytraps,manyeconomistshavebeenskepticaloftheirexistenceor
thattheycouldbesprungbysuchaid-basedpolicies.Inthissense,thestudyisan
importantcontributiontothetheoryofeconomicdevelopment;ittestsatheoretical
propositionandwill(orshould)changemindsaboutit.
Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottellus
whichcomponentofthetreatmentaccountedfortheresults,orwhichmightbedis-
pensable—amuchmoreexpensivemultifactorialtrialwouldberequired—thoughit
seemslikelyinpracticethatthecostliestcomponent—therepeatedvisitsfortrain-
ingandsupport—islikelytobethefirsttobecutbycash-strappedpoliticiansorad-
ministrators.Andasnoted,itisnotclearwhatshouldcountas(simple)replication
ininternationalcomparisons;itishardtothinkoftheusesofstandardizedeffect
sizes,excepttodocumentthateffectsexisteverywhereandthattheyaresimilarly
largerelativetolocalvariationinsuchthings.
51
Theeffectsize—theATEstandardizedbybeingexpressedinnumbersof
standarddeviationsoftheoriginaloutcome—thoughconvenientlydimensionless,
haslittletorecommendit.AswithmuchofRCTpractice,itstripsoutanyeconomic
content—noratesofreturn,orbenefitsminuscosts—anditremovesanydiscipline
onwhatisbeingcompared.Applesandorangesbecomeimmediatelycomparable,
asdotreatmentswhoseinclusioninameta-analysisislimitedonlybytheimagina-
tionoftheanalystsinclaimingsimilarity.Trainingprogramsforphysicalfitnesscan
bepooledwithtrainingprogramsforwelding,ormarketing,orevenobedience
trainingforpets.Inpsychology,wheretheconceptoriginated,thisresultsinendless
disputesaboutwhatshouldandshouldnotbepooledinameta-analysis.Goldberger
andManski(1995,769)notethat“standardizationaccomplishesnothingexceptto
givequantitiesinnoncomparableunitsthesuperficialappearanceofbeingincom-
parableunits.Thisaccomplishmentisworsethanuseless—ityieldsmisleadingin-
ferences.”Beyondthat,Simpson(2017)notesthatrestrictionsonthetrialsample—
oftengoodpracticetoreducebackgroundnoiseandtohelpdetectaneffect—will
reducethebaselinestandarddeviationandinflatetheeffectsize.Moregenerally,ef-
fectsizesareopentomanipulationbyexclusionrules.Itmakesnosensetoclaim
replicabilityonthebasisofeffectsizes,letalonetousethemtorankprojects.Effect
sizesareirrelevantforpolicymaking.
Thegraduationstudycanbetakenastheclosesttofulfillingthe`findingout
whatworks’aimoftheRCTmovementindevelopment.Yetitissilentonperhaps
thecrucialaspectforpolicy,whichisthatthetrialwasruninpartnershipwith
NGOs,whereaswhatwewouldliketoknowiswhetheritcouldbereplicatedbygov-
ernments,includingthosegovernmentsthatareincapableofgettingdoctors,
nurses,andteacherstoshowuptoclinicsorschools,Chaudhuryetal(2005),
Banerjee,DeatonandDuflo(2004),orofregulatingthequalityofmedicalcareinei-
therthepublicorprivatesectors,Filmer,HammerandPritchett(2000)orDasand
Hammer(2005).Infact,wealreadyknowagreatdealabout`whatworks.’Vaccina-
tionswork,maternalandchildhealthcareserviceswork,andclassroomteaching
works.Yetknowingthisdoesnotgetthosethingsdone.Addinganotherprogram
thatworksunderidealconditionsisusefulonlywhereconditionsareinfactideal,in
52
whichcaseitwouldlikelybeunnecessary.Findingoutwhatworksisnotthemagic
keytoeconomicdevelopment.Technicalknowledge,thoughalwaysworthhaving,
requiressuitableinstitutionsandsuitableincentivesifitistodoanygood.
Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthat
usedcamerasandthreatsofwagereductionstoincentivizeattendanceofteachers
inschoolsrunbyanNGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),and
thesubsequentfailureofafollow-upprograminthesamestatetotacklemassab-
senteeismofhealthworkers,Banerjee,Duflo,andGlennerster(2008).Inthe
schools,thecamerasandtimekeepingworkedasintended,andteacherattendance
increased.Intheclinics,therewasashort-runeffectonnurseattendance,butitwas
quicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthatare
initiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inboth
trials,therewereincentivestoimproveattendance,andtherewereincentivesto
findawaytosabotagethemonitoringandrestoreworkerstotheiraccustomedpo-
sitions;theforceoftheseincentivesisa`high-level’cause,likegravity,ortheprinci-
pleofthelever,thatworksinmuchthesamewayeverywhere.Fortheclinics,some
sabotagewasdirect—thesmashingofcameras—andsomewassubtler,whengov-
ernmentsupervisorsprovidedofficial,thoughspecious,reasonsformissingwork.
WecanonlyconjecturewhythecausalitywasswitchedinthemovefromNGOto
government;wesuspectthatworkingforahighly-respectedlocalNGOisadifferent
contractfromworkingforthegovernment,wherenotshowingupforworkis
widely(ifinformally)understoodtobepartofthedeal.Theincentiveleverworks
whenitiswiredupright,aswiththeNGOs,butnotwhenthewiringcutsitout,as
withthegovernment.Knowing`whatworks’inthesenseofthetreatmenteffecton
thetrialpopulationisoflimitedvaluewithoutunderstandingthepoliticalandinsti-
tutionalenvironmentinwhichitisset.Thisunderlinestheneedtounderstandthe
underlyingsocial,economic,andculturalstructures—includingtheincentivesand
agencyproblemsthatinhibitservicedelivery—thatarerequiredtosupportthe
causalpathwaysthatweshouldliketoseeatwork.
Trialsineconomicdevelopmentoftentakeplaceinartificialenvironments.
Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenaforeignagency
53
comesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’
whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingon
otherthanthetreatment.”Thereisalsothesuspicionthatatreatmentthatworks
doessobecauseofthepresenceofthe`treators,’oftenfromabroad,andmaynot
workdosowiththepeoplewhowillworkitinpractice.
Thereisalsomuchtobelearnedfrommanyyearsofeconomictrialsinthe
UnitedStates,particularlyfromtheworkofMDRC,fromtheearlyincometaxtrials,
aswellasfromtheRandHealthExperiment.Followingtheincometaxtrials,MDRC
hasrunmanyrandomizedtrialssincethe1970s,mostlyfortheFederalgovernment
butalsoforindividualstatesandforCanada(seethethoroughandinformativeac-
countbyGueronandRolston(2011)forthefactualinformationunderlyingthefol-
lowingdiscussion).MDRC’sprogram,likethatofJPALindevelopment,isintended
tofindout`whatworks’inthestateandfederalwelfareprograms.Theseprograms
areconditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedthey
satisfycertainconditionssuchasworkrequirementsortraining,whichareoften
thesubjectofthetrial.Whatarethebenefitsandcostsofvariousalternatives,both
totherecipientsandtothelocalandfederaltaxpayers?Alloftheseprogramsare
deeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability.
Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatits
consequenceswillbesothat,bytheirlights,controlgroupsareunethicalbecause
theydeprivesomepeopleofwhattheadvocates`know’willbecertainbenefits.
Giventhis,itisperhapssurprisingthatRCTshavebecometheacceptednormfor
thiskindofpolicyevaluationintheUS.
Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonbe-
lief,exploredinSection1,thatRCTscanrevealthetruth.AttheFederallevel,pro-
spectivepoliciesarevettedbythenon-partisanCongressionalBudgetOffice(CBO),
whichmakesitsownestimatesofthebudgetaryimplicationsoftheprogram.Ideo-
logueswhoseprogramsarescoredpoorlybytheCBOhaveanincentivetosupport
anRCT,nottoconvincethemselves,buttoconvinceopponents;onceagain,RCTs
arevaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsare
54
easiertoputinplacewhenthereareinsufficientfundstocoverthewholepopula-
tion.TherewasalsoawidespreadandlargelyuncriticalbeliefthatRCTsgivethe
rightanswer,atleastforthebudgetaryimplications,which,ratherthanthewellbe-
ingoftherecipients,wereoftentheprimaryconcern;notethatallofthesetrialsare
onpoorpeoplebyrichpeoplewhoaretypicallymoreconcernedwithcostthanwith
thewellbeingofthepoor,Greenberg,SchroderandOnstott(1999).MDRCstrials
couldthereforebeeffectivedispute-reconciliationmechanismsbothforthosewho
sawtheneedforevidenceandforthosewhodidnot(exceptinstrumentally).The
outcomeherefitswithour`publichealth’case;whatthepoliticiansneedtoknowis
nottheoutcomesforindividuals,orevenhowtheoutcomesinonestatemight
transporttoanother,buttheaveragebudgetarycostinaspecificplace,something
thatagoodRCTconductedonarepresentativesampleofthetargetpopulationcan
deliver,atleastintheabsenceofgeneralequilibriumeffects,timingeffects,etc.
TheseRCTsbyMDRCandothercontractorshavedemonstratedboththefea-
sibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationinthese
settings(wheremanyparticipantswerehostiletotheidea),aswellastheiruseful-
nesstopolicymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavor
ofthedesirabilityofworkrequirementsasaconditionofwelfare,evenamongmany
originallyopposed.Therearealsolimitations;thetrialsappeartohavehadatbesta
minorinfluenceonscientificthinkingaboutbehaviorinlabormarketsand,inthat
sense,theyaremoreabout`plumbing’thanscience,Duflo(2017).Theresultsof
similarprogramshaveoftenbeendifferentacrossdifferentsites,andtherehasto
datebeennofirmunderstandingofwhy;indeed,thetrialsarenotdesignedtore-
vealthis,Moffitt(2004).Finally,andperhapscruciallyforthepotentialcontribution
toeconomicscience,therehasbeenlittlesuccessinunderstandingeithertheunder-
lyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthebe-
ginningtoopentheblackboxes.
TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferent
butequallyinstructivestoryifonlybecauseitsresultshavepermeatedtheacademic
andpolicydiscussionsabouthealthcareeversince.Itwasoriginallydesignedtotest
whethermoregenerousinsurancecausespeopletousemoremedicalcareand,ifso,
55
byhowmuch.Theincentiveeffectsarehardlyindoubttoday;theimmortalityofthe
studycomesratherfromthefactthatitsmulti-arm(responsesurface)designal-
lowedthecalculationofanelasticityforthestudypopulation,thatmedicalexpendi-
turesdecreasedby–0.1to–0.2percentforeverypercentageincreaseinthecopay-
ment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itisthisdimensionless
andthusapparentlytransportablenumberthathasbeenusedeversincetodiscuss
thedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversal
constant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstud-
ies,anditisevenunclearthatitisfirmlybasedontheoriginalevidence.Thispoints,
onceagain,tothecentralimportanceoftransportabilityfortheusefulness,both
shortandlongterm,ofatrial.Here,thesimpledirecttransportabilityoftheresult
seemstohavebeenlargelyillusorythough,aswehaveargued,thisdoesnotmean
thatmorecomplexconstructionsbasedontheresultsofthetrialwouldnothave
donebetter.
Conclusions
Itisusefultorespondtotwochallengesthatareoftenputtous,onefrommedicine
andonefromsocialscience.Themedicalchallengeis,“Ifyouarebeingprescribeda
newdrug,wouldn’tyouwantittohavebeenthroughanRCT?”Thesecond(related)
challengeis,“OK,youhavehighlightedsomeoftheproblemswithRCTs,butother
methodshaveallofthoseproblems,plusproblemsoftheirown.”Webelievethatwe
haveansweredbothoftheseinthepaperbutthatitishelpfultorecapitulate.
Themedicalchallengeisaboutyou,aspecificperson,sothatoneanswer
wouldbethatyoumaybedifferentfromtheaverage,andyouareentitledtoand
oughttoaskabouttheoryandevidenceaboutwhetheritwillworkforyou.This
wouldbeintheformofaconversationbetweenyouandyourphysician,whoknows
alotaboutyou.Youwouldwanttoknowhowthisclassofdrugissupposedtowork
andwhetherthatmechanismislikelytoworkforyou.Isthereanyevidencefrom
otherpatients,especiallypatientslikeyou,withyourconditionandinyourcircum-
stances,oraretheresuggestionsfromtheory?Whatscientificworkhasbeendone
toidentifywhatsupportfactorsmatterforsuccesswiththiskindofdrug?Iftheonly
56
informationavailableisfromthepharmaceuticalcompany,anRCTmightseemlike
agoodidea.Buteventhen,andalthoughknowledgeofthemeaneffectamongsome
groupiscertainlyofvalue,youmightgivelittleweighttoanRCTwhoseparticipants
areselectedinthewaytheywereselectedinthetrial,orwherethereislittleinfor-
mationaboutwhethertheoutcomesarerelevanttoyou.Recallthatmanynewdrugs
areprescribed‘off-label’,forapurposeforwhichtheywerenottested,andbeyond
that,thatmanynewdrugsareadministeredintheabsenceofanRCTbecauseyou
areactuallybeingenrolledinone.Forpatientswhoselastchanceistoparticipatein
atrialofsomenewdrug,thisisexactlythesortofconversationyoushouldhave
withyourphysician(followedbyoneaskinghertorevealwhetheryouareintheac-
tivearm,sothatyoucanswitchifnot),andsuchconversationsneedtotakeplace
forallprescriptionsthatarenewtoyou.Intheseconversations,theresultsofan
RCTmayhavemarginalvalue.Ifyourphysiciantellsyouthatsheendorsesevidence-
basedmedicine,andthatthedrugwillmostlikelyworkforyoubecauseanRCThas
shownthat‘itworks’,itistimetofindanewphysician.
Thesecondchallengeclaimsthatothermethodsarealwaysdominatedbyan
RCT.Thiskindofchallengeisnotwell-formulated.Dominatedforansweringwhat
question,forwhatpurposes?ThechiefadvantageoftheRCTisthatitcan,ifwell-
conducted,giveanunbiasedestimateofanATEinastudy(trial)sampleandthus
provideevidencethatthetreatmentcausedtheoutcomeinsomeindividualsinthat
sample.Ifthatiswhatyouwanttoknowandthere’slittlebackgroundknowledge
availableandthepriceisright,thenanRCTmaybethebestchoice.Astoother
questions,theRCTresultcanbepart—butusuallyonlyasmallpart—ofthedefense
of(a)ageneralclaim,(b)aclaimthatthetreatmentwillcausethatoutcomefor
someotherindividuals,oreven(c)aclaimaboutwhattheATEwillbeinsomeother
population.Buttheydolittlefortheseenterprisesontheirown.Whatisthebest
overallpackageofresearchworkfortacklingthesequestions—mostcost-effective
andmostlikelytoproducecorrectresults—dependsonwhatweknowandwhat
differentkindsofresearchwillcost.
57
ThereareexampleswhereanRCTdoesbetterthananobservationalstudy,
andtheseseemtobethecasesthatcometomindfordefendersofRCTs.Forexam-
ple,regressionsofwhetherpeoplewhogetMedicaiddobetterorworsethanpeople
withprivateinsurancearevitiatedbygrossdifferencesintheothercharacteristics
ofthetwopopulations.ButitisalongstepfromthattosayingthatanRCTcansolve
theproblem,letalonethatitistheonlywaytosolvetheproblem.Itwillnotonlybe
expensivepersubject,butitcanonlyenrollaselectedandalmostcertainlyunrepre-
sentativestudysample,itcanberunonlytemporarily,andtherecruitmenttothe
experimentwillnecessarilybedifferentfromrecruitmentinaschemethatisper-
manentandopentothefullqualifiedpopulation.Noneofthisremovestheblem-
ishesoftheobservationalstudy,buttherearemanymethodsofmitigatingitsdiffi-
culties,sothat,intheend,anobservationalstudywithcrediblecorrectionsanda
morerelevantandmuchlargerstudysample—todayoftenthecompletepopulation
ofinterestthroughadministrativerecords—mayprovideabetterestimate.Every-
thinghastobejudgedonacase-by-casebasis.Thereisnorigorousargumentfora
lexicographicpreferenceforRCTs.
Thereisalsoanimportantlineofenquirythatgoes,notonlybeyondRCTs,
butbeyondthe‘methodofdifferences’thatiscommontoRCTs,regressions,orany
formofcontrolledoruncontrolledcomparison.Thehypothetico-deductivemethod
confrontstheory-baseddeductionswiththedata—eitherobservationalorexperi-
mental.Asnotedabove,economistsroutinelyusetheorytoteaseoutanewimplica-
tionthatcanbetakentothedata,andtherearealsogoodexamplesinmedicine
suchasBleyerandWelch(2012)’sdemonstrationofthelimitedimpactonbreast
cancerincidenceofmammographyscreening,atopicwhereothermethodshave
generatedgreatcontroversyandlittleconsensus.
RCTsaretheultimateinnon-parametricestimationofaveragetreatmentef-
fectsinthetrialsamplesbecausetheymakesofewassumptionsaboutheterogene-
ity,causalstructure,choiceofvariables,andfunctionalform.RCTsareoftenconven-
ientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat
happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,
includingmanyofthemostimportant(andNobelPrizewinning)experimentsin
58
economics,donotanddidnotuserandomization,Harrison(2013),Svorencik
(2015).Butthecredibilityoftheresults,eveninternally,canbeunderminedbyun-
balancedcovariatesandbyexcessiveheterogeneityinresponses,especiallywhen
thedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazard-
ous.Ironically,thepriceofthecredibilityinRCTsisthatwecanonlyrecoverthe
meanofthedistributionoftreatmenteffects,andthatonlyforthetrialsample.Yet,
inthepresenceofoutliers,reliableinferenceonmeansisdifficult.Andrandomiza-
tioninandofitselfdoesnothingunlessthedetailsareright;purposiveselectioninto
theexperimentalpopulation,likepurposiveselectionintoandoutofassignment,
underminesinferenceinjustthesamewayasdoesselectioninobservationalstud-
ies.Lackofblinding,whetherofparticipants,trialists,datacollectors,oranalysts,
underminesinference,akintoafailureofexclusionrestrictionsininstrumentalvari-
ableanalysis.
ThelackofstructurecanbeseriouslydisablingwhenwetrytouseRCTre-
sultsoutsideofafewcontexts,suchasprogramevaluation,hypothesistesting,or
establishingproofofconcept.Beyondthat,theresultscannotbeusedtohelpmake
predictionsbeyondthetrialsamplewithoutmorestructure,withoutmorepriorin-
formation,andwithouthavingsomeideaofwhatmakestreatmenteffectsvaryfrom
placetoplaceortimetotime.Thereisnooptionbuttocommittosomecausal
structureifwearetoknowhowtouseRCTevidenceoutoftheoriginalcontext.
Simplegeneralizationandsimpleextrapolationdonotcutthemustard.Thisistrue
ofanystudy,experimentalorobservational.Butobservationalstudiesarefamiliar
with,androutinelyworkwith,thesortofassumptionsthatRCTsclaimtoavoid,so
thatiftheaimistouseempiricalevidence,anycredibilityadvantagethatRCTshave
inestimationisnolongeroperative.AndbecauseRCTstellussolittleaboutwhyre-
sultshappen,theyhaveadisadvantageoverstudiesthatuseawiderrangeofprior
informationanddatatohelpnaildownmechanisms.
Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremely
useful,pinningdownpartofastructure,helpingtobuildstrongerunderstanding
andknowledge,andhelpingtoassesswelfareconsequences.Asourexamplesshow,
thiscanoftenbedonewithoutcommittingtothefullcomplexityofwhatareoften
59
thoughtofasstructuralmodels.Yetwithoutthestructurethatallowsustoplace
RCTresultsincontext,ortounderstandthemechanismsbehindthoseresults,not
onlycanwenottransportwhether`itworks’elsewhere,butwecannotdothestand-
ardstuffofeconomics,whichistosaywhethertheinterventionisactuallywelfare
improving.Withoutknowingwhythingshappenandwhypeopledothings,werun
theriskofworthlesscasual(`fairystory’)causaltheorizingandhavegivenupon
oneofthecentraltasksofeconomics.
Wemustbackawayfromtherefusaltotheorize,fromtheexultationinour
abilitytohandleunlimitedheterogeneity,andactuallySAYsomething.Perhapspar-
adoxically,unlesswearepreparedtomakeassumptions,andtosaywhatweknow,
makingstatementsthatwillbeincredibletosome,allthecredibilityoftheRCTisfor
naught.
RCTsineconomicsonhealth,labor,anddevelopmenthaveproventheir
worthinprovidingproofsofconceptandattestingpredictionsthatsomepolicies
mustalwaysworkorcanneverwork.But,aselsewhereineconomics,wecannot
findoutwhysomethingworksbysimplydemonstratingthatitdoeswork,nomatter
howoften,whichleavesusuninformedastowhetherthepolicyshouldbeimple-
mented.Beyondthat,smallscale,demonstrationRCTsarenotcapableoftellingus
whatwouldhappenifthesepolicieswereimplementedtoscale,ofcapturingunin-
tendedconsequencesthattypicallycannotbeincludedintheprotocols,orofmodel-
ingwhatwillhappenifschemesareimplementeddifferentlythaninthetrial,forex-
amplebygovernments,whosemotivesandoperatingprinciplesaredifferentfrom
theNGOsoracademicswhotypicallyruntrials.Whileitistruethatabstract
knowledgeisalwayslikelytobebeneficial,successfulpolicydependsoninstitutions
andonpolitics,mattersonwhichRCTshavelittletosay.TheresultsofRCTscanand
shouldfeedintopublicdebateaboutwhatshouldbedone,butweareondangerous
groundwhentheyareused,ongroundsoftheirsupposedepistemicsuperiority,to
insulatepolicyfromdemocraticprocesses.
Citations
60
Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathavewelearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimen-tation,Chicago,Il.ChicagoUniversityPressforNationalBureauofEconomicRe-search,11–54.
Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalre-sultsineconomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperi-mentaleconomics,OxfordUniversityPress.
Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatisticalSociety,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36.
Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”EconomicJournal,114,C52–C83.
Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKing,andMichaelKremer,2002,“VouchersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,”AmericanEconomicReview,92(5),1535–58.
Angrist,JoshuaD.andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninem-piricaleconomics:howbetterresearchdesignistakingtheconoutofeconomet-rics,”JournalofEconomicPerspectives,24(2),3–30.
Angrist,JoshuaD.andJörn-SteffenPischke,2017,“Undergraduateeconometricsin-struction:throughourclasses,darkly,”JournalofEconomicPerspectives,31(2),125-44.
Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsur-anceexperiment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222.
Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialex-periments,”DocumentNo.P-5546,SantaMonica,CA.RandCorporation.
Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.PalmerandJosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMaintenanceExperiment,Washington,DC.TheBrookingsInstitu-tion.109–38.
Athey,SusanandGuidoW.Imbens,2017,“Thestateofappliedeconometrics:cau-salityandpolicyevaluation,”JournalofEconomicPerspectives,31(2),3-32.
Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usingastructuralmodelandarandomizedexperimenttoevaluatePRO-GRESA,”ReviewofEconomicStudies,79(1),37–66.
Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015,“Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolledtrialinColombia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06.
Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalproceduresinnonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22.
Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofexperimenters,”processed,July2016.
61
Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachestoexperimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167,April.
Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“Healthcaredeliveryinru-ralRajasthan,”EconomicandPoliticalWeekly,39(9),944–9.
Banerjee,AbhijitandEstherDuflo,2009,“Theexperimentalapproachtodevelop-menteconomics,”AnnualReviewofEconomics,1,151-78.
Banerjee,AbhijitandEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewaytofightglobalpoverty,PublicAffairs.
Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,Wil-liamParienté,JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogramcauseslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236),1260799.
Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse:incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEconomicAssociation,6(2–3),487–500.
Banerjee,AbhijitV.andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomicReview,93(2),39–44.
Banerjee,Abhijit,DeanKarlan,andJonathanZinman,2015,“Sixrandomizedevalua-tionsofmicrocredit:introductionandfurthersteps,”AmericanEconomicJournal:AppliedEconomics,7(1),1-21.
Bareinboim,EliasandJudeaPearl,2013,“Ageneralalgorithmfordecidingtrans-portabilityofexperimentalresults,”JournalofCausalInference,1(1),107-34.
Bareinboim,EliasandJudeaPearl,2014,“Transportabilityfrommultipleenviron-mentswithlimitedexperiments:completenessresults,”inM.Welling,Z.Ghah-ramani,C.Cortes,andN.Lawrence,eds.,AdvancesofNeuralInformationPro-cessing,27,(NIPSProceedings),280-8.
Bauchet,Jonathan,JonathanMorduch,andShamikaRavi,2015,“Failurevsdisplace-ment:whyaninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevelopmentEconomics,116,1–16.
Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance,Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf
Begg,ColinB.,1990,“Significancetestsofcovarianceimbalanceinclinicaltrials,”ControlledClinicalTrials,11(4),223-5.
Bhattacharya,DebopamandPascalineDupas,2012,“Inferringwelfaremaximizingtreatmentassignmentunderbudgetconstraints,”JournalofEconometrics,167(1),168-96.
Bitler,MarianneP.,JonahB.Gelbach,andHilaryW.Hoynes,2006,“Whatmeanim-pactsmiss:distributionaleffectsofwelfarereformexperiments,”AmericanEco-nomicReview,96(4),988-1012.
Bleyer,Archie,andH.GilbertWelch,2012,“Effectofthreedecadesofscreeningmammographyonbreast-cancerincidence,”NewEnglandJournalofMedicine,367,1998-2005
Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteex-perimentaldifferencestofindoutwhyprogrameffectivenessvaries,”inHoward
62
S.Bloom,ed.,Learningmorefromsocialexperiments:evolvinganalyticalap-proaches,NewYork,NY.RussellSage.
Bold,Tessa,MwangiKimenyi,GermanoMwabu,AliceNg’ang’a,andJustinSandefur,2013,“Scalingupwhatworks:experimentalevidenceonexternalvalidityinKen-yaneducation,”Washington,DC.CenterforGlobalDevelopment,WorkingPaper321.
Bothwell,LauraE.andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlledtrial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635
Campbell,D.T.andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforresearch.Chicago.RandMcNally.
Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.Claren-donPress.
Cartwright,Nancy,2007,“AreRCTsthegoldstandard?”Biosocieties,2,11-20.Cartwright,Nancy,2011,“Aphilosopher’sviewofthelongroadfromRCTstoeffec-tiveness,”TheLancet,377,1400-01.
Cartwright,Nancy,2012,“Presidentialaddress:willthispolicyworkforyou?Pre-dictingeffectivenessbetter:howphilosophyhelps,”PhilosophyofScience,79,973-89.
Cartwright,Nancy,2016.“Whereistherigorwhenyouneedit?”inI.Marinovic,ed.,FoundationsandTrendsinAccounting:specialissueoncausalinferenceincapitalmarketsresearch,10(2-4):106-24.
Cartwright,NancyandJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingitbetter,Oxford.OxfordUniversityPress.
Cartwright,NancyandEileenMunro,2010,“ThelimitationsofRCTsinpredictingeffectiveness,”JournalofExperimentalChildPsychology,16(2),
Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionofmethodstocreateunbiasedcomparisongroupsintherapeuticexper-iments,”InternationalJournalofEpidemiology,30,1156–64.
Chan,TatY.andBartonH.Hamilton,2006,“Learning,privateinformation,andtheeconomicevaluationofrandomizedexperiments,”JournalofPoliticalEconomy,114(6),997-1040.
Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal–agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4),1279–1309.
Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Ac-countingforbehaviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227.doi:10:1371/journal.pone.0127227.
Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharan,andF.HalseyRogers,2005,“Missinginaction:teacherandhealthworkerabsenceinde-velopingcountries,”JournalofEconomicPerspectives,19(4),91–116.
Chyn,Eric,2016,“Movedtoopportunity:thelong-runeffectofpublichousingdemo-litiononlabormarketoutcomesofchildren,”UniversityofMichigan.http://www-personal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf
63
Chetty,Raj,2009,“Sufficientstatisticsforwelfareanalysis:abridgebetweenstruc-turalandreduced-formmethods,”AnnualReviewofEconomics,1,451-87.
Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexper-iments,”Econometrica,41(4),643–56.
Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Dolabormarketpolicieshavedisplacementeffects?evidencefromaclus-teredrandomizedexperiment,”QuarterlyJournalofEconomics,128(2),531–80.
Das,JishnuandJeffreyHammer,2005,”Whichdoctor?Combiningvignettesanditemresponsetomeasureclinicalcompetence,”JournalofDevelopmentEconom-ics,78,348–83.
Davey-Smith,George,andShahIbrahim,2002,“Datadredging,bias,orconfound-ing,”BritishMedicalJournal,325,1437-8.
Deaton,Angus,2010,“Instruments,randomization,andlearningaboutdevelop-ment,”JournalofEconomicLiterature,48(2),424-55.
Deaton,AngusandNancyCartwright,2016,“Understandingandmisunderstandingrandomizedcontrolledtrials,”http://www.princeton.edu/~deaton/down-load.html?pdf=Deaton_Cartwright_RCTs_with_ABSTRACT_August_25.pdf
Deaton,AngusandJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.CambridgeUniversityPress.
Deaton,AngusandSerenaNg,1998,“Parametricandnonparametricapproachestopriceandtaxreform,”JournaloftheAmericanStatisticalAssociation,93(443),900-9.
Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Com-parativecost-effectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithapplicationsforeducation,”J–PAL,MIT,December3rd.http://www.povertyactionlab.org/publication/cost-effectiveness
Drèze,Jean,2016,Personalemailcommunication.Duflo,Esther,2017,“Theeconomistasplumber,”AmericanEconomicReview,107(5),1-26.
Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteacherstocometoschool,”AmericanEconomicReview,102(4),1241–78.
Duflo,EstherandMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelopmenteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brookings,93–120.
Dynarski,Susan,2015,“Helpingthepoorineducation:thepowerofasimplenudge,”NewYorkTimes,Jan17,2015.
Fine,PaulE.M.andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthedeterminationofoptimalvaccinationpolicies,”AmericanJournalofEpide-miology,124(6),1012–20.
Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMin-istryofAgricultureofGreatBritain,33,503–13.
Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisofhealthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204.
64
Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”Ad-vancesinAppliedMathematics,40,180–93.
Frieden,ThomasR.,2017,“Evidenceforhealthdecisionmaking—beyondrandom-ized,controlledtrials,”NewEnglandJournalofMedicine,377,465-75.
Garfinkel,IrwinandCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF.Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.1–22.
Gerber,AlanS.andDonaldP.Green,2012,FieldExperiments,NewYork.Norton.Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Vermeersch,2016,Impactevaluationinpractice,2ndEdition,Washington,DC.Inter-AmericanDevelopmentBankandWorldBank.
Goldberger,ArthurS.andCharlesF.Manski,1995,“ReviewArticle:TheBellCurvebyHerrnsteinandMurray,”JournalofEconomicLiterature,33(2),762-76.
Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washington,DC.UrbanInstitutePress.
Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperi-mentmarket,”JournalofEconomicPerspectives,13(3),157–72.
Gueron,JudithM.andHowardRolston,2013,Fightingforreliableevidence,NewYork,RussellSage.
Guyatt,Gordon,DavidL.Sackett,andDeborahJ.CookfortheEvidence-BasedMedi-cineWorkingGroup,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapyorprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?”JournaloftheAmericanMedicalAssociation,271(1),59–63.
Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”Jour-nalofEconomicMethodology,20(2),103–17.
Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDevelopmentResearch,26,39–45.
Harrison,GlennW.,2014,“Cautionarynotesontheuseoffieldexperimentstoad-dresspolicyissues,”OxfordReviewofEconomicPolicy,30(4),753-63.
Hausman,JerryA.andDavidA.Wise,1985,“Technicalproblemsinsocialexperi-mentation:costversuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation,Chicago,IL.ChicagoUniversityPress.187–220.
Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.ManskiandIrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cam-bridge,MA.HarvardUniversityPress.547–70.
Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralas-sumptionsusedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62.
Heckman,JamesJ.,2005,“Thescientificmodelofcausality,”SociologicalMethodol-ogy,35(1),1-97.
Heckman,JamesJ.,2008,“Econometriccausality,”InternationalStatisticalReview,76(1),1-27.
65
Heckman,JamesJ.,2010,“Buildingbridgesbetweenstructuralandprogramevalua-tionapproachestoevaluatingpolicy,”JournalofEconomicLiterature,48(2),356-98.
Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000,“Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperiment,”QuarterlyJournalofEconomics,115(2),651–94.
Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandeconometricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDa-vidCard,eds.Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097.
Heckman,JamesJ.,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanismsthroughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”AmericanEconomicReview,103(6),2052–86.
Heckman,JamesJ.andJeffreySmith,1995,“Assessingthecaseforsocialexperi-ments,”JournalofEconomicPerspectives,9(2),85-110.
Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofprogrammeevaluationsandsocialexperiments:accountingforheterogeneityinprogrammeimpacts,”ReviewofEconomicStudies,64(4),487–535.
Heckman,JamesJ.andSergioUrzúa,2010,“ComparingIVwithstructuralmodels:whatsimpleIVcanandcannotidentify,”JournalofEconometrics,156,27-37.
Heckman,JamesJ.andEdwardVytlacil,2005,“Structuralequations,treatmentef-fects,andeconometricpolicyevaluation,”Econometrica,73(3),669–738.
Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms,Part1:causalmodels,structuralmodels,andeconometricpolicyeval-uation,”Chapter70inJamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874.
Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedi-cine:BradfordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64.
Hotz,V.Joseph,GuidoW.Imbens,andJulieH.Mortimer,2005,“Predictingtheeffi-cacyoffuturetrainingprogramsusingpastexperienceatotherlocations,”Jour-nalofEconometrics,125,241–70.
Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceonachievementandstratification:evidencefromChile’svoucherpro-gram,”JournalofPublicEconomics,90,1477–1503.
Hurwicz,Leonid,1966,“Onthestructuralformofinterdependentsystems,”StudiesinLogicandtheFoundationsofMathematics,44,232-9.
Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexogeneity:areview,”ReviewofEconomicsandStatistics,86(1),4–29.
Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)andHeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423.
Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaveragetreatmenteffects,”Econometrica,62(2),467–75.
Imbens,GuidoW.andMichalKolesár,2016,“Robuststandarderrorsinsmallsam-ples:somepracticaladvice,”ReviewofEconomicsandStatistics,98(4),701-12.
66
Imbens,GuidoW.andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometricsofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86.
InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct,reporting,editing,andpublicationofscholarlyworkinmedicaljournals,http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.)
J_PAL,2017,https://www.povertyactionlab.org/about-j-pal,(accessed,August21,2017).
Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisagree,”AmericanPsychologist,64(6),515–26.
Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconom-icsishelpingtosolveglobalpoverty,NewYork.Dutton.
Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcon-trolledtrialsarethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinanceproductdesigns,”EnterpriseDevelopmentandMicro-finance,20(3),167–76.
Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycoulddoinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012
Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,NewYork.Farrar,Straus,andGiroux.
Kremer,MichaelandAlakaHolla,2009,“Improvingeducationinthedevelopingworld:whathavewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42.
Lalonde,RobertJ.,1986,“Evaluatingtheeconometricevaluationsoftrainingpro-gramswithexperimentaldata,”AmericanEconomicReview,76(4),604-20.
Lehman,Erich.L.andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition),NewYork.Springer.
Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Opor-tunidadesprogram,Washington,DC.Brookings.
Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.Ox-fordUniversityPress.
Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,andArleenLeibowitz,1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,”AmericanEconomicReview,77(3),251–77.
Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin,ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedex-periment,SantaMonica,CA.RAND.
Manski,CharlesF.,2004,“Treatmentrulesforheterogeneouspopulations,”Econo-metrica,72(4),1221-46.
Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge,MA.HarvardUniversityPress.
Manski,CharlesF.andAlekseyTetenov,2016,“Sufficienttrialsizetoinformclinicalpractice,”PNAS,113(38),10518-23.
67
Metcalf,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperiments,”AmericanEconomicReview,63(3),478–83.
Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHumanResources,14(4),477–87.
Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharlesManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress,231–52.
Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspectivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist,47(5),506–40
Morgan,KariLockandDonaldB.Rubin,2012,“Rerandomizationtoimprovecovari-atebalanceinexperiments,”AnnalsofStatistics,40(2),1263–82.
Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepol-icyrelevanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225.
Orcutt,GuyH.andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimenta-tionforincomemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72.
Pearl,JudeaandEliasBareinboim,2011,“Transportabilityofcausalandstatisticalrelations:aformalapproach,”Proceedingsofthe25thAAAIConferenceonArtificialIntelligence,AAAIPress,247-54,
Pearl, Judea and Elias Bareinboim, 2014, “External validity: from do-calculus to trans-portability across populations,” Statistical Science, 29(4), 579-95.
Rodrik,Dani,2006,personalemailcommunication.Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdotheresultsofthetrialapply’”,Lancet,365,82–93.
Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor.
Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRichardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal,312(January13),71–2.
Savage,LeonardJ.,1962,“Subjectiveprobabilityandstatisticalpractice,”inG.A.Bar-nardandD.R.Cox,eds.,TheFoundationsofStatisticalInference,London.Me-thuen.9-35.
Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPop-ham,ed.,Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorporation.
Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13,1715–26.
Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32,1439–50.
Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasi-experimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin.
Simpson,Adrian,2017,“Themisdirectionofpublicpolicy:comparingandcombin-ingstandardisedeffectsizes,”JournalofEducationalPolicy32(4),450-66.
68
Stuart,ElizabethA.,StephenR.Cole,andCatharineP.Bradshaw,andPhilipJ.Leaf,2011,“Theuseofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,”JournaloftheRoyalStatisticalSocietyA,174(2),369–86.
Student(W.S.Gosset),1938,“Comparisonbetweenbalancedandrandomarrange-mentsoffieldplots,”Biometrika,29(3/4),363-78.
Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperi-mentaleconomics,UtrechtSchoolofEconomics,DissertationSeries#29,http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026
Todd,PetraE.andKennethJ.Wolpin,2006,“Assessingtheimpactofaschoolsub-sidyprograminMexico:usingasocialexperimenttovalidateadynamicbehav-ioralmodelofchildschoolingandfertility,”AmericanEconomicReview,96(5),1384–1417.
Todd,PetraE.andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annalesd’EconomieetdelaStatistique,91/92,263–91.
U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducationEvaluationandRegionalAssistance,2003,Identifyingandimplement-ingeducationalpracticessupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEducationSciences.
Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasran-domizedcontrolledtrials?”TheLancet,363:1728–31.
Vandenbroucke,JanP..2009,“TheHRTcontroversy:observationalstudiesandRCTsfallinline,”TheLancet,373,1233-5.
Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,un-published.http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf
White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirecttestforheteroskedasticity,”Econometrica,50(1),1–25.
Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsidiesonrent,”inP.BruckerandR.Pauly,eds.MethodsofOperationsResearch,50,VerlagAnonHain.441–89.
Wolpin,KennethI.,2013,Thelimitsofinferencewithouttheory,Camridge,MA.MITPress.
Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”Philoso-phyCompass,2/6,981–1022.
Worrall,John,2008,“Evidenceandethicsinmedicine,”PerspectivesinBiologyandMedicine,51(3),418-31.
Yates,Frank,1939,“Thecomparativeadvantagesofsystematicandrandomizedar-rangementsinthedesignofagriculturalandbiologicalexperiments,”Biometrika,30(3/4),440-66.
Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalin-significanceofseeminglysignificantexperimentalresults,”LondonSchoolofEco-nomics,WorkingPaper,Feb.
Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconom-ics:whyW.S.Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208.
69
Appendix:MonteCarloexperimentforanRCTwithoutliersInthisillustrativeexample,thereisparentpopulationeachmemberofwhichhashisorher
owntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistribu-
tionwithzeromeansothatthepopulationATEiszero.Theindividualtreatmenteffectsβ
aredistributedsothat β + e0.5 ∼ Λ(0,1) ,forstandardizedlognormaldistributionΛ. Inthe
absenceoftreatment,everyoneinthesamplerecordszero,sothesampleaveragetreat-
menteffectinanyonetrialissimplythemeanoutcomeamongthentreatments.Forvalues
ofnequalto25,50,100,200,and500wedrawfromtheparentpopulation100trialsam-
pleseachofsize2n;withfivevaluesofn,thisgivesus500trialsamplesinall;becauseof
samplingthetrueATE’sineachtrialsamplewillnotbezero.Foreachofthese500samples,
werandomizeintoncontrolsandntreatments,estimatetheATEanditsestimatedt–value
(usingthestandardtwo-samplet–value,orequivalently,byrunningaregressionwithro-
bustt–values),andthenrepeat1,000times,sowehave1,000ATEestimatesandt–values
foreachofthe500trialsamples.TheseallowustoassessthedistributionofATEestimates
andtheirnominalt–valuesforeachtrial.
TheresultsareshowninTableA1.Eachrowcorrespondstoasamplesize.Ineach
row,weshowtheresultsof100,000individualtrials,composedof1,000replicationson
eachofthe100trial(experimental)samples.Thecolumnsareaveragedoverall100,000tri-
als.
TableA1:RCTswithskewedtreatmenteffects
Samplesize MeanofATE
estimates
Meanofnominalt–
values
Fractionnullre-
jected(percent)
25
50
0.0268
0.0266
–0.4274
–0.2952
13.54
11.20
100 –0.0018 –0.2600 8.71
200 0.0184 –0.1748 7.09
500 –0.0024 –0.1362 6.06
Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfromalognormaldistributionoftreatmenteffectsshiftedtohaveazeromean.
70
Thelastcolumnshowsthefractionsoftimesthenullthatistrueinthepopulationis
rejectedinthetrialsamplesandisourkeyresult.Whenthereareonly50treatmentsand
50controls(row2),the(true)nullisrejected11.2percentofthetime,insteadofthe5per-
centthatwewouldlikeandexpectifwewereunawareoftheproblem.Whenthereare500
unitsineacharm,therejectionrateis6.06percent,muchclosertothenominal5percent.
FigureA1:EstimatesofanATEwithanoutlierinthetrialsample
FigureA1illustratestheestimatedATEsfromanextremetrialsamplefromthesimulations
inthesecondrowwith100observationsintotal;thehistogramshowsthe1,000estimates
oftheATEforthattrialsample.Thistrialsamplehasasinglelargeoutlyingtreatmenteffect
of48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe
treatmentgroup,wegettheright-handsideofthefigure,whenitisinthecontrolgroup,we
gettheleft-handside.
0.5
11.
5D
ensi
ty
-.5 0 .5 1 1.5 21,000 estimates of average treatment effect