Download pdf - NBER WORKING PAPER SERIES UNDERSTANDING AND ...Lars Hansen, Jeff Hammer, Glenn Harrison, Macartan Humphreys, Michal Kolesár, Helen Milner, Tamlyn Munslow, Suresh Naidu, Lant Pritchett,

NBER WORKING PAPER SERIES

UNDERSTANDING AND MISUNDERSTANDING RANDOMIZED CONTROLLED TRIALS

Angus DeatonNancy Cartwright

Working Paper 22595http://www.nber.org/papers/w22595

NATIONAL BUREAU OF ECONOMIC RESEARCH1050 Massachusetts Avenue

Cambridge, MA 02138September 2016, Revised October 2017

We acknowledge helpful discussions with many people over the several years this paper has been in preparation. We would particularly like to note comments from seminar participants at Princeton, Columbia, and Chicago, the CHESS research group at Durham, as well as discussions with Orley Ashenfelter, Anne Case, Nick Cowen, Hank Farber, Jim Heckman, Bo Honoré, Chuck Manski, and Julian Reiss. Ulrich Mueller had a major influence on shaping Section 1. We have benefited from generous comments on an earlier version by Christopher Adams, Tim Besley, Chris Blattman, Sylvain Chassang, Jishnu Das, Jean Drèze, William Easterly, Jonathan Fuller, Lars Hansen, Jeff Hammer, Glenn Harrison, Macartan Humphreys, Michal Kolesár, Helen Milner, Tamlyn Munslow, Suresh Naidu, Lant Pritchett, Dani Rodrik, Burt Singer, Richard Williams, Richard Zeckhauser, and Steve Ziliak. Cartwright’s research for this paper has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 667526 K4U), the Spencer Foundation, and the National Science Foundation (award 1632471). Deaton acknowledges financial support from the National Institute on Aging through the National Bureau of Economic Research, Grants 5R01AG040629-02 and P01AG05842-14 and through Princeton University’s Roybal Center, Grant P30 AG024928. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.

NBER working papers are circulated for discussion and comment purposes. They have not been peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies official NBER publications.

© 2016 by Angus Deaton and Nancy Cartwright. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source.

Understanding and Misunderstanding Randomized Controlled Trials Angus Deaton and Nancy CartwrightNBER Working Paper No. 22595September 2016, Revised October 2017JEL No. C10,C26,C93,O22

ABSTRACT

RCTs would be more useful if there were more realistic expectations of them and if their pitfalls were better recognized. For example, and contrary to many claims in the applied literature, randomization does not equalize everything but the treatment across treatments and controls, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) confounders. Estimates apply to the trial sample only, sometimes a convenience sample, and usually selected; justification is required to extend them to other groups, including any population to which the trial sample belongs. Demanding “external validity” is unhelpful because it expects too much of an RCT while undervaluing its contribution. Statistical inference on ATEs involves hazards that are not always recognized. RCTs do indeed require minimal assumptions and can operate with little prior knowledge. This is an advantage when persuading distrustful audiences, but it is a disadvantage for cumulative scientific progress, where prior knowledge should be built upon and not discarded. RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program, combining with other methods, including conceptual and theoretical development, to discover not “what works,” but “why things work”.

Angus Deaton361 Wallace HallWoodrow Wilson SchoolPrinceton UniversityPrinceton, NJ 08544-1013and [email protected]

Nancy CartwrightDepartment of PhilosophyDurham University50 Old ElvetDurham, UK DH1 3HN and University of California, San [email protected]

2

IntroductionRandomizedcontrolledtrials(RCTs)arecurrentlywidelyvisibleineconomicstoday

andhavebeenusedinthesubjectatleastsincethe1960s(seeGreenbergand

Shroder(2004)foracompendium).Itisoftenclaimedthatsuchtrialscandiscover

“whatworks”ineconomics,aswellasinpoliticalscience,education,andsocialpol-

icy.Amongbothresearchersandthegeneralpublic,RCTsareperceivedtoyield

causalinferencesandestimatesofaveragetreatmenteffects(ATEs)thataremore

reliableandmorecrediblethanthosefromanyotherempiricalmethod.Theyare

takentobelargelyexemptfromthemyriadeconometricproblemsthatcharacterize

observationalstudies,torequireminimalsubstantiveassumptions,littleornoprior

information,andtobelargelyindependentof“expert”knowledgethatisoftenre-

gardedasmanipulable,politicallybiased,orotherwisesuspect.

Therearenow“WhatWorks”centersusingandrecommendingRCTsina

rangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld.These

centersseeRCTsastheirpreferredtoolandindeedoftenpreferRCTevidencelexi-

cographically.Asoneofmanyexamples,theUSDepartmentofEducation’sstandard

for“strongevidenceofeffectiveness”requiresa“well-designedandimplemented”

RCT;noobservationalstudycanearnsuchalabel.This“goldstandard”claimabout

RCTsislesscommonineconomics,butImbens(2010,407)writesthat“randomized

experimentsdooccupyaspecialplaceinthehierarchyofevidence,namelyatthe

verytop.”TheAbdulLatifJameelPovertyActionLab(J-PAL),whosestatedmission

is“toreducepovertybyensuringthatpolicyisinformedbyscientificevidence”,ad-

vertisesthatitsaffiliatedprofessors“conductrandomizedevaluationstotestand

improvetheeffectivenessofprogramsandpoliciesaimedatreducingpoverty”,J-

PAL(2017).Theleadpageofitswebsite(echoedinthe‘Evaluation’section)notes

“843ongoingandcompletedrandomizedevaluationsin80countries”withnomen-

tionofanystudiesthatarenotrandomized.

Inmedicine,thegoldstandardviewhaslongbeenwidespread,e.g.fordrug

trialsbytheFDA;anotableexceptionistherecentpaperbyFrieden(2017),ex-di-

3

rectoroftheU.S.CentersforDiseaseControlandPrevention,wholistskeylimita-

tionsofRCTsaswellasarangeofcontextswhereRCTs,evenwhenfeasible,are

dominatedbyothermethods.

WearguethatanyspecialstatusforRCTsisunwarranted.Whichmethodis

mostlikelytoyieldagoodcausalinferencedependsonwhatwearetryingtodis-

coveraswellasonwhatknowledgeisalreadyavailable.Whenlittleprior

knowledgeisavailable,nomethodislikelytoyieldwell-supportedconclusions.This

paperisnotacriticismofRCTsinandofthemselves,letaloneanattempttoidentify

goodandbadstudies.Instead,wewillarguethat,dependingonwhatwewantto

discover,whywewanttodiscoverit,andwhatwealreadyknow,therewilloftenbe

superiorroutesofinvestigation.

Wepresenttwosetsofarguments.Thefirstisanenquiryintotheideathat

ATEsestimatedfromRCTSarelikelytobeclosertothetruththanthoseestimated

inotherways.ThesecondexploreshowtousetheresultsofRCTsoncewehave

them.Inthefirstsection,ourdiscussionrunsinfamiliartermsofbiasandefficiency,

orexpectedloss.Noneofthismaterialisnew,butweknowofnosimilartreatment,

andwewishtodisputemanyoftheclaimsthatarefrequentlymadeintheapplied

literature.Someroutinemisunderstandingsare:(a)randomizationensuresafair

trialbyensuringthat,atleastwithhighprobability,treatmentandcontrolgroups

differonlyinthetreatment;(b)RCTsprovidenotonlyunbiasedestimatesofATEs

butalsopreciseestimates;(c)statisticalinferenceinRCTs,whichrequiresonlythe

simplecomparisonofmeans,isstraightforward,sothatstandardsignificancetests

arereliable.

Nothingwesayinthepapershouldbetakenasageneralargumentagainst

RCTs;wearesimplytryingtochallengeunjustifiableclaims,andexposemisunder-

standings.WearenotagainstRCTs,onlymagicalthinkingaboutthem.Themisun-

derstandingsareimportantbecausewebelievethattheycontributetothecommon

perceptionthatRCTsalwaysprovidethestrongestevidenceforcausalityandforef-

fectiveness.

4

Inthesecondpartofthepaper,wediscusshowtousetheevidencefrom

RCTs.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanad-

vantageinestimation,isoftenadisadvantagewhenwetrytousetheresultsoutside

ofthecontextinwhichtheresultswereobtained;credibilityinestimationcanlead

toincredibilityinuse.Muchoftheliterature,perhapsinspiredbyCampbelland

Stanley’s(1963)famous“primacyofinternalvalidity”,appearstobelievethatinter-

nalvalidityisnotonlynecessarybutalmostsufficienttoguaranteetheusefulnessof

theestimatesindifferentcontexts.Butyoucannotknowhowtousetrialresults

withoutfirstunderstandinghowtheresultsfromRCTsrelatetotheknowledgethat

youalreadypossessabouttheworld,andmuchofthisknowledgeisobtainedby

othermethods.OncethecommitmenthasbeenmadetoseeingRCTswithinthis

broaderstructureofknowledgeandinference,andwhentheyaredesignedtoen-

hanceit,theycanbeenormouslyuseful,notjustforwarrantingclaimsofeffective-

nessbutforscientificprogressmoregenerally.Cumulativescienceisnotadvanced

throughmagicalthinking.

TheliteratureontheprecisionofATEsestimatedfromRCTsgoesbacktothe

verybeginning.Gosset(writingas`Student’)neveracceptedFisher’sargumentsfor

randomizationinagriculturalfieldtrialsandarguedconvincinglythathisownnon-

randomdesignsfortheplacementoftreatmentandcontrolsyieldedmoreprecise

estimatesoftreatmenteffects(seeStudent(1938)andZiliak(2014)).Gosset

workedforGuinnesswhereinefficiencymeantlostrevenue,sohehadreasonsto

care,asshouldwe.Fisherwontheargumentintheend,notbecauseGossetwas

wrongaboutefficiency,butbecause,unlikeGosset’sprocedures,randomizationpro-

videsasoundbasisforstatisticalinference,andthusforjudgingwhetheranesti-

matedATEisdifferentfromzerobychance.Moreover,Fisher’sblockingprocedures

canlimittheinefficiencyfromrandomization(seeYates(1939)).Gosset’sreserva-

tionswereechoedmuchlaterinSavage’s(1962)commentthataBayesianshould

notchoosetheallocationoftreatmentsandcontrolsatrandombutinsuchaway

that,givenwhatelseisknownaboutthetopicandthesubjects,theirplacementre-

vealsthemosttotheresearcher.Theseissuesabouthowtoincorporatepriorinfor-

mationintorandomizedtrialsarecentraltoSection1.

5

Ineconomics,thestrengthsandweaknessesofRCTsarewellexploredinthe

volumesbyHausmanandWise(1985)andbyGarfinkelandManski(1992);inthe

latter,theintroductionbyGarfinkelandManskiisabalancedsummaryofwhatran-

domizedtrialscanandcannotdo.ThepaperinthatvolumebyHeckman(1992)

raisesmanyoftheissuesthatheandhiscoauthorshaveexploredinsubsequentpa-

pers,seeinparticularHeckmanandSmith(1995),andHeckman,LalondeandSmith

(1999)whofocusonlabormarketexperiments.Manski(2013)containsagood

summaryofbothstrengthsandweaknesses.

Thereisalsoamorecontestedrecentliterature.Ontheonehand,thereare

proceduresthattakeasfundamentaltheunrestrictedindividualtreatmenteffectsof

individualsandseeknon-parametricapproachestoestimatingtheiraverage.Onthe

otherhand,theseproceduresarecontrastedwithanapproachthatuseselementsof

economictheorytodefineparametersofinterestandtoidentifymagnitudesthat

arelikelytobeinvarianttopolicymanipulationoracrosscontexts,whereinvari-

anceisdefinedinthesenseofHurwicz(1966).TheintroductioninImbensand

Wooldridge(2009)provideaneloquentdefenseofthetreatment-effectformulation.

Itemphasizesthecredibilitythatcomesfromatheory-freespecificationwithalmost

unlimitedheterogeneityintreatmenteffects.TheintroductioninHeckmanand

Vytlacil(2007)makesanequallyeloquentcaseagainst,notingthatthecrucialingre-

dientsoftreatmentsinRCTsareoftennotclearlyspecified—sothatweoftendonot

knowwhatthetreatmentreallyis—andthatthetreatmenteffectsarehardtolinkto

invariantparametersthatwouldbeusefulelsewhere.Aspectsofthesamedebate

featureinImbens(2010),AtheyandImbens(2017),AngristandPischke(2017),

Heckman(2005,2008,2010)andHeckmanandUrzua(2010).

Deaton(2010)complainsabouttheuseofinstrumentalvariables,including

randomization,asasubstituteforthinkingaboutandconstructingmodelsofeco-

nomicdevelopment.HearguesagainsttheideathatusingRCTstoevaluateprojects

todiscover“whatworks”caneveryieldasystematicbodyofscientificknowledge

thatcanbeusedtoreduceoreliminatepoverty.Thatpaperisanargumentagainst

theusefulnessoftheheterogeneoustreatmentapproach.Itarguesthatrefusingto

6

modelheterogeneity,thoughavoidingassumptions,precludesthesortofcumula-

tiveresearchprogramthatmightyieldusefulpolicy.Thepaper’sclaimthatRCTs

havenospecialclaimtogeneratecredibleandusefulknowledgewaschallengedby

Imbens(2010);someofhisargumentsareansweredbelow.Cartwright(2007)and

CartwrightandMunro(2010)challengeany“goldstandard”viewofRCTs.Cart-

wright(2011,2012,2016)andCartwrightandHardie(2012)focusonthequestion

ofhowtousetheresultsofRCTsandwhatwecanlearnwhenanexperimentshows

thatsomepolicyworkssomewhere.Section2pursuestheseissuesingeneraland

throughcasestudies.

Section1:DoRCTsgivegoodestimatesofAverageTreatmentEffects

Inthissection,weexplorehowtoestimateaveragetreatmenteffects(ATEs)andthe

roleofrandomization.WenotefirstthatestimatingATEsisonlyoneofmanyuses

forthedatageneratedbyanRCT.Westartfromatrialsample,acollectionofsub-

jectsthatwillbeallocatedrandomlytoeitherthetreatmentorcontrolarmofthe

trial.This“sample”mightbe,butrarelyis,arandomsamplefromsomepopulation

ofinterest.Morefrequently,itisselectedinsomeway,forexampletothosewilling

toparticipate,orissimplyaconveniencesamplethatisavailabletothetrialists.

Givenrandomallocationtotreatmentsandcontrols,thedatafromthetrialallowthe

identificationoftwodistributions,𝐹"(𝑌")and𝐹&(𝑌&),ofoutcomes𝑌"and𝑌&inthe

treatedanduntreatedcaseswithinthetrialsample.TheestimatedATEisthediffer-

enceinmeansofthetwodistributionsandisthefocusofmuchoftheliteraturein

socialscienceandmedicine.Yetpolicymakersandresearchersmaywellbeinter-

estedinotherfeaturesofthetwodistributions.Forexample,ifYisincome,they

maybeinterestedinwhetheratreatmentreducedincomeinequality,orinwhatit

didtothe10thor90thpercentilesoftheincomedistribution,eventhoughdifferent

peopleoccupythosepercentilesinthetreatmentandcontroldistributions(seeBit-

leretal(2006)foranexampleinUSwelfarepolicy).Cancertrialsstandardlyusethe

mediandifferenceinsurvival,whichcomparesthetimesuntilhalfthepatientshave

diedineacharm.Morecomprehensively,policymakersmaywishtocompareex-

pectedutilitiesfortreatedanduntreatedunderthetwodistributionsandconsider

7

optimalexpected-utilitymaximizingtreatmentrulesconditionalonthecharacteris-

ticsofsubjects(seeManski(2004)andManskiandTetenov(2016);Bhattacharya

andDupas(2012)containsanapplication.)Theseusesareimportant,butwefocus

onATEshereanddonotconsidertheseotherusesofRCTsanyfurtherinthispaper.

1.1Whatdoesrandomizationdo?

Ausefulwaytothinkabouttheestimationoftreatmenteffectsistouseaschematic

linearcausalmodeloftheform:

(1)

where, istheoutcomeforuniti,𝑇( isadichotomous(1,0)treatmentdummyindi-

catingwhetherornotiistreated,and𝛽( istheindividualtreatmenteffectofthe

treatmentoni.Thex’saretheobservedorunobservedotherlinearcausesofthe

outcome,andwesupposethat(1)capturesaminimalsetofcausesof𝑌( sufficientto

fixitsvalue.Jmaybe(very)large.Becausetheheterogeneityoftheindividualtreat-

menteffects,𝛽( ,isunrestricted,weallowthepossibilitythatthetreatmentinteracts

withthex’sorothervariables,sothattheeffectsofTcandependonanyothervaria-

bles.Notethatwedonotneedisubscriptsonthe𝛾’sthatcontroltheeffectsofthe

othercauses;iftheireffectsdifferacrossindividuals,weincludetheinteractionsof

individualcharacteristicswiththeoriginalx’sasnewx’s.Giventhatthex’scanbe

unobservable,thisisnotrestrictive.

Consideranexperimentthataimstotellussomethingaboutthetreatment

effects;thismightormightnotuserandomization.Eitherway,wecanrepresentthe

treatmentgroupashaving𝑇( = 1andthecontrolgroupashaving𝑇( = 0.Giventhe

study(ortrial)sample,subtractingtheaverageoutcomesamongthecontrolsfrom

theaverageoutcomesamongthetreatments,weget

Y

1−Y

0= β

1+ γ j (xij

1−

j=1

J

∑ xij0) = β

1+ (S

1− S

0) (2)

Thefirsttermonthefar-right-handsideof(2),whichistheATEinthetrialsample,

iswhatwewant,butthesecondtermorerrorterm,whichisthesumofthenetav-

eragebalanceofothercausesacrossthetwogroups,willgenerallybenon-zeroand

Yi = βiTi + γ j xijj=1

J∑Yi

8

needstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe

othercausesareidenticalinthetwogroups,ormoreprecisely(andlessonerously)

whenthesumoftheirnetdifferences𝑆" − 𝑆&iszero;thisisthecaseofperfectbal-

ance.Withperfectbalance,thedifferencebetweenthetwomeansisexactlyequalto

theaverageofthetreatmenteffectsamongthetreated,sothatwehavetheultimate

precisioninthatweknowthetruthinthetrialsample,atleastinthislinearcase.As

always,the“truth”herereferstothetrialsample,anditisalwaysimportanttobe

awarethatthetrialsamplemaynotberepresentativeofthepopulationthatisulti-

matelyofinterest,includingthepopulationfromwhichthetrialsamplecomes;any

suchextensionrequiresfurtherargument.

Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleof

randomization?Inalaboratoryexperiment,wherethereisusuallymuchprior

knowledgeoftheothercauses,theexperimenterhasagoodchanceofcontrolling

(orsubtractingawaytheeffectsof)theothercauses,aimingtoensurethatthelast

termin(1)isclosetozero.Failingsuchknowledgeandcontrol,analternativeis

matching,whichisfrequentlyusedinstatistical,medical,andeconometricwork.For

eachsubject,amatchisfoundthatisascloseaspossibleonallsuspectedcauses,so

that,onceagain,thelasttermin(1)canbekeptsmall.Whenwehaveagoodideaof

thecauses,matchingmayalsodeliverapreciseestimate.Ofcourse,whenthereare

unknownorunobservablecausesthathaveimportanteffects,neitherlaboratory

controlnormatchingoffersprotection.

Whatdoesrandomizationdo?Sincethetreatmentsandcontrolscomefrom

thesameunderlyingdistribution,randomizationguarantees,byconstruction,that

thelasttermontherightin(1)iszeroinexpectation,subjecttothecaveatthatno

correlationsofthex’swithYareintroducedpost-randomization,forexampleby

subjectsnotacceptingtheirassignment.Theexpectationhereistakenoverre-

peatedrandomizationsonthetrialsample,eachwithitsownallocationoftreat-

mentsandcontrols.Assumingthatourcaveatholds,thelasttermin(2)willbezero

whenaveragedoverthisinfinitenumberof(entirelyhypothetical)replications,and

9

theaverageoftheestimatedATEswillbethetrueATEinthetrialsample.So𝛽"de-

liversanunbiasedestimateoftheATEamongthetreatedinthetrialsample,andit

doessowhetherornotthecausesareobserved.Unbiasednessdoesnotrequireus

toknowanythingabouttheothercausesthoughitdoesrequirethattheynotchange

afterrandomizationsoastomakethemcorrelatedwiththetreatment,whichisan

importantcaveattowhichweshallreturn.IftheRCTisrepeatedmanytimesonthe

sametrialsample,then,assumingourcaveatholdsinthetrials,thelasttermin(2)

willbezerowhenaveragedoveraninfinitenumberof(entirelyhypothetical)trials,

andtheaverageoftheestimatedATEswillbethetrueATEinthetrialsample.Of

course,noneofthisistrueinanyonetrialwherethedifferenceinmeanswillbe

equaltotheaveragetreatmenteffectamongthosetreatedplusthetermthatreflects

theimbalanceintheneteffectsoftheothercauses.Wedonotknowthesizeofthis

errorterm,andthereisnothinginrandomizationthatlimitsitssize;bychancethe

randomizationinoursingletrialcanover-representanimportantexcludedcause(s)

inonearmovertheother,inwhichcasetherewillbeadifferencebetweenthe

meansofthetwogroupsthatisnotcausedbythetreatment.

Theunbiasednessresultcaneasilybecompromised.Inparticular,thetreat-

mentmustnotbecorrelatedwithanyothercause.Randomassignmentisdesigned

toaidwiththis,butitisnotsufficientif,forexample,thereislackofblindingsothat

individualsareawareoftheirassignment,orifthoseadministeringthetreatment

aresoaware,andifthatawarenesstriggersanothercause.Similarly,researchers

sometimesreturntoindividualswhowererandomizedyearsbefore,sothatthere

hasbeentimeforthesubjectsorotherstolearntheirassignmentorforothercauses

tobeinfluencedbytheassignment.Thisagainopensupthepossibilityofunbal-

ancedeffectsofcausesotherthanthetreatmentweareinterestedin.Wehaveal-

readynotedthatunbiasednessreferstothetrialsample,whichmayormaynotbe

representativeofthepopulationofinterest.

Ifweweretorepeatthetrialmanytimes,theover-representationoftheun-

balancedcauseswillsometimesbeinthetreatmentsandsometimesinthecontrols.

Theimbalancewillvaryoverreplicationsofthetrial,andalthoughwecannotsee

thisfromoursingletrial,weshouldbeabletocaptureitseffectsonourestimateof

10

theATEfromanestimatedstandarderror.ThiswasFisher’sinnovation:notthat

randomizationbalancedothercausesbetweentreatmentsandcontrolsbutthat,

conditionalonourcaveatabove,randomizationprovidesthebasisforcalculating

thesizeoftheerror.Gettingthestandarderrorandassociatedsignificancestate-

mentsrightareofthegreatestimportance.Giventheabsenceoftreatment-related

post-randomizationchangesinothercauses,randomizationyieldsanunbiasedesti-

mateoftheATEinthetrialsampleaswellasasoundmethodformeasuringerrorof

estimationinthatsample;thereinliesitsvirtue,notthatityieldspreciseestimates

throughbalance.

1.2Misunderstandings:claimingtoomuch

Everythingsofarshouldbeperfectlyfamiliar,butexactlywhatrandomizationdoes

isfrequentlylostinthepracticalliterature.Thereisoftenconfusionbetweenperfect

control,ontheonehand(asinalaboratoryexperimentorperfectmatchingwithno

unobservablecauses),andcontrolinexpectationontheother,whichiswhatran-

domizationcontributes.Ifweknewenoughabouttheproblemtobeabletocontrol

well,thatiswhatwewoulddo.Randomizationisanalternativewhenwedonot

knowenough,butisgenerallyinferiortogoodcontrol.Wesuspectthatatleastsome

ofthepopularandprofessionalenthusiasmforRCTs,aswellasthebeliefthatthey

areprecisebyconstruction,comesfrommisunderstandingsaboutbalance.These

misunderstandingsarenotsomuchamongthetrialistswhowilloftengiveacorrect

accountwhenpressed.Theycomefromimprecisestatementsbytrialiststhatare

takenliterallybythelayaudiencethatthetrialistsarekeentoreach.

Suchamisunderstandingiswellcapturedbyaquotefromthesecondedition

oftheonlinemanualonimpactevaluationjointlyissuedbytheInter-AmericanDe-

velopmentBankandtheWorldBank(thefirst,2011editionissimilar):

“Wecanbeconfidentthatourestimatedimpactconstitutesthetrueimpact

oftheprogram,sincewehaveeliminatedallobservedandunobservedfac-

torsthatmightotherwiseplausiblyexplainthedifferenceinoutcomes.”Ger-

tler,Martinez,Premand,Rawlings,andVermeersch(2016,69).

11

Thisstatementisfalse,becauseitconfusesactualbalanceinanysingletrialwith

balanceinexpectationovermany(hypothetical)trials.Ifitweretrue,andifallfac-

torswereindeedcontrolled(andnoimbalanceswereintroducedpostrandomiza-

tion),thedifferencewouldbeanexactmeasureoftheaveragetreatmenteffect

amongthetreatedinthetrialpopulation(atleastintheabsenceofmeasurementer-

ror).Weshouldnotonlybeconfidentofourestimatebut,asthequotesays,we

wouldknowthatitisthetruth.Notethatthestatementcontainsnoreferenceto

samplesize;wegetthetruthbyvirtueofbalance,notfromalargenumberofobser-

vations.

AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuc-

cessfulscholarswhouseRCTs:

“complicationsthataredifficulttounderstandandcontrolrepresentkeyrea-

sonstoconductexperiments,notapointofskepticism.Thisisbecauseran-

domizationactsasaninstrumentalvariable,balancingunobservablesacross

controlandtreatmentgroups.”Al-UbaydliandList(2013)(italicsintheorig-

inal.)

AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAc-

tion,whichrunsdevelopmentRCTsaroundtheworld:

“Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomly

assigningsubjectstotreatmentsandcontrolgroups.Thismakesitsothatall

thoseotherfactorswhichcouldinfluencetheoutcomearepresentintreat-

mentandcontrol,andthusanydifferenceinoutcomecanbeconfidentlyat-

tributedtotheintervention.”Karlan,GoldbergandCopestake(2009)

Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeply

skepticaloftheuseofevidencefromRCTs,

“Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtoun-

derstandallthefactorsthatinfluenceoutcomes.Saythatanundiscoveredge-

neticvariationmakescertainpeopleunresponsivetomedication.Theran-

domizingprocesswillensure—ormakeithighlyprobable—thatthearmsof

thetrialcontainequalnumbersofsubjectswiththatvariation.Theresultwill

beafairtest.”(Kramer,2016,p.18)

12

ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.Judy

Gueron,thelong-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgov-

ernmentpolicyfor45years,askswhyfederalandstateofficialswerepreparedto

supportrandomizationinspiteoffrequentdifficultiesandinspiteoftheavailability

ofothermethodsandconcludesthatitwasbecause“theywantedtolearnthetruth,”

GueronandRolston(2013,429).Therearemanystatementsoftheform“Weknow

that[projectX]workedbecauseitwasevaluatedwitharandomizedtrial,”Dynarski

(2015).

ItiscommontotreattheATEfromanRCTasifitwerethetruth,notjustin

thetrialsamplebutmoregenerally.Ineconomics,afamousexampleisLalonde’s

(1986)studyoftrainingprograms,whoseresultswereatoddswithanumberof

previousnon-randomizedstudies.Thepaperpromptedalarge-scalere-examination

oftheobservationalstudiestotrytobringthemintoline,thoughitnowseemsjust

aslikelythatthedifferenceslieinthefactthatthestudyresultsapplytodifferent

populations(Heckman,Lalonde,andSmith(1999)).Inepidemiology,Davey-Smith

andIbrahim(2002)statethat“observationalstudiespropose,RCTsdispose”.A

goodexampleistheRCTofhormonereplacementtherapy(HRT)forpost-menopau-

salwomen.HRThadpreviouslybeensupportedbypositiveresultsfromahigh-

qualityandlong-runningobservationalstudy,buttheRCTwasstoppedinthefaceof

excessdeathsinthetreatmentgroup.ThenegativeresultoftheRCTledtowide-

spreadabandonmentofthetherapy,whichmighthavebeenamistake(seeVanden-

broucke(2009)andFrieden(2017)).Yetthemedicalandpopularliteraturerou-

tinelystatesthattheRCTwasrightandtheearlierstudywrong,simplybecausethe

earlierstudywasnotrandomized.Thegoldstandardor“truth”viewdoesharm

whenitunderminestheobligationofsciencetoreconcileRCTsresultswithother

evidenceinaprocessofcumulativeunderstanding.

Thefalsebeliefinautomaticprecisionsuggeststhatweneedpaynoatten-

tiontotheothercausesin(1)or(2).Indeed,GerberandGreen(2012),intheir

standardtextforRCTsinpoliticalscience,writethatrunninganRCTis“aresearch

strategythatdoesnotrequire,letalonemeasure,allpotentialconfounders.”Thisis

trueifwearehappywithestimatesthatarearbitrarilyfarfromthetruth,justso

13

longastheerrorscanceloutoveraseriesofimaginaryexperiments.Inreality,the

causalitythatisbeingattributedtothetreatmentmight,infact,becomingfroman

imbalanceinsomeothercauseinourparticulartrial;limitingthisrequiresserious

thoughtaboutpossibleconfounders.

1.3Samplesize,balance,andprecision

Atthetimeofrandomizationandintheabsenceofpost-randomizationchangesin

othercauses,atrialismorelikelytobebalancedwhenthesamplesizeislarge.As

thesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol

groupswillbecomearbitrarilyclose.Yetthisisoflittlehelpinfinitesamples.As

Fisher(1926)noted:“Mostexperimentersoncarryingoutarandomassignment

willbeshockedtofindhowfarfromequallytheplotsdistributethemselves,”quoted

inMorganandRubin(2012).Evenwithverylargesamplesizes,ifthereisalarge

numberofcauses,balanceoneachcausemaybeinfeasible.Evenwithjustthree

causeswiththreevalueseach,thereare27cellstobalance,andinmostsocialand

medicalcasestherewillbemore.Vandenbroucke(2004)notesthattherearethree

billionbasepairsinthehumangenome,manyorallofwhichcouldberelevantprog-

nosticfactorsforthebiologicaloutcomethatweareseekingtoinfluence.Itistrue,

as(2)makesclear,thatwedonotneedbalanceoneachcauseindividually,onlyon

theirneteffect,theterm𝑆" − 𝑆&.Butconsiderthehumangenomebasepairs.Outof

allofthosebillions,onlyonemightbeimportant,andifthatoneisunbalanced,the

resultsofasingletrialcanbefarfromthetruth.Statementsaboutlargesamples

guaranteeingbalancearenotusefulwithoutguidelinesabouthowlargeislarge

enough,andsuchstatementscannotbemadewithoutknowledgeofothercauses

andhowtheyaffectoutcomes.Ofcourse,lackofbalanceintheneteffectofeither

observablesornon-observablesin(2)doesnotcompromisetheinferenceinanRCT

inthesenseofobtainingastandarderrorfortheunbiasedATE(seeSenn(2013)for

aparticularlyclearstatement).

HavingrunanRCT,itmakesgoodsensetoexamineanyavailablecovariates

forbalancebetweenthetreatmentsandcontrols;ifwesuspectthatanobserved

variablexisapossiblecause,anditsmeansinthetwogroupsareverydifferent,we

14

shouldtreatourresultswithappropriatesuspicion.Inpractice,trialistsineconom-

ics(andinsomeotherdisciplines)usuallycarryoutastatisticaltestforbalanceaf-

terrandomizationbutbeforeanalysis,presumablywiththeaimoftakingsomeap-

propriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthe

samplemeansofobservablecovariates—theobservablex’sin(1)orinteractivefac-

torsrepresentedinβ—forthecontrolandtreatmentgroups,togetherwiththeirdif-

ferences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,ei-

thervariablebyvariable,orjointly.Thesetestsareappropriateforunbiasednessif

weareconcernedthattherandomnumbergeneratormighthavefailed,orifweare

worriedthattherandomizationisunderminedbynon-blindedsubjectswhosys-

tematicallyunderminetheallocation.Otherwise,unbiasednessisguaranteedbythe

randomization,whateverthetestsshow,andasthenextparagraphdemonstrates,

thetestisnotinformativeaboutthebalancethatwouldleadtoprecision.

Ifwewrite𝜇&and𝜇"forthe(vectorsof)truemeansinthetrialsample(i.e.

themeansoverallpossiblerandomizations)oftheobservedcausesofYinthecon-

trolandtreatmentgroupsatthepointofassignment,thenullhypothesisis(pre-

sumably,asjudgedbythetypicalbalancetest)thatthetwovectorsareidentical,

withthealternativebeingthattheyarenot.Butiftherandomizationhasbeencor-

rectlydonethenullhypothesisistruebyconstruction(seee.g.Altman(1985)and

Senn(1994)),whichmayhelpexplainwhyitsorarelyfailsinpractice.AsBegg

(1990)notes,“(I)tisatestofanullhypothesisthatisknowntobetrue.Therefore,if

thetestturnsouttobesignificantitis,bydefinition,aafalsepositive.”Thisis,of

course,consistentwithFisher’scommentsabouttheplotsinthefield,whichnotes

thattwosamplesofplotsrandomlydrawnfromthesamefieldcanlookveryunbal-

anced.Indeed,althoughwecannot“test”itinthisway,weknowthatthenullhy-

pothesisisalsotruefortheunobservablecauses.Notethecontrastwiththestate-

mentquotedaboveclaimingthatRCTsguaranteebalanceoncausesacrosstreat-

mentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthepointof

assignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereas

thebalancetestsareaboutthebalanceofcausesatthepointofassignmentinexpec-

15

tationovermanytrials,whichisguaranteedbyrandomization.Theconfusionisper-

hapsunderstandable,butitisaconfusionnevertheless.Ofcourse,itisalwaysgood

practicetolookforimbalancesbetweenobservedcovariatesinanysingletrialusing

somemoreappropriatedistancemeasure,forexamplethenormalizeddifferencein

means(ImbensandWooldridge(2009,equation3)).Similarly,itwouldhavebeen

goodpracticeforFishertoabandonarandomizationinwhichtherewereclearpat-

ternsinthe(random)distributionofplotsacrossthefield,eventhoughthetreat-

mentandcontrolplotswererandomlyselectionsthat,byconstruction,couldnot

differ“significantly”usingthestandard(incorrect)balancetest.Whethersuchim-

balancesshouldbeseenasunderminingtheestimateoftheATEdependsonour

priorsaboutwhichcovariatesarelikelytobeimportant,andhowimportant,which

is(notcoincidentally)thesamethoughtexperimentthatisroutinelyundertakenin

observationalstudieswhenweworryaboutconfounding.

Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomiza-

tion,forexample,bystratification.Fisher,whoasthequoteaboveillustrates,was

wellawareofthelossofprecisionfromrandomizationarguedfor“blocking”(strati-

fication)inagriculturaltrialsorforusingLatinSquares,bothofwhichrestrictthe

amountofimbalance.Stratification,tobeuseful,requiressomepriorunderstanding

ofthefactorsthatarelikelytobeimportant,andsoittakesusawayfromthe“no

knowledgerequired”or“nopriorsaccepted”appealofRCTs;itrequiresthinking

aboutandmeasuringcovariates.ButasScriven(1974,103)notes:“(C)ausehunting,

likelionhunting,isonlylikelytobesuccessfulifwehaveaconsiderableamountof

relevantbackgroundknowledge”.Cartwright(1994,Chapter2)putsitevenmore

strongly,“nocausesin,nocausesout”.StratificationinRCTs,asinotherformsof

sampling,isastandardmethodforusingbackgroundknowledgetoincreasethe

precisionofanestimator.Ithasthefurtheradvantagethatitallowsfortheexplora-

tionofdifferentATEsindifferentstratawhichcanbeusefulinadaptingortrans-

portingtheresultstootherlocations(seeSection2).

Stratificationisnotpossibleiftherearetoomanycovariates,orifeachhas

manyvalues,sothattherearemorecellsthancanbefilledgiventhesamplesize.

Withfivecovariates,andtenvaluesoneach,andnopriorstolimitthestructure,we

16

wouldhave100,000possiblestrata.Fillingtheseiswellbeyondthesamplesizesin

mosttrials.Analternativethatworksmoregenerallyistore-randomize.Iftheran-

domizationgivesanobviousimbalanceonknowncovariates—treatmentplotsall

ononesideofthefield,allofthetreatmentclinicsinoneregion,toomanyrichand

toofewpoorinthecontrolgroup—wetryagain,andkeeptryinguntilwegetabal-

ancemeasuredasasmallenoughdistancebetweenthemeansoftheobservedco-

variatesinthetwogroups.MorganandRubin(2012)suggesttheMahalanobisD–

statisticbeusedasacriterionanduseFisher’srandomizationinference(tobedis-

cussedfurtherbelow)tocalculatestandarderrorsthattakethere-randomization

intoaccount.Analternative,widelyadaptedinpractice,istoadjustforcovariatesby

runningaregression(orcovariance)analysis,withtheoutcomeontheleft-hand

sideandthetreatmentdummyandthecovariatesasexplanatoryvariables,includ-

ingpossibleinteractionsbetweencovariatesandtreatmentdummies.Freedman

(2008)showsthattheadjustedestimateoftheATEisbiasedinfinitesamples,with

thebiasdependingonthecorrelationbetweenthesquaredtreatmenteffectandthe

covariates.Acceptingsomebiasinexchangeforgreaterprecisionwilloftenmake

sense,thoughitcertainlyunderminesanygoldstandardargumentthatreliesonun-

biasednesswithoutconsiderationofprecision.

1.4Shouldwerandomize?

ThetensionbetweenrandomizationandprecisionthatgoesbacktoFisher,Gosset,

andSavagehasbeenreopenedinrecentpapersbyKasy(2016),Banerjee,Chassang,

andSnowberg(BCS)(2016)andBanerjee,Chassang,Montero,andSnowberg

(BCMS)(2016).

Thetrade-offbetweenbiasandprecisioncanbeformalizedinseveralways,

forexamplebyspecifyingalossorutilityfunctionthatdependsonhowauserisaf-

fectedbydeviationsoftheestimateoftheATEfromthetruthandthenchoosingan

estimatororanexperimentaldesignthatminimizesexpectedlossormaximizesex-

pectedutility.AsSavage(1962)noted,foraBayesian,thisinvolvesallocatingtreat-

mentsandcontrolsin“thespecificlayoutthatpromisedtotellhimthemost,”but

withoutrandomization.Ofcourse,thisrequiresseriousandperhapsdifficultthought

aboutthemechanismsunderlyingtheATE,whichrandomizationavoids.Savagealso

17

notesthatseveralpeoplewithdifferentpriorsmaybeinvolvedinaninvestigation

andthatindividualpriorsmaybeunreliablebecauseof“vaguenessandtemptation

toself-deception,”defectsthatrandomizationmayalleviate,oratleastevade.BCMS

(2016)provideaproofofaBayesianno-randomizationtheorem,andBCS(2016)

provideanillustrationofaschooladministratorwhohaslongbelievedthatschool

outcomesaredetermined,notbyschoolquality,butbyparentalbackground,and

whocanlearnthemostbyplacingdeprivedchildrenin(supposed)high-quality

schoolsandprivilegedchildrenin(supposed)low-qualityschools,whichisthekind

ofstudysettingthatcasestudymethodologyiswellattunedto.AsBCSnote,thisal-

locationwouldnotpersuadethosewithdifferentpriors,andtheyproposerandomi-

zationasameansofsatisfyingskepticalobservers.

Severalpointsareimportant.First,theanti-randomizationtheoremisnota

justificationofanynon-randomizeddesign,forexample,onethatallowsselection

onunobservables,butonlytheoptimaldesignthatismostinformative.Accordingto

Chalmers(2001)andBothwellandPodolsky(2016),thedevelopmentofrandomi-

zationinmedicineoriginatedwithBradford-Hill,whousedrandomizationinthe

firstRCTinmedicine—thestreptomycintrial—becauseitpreventeddoctorsselect-

ingpatientsonthebasisofperceivedneed(oragainstperceivedneed,leaningover

backwardasitwere),anargumentrecentlyechoedbyWorrall(2007).Randomiza-

tionservesthispurpose,butsodoothernon-discretionaryschemes;whatisre-

quiredisthathiddeninformationshouldnotbeallowedtoaffecttheallocation.

Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontrol

dependonthecovariatesandontheinvestigators’priorsabouthowthecovariates

affecttheoutcomes.Thisopensupallsortsofmethodsofinferencethatarelongfa-

miliartoeconomistsbutthatareexcludedbypurerandomization.Forexample,

whatphilosopherscallthehypothetico-deductivemethodworksbyusingtheoryto

makeapredictionthatcanbetakentothedataforpotentialfalsification(asinthe

schoolexampleabove).Thisisthewaythatphysicistslearn,asdoeconomistswhen

theyusetheorytoderivepredictionsthatcanbetestedagainstthedata,perhapsin

anRCT,butmorefrequentlynot.Someofthemostfruitfulresearchprogramsin

economicshavebeengeneratedbythepuzzlesthatresultwhenthedatafailto

18

matchsuchtheoreticalpredictions,suchastheequitypremiumpuzzle,variouspur-

chasingpowerparitypuzzles,theFeldstein-Horiokapuzzle,theconsumption

smoothnesspuzzle,thepuzzleofcaloriedeclineinthefaceofmalnourishmentand

incomegrowth,andmanyothers.

Third,randomization,byrunningroughshodoverpriorinformationfrom

theoryandfromcovariates,iswastefulandevenunethicalwhenitunnecessarilyex-

posespeople,orunnecessarilymanypeople,topossibleharminariskyexperiment.

Worrall(2008)documentsthe(extreme)caseofECMO,anewtreatmentfornew-

bornswithpersistentpulmonaryhypertensionthatwasdevelopedinthe1970sby

intelligentanddirectedtrialanderrorwithinawell-understoodtheoryofthedis-

ease.Inearlyexperimentationbytheinventors,mortalitywasreducedfrom80to

20percent.TheinvestigatorsfeltcompelledtoconductanRCT,albeitwithanadap-

tive‘play-the-winner’designinwhicheachsuccessinanarmincreasedtheproba-

bilityofthenextbabybeingassignedtothatarm.Onebabyreceivedconventional

therapyanddied,11receivedECMOandlived.Evenso,astandardrandomizedcon-

trolledtrialwasthoughtnecessary.Withastoppingruleoffourdeaths,fourmore

babies(outoften)diedinthecontrolgroupandnoneoftheninewhoreceived

ECMO.

Fourth,thenon-randommethodsusepriorinformation,whichiswhythey

dobetterthanrandomization.Thisisbothanadvantageandadisadvantage,de-

pendingonone’sperspective.Ifpriorinformationisnotwidelyaccepted,orisseen

asnon-crediblebythoseweareseekingtopersuade,wewillgeneratemorecredible

estimatesifwedonotusethosepriors.Indeed,thisiswhyBCS(2016)recommend

randomizeddesigns,includinginmedicineandindevelopmenteconomics.Theyde-

velopatheoryofaninvestigatorwhoisfacinganadversarialaudiencewhowill

challengeanypriorinformationandcanevenpotentiallyvetoresultsbasedonit

(thinkofadministrativeagenciessuchastheFDAorjournalreferees).Theexperi-

mentertradesoffhisorherowndesireforprecision(andpreventingpossibleharm

tosubjects),whichwouldrequirepriorinformation,againstthewishesoftheaudi-

ence,whowantsnothingtodowiththosepriors.Eventhen,theapprovaloftheau-

dienceisonlyexante;oncethefullyrandomizedexperimenthasbeendone,nothing

19

stopscriticsarguingthat,infact,therandomizationdidnotofferafairtestbecause

importantothercauseswerenotbalanced.AmongdoctorswhouseRCTs,andespe-

ciallymeta-analysis,suchargumentsare(appropriately)common(seeKramer

(2016)).

Today,whenthepublichascometoquestionexpertpriorknowledge,RCTs

willflourish.Incaseswherethereisgoodreasontodoubtthegoodfaithofexperi-

menters,asinmanypharmaceuticaltrials,randomizationwillindeedbeanappro-

priateresponse.Butwebelievesuchargumentsaredestructiveforscientificen-

deavor(whichisnotthepurposeoftheFDA)andshouldberesistedasageneral

prescriptioninscientificresearch.Previousknowledgeneedstobebuiltonandin-

corporatedintonewknowledge,notdiscardedinthefaceofaggressiveignorance.

Thesystematicrefusaltousepriorknowledgeandtheassociatedpreferencefor

RCTsarerecipesforpreventingcumulativescientificprogress.Intheend,itisalso

self-defeating.ToquoteRodrik(2016)“thepromiseofRCTsastheory-freelearning

machinesisafalseone.”

1.5StatisticalinferenceinRCTs

TheestimatedATEinasimpleRCTisthedifferenceinthemeansbetweenthetreat-

mentandcontrolgroups.Whencovariatesareallowedfor,asinmostRCTsineco-

nomics,theATEisusuallyestimatedfromthecoefficientonthetreatmentdummy

inaregressionthatlookslike(1),butwiththeheterogeneityin𝛽ignored.Modern

workcalculatesstandarderrorsallowingforthepossibilitythatresidualvariances

maybedifferentinthetreatmentandcontrolgroups,usuallybyclusteringthe

standarderrors,whichisequivalenttothefamiliartwosamplestandarderrorin

thecasewithnocovariates.Statisticalinferenceisdonewitht-valuesintheusual

way.Theseproceduresdonotalwaysgivetherightanswer.

Lookingbackat(1),theunderlyingobjectsofinterestaretheindividual

treatmenteffects𝛽( foreachoftheindividualsinthetrialsample.Neitherthey,nor

theirdistribution𝐺(𝛽)isidentifiedfromanRCT;becauseRCTsmakesofewas-

sumptionswhich,inmanycases,istheirstrength,theycanidentifyonlythemeanof

thedistribution.Inmanyobservationalstudies,researchersarepreparedtomake

moreassumptionsonfunctionalformsorondistributions,andforthatpriceweare

20

abletoidentifyotherquantitiesofinterest.Withouttheseassumptions,inferences

mustbebasedonthedifferenceinthetwomeans,astatisticthatissometimesill-

behaved,asweshalldiscussbelow.Thisill-behaviorhasnothingtodowithRCTs,

perse,butwithinRCTs,andtheirminimalassumptions,wecannoteasilyswitch

fromthemeantosomeotherquantityofinterest.

Fisherproposedthatstatisticalinferenceshouldbedoneusingwhathasbe-

comeknownas“randomization”inference,aprocedurethatisasnon-parametricas

theRCT-basedestimateofanATEitself.Totestthenullhypothesisthat𝛽( = 0for

alli,notethat,underthenullthatthetreatmenthasnoeffectonanyindividual,an

estimatednonzeroATEmustbeaconsequenceoftheparticularrandomallocation

thatgeneratedit.Bytabulatingallpossiblecombinationsoftreatmentsandcontrols

inourtrialsample,andtheATEassociatedwitheach,wecancalculatetheexactdis-

tributionoftheestimatedATEunderthenull.Thisallowsustocalculatetheproba-

bilityofcalculatinganestimateaslargeasouractualestimatewhentherearenoef-

fectsoftreatment.Thisrandomizationtestrequiresafinitesample,butitwillwork

foranysamplesize(seeImbensandWooldridge(2009)foranexcellentaccountof

theprocedure).Imbens(2010)arguesthatitisthisrandomizationinferenceplus

theunbiasednessoftheATEthatprovidesthetwinnon-parametricpillarsthatsup-

portplacingRCTsatthe“verytop”ofthehierarchyofevidence.

Randomizationinferencecanbeusedfornullhypothesesthatspecifythatall

ofthetreatmenteffectsarezero,asintheaboveexample,butitcannotbeusedto

testthehypothesisthattheaveragetreatmenteffectiszero,whichwilloftenbeof

interest.Inagriculturaltrials,andinmedicine,thestronger(sharp)hypothesisthat

thetreatmenthasnoeffectwhateverisoftenofinterest.Inmanyeconomicapplica-

tionsthatinvolvemoney,suchaswelfareexperimentsorcost-benefitanalyses,we

areinterestedinwhethertheneteffectofthetreatmentispositiveornegative,and

inthesecases,randomizationinferencecannotbeused.Noneofwhichargues

againstitswideruseinsocialscienceswhenappropriate.

Incaseswhererandomizationinferencecannotbeused,wemustconstruct

testsforthedifferencesintwomeans.Standardprocedureswilloftenworkwell,but

therearetwopotentialpitfalls.One,the‘Fisher-Behrensproblem’,comesfromthe

21

factthat,whenthetwosampleshavedifferentvariances—whichwetypicallywant

topermit—theusualt-statisticdoesnothavethet-distribution.Thesecondprob-

lem,whichismuchhardertoaddress,occurswhenthedistributionoftreatmentef-

fectsisnotsymmetric(BahadurandSavage(1956)).Neitherpitfallisspecificto

RCTs,butRCTsforceustoworkwithmeansinestimatingtreatmenteffectsand,

withonlyaveryfewexceptionsintheliterature,socialscientistswhouseRCTsap-

peartobeunawareofthedifficulties.

InthesimplecaseofcomparingtwomeansinanRCTwithoutcovariates,in-

ferenceisusuallybasedonthetwo–samplet–statisticwhichiscomputedbydivid-

ingtheATEbytheestimatedstandarderrorwhosesquareisgivenby

𝜎4 =𝑛" − 1 6" 𝑌( − 𝑌"(7"

4

𝑛"+

𝑛& − 1 6" 𝑌( − 𝑌&(7&4

𝑛&3

where0referstocontrolsand1totreatments,sothatthereare𝑛"treatmentsand

𝑛&controls,and𝑌"and𝑌&arethetwomeans.Ashaslongbeenknown,the“t-statis-

tic”basedon(3)isnotdistributedasStudent’stifthetwovariances(treatmentand

control)arenotidenticalbuthastheBehrens–Fisherdistribution.Inextremecases,

whenoneofthevariancesiszero,thet–statistichaseffectivedegreesoffreedom

halfofthatofthenominaldegreesoffreedom,sothatthetest-statistichasthicker

tailsthanallowedfor,andtherewillbetoomanyrejectionswhenthenullistrue.

Young(2016)arguesthatthisproblemisworsewhenthetrialresultsarean-

alyzedbyregressingoutcomesnotonlyonthetreatmentdummybutalsoonaddi-

tionalcontrolsandwhenusingclusteredorrobuststandarderrors.Whenthede-

signmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobservations

outcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductionin

theeffectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)

leadingtospuriousfindingsofsignificance.Younglooksat2,003regressionsre-

portedin53RCTpapersintheAmericanEconomicAssociationjournalsandrecalcu-

latesthesignificanceoftheestimatesusingrandomizationinferenceappliedtothe

authors’originaldata.In30to40percentoftheestimatedtreatmenteffectsinindi-

vidualequationswithcoefficientsthatarereportedassignificant,hecannotreject

thenullofnoeffectforanyobservation;thefractionofspuriouslysignificantresults

22

increasesfurtherwhenhesimultaneouslytestsforallresultsineachpaper.These

spuriousfindingscomeinpartfromissuesofmultiple-hypothesistesting,both

withinregressionswithseveraltreatmentsandacrossregressions.Withinregres-

sions,treatmentsarelargelyorthogonal,butauthorstendtoemphasizesignificant

t–valuesevenwhenthecorrespondingF-testsareinsignificant.Acrossequations,

resultsareoftenstronglycorrelated,sothat,atworst,differentregressionsarere-

portingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsig-

nificanteffects.Atthesametime,thepervasivenessofobservationswithhighinflu-

encegeneratesspurioussignificanceonitsown.

Theseissuesarenowbeingtakenmoreseriously.InadditiontoYoung

(2016),ImbensandKolesár(2016)providepracticaladvicefordealingwiththe

Fisher-Behrensproblem,andthebestcurrentpracticetriestobecarefulaboutmul-

tiplehypothesistesting.Yetitremainsthecasethatmanyoftheresultsinthelitera-

turearespuriouslysignificant.

Spurioussignificancealsoariseswhenthedistributionoftreatmenteffects

containsoutliersor,moregenerally,isnotsymmetric.Standardt–testsbreakdown

indistributionswithenoughskewness(seeLehmannandRomano(2005,p.466–

8)).Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffected

whenthedistributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytri-

alshaveoutcomesvaluedinmoney.Doesananti-povertyinnovation—forexample

microfinance—increasetheincomesoftheparticipants?Incomeitselfisnotsym-

metricallydistributed,andthismightbetrueofthetreatmenteffectstooifthereare

afewpeoplewhoaretalentedbutcredit-constrainedentrepreneursandwhohave

treatmenteffectsthatarelargeandpositive,whilethevastmajorityofborrowers

fritterawaytheirloans,oratbestmakepositivebutmodestprofits.Arecentsum-

maryoftheliteratureisconsistentwiththis(seeBanerjee,Karlan,andZinman

(2015)).Anotherimportantexampleisexpendituresonhealthcare.Mostpeople

havezeroexpenditureinanygivenperiod,butamongthosewhodoincurexpendi-

tures,afewindividualsspendhugeamountsthataccountforalargeshareoftheto-

tal.Indeed,inthefamousRandhealthexperiment(seeManning,Newhouseetal.

23

(1987,1988)),thereisasingleverylargeoutlier.Theauthorsrealizethatthecom-

parisonofmeansacrosstreatmentarmsisfragile,and,althoughtheydonotsee

theirproblemexactlyasdescribedhere,theyobtaintheirpreferredestimatesusing

astructuralapproachthatisexplicitlydesignedtomodeltheskewnessofexpendi-

tures.

Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,trans-

forming,oreliminatingobservationsthathavelargeeffectsontheestimates.Butif

theexperimentisaprojectevaluationdesignedtoestimatethenetbenefitsofapol-

icy,theeliminationofgenuineoutliers,asintheRandHealthExperiment,willviti-

atetheanalysis.Itispreciselytheoutliersthatmakeorbreaktheprogram.Trans-

formations,suchastakinglogarithms,mayhelptoproducesymmetry,butthey

changethenatureofthequestionbeingasked;acostbenefitanalysismustbedone

indollars,notlogdollars.

Weconsideranexamplethatillustrateswhatcanhappeninarealisticbut

simplifiedcase;thefullresultsarereportedintheAppendix.Weimagineapopula-

tionofindividuals,eachwithatreatmenteffect𝛽( .Theparentpopulationmeanof

thetreatmenteffectsiszero,butthereisalongtailofpositivevalues;weusealeft-

shiftedlognormaldistribution.Wehaveamicrofinancetrialinmind,wherethereis

alongpositivetailofrareindividualswhocandoamazingthingswithcreditwhile

mostpeoplecannotuseiteffectively.Atrialsampleof2n individualsisrandomly

drawnfromtheparentpopulationandisrandomlysplitbetweenntreatmentsand

ncontrols.Withineachtrialsample,whosetrueATEwillgenerallydifferfromzero

becauseofthesampling,werunmanyRCTsandtabulatethevaluesoftheATEfor

each.

Usingstandardt-tests,the(trueintheparentdistribution)hypothesisthat

theATEiszeroisrejectedbetween14(𝑛 = 25)and6percent(𝑛 = 500)ofthetime.

Theserejectionscomefromtwoseparateissues,bothofwhicharerelevantinprac-

tice;(a)thattheATEintrialsamplediffersfromtheATEintheparentpopulationof

interest,and(b)thatthet-valuesarenotdistributedastinthepresenceofoutliers.

24

Theproblemcasesarewhenthetrialsamplehappenstocontainoneormoreoutli-

ers,somethingthatisalwaysariskgiventhelongpositivetailoftheparentdistribu-

tion.Whenthishappens,everythingdependsonwhethertheoutlierisamongthe

treatmentsorthecontrols;ineffect,theoutliersbecomethesample,reducingthe

effectivenumberofdegreesoffreedom.Inextremecases,oneofwhichisillustrated

inFigureA.1,thedistributionofestimatedATEsisbimodal,dependingonthegroup

towhichtheoutlierisassigned.Whentheoutlierisinthetreatmentgroup,thedis-

persionacrossoutcomesislarge,asistheestimatedstandarderror,andsothose

outcomesrarelyrejectthenullusingthestandardtableoft–values.Theover-rejec-

tionscomefromcaseswhentheoutlierisinthecontrolgroup,theoutcomesarenot

sodispersed,andthet–valuescanbelarge,negative,andsignificant.Whilethese

casesofbimodaldistributionsmaynotbecommonanddependontheexistenceof

largeoutliers,theyillustratetheprocessthatgeneratestheover-rejectionsandspu-

rioussignificance.Notethatthereisnoremedythroughrandomizationinference

here,giventhatourinterestisinthehypothesisthattheaveragetreatmenteffectis

zero.

OurreadingoftheliteratureonRCTsindevelopmenteconomicssuggests

thattheyarenotexemptfromtheseconcerns.Manydevelopmenttrialsarerunon

(sometimesvery)smallsamples,theyhavetreatmenteffectswhereasymmetryis

hardtoruleout—especiallywhentheoutcomesareinmoney—andtheyoftengive

resultsthatarepuzzling,oratleastnoteasilyinterpretedintermsofeconomicthe-

ory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),whocitemany

RCTs,raiseconcernsaboutmisleadinginference,implicitlytreatingallresultsasre-

liable.Nodoubttherearebehaviorsintheworldthatareinconsistentwithstandard

economics,andsomecanbeexplainedbystandardbiasesinbehavioraleconomics,

butitwouldalsobegoodtobesuspiciousofthesignificancetestsbeforeaccepting

thatanunexpectedfindingiswell-supportedandthattheorymustberevised.Repli-

cationofresultsindifferentsettingsmaybehelpful,iftheyaretherightkindof

places(seeourdiscussioninSection2).Yetithardlysolvestheproblemgiventhat

theasymmetrymaybeinthesamedirectionindifferentsettings,thatitseemslikely

tobesoinjustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobe

25

ofuseforinferenceaboutthepopulationofinterest,andthatthe“significant”t–val-

ueswillshowdeparturesfromthenullinthesamedirection.This,then,replicates

thespuriousfindings.

Asummary

Whatdotheargumentsofthissectionmeanabouttheimportanceofrandomization

andtheinterpretationthatshouldbegiventoanestimatedATEfromarandomized

trial?First,weshouldbesurethatanunbiasedestimateofanATEforthetrialpopu-

lationislikelytobeusefulenoughtowarrantthecostsofrunningthetrial.Second,

sincerandomizationdoesnotensureorthogonality,caremustbetaken(e.g.by

blinding)thattherearenosignificantpost-randomizationcorrelateswiththetreat-

ment.Thisisawell-knownlessonbutmanysocialandeconomictrialsarenot

blindedandinsufficientdefenseisofferedthatunbiasednessisnotundermined.In-

deed,lackofblindingisnottheonlysourceofpost-randomizationbias.Treatments

andcontrolsmaybehandledindifferentplaces,orbydifferentlytrainedpractition-

ers,oratdifferenttimesofday,andthesedifferencescanbringwiththemsystem-

aticdifferencesintheothercausestowhichthetwogroupsareexposed.Thesecan,

andshould,beguardedagainst.Butdoingsorequiresanunderstandingofwhat

thesecausallyrelevantfactorsmightbe.Third,theinferenceproblemsreviewed

herecannotjustbepresumedaway.Whenthereissubstantialheterogeneity,the

ATEinthetrialsamplecanbequitedifferentfromtheATEinthepopulationofin-

terest,evenifthetrialisrandomlyselectedfromthatpopulation;inpractice,there-

lationshipbetweenthetrialsampleandthepopulationisoftenobscure.

Beyondthat,inmanycases,thestatisticalinferencewillbefine,butserious

attentionshouldbegiventothepossibilitythatthereareoutliersintreatmentef-

fects,somethingthatknowledgeoftheproblemcansuggestandwhereinspectionof

themarginaldistributionsoftreatmentsandcontrolsmaybeinformative.Forexam-

ple,ifbotharesymmetric,itseemsunlikely(thoughcertainlynotimpossible)that

thetreatmenteffectsarehighlyskewed.MeasurestodealwithFisher-Behrens

shouldbeusedandrandomizationinferenceconsideredwhenappropriatetothe

hypothesisofinterest.

26

Allofthiscanberegardedasrecommendationsforimprovementtocurrent

practice,notachallengetoit.Morefundamentally,westronglycontesttheoften-ex-

pressedideathattheATEcalculatedfromanRCTisautomaticallyreliable,thatran-

domizationautomaticallycontrolsforunobservables,orworstofall,thatthecalcu-

latedATEistrue.If,bychance,itisclosetothetruth,thetruthwearereferringtois

thetruthinthetrialsampleonly.Tomakeanyinferencebeyondthatrequiresanar-

gumentofthekindweconsiderinthenextsection.Wehavealsoarguedthat,de-

pendingonwhatwearetryingtomeasureandwhatwewanttousethatmeasure

for,thereisnopresumptionthatanRCTisthebestmeansofestimatingit.Thattoo

requiresanargument,notapresumption.

Section2:Usingtheresultsofrandomizedcontrolledtrials

2.1Introduction

SupposewehaveestimatedanATEfromawell-conductedRCTonatrialsample,

andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeabout

bychance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinour

trialsample,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?

Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,has

paidmoreattentiontoobtainingresultsthantoconsideringwhatcanbedonewith

them.Thereislittletheoreticalorempiricalworktoguideushowandforwhatpur-

posestousethefindingsofRCTs,suchastheconditionsunderwhichthesamere-

sultsholdoutsideoftheoriginalsettings,howtheymightbeadaptedforuseelse-

where,orhowtheymightbeusedforformulating,testing,understanding,orprob-

inghypothesesbeyondtheimmediaterelationbetweenthetreatmentandtheout-

comeinvestigatedinthestudy.Yetitcannotbethatknowinghowtouseresultsis

lessimportantthanknowinghowtodemonstratethem.Anychainofevidenceisonly

asstrongasitweakestlink,sothatarigorouslyestablishedeffectwhoseapplicabil-

ityisjustifiedbyaloosedeclarationofsimilewarrantslittle.Iftrialsaretobeuseful,

weneedpathstotheirusethatareascarefullyconstructedasarethetrialsthem-

selves.

Theargumentforthe“primacyofinternalvalidity”madebyShadish,Cook,

andCampbell(2002)maybereasonableasawarningthatbadRCTsareunlikelyto

27

generalize,butitissometimesincorrectlytakentoimplythatresultsofaninternally

validtrialwillautomatically,oroften,apply‘asis’elsewhere,orthatthisshouldbe

thedefaultassumptionfailingargumentstothecontrary,asifaparameter,once

wellestablished,canbeexpectedtobeinvariantacrosssettings.Aninvarianceas-

sumptionisoftenmadeinmedicine,forexample,whereitissometimesplausible

thataparticularprocedureordrugworksthesamewayeverywhere(thoughsee

Horton(2000)forastrongdissentandRothwell(2005)forexamplesonbothsides

ofthequestion).Weshouldalsonotetherecentmovementtoensurethattestingof

drugsincludeswomenandminoritiesbecausemembersofthosegroupssuppose

thattheresultsoftrialsonmostlyhealthyyoungwhitemalesdonotapplytothem.

2.2Usingresults,transportability,andexternalvalidity

Supposeatrialhasestablishedaresultinaspecificsetting.If`thesame’resultholds

elsewhere,itissaidtohaveèxternalvalidity’.Externalvaliditymayreferjusttothe

transportabilityofthecausalconnection,orgofurtherandrequirereplicationofthe

magnitudeoftheATE.Eitherway,theresultholds—everywhere,orwidely,orin

somespecificelsewhere—oritdoesnot.

Thisbinaryconceptofexternalvalidityisoftenunhelpfulbecauseitasksthe

resultsofanRCTtosatisfyaconditionthatisneithernecessarynorsufficientfora

trialtobeuseful,andsobothoverstatesandunderstatestheirvalue.Itdirectsusto-

wardsimpleextrapolation—whetherthesameresultholdselsewhere—orsimple

generalization—itholdsuniversallyoratleastwidely—andawayfrommorecom-

plexbutmoreusefulapplicationsoftheresults.Thefailureofexternalvalidityinter-

pretedassimplegeneralizationorextrapolationsayslittleaboutthevalueofthe

trial.

First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybe-

yondtheoriginalcontext;wediscusstheseinthenextsubsection.Second,thereare

oftengoodreasonstoexpectthattheresultsfromawell-conducted,informative,

andpotentiallyusefulRCTwillnotapplyelsewhereinanysimpleway.Withoutfur-

therunderstandingandanalysis,evensuccessfulreplicationtellsuslittleeitherfor

oragainstsimplegeneralizationortosupportfortheconclusionthatthenextwill

workinthesameway.Nordofailuresofreplicationmaketheoriginalresultuseless.

28

Weoftenlearnmuchfromcomingtounderstandwhyreplicationfailedandcanuse

thatknowledge,inlookingforhowthefactorsthatcausedtheoriginalresultmight

operatedifferentlyindifferentsettings.Third,andparticularlyimportantforscien-

tificprogress,theRCTresultcanbeincorporatedintoanetworkofevidenceandhy-

pothesesthattestorexploreclaimsthatlookverydifferentfromtheresultsre-

portedfromtheRCT.WeshallgiveexamplesbelowofextremelyusefulRCTsthat

arenotexternallyvalidinthe(usual)sensethattheirresultsdonotholdelsewhere,

whetherinaspecifictargetsettingorinthemoresweepingsenseofholdingevery-

where.

BertrandRussell’schicken(Russell(1912))providesanexcellentexampleof

thelimitationstostraightforwardextrapolationfromrepeatedsuccessfulreplica-

tion.Thebirdinfers,onrepeatedevidence,thatwhenthefarmercomesinthe

morning,hefeedsher.TheinferenceservesherwelluntilChristmasmorning,when

hewringsherneckandservesherfordinner.Thoughthischickendidnotbaseher

inferenceonanRCT,hadweconstructedoneforher,wewouldhaveobtainedthe

sameresultthatshedid.Herproblemwasnothermethodology,butratherthatshe

didnotunderstandthesocialandeconomicstructurethatgaverisetothecausalre-

lationsthatsheobserved.

So,establishingcausalitydoesnothinginandofitselftoguaranteegenerali-

zability.NordoestheabilityofanidealRCTtoeliminatebiasfromselectionorfrom

omittedvariablesmeanthattheresultingATEfromthetrialsamplewillapplyany-

whereelse.Theissueisworthmentioningonlybecauseoftheenormousweight

thatiscurrentlyattachedineconomicstothediscoveryandlabelingofcausalrela-

tions,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicabil-

ity,whatmightbelabeled‘anecdotalcausality’.Theoperationofacausegenerally

requiresthepresenceof“supportfactors”,withoutwhichacausethatproducesthe

targetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto

operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)called

INUScausality(InsufficientbutNon-redundantpartsofaconditionthatisitselfUn-

necessarybutSufficientforacontributiontotheoutcome)isoftenthekindofcau-

salitywesee.Astandardexampleisahouseburningdownbecausethetelevision

29

waslefton,althoughtelevisionsdonotoperateinthiswaywithoutsupportfactors,

suchaswiringfaults,thepresenceoftinder,andsoon.Thisisstandardfareinepi-

demiology,whichusestheterm`causalpie’torefertoasetofcausesthatarejointly

butnotseparatelysufficientforaneffect.

Ifwerewrite(1)intheform

𝑌( = 𝛽(𝑇( + 𝛾<𝑥(< = 𝜃 𝑤( 𝑇(

@

<A"

+ 𝛾<𝑥(<

@

<A"

(4)

wherethefunction𝜃(. )controlshowak-vector𝑤( ofk`supportfactors’affectindi-

viduali’streatmenteffect𝛽( .Thesupportfactorsmayincludesomeofthex’s.Since

theATEistheaverageofthe𝛽(𝑠,twopopulationswillhavethesameATEifandonly

iftheyhavethesameaveragefortheneteffectofthesupportfactorsnecessaryfor

thetreatmenttowork,i.e.forthequantityinfrontof𝑇( .Thesearehoweverjustthe

kindoffactorsthatarelikelytobedifferentlydistributedindifferentpopulations,

andindeedwedogenerallyfinddifferentATEsindifferenteconomic(andotherso-

cialpolicy)RCTsindifferentplaceseveninthecaseswhere(unusually)theyall

pointinthesamedirection.

Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocial

structurestoenablethemtowork.ConsidertheRubeGoldbergmachinethatis

riggedupsothatflyingakitesharpensapencil(CartwrightandHardie(2012,77)).

Theunderlyingstructureaffordsaveryspecificformof(4)thatwillnotdescribe

causalprocesseselsewhere.NeitherthesameATEnorthesamequalitativecausal

relationscanbeexpectedtoholdwherethespecificformfor(4)isdifferent.Indeed,

wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelationsthatwe

likeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystems

aredesignedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothat

driverscannotstarttheminreverse;workschedulesforpilotsaredesignedsothey

donotflytoomanyconsecutivehourswithoutrestbecausealertnessandperfor-

mancearecompromised.

30

AsintheRubeGoldbergmachineandinthedesignofcarsandworksched-

ules,theeconomicstructureandequilibriummaydifferinwaysthatsupportdiffer-

entkindsofcausalrelationsandthusrenderatrialinonesettinguselessinanother.

Forexample,atrialthatreliesonprovidingincentivesforpersonalpromotionisof

nouseinastateinwhichapoliticalsystemlockspeopleintotheirsocialandeco-

nomicpositions.Cashtransfersthatareconditionalonparentstakingtheirchildren

toclinicscannotimprovechildhealthintheabsenceoffunctioningclinics.Policies

targetedatmenmaynotworkforwomen.Weusealevertotoastourbread,butlev-

ersonlyoperatetotoastbreadinatoaster;wecannotbrowntoastbypressingan

accelerator,eveniftheprincipleoftheleveristhesameinbothatoasterandacar.

Ifwemisunderstandthesetting,ifwedonotunderstandwhythetreatmentinour

RCTworks,werunthesamerisksasRussell’schicken.

2.3WhenRCTsspeakforthemselves:notransportabilityrequired

Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmayprovidea

counterexampletoageneraltheoreticalproposition,eithertothepropositionitself

(asimplerefutationtest)ortosomeconsequenceofit(acomplexrefutationtest).

AnRCTmayalsoconfirmapredictionofatheory,andalthoughthisdoesnotcon-

firmthetheory,itisevidenceinitsfavor,especiallyifthepredictionseemsinher-

entlyunlikelyinadvance.Thisisallfamiliarterritory,andthereisnothingunique

aboutanRCT;itissimplyoneamongmanypossibletestingprocedures.Evenwhen

thereisnotheory,orveryweaktheory,anRCT,bydemonstratingcausalityinsome

populationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof

workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalva-

lidity.

NoristransportationcalledforwhenanRCTisusedforevaluation,forexam-

pletosatisfydonorsthattheprojecttheyfundedachieveditsaimsinthepopulation

inwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,to

beglobalpublicgoodsrequiresargumentsandguidelinesthatjustifyusingthere-

sultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-product

oftheBankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreat-

mentschangeacrossstudies,evaluationsneednotleadtocumulativeknowledge.Or

31

asHeckmanetal(1999,1934)note,“thedataproducedfromthem[socialexperi-

ments]arefarfromidealforestimatingthestructuralparametersofbehavioral

models.Thismakesitdifficulttogeneralizefindingsacrossexperimentsortouse

experimentstoidentifythepolicy-invariantstructuralparametersthatarerequired

foreconometricpolicyevaluation.”

Ofcourse,whenweaskexactlywhatthoseinvariantstructuralparameters

are,whethertheyexist,andhowtheyshouldbemodeled,weopenupmajorfault

linesinmodernappliedeconomics.Forexample,wedonotintendtoendorseinter-

temporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters

thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotas

universallyacceptedasitoncewas.Butthepointremainsthatweneedsomething,

someregularityorsomeinvariance,andthatsomethingcanrarelyberecoveredby

simplygeneralizingacrosstrials.

Athirdnon-problematicandimportantuseofanRCTiswhentheparameter

ofinterestistheATEinawell-definedpopulationfromwhichthetrialsampleisit-

selfarandomsample.Inthiscasethesampleaveragetreatmenteffect(SATE)isan

unbiasedestimatorofthepopulationaveragetreatmenteffect(PATE)that,byas-

sumption,isourtarget(seeImbens(2004)fortheseterms).Werefertothisasthe

`publichealth’case;likemanypublichealthinterventions,thetargetistheaverage,

`populationhealth,’notthehealthofindividuals.Onemajor(andwidelyrecog-

nized)dangerofthisuseofRCTsisthatscalingupfrom(evenarandom)sampleto

thepopulationwillnotgothroughinanysimplewayiftheoutcomesofindividuals

orgroupsofindividualschangethebehaviorofothers—whichwillbecommonin

economicexamplesbutperhapslesssoinhealth.Thereisalsoanissueoftimingif

timeelapsesbetweenthetrialandtheimplementation.

Ineconomics,a`public-health-style’exampleistheimpositionofacommod-

itytax,wherethetotaltaxrevenueisofinterestandpolicymakersdonotcarewho

paysthetax.Indeed,theorycanoftenidentifyaspecific,well-definedquantity

whosemeasurementiskeyforapolicy(seeDeatonandNg(1998)foranexampleof

whatChetty(2009)callsa“sufficient”statistic).Inthiscase,thebehaviorofaran-

domsampleofindividualsmightwellprovideagoodguidetothetaxrevenuethat

32

canbeexpected.Anothercasecomesfromworkonpovertyprogramswherethe

sponsorsaremostconcernedaboutthebudget;wediscussthesecasesattheendof

thisSection.Evenhere,itiseasytoimaginebehavioraleffectscomingintoplaythat

driveawedgebetweenthetrialanditsfull-scaleimplementation,forexampleif

complianceishigherwhentheschemeiswidelypublicized,orifgovernmentagen-

ciesimplementtheschemedifferentlyfromtrialists.

2.4Transportingresultslaterallyandglobally

TheprogramofRCTsineconomics,asinotherareasofsocialscience,hasthe

broadergoaloffindingout`whatworks.’Atitsmostambitious,thisaimsforuniver-

salreach,andthedevelopmenteconomicsliteraturefrequentlyarguesthat“credi-

bleimpactevaluationsareglobalpublicgoodsinthesensethattheycanofferrelia-

bleguidancetointernationalorganizations,governments,donors,andnongovern-

mentalorganizations(NGOs)beyondnationalborders”,DufloandKremer(2008,

93).SometimestheresultsofasingleRCTareadvocatedashavingwideapplicabil-

ity,withespeciallystrongendorsementwhenthereisatleastonereplication.For

example,KremerandHolla(2009,3)useaKenyantrialasthebasisforablanket

statementwithoutspecifyingcontext,“Provisionoffreeschooluniforms,forexam-

ple,leadsto10%-15%reductionsinteenpregnancyanddropoutrates.”Dufloand

Kremer(2008,104),writingaboutanothertrial,aremorecautious,citingtwoeval-

uationsandrestrictingthemselvestoIndia:“Onecanberelativelyconfidentabout

recommendingthescaling-upofthisprogram,atleastinIndia,onthebasisofthese

estimates,sincetheprogramwascontinuedforaperiodoftime,wasevaluatedin

twodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.”

Evenanumberofreplicationsdonotprovideasoundbasisforinference.Without

theorytosupporttheprojectionofresults,thisisjustinductionbysimpleenumera-

tion—swan1iswhite,swan2iswhite,...,soallswansarewhite.

TheproblemofgeneralizationextendsbeyondRCTs,toboth`fullycon-

trolled’laboratoryexperimentsandtomostnon-experimentalfindings.Ourargu-

menthereisthatevidencefromRCTsisnotautomaticallysimplygeneralizable,and

thatitssuperiorinternalvalidity,ifandwhenitexists,doesnotprovideitwithany

uniqueinvarianceacrosscontext.Thattransportationisfarfromautomaticalso

33

tellsuswhy(evenideal)RCTsofsimilarinterventionsgivedifferentanswersindif-

ferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings

andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservational

studies.

ManyadvocatesofRCTsunderstandthat`whatworks’needstobequalified

to`whatworksunderwhichcircumstances’andtrytosaysomethingaboutwhat

thosecircumstancesmightbe,forexample,byreplicatingRCTsindifferentplaces

andthinkingintelligentlyaboutthedifferencesinoutcomeswhentheyfindthem.

Sometimesthisisdoneinasystematicway,forexamplebyhavingmultipletreat-

mentswithinthesametrialsothatitispossibletoestimatea`responsesurface’that

linksoutcomestovariouscombinationsoftreatments(seeGreenbergandSchroder

(2004)orShadishetal(2002)).Forexample,theRANDhealthexperimenthadmul-

tipletreatments,allowinginvestigation,notofhowmuchhealthinsurancein-

creasedexpendituresunderdifferentcircumstances.Someofthenegativeincome

taxexperiments(NITs)inthe1960sand1970sweredesignedtoestimateresponse

surfaces,withthenumberoftreatmentsandcontrolsineacharmoptimizedtomax-

imizeprecisionofestimatedresponsefunctionssubjecttoanoverallcostlimit(see

Conlisk(1973)).Experimentsontime-of-daypricingforelectricityhadasimilar

structure(seeAigner(1985)).

TheexperimentsbyMDRC(originallyknownastheManpowerDevelopment

ResearchCorporation)havealsobeenanalyzedacrosscitiesinanefforttolinkcity

featurestotheresultsoftheRCTswithinthem(seeBloom,Hill,andRiccio(2005)).

UnliketheRANDandNITexamples,theseareexpostanalysesofcompletedtrials;

thesameistrueofVivalt(2015),whofinds,forthecollectionoftrialsshestudied,

thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller

(standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal

(2013),whoranparallelRCTsonaninterventionimplementedeitherbyanNGOor

bythegovernmentofKenya,foundsimilarresultsthere.Notethattheseanalyses

haveadifferentpurposefrommeta-analysesthatassumethatdifferenttrialsesti-

matethesameparameteruptonoiseandaverageinordertoincreaseprecision.

34

Althoughthereareissueswithallofmethodsofinvestigatingdifferences

acrosstrials,withoutsomedisciplineitistooeasytocomeupwith`just-so’orfairy

storiesthataccountfordifferences.Weriskaprocedurethat,ifaresultisreplicated

infullorinpartinatleasttwoplaces,putsthattreatmentintotheìtworks’box

and,iftheresultdoesnotreplicate,casuallyinterpretsthedifferenceinawaythat

allowsatleastsomeofthefindingstosurvive.

Howcanwedobetterthansimplegeneralizationandsimpleextrapolation?

Manywritersemphasizetheroleoftheoryintransportingandusingtheresultsof

trials,andweshalldiscussthisinthenextsubsection.Butstatisticalapproachesare

alsowidelyused;thesearedesignedtodealwiththepossibilitythattreatmentef-

fectsvarysystematicallywithothervariables.Referringbackto(4),itisclearthat,

supposingthesameformof(4)obtains,ifthedistributionofthewvaluesisthe

sameinthenewcircumstancesasintheold,theATEintheoriginaltrialwillholdin

thenewcircumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowe

haveanyobviouswayofcheckingitunlessweknowwhatthesupportfactorsarein

bothplaces.Oneproceduretodealwithinteractionsispost-experimentalstratifica-

tion,whichparallelspost-surveystratificationinsamplesurveys.Thetrialisbroken

upintosubgroupsthathavethesamecombinationofknown,observablew’s(age,

race,genderforexample),thentheATEswithineachofthesubgroupsarecalcu-

lated,andthentheyarereassembledaccordingtotheconfigurationofw’sinthe

newcontext.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectes-

timatestotheparentpopulationwhenthetrialsampleisnotarandomsampleof

theparent.Othermethodscanbeusedwhentherearetoomanyw’sforstratifica-

tion,forexamplebyestimatingtheprobabilityofeachobservationinthepopulation

includedinthetrialsampleasafunctionofthew’s,thenweightingeachobservation

bytheinverseofthesepropensityscores.AgoodreferenceforthesemethodsisStu-

artetal(2011),orineconomics,Angrist(2004)andHotz,Imbens,andMortimer

(2005).

Thesemethodsareoftennotapplicable,however.First,reweightingworks

onlywhentheobservablefactorsusedforreweightingincludeall(andonly)genu-

ineinteractivecauses.Second,aswithanyformofreweighting,thevariablesusedto

35

constructtheweightsmustbepresentinboththeoriginalandnewcontext.Forex-

ample,ifwearetocarryaresultforwardintime,wemaynotbeabletoextrapolate

fromaperiodoflowinflationtoaperiodofhighinflation.AsHotzetal(2005)note,

itwilltypicallybenecessarytoruleoutsuch`macro’effects,whetherovertime,or

overlocations.Third,italsodependsonassumingthatthesamegoverningequation

(4)coversthetrialandthetargetpopulation.

PearlandBareinboim(2011,2014)andBareinboimandPearl(2013,2014)

providestrategiesforinferringinformationaboutnewpopulationsfromtrialre-

sultsthataremoregeneralthanreweighting.Theysupposewehaveavailableboth

causalinformationandprobabilisticinformationforpopulationA(e.g.theexperi-

mentalone),whileforpopulationB(thetarget)wehaveonly(some)probabilistic

information,andalsothatweknowthatcertainprobabilisticandcausalfactsare

sharedbetweenthetwoandcertainonesarenot.Theyoffertheoremsdescribing

whatcausalconclusionsaboutpopulationBaretherebyfixed.Theirworkunder-

linesthefactthatexactlywhatconclusionsaboutonepopulationcanbesupported

byinformationaboutanotherdependsonexactlywhatcausalandprobabilisticfacts

theyhaveincommon. ButasMuller(2015)notes,this,liketheproblemwithsimple

reweighting,takesusbacktothesituationthatRCTsaredesignedtoavoid,where

weneedtostartfromacompleteandcorrectspecificationofthecausalstructure.

RCTscanavoidthisinestimation—whichisoneoftheirstrengths,supportingtheir

credibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew

context.

Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneral

claimsbysimplegeneralization;thereisnowarrantfortheconvenientassumption

thattheATEestimatedinaspecificRCTisaninvariantparameter,northatthe

kindsofinterventionsandoutcomeswemeasureintypicalRCTsparticipateingen-

eralcausalrelations.Whileitistruethatgeneralcausalclaimsexist—thatgravita-

tionalmassesattracteachother,orthatpeoplerespondtoincentives—theseuse

relativelyabstractconceptsandoperateatamuchhigherlevelthantheclaimsthat

canbereasonablyinferredfromatypicalRCT.

36

Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobeval-

uable,orfailingthat,subgroupanalysis,becauseitcanprovideinformationthatmay

beusefulforgeneralizationortransportation.Forexample,KremerandHolla

(2009)notethat,intheirtrials,schoolattendanceissurprisinglysensitivetosmall

subsidies,whichtheysuggestisbecausetherearealargenumberofstudentsand

parentswhoareonthe(financial)marginbetweenattendingandnotattending

school;ifthisisindeedthemechanismfortheirresults,agoodvariableforstratifi-

cationwouldbedistancefromtherelevantcutoff.Wealsoneedtoknowthatthis

samemechanismworksinanynewtargetsetting.

Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmore

modelbuildingandmore—ordifferent—assumptionsthanadvocatesofRCTsare

oftencomfortablewith.Tobeclear,modelingcausalstructuredoesnotcommitus

totheelaborateandoftenincredibleassumptionsthatcharacterizesomestructural

modelingineconomics,butthereisnoescapefromthinkingaboutthewaythings

work;thewhyaswellasthewhat.

Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,

forexampleaboutdifferencesinsocial,economic,andculturalstructuresandabout

thejointdistributionsofcausalvariables,knowledgethatwilloftenonlybeavaila-

blethroughobservationalstudies.Wewillalsoneedexternalinformation,boththe-

oreticalandempirical,tosettleonaninformativecharacterizationofthepopulation

enrolledintheRCTbecausehowthatpopulationisdescribediscommonlytakento

besomeindicationofwhichotherpopulationstheresultsarelikelytobeexportable

to.Manymedicalandpsychologicaljournalsareexplicitaboutthis.Forinstance,the

rulesforsubmissionrecommendedbytheInternationalCommitteeofMedicalJour-

nalEditors,ICMJE(2015,14)insistthatarticleabstracts“Clearlydescribetheselec-

tionofobservationalorexperimentalparticipants(healthyindividualsorpatients,

includingcontrols),includingeligibilityandexclusioncriteriaandadescriptionof

thesourcepopulation.”AnRCTisconductedonaspecifictrialsample,somehow

drawnfromapopulationofspecificindividuals.Theresultsobtainedarefeaturesof

thatsample,ofthoseveryindividualsatthatverytime,notanyotherpopulation

37

withanydifferentindividualsthatmight,forexample,satisfyoneoftheinfiniteset

ofdescriptionsthatthetrialsamplesatisfies.

Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecial

cases,likeposthocevaluationforpayment-for-results,wearenotespeciallycon-

cernedtolearnabouttheveryindividualsenrolledinthetrial.Mostexperiments

are,andshouldbe,conductedwithaneyetowhattheresultscanhelpuslearn

aboutotherpopulations.Thiscannotbedonewithoutsubstantialassumptions

aboutwhatmightandwhatmightnotberelevanttotheproductionoftheoutcome

studied.(Forexample,theICMJEguidelines(2015,14)goontosay:“Becausethe

relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetime

ofstudydesign,researchersshouldaimforinclusionofrepresentativepopulations

intoallstudytypesandataminimumprovidedescriptivedatafortheseandother

relevantdemographicvariables,”p14.)Sobothintelligentstudydesignandrespon-

siblereportingofstudyresultsinvolvesubstantialbackgroundassumptions.

Ofcourse,thisistrueforallstudies.ButRCTsrequirespecialconditionsif

theyaretobeconductedatallandespeciallyiftheyaretobeconductedsuccess-

fully—forexample,localagreements,compliantsubjects,affordableadministrators,

multipleblinding,peoplecompetenttomeasureandrecordoutcomesreliably,aset-

tingwhererandomallocationismorallyandpoliticallyacceptable,etc.—whereas

observationaldataareoftenmorereadilyandwidelyavailable.InthecaseofRCTs,

thereisdangerthatthesekindsofconsiderationshavetoomucheffect.Thisisespe-

ciallyworrisomewherethefeaturesthatthetrialsampleshouldhavearenotjusti-

fied,madeexplicit,orsubjectedtoseriouscriticalreview.

Theneedforobservationalknowledgeisoneofmanyreasonswhyitiscoun-

ter-productivetoinsistthatRCTsarethegoldstandardorthatsomecategoriesof

evidenceshouldbeprioritizedoverothers;thesestrategiesleaveushelplessinus-

ingRCTsbeyondtheiroriginalcontext.TheresultsofRCTsmustbeintegratedwith

otherknowledge,includingthepracticalwisdomofpolicymakers,iftheyaretobe

useableoutsidethecontextinwhichtheywereconstructed.

38

Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbe-

tweenRCTsandobservationalresultsneedtobeexplained,forexamplebyrefer-

encetothedifferentpopulationsineach,aprocessthatwillsometimesyieldim-

portantevidence,includingontherangeofapplicabilityoftheRCTresultsthem-

selves.WhilethevalidityoftheRCTwillsometimesprovideanunderstandingof

whytheobservationalstudyfoundadifferentanswer,thereisnobasis(orexcuse)

forthecommonpracticeofdismissingtheobservationalstudysimplybecauseit

wasnotanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvance

that,ascollectiveknowledgeadvances,newfindingsmustbeabletoexplainandbe

integratedwithpreviousresults,evenresultsthatarenowthoughttobeinvalid;

methodologicalprejudiceisnotanexplanation.

2.5Usingtheoryforgeneralization

Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincethe

earlyexperiments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincome

taxtrialsusingasimplestatictheoryoflaborsupply.Accordingtothis,people

choosehowtodividetheirtimebetweenworkandleisureinanenvironmentin

whichtheyreceiveaminimumGiftheydonotwork,andwheretheyreceiveanad-

ditionalamount(1-t)wforeachhourtheywork,wherewisthewagerate,andtisa

taxrate.ThetrialsassigneddifferentcombinationsofGandttodifferenttrial

groups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimation

oftheparametersofpreferences,whichcouldthenbeusedinawiderangeofpolicy

calculations,forexampletoraiserevenueatminimumutilitylosstoworkers.

Followingtheseearlytrials,therehasbeenacontinuingtraditionofusing

trialresults,togetherwiththebaselinedatacollectedforthetrial,tofitstructural

modelsthataretobeusedmoregenerally.EarlyexamplesincludeMoffitt(1979)on

laborsupplyandWise(1985)onhousing;amorerecentexampleisHeckman,Pinto

andSavelyev(2013)forthePerrypre-schoolprogram.Developmenteconomicsex-

amplesincludeAttanasio,Meghir,andSantiago(2012),Attanasioetal(2015),Todd

andWolpin(2006),Wolpin(2013),andDuflo,Hanna,andRyan(2012).These

39

structuralmodelssometimesrequireformidableauxiliaryassumptionsonfunc-

tionalformsorthedistributionsofunobservables,buttheyhavecompensatingad-

vantages,includingtheabilitytointegratetheoryandevidence,tomakeout-of-sam-

plepredictions,andtoanalyzewelfare,andtheuseofRCTevidenceallowsthere-

laxationofatleastsomeoftheassumptionsthatareneededforidentification.Inthis

way,thestructuralmodelsborrowcredibilityfromtheRCTsandinreturnhelpset

theRCTresultswithinacoherentframework.Withoutsomesuchinterpretation,the

welfareimplicationsofRCTresultscanbeproblematic;knowinghowpeopleingen-

eral(letalonejustpeopleinthetrialpopulation)respondtosomepolicyisrarely

enoughtotellwhetherornottheyaremadebetteroff,Harrison(2014a,b).Tradi-

tionalwelfareeconomicsdrawsalinkfrompreferencestobehavior,alinkthatisre-

spectedinstructuralworkbutoftenlostinthe`whatworks’literature,andwithout

whichwehavenobasisforinferringwelfarefrombehavior.Whatworksisnot

equivalenttowhatshouldbe.

Lighttouchtheorycandomuchtointerpret,toextend,andtouseRCTre-

sults.InboththeRANDHealthExperimentandnegativeincometaxexperiments,an

immediateissueconcernedthedifferencebetweenshortandlong-runresponses;

indeed,differencesbetweenimmediateandultimateeffectsoccurinawiderangeof

RCTs.BothhealthandtaxRCTsaimedtodiscoverwhatwouldhappenifconsum-

ers/workerswerepermanentlyfacedwithhigherorlowerprices/wages,butthetri-

alscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearningsisef-

fectivelya`firesale’onleisure,sothattheexperimentprovidedanopportunityto

takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentin

apermanentscheme.Howdowegetfromtheshort-runresponsesthatcomefrom

thetrialtothelong-runresponsesthatwewanttoknow?Metcalf(1973)andAsh-

enfelter(1978)providedanswersfortheincometaxexperiments,asdidArrow

(1975)fortheRandHealthExperiment.

Arrow’sanalysisillustrateshowtousebothstructureandobservationaldata

totransportandadaptresultsfromonesettingtoanother.Hemodelsthehealthex-

perimentasatwo-periodmodelinwhichthepriceofmedicalcareisloweredinthe

firstperiodonly,andshowshowtoderivewhatwewant,whichistheresponsein

40

thefirstperiodifpriceswereloweredbythesameproportioninbothperiods.The

magnitudethatwewantisS,thecompensatedpricederivativeofmedicalcarein

period1inthefaceofidenticalincreasesin𝑝"and𝑝4inbothperiods1and2.Thisis

equalto𝑠"" + 𝑠"4,thesumofthederivativesofperiod1’sdemandwithrespectto

thetwoprices.Thetrialgivesonly𝑠"".Butifwehavepost-trialdataonmedicalser-

vicesforbothtreatmentsandcontrols,wecaninfer𝑠4",theeffectoftheexperi-

mentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformof

Slutskysymmetrysaysthat𝑠"4 = 𝑠4"andsoallowsArrowtoinfer𝑠"4andthusS.He

contraststhiswithMetcalf’salternativesolution,whichmakesdifferentassump-

tions—thattwoperiodpreferencesareintertemporallyadditive,inwhichcasethe

long-runelasticitycanbeobtainedfromknowledgeoftheincomeelasticityofpost-

experimentalmedicalcare,whichwouldhavetocomefromanobservationalanaly-

sis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwill-

ingnesstomakeassumptionsandonthedatathatwehave,asuitablecombination

of(elementaryandtransparent)theoreticalassumptionsandobservationaldatain

ordertoadaptandusetrialresults.Suchanalysiscanalsohelpdesigntheoriginal

trialbyclarifyingwhatweneedtoknowinordertousetheresultsofatemporary

treatmenttoestimatethepermanenteffectsthatweneed.Ashenfelterprovidesa

thirdsolution,notingthatthetwo-periodmodelisformallyidenticaltoatwo-person

model,sothatwecanuseinformationontwo-personlaborsupplytotellusabout

thedynamics.

Theorycanoftenallowustoreclassifyneworunknownsituationsasanalo-

goustosituationswherewealreadyhavebackgroundknowledge.Onefrequently

usefulwayofdoingthisiswhenthenewpolicycanberecastasequivalenttoa

changeinthebudgetconstraintthatrespondentsface.Theconsequencesofanew

policymaybeeasiertopredictifwecanreduceittoequivalentchangesinincome

andprices,whoseeffectsareoftenwellunderstoodandwell-studied.Toddand

Wolpin(2008)andWolpin(2013)makethispointandprovideexamples.Inthela-

borsupplycase,anincreaseinthetaxratehasthesameeffectasadecreaseinthe

wagerate,sothatwecanrelyonpreviousliteraturetopredictwhatwillhappen

whentaxratesarechanged.InthecaseofMexico’sPROGRESAconditionalcash

41

transferprogram,ToddandWolpinnotethatthesubsidiespaidtoparentsiftheir

childrengotoschoolcanbethoughtofasacombinationofreductioninchildren’s

wagesandanincreaseinparents’income,whichallowsthemtopredicttheresults

oftheconditionalcashexperimentwithlimitedadditionalassumptions.Ifthis

works,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidateprevious

knowledgeandcontributestoanevolvingbodyoftheoryandempirical,including

trial,evidence.

Theprogramofthinkingaboutpolicychangesasequivalenttopriceandin-

comechangeshasalonghistoryineconomics;muchofrationalchoicetheorycanbe

sointerpreted(seeDeatonandMuellbauer(1980)formanyexamples).Whenthis

conversioniscredible,andwhenatrialonsomeapparentlyunrelatedtopiccanbe

modeledasequivalenttoachangeinpricesandincomes,andwhenwecanassume

thatpeopleindifferentsettingsrespondrelevantlysimilarlytochangesinprices

andincomes,wehaveareadymadeframeworkforincorporatingthetrialresults

intopreviousknowledge,aswellasforextendingthetrialresultsandusingthem

elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peo-

plemaynotinfacttreatataxincreaseasadecreaseinthepriceofleisure,andbe-

havioraleconomicsisfullofexampleswhereapparentlyequivalentstimuligenerate

non-equivalentoutcomes.Theembraceofbehavioraleconomicsbymanyofthecur-

rentgenerationoftrialistsmayaccountfortheirlimitedwillingnesstouseconven-

tionalchoicetheoryinthisway.Unfortunately,behavioraleconomicsdoesnotyet

offerareplacementforthegeneralframeworkofchoicetheorythatissousefulin

thisregard.

Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopula-

tiontowhichthetrialresultsimmediatelyapplyandforthinkingaboutmoving

fromthispopulationtopopulationsofinterest.Ashenfelter’s(1978)analysisis

againagoodillustrationandpredatesmuchsimilarworkinlaterliterature.Thein-

cometaxexperimentsofferedparticipationinthetrialtoarandomsampleofthe

populationofinterest.Becausetherewasnoblindingandnocompulsion,people

whowererandomizedintothetreatmentgroupwerefreetochoosetorefusetreat-

ment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto

42

participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownin

theRCTandInstrumentalVariablesliteratureastheirownidiosyncratic`gain.’The

simplelaborsupplymodelgivesanapproximatecondition:Ifthetreatmentin-

creasesthetaxratefrom t0 to t1 withanoffsettingincreaseinG,thenanindividual

assignedtotheexperimentalgroupwilldeclinetoparticipateif

(t1 − t0 )w0h0 + 12

s00 (t1 − t0 ) > G1 −G0 (5)

wheresubscript1referstothetreatmentsituation,0tothecontrol,ℎ&ishours

worked,and𝑠&&isthe(negative)utility-constantresponseofhoursworkedtothe

taxrate.Ifthereisnosubstitution,thesecondtermontheleft-handsideiszero,and

peoplewillaccepttreatmentiftheincreaseinGmorethanmakesupforthein-

creasesintaxespayable,the`breakeven’condition.Inconsequence,thosewith

higherearningsarelesslikelytoaccepttreatment.Somebetter-offpeoplewithhigh

substitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymorecheap

leisureissufficiententicement.

Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearn

aboutthebetter-offorlow-substitutionpeoplewhodeclinetreatmentbutwho

wouldhavetoacceptitifthepolicywereimplemented.Boththeintention-to-treat

estimatorandtheàstreated’estimatorthatcomparesthetreatedandtheun-

treatedareaffected,notjustbythelaborsupplyeffectsthatthetrialisdesignedto

induce,butbythekindofselectioneffectsthatrandomizationisdesignedtoelimi-

nate.Ofcourse,theanalysisthatleadsto(5)canperhapshelpussaysomething

aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.

Yetthisisnoeasymatterbecauseselectiondepends,notonlyonobservables,such

aspre-experimentalearningsandhoursworked,buton(muchhardertoobserve)

laborsupplyresponsesthatlikelyvaryacrossindividuals.ParaphrasingAshenfelter,

wecannotestimatetheeffectsofapermanentcompulsorynegativeincometaxpro-

gramfromatransitoryvoluntarytrialwithoutstrongassumptionsoradditionalevi-

dence.

Muchofthemodernliterature,forexampleontrainingprograms,wrestles

withtheissueofexactlywhoisrepresentedbytheRCTresults,includingnotonly

43

whoparticipatesinthefirstplacebutwholeavesbeforethetrialiscompleted(see

againHeckman,LalondeandSmith(1999)).Asintheexamplesabove,modelingat-

tritionwithinatrialcanyieldestimatesofbehavioralresponsesthatcanbeusedto

transportthefindingstoothersettings(seeChanandHamilton(2006),Chassang,

PadróIMiguel,andSnowberg(2012)andChassangetal(2015)).Whenpeopleare

allowedtorejecttheirrandomlyassignedtreatmentaccordingtotheirown(realor

perceived)advantage,ortodropoutofatrialonanestimateofthebenefitsand

costsfromdoingso,wehavecomealongwayawayfromtherandomallocationin

thestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof

blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchas

welfaretrials,thateffectivelycompelpeopletoaccepttheirassignments,andsome

wherethetreatmentisgenerousenoughtodoso,therearetrialswheresubjects

havemuchfreedomand,inthosecasesitislessthanobvioustouswhatrole,ifany,

randomizationplaysinwarrantingtheresults.

2.6Scalingup:usingtheaverageforpopulations

ManyRCTsaresmall-scaleandlocal,forexampleinafewschools,clinics,orfarms

inaparticulargeographic,cultural,socio-economicsetting.Ifsuccessfulaccording

toacost-effectivenesscriterion,forexample,itisacandidateforscaling-up,apply-

ingthesameinterventionforamuchlargerarea,oftenawholecountry,orsome-

timesevenbeyond,aswhensometreatmentisconsideredforallrelevantWorld

Bankprojects.Thefactthattheinterventionmightworkdifferentlyatscalehaslong

beennotedintheeconomicsliterature,e.g.GarfinkelandManski(1992),Heckman

(1992),andMoffitt(1992),andisrecognizedintherecentreviewbyBanerjeeand

Duflo(2009).Wewantheretoemphasizethepervasivenessofsucheffectsaswell

astonoteagainthatthisshouldnotbetakenasanargumentagainstusingRCTsbut

onlyagainsttheideathateffectsatscalearelikelytobethesameasinthetrial.

Anexampleofwhatareoftencalled`generalequilibriumeffects’comesfrom

agriculture.SupposeanRCTdemonstratesthatinthestudypopulationanewwayof

usingfertilizerhadasubstantialpositiveeffecton,say,cocoayields,sothatfarmers

whousedthenewmethodssawincreasesinproductionandinincomescompared

tothoseinthecontrolgroup.Iftheprocedureisscaleduptothewholecountry,or

44

toallcocoafarmersworldwide,thepricewilldrop,andifthedemandforcocoais

priceinelastic—asisusuallythoughttobethecase,atleastintheshortrun—cocoa

farmers’incomeswillfall.Indeed,theconventionalwisdomformanycropsisthat

farmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsiderations

mightnotbedecisiveindecidingwhetherornottopromotetheinnovation,and

theremaystillbelongtermgainsif,forexample,somefarmersfindsomethingbet-

tertodothangrowingcocoa.But,inthiscase,thescaled-upeffectisoppositeinsign

tothetrialeffect.Theproblemisnotwiththetrialresults,whichcanbeusefullyin-

corporatedintoamorecomprehensivemarketmodelthatincorporatesthere-

sponsesestimatedbythetrial.Theproblemisonlyifweassumethattheaggregate

looksliketheindividual.Thatotheringredientsoftheaggregatemodelmustcome

fromobservationalstudiesshouldnotbeacriticism,evenforthosewhofavorRCTs;

itissimplythepriceofdoingseriousanalysis.

Therearemanypossibleinterventionsthataltersupplyordemandwhoseef-

fect,inaggregate,willchangeapriceorawagethatisheldconstantintheoriginal

RCT.Educationwillchangethesuppliesofskilledversusunskilledlabor,withimpli-

cationsforrelativewagerates.Conditionalcashtransfersincreasethedemandfor

(andperhapssupplyof)schoolsandclinics,whichwillchangepricesorwaiting

lines,orboth.Thereareinteractionsbetweenpeoplethatwilloperateonlyatscale.

Givingonechildavouchertogotoprivateschoolmightimproveherfuture,butdo-

ingsoforeveryonecandecreasethequalityofeducationforthosechildrenwhoare

leftinthepublicschools(seethecontrastingstudiesofAngristetal(2002)and

HsiehandUrquiola(2002)).Educationalortrainingprogramsmaybenefitthose

whoaretreatedbutharmthoseleftbehind;Créponetal(2014)recognizetheissue

andshowhowtoadaptanRCTtodealwithit.

Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovern-

mentmaynotallowthemasstransferofmoneyfromabroadtoapowerlessseg-

mentofthepopulation,thoughitmaypermitasmall-scaleRCTofcashtransfers,

perhapseveninthehopethatalarge-scaleimplementationwillyieldopportunities

forpredation.ProvisionofhealthcarebyforeignNGOsmaybesuccessfulintrials,

buthaveunintendednegativeconsequencestoscalebecauseofgeneralequilibrium

45

effectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthenatureof

thecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetopro-

videservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthrough

asystem(thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatis

procuredfailingtofinditswaytotheintendedbeneficiaries.LocalizedRCTson

whetherornotfamiliesarebetteroffwithcashtransfersarenotinformativeabout

howpoliticianswouldchangetheamountofthetransferiffacedwithunanticipated

inflation,andatleastasimportant,whetherthegovernmentcouldcutprocurement

fromrelativelywealthyandpoliticallypowerfulfarmers.Withoutapoliticaland

generalequilibriumanalysis,itisimpossibletothinkabouttheeffectsofreplacing

foodsubsidieswithcashtransfers(seee.g.Basu(2010)).

Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscom-

monthanaresocialinteractionsinsocialscience,interactionscanbeimportant.In-

fectiousdiseasesareonewell-knownexample,whereimmunizationprogramsaf-

fectthedynamicsofdiseasetransmissionthroughherdimmunity(seeFineand

Clarkson(1986)andManski(2013,52)).Thesocialandeconomicsettingalsoaf-

fectshowdrugsareactuallyusedandthesameissuescanarise;thedistinctionbe-

tweenefficacyandeffectivenessinclinicaltrialsisinpartrecognitionofthefact.

2.7Drillingdown:usingtheaverageforindividuals

Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfrom

RCTsatthelevelofindividualunits,evenindividualunitsthatwereincludedinthe

trial.Awell-conductedRCTdeliversanATEforthetrialpopulationbut,ingeneral,

thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedinthe

AmericanMedicalAssociation’s“Users’guidetothemedicalliterature”that“ifthe

patientwouldhavebeenenrolledinthestudyhadshebeenthere—thatisshemeets

alloftheinclusioncriteriaanddoesn’tviolateanyoftheexclusioncriteria—thereis

littlequestionthattheresultsareapplicable”(seeGuyattetal(1994,60)).Even

moremisleadingaretheoften-heardstatementsthatanRCTwithanaveragetreat-

menteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworks

fornoone.

46

Theseissuesarefamiliartophysicianspracticingevidence-basedmedicine

whoseguidelinesrequire“integratingindividualclinicalexpertisewiththebest

availableexternalclinicalevidencefromsystematicresearch,”Sackettetal(1996,

71).Exactlywhatthismeansisunclear;physiciansknowmuchmoreabouttheirpa-

tientsthanisallowedforintheATEfromtheRCT(though,onceagain,stratification

inthetrialislikelytobehelpful)andtheyoftenhaveintuitiveexpertisefromlong

practicethatcanhelpthemidentifyfeaturesinaparticularpatientthatmayinflu-

encetheeffectivenessofagiventreatmentforthatpatient.Butthereisanoddbal-

ancestruckhere.Thesejudgmentsaredeemedadmissibleindiscussionwiththein-

dividualpatient,buttheydon’tadduptoevidencetobemadepubliclyavailable,

withtheusualcautionsaboutcredibility,bythestandardsadoptedbymostEBM

sites.Itisalsotruethatphysicianscanhaveprejudicesand`knowledge’thatmight

beanythingbut.Clearly,therearesituationswhereforcingpractitionerstofollow

theaveragewilldobetter,evenforindividualpatients,andotherswheretheoppo-

siteistrue,KahnemanandKlein(2009).

Whetherornotaveragesareusefultoindividualsraisesthesameissue

throughoutsocialscienceresearch.Imaginetwoschools,StJoseph’sandSt.Mary’s,

bothofwhichwereincludedinanRCTofaclassroominnovation.Theinnovationis

successfulonaverage,butshouldtheschoolsadoptit?ShouldStMary’sbeinflu-

encedbyapreviousattemptinStJoseph’sthatwasjudgedafailure?Manywould

dismissthisexperienceasanecdotalandaskhowStJoseph’scouldhaveknownthat

itwasafailurewithoutbenefitof`rigorous’evidence.YetifStMary’sislikeStJo-

seph’s,withasimilarmixofpupils,asimilarcurriculum,andsimilaracademic

standing,mightnotStJoseph’sexperiencebemorerelevanttowhatmighthappen

atStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagood

ideafortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindout

whathappenedandwhy?Theymaybeabletoobservethemechanismofthefailure,

ifsuchitwas,andfigureoutwhetherthesameproblemswouldapplyforthem,or

whethertheymightbeabletoadapttheinnovationtomakeitworkforthem,per-

hapsevenmoresuccessfullythanthepositiveaverageinthetrial.

47

Onceagain,thesequestionsareunlikelytobeeasilyansweredinpractice;

but,aswithtransportability,thereisnoseriousalternativetotrying.Assumingthat

theaverageworksforyouwilloftenbewrong,anditwillatleastsometimesbepos-

sibletodobetter.Asinthemedicalcase,theadvicetoindividualschoolsoftenlacks

specificity.Forexample,theU.S.InstituteofEducationScienceshasprovideda

“user-friendly”guidetopracticessupportedbyrigorousevidence,USDepartmentof

Education(2003).Theadvice,whichissimilartorecommendationsindevelopment

economics,isthattheinterventionbedemonstratedeffectivethroughwell-designed

RCTsinmorethanonesiteandthat“thetrialsshoulddemonstratetheinterven-

tion’seffectivenessinschoolsettingssimilartoyours”(2003,17).Nooperational

definitionof“similar”isprovided.

2.8Examplesandillustrationsfromeconomics

OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethatthey

representanapproachthatisdifferentfrommostcurrentpractice.Todocument

thisandtofilloutthearguments,weprovidesomeexamples.Whiletheseareocca-

sionallycritical,ourpurposeisconstructive;indeed,webelievethatmisunderstand-

ingsabouthowtouseRCTshaveartificiallylimitedtheirusefulness,aswellasalien-

atedsomewhowouldotherwiseusethem.

Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedus-

ingRCTs(andotherRCT-likemethods)andareoftencitedasaleadingexampleof

howanevaluationwithstronginternalvalidityleadstoarapidspreadofthepolicy,

e.g.AngristandPischke(2010)amongmanyothers.Thinkthroughthecausalchain

thatisrequiredforCCTstobesuccessful:Peoplemustlikemoney,theymustlike

(ordonotobjecttoomuch)totheirchildrenbeingeducatedandvaccinated,there

mustexistschoolsandclinicsthatarecloseenoughandwellenoughstaffedtodo

theirjob,andthegovernmentoragencythatisrunningtheschememustcareabout

thewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide

rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs

`work’inmanyreplications,thoughtheycertainlywillnotworkinplaceswherethe

schoolsandclinicsdonotexist,e.g.Levy(2001),norinplaceswherepeople

stronglyopposeeducationorvaccination.

48

Similarly,giventhatthesupportfactorswilloperatewithdifferentstrengths

andeffectivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATE

differsfromplacetoplace;forexample,Vivalt’sAidGradewebsitelists29estimates

fromarangeofcountriesofthestandardized(dividedbylocalstandarddeviationof

theoutcome)effectsofCCTsonschoolattendance;allbutfourshowtheexpected

positiveeffect,andtherangerunsfrom–8to+38percentagepoints,Vivalt(2015).

Eveninthisleadingcase,wherewemightreasonablyconcludethatCCTs`work’in

gettingchildrenintoschool,itwouldbehardtocalculatecrediblecost-effectiveness

numbersortocometoageneralconclusionaboutwhetherCCTsaremoreorless

costeffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeex-

pectedtodifferinnewsettings,justastheyhaveinobservedones,makingthese

predictionsdifficult.

Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—

thattheATEtransportsfromoneplacetoanother—isnotreasonable.AidGrade

usesstandardizedmeasuresofeffectsizedividedbystandarddeviationofoutcome

atbaseline,asdoesthemajormulti-countrystudybyBanerjeeetal(2015).Butwe

mightprefermeasuresthathaveaneconomicinterpretation,suchasadditional

monthsofschoolingper$100spent(forexampleifadonoristryingtodecide

wheretospend,seebelow).Nutritionmightbemeasuredbyheight,orbythelogof

height.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousingan-

othermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsit-

uations.Thisisexactlythesortofthingthataformalanalysisoftransportability

forcesustothinkabout.(NotealsothattheATEintheoriginalRCTcandifferde-

pendingonwhethertheoutcomeismeasuredinlevelsorinlogs;itiseasytocon-

structexampleswherethetwoATEshavedifferentsigns.)

Muchoftheeconomicsliterature,likethemedicalliterature,workswiththe

viewofexternalvaliditythat,unlessthereisevidencetothecontrary,thedirection

andsizeoftreatmenteffectscanbetransportedfromoneplacetoanother.TheJ-

PALwebsitereportsitsfindingsunderageneralheadingofpolicyrelevance,subdi-

videdbyaselectionoftopics.Undereachtopic,thereisalistofrelevantRCTsfrom

arangeofdifferentsettingsaroundtheworld.Theseareconvenientlyconverted

49

intoacommoncost-effectivenessmeasuresothat,forexample,under“education”,

subhead“studentparticipation”,therearefourstudiesfromAfrica:oninforming

parentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluni-

forms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementaread-

ditionalyearsofstudenteducationper$100,andamongthesefourstudies,theav-

erageeffectsofspending$100are20.7years,13.9years,0.71yearsand0.27years

respectively.(Notethatthisisadifferent—andmuchsuperior—standardization

fromtheeffectsizestandardizationdiscussedbelow.)

Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorin-

terestedineducation,andifmarginalandaverageeffectsarethesame,theymight

indicatethatthebestplacetodevoteamarginaldollarisinMadagascar,whereit

wouldbeusedtoinformparentsaboutthevalueofeducation.Thisiscertainlyuse-

ful,butitisnotasusefulasstatementsthatinformationordewormingprograms

areeverywheremorecost-effectivethanprogramsinvolvingschooluniformsor

scholarships,orifnoteverywhere,atleastoversomedomain,anditisthesesecond

kindsofcomparisonthatwouldgenuinelyfulfillthepromiseof`findingoutwhat

works.’Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfrom

oneplacetoanother,iftheKenyanresultsalsoholdinMadagascar,Mali,orNa-

mibia,orsomeotherlistofplaces.J-PAL’smanualforcost-effectiveness,Dhaliwalet

al(2012)explainsin(entirelyappropriate)detailhowtohandlevariationincosts

acrosssites,notingvariablefactorssuchaspopulationdensity,prices,exchange

rates,discountrates,inflation,andbulkdiscounts.Butitgivesshortshrifttocross-

sitevariationinthesizeofATEs,whichplayanequalpartinthecalculationsofcost

effectiveness.Themanualbrieflynotesthatdiminishingreturns(orthelast-mile

problem)mightbeimportantintheorybutarguesthatthebaselinelevelsofout-

comesarelikelytobesimilarinthepilotandreplicationareas,sothattheATEcan

besafelytransportedasis.Allofthislacksajustificationfortransportability,some

understandingofwhenresultstransport,whentheydonot,orbetterstill,howthey

shouldbemodifiedtomakethemtransportable.

50

OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTs

isbyBanerjeeetal(2015),whichtestsa“graduation”programdesignedtoperma-

nentlyliftextremelypoorpeoplefrompovertybyprovidingthemwithagiftofa

productiveasset(fromguinea-pigs,(regular-)pigs,sheep,goats,orchickensde-

pendingonlocale),trainingandsupport,andlife-skillscoaching,aswellassupport

forconsumption,saving,andhealthservices.Theideaisthatthispackageofaidcan

helppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossiblewithone

interventionatatime.ComparableversionsoftheprogramweretestedinEthiopia,

Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethe

chickensdied)findlargelypositiveandpersistenteffects—withsimilar(standard-

ized)effectsizes—forarangeofoutcomes(economic,mentalandphysicalhealth,

andfemaleempowerment).Onesiteapart,essentiallyeveryoneacceptedtheiras-

signment.ReplicationofpositiveATEsoversuchawiderangeofplacescertainly

providesproofofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)

failtoreplicatetheresultinSouthIndia,wherethecontrolgroupgotaccesstomuch

thesamebenefits,whatHeckman,Hohman,andSmith(2000)call“substitution

bias”.Evenso,theresultsareimportantbecause,althoughthereisalongstanding

interestinpovertytraps,manyeconomistshavebeenskepticaloftheirexistenceor

thattheycouldbesprungbysuchaid-basedpolicies.Inthissense,thestudyisan

importantcontributiontothetheoryofeconomicdevelopment;ittestsatheoretical

propositionandwill(orshould)changemindsaboutit.

Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottellus

whichcomponentofthetreatmentaccountedfortheresults,orwhichmightbedis-

pensable—amuchmoreexpensivemultifactorialtrialwouldberequired—thoughit

seemslikelyinpracticethatthecostliestcomponent—therepeatedvisitsfortrain-

ingandsupport—islikelytobethefirsttobecutbycash-strappedpoliticiansorad-

ministrators.Andasnoted,itisnotclearwhatshouldcountas(simple)replication

ininternationalcomparisons;itishardtothinkoftheusesofstandardizedeffect

sizes,excepttodocumentthateffectsexisteverywhereandthattheyaresimilarly

largerelativetolocalvariationinsuchthings.

51

Theeffectsize—theATEstandardizedbybeingexpressedinnumbersof

standarddeviationsoftheoriginaloutcome—thoughconvenientlydimensionless,

haslittletorecommendit.AswithmuchofRCTpractice,itstripsoutanyeconomic

content—noratesofreturn,orbenefitsminuscosts—anditremovesanydiscipline

onwhatisbeingcompared.Applesandorangesbecomeimmediatelycomparable,

asdotreatmentswhoseinclusioninameta-analysisislimitedonlybytheimagina-

tionoftheanalystsinclaimingsimilarity.Trainingprogramsforphysicalfitnesscan

bepooledwithtrainingprogramsforwelding,ormarketing,orevenobedience

trainingforpets.Inpsychology,wheretheconceptoriginated,thisresultsinendless

disputesaboutwhatshouldandshouldnotbepooledinameta-analysis.Goldberger

andManski(1995,769)notethat“standardizationaccomplishesnothingexceptto

givequantitiesinnoncomparableunitsthesuperficialappearanceofbeingincom-

parableunits.Thisaccomplishmentisworsethanuseless—ityieldsmisleadingin-

ferences.”Beyondthat,Simpson(2017)notesthatrestrictionsonthetrialsample—

oftengoodpracticetoreducebackgroundnoiseandtohelpdetectaneffect—will

reducethebaselinestandarddeviationandinflatetheeffectsize.Moregenerally,ef-

fectsizesareopentomanipulationbyexclusionrules.Itmakesnosensetoclaim

replicabilityonthebasisofeffectsizes,letalonetousethemtorankprojects.Effect

sizesareirrelevantforpolicymaking.

Thegraduationstudycanbetakenastheclosesttofulfillingthe`findingout

whatworks’aimoftheRCTmovementindevelopment.Yetitissilentonperhaps

thecrucialaspectforpolicy,whichisthatthetrialwasruninpartnershipwith

NGOs,whereaswhatwewouldliketoknowiswhetheritcouldbereplicatedbygov-

ernments,includingthosegovernmentsthatareincapableofgettingdoctors,

nurses,andteacherstoshowuptoclinicsorschools,Chaudhuryetal(2005),

Banerjee,DeatonandDuflo(2004),orofregulatingthequalityofmedicalcareinei-

therthepublicorprivatesectors,Filmer,HammerandPritchett(2000)orDasand

Hammer(2005).Infact,wealreadyknowagreatdealabout`whatworks.’Vaccina-

tionswork,maternalandchildhealthcareserviceswork,andclassroomteaching

works.Yetknowingthisdoesnotgetthosethingsdone.Addinganotherprogram

thatworksunderidealconditionsisusefulonlywhereconditionsareinfactideal,in

52

whichcaseitwouldlikelybeunnecessary.Findingoutwhatworksisnotthemagic

keytoeconomicdevelopment.Technicalknowledge,thoughalwaysworthhaving,

requiressuitableinstitutionsandsuitableincentivesifitistodoanygood.

Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthat

usedcamerasandthreatsofwagereductionstoincentivizeattendanceofteachers

inschoolsrunbyanNGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),and

thesubsequentfailureofafollow-upprograminthesamestatetotacklemassab-

senteeismofhealthworkers,Banerjee,Duflo,andGlennerster(2008).Inthe

schools,thecamerasandtimekeepingworkedasintended,andteacherattendance

increased.Intheclinics,therewasashort-runeffectonnurseattendance,butitwas

quicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthatare

initiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inboth

trials,therewereincentivestoimproveattendance,andtherewereincentivesto

findawaytosabotagethemonitoringandrestoreworkerstotheiraccustomedpo-

sitions;theforceoftheseincentivesisa`high-level’cause,likegravity,ortheprinci-

pleofthelever,thatworksinmuchthesamewayeverywhere.Fortheclinics,some

sabotagewasdirect—thesmashingofcameras—andsomewassubtler,whengov-

ernmentsupervisorsprovidedofficial,thoughspecious,reasonsformissingwork.

WecanonlyconjecturewhythecausalitywasswitchedinthemovefromNGOto

government;wesuspectthatworkingforahighly-respectedlocalNGOisadifferent

contractfromworkingforthegovernment,wherenotshowingupforworkis

widely(ifinformally)understoodtobepartofthedeal.Theincentiveleverworks

whenitiswiredupright,aswiththeNGOs,butnotwhenthewiringcutsitout,as

withthegovernment.Knowing`whatworks’inthesenseofthetreatmenteffecton

thetrialpopulationisoflimitedvaluewithoutunderstandingthepoliticalandinsti-

tutionalenvironmentinwhichitisset.Thisunderlinestheneedtounderstandthe

underlyingsocial,economic,andculturalstructures—includingtheincentivesand

agencyproblemsthatinhibitservicedelivery—thatarerequiredtosupportthe

causalpathwaysthatweshouldliketoseeatwork.

Trialsineconomicdevelopmentoftentakeplaceinartificialenvironments.

Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenaforeignagency

53

comesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’

whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingon

otherthanthetreatment.”Thereisalsothesuspicionthatatreatmentthatworks

doessobecauseofthepresenceofthe`treators,’oftenfromabroad,andmaynot

workdosowiththepeoplewhowillworkitinpractice.

Thereisalsomuchtobelearnedfrommanyyearsofeconomictrialsinthe

UnitedStates,particularlyfromtheworkofMDRC,fromtheearlyincometaxtrials,

aswellasfromtheRandHealthExperiment.Followingtheincometaxtrials,MDRC

hasrunmanyrandomizedtrialssincethe1970s,mostlyfortheFederalgovernment

butalsoforindividualstatesandforCanada(seethethoroughandinformativeac-

countbyGueronandRolston(2011)forthefactualinformationunderlyingthefol-

lowingdiscussion).MDRC’sprogram,likethatofJPALindevelopment,isintended

tofindout`whatworks’inthestateandfederalwelfareprograms.Theseprograms

areconditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedthey

satisfycertainconditionssuchasworkrequirementsortraining,whichareoften

thesubjectofthetrial.Whatarethebenefitsandcostsofvariousalternatives,both

totherecipientsandtothelocalandfederaltaxpayers?Alloftheseprogramsare

deeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability.

Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatits

consequenceswillbesothat,bytheirlights,controlgroupsareunethicalbecause

theydeprivesomepeopleofwhattheadvocates`know’willbecertainbenefits.

Giventhis,itisperhapssurprisingthatRCTshavebecometheacceptednormfor

thiskindofpolicyevaluationintheUS.

Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonbe-

lief,exploredinSection1,thatRCTscanrevealthetruth.AttheFederallevel,pro-

spectivepoliciesarevettedbythenon-partisanCongressionalBudgetOffice(CBO),

whichmakesitsownestimatesofthebudgetaryimplicationsoftheprogram.Ideo-

logueswhoseprogramsarescoredpoorlybytheCBOhaveanincentivetosupport

anRCT,nottoconvincethemselves,buttoconvinceopponents;onceagain,RCTs

arevaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsare

54

easiertoputinplacewhenthereareinsufficientfundstocoverthewholepopula-

tion.TherewasalsoawidespreadandlargelyuncriticalbeliefthatRCTsgivethe

rightanswer,atleastforthebudgetaryimplications,which,ratherthanthewellbe-

ingoftherecipients,wereoftentheprimaryconcern;notethatallofthesetrialsare

onpoorpeoplebyrichpeoplewhoaretypicallymoreconcernedwithcostthanwith

thewellbeingofthepoor,Greenberg,SchroderandOnstott(1999).MDRCstrials

couldthereforebeeffectivedispute-reconciliationmechanismsbothforthosewho

sawtheneedforevidenceandforthosewhodidnot(exceptinstrumentally).The

outcomeherefitswithour`publichealth’case;whatthepoliticiansneedtoknowis

nottheoutcomesforindividuals,orevenhowtheoutcomesinonestatemight

transporttoanother,buttheaveragebudgetarycostinaspecificplace,something

thatagoodRCTconductedonarepresentativesampleofthetargetpopulationcan

deliver,atleastintheabsenceofgeneralequilibriumeffects,timingeffects,etc.

TheseRCTsbyMDRCandothercontractorshavedemonstratedboththefea-

sibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationinthese

settings(wheremanyparticipantswerehostiletotheidea),aswellastheiruseful-

nesstopolicymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavor

ofthedesirabilityofworkrequirementsasaconditionofwelfare,evenamongmany

originallyopposed.Therearealsolimitations;thetrialsappeartohavehadatbesta

minorinfluenceonscientificthinkingaboutbehaviorinlabormarketsand,inthat

sense,theyaremoreabout`plumbing’thanscience,Duflo(2017).Theresultsof

similarprogramshaveoftenbeendifferentacrossdifferentsites,andtherehasto

datebeennofirmunderstandingofwhy;indeed,thetrialsarenotdesignedtore-

vealthis,Moffitt(2004).Finally,andperhapscruciallyforthepotentialcontribution

toeconomicscience,therehasbeenlittlesuccessinunderstandingeithertheunder-

lyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthebe-

ginningtoopentheblackboxes.

TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferent

butequallyinstructivestoryifonlybecauseitsresultshavepermeatedtheacademic

andpolicydiscussionsabouthealthcareeversince.Itwasoriginallydesignedtotest

whethermoregenerousinsurancecausespeopletousemoremedicalcareand,ifso,

55

byhowmuch.Theincentiveeffectsarehardlyindoubttoday;theimmortalityofthe

studycomesratherfromthefactthatitsmulti-arm(responsesurface)designal-

lowedthecalculationofanelasticityforthestudypopulation,thatmedicalexpendi-

turesdecreasedby–0.1to–0.2percentforeverypercentageincreaseinthecopay-

ment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itisthisdimensionless

andthusapparentlytransportablenumberthathasbeenusedeversincetodiscuss

thedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversal

constant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstud-

ies,anditisevenunclearthatitisfirmlybasedontheoriginalevidence.Thispoints,

onceagain,tothecentralimportanceoftransportabilityfortheusefulness,both

shortandlongterm,ofatrial.Here,thesimpledirecttransportabilityoftheresult

seemstohavebeenlargelyillusorythough,aswehaveargued,thisdoesnotmean

thatmorecomplexconstructionsbasedontheresultsofthetrialwouldnothave

donebetter.

Conclusions

Itisusefultorespondtotwochallengesthatareoftenputtous,onefrommedicine

andonefromsocialscience.Themedicalchallengeis,“Ifyouarebeingprescribeda

newdrug,wouldn’tyouwantittohavebeenthroughanRCT?”Thesecond(related)

challengeis,“OK,youhavehighlightedsomeoftheproblemswithRCTs,butother

methodshaveallofthoseproblems,plusproblemsoftheirown.”Webelievethatwe

haveansweredbothoftheseinthepaperbutthatitishelpfultorecapitulate.

Themedicalchallengeisaboutyou,aspecificperson,sothatoneanswer

wouldbethatyoumaybedifferentfromtheaverage,andyouareentitledtoand

oughttoaskabouttheoryandevidenceaboutwhetheritwillworkforyou.This

wouldbeintheformofaconversationbetweenyouandyourphysician,whoknows

alotaboutyou.Youwouldwanttoknowhowthisclassofdrugissupposedtowork

andwhetherthatmechanismislikelytoworkforyou.Isthereanyevidencefrom

otherpatients,especiallypatientslikeyou,withyourconditionandinyourcircum-

stances,oraretheresuggestionsfromtheory?Whatscientificworkhasbeendone

toidentifywhatsupportfactorsmatterforsuccesswiththiskindofdrug?Iftheonly

56

informationavailableisfromthepharmaceuticalcompany,anRCTmightseemlike

agoodidea.Buteventhen,andalthoughknowledgeofthemeaneffectamongsome

groupiscertainlyofvalue,youmightgivelittleweighttoanRCTwhoseparticipants

areselectedinthewaytheywereselectedinthetrial,orwherethereislittleinfor-

mationaboutwhethertheoutcomesarerelevanttoyou.Recallthatmanynewdrugs

areprescribed‘off-label’,forapurposeforwhichtheywerenottested,andbeyond

that,thatmanynewdrugsareadministeredintheabsenceofanRCTbecauseyou

areactuallybeingenrolledinone.Forpatientswhoselastchanceistoparticipatein

atrialofsomenewdrug,thisisexactlythesortofconversationyoushouldhave

withyourphysician(followedbyoneaskinghertorevealwhetheryouareintheac-

tivearm,sothatyoucanswitchifnot),andsuchconversationsneedtotakeplace

forallprescriptionsthatarenewtoyou.Intheseconversations,theresultsofan

RCTmayhavemarginalvalue.Ifyourphysiciantellsyouthatsheendorsesevidence-

basedmedicine,andthatthedrugwillmostlikelyworkforyoubecauseanRCThas

shownthat‘itworks’,itistimetofindanewphysician.

Thesecondchallengeclaimsthatothermethodsarealwaysdominatedbyan

RCT.Thiskindofchallengeisnotwell-formulated.Dominatedforansweringwhat

question,forwhatpurposes?ThechiefadvantageoftheRCTisthatitcan,ifwell-

conducted,giveanunbiasedestimateofanATEinastudy(trial)sampleandthus

provideevidencethatthetreatmentcausedtheoutcomeinsomeindividualsinthat

sample.Ifthatiswhatyouwanttoknowandthere’slittlebackgroundknowledge

availableandthepriceisright,thenanRCTmaybethebestchoice.Astoother

questions,theRCTresultcanbepart—butusuallyonlyasmallpart—ofthedefense

of(a)ageneralclaim,(b)aclaimthatthetreatmentwillcausethatoutcomefor

someotherindividuals,oreven(c)aclaimaboutwhattheATEwillbeinsomeother

population.Buttheydolittlefortheseenterprisesontheirown.Whatisthebest

overallpackageofresearchworkfortacklingthesequestions—mostcost-effective

andmostlikelytoproducecorrectresults—dependsonwhatweknowandwhat

differentkindsofresearchwillcost.

57

ThereareexampleswhereanRCTdoesbetterthananobservationalstudy,

andtheseseemtobethecasesthatcometomindfordefendersofRCTs.Forexam-

ple,regressionsofwhetherpeoplewhogetMedicaiddobetterorworsethanpeople

withprivateinsurancearevitiatedbygrossdifferencesintheothercharacteristics

ofthetwopopulations.ButitisalongstepfromthattosayingthatanRCTcansolve

theproblem,letalonethatitistheonlywaytosolvetheproblem.Itwillnotonlybe

expensivepersubject,butitcanonlyenrollaselectedandalmostcertainlyunrepre-

sentativestudysample,itcanberunonlytemporarily,andtherecruitmenttothe

experimentwillnecessarilybedifferentfromrecruitmentinaschemethatisper-

manentandopentothefullqualifiedpopulation.Noneofthisremovestheblem-

ishesoftheobservationalstudy,buttherearemanymethodsofmitigatingitsdiffi-

culties,sothat,intheend,anobservationalstudywithcrediblecorrectionsanda

morerelevantandmuchlargerstudysample—todayoftenthecompletepopulation

ofinterestthroughadministrativerecords—mayprovideabetterestimate.Every-

thinghastobejudgedonacase-by-casebasis.Thereisnorigorousargumentfora

lexicographicpreferenceforRCTs.

Thereisalsoanimportantlineofenquirythatgoes,notonlybeyondRCTs,

butbeyondthe‘methodofdifferences’thatiscommontoRCTs,regressions,orany

formofcontrolledoruncontrolledcomparison.Thehypothetico-deductivemethod

confrontstheory-baseddeductionswiththedata—eitherobservationalorexperi-

mental.Asnotedabove,economistsroutinelyusetheorytoteaseoutanewimplica-

tionthatcanbetakentothedata,andtherearealsogoodexamplesinmedicine

suchasBleyerandWelch(2012)’sdemonstrationofthelimitedimpactonbreast

cancerincidenceofmammographyscreening,atopicwhereothermethodshave

generatedgreatcontroversyandlittleconsensus.

RCTsaretheultimateinnon-parametricestimationofaveragetreatmentef-

fectsinthetrialsamplesbecausetheymakesofewassumptionsaboutheterogene-

ity,causalstructure,choiceofvariables,andfunctionalform.RCTsareoftenconven-

ientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat

happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,

includingmanyofthemostimportant(andNobelPrizewinning)experimentsin

58

economics,donotanddidnotuserandomization,Harrison(2013),Svorencik

(2015).Butthecredibilityoftheresults,eveninternally,canbeunderminedbyun-

balancedcovariatesandbyexcessiveheterogeneityinresponses,especiallywhen

thedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazard-

ous.Ironically,thepriceofthecredibilityinRCTsisthatwecanonlyrecoverthe

meanofthedistributionoftreatmenteffects,andthatonlyforthetrialsample.Yet,

inthepresenceofoutliers,reliableinferenceonmeansisdifficult.Andrandomiza-

tioninandofitselfdoesnothingunlessthedetailsareright;purposiveselectioninto

theexperimentalpopulation,likepurposiveselectionintoandoutofassignment,

underminesinferenceinjustthesamewayasdoesselectioninobservationalstud-

ies.Lackofblinding,whetherofparticipants,trialists,datacollectors,oranalysts,

underminesinference,akintoafailureofexclusionrestrictionsininstrumentalvari-

ableanalysis.

ThelackofstructurecanbeseriouslydisablingwhenwetrytouseRCTre-

sultsoutsideofafewcontexts,suchasprogramevaluation,hypothesistesting,or

establishingproofofconcept.Beyondthat,theresultscannotbeusedtohelpmake

predictionsbeyondthetrialsamplewithoutmorestructure,withoutmorepriorin-

formation,andwithouthavingsomeideaofwhatmakestreatmenteffectsvaryfrom

placetoplaceortimetotime.Thereisnooptionbuttocommittosomecausal

structureifwearetoknowhowtouseRCTevidenceoutoftheoriginalcontext.

Simplegeneralizationandsimpleextrapolationdonotcutthemustard.Thisistrue

ofanystudy,experimentalorobservational.Butobservationalstudiesarefamiliar

with,androutinelyworkwith,thesortofassumptionsthatRCTsclaimtoavoid,so

thatiftheaimistouseempiricalevidence,anycredibilityadvantagethatRCTshave

inestimationisnolongeroperative.AndbecauseRCTstellussolittleaboutwhyre-

sultshappen,theyhaveadisadvantageoverstudiesthatuseawiderrangeofprior

informationanddatatohelpnaildownmechanisms.

Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremely

useful,pinningdownpartofastructure,helpingtobuildstrongerunderstanding

andknowledge,andhelpingtoassesswelfareconsequences.Asourexamplesshow,

thiscanoftenbedonewithoutcommittingtothefullcomplexityofwhatareoften

59

thoughtofasstructuralmodels.Yetwithoutthestructurethatallowsustoplace

RCTresultsincontext,ortounderstandthemechanismsbehindthoseresults,not

onlycanwenottransportwhetherìtworks’elsewhere,butwecannotdothestand-

ardstuffofeconomics,whichistosaywhethertheinterventionisactuallywelfare

improving.Withoutknowingwhythingshappenandwhypeopledothings,werun

theriskofworthlesscasual(`fairystory’)causaltheorizingandhavegivenupon

oneofthecentraltasksofeconomics.

Wemustbackawayfromtherefusaltotheorize,fromtheexultationinour

abilitytohandleunlimitedheterogeneity,andactuallySAYsomething.Perhapspar-

adoxically,unlesswearepreparedtomakeassumptions,andtosaywhatweknow,

makingstatementsthatwillbeincredibletosome,allthecredibilityoftheRCTisfor

naught.

RCTsineconomicsonhealth,labor,anddevelopmenthaveproventheir

worthinprovidingproofsofconceptandattestingpredictionsthatsomepolicies

mustalwaysworkorcanneverwork.But,aselsewhereineconomics,wecannot

findoutwhysomethingworksbysimplydemonstratingthatitdoeswork,nomatter

howoften,whichleavesusuninformedastowhetherthepolicyshouldbeimple-

mented.Beyondthat,smallscale,demonstrationRCTsarenotcapableoftellingus

whatwouldhappenifthesepolicieswereimplementedtoscale,ofcapturingunin-

tendedconsequencesthattypicallycannotbeincludedintheprotocols,orofmodel-

ingwhatwillhappenifschemesareimplementeddifferentlythaninthetrial,forex-

amplebygovernments,whosemotivesandoperatingprinciplesaredifferentfrom

theNGOsoracademicswhotypicallyruntrials.Whileitistruethatabstract

knowledgeisalwayslikelytobebeneficial,successfulpolicydependsoninstitutions

andonpolitics,mattersonwhichRCTshavelittletosay.TheresultsofRCTscanand

shouldfeedintopublicdebateaboutwhatshouldbedone,butweareondangerous

groundwhentheyareused,ongroundsoftheirsupposedepistemicsuperiority,to

insulatepolicyfromdemocraticprocesses.

Citations

60

Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathavewelearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimen-tation,Chicago,Il.ChicagoUniversityPressforNationalBureauofEconomicRe-search,11–54.

Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalre-sultsineconomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperi-mentaleconomics,OxfordUniversityPress.

Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatisticalSociety,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36.

Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”EconomicJournal,114,C52–C83.

Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKing,andMichaelKremer,2002,“VouchersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,”AmericanEconomicReview,92(5),1535–58.

Angrist,JoshuaD.andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninem-piricaleconomics:howbetterresearchdesignistakingtheconoutofeconomet-rics,”JournalofEconomicPerspectives,24(2),3–30.

Angrist,JoshuaD.andJörn-SteffenPischke,2017,“Undergraduateeconometricsin-struction:throughourclasses,darkly,”JournalofEconomicPerspectives,31(2),125-44.

Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsur-anceexperiment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222.

Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialex-periments,”DocumentNo.P-5546,SantaMonica,CA.RandCorporation.

Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.PalmerandJosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMaintenanceExperiment,Washington,DC.TheBrookingsInstitu-tion.109–38.

Athey,SusanandGuidoW.Imbens,2017,“Thestateofappliedeconometrics:cau-salityandpolicyevaluation,”JournalofEconomicPerspectives,31(2),3-32.

Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usingastructuralmodelandarandomizedexperimenttoevaluatePRO-GRESA,”ReviewofEconomicStudies,79(1),37–66.

Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015,“Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolledtrialinColombia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06.

Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalproceduresinnonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22.

Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofexperimenters,”processed,July2016.

61

Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachestoexperimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167,April.

Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“Healthcaredeliveryinru-ralRajasthan,”EconomicandPoliticalWeekly,39(9),944–9.

Banerjee,AbhijitandEstherDuflo,2009,“Theexperimentalapproachtodevelop-menteconomics,”AnnualReviewofEconomics,1,151-78.

Banerjee,AbhijitandEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewaytofightglobalpoverty,PublicAffairs.

Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,Wil-liamParienté,JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogramcauseslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236),1260799.

Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse:incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEconomicAssociation,6(2–3),487–500.

Banerjee,AbhijitV.andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomicReview,93(2),39–44.

Banerjee,Abhijit,DeanKarlan,andJonathanZinman,2015,“Sixrandomizedevalua-tionsofmicrocredit:introductionandfurthersteps,”AmericanEconomicJournal:AppliedEconomics,7(1),1-21.

Bareinboim,EliasandJudeaPearl,2013,“Ageneralalgorithmfordecidingtrans-portabilityofexperimentalresults,”JournalofCausalInference,1(1),107-34.

Bareinboim,EliasandJudeaPearl,2014,“Transportabilityfrommultipleenviron-mentswithlimitedexperiments:completenessresults,”inM.Welling,Z.Ghah-ramani,C.Cortes,andN.Lawrence,eds.,AdvancesofNeuralInformationPro-cessing,27,(NIPSProceedings),280-8.

Bauchet,Jonathan,JonathanMorduch,andShamikaRavi,2015,“Failurevsdisplace-ment:whyaninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevelopmentEconomics,116,1–16.

Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance,Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf

Begg,ColinB.,1990,“Significancetestsofcovarianceimbalanceinclinicaltrials,”ControlledClinicalTrials,11(4),223-5.

Bhattacharya,DebopamandPascalineDupas,2012,“Inferringwelfaremaximizingtreatmentassignmentunderbudgetconstraints,”JournalofEconometrics,167(1),168-96.

Bitler,MarianneP.,JonahB.Gelbach,andHilaryW.Hoynes,2006,“Whatmeanim-pactsmiss:distributionaleffectsofwelfarereformexperiments,”AmericanEco-nomicReview,96(4),988-1012.

Bleyer,Archie,andH.GilbertWelch,2012,“Effectofthreedecadesofscreeningmammographyonbreast-cancerincidence,”NewEnglandJournalofMedicine,367,1998-2005

Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteex-perimentaldifferencestofindoutwhyprogrameffectivenessvaries,”inHoward

62

S.Bloom,ed.,Learningmorefromsocialexperiments:evolvinganalyticalap-proaches,NewYork,NY.RussellSage.

Bold,Tessa,MwangiKimenyi,GermanoMwabu,AliceNg’ang’a,andJustinSandefur,2013,“Scalingupwhatworks:experimentalevidenceonexternalvalidityinKen-yaneducation,”Washington,DC.CenterforGlobalDevelopment,WorkingPaper321.

Bothwell,LauraE.andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlledtrial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635

Campbell,D.T.andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforresearch.Chicago.RandMcNally.

Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.Claren-donPress.

Cartwright,Nancy,2007,“AreRCTsthegoldstandard?”Biosocieties,2,11-20.Cartwright,Nancy,2011,“Aphilosopher’sviewofthelongroadfromRCTstoeffec-tiveness,”TheLancet,377,1400-01.

Cartwright,Nancy,2012,“Presidentialaddress:willthispolicyworkforyou?Pre-dictingeffectivenessbetter:howphilosophyhelps,”PhilosophyofScience,79,973-89.

Cartwright,Nancy,2016.“Whereistherigorwhenyouneedit?”inI.Marinovic,ed.,FoundationsandTrendsinAccounting:specialissueoncausalinferenceincapitalmarketsresearch,10(2-4):106-24.

Cartwright,NancyandJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingitbetter,Oxford.OxfordUniversityPress.

Cartwright,NancyandEileenMunro,2010,“ThelimitationsofRCTsinpredictingeffectiveness,”JournalofExperimentalChildPsychology,16(2),

Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionofmethodstocreateunbiasedcomparisongroupsintherapeuticexper-iments,”InternationalJournalofEpidemiology,30,1156–64.

Chan,TatY.andBartonH.Hamilton,2006,“Learning,privateinformation,andtheeconomicevaluationofrandomizedexperiments,”JournalofPoliticalEconomy,114(6),997-1040.

Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal–agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4),1279–1309.

Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Ac-countingforbehaviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227.doi:10:1371/journal.pone.0127227.

Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharan,andF.HalseyRogers,2005,“Missinginaction:teacherandhealthworkerabsenceinde-velopingcountries,”JournalofEconomicPerspectives,19(4),91–116.

Chyn,Eric,2016,“Movedtoopportunity:thelong-runeffectofpublichousingdemo-litiononlabormarketoutcomesofchildren,”UniversityofMichigan.http://www-personal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf

63

Chetty,Raj,2009,“Sufficientstatisticsforwelfareanalysis:abridgebetweenstruc-turalandreduced-formmethods,”AnnualReviewofEconomics,1,451-87.

Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexper-iments,”Econometrica,41(4),643–56.

Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Dolabormarketpolicieshavedisplacementeffects?evidencefromaclus-teredrandomizedexperiment,”QuarterlyJournalofEconomics,128(2),531–80.

Das,JishnuandJeffreyHammer,2005,”Whichdoctor?Combiningvignettesanditemresponsetomeasureclinicalcompetence,”JournalofDevelopmentEconom-ics,78,348–83.

Davey-Smith,George,andShahIbrahim,2002,“Datadredging,bias,orconfound-ing,”BritishMedicalJournal,325,1437-8.

Deaton,Angus,2010,“Instruments,randomization,andlearningaboutdevelop-ment,”JournalofEconomicLiterature,48(2),424-55.

Deaton,AngusandNancyCartwright,2016,“Understandingandmisunderstandingrandomizedcontrolledtrials,”http://www.princeton.edu/~deaton/down-load.html?pdf=Deaton_Cartwright_RCTs_with_ABSTRACT_August_25.pdf

Deaton,AngusandJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.CambridgeUniversityPress.

Deaton,AngusandSerenaNg,1998,“Parametricandnonparametricapproachestopriceandtaxreform,”JournaloftheAmericanStatisticalAssociation,93(443),900-9.

Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Com-parativecost-effectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithapplicationsforeducation,”J–PAL,MIT,December3rd.http://www.povertyactionlab.org/publication/cost-effectiveness

Drèze,Jean,2016,Personalemailcommunication.Duflo,Esther,2017,“Theeconomistasplumber,”AmericanEconomicReview,107(5),1-26.

Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteacherstocometoschool,”AmericanEconomicReview,102(4),1241–78.

Duflo,EstherandMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelopmenteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brookings,93–120.

Dynarski,Susan,2015,“Helpingthepoorineducation:thepowerofasimplenudge,”NewYorkTimes,Jan17,2015.

Fine,PaulE.M.andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthedeterminationofoptimalvaccinationpolicies,”AmericanJournalofEpide-miology,124(6),1012–20.

Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMin-istryofAgricultureofGreatBritain,33,503–13.

Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisofhealthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204.

64

Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”Ad-vancesinAppliedMathematics,40,180–93.

Frieden,ThomasR.,2017,“Evidenceforhealthdecisionmaking—beyondrandom-ized,controlledtrials,”NewEnglandJournalofMedicine,377,465-75.

Garfinkel,IrwinandCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF.Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.1–22.

Gerber,AlanS.andDonaldP.Green,2012,FieldExperiments,NewYork.Norton.Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Vermeersch,2016,Impactevaluationinpractice,2ndEdition,Washington,DC.Inter-AmericanDevelopmentBankandWorldBank.

Goldberger,ArthurS.andCharlesF.Manski,1995,“ReviewArticle:TheBellCurvebyHerrnsteinandMurray,”JournalofEconomicLiterature,33(2),762-76.

Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washington,DC.UrbanInstitutePress.

Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperi-mentmarket,”JournalofEconomicPerspectives,13(3),157–72.

Gueron,JudithM.andHowardRolston,2013,Fightingforreliableevidence,NewYork,RussellSage.

Guyatt,Gordon,DavidL.Sackett,andDeborahJ.CookfortheEvidence-BasedMedi-cineWorkingGroup,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapyorprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?”JournaloftheAmericanMedicalAssociation,271(1),59–63.

Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”Jour-nalofEconomicMethodology,20(2),103–17.

Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDevelopmentResearch,26,39–45.

Harrison,GlennW.,2014,“Cautionarynotesontheuseoffieldexperimentstoad-dresspolicyissues,”OxfordReviewofEconomicPolicy,30(4),753-63.

Hausman,JerryA.andDavidA.Wise,1985,“Technicalproblemsinsocialexperi-mentation:costversuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation,Chicago,IL.ChicagoUniversityPress.187–220.

Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.ManskiandIrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cam-bridge,MA.HarvardUniversityPress.547–70.

Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralas-sumptionsusedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62.

Heckman,JamesJ.,2005,“Thescientificmodelofcausality,”SociologicalMethodol-ogy,35(1),1-97.

Heckman,JamesJ.,2008,“Econometriccausality,”InternationalStatisticalReview,76(1),1-27.

65

Heckman,JamesJ.,2010,“Buildingbridgesbetweenstructuralandprogramevalua-tionapproachestoevaluatingpolicy,”JournalofEconomicLiterature,48(2),356-98.

Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000,“Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperiment,”QuarterlyJournalofEconomics,115(2),651–94.

Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandeconometricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDa-vidCard,eds.Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097.

Heckman,JamesJ.,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanismsthroughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”AmericanEconomicReview,103(6),2052–86.

Heckman,JamesJ.andJeffreySmith,1995,“Assessingthecaseforsocialexperi-ments,”JournalofEconomicPerspectives,9(2),85-110.

Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofprogrammeevaluationsandsocialexperiments:accountingforheterogeneityinprogrammeimpacts,”ReviewofEconomicStudies,64(4),487–535.

Heckman,JamesJ.andSergioUrzúa,2010,“ComparingIVwithstructuralmodels:whatsimpleIVcanandcannotidentify,”JournalofEconometrics,156,27-37.

Heckman,JamesJ.andEdwardVytlacil,2005,“Structuralequations,treatmentef-fects,andeconometricpolicyevaluation,”Econometrica,73(3),669–738.

Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms,Part1:causalmodels,structuralmodels,andeconometricpolicyeval-uation,”Chapter70inJamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874.

Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedi-cine:BradfordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64.

Hotz,V.Joseph,GuidoW.Imbens,andJulieH.Mortimer,2005,“Predictingtheeffi-cacyoffuturetrainingprogramsusingpastexperienceatotherlocations,”Jour-nalofEconometrics,125,241–70.

Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceonachievementandstratification:evidencefromChile’svoucherpro-gram,”JournalofPublicEconomics,90,1477–1503.

Hurwicz,Leonid,1966,“Onthestructuralformofinterdependentsystems,”StudiesinLogicandtheFoundationsofMathematics,44,232-9.

Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexogeneity:areview,”ReviewofEconomicsandStatistics,86(1),4–29.

Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)andHeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423.

Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaveragetreatmenteffects,”Econometrica,62(2),467–75.

Imbens,GuidoW.andMichalKolesár,2016,“Robuststandarderrorsinsmallsam-ples:somepracticaladvice,”ReviewofEconomicsandStatistics,98(4),701-12.

66

Imbens,GuidoW.andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometricsofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86.

InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct,reporting,editing,andpublicationofscholarlyworkinmedicaljournals,http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.)

J_PAL,2017,https://www.povertyactionlab.org/about-j-pal,(accessed,August21,2017).

Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisagree,”AmericanPsychologist,64(6),515–26.

Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconom-icsishelpingtosolveglobalpoverty,NewYork.Dutton.

Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcon-trolledtrialsarethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinanceproductdesigns,”EnterpriseDevelopmentandMicro-finance,20(3),167–76.

Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycoulddoinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012

Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,NewYork.Farrar,Straus,andGiroux.

Kremer,MichaelandAlakaHolla,2009,“Improvingeducationinthedevelopingworld:whathavewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42.

Lalonde,RobertJ.,1986,“Evaluatingtheeconometricevaluationsoftrainingpro-gramswithexperimentaldata,”AmericanEconomicReview,76(4),604-20.

Lehman,Erich.L.andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition),NewYork.Springer.

Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Opor-tunidadesprogram,Washington,DC.Brookings.

Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.Ox-fordUniversityPress.

Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,andArleenLeibowitz,1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,”AmericanEconomicReview,77(3),251–77.

Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin,ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedex-periment,SantaMonica,CA.RAND.

Manski,CharlesF.,2004,“Treatmentrulesforheterogeneouspopulations,”Econo-metrica,72(4),1221-46.

Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge,MA.HarvardUniversityPress.

Manski,CharlesF.andAlekseyTetenov,2016,“Sufficienttrialsizetoinformclinicalpractice,”PNAS,113(38),10518-23.

67

Metcalf,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperiments,”AmericanEconomicReview,63(3),478–83.

Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHumanResources,14(4),477–87.

Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharlesManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress,231–52.

Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspectivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist,47(5),506–40

Morgan,KariLockandDonaldB.Rubin,2012,“Rerandomizationtoimprovecovari-atebalanceinexperiments,”AnnalsofStatistics,40(2),1263–82.

Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepol-icyrelevanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225.

Orcutt,GuyH.andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimenta-tionforincomemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72.

Pearl,JudeaandEliasBareinboim,2011,“Transportabilityofcausalandstatisticalrelations:aformalapproach,”Proceedingsofthe25thAAAIConferenceonArtificialIntelligence,AAAIPress,247-54,

Pearl, Judea and Elias Bareinboim, 2014, “External validity: from do-calculus to trans-portability across populations,” Statistical Science, 29(4), 579-95.

Rodrik,Dani,2006,personalemailcommunication.Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdotheresultsofthetrialapply’”,Lancet,365,82–93.

Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor.

Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRichardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal,312(January13),71–2.

Savage,LeonardJ.,1962,“Subjectiveprobabilityandstatisticalpractice,”inG.A.Bar-nardandD.R.Cox,eds.,TheFoundationsofStatisticalInference,London.Me-thuen.9-35.

Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPop-ham,ed.,Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorporation.

Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13,1715–26.

Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32,1439–50.

Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasi-experimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin.

Simpson,Adrian,2017,“Themisdirectionofpublicpolicy:comparingandcombin-ingstandardisedeffectsizes,”JournalofEducationalPolicy32(4),450-66.

68

Stuart,ElizabethA.,StephenR.Cole,andCatharineP.Bradshaw,andPhilipJ.Leaf,2011,“Theuseofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,”JournaloftheRoyalStatisticalSocietyA,174(2),369–86.

Student(W.S.Gosset),1938,“Comparisonbetweenbalancedandrandomarrange-mentsoffieldplots,”Biometrika,29(3/4),363-78.

Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperi-mentaleconomics,UtrechtSchoolofEconomics,DissertationSeries#29,http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026

Todd,PetraE.andKennethJ.Wolpin,2006,“Assessingtheimpactofaschoolsub-sidyprograminMexico:usingasocialexperimenttovalidateadynamicbehav-ioralmodelofchildschoolingandfertility,”AmericanEconomicReview,96(5),1384–1417.

Todd,PetraE.andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annalesd’EconomieetdelaStatistique,91/92,263–91.

U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducationEvaluationandRegionalAssistance,2003,Identifyingandimplement-ingeducationalpracticessupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEducationSciences.

Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasran-domizedcontrolledtrials?”TheLancet,363:1728–31.

Vandenbroucke,JanP..2009,“TheHRTcontroversy:observationalstudiesandRCTsfallinline,”TheLancet,373,1233-5.

Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,un-published.http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf

White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirecttestforheteroskedasticity,”Econometrica,50(1),1–25.

Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsidiesonrent,”inP.BruckerandR.Pauly,eds.MethodsofOperationsResearch,50,VerlagAnonHain.441–89.

Wolpin,KennethI.,2013,Thelimitsofinferencewithouttheory,Camridge,MA.MITPress.

Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”Philoso-phyCompass,2/6,981–1022.

Worrall,John,2008,“Evidenceandethicsinmedicine,”PerspectivesinBiologyandMedicine,51(3),418-31.

Yates,Frank,1939,“Thecomparativeadvantagesofsystematicandrandomizedar-rangementsinthedesignofagriculturalandbiologicalexperiments,”Biometrika,30(3/4),440-66.

Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalin-significanceofseeminglysignificantexperimentalresults,”LondonSchoolofEco-nomics,WorkingPaper,Feb.

Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconom-ics:whyW.S.Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208.

69

Appendix:MonteCarloexperimentforanRCTwithoutliersInthisillustrativeexample,thereisparentpopulationeachmemberofwhichhashisorher

owntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistribu-

tionwithzeromeansothatthepopulationATEiszero.Theindividualtreatmenteffectsβ

aredistributedsothat β + e0.5 ∼ Λ(0,1) ,forstandardizedlognormaldistributionΛ. Inthe

absenceoftreatment,everyoneinthesamplerecordszero,sothesampleaveragetreat-

menteffectinanyonetrialissimplythemeanoutcomeamongthentreatments.Forvalues

ofnequalto25,50,100,200,and500wedrawfromtheparentpopulation100trialsam-

pleseachofsize2n;withfivevaluesofn,thisgivesus500trialsamplesinall;becauseof

samplingthetrueATE’sineachtrialsamplewillnotbezero.Foreachofthese500samples,

werandomizeintoncontrolsandntreatments,estimatetheATEanditsestimatedt–value

(usingthestandardtwo-samplet–value,orequivalently,byrunningaregressionwithro-

bustt–values),andthenrepeat1,000times,sowehave1,000ATEestimatesandt–values

foreachofthe500trialsamples.TheseallowustoassessthedistributionofATEestimates

andtheirnominalt–valuesforeachtrial.

TheresultsareshowninTableA1.Eachrowcorrespondstoasamplesize.Ineach

row,weshowtheresultsof100,000individualtrials,composedof1,000replicationson

eachofthe100trial(experimental)samples.Thecolumnsareaveragedoverall100,000tri-

als.

TableA1:RCTswithskewedtreatmenteffects

Samplesize MeanofATE

estimates

Meanofnominalt–

values

Fractionnullre-

jected(percent)

25

50

0.0268

0.0266

–0.4274

–0.2952

13.54

11.20

100 –0.0018 –0.2600 8.71

200 0.0184 –0.1748 7.09

500 –0.0024 –0.1362 6.06

Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfromalognormaldistributionoftreatmenteffectsshiftedtohaveazeromean.

70

Thelastcolumnshowsthefractionsoftimesthenullthatistrueinthepopulationis

rejectedinthetrialsamplesandisourkeyresult.Whenthereareonly50treatmentsand

50controls(row2),the(true)nullisrejected11.2percentofthetime,insteadofthe5per-

centthatwewouldlikeandexpectifwewereunawareoftheproblem.Whenthereare500

unitsineacharm,therejectionrateis6.06percent,muchclosertothenominal5percent.

FigureA1:EstimatesofanATEwithanoutlierinthetrialsample

FigureA1illustratestheestimatedATEsfromanextremetrialsamplefromthesimulations

inthesecondrowwith100observationsintotal;thehistogramshowsthe1,000estimates

oftheATEforthattrialsample.Thistrialsamplehasasinglelargeoutlyingtreatmenteffect

of48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe

treatmentgroup,wegettheright-handsideofthefigure,whenitisinthecontrolgroup,we

gettheleft-handside.

0.5

11.

5D

ensi

ty

-.5 0 .5 1 1.5 21,000 estimates of average treatment effect