Deaton - 2010 - Instruments, Randomization, And Learning About Dev

Embed Size (px)

Citation preview

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    1/32

     Journal of Economic Literature 48 (June 2010): 424–455http:www.aeaweb.org/articles.php?doi=10.1257/jel.48.2.424

    424

    Instruments, Randomization, andLearning about Development

    A D*

    There is currently much debate about the effectiveness of foreign aid and about whatkind of projects can engender economic development. There is skepticism about theability of econometric analysis to resolve these issues or of development agencies tolearn from their own experience. In response, there is increasing use in developmenteconomics of randomized controlled trials (RCTs) to accumulate credible knowl-edge of what works, without overreliance on questionable theory or statistical meth-ods. When RCTs are not possible, the proponents of these methods advocate quasi-randomization through instrumental variable (IV) techniques or natural experiments.I argue that many of these applications are unlikely to recover quantities that are use-

     ful for policy or understanding: two key issues are the misunderstanding of exogeneityand the handling of heterogeneity. I illustrate from the literature on aid and growth.

     Actual randomization faces similar problems as does quasi-randomization, notwith- standing rhetoric to the contrary. I argue that experiments have no special ability to

     produce more credible knowledge than other methods, and that actual experimentsare frequently subject to practical problems that undermine any claims to statisti-cal or epistemic superiority. I illustrate using prominent experiments in developmentand elsewhere. As with IV methods, RCT-based evaluation of projects, without guid-ance from an understanding of underlying mechanisms, is unlikely to lead to scientific

     progress in the understanding of economic development. I welcome recent trends indevelopment experimentation away from the evaluation of projects and toward theevaluation of theoretical mechanisms. (JEL C21, F35, O19 )

    *Deaton: Princeton University. This is a lightly revisedand updated version of “Instruments of Development:Randomization in the Tropics and the Search for the Elu-sive Keys to Economic Development,” originally given asthe Keynes Lecture, British Academy, October 9, 2008,and published in the Proceedings of the British Academy102, 2008 Lectures, © The British Academy 2009. I amgrateful to the British Academy for permission to reprintthe unchanged material. I have clarified some points inresponse to Imbens (2009), but have not changed anythingof substance and stand by my original arguments. Forhelpful discussions and comments on earlier versions, I amgrateful to Abhijit Banerjee, Tim Besley, Richard Blundell,

    Ron Brookmayer, David Card, Anne Case, Hank Farber,Bill Easterly, Erwin Diewert, Jean Drèze, Esther Duflo,Ernst Fehr, Don Green, Jim Heckman, Ori Heffetz, KarlaHoff, Bo Honoré, Št ˘  epán Jurajda, Dean Karlan, Aart Kraay,Michael Kremer, David Lee, Steve Levitt, Winston Lin,John List, David McKenzie, Margaret McMillan, CostasMeghir, Branko Milanovic, Chris Paxson, Franco Perac-chi, Dani Rodrik, Sam Schulhofer-Wohl, Jonathan Skinner,Jesse Rothstein, Rob Sampson, Burt Singer, Finn Tarp,Alessandro Tarozzi, Chris Udry, Gerard van den Berg, Eric Verhoogen, Susan Watkins, Frank Wolak, and John Wor-rall. I acknowledge a particular intellectual debt to NancyCartwright, who has discussed these issues patiently with

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    2/32

    425Deaton: Instruments, Randomization, and Learning about Development

    1. Introduction

    The effectiveness of development assis-tance is a topic of great public interest.Much of the public debate among non-economists takes it for granted that, if thefunds were made available, poverty wouldbe eliminated (Thomas Pogge 2005; PeterSinger 2004) and at least some economistsagree (Jeffrey D. Sachs 2005, 2008). Others,most notably William Easterly (2006, 2009),are deeply skeptical, a position that has beenforcefully argued at least since P. T. Bauer(1971, 1981). Few academic economists orpolitical scientists agree with Sachs’s views,

    but there is a wide range of intermediatepositions, well assembled by Easterly (2008).The debate runs the gamut from the macro—can foreign assistance raise growth rates andeliminate poverty?—to the micro—whatsorts of projects are likely to be effective?—should aid focus on electricity and roads,or on the provision of schools and clinicsor vaccination campaigns? Here I shall beconcerned with both the macro and microkinds of assistance. I shall have very little tosay about what actually works and what doesnot—but it is clear from the literature that

     we do not know. Instead, my main concernis with how we should go about finding out

     whether and how assistance works and withmethods for gathering evidence and learningfrom it in a scientific way that has some hopeof leading to the progressive accumulationof useful knowledge about development.I am not an econometrician, but I believe

    that econometric methodology needs to beassessed, not only by methodologists, but bythose who are concerned with the substanceof the issue. Only they (we) are in a positionto tell when something has gone wrong withthe application of econometric methods,not because they are incorrect given theirassumptions, but because their assumptionsdo not apply, or because they are incorrectlyconceived for the problem at hand. Or atleast that is my excuse for meddling in thesematters.

    Any analysis of the extent to which for-eign aid has increased economic growth inrecipient countries immediately confronts

    the familiar problem of simultaneous causal-ity; the effect of aid on growth, if any, will bedisguised by effects running in the oppositedirection, from poor economic performanceto compensatory or humanitarian aid. It is notobvious how to disentangle these effects, andsome have argued that the question is unan-swerable and that econometric studies of itshould be abandoned. Certainly, the econo-metric studies that use international evi-dence to examine aid effectiveness currentlyhave low professional status. Yet it cannotbe right to give up on the issue. There is nogeneral or public understanding that nothingcan be said, and to give up the econometricanalysis is simply to abandon precise state-ments for loose and unconstrained historiesof episodes selected to support the positionof the speaker.

    The analysis of aid effectiveness typicallyuses cross-country growth regressions with

    the simultaneity between aid and growthdealt with using instrumental variable meth-ods. I shall argue in the next section that therehas been a good deal of misunderstanding inthe literature about the use of instrumental

     variables. Econometric analysis has changedits focus over the years, away from the analy-sis of models derived from theory towardmuch looser specifications that are statisticalrepresentations of program evaluation. With

    me for several years, whose own work on causality hasgreatly influenced me, and who pointed me toward otherimportant work; to Jim Heckman, who has long thoughtdeeply about these issues, and many of whose views arerecounted here; and to the late David Freedman, who con-sistently and effectively fought against the (mis)use of tech-nique as a substitute for substance and thought. None of which removes the need for the usual disclaimer, that the views expressed here are entirely my own. I acknowledgefinancial support from NIA through grant P01 AG05842–14 to the National Bureau of Economic Research.

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    3/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)426

    this shift, instrumental variables have movedfrom being solutions to a well-defined prob-lem of inference to being devices that inducequasi-randomization. Old and new under-standings of instruments coexist, leading toerrors, misunderstandings, and confusion,as well as unfortunate and unnecessary rhe-torical barriers between disciplines work-ing on the same problems. These abusesof technique have contributed to a generalskepticism about the ability of econometricanalysis to answer these big questions.

    A similar state of affairs exists in the micro-economic area, in the analysis of the effec-tiveness of individual programs and projects,

    such as the construction of infrastructure—dams, roads, water supply, electricity—and inthe delivery of services—education, health,or policing. There is frustration with aidorganizations, particularly the World Bank,for allegedly failing to learn from its projectsand to build up a systematic catalog of what

     works and what does not. In addition, someof the skepticism about macro economet-rics extends to micro econometrics, so thatthere has been a movement away from suchmethods and toward randomized controlledtrials. According to Esther Duflo, one of theleaders of the new movement in develop-ment, “Creating a culture in which rigor-ous randomized evaluations are promoted,encouraged, and financed has the potentialto revolutionize social policy during the 21stcentury, just as randomized trials revolution-ized medicine during the 20th,” this from a2004  Lancet  editorial headed “The World

    Bank is finally embracing science.”In section 4 of this paper, I shall argue that, under ideal circumstances, randomized eval-uations of projects are useful for obtaininga convincing estimate of the average effectof a program or project. The price for thissuccess is a focus that is too narrow and toolocal to tell us “what works” in development,to design policy, or to advance scientificknowledge about development processes.

    Project evaluations, whether using random-ized controlled trials or nonexperimentalmethods, are unlikely to disclose the secretsof development nor, unless they are guidedby theory that is itself open to revision, arethey likely to be the basis for a cumulativeresearch program that might lead to a betterunderstanding of development. This argu-ment applies a fortiori to instrumental vari-ables strategies that are aimed at generatingquasi-experiments; the value of econometricmethods cannot and should not be assessedby how closely they approximate random-ized controlled trials. Following NancyCartwright (2007a, 2007b), I argue that evi-

    dence from randomized controlled trials canhave no special priority. Randomization isnot a gold standard because “there is no goldstandard” Cartwright (2007a.) Randomizedcontrolled trials cannot automatically trumpother evidence, they do not occupy any spe-cial place in some hierarchy of evidence, nordoes it make sense to refer to them as “hard”

     while other methods are “soft.” These rhe-torical devices are just that; metaphor is notargument, nor does endless repetition makeit so.

    More positively, I shall argue that theanalysis of projects needs to be refocusedtoward the investigation of potentially gen-eralizable mechanisms that explain why andin what contexts projects can be expectedto work. The best of the experimental workin development economics already does sobecause its practitioners are too talentedto be bound by their own methodological

    prescriptions. Yet there would be much tobe said for doing so more openly. I concur with Ray Pawson and Nick Tilley (1997), who argue that thirty years of project evalua-tion in sociology, education, and criminology

     was largely unsuccessful because it focusedon  whether   projects worked instead of on

     why  they worked. In economics, warningsalong the same lines have been repeatedlygiven by James J. Heckman (see particularly

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    4/32

    427Deaton: Instruments, Randomization, and Learning about Development

    Heckman 1992 and Heckman and Jeffrey A.Smith 1995), and much of what I have to sayis a recapitulation of his arguments.

    The paper is organized as follows.Section 2 lays out some econometric pre-liminaries concerning instrumental vari-ables and the vexed question of exogeneity.Section 3 is about aid and growth. Section 4 isabout randomized controlled trials. Section 5is about using empirical evidence and where

     we should go now.

    2. Instruments, Identification, and theMeaning of Exogeneity

    It is useful to begin with a simple andfamiliar econometric model that I can useto illustrate the differences between differ-ent flavors of econometric practice—this hasnothing to do with economic developmentbut it is simple and easy to contrast with thedevelopment practice that I wish to discuss.In contrast to the models that I will discusslater, I think of this as a model in the spiritof the Cowles Foundation. It is the simplestpossible Keynesian macroeconomic modelof national income determination takenfrom once-standard econometrics textbooks.There are two equations that together com-prise a complete macroeconomic system.The first equation is a consumption functionin which aggregate consumption is a linearfunction of aggregate national income, whilethe second is the national income account-ing identity that says that income is the sumof consumption and investment. I write the

    system in standard notation as

    (1) C = α + β  Y +  u,

    (2) Y ≡  C + I.

    According to (1), consumers choose thelevel of aggregate consumption with refer-ence to their income, while in (2) investmentis set by the “animal spirits” of entrepreneurs

    in some way that is outside of the model. Nomodern macroeconomist would take thismodel seriously, though the simple con-sumption function is an ancestor of moresatisfactory and complete modern formula-tions; in particular, we can think of it (or atleast its descendents) as being derived froma coherent model of intertemporal choice.Similarly, modern versions would postulatesome theory for what determines invest-ment I—here it is simply taken as given andassumed to be orthogonal to the consump-tion disturbance u.

    In this model, consumption and incomeare simultaneously determined so that, in

    particular, a stochastic realization of  u—consumers displaying animal spirits of theirown—will affect not only C  but also Y  through equation (2), so that there is a posi-tive correlation between u and Y.  As a result,ordinary least squares (OLS) estimation of(1) will lead to upwardly biased and inconsis-tent estimates of the parameter β .

    This simultaneity problem can be dealt with in a number of ways. One is to solve (1)and (2) to get the reduced form equations

    (3) C = α _ 1 − β 

      +β  

    _

     

    1 − β  I +  u _ 

    1 − β  ,

    (4) Y = α _ 1 − β 

      + I 

    _

     

    1 − β   +  u

     

    _

     

    1 − β  .

    Both of these equations can be consistentlyestimated by OLS, and it is easy to show

    that the same estimates of α  and β  will beobtained from either one. An alternativemethod of estimation is to focus on theconsumption function (1) and to use ourknowledge of (2) to note that investmentcan be used as an instrumental variable (IV)for income. In the IV regression, there is a“first stage” regression in which income isregressed on investment; this is identical toequation (4), which is part of the reduced

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    5/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)428

    form. In the second stage, consumption isregressed on the predicted value of incomefrom (4). In this simple case, the IV estimateof β   is identical to the estimate from thereduced form. This simple model may notbe a very good model—but it  is a model, ifonly a primitive one.

    I now leap forward sixty years and consideran apparently similar set up, again using anabsurdly simple specification. The WorldBank (let us imagine) is interested in whetherto advise the government of China to buildmore railway stations as part of its povertyreduction strategy. The Bank economists

     write down an econometric model in which

    the poverty head count ratio in city c is takento be a linear function of an indicator R  of whether or not the city has a railway station,

    (5) Pc  = γ + θRc  +  vc,

     where θ  (I hesitate to call it a parameter)indicates the effect—presumably negative—of infrastructure (here a railway station) onpoverty. While we cannot expect to get usefulestimates of θ from OLS estimation of (5)—railway stations may be built to serve moreprosperous cities, they are rarely built in des-erts where there are no people, or there maybe “third factors” that influence both—this isseen as a “technical problem” for which thereis a wide range of econometric treatmentsincluding, of course, instrumental variables.

     We no longer have the reduced form ofthe previous model to guide us but, if wecan find an instrument Z  that is correlated

     with whether a town has a railway stationbut uncorrelated with v, we can do the samecalculations and obtain a consistent estimate.For the record, I write this equation

    (6) Rc  = ϕ + φZc  + ηc.

    Good candidates for Z might be indicatorsof whether the city has been designated bythe Government of China as belonging to a

    special “infrastructure development area,”or perhaps an earthquake that convenientlydestroyed a selection of railway stations, oreven the existence of river confluence nearthe city, since rivers were an early source ofpower, and railways served the power-basedindustries. I am making fun, but not much.And these instruments all have the real meritthat there is some mechanism linking themto whether or not the town has a railwaystation, something that is not automaticallyguaranteed by the instrument being corre-lated with R and uncorrelated with v (see, forexample, Peter C. Reiss and Frank A. Wolak2007, pp. 4296–98).

    My main argument is that the two econo-metric structures, in spite of their resem-blance and the fact that IV techniques can beused for both, are in fact quite different. Inparticular, the IV procedures that work forthe effect of national income on consump-tion are unlikely to give useful results forthe effect of railway stations on poverty. Toexplain the differences, I begin with the lan-guage. In the original example, the reducedform is a fully specified system since it isderived from a notionally complete model ofthe determination of income. Consumptionand income are treated symmetrically andappear as such in the reduced form equa-tions (3) and (4). In contemporary examples,such as the railways, there is no completetheoretical system and there is no symme-try. Instead, we have a “main” equation (5),

     which used to be the “structural” equation(1). We also have a “first-stage” equation,

     which is the regression of railway stations onthe instrument. The now rarely consideredregression of the variable of interest on theinstrument, here of poverty on earthquakesor on river confluences, is nowadays referredto as the reduced form, although it was origi-nally one equation of a multiple equationreduced form—equation (6) is also part ofthe reduced form—within which it had nospecial significance. These language shifts

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    6/32

    429Deaton: Instruments, Randomization, and Learning about Development

    sometimes cause confusion but they are notthe most important differences between thetwo systems.

    The crucial difference is that the relation-ship between railways and poverty is not amodel at all, unlike the consumption modelthat embodied a(n admittedly crude) theoryof income determination. While it is clearly

     possible that the construction of a railway sta-tion will reduce poverty, there are many pos-sible mechanisms, some of which will workin one context and not in another. In conse-quence, θ is unlikely to be constant over dif-ferent cities, nor can its variation be usefullythought of as random variation that is uncor-

    related with anything else of interest. Instead,it is precisely the variation in θ that encapsu-lates the poverty reduction mechanisms thatought to be the main objects of our enquiry.Instead, the equation of interest—the so-called “main equation” (5)—is thought of asa representation of something more akin toan experiment or a biomedical trial in whichsome cities get “treated” with a station andsome do not. The role of econometric analy-sis is not, as in the Cowles example, to esti-mate and investigate a casual model, but “tocreate an analogy, perhaps forced, betweenan observational study and an experiment”(David A. Freedman 2006, p. 691).

    One immediate task is to recognize andsomehow deal with the variation in θ, whichis typically referred to as the “heterogene-ity problem” in the literature. The obvious

     way is to define a parameter of interest in a way that corresponds to something we want

    to know for policy evaluation—perhaps theaverage effect on poverty over some groupof cities—and then devise an appropriateestimation strategy. However, this step isoften skipped in practice, perhaps becauseof a mistaken belief that the “main equa-tion” (5) is a structural equation in whichθ  is a constant, so that the analysis can goimmediately to the choice of instrument Z,over which a great deal of imagination and

    ingenuity is often exercised. Such ingenuityis often needed because it is difficult simul-taneously to satisfy both of the standard cri-teria required for an instrument, that it becorrelated with R

    c

     and uncorrelated with vc

    .However, if heterogeneity is indeed present,even satisfying the standard criteria is notsufficient to prevent the probability limit ofthe IV estimator depending on the choiceof instrument (Heckman 1997). Withoutexplicit prior consideration of the effectof the instrument choice on the parameterbeing estimated, such a procedure is effec-tively the opposite of standard statisticalpractice in which a parameter of interest is

    defined first, followed by an estimator thatdelivers that parameter. Instead, we have aprocedure in which the choice of the instru-ment, which is guided by criteria designedfor a situation in which there is no hetero-geneity, is implicitly allowed to determinethe parameter of interest. This goes beyondthe old story of looking for an object wherethe light is strong enough to see; rather, wehave at least some control over the light butchoose to let it fall where it may and thenproclaim that whatever it illuminates is what

     we were looking for all along.Recent econometric analysis has given us a

    more precise characterization of what we canexpect from such a method. In the railwayexample, where the instrument is the designa-tion of a city as belonging to the “special infra-structure zone,” the probability limit of the IVestimator is the average of poverty reductioneffects over those cities who were induced to

    construct a railway station by being so des-ignated. This average is known as the “localaverage treatment effect” (LATE) and itsrecovery by IV estimation requires a numberof nontrivial conditions, including, for exam-ple, that no cities who would have constructeda railway station are perverse enough to beactually deterred from doing so by the posi-tive designation (see Guido W. Imbens andJoshua D. Angrist 1994, who established the

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    7/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)430

    LATE theorem). The LATE may or may notbe a parameter of interest to the World Bankor the Chinese government and, in general,there is no reason to suppose that it will be.For example, the parameter estimated willtypically not be the average poverty reductioneffect over the designated cities, nor will it bethe average effect over all cities.

    I find it hard to make any sense of theLATE. We are unlikely to learn much aboutthe processes at work if we refuse to say any-

     thing about what determines θ; heterogene-ity is not a technical problem calling for aneconometric solution but a reflection of thefact that we have not started on our proper

    business, which is trying to understand whatis going on. Of course, if we are as skepticalof the ability of economic theory to deliveruseful models as are many applied econo-mists today, the ability to avoid modeling canbe seen as an advantage, though it should notbe a surprise when such an approach deliversanswers that are hard to interpret. Note thatmy complaint is not with the “local” nature ofthe LATE—that property is shared by manyestimation strategies and I will discuss laterhow we might overcome it. The issue hereis rather the “average” and the lack of an exante characterization of the set over whichthe averaging is done. Angrist and Jörn-Steffen Pischke (2010) have recently claimedthat the explosion of instrumental variablesmethods, including LATE estimation, hasled to greater “credibility” in applied econo-metrics. I am not entirely certain what cred-ibility means, but it is surely undermined if

    the parameter being estimated is not what we want to know. While in many cases whatis estimated may be close to, or may containinformation about, the parameter of interest,that this is actually so requires demonstrationand is not true in general (see Heckman andSergio Urzua 2009, who analyze cases wherethe LATE is an uninteresting and potentiallymisleading assemblage of parts of the under-lying structure).

    There is a related issue that bedevils agood deal of contemporary applied work,

     which is the understanding of exogeneity, a word that I have so far avoided. Suppose, forthe moment, that the effect of railway sta-tions on poverty is the same in all cities and

     we are looking for an instrument, which isrequired to be exogenous in order to con-sistently estimate θ. According to Merriam-

     Webster’s dictionary, “exogenous” means“caused by factors or an agent from outsidethe organism or system,” and this commonusage is often employed in applied work.However, the consistency of IV estimationrequires that the instrument be orthogonal

    to the error term v in the equation of inter-est, which is not implied by the Merriam- Webster definition (see Edward E. Leamer1985, p. 260). Jeffrey M. Wooldridge (2002,p. 50) warns his readers that “you should notrely too much on the meaning of ‘endoge-nous’ from other branches of economics” andgoes on to note that “the usage in economet-rics, while related to traditional definitions, isused broadly to describe any situation wherean explanatory variable is correlated withthe disturbance.” Heckman (2000) suggestsusing the term “external” (which he tracesback to Wright and Frisch in the 1930s) forthe Merriam-Webster definition, for vari-ables whose values are not set or causedby the variables in the model and keeping“exogenous” for the orthogonality conditionthat is required for consistent estimation inthis instrumental variable context. The termsare hardly standard, but I adopt them here

    because I need to make the distinction. Themain issue, however, is not the terminol-ogy but that the two concepts be kept dis-tinct so that we can see when the argumentbeing offered is a justification for externality

     when what is required  is a justification forexogeneity. An instrument that is external,but not exogeneous, will not yield consistentestimates of the parameter of interest, even

     when the parameter of interest is a constant.

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    8/32

    431Deaton: Instruments, Randomization, and Learning about Development

    An alternative approach is to keep theMiriam-Webster (or “other branches of eco-nomics”) definition for exogenous and torequire that, in addition to being exogenous,an instrument satisfy the “exclusion restric-tions” of being uncorrelated with the dis-turbance. I have no objection to this usage,though the need to defend these additionalrestrictions is not always appreciated in prac-tice. Yet exogeneity in this sense has no con-sequences for the consistency of econometricestimators and so is effectively meaningless.

    Failure to separate externality and exoge-neity—or to build a case for the validity ofthe exclusion restrictions—has caused, and

    continues to cause, endless confusion in theapplied development (and other) literatures.Natural or geographic variables—distancefrom the equator (as an instrument for percapita GDP in explaining religiosity, RachelM. McCleary and Robert J. Barro 2006),rivers (as an instrument for the number ofschool districts in explaining educational out-comes, Caroline M. Hoxby 2000), land gradi-ent (as an instrument for dam construction inexplaining poverty, Duflo and Rohini Pande2007), or rainfall (as an instrument for eco-nomic growth in explaining civil war, EdwardMiguel, Shanker Satyanath, and ErnestSergenti 2004 and the examples could bemultiplied ad infinitum)—are not affectedby the variables being explained, and areclearly external. So are historical variables—the mortality of colonial settlers is not influ-enced by current institutional arrangementsin ex-colonial countries (Daron Acemoglu,

    Simon Johnson, and James A. Robinson2001) nor does the country’s growth ratetoday influence the identity of their past col-onizers (Barro 1998). Whether any of theseinstruments is exogenous  (or satisfies theexclusion restrictions) depends on the speci-fication of the equation of interest, and is notguaranteed by its externality. And becauseexogeneity is an identifying assumptionthat must be made prior to analysis of the

    data, empirical tests cannot settle the ques-tion. This does not prevent many attemptsin the literature, often by misinterpretinga satisfactory over identification test as evi-dence for valid identification. Such tests cantell us whether estimates change when weselect different subsets from a set of possibleinstruments. While the test is clearly usefuland informative, acceptance is consistent

     with all of the instruments being invalid, while failure is consistent with a subset beingcorrect. Passing an overidentification testdoes not validate instrumentation.

    In my running example, earthquakes andrivers are external to the system and are nei-

    ther caused by poverty nor by the construc-tion of railway stations, and the designationas an infrastructure zone may also be deter-mined by factors independent of poverty orrailways. But even earthquakes (or rivers) arenot exogenous if they have an effect on pov-erty other than through their destruction (orencouragement) of railway stations, as willalmost always be the case. The absence ofsimultaneity does not guarantee exogeneity—exogeneity requires the absence of simulta-neity but is not implied by it. Even randomnumbers—the ultimate external variables—may be endogenous, at least in the presenceof heterogeneous effects if agents choose toaccept or reject their assignment in a way thatis correlated with the heterogeneity. Again,the example comes from Heckman’s (1997)discussion of Angrist’s (1990) famous use ofdraft lottery numbers as an instrumental vari-able in his analysis of the subsequent earn-

    ings of Vietnam veterans.I can illustrate Heckman’s argument usingthe Chinese railways example with the zonedesignation as instrument. Rewrite the equa-tion of interest, (5), as

    (7) Pc  = γ +_

     θRc  +  wc 

    = γ +_

     θRc  +  { vc  + (θ − _

     θ) Rc} ,

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    9/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)432

     where  wc  is defined by the term in curlybrackets, and

    _

     θ is the mean of θ  over thecities that get the station so that the com-pound error term w has mean zero. Supposethe designation as an infrastructure zone isDc , which takes values 1 or 0, and that theChinese bureaucracy, persuaded by youngdevelopment economists, decides to ran-domize and designates cities by flipping a

     yuan. For consistent estimation of_

     θ, we wantthe covariance of the instrument with theerror to be zero. The covariance is

    (8) E(Dc wc) = E[(θ − _

     θ) RD]

      = E[(θ − _ θ) |D =  1, R = 1]

      ×  P(D = 1,R = 1),

     which will be zero if either (a) the averageeffect of building a railway station on povertyamong the cities induced to build one by thedesignation is the same as the average effectamong those who would have built one any-

     way, or ( b) no city not designated builds arailway station. If ( b) is not guaranteed by

     fiat, we cannot suppose that it will otherwisehold, and we might reasonably hope thatamong the cities who build railway stations,those induced to do so by the designationare those where there is the largest effecton poverty, which violates (a). In the exam-ple of the Vietnam veterans, the instrument(the draft lottery number) fails to be exog-

    enous because the error term in the earningsequation depends on each individual’s rateof return to schooling, and whether or noteach potential draftee accepted their assign-ment—their veteran’s status—depends onthat rate of return. This failure of exogene-ity is referred to by Richard Blundell andMonica Costa Dias (2009) as selection onidiosyncratic gain and it adds to any biascaused by any failure of the instrument to be

    orthogonal to ν c, ruled out here by the ran-domness of the instrument.

    The general lesson is once again the ulti-mate futility of trying to avoid thinking abouthow and why things work—if we do not doso, we are left with undifferentiated hetero-geneity that is likely to prevent consistentestimation of any parameter of interest. Oneappropriate response is to specify exactlyhow cities respond to their designation,an approach that leads to Heckman’s localinstrumental variable methods (Heckmanand Edward Vytlacil 1999, 2007; Heckman,Urzua, and Vytlacil 2006). In a similar vein,David Card (1999) reviews estimates of the

    rate of return to schooling and explores howthe choice of instruments leads to estimatesthat are averages over different subgroupsof the population so that, by thinking aboutthe implicit selection, evidence from differ-ent studies can be usefully summarized andcompared. Similar questions are pursued inGerard van den Berg (2008).

    3. Instruments of Development

    The question of whether aid has helpedeconomies grow faster is typically asked

     within the framework of standard growthregressions. These regressions use data formany countries over a period of years, usuallyfrom the Penn World Table, the current ver-sion of which provides data on real per cap-ita GDP and its components in purchasingpower dollars for more than 150 countries asfar back as 1950. The model to be estimated

    has the rate of growth of per capita GDP asthe dependent variable, while the explana-tory variables include the lagged value ofGDP per capita, the share of investment inGDP, and measures of the educational levelof the population (see, for example, Barroand Xavier Sala-i-Martin 1995, chapter 12,for an overview). Other variables are oftenadded, and my main concern here is withone of these, external assistance (aid) as a

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    10/32

    433Deaton: Instruments, Randomization, and Learning about Development

    fraction of GDP. A typical specification canbe written

    (9) ∆ln Y ct+1  = β 0  + β 1 ln Y ct

      + β 2  Ict 

    _

     Y ct + β 3 Hct + β 4 Zct

      + θ Act + uct ,

     where Y is per capita GDP, I is investment,H  is a measure of human capital or educa-tion, and A is the variable of interest, aid asa share of GDP. Z stands for whatever other

     variables are included. The index c  is forcountry and t for time. Growth is rarely mea-

    sured on a year to year basis—the data in thePenn World Table are not suitable for annualanalysis—so that growth may be measuredover ten, twenty, or forty year intervals. Witharound forty years of data, there are four,two, or one observation for each country.

    An immediate question is whether thegrowth equation (9) is a model-based Cowles-type equation, as in my national incomeexample, or whether it is more akin to theatheoretical analysis in my invented Chineserailway example. There are elements of bothhere. If we ignore the Z  and  A  variables in(9), the model can be thought of as a Solowgrowth model, extended to add human capitalto physical capital (see again Barro and Sala-i-Martin, who derive their empirical specifi-cations from the theory, and also N. GregoryMankiw, David Romer, and David N. Weil1992, who extended the Solow model toinclude education). However, the addition of

    the other variables, including aid, is typicallyless well justified. In some cases, for exampleunder the assumption that all aid is invested,it is possible to calculate what effect we mightexpect aid to have (see Raghuram G. Rajanand Arvind Subramanian 2008). If we followthis route, (9) would not be useful—becauseaid is already included—and we shouldinstead investigate  whether   aid is indeedinvested, and then infer the effectiveness

    of aid from the effectiveness of investment.Even so, it presumably matters what kind ofinvestment is promoted by aid, and aid forroads, for dams, for vaccination programs,or for humanitarian purposes after an earth-quake are likely to have different effects onsubsequent growth. More broadly, one of themain issues of contention in the debate is whataid actually does. Just to list a few of the pos-sibilities, does aid increase investment, doesaid crowd out domestic investment, is aid sto-len, does aid create rent-seeking, or does aidundermine the institutions that are requiredfor growth? Once all of these possibilities areadmitted, it is clear that the analysis of (9) is

    not a Cowles model at all, but is seen as analo-gous to a biomedical experiment in whichdifferent countries are “dosed” with differentamounts of aid, and we are trying to measurethe average response. As in the Chinese rail-

     ways case, a regression such as (9) will not giveus what we want because the doses of aid arenot randomly administered to different coun-tries, so our first task is to find an instrumental

     variable that will generate quasi-randomness.The most obvious problem with a regres-

    sion of aid on growth is the simultaneousfeedback from growth to aid that is gener-ated by humanitarian responses to economiccollapse or to natural or man-made disas-ters that engender economic collapse. Moregenerally, aid flows from rich countries topoor countries, and poor countries, almostby definition, are those with poor records ofeconomic growth. This feedback, from lowgrowth to high aid, will obscure, nullify, or

    reverse any positive effects of aid. Most of theliterature attempts to eliminate this feedbackby using one or more instrumental variablesand, although they would not express it inthese terms, the aim of the instrumentationis to a restore a situation in which the pureeffect of aid on growth can be observed asif in a randomized situation. How close weget to this ideal depends, of course, on thechoice of instrument.

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    11/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)434

    Although there is some variation acrossstudies, there is a standard set of instruments,originally proposed by Peter Boone (1996),

     which includes the log of population sizeand various country dummies, for example,a dummy for Egypt or for francophone WestAfrica. One or both of these instruments areused in almost all the papers in a large sub-sequent literature, including Craig Burnsideand David Dollar (2000), Henrik Hansen andFinn Tarp (2000, 2001), Carl-Johan Dalgaardand Hansen (2001), Patrick Guillaumont andLisa Chauvet (2001), Robert Lensink andHoward White (2001), Easterly, Ross Levine,and David Roodman (2004), Dalgaard,

    Hansen, and Tarp (2004), Michael Clemens,Steven Radelet, and Rikhil Bhavnani (2004),Rajan and Subramanian (2008), and Roodman(2007). The rationale for population size isthat larger countries get less aid per capitabecause the aid agencies allocate aid on acountry basis, with less than full allowancefor population size. The rationale for what Ishall refer to as the “Egypt instrument” is thatEgypt gets a great deal of American aid as partof the Camp David accords in which it agreedto a partial rapprochement with Israel. Thesame argument applies to the francophonecountries, which receive additional aid fromFrance because of their French colonial leg-acy. By comparing these countries with coun-tries not so favored or by comparing populous

     with less populous countries, we can observea kind of variation in the share of aid in GDPthat is unaffected by the negative feedbackfrom poor growth to compensatory aid. In

    effect, we are using the variation across popu-lations of different sizes as a natural experi-ment to reveal the effects of aid.

    If we examine the effects of aid on growth without any allowance for reverse causality,for example by estimating equation (9) byOLS, the estimated effect is typically nega-tive. For example, Rajan and Subramanian(2008), in one of the most careful recent stud-ies, find that an increase in aid by one percent

    of GDP comes with a reduction in the growthrate of one tenth of a percentage point a year.Easterly (2006) provides many other (some-times spectacular) examples of negativeassociations between aid and growth. Wheninstrumental variables are used to eliminatethe reverse causality, Rajan and Subramanianfind a weak or zero effect of aid and contrastthat finding with the robust positive effectsof investment on growth in specifications like(9). I should note that, although Rajan andSubramanian’s study is an excellent one, it iscertainly not without its problems and, as theauthors note, there are many difficult econo-metric problems over and above the choice

    of instruments, including how to estimatedynamic models with country fixed effectson limited data, the choice of countries andsample period, the type of aid that needs tobe considered, and so on. Indeed, it is thoseother issues that are the focus of most of theliterature cited above. The substance of thisdebate is far from over.

    My main concern here is with the use ofthe instruments, what they tell us, and whatthey might tell us. The first point is thatneither the “Egypt” (or colonial heritage)nor the population instrument are plausiblyexogenous; both are external—Camp Davidis not part of the model, nor was it causedby Egypt’s economic growth, and similarlyfor population size—but exogeneity wouldrequire that neither “Egypt” nor popula-tion size have any influence on economicgrowth except through the effects on aidflows, which makes no sense at all. We also

    need to recognize the heterogeneity in theaid responses and try to think about how thedifferent instruments are implicitly choos-ing different averages, involving different

     weightings or subgroups of countries. Or wecould stop right here, conclude that thereare no valid instruments, and that the aidto growth question is not answerable in this

     way. I shall argue otherwise, but I should alsonote that similar challenges over the validity

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    12/32

    435Deaton: Instruments, Randomization, and Learning about Development

    of instruments have become routine inapplied econometrics, leading to widespreadskepticism by some, while others press onundaunted in an ever more creative searchfor exogeneity.

    Yet consideration of the instruments is not without value, especially if we move awayfrom instrumental variable estimation, withthe use of instruments seen as technical, notsubstantive, and think about the reducedform which contains substantive informa-tion about the relationship between growthand the instruments. For the case of popu-lation size, we find that, conditional on theother variables, population size is unrelated

    to growth, which is one of the reasons thatthe IV estimates of the effects of aid aresmall or zero. This (partial) regression coef-ficient is a much simpler object than is theinstrumental variable estimate; under stan-dard assumptions, it tells us how much fasterlarge countries grow than small countries,once the standard effects of the augmentedSolow model have been taken into account.Does this tell us anything about the effec-tiveness of aid? Not directly, though it issurely useful to know that, while larger coun-tries receive less per capita aid in relationto per capita income, they grow just as fastas countries that have received more, once

     we take into account the amount that theyinvest, their levels of education, and theirstarting level of GDP. But we would hardlyconclude from this fact alone that aid doesnot increase growth. Perhaps aid works less

     well in small countries, or perhaps there is an

    offsetting positive effect of population sizeon economic growth. Both are possible andboth are worth further investigation. Moregenerally, such arguments are susceptible tofruitful discussions, not only among econo-mists, but also with other social scientistsand historians who study these questions,something that is typically difficult withinstrumental variable methods. Economists’claims to methodological superiority based

    on instrumental variables ring particularlyhollow when it is economists themselves whoare so often misled. My argument is that, forboth economists and noneconomists, thedirect consideration of the reduced form islikely to generate productive lines of enquiry.

    The case of the “Egypt” instrument issomewhat different. Once again the reducedform is useful (Egypt doesn’t grow particu-larly fast in spite of all the aid it gets in con-sequence of Camp David), though mostlyfor making it immediately clear that thecomparison of Egypt versus non-Egypt,or francophone versus nonfrancophone, isnot a useful way of assessing the effective-

    ness of aid on growth. There is no reason tosuppose that “being Egypt” has no effect onits growth other than through aid from theUnited States. Yet almost every paper in thisliterature unquestioningly uses the Egyptdummy as an instrument. Similar instru-ments based on colonial heritage face exactlythe same problem; colonial heritage cer-tainly affects aid, and colonial heritage is notinfluenced by current growth performance,but different colonists behaved differentlyand left different legacies of institutions andinfrastructure, all of which have their ownpersistent effect on growth today.

    The use of population size, an Egyptdummy, or colonial heritage variables asinstruments in the analysis of aid effective-ness cannot be justified. These instrumentsare external, not exogenous, or if we usethe Webster definition of exogeneity, theyclearly fail the exclusion restrictions. Yet they

    continue in almost universal use in the aid-effectiveness literature, and are endorsed forthis purpose by the leading exponents of IVmethods (Angrist and Pischke 2010).

    I conclude this section with an examplethat helps bridge the gap between analy-ses of the macro and analyses of the microeffects of aid. Many microeconomists agreethat instrumentation in cross-country regres-sions is unlikely to be useful, while claiming

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    13/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)436

    that microeconomic analysis is capable ofdoing better. We may not be able to answerill-posed questions about the macroeco-nomic effects of foreign assistance, but wecan surely do better on specific projects andprograms. Abhijit Vinayak Banerjee andRuimin He (2008) have provided a list of thesort of studies that they like and that theybelieve should be replicated more widely.One of these, also endorsed by Duflo (2004),is a famous paper by Angrist and Victor Lavy(1999) on whether schoolchildren do bet-ter in smaller classes, a position frequentlyendorsed by parents and by teacher’s unionsbut not always supported by empirical work.

    The question is an important one for devel-opment assistance because smaller class sizescost more and are a potential use for foreignaid. Angrist and Lavy’s paper uses a naturalexperiment, not a real one, and relies on IVestimation, so it provides a bridge betweenthe relatively weak natural experiments inthis section, and the actual randomized con-trolled trials in the next.

    Angrist and Lavy’s study is about the allo-cation of children enrolled in a school intoclasses. Many countries set their class sizesto conform to some version of Maimonides’rule, which sets a maximum class size,beyond which additional teachers must befound. In Israel, the maximum class size isset at 40. If there are less than 40 childrenenrolled, they will all be in the same class. Ifthere are 41, there will be two classes, oneof 20, and one of 21. If there are 81 or morechildren, the first two classes will be full, and

    more must be set up. Angrist and Lavy’s fig-ure 1 plots actual class size and Maimonides’rule class size against the number of childrenenrolled; this graph starts off running alongthe 45-degree line, and then falls discontinu-ously to 20 when enrollment is 40, increas-ing with slope of 0.5 to 80, falling to 27.7 (80divided by 3) at 80, rising again with a slopeof 0.25, and so on. They show that actual classsizes, while not exactly conforming to the

    rule, are strongly influenced by it and exhibitthe same saw tooth pattern. They then plottest scores against enrollment, and showthat they display the opposite pattern, ris-ing at each of the discontinuities where classsize abruptly falls. This is a natural experi-ment, with Maimonides’ rule inducing quasi-experimental variation, and generating a pre-dicted class size for each level of enrollment

     which serves as an instrumental variable in aregression of test scores on class size. TheseIV estimates, unlike the OLS estimates, showthat children in smaller classes do better.

    Angrist and Lavy’s paper, the creativity ofits method, and the clarity of its result has set

    the standard for micro empirical work sinceit was published, and it has had a far-reach-ing effect on subsequent empirical work inlabor and development economics. Yet thereis a problem, which has become apparentover time. Note first the heterogeneity; it isimprobable that the effect of lower class sizeis the same for all children so that, underthe assumptions of the LATE theorem, theIV estimate recovers a weighted averageof the effects for those children who areshifted by Maimonides’ rule from a larger toa smaller class. Those children might not bethe same as other children, which makes ithard to know how useful the numbers mightbe in other contexts, for example when allchildren are put in smaller class sizes. Theunderlying reasons for this heterogeneityare not addressed in this quasi-experimentalapproach. To be sure of what is happeninghere, we need to know more about how dif-

    ferent children finish up in different classes, which raises the possibility that the varia-tion across the discontinuities may not beorthogonal to other factors that affect testscores.

    A recent paper by Miguel Urquiola andEric Verhoogen (2009) explores how it isthat children are allocated to different classsizes in a related, but different, situation inChile where a version of Maimonides’ rule is

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    14/32

    437Deaton: Instruments, Randomization, and Learning about Development

    in place. Urquiola and Verhoogen note thatparents care a great deal about whether theirchildren are in the 40 child class or the 20child class, and for the private schools theystudy, they construct a model in which thereis sorting across the boundary, so that thechildren in the smaller classes have richer,more educated parents than the childrenin the larger classes. Their data match sucha model, so that at least some of the differ-ences in test scores across class size comefrom differences in the children that wouldbe present whatever the class size. Thispaper is an elegant example of why it is sodangerous to make inferences from natu-

    ral experiments without understanding themechanisms at work. It also strongly suggeststhat the question of the effects of class sizeon student performance is not well-defined

     without a description of the environment in which class size is being changed (see alsoChristopher A. Sims 2010).

    Another good example comes to me in pri- vate correspondence from Branko Milanovic, who was a child in Belgrade in a schoolonly half of whose teachers could teach inEnglish, and who was randomly assigned to aclass taught in Russian. He remembers losingfriends whose parents (correctly) perceivedthe superior value of an English education,and were insufficiently dedicated socialists toaccept their assignment. The two languagegroups of children remaining in the school,although “randomly” assigned, are far fromidentical, and IV estimates using the ran-domization could overstate the superiority of

    the English medium education.More generally, these are examples wherean instrument induces actual or quasi-ran-dom assignment, here of children into dif-ferent classes, but where the assignment canbe undone, at least partially, by the actions ofthe subjects. If children—or their parents—care about whether they are in small or largeclasses, or in Russian or English classes,some will take evasive action—by protesting

    to authorities, finding a different school, oreven moving—and these actions will gen-erally differ by rich and poor children, bychildren with more or less educated parents,or by any factor that affects the cost to thechild of being in a larger class. The behav-ioral response to the quasi-randomization(or indeed randomization) means that thegroups being compared are not identical tostart with (see also Justin McCrary 2008 andDavid S. Lee and Thomas Lemieux 2009 forfurther discussion and for methods of detec-tion when this is happening).

    In preparation for the next section, I notethat the problem here is  not  the fact that

     we have a quasi-experiment rather than areal experiment, so that there was no actualrandomization. If children had been ran-domized into class size, as in the Belgradeexample, the problems would have been thesame unless there had been some mecha-nism for forcing the children (and their par-ents) to accept the assignment.

    4. Randomization in the Tropics

    Skepticism about econometrics, doubtsabout the usefulness of structural modelsin economics, and the endless wranglingover identification and instrumental vari-ables has led to a search for alternative waysof learning about development. There hasalso been frustration with the World Bank’sapparent failure to learn from its own proj-ects and its inability to provide a convinc-ing argument that its past activities have

    enhanced economic growth and povertyreduction. Past development practice isseen as a succession of fads, with one sup-posed magic bullet replacing another—fromplanning to infrastructure to human capi-tal to structural adjustment to health andsocial capital to the environment and backto infrastructure—a process that seems notto be guided by progressive learning. Formany economists, and particularly for the

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    15/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)438

    group at the Poverty Action Lab at MIT, thesolution has been to move toward random-ized controlled trials of projects, programs,and policies. RCTs are seen as generatinggold standard evidence that is superior toeconometric evidence and that is immuneto the methodological criticisms that aretypically directed at econometric analyses.Another aim of the program is to persuadethe World Bank to replace its current evalu-ation methods with RCTs; Duflo (2004)argues that randomized trials of projects

     would generate knowledge that could beused elsewhere, an international publicgood. Banerjee (2007b, chapter 1) accuses

    the Bank of “lazy thinking,” of a “resistanceto knowledge,” and notes that its recom-mendations for poverty reduction andempowerment show a striking “lack of dis-tinction made between strategies foundedon the hard evidence provided by random-ized trials or natural experiments and therest.” In all this there is a close parallel withthe evidence-based movement in medi-cine that preceded it, and the successes ofRCTs in medicine are frequently cited. Yetthe parallels are almost entirely rhetorical,and there is little or no reference to the dis-senting literature, as surveyed for examplein John Worrall (2007) who documents therise and fall in medicine of the rhetoric usedby Banerjee. Nor is there any recognition ofthe many problems of medical RCTs, someof which I shall discuss as I go.

    The movement in favor of RCTs is currently very successful. The World Bank is now con-

    ducting substantial numbers of randomizedtrials, and the methodology is sometimesexplicitly requested by governments, whosupply the World Bank with funds for this pur-pose (see World Bank 2008 for details of theSpanish Trust Fund for Impact Evaluation).There is a new International Initiative forImpact Evaluation which “seeks to improvethe lives of poor people in low- and middle-income countries by providing, and summa-

    rizing, evidence of what works, when, whyand for how much,” (International Initiativefor Impact Evaluation 2008), although notexclusively by randomized controlled trials.The Poverty Action Lab lists dozens of com-pleted and ongoing projects in a large num-ber of countries, many of which are projectevaluations. Many development economists

     would join many physicians in subscribingto the jingoist view proclaimed by the edi-tors of the British Medical Journal  (quotedby Worrall 2007, p. 996) which noted that“Britain has given the world Shakespeare,Newtonian physics, the theory of evolution,parliamentary democracy—and the random-

    ized trial.”4.1 The Ideal RCT 

    Under ideal conditions and when cor-rectly executed, an RCT can estimate certainquantities of interest with minimal assump-tions, thus absolving RCTs of one complaintagainst econometric methods—that theyrest on often implausible economic models.It is useful to lay out briefly the (standard)framework for these results, originally dueto Jerzy Neyman in the 1920s, currentlyoften referred to as the Holland–Rubinframework or the Rubin causal model (seeFreedman 2006 for a discussion of the his-tory). According to this, each member of thepopulation under study, labeled  i, has twopossible values associated with it, Y 0 i and Y 1 i ,

     which are the outcomes that i would displayif it did not get the treatment, T  i = 0, and ifit did get the treatment, T  i = 1. Since each i 

    is either in the treatment group or in the con-trol group, we observe one of Y 0 i and Y 1 i butnot both. We would like to know somethingabout the distribution over  i  of the effectsof the treatment, Y  i1 − Y  i0, in particular itsmean

    _

     Y1 − _

     Y0. In a sense, the most surpris-ing thing about this set-up is that we can sayanything at all without further assumptionsor without any modeling. But that is themagic that is wrought by the randomization.

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    16/32

    439Deaton: Instruments, Randomization, and Learning about Development

     What we can  observe in the data is thedifference between the average outcome inthe treatments and the average outcome inthe controls, or E(Y  i |T  i = 1) − E(Y  i |T  i = 0).This difference can be broken up into twoterms

    (10) E(Y  i1 | T  i  = 1) −  E(Y  i0 | T  i  = 0)

      = [E(Y  i1 |T  i  = 1) −  E(Y  i0 |T  i  =  1)]

    + [E(Y  i0 |T  i  = 1) −  E(Y  i0 | T  i  = 0)].

    Note that on the right hand side the sec-ond term in the first square bracket cancels

    out with the first term in the second squarebracket. But the term in the second squarebracket is zero by randomization; the non-treatment outcomes, like any other charac-teristic, are identical in expectation in thecontrol and treatment groups. We can there-fore write (10) as

    (11) E(Y  i1 | T  i  = 1) −  E(Y  i0 | T  i  = 0)

      = [E(Y  i1

     |T  i  = 1) −  E(Y 

     i0 |T 

     i  =  1)],

    so that the difference in the two observ-able outcomes is the difference betweenthe average treated outcome and the aver-age untreated outcome in the treatmentgroup. The last term on the right hand side

     would be unobservable in the absence ofrandomization.

     We are not quite done. What we wouldlike is the average of the difference, rather

    than the difference of averages that is cur-rently on the right-hand side of (11). But theexpectation is a linear operator, so that thedifference of the averages is identical to theaverage of the differences, so that we reach,finally 

    (12) E(Y  i1 | T  i  = 1) −  E(Y  i0 | T  i  = 0)

      = [E(Y  i1  −  Y  i0 |T  i  = 1).

    The difference in means between the treat-ments and controls is an estimate of theaverage treatment effect among the treated,

     which, since the treatment and controls dif-fer only by randomization, is an estimateof the average treatment effect for all. Thisstandard but remarkable result dependsboth on randomization and on the linearityof expectations.

    One immediate consequence of this deri- vation is a fact that is often quoted by criticsof RCTs, but often ignored by practitioners,at least in economics: RCTs are informativeabout the  mean  of the treatment effects,Y  i1 − Y  i0 , but do not identify other features

    of the distribution. For example, the medianof the difference is not the difference inmedians, so an RCT is not, by itself, infor-mative about the median treatment effect,something that could be of as much inter-est to policymakers as the mean treatmenteffect. It might also be useful to know thefraction of the population for which thetreatment effect is positive, which once againis not identified from a trial. Put differently,the trial might reveal an average positiveeffect although nearly all of the populationis hurt with a few receiving very large ben-efits, a situation that cannot be revealed bythe RCT, although it might be disastrous ifimplemented. Indeed, Ravi Kanbur (2001)has argued that much of the disagreementabout development policy is driven by differ-ences of this kind.

    Given the minimal assumptions that gointo an RCT, it is not surprising that it can-

    not tell us everything that we would like toknow. Heckman and Smith (1995) discussthese issues at greater length and also notethat, in some circumstances, more can belearned. Essentially, the RCT gives us twomarginal distributions from which we wouldlike to infer a joint distribution; this is impos-sible, but the marginal distributions limit the

     joint distribution in ways that can be use-ful. For example, Charles F. Manski (1996)

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    17/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)440

    notes that a planner who is maximizing theexpected value of a social welfare functionneeds only the two marginal distributionsto check the usefulness of the treatment.Beyond that, if the probability distributionof outcomes among the treated stochasti-cally dominates the distribution among thecontrols, we know that appropriately definedclasses of social welfare functions will showan improvement without having to know

     what the social welfare function is. Not allrelevant cases are covered by these examples;even if a drug saves lives on average, we needto know whether it is uniformly beneficial orkills some and saves more. To answer such

    questions, we will have to make assumptionsbeyond those required for an RCT; as usual,some questions can be answered with fewerassumptions than others.

    In practice, researchers who conduct ran-domized controlled trials often do presentresults on statistics other than the mean. Forexample, the results can be used to run aregression of the form

    (13) Y  i  = β 0  + β 1 T  i  + ∑ 

     j  θ j  X  ij

      + ∑  j 

    ϕ j  X  ij × T  i  +  u i,

     where T   is a binary variable that indicatestreatment status, and the  X ’s are variouscharacteristics measured at baseline thatare included in the regression both on theirown (main effects) and as interactions with

    treatment status (see Suresh de Mel, DavidMcKenzie, and Christopher Woodruff 2008for an example of a field experiment withmicro-enterprises in Sri Lanka). The esti-mated treatment effect now varies across thepopulation, so that it is possible, for example,to estimate whether the average treatmenteffect is positive or negative for various sub-groups of interest. These estimates dependon more assumptions than the trial itself,

    in particular on the validity of running aregression like (13), on which I shall havemore to say below. One immediate chargeagainst such a procedure is data mining. Asufficiently determined examination of anytrial will eventually reveal some subgroupfor which the treatment yielded a significanteffect of some sort, and there is no general

     way of adjusting standard errors to protectagainst the possibility. A classic example frommedicine comes from the ISIS-2 trial of theuse of aspirin after heart attacks (RichardPeto, Rory Collins, and Richard Gray 1995).A randomized trial established a beneficialeffect with a significance level of better than

    10−

    6,  yet ex post analysis of the data showedthat there was no significant effect for trialsubjects whose astrological signs were Libraor Gemini. In drug trials, the FDA rulesrequire that analytical plans be submittedprior to trials and drugs cannot be approvedbased on ex post data analysis. As noted byRobert J. Sampson (2008), one analysis ofthe recent Moving to Opportunity experi-ment has an appendix listing tests of manythousands of outcomes (Lisa Sanbonmatsuet al. 2006).

    I am not arguing against posttrial subgroupanalysis, only that, as is enshrined in theFDA rules, any special epistemic status (asin “gold standard,” “hard,” or “rigorous” evi-dence) possessed by RCTs does not extendto ex post subgroup analysis if only becausethere is no guarantee that a new RCT onpost-experimentally defined subgroups will

     yield the same result. Such analyses do not

    share any special evidential status that mightarguably be accorded to RCTs and must beassessed in exactly the same way as we wouldassess any nonexperimental or econometricstudy. These issues are wonderfully exposedby the subgroup analysis of drug effective-ness by Ralph I. Horwitz et al. (1996), criti-cized by Douglas G. Altman (1998), whorefers to such studies as “a false trail,” byStephen Senn and Frank Harrell (1997),

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    18/32

    441Deaton: Instruments, Randomization, and Learning about Development

     who call them “wisdom after the event,”and by George Davey Smith and MatthiasEgger (1998), who call them “incommuni-cable knowledge,” drawing the response byHorwitz et al. (1997) that their critics havereached “the tunnel at the end of the light.”

     While it is clearly absurd to discard databecause we do not know how to analyze it

     with sufficient purity and while many impor-tant findings have come from posttrial analy-sis of experimental data, both in medicineand in economics, for example of the nega-tive income tax experiments of the 1960s,the concern about data-mining remains realenough. In large-scale, expensive trials, a

    zero or very small result is unlikely to be wel-comed, and there is likely to be considerablepressure to search for some subpopulation orsome outcome that shows a more palatableresult, if only to help justify the cost of thetrial.

    The mean treatment effect from an RCTmay be of limited value for a physician or apolicymaker contemplating specific patientsor policies. A new drug might do better thana placebo in an RCT, yet a physician mightbe entirely correct in not prescribing it fora patient whose characteristics, according tothe physician’s theory of the disease, mightlead her to suppose that the drug would beharmful. Similarly, if we are convinced thatdams in India do not reduce poverty on aver-age, as in Duflo and Pande’s (2007) IV study,there is no implication about any specificdam, even one of the dams included in thestudy, yet it is always a specific dam that a

    policymaker has to approve. Their evidencecertainly puts a higher burden of proof onthose proposing a new dam, as would bethe case for a physician prescribing in theface of an RCT, though the force of the evi-dence depends on the size of the mean effectand the extent of the heterogeneity in theresponses. As was the case with the materialdiscussed in sections 2 and 3, heterogeneityposes problems for the analysis of RCTs, just

    as it posed problems for nonexperimentalmethods that sought to approximate ran-domization. For this reason, in his Planningof Experiments, David R. Cox (1958, p. 15)begins his book with the assumption  thatthe treatment effects are identical for allsubjects. He notes that the RCT will stillestimate the mean treatment effect withheterogeneity but argues that such estimatesare “quite misleading,” citing the example oftwo internally homogeneous subgroups withdistinct average treatment effects, so that theRCT delivers an estimate that applies to noone. Cox’s recommendation makes a gooddeal of sense when the experiment is being

    applied to the parameter of a well-specifiedmodel, but it could not be further away frommost current practice in either medicine oreconomics.

    One of the reasons why subgroup analy-sis is so hard to resist is that researchers,however much they may wish to escape thestraitjacket of theory, inevitably have somemechanism in mind, and some of thosemechanisms can be “tested” on the datafrom the trial. Such “testing,” of course, doesnot satisfy the strict evidential standards thatthe RCT has been set up to satisfy and, if theinvestigation is constrained to satisfy thosestandards, no ex post speculation is permit-ted. Without a prior theory and within itsown evidentiary standards, an RCT targetedat “finding out what works” is not informa-tive about mechanisms, if only because thereare always multiple mechanisms at work. Forexample, when two independent but identi-

    cal RCTs in two cities in India find that chil-dren’s scores improved less in Mumbai thanin Vadodora, the authors state “this is likelyrelated to the fact that over 80 percent ofthe children in Mumbai had already mas-tered the basic language skills the program

     was covering” (Duflo, Rachel Glennerster,and Michael Kremer 2008). It is not clearhow “likely” is established here, and thereis certainly no evidence that conforms to

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    19/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)442

    the “gold standard” that is seen as one ofthe central justifications for the RCTs. Forthe same reason, repeated  successful repli-cations of a “what works” experiment, i.e.,one that is unrelated to some underlyingor guiding mechanism, is both unlikely andunlikely to be persuasive. Learning abouttheory, or mechanisms, requires that theinvestigation be targeted toward that theory,toward  why  something works, not  whether  it works. Projects can rarely be replicated,though the mechanisms underlying suc-cess or failure will often be replicable andtransportable. This means that, if the WorldBank had indeed randomized all of its past

    projects, it is unlikely that the cumulatedevidence would contain the key to economicdevelopment.

    Cartwright (2007a) summarizes the ben-efits of RCTs relative to other forms of evi-dence. In the ideal case, “if the assumptionsof the test are met, a positive result  implies the appropriate causal conclusion,” that theintervention “worked” and caused a positiveoutcome. She adds “the benefit that the con-clusions follow deductively in the ideal casecomes with great cost: narrowness of scope”(p. 11).

    4.2 Tropical RCTs in Practice

    How well do actual RCTs approximate theideal? Are the assumptions generally met inpractice? Is the narrowness of scope a pricethat brings real benefits or is the superior-ity of RCTs largely rhetorical? RCTs allowthe investigator to induce variation that

    might not arise nonexperimentally, and this variation can reveal responses that couldnever have been found otherwise. Are theseresponses the relevant ones? As always, thereis no substitute for examining each study indetail, and there is certainly nothing in theRCT methodology itself that grants immu-nity from problems of implementation. Yetthere are some general points that are worthdiscussion.

    The first is the seemingly obvious practi-cal matter of how to compute the results ofa trial. In theory, this is straightforward—wesimply compare the mean outcome in theexperimental group with the mean outcomein the control group and the difference isthe causal effect of the intervention. Thissimplicity, compared with the often baroquecomplexity of econometric estimators, isseen as one of the great advantages of RCTs,both in generating convincing results and inexplaining those results to policymakers andthe lay public. Yet any difference is not use-ful without a standard error and the calcula-tion of the standard error is rarely quite so

    straightforward. As Ronald A. Fisher (1935)emphasized from the very beginning, in hisfamous discussion of the tea lady, randomiza-tion plays two separate roles. The first is toguarantee that the probability law governingthe selection of the control group is the sameas the probability law governing the selectionof the experimental group. The second is toprovide a probability law that enables us to

     judge whether a difference between the twogroups is significant. In his tea lady example,Fisher uses combinatoric analysis to calcu-late the exact probabilities of each possibleoutcome, but in practice this is rarely done.

    Duflo, Glennerster, and Kremer (2008,p. 3921) (DGK) explicitly recommend whatseems to have become the standard methodin the development literature, which is torun a restricted version of the regression(13), including only the constant and thetreatment dummy,

    (14) Y  i  = β 0  + β 1 T  i  +  u i.

    As is easily shown, the OLS estimate of β 1 is simply the difference between the meanof the Y  i  in the experimental and controlgroups, which is exactly what we want.However, the  standard error   of β 1  fromthe OLS regression is not generally correct.One problem is that the variance among the

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    20/32

    443Deaton: Instruments, Randomization, and Learning about Development

    experimentals may be different from the variance among the controls, and to assumethat the experiment does not affect the vari-ance is very much against the minimalistspirit of RCTs. If the regression (14) is run

     with the standard heteroskedasticity cor-rection to the standard error, the result willbe the same as the formula for the standarderror of the difference between two means,but not otherwise except in the special case

     where there are equal numbers of experi-mental and controls, in which case it turnsout that the correction makes no differenceand the OLS standard error is correct. It isnot clear in the experimental development

    literature whether the correction is routinelydone in practice, and the handbook reviewby DGK makes no mention of it, although itprovides a thoroughly useful review of manyother aspects of standard errors.

    Even with the correction for unequal variances, we are not quite done. The gen-eral problem of testing the significance ofthe differences between the means of twonormal populations with different variancesis known as the Fisher–Behrens problem.The test statistic computed by dividing thedifference in means by its estimated stan-dard error does not have the  t–distribution

     when the variances are different in treat-ments and controls, and the significance ofthe estimated difference in means is likelyto be overstated if no correction is made. Ifthere are equal numbers of treatments andcontrols, the statistic will be approximatelydistributed as Student’s  t  but with degrees

    of freedom that can be as little as half thenominal degrees of freedom when one of thetwo variances is zero. In general, there is alsono reason to suppose that the heterogene-ity in the treatment effects is normal, which

     will further complicate inference in smallsamples.

    Another standard practice, recommendedby DGK, and which is also common in medi-cal RCTs according to Freedman (2008), is

    to run the regression (14) with additionalcontrols taken from the baseline data orequivalently (13) with the X  i but without theinteractions,

    (15) Y  i  = β 0  + β 1 T  i  + ∑ 

     j 

    θ j  X  ij  +  u i.

    The standard argument is that, if the ran-domization is done correctly, the  X  i will beorthogonal to the treatment variable T  i  sothat their inclusion does not affect the esti-mate of β 1, which is the parameter of inter-est. However, by absorbing variance, ascompared with (14), they will increase theprecision of the estimate—this is not neces-

    sarily the case, but will often be true. DGK(p. 3924) give an example: “controlling forbaseline test scores in evaluations of edu-cational interventions greatly improves theprecision of the estimates, which reduces thecost of these evaluations when a baseline testcan be conducted.”

    There are two problems with this pro-cedure. The first, which is noted by DGK,is that, as with posttrial subgroup analysis,there is a risk of data mining—trying dif-ferent control variables until the experi-ment “works”—unless the control variablesare specified in advance. Again, it is hardto tell whether or how often this dictum isobserved. The second problem is analyzedby Freedman (2008), who notes that (15) isnot a standard regression because of the het-erogeneity of the responses. Write α i for the(hypothetical) treatment response of unit  i,so that, in line with the discussion in the pre-

     vious subsection, α i = Y  i1 − Y  i0, and we can write the identity 

    (16) Y  i  = Y  i0  + α i T  i

      =_

     Y0  + α i T  i  + (Y  i0 − _

     Y0),

     which looks like the regression (15) withthe  X ’s and the error term capturing the

     variation in Y  i0. The only difference is that

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    21/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)444

    the coefficient on the treatment term hasan  i  suffix because of the heterogeneity. If

     we define α = E(α i |T  i = 1), the averagetreatment effect among the treated, as theparameter of interest, as in section 4.1, wecan rewrite (16) as

    (17) Y  i  =__

     Y0  + αT  i  + (Y  i0 − _

     Y0)

      +  (α i − α)T  i.

    Finally, and to illustrate, suppose that wemodel the variation in Y  i0 as a linear functionof an observable scalar X  i0 and a residual η  i ,

     we have

    (18) Y  i  = β 0  + αT  i  + θ( X  i − _

      X)

    + [η i  + (α i − α)T  i],

     with β 0 = Y 0, which is in the regression form(15) but allows us to see the links with theexperimental quantities.

    Equation (18) is analyzed in some detail byFreedman (2008). It is easily shown that T  i isorthogonal to the compound error, but thatthis is not true of X  i − 

    _

      X. However, the tworight hand side variables are uncorrelatedbecause of the randomization, so the OLSestimate of β 1 = α is consistent. This is nottrue of θ, though this may not be a problemif the aim is simply to reduce the sampling

     variance. A more serious issue is that thedependency between T  i and the compound

    error term means that the OLS estimate ofthe average treatment effect α is biased, andin small samples this bias—which comesfrom the heterogeneity—may be substantial.Freedman notes that the leading term in thebias of the estimate of the OLS estimate of α is φ/ n where n is the sample size and

    (19) φ  = −lim 1 _  n

     ∑  i=1

      n

      (α i − α)Z  i 2 ,

     where Z i  is the standardized ( z-score) ver-sion of X  i. Equation (19) shows that the biascomes from the heterogeneity, or more spe-cifically, from a covariance between the het-erogeneity in the treatment effects and thesquares of the included covariates. With thesample sizes typically encountered in theseexperiments, which are often expensive toconduct, the bias can be substantial. Onepossible strategy here would be to comparethe estimates of α with and without covari-ates; even ignoring pretest bias, it is notclear how to make such a comparison with-out a good estimate of the standard error.Alternatively, and as noted by Imbens (2009)

    in his commentary on this paper, it is possibleto remove bias using a “saturated” regressionmodel, for example by estimating (15) whenthe covariates are discrete and there is a com-plete set of interactions. This is equivalent tostratifying on each combination of values ofthe covariates, which diminishes any effecton reducing the standard errors and, unlessthe stratification is done ex ante, raises theusual concerns about data mining.

    Of these and related issues in medi-cal trials, Freedman (2008, p. 13) writes“Practitioners will doubtless be heard toobject that they know all this perfectly well.Perhaps, but then why do they so often fitmodels without discussing assumptions?”

    All of the issues so far can be dealt with,either by appropriately calculating stan-dard errors or by refraining from the use ofcovariates, though this might involve draw-ing larger and more expensive samples.

    However, there are other practical problemsthat are harder to fix. One of these is that sub- jects may fail to accept assignment, so thatpeople who are assigned to the experimentalgroup may refuse, and controls may find a

     way of getting the treatment, and either maydrop out of the experiment altogether. Theclassical remedy of double blinding, so thatneither the subject nor the experimenterknow which subject is in which group, is

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    22/32

    445Deaton: Instruments, Randomization, and Learning about Development

    rarely feasible in social experiments—chil-dren know their class size—and is oftennot feasible in medical trials—subjects maydecipher the randomization, for example byasking a laboratory to check that their medi-cine is not a placebo. Heckman (1992) notesthat, in contrast to people, “plots of groundsdo not respond to anticipated treatments offertilizer, nor can they excuse themselvesfrom being treated.” This makes the impor-tant point, further developed by Heckman inlater work, that the deviations from assign-ment are almost certainly purposeful, at leastin part. The people who struggle to escapetheir assignment will do so more vigorously

    the higher are the stakes, so that the devia-tions from assignment cannot be treated asrandom measurement error, but will com-promise the results in fundamental ways.

    Once again, there is a widely used techni-cal fix, which is to run regressions like (15) or(18), with actual treatment status in place ofthe assigned treatment status T  i. This replace-ment will destroy the orthogonality betweentreatment and the error term, so that OLSestimation will no longer yield a consistentestimate of the average treatment effectamong the treated. However, the assignedtreatment status, which is known to theexperimenter, is orthogonal to the error termand is correlated with the actual treatmentstatus, and so can serve as an instrumental

     variable for the latter. But now we are backto the discussion of instrumental variables insection 2, and we are doing econometrics,not an ideal RCT. Under the assumption of

    no “defiers”—people who do the opposite oftheir assignment just because of the assign-ment (and it is not clear “just why are thereno defiers” Freedman 2006)—the instru-mental variable converges to the LATE. Asbefore, it is unclear whether this is what we

     want, and there is no way to find out with-out modeling the behavior that is respon-sible for the heterogeneity of the responseto assignment, as in the local instrumental

     variable approach developed by Heckmanand his coauthors, Heckman and Vytlacil(1999, 2007). Alternatively, and as recom-mended by Freedman (2005, p. 4; 2006), itis always informative to make a simple unad-

     justed comparison of the average outcomesbetween treatments and controls accordingto the original assignment. This may alsobe enough if what we are concerned with is

     whether the treatment works or not, ratherthan with the size of the effect. In terms ofinstrumental variables, this is a recommen-dation to look at the reduced form, and againharks back to similar arguments in section 2on aid effectiveness.

    One common problem is that the people who agree to participate in the experiment(as either experimental or controls) are notthemselves randomly drawn from the gen-eral population so that, even if the experi-ment itself if perfectly executed, the resultsare not transferable from the experimentalto the parent population and will not be areliable guide to policy in the parent popu-lation. In effect, the selection or omitted

     variable bias that is a potential problem innonexperimental studies comes back in adifferent form and, without an analysis ofthe two biases, it is impossible to conclude

     which estimate is better—a biased nonex-perimental analysis might do better than arandomized controlled trial if enrollmentinto the trial is nonrepresentative. In drugtrials, ethical protocols typically require theprinciple of “equipoise”—that the physician,or at least physicians in general, believe that

    the patient has an equal chance of improve-ment with the experimental and controldrug. Yet risk-averse patients will not acceptsuch a gamble, so that there is selection intotreatment, and the requirements of ethicsand of representativity come into direct con-flict. While this argument does not apply toall trials, there are many ways in which theexperimental (experimentals plus controls)and parent populations can differ.

  • 8/20/2019 Deaton - 2010 - Instruments, Randomization, And Learning About Dev

    23/32

     Journal of Economic Literature, Vol. XLVIII (June 2010)446

    There are also operational problems thatafflict every actual experiment; these canbe mitigated by careful planning—in RCTscompared with econometri