14
Journal of Educational Psychology 1974, Vol. 66, No. 5, 688-701 ESTIMATING CAUSAL EFFECTS OF TREATMENTS IN RANDOMIZED AND NONRANDOMIZED STUDIES 1 DONALD B. RUBIN 2 Educational Testing Service, Princeton, New Jersey A discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation is presented. The objective is to specify the benefits of randomization in estimating causal effects of treatments. The basic conclusion is that randomization should be employed whenever possible but that the use of carefully controlled nonrandomized data to estimate causal effects is a reasonable and nec- essary procedure in many cases. Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies to estimate causal effects of treatments (e.g., Campbell & Erlebacher, 1970). The im- plication in much of this literature is that only properly randomized experiments can lead to useful estimates of causal effects. If taken as applying to all fields of study, this position is untenable. Since the extensive use of randomized experiments is limited to the last half century, 8 and in fact is not used in much scientific investigation today, 4 one is led to the conclusion that most scientific "truths" have been established without using randomized experiments. In addition, most of us successfully determine the causal effects of many of our everyday actions, even interpersonal behaviors, with- out the benefit of randomization. Even if the position that causal effects of treatments can only be well established from randomized experiments is taken as ap- plying only to the social sciences in which 1 1 would like to thank E. J. Anastasio, A. E. Beaton, W. G. Cochran, K. M. Kazarow, and R. L. Linn for helpful comments on earlier ver- sions of this paper. I would also like to thank the U. S. Office of Education for supporting work on this paper under contract OEC-0-71-3715. 2 Requests for reprints should be sent to Donald B. Rubin, Division of Data Analysis Research, Educational Testing Service, Princeton, New Jer- sey 08540. 3 Essentially since Fisher (1925). 'For example, in Davies (1954), a well-known textbook on experimental design in industrial work, randomization is not emphasized. there are currently few well-established causal relationships, its implication—to ignore existing observational data—may be counter-productive. Often the only im- mediately available data are observational (nonrandomized) and either (a) the cost of performing the equivalent randomized ex- periment to test all treatments is prohibitive (e.g., 100 reading programs under study); (6) there are ethical reasons why the treat- ments cannot be randomly assigned (e.g., estimating the effects of heroin addiction on intellectual functioning); or (c) estimates based on results of experiments would be delayed many years (e.g., effect of child- hood intake of cholesterol on longevity). In cases such as these, it seems more reason- able to try to estimate the effects of the treatments from nonrandomized studies than to ignore these data and dream of the ideal experiment or make "armchair" decisions without the benefit of data analy- sis. Using the indications from nonran- domized studies, one can, if necessary, initiate randomized experiments for those treatments that require better estimates or that look most promising. The position here is not that randomiza- tion is overused. On the contrary, given a choice between the data from a randomized experiment and an equivalent nonran- domized study, one should choose the data from the experiment, especially in the social sciences where much of the variability is often unassigned to particular causes. How- ever, we will develop the position that non- randomized studies as well as randomized 688

Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

Journal of Educational Psychology1974, Vol. 66, No. 5, 688-701

ESTIMATING CAUSAL EFFECTS OF TREATMENTS IN

RANDOMIZED AND NONRANDOMIZED STUDIES1

DONALD B. RUBIN2

Educational Testing Service, Princeton, New Jersey

A discussion of matching, randomization, random sampling, and othermethods of controlling extraneous variation is presented. The objectiveis to specify the benefits of randomization in estimating causal effectsof treatments. The basic conclusion is that randomization should beemployed whenever possible but that the use of carefully controllednonrandomized data to estimate causal effects is a reasonable and nec-essary procedure in many cases.

Recent psychological and educationalliterature has included extensive criticismof the use of nonrandomized studies toestimate causal effects of treatments (e.g.,Campbell & Erlebacher, 1970). The im-plication in much of this literature is thatonly properly randomized experiments canlead to useful estimates of causal effects. Iftaken as applying to all fields of study, thisposition is untenable. Since the extensiveuse of randomized experiments is limited tothe last half century,8 and in fact is notused in much scientific investigation today,4

one is led to the conclusion that mostscientific "truths" have been establishedwithout using randomized experiments. Inaddition, most of us successfully determinethe causal effects of many of our everydayactions, even interpersonal behaviors, with-out the benefit of randomization.

Even if the position that causal effects oftreatments can only be well established fromrandomized experiments is taken as ap-plying only to the social sciences in which

11 would like to thank E. J. Anastasio, A. E.Beaton, W. G. Cochran, K. M. Kazarow, andR. L. Linn for helpful comments on earlier ver-sions of this paper. I would also like to thank theU. S. Office of Education for supporting work onthis paper under contract OEC-0-71-3715.

2 Requests for reprints should be sent to DonaldB. Rubin, Division of Data Analysis Research,Educational Testing Service, Princeton, New Jer-sey 08540.

3 Essentially since Fisher (1925).'For example, in Davies (1954), a well-known

textbook on experimental design in industrial work,randomization is not emphasized.

there are currently few well-establishedcausal relationships, its implication—toignore existing observational data—may becounter-productive. Often the only im-mediately available data are observational(nonrandomized) and either (a) the cost ofperforming the equivalent randomized ex-periment to test all treatments is prohibitive(e.g., 100 reading programs under study);(6) there are ethical reasons why the treat-ments cannot be randomly assigned (e.g.,estimating the effects of heroin addiction onintellectual functioning); or (c) estimatesbased on results of experiments would bedelayed many years (e.g., effect of child-hood intake of cholesterol on longevity).In cases such as these, it seems more reason-able to try to estimate the effects of thetreatments from nonrandomized studiesthan to ignore these data and dream of theideal experiment or make "armchair"decisions without the benefit of data analy-sis. Using the indications from nonran-domized studies, one can, if necessary,initiate randomized experiments for thosetreatments that require better estimates orthat look most promising.

The position here is not that randomiza-tion is overused. On the contrary, given achoice between the data from a randomizedexperiment and an equivalent nonran-domized study, one should choose the datafrom the experiment, especially in the socialsciences where much of the variability isoften unassigned to particular causes. How-ever, we will develop the position that non-randomized studies as well as randomized

688

Page 2: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

CAUSAL EFFECTS 689

experiments can be useful in estimatingcausal treatment effects.

In order to avoid unnecessary complica-tion, we will restrict discussion to the verysimple study consisting of 2N units (e.g.,subjects), half having been exposed to anexperimental (E) treatment (e.g., a com-pensatory reading program) and the otherhalf having been exposed to a control (C)treatment (e.g., a regular reading program).If Treatments E and C were assigned to the2N units randomly, that is, using somemechanism that assured each unit wasequally likely to be exposed to E as to C,then the study is called a randomized ex-periment or more simply an experiment;otherwise, the study is called a nonran-domized study, a quasi-experiment, or anobservational study. The objective is todetermine for some population of units (e.g.,underprivileged sixth-grade children) the"typical" causal effect of the E versus Ctreatment on a dependent Variable Y,where Y could be dichotomous (e.g., suc-cess-failure) or more continuous (e.g., scoreon a given reading test). The central ques-tion concerns the benefits of randomiza-tion in determining the causal effect of theE versus C treatment on Y.

DEFINING THE CAUSAL EFFECT OF THE EVEBSUS C TREATMENT

Intuitively, the causal effect of one treat-ment, E, over another, C, for a particularunit and an interval of time from ti to t2 isthe difference between what would havehappened at time t2 if the unit had beenexposed to E initiated at ti and what wouldhave happened at t2 if the unit had beenexposed to C initiated at ti: "If an hour agoI had taken two aspirins instead of just aglass of water, my headache would now begone," or "Because an hour ago I took twoaspirins instead of just a glass of water, myheadache is now gone." Our definition of thecausal effect of the E versus C treatment willreflect this intuitive meaning.

First define a trial to be a unit and anassociated pair of times, tj and t2, whereti denotes the time of initiation of a treat-ment and t2 denotes the time of measure-ment of a dependent variable, Y, whereti < t2. We restrict our attention to Treat-ments E and C that could be randomly

assigned; thus, we assume (a) a time ofinitiation of treatment can be ascertainedfor each unit exposed to E or C and (6)E and C are exclusive of each other in thesense, that a trial cannot simultaneouslybe an E trial and a C trial (i.e., if E isdefined to be C plus some action, theinitiation of both is the initiation of E; if Eand C are alternative actions, the initiationof both E and C is the initiation of neitherof these but rather of a third treatment, E+ C).

Now define the causal effect of the Eversus C treatment on Y for a particulartrial (i.e., a particular unit and associatedtimes ti, t2) as follows:

Let y(E) be the value of Y measured6 att2 on the unit, given that the unit re-ceived the experimental Treatment Einitiated at ti;Let y(C) be the value of Y measured att2 on the unit given that the unit re-ceived the control Treatment C initiatedat ti;Then y(E) — y(C) is the causal effect ofthe E versus C treatment on Y for thattrial, that is, for that particular unit andthe times ti, t2.

For example, assume that the unit is aparticular child, the experimental treat-ment is an enriched reading program, andthe control treatment is a regular readingprogram. Suppose that if the child weregiven the enriched program initiated at time

5 The measured value of Y stated with referenceto time t2 is considered the "true" value of Y at ta.This position can be justified by defining Y by ameasuring instrument that always yields the mea-sured Y (e.g., Y is the score on a particular IQtest as recorded by the subject's teacher). Sincean "error" in the measured Y can only be detectedby a "better" measuring instrument (e.g., amachine-produced score on that same IQ test),the values of a "truer" score can be viewed as thevalues of a different dependent variable. Clearly,any study is more meaningful to the investigatorif the dependent variable better reflects underlyingconcepts he feels are important (e.g., is more ac-curate) but that does not imply he must con-sider errors about some unmeasurable "true score."For the reader who prefers the concept of sucherrors of measurement, he may consider the follow-ing discussion to assume negligible "technicalerrors" so that Y is essentially the "true" Y

Page 3: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

690 DONALD B. RUBIN

ti, 10 days later at time t2 he would have ascore of 38 items correct on a reading test;and suppose that if the child instead weregiven the regular program initiated at timeti, at time t2 he would score 34 items cor-rect. Then the causal effect on the readingtest for that trial (that child and times ti,ta) of the enriched program versus theregular program is 38 — 34 = 4 more itemscorrect.

The problem in measuring y(E) — y(C)is that we can never observe both y(E) andy(C) since we cannot return to time ti togive the other treatment. We may have thesame unit measured on both treatments intwo trials (a repeated measure design), butsince there may exist carryover effects (e.g.,the effect of the first treatment wears offslowly) or general time trends (e.g., as' thechild ages, his learning ability increases),we cannot be certain that the unit's re-sponses would be identical at both times.

Assume now that there are M trials forwhich we want the "typical" causal effect.For simplicity of exposition, assume thateach trial is associated with a different unitand expand the above notation by addingthe subscript j to denote the j t h trial (j =.1, 2, • • • , M); thus Yi(E) - yj(C) is thecausal effect of the E versus C treatmentfor the j th trial, that is, the j t h unit and theassociated times of initiation of treatment,tij , and measurement of Y, t2j.

An obvious definition of the "typical"causal effect of the E versus C treatmentfor the M trials is the average (mean) causaleffect for the M trials:

-._LM i-l

Even though other definitions of typicalare interesting,6 they lead to more compli-

0 Notice that if all but one of the individualcausal effects are small and that one is very large,the average causal effect may be substantiallylarger than all but one of the individual causaleffects and thus not very "typical." Other possibledefinitions of the typical causal effects for the Mtrials are the median causal effect (the median ofthe individual causal effects) or the midmeancausal effect (the average of the middle half ofthe individual causal effects). If the individualcausal effects, y j (E) — yj (C) , are approximatelysymmetrically distributed about a central value,

cations when discussing properties of esti-mates under randomization. Hence weassume the average causal effect is the de-sired typical causal effect for the M trialsand proceed to the problem of its estima-tion given the obvious constraint that wecan never actually measure both yj(E)and yj(C) for any trial.

RANDOMIZATION, MATCHING, ANDESTIMATING THE TYPICAL

CAUSAL EFFECT IN THE2N TRIAL STUDY

For now assume that the objective is toestimate the typical causal effect only forthe 2N trials in the study. Of course, inorder for the results of a study to be of muchinterest, we must be able to generalize tounits and associated times other than thosein the study. However, the issue of gen-eralizing results to other trials is discussedseparately from the issue of estimating thetypical causal effect for the trials understudy. Also, for now we only consider thesimple and standard estimate of the typicalcausal effect of E versus C: the average Ydifference between those units who re-ceived E and those units who received C.After considering this estimate when thereare only two trials in the study and thenwhen there are 2N (N > 1) trials in thestudy, we will more formally discuss twobenefits of randomization.

Two-Trial Study

Suppose there are two' trials under study,one trial having a unit exposed to E and theother having a unit exposed to C. Thetypical causal effect for the two trials is

- y,(C) + y,(E) - y,(C)]. [1]

The estimate of this quantity from thestudy, the difference between the measuredY for the unit who received E and themeasured Y for the unit who received C, iseither

or

yi(E) - y,(C)

y,(E) - yi(C)

[2]

[3]

sensible definitions of "typical" will yield similarvalues.

Page 4: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

CAUSAL EFFECTS 691

depending upon which unit was assigned E.Neither Equation 2 nor Equation 3 is neces-sarily close to Equation 1 or to the causaleffect for either unit

or

yi(E) - yi(C)

yi(E) -

[4]

[5]

even if these individual causal effects areequal. If the Treatments E and C wererandomly assigned to units, we are equallylikely to have observed the difference inEquation 2 as that in Equation 3, so thatthe average or "expected" difference in Ybetween experimental and control units isthe average of Equations 2 and 3,

which equals Equation 1, the typical causaleffect for the two trials. For this reason, ifthe treatments are randomly assigned, thedifference in Y between the experimentaland control units is called an "unbiased"estimate of the desired typical causal effect.

Now suppose that the two units are verysimilar in the way they respond to the Eand C treatments at the times of their trials.By this we mean that on the basis of"extra information," we know yi(E) is aboutequal to y2(E) and yi(C) is about equal toy2(C); that is, the two trials are closely"matched" with respect to the effects of thetwo treatments. It then follows that Equa-tion 2 is about equal to Equation 3, andboth are about equal to the desired typicalcausal effect in Equation 1. In fact, if thetwo units react identically in their trials,Equation 5 = Equation 4 = Equation 3 =Equation 2 = Equation 1, and randomiza-tion is absolutely irrevelant. Clearly, havingclosely "matched" trials increases thecloseness of the calculated experimentalminus control difference to the typical causaleffect for the two trials, while random as-signment of treatments does not improvethat estimate.

Although two-trial studies are almost un-heard of in the behavioral sciences, they arenot uncommon in the physical sciences. Forexample, when comparing the heat ex-pansion rates (per hour) of a metal alloy inoxygen and nitrogen, an investigator might

use 2 one-foot lengths of the alloy. Becausethe lengths of alloy are so closely matchedbefore being exposed to the treatment (al-most identical compositions and dimen-sions), the units should respond almostidentically to the treatments even wheninitiated at different times, and thus thecalculated experimental (oxygen) minuscontrol (nitrogen) difference should be anexcellent estimate of the typical causaFeffect, Equation 1.

A skeptical observer, however, could al-ways claim that the experimental minuscontrol difference is not a good estimate ofthe typical causal effect of the E versus Ctreatment because the two units were notabsolutely identical prior to the applicationof the treatments. For example, he couldclaim that the length of alloy molded firstwould expand more rapidly. Hence, hemight argue that what was measured wasreally the effect of the difference in order ofmanufacture, not the causal effect of theoxygen versus nitrogen treatment. Sinceunits are never absolutely identical beforethe application of treatments, this kind ofargument, whether "sensible" or not, canalways be made. Nevertheless, if the twotrials are closely matched with respect tothe expected effects of the treatments, thatis, if (a) the two units are matched prior tothe initiation of treatments on all variablesthought to be important in the sense thatthey causally affect Y and (6) the possibleeffect of different times of initiation oftreatment and measurement of Y are con-trolled, then the investigator can be con-fident that he is in fact measuring thecausal effect of the E versus C treatmentfor those two trials. This kind of confidenceis much easier to generate in the physicalsciences where there are models that suc-cessfully assign most variability to specificcauses than in the social sciences where oftenimportant causal variables have not beenidentified.

Another source of confidence that the ex-perimental minus control difference is a goodestimate of the causal effect of E versus C isreplication: Are similar results obtainedunder similar conditions? One type ofreplication is the inclusion of more than twotrials in the study.

Page 5: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

692 DONALD B. RUBIN

The £N Trial Study

Suppose there are 2N trials (N > 1) inthe study, half with N units having receivedthe E treatment and the other half with Nunits having received the C treatment. Theimmediate objective is to find the typicalcausal effect of the E versus C treatment onY for the 2N trials, say T:

Let SE denote the set of indices of the Etrials and Sc denote the set of indicesof the C trials (SE U Sc = {i = 1, 2, • • • ,2N\). Then the difference between theaverage observed Y in the E trials and theaverage observed Y in the C trials can beexpressed as

j,8E

- , ,JM j«SC

where /Zi<SE and ]Ci<sc indicate, respec-tively, summation over all indices in SE(i.e., all E trials) and over all indices in Sc(i.e., all C trials). We now consider howclose this estimate yd is to the typical causaleffect T and what advantage there mightbe if we knew the treatments were randomlyassigned.

First, assume that for each unit receivingE there is a unit receiving C, and the twounits react identically at the times of theirtrials; that is, the 2N trials are actually Nperfectly matched pairs. We now show thatthe estimate yd in this case equals r. ya canbe expressed as the average experimentalminus control (E — C) difference acrossthe N matched trials. Since the (E — C)difference in each matched pair of trials isthe typical causal effect for both trials ofthat pair, the average of those differences isthe typical causal effect for all N pairs andthus all 2N trials. This result holds whetherthe treatments were randomly assigned ornot. In fact, if one had N identically matchedpairs, a "thoughtless" random assignmentcould be worse than a nonrandom assign-ment of E to one member of the pair and Cto the other. By "thoughtless" we meansome random assignment that does notassure that the members of each matchedpair get different treatments — picking the

N indices to receive E "from a hat" con-taining the numbers 1 through 2N ratherthan tossing a fair coin for each matchedpair to see which unit is to receive E.

In practice, of course, we never haveexactly matched trials. However, if matchedpairs of trials are very similar in the sensethat prior to the initiation of treatments theinvestigator has controlled those variablesthat might appreciably affect Y, then yashould be close to T. If, in addition, theestimated causal effect is replicable in thesense that the N individual estimated causaleffects for each matched pair are very simi-lar, the investigator might feel even moreconfident that he is in fact estimating thetypical causal effect for the 2N trials (e.g.,2N children from the same school matchedby sex and initial reading score into N pairs,with the same observed E — C differencein final score in each matched pair). Simi-larly, if the trials are not pair-matched butare all similar (e.g., all children are malesfrom the same school with similar pretestscores) and if we observe that all yj(E)jeSE are about equal and all yj(C) jeSc areabout equal, the investigator would alsofeel confident that he is in fact estimatingthe typical causal effect for the 2N trials.

Nevertheless, it is obvious that if treat-ments were systematically assigned to units,the addition of replication evidence cannotdissuade the critic who believes the effectbeing measured is due to a variable used toassign treatments (e.g., in the reading study,if more active children always received theenriched program, or in the heat-expansionstudy, if the first molded alloy was alwaysmeasured in oxygen). If treatments wererandomly assigned, all systematic sourcesof bias would be made random, and thus itwould be unlikely, especially if N is large,that almost all E trials would be with themore active children or the first moldedalloy. Hence, any effect of that variablewould be at least partially balanced in thesense of systematically favoring neither theE treatment nor the C treatment over the2N trials. In addition, using the replications,there could be evidence to refute the skep-tic's claim of the importance of that variable(e.g., in each matched trial we get about the

Page 6: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

CAUSAL EFFECTS 693

same estimate whether the more active childgets E or C). Of course, if we knew theskeptic's claim beforehand, a specific controlof this additional variable would be moreadvisable than relying on randomization(e.g., in a random half of the matched trialsassign E to the more active child, and inthe other half assign C to the more activechild, or include the child's activity as amatching variable).

It is important to realize, however, thatwhether treatments are randomly assignedor not, no matter how carefully matchedthe trials, and no matter how large N, askeptical observer could always eventuallyfind some variable that systematicallydiffers in the E trials and C trials (e.g.,length of longest hair on the child) andclaim that yd estimates the effect of thisvariable rather than T, the causal effect ofthe E versus C treatment. Within the experi-ment there can be no refutation of thisclaim; only a logical argument explainingthat the variable cannot causally affect thedependent variable or additional data out-side the study can be used to counter it.

Two FORMAL BENEFITS OFRANDOMIZATION

If randomization can never assure us thatwe are correctly estimating the causal effectof E versus C for the 2JV trials under study,what are the benefits of randomizationbesides the intuitive ones that follow frommaking all systematic sources of bias intorandom ones? Formally, randomizationprovides a mechanism to derive probabilisticproperties of estimates without makingfurther assumptions. We will consider twosuch properties that are important:

1. The average E — C difference is an"unbiased" estimate of T, the typicalcausal effect for the 2N trials.

2. Precise probabilistic statements canbe made indicating how unusual the ob-served E — C difference, ya, would beunder specific hypothesized causal effects.

More advanced discussion of the formalbenefits of randomization may be found inSheffe" (1959) and Kempthorne (1952).

Unbiased Estimation over the RandomizationSet

We begin by defining the "randomizationset" to be the set of r allocations thatwere equally likely to be observed giventhe randomization plan. For example,if the treatments were randomly assignedto trials with no restrictions (the completelyrandomized experiment, Cochran & Cox,

1957), each one of the ( ^ ) possible

allocations of N trials to E and N trials to Cwas equally likely to be the observed alloca-tion. Thus, the collection of all of these

r = I AT I allocations is known as the

randomization set for this completely ran-domized experiment. If the treatments wereassigned randomly within matched pairs(the randomized blocks experiment, Cochran& Cox, 1957), any of the 2N allocations,with each member of the pair receiving adifferent treatment, was equally likely to bethe observed one. Hence, for the experimentwith randomization done within matchedpairs, the collection of these r = 2N equallylikely allocations is known as the randomiza-tion set.

For each of the r possible allocations inthe randomization set, there is a corre-sponding average E — C difference thatwould have been calculated had that alloca-tion been chosen. If the expectation (i.e.,average) of these r possible average differ-ences equals T, the average E — C differ-ence is called unbiased over the randomiza-tion set for estimating T. We now show thatgiven randomly assigned treatments, theaverage E — C difference is an unbiasedestimate of T, the typical causal effect forthe 2N trials.

By the definition of random assignment,each trial is equally likely to be an E trialas a C trial. Hence, the contribution of thej t h trial (j = 1, • • • , 27V) to the averageE — C difference in half of the r alloca-tions in the randomization set is yj(E)/ATand in the other half is - yi(C)/N. Theexpected contribution of the jth trial to theaverage E — C difference is therefore

Page 7: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

694 DONALD B. RUBIN

Summing over all 2N trials we have, theexpectation of the average E — C differenceover the r allocations in the randomizationset is

2AT

AtV j— 1

which is the typical causal effect for the 2Ntrials, T.

Although the unbiasedness of the E — Cdifference is appealing in the sense that itindicates that we are tending to estimateT, its impact is not immediately overwhelm-ing: the one E — C difference we haveobserved, yd, may or may not be close toT. In a vague sense we may believe ydshould be close to r because the unbiased-ness indicates that "on the average" theE — C difference is T, but this belief maybe tempered when other properties of theestimate are revealed; for example withoutadditional constraints on the symmetry ofeffects, the average E — C difference is notequally likely to be above T as below it.

In addition, after observing the valuesof some important unmatched variable, wemay no longer believe yd tends to estimateT. For example, suppose in the study ofreading programs, initial reading score is nota matching variable, and after the experi-ment is complete we find that the averageinitial score for the children exposed to Ewas higher than for those exposed to C.Clearly we would now believe that yd proba-bly overestimates T even if treatments wererandomly assigned.

In sum, the unbiasedness of the E — Cdifference for T follows from the randomassignment of treatments; it is a desirableproperty because it indicates that "on theaverage" we tend to estimate the correctquantity, but it hardly solves the problemof estimating the typical causal effect. Asyet we have no indication whether to believeyd is close to T nor to any ability to adjustfor important information we may possess.

Probabilistic Statements from theRandomization Set

A. second formal advantage of randomiza-tion is that it provides a mechanism formaking precise probabilistic statementsindicating how unusual the observed E — C

difference, yd, would be under specifichypotheses. Suppose that the investigatorhypothesizes exactly what the individualcausal effects are for each of the 2N trialsand these hypothesized values are fj, j =1, • • • , 2JV. The hypothesized typical causaleffect for the 2N trials is thus

j 2JV

f = 2N § f '•

Having the fj and the observed yj(E), j«SEand yj(C), jeSc, we can calculate hypothe-sized values, say yj(C) and yj(E), for all ofthe 2N trials. For j*SE, ys(E) is observedand yj(C) is unobserved; hence, for thesetrials y,(E) = yj(E)_and y,(C) = y,(E) -fj. For jeSc, ys(C) is observed and yj(E)is unobserved; hence, for these trials yj(C) =y,(C) and y,(E) = yi(C) + f,. Thus, wecan calculate hypothesized yj(E) and yj(C)for all 2N trials, and using these, we cancalculate an hypothesized average E — Cdifference for each of the r allocations of the2N trials in the randomization set.

Suppose that we calculate the r hypothe-sized average E — C differences and listthem from high to low, noting whichE — C difference corresponds to the SE, Scallocation we have actually observed. Thisdifference, yd, is the only one which does notuse the hypothesized f j. If treatments wereassigned completely at random to the trialsand the hypothesized f j are correct, any one

of the r = ( AT ) differences was equally

likely to be the observed one; similarly, iftreatments were randomly assigned withinmatched pairs, each of the r = 2 " differenceswith each member of a matched pair gettinga different treatment was equally likely tobe the observed one. Intuitively, if thehypothesized fj are essentially correct, wewould expect the observed difference yd tobe rather typical of the (r — 1) other differ-ences that were equally likely to be observed;that is, yd should be near the center of thedistribution of the r E — C differences.If the observed difference is in the tail ofdistribution and therefore not typical ofthe r differences, we might doubt the cor-rectness of the hypothesized r\.

Since the average of the r E — C dif-ferences is the hypothesized typical causal

Page 8: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

CAUSAL EFFECTS 695

effect, f, and the r allocations are equallylikely, we can make the following probabilis-tic statement:

Under the hypothesis that the causal effects aregiven by the f j, j = !,•••, W, the probability thatwe would observe an average E — C differencethat is as far or farther from r than the one wehave observed is m/r where m is the number ofallocations in the randomization set that yieldE — C differences that are as far or farther fromf than y<i.

If this probability, called the "significancelevel" for the hypothesized f j, is very small,we either must admit that the observedvalue is unusual in the sense that it is in thetail of the distribution of the equally likelydifferences, or we must reject the plausi-bility of the hypothesized f j -

The most common hypothesis for which asignificance level is calculated is that theE versus C treatment has no effect on Ywhatsoever (i.e., f; = 0). Other commonhypotheses assume that the effect of theE versus C treatment on Y is a nonzeroconstant (i.e., f j = fo) for all trials.7

The ability to make precise probabilisticstatements about the observed ya undervarious hypotheses without additional as-sumptions is a tremendous benefit of ran-domization especially since yd tends toestimate r. However, one must realize thatthese simple probabilistic statements referonly to the 2N trials used in the study anddo not reflect additional information (i.e.,other variables) that we may also havemeasured.

PRESENTING THE RESULTS OF ANEXPERIMENT AS BEING OF

GENERAL INTEREST

Before presenting the results of an experi-ment as being relevant, an investigatorshould believe that he has measured the

7 These hypotheses for a constant effect can beused to form "confidence limits" for T. Giventhat the r\ are constant, the set of all hypothesizedT0 such that the associated significance level isgreater than or equal to a = m/r form a (1 — a)confidence interval for T: of the r such (1 — a) con-fidence intervals one could have constructed (onefor each of the r allocations in the randomizationset), r(l — «) = r — m of them include the truevalue of r assuming all TJ = T. See Lehmann (1959,p. 59) for the proof.

causal effect of the E versus C treatmentand not the effect of some extraneous varia-ble. Also, he should believe that the result isapplicable to a population of trials besidesthe 2N in the experiment.

Considering Additional VariablesAs indicated previously, the investigator

should be prepared to consider the possibleeffect of other variables besides those explicitin the experiment. Often additional variableswill be ones that the investigator considersrelevant because they may causally affect Y;therefore, he may want to adjust the esti-mate yd and significance levels of hypothesesto reflect the values of these variables inthe study. At times the variables will beones which cannot causally affect Y eventhough in the study they may be correlatedwith the observed values of Y. An investiga-tor who refuses to consider any additionalvariables is in fact saying that he does notcare if yd is a bad estimate of the typicalcausal effect of the E versus C treatmentbut instead is satisfied with mathematicalproperties (i.e., unbiasedness) of the processby which he calculated it.

Consider first the case of an obviouslyimportant variable. As an example, supposein the reading study, with programs ran-domly assigned, we found that the averageE — C difference in final score was fouritems correct and that under the hypothesisof no effects the significance level was .01;also assume that initial score was not amatching variable and in fact the differencein initial score was also four items correct.Admittedly, this is probably a rare eventgiven the randomization, but rare events dohappen rarely. Given that it did happen,we would indeed be foolish to believe yd = 4items is a good estimate of T and/or theimplausibility of the hypothesis of no treat-ment effects indicated by the .01 significancelevel. Rather, it would seem more sensibleto believe that y<i overestimates T andsignificance levels underestimate the plausi-bility of hypotheses that suggest zero ornegative values for r.

A commonly used and obvious correctionis to calculate the average E — C differencein gain score rather than final score. Thatis, for each trial there is a "pretest" scorewhich was measured before the initiation of

Page 9: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

696 DONALD B. RUBIN

treatments, and the gain score for each trialis the final score minus the pretest score.More generally we will speak of a "prior"score or "prior" variable which would havethe same value, Xj, whether the j t h unitreceived E or C. It then follows given ran-dom assignment of treatments that theadjusted estimate (e.g., gain score)

- x,] - 1 E [yj(C) - x,]i«sc

remains an unbiased estimate of T overthe randomization set: Each prior scoreappears in half of the equally likely allo-cations as Xj/N and the other half as— Xj/N; hence, averaged over all alloca-tions, the j t h prior score has no effect.8 Butthis result holds for any set of prior scoresXj, J = 1, ' ' ' , 2JV, whether sensible or not.For example, in an experiment evaluating acompensatory reading program, with Ybeing the final score on a reading test, theprior variable "pretest reading score" orperhaps "IQ" properly scaled makes sensebut "height in millimeters" does not. Also,why not use the prior variable "one halfpretest score?"

Clearly, in order to make an intelligentadjustment for extra information, we cannotbe guided solely by the concept of unbiased-ness over the randomization set. We needsome model for the effect of prior variablesin order to use their values in an intelligentmanner. For example, if the final scoretypically would equal the initial score if therewere no E — C treatment effect (as with thelength of the alloys in the heat expansionexperiment), the gain score is perfectlyreasonable. In the physical sciences, morecomplex models representing generally ac-cepted functional relationships are oftenused; however, in the social sciences thereare rarely such accepted relationships uponwhich to rely. What then does the investiga-tor do in order to adjust intelligently thefinal reading scores for the subjects' varyingIQs, grade levels, socioeconomic status, and

8 If the prior score could vary depending onwhether the unit received E or C (i.e., it is avariable measured after the initiation of the treat-ment), we would have no assurance that the ad-justed E — C difference is an unbiased estimateover the randomization set.

so on? Apparently, he must be willing tomake some assumptions about the func-tional form of the causal effect of theseother variables on Y. If he assumes, perhapsbased on indications in previous data, some"known" function for Xj (e.g., in the com-pensatory reading program example, supposeX j equals [.01 X IQ]2 X pretest X [per-centile of family income]), so that Xj isthe same whether the j t h unit received E orC, from the previous discussion the averageE — C difference in adjusted scores remainsan unbiased estimate of T. If the investiga-tor assumes a model whose parameters areunknown and estimates these parametersby some method from the data, in generalthe average E — C difference in adjustedscores is no longer unbiased over the ran-domization set because the adjustment forthe j t h trial depends on which trials receivedE and which received C (e.g., in the analysisof covariance, the estimated regressioncoefficients in general vary over the rallocations in the randomization set).Hence, forming an intelligent adjustedestimate may not be simple even in a ran-domized experiment.

Significance levels for any adjusted esti-mate can be found by calculating the ad-justed estimate rather than the simpleE — C difference for each of the equallylikely allocations in the randomization set.However, if the adjusted estimate does nottend to estimate T in a sensible manner, theresulting significance level may not be ofmuch interest.

Now consider a variable that is broughtto the investigator's attention, but he feelsit cannot causally affect Y (e.g., in thecompensatory reading example, age of oldestliving relative). Eventually a skeptic canfind such a variable that systematicallydiffers in the E trials and the C trials evenin the best of experiments. Considering onlythat variable, it is indeed unlikely givenrandomization that there would be suchdiscrepancy between its values in E trialsand C trials, but its occurrence cannot bedenied. If the skeptic adjusts ya by using astandard model (e.g., covariance), theadjusted estimate and related significancelevels may then give misleading results(e.g., zero estimate of T, hypothesis that

Page 10: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

CAUSAL EFFECTS 697

all causal effects are zero, f \ = 0, is veryplausible). In fact, using such models onecan obtain any estimated causal effectdesired by searching for and finding a priorvariable or combination of prior variablesthat yield the desired result. Such a searchshould be more difficult given that ran-domization was performed, but even withrandomized data the investigator must beprepared to ignore variables that he feelscannot causally affect Y. On the otherhand, he may want to adjust for such avariable if he feels it is a surrogate for anunmeasured variable that can causallyaffect Y (e.g., age of oldest living relativeis a surrogate for mental stability of thefamily in the compensatory reading ex-ample).

The point of this discussion is that whentrying to estimate the typical causal effect inthe 2N trial experiment, handling additionalvariables may not be trivial without a well-developed causal model that will properlyadjust for those prior variables that causallyaffect Y and ignore other variables that donot causally affect Y even if they are highlycorrelated with the observed values of Y.Without such a model, the investigator mustbe prepared to ignore some variables hefeels cannot causally affect Y and use asomewhat arbitrary model to adjust forthose variables he feels are important. Anexample which demonstrates that it is notalways simple to interpret significant resultsin a randomized experiment with many priorvariables recorded is the recent controversyover the utility of oral-diabetic drugs.9

Generalizing Results to Other Trials

In order to believe that the results of anexperiment are of practical interest, wegenerally must believe that the 2N trialsin the study are representative of a popula-tion of other future trials. For example, ifthe experimental treatment is a compensa-tory reading program and the trials arecomposed of sixth-grade school childrenwith treatments initiated in fall 1970 and Ymeasured in spring 1971, the results are oflittle interest unless we believe they tell ussomething about future sixth graders who

"See for example Schor (1971) and Cornfield(1971).

might be exposed to this compensatoryreading program.

For simplicity, assume the 2N trials inthe study are a simple random sample froma "target population" of M trials to whichwe want to generalize the results; by simplerandom sample we mean that each of the

\ 2N ) wavs °f choosing the 2N trials isequally likely to be selected. If T is thetypical (average) causal effect for all Mtrials, it then follows given random assign-ment of treatments that the average E — Cdifference for the 2JV trials used is an un-biased estimate of T over the randomsampling plan and over the randomizationset. In other words, in each of the r X

T )2AT wavs °f choosing 2N trials from Mtrials and then randomly assigning N trialsto E and N trials to C, there is a calculatedaverage E — C difference, and the average

of these r X differences is T: Be-

cause of the randomization and randomsampling, each trial is equally likely to bean E trial as a C trial and thus contributesyj(E)/2V to the E — C difference as oftenas it contributes — y)(C)/N. It also followsthat under a hypothesized set of causaleffects, f j, j = 1, • • • , M, the significancelevel (the probability that we would observea difference as large as or larger than yd),given that we have sampled the 2N trialsin the study, is m/r where m is the numberof allocations in the randomization set thatyield estimates as far or farther from f thanyd-10

If we let M grow to infinity (a reasonableassumption in many experiments when thepopulation to which we want to generalizeresults is essentially unlimited, for example,all future sixth-grade students), some addi-tional probabilistic results follow. For ex-ample, the usual covariance adjusted esti-mate is an unbiased estimate of T (notnecessarily T) over the random samplingplan and the randomization set, but whetherthe adjustment actually adjusts for the

10 Even though we have hypothesized TJ for alltrials, we cannot calculate hypothesized y t(E) andyj(C) for the unsampled trials, and thus the proba-bilistic statement is conditional on the observedtrials.

Page 11: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

698 DONALD B. RUBIN

additional variables (s) still depends on theappropriateness of the underlying linearmodel.

Hence, given random sampling of trials,the ability to generalize results to othertrials seems relatively straightforward proba-bilistically. However, most experiments aredesigned to be generalized to future trials;we never have a random sample of trialsfrom the future but at best a randomsample from the present; in fact, experi-ments are usually conducted in constrained,atypical environments and within a re-stricted period of time. Thus, in order togeneralize the results of any experiment tofuture trials of interest, we minimally mustbelieve that there is a similarity of effectsacross time and more often must believe thatthe trials in the study are "representative"of the population of trials. This step of faithmay be called making an assumption of"subjective random sampling" in order toassert such properties as (a) yd (or y^ ad-justed) tends to estimate the typical causaleffect T and (b) the plausibility of hypothe-sized fj, j = 1, • • • , M, is given by the usualconditional significance level.

Even though the trials in an experimentare often not very representative of thetrials of interest, investigators do make andmust be willing to make this assumption ofsubjective random sampling in order tobelieve their results are useful. When in-vestigators carefully describe their sample oftrials and the ways in which they may differfrom those in the target population, thistacit assumption of subjective randomsampling seems perfectly reasonable. Ifthere is an important variable that differsbetween the sample of trials and the popula-tion of trials, an attempt to adjust theestimate based on the same kinds of modelsdiscussed previously is quite appropriate.11

PRESENTING THE RESULTS OF ANNONRANDOMIZED STUDY AS

BEING OF GENERALINTEREST

The same two issues previously discussedas arising when presenting the results of an

experiment also arise when presenting theresults of a nonrandomized study as beingrelevant. However, the first issue, the effectof variables not explicitly controlled, isusually more serious in nonrandomized thanin randomized studies, while the second,the applicability of the results to a popula-tion of interest, is often more serious inrandomized than in nonrandomized studies.

Effect of Variables Not Explicitly Controlled

In order to believe that yd in a nonran-domized study is a good estimate of r,the typical causal effect for the 2N trials inthe study, we must believe that there are noextraneous variables that affect Y andsystematically differ in the E and C groups;but we have to believe this even in a ran-domized experiment. The primary differenceis that without randomization there is oftena strong suspicion that there are such varia-bles, while with randomization such suspi-cions are generally not as strong.

Consider a carefully controlled nonran-domized study—a study in which there areno obviously important prior variables thatsystematically differ in the E trials and theC trials. In such a study, there is a realsense in which a claim of "subjective ran-domization" can be made. For example, ifthe study was composed of carefullymatched pairs of trials, there might be avery defensible belief that within eachmatched pair each unit was equally likely toreceive E as C in the sense that if you wereshown the units without being told whichreceived E, only half the time would youcorrectly guess which received E.12 Underthis assumption of subjective randomiza-tion, the usual estimates and significancelevels can be used as if the study had beenrandomized; this procedure is analogous toassuming subjective random sampling inorder to make inferences about a targetpopulation. Until an obviously importantvariable is found that systematically differsin the E and C trials, the belief in subjectiverandomization is well founded.

Now consider a nonrandomized study in

"See Cochran (1963) on regression and ratioadjustments. These are appropriate whether thesample is actually random or not.

12 Perhaps this is all that is meant by "ran-domization" to some Bayesians under any circum-stance (see Savage, 1954, p. 66).

Page 12: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

CAUSAL EFFECTS 699

which an obviously important prior variableis found that systematically differs in the Eand C trials. We must adjust the estimateyd and the associated significance levels justas we would if the study were in fact aproperly randomized experiment. An ob-vious way to adjust for such variables is toassume subjective randomization (i.e., thestudy was randomized and the observeddifference on prior variables occurred "bychance"), and use the methods discussed inthe previous section "Considering Addi-tional Variables" appropriate for an experi-ment (i.e , gain scores, adjustment by aknown function, covariance adjustment).

The main problem with this approach isthat having found an important prior varia-ble that systematically differs in the E and Ctrials, we might suspect that there are othersuch variables, while if the study wererandomized we might not be as suspicious ofrinding these prior variables. Additionally,the various methods of adjustment thatyield unbiased estimates given randomiza-tion have varying biases under differentmodels without randomization. Even thoughan unbiased estimate is not (as we haveseen) the total answer to estimating r, itis more desirable than a badly biased esti-mate. Recent work on methods of reducingbias in nonrandomized studies is summarizedin Cochran and Rubin (1974). Much workremains to be done, especially for many priorvariables and nonlinear relations betweenthese and Y.

In sum, with respect to variables notexplicitly controlled, a randomized studyleaves the investigator in a more comfortableposition than does a nonrandomized study.Nevertheless, the following points remaintrue for both: (a) Any adjustment is some-what dependent upon the appropriatenessof the underlying model—if the model isappropriate the confounding effect of theprior variables is reduced or eliminated,while if the model is inappropriate a con-founding effect remains, (b) We can neverknow that all causally relevant prior varia-bles that systematically differ in the E andC trials have been controlled, (c) We mustbe prepared to ignore irrelevant prior varia-bles even if they systematically differ in Eand C trials, or else we can obtain any

estimate desired by eventually finding the"right" irrelevant prior variables.

Generalizing Results to Other Trials

For almost any study to be of interest,the results must be generalizable to a popula-tion of trials. Typically, nonrandomizedstudies have more representative trials thanexperiments since these are often conductedin constrained environments. Thus, if thechoice is between a nonrandomized studywhose 2N trials consisted of N representativeE trials closely matched to N representativeC trials and an experiment whose 2N trialswere highly atypical, it is not clear whichwe should prefer; in practice there may be atrade-off between the reasonableness of theassumptions of subjective random samplingand subjective randomization (e.g., con-sider a carefully matched nonrandomizedevaluation of existing compensatory readingprograms and an experiment having thesecompensatory reading programs randomlyassigned to inmates at a penitentiary).

In a sense, all studies lie on a continuumfrom irrelevant to relevant with respect toanswering a question. A poorly controllednonrandomized study conducted on atypicaltrials is barely relevant, but a small ran-domized study with much missing dataconducted on the same atypical trials is notmuch better. Similarly, a very well-con-trolled experiment conducted on a repre-sentative sample of trials is very relevant,and a very well-controlled nonrandomizedstudy (e.g., E and C trials matched on allcausally important variables, several controlgroups each with a potentially differentbias) conducted on a representative sampleof trials is almost as good. Typically, real-world studies fall somewhere in the middleof this continuum with nonrandomizedstudies having more representative trialsthan experiments but less control over priorvariables.

SUMMARY

The basic position of this paper can besummarized as follows: estimating thetypical causal effect of one treatment versusanother is a difficult task unless we under-stand the actual process well enough to (a)assign most of the variability in Y to specific

Page 13: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

700 DONALD B. RUBIN

causes and (6) ignore associated but causallyirrelevant variables. Short of such under-standing, random sampling and randomiza-tion help in that all sensible estimates tendto estimate the correct quantity, but theseprocedures can never completely assure usthat we are obtaining a good estimate of thetreatment effect.13

Almost never do we have a random samplefrom the target population of trials, andthus we must generally rely on the belief insubjective random sampling, that is, thereis no important variable that differs in thesample and the target population. Similarly,often the only data available are observa-tional and we must rely on the belief insubjective randomization, that is, there isno important variable that differs in the Etrials and C trials. With or without randomsampling or randomization, if an importantprior variable is found that systematicallydiffers in E and C trials or in the sample andtarget population, we are faced with eitheradjusting for it or not putting much faith inour estimate. However, we cannot adjustfor any variable presented, because if wedo, any estimate can be obtained.

In both randomized and nonrandomizedstudies, the investigator should think hardabout variables besides the treatment thatmay causally affect Y and plan in advancehow to control for the important ones—•either by matching or adjustment or both.When presenting the results to the reader,it is important to indicate the extent towhich the assumptions of subjective ran-domization and subjective random samplingcan be believed and what methods of controlhave been employed.14 If a nonrandomizedstudy is carefully controlled, the investigatorcan reach conclusions similar to those he

13 Even assuming a good estimate of the causaleffect of E versus C, there remains the problemof determining which aspects of the treatments areresponsible for the effect. Consider, for example,"expectancy" effects in education (Rosenthal, 1971)and the associated problems of deciding the relativecausal effects of the content of programs and theimplementation of programs.

11 Recent advice on the design and analysis ofobservational studies is given by Cochran in Ban-croft (1972).

would reach in a similar experiment.16 Infact, if the effect of the E versus C treatmentis large enough, he will be able to detect it insmall, nonrepresentative samples and poorlycontrolled studies.

Basic problems in social science researchare that causal models are not yet wellformulated, there are many possible treat-ments, and in many cases the differentialeffects of treatments appear to be quitesmall. Given this situation, it seems reasona-ble to (a) search for treatments with largeeffects using well-controlled nonrandomizedstudies when experiments are impracticaland (6) rely on further experimental studyfor more refined estimates of the effects ofthose treatments that appear to be im-portant. The practical alternative to usingnonrandomized studies in this way is evalu-ating many treatments by introspectionrather than by data analysis.

REFERENCES

Bancroft, T. A. Statisical papers in honor ojGeorge W. Snedecor. Ames: Iowa State Univer-sity Press, 1972.

Campbell, D. T., & Erlebacher, A. How regressionartifacts in quasi-experimental evaluations canmistakenly make compensatory education lookharmful. In J. Hellmuth (Ed.), The disadvan-taged child. Vol. 3. Compensatory education: Anational debate. New York: Brunner/Mazel,1970.

Cochran, W. G. Sampling techniques. New York:Wiley, 1963.

Cochran, W. G., & Cox, G. M. Experimental de-signs. (6th ed.) New York: Wiley, 1957.

Cochran, W. G., & Rubin, D. B. Controlling biasin observational studies: A review. MahalanobisMemorial Volume Sankhyd -A, 1974, 1-30.

Cornfield, J. The University Group Diabetes Pro-gram. Journal oj the American Medical Associa-tion, 1971, 217, 1676-1687.

Davies, O. L. Design and analysis oj industrialexperiments. London, England: Oliver & Boyd,1954.

Fisher, R. A. Statistical methods for research work-ers. (1st ed.) New York: Hafner, 1925.

Kempthorne, O. The design and analysis oj experi-ments. New York: Wiley, 1952.

Lehmann, E. L. Testing statistical hypotheses. NewYork: Wiley, 1959.

Rosenthal, R. Teacher expectation and pupil learn-ing. In R. D. Strom (Ed.), Teachers and the15 See, for example, the large scale evaluation of

Salk vaccine described by Meier in Tanur et al(1972).

Page 14: Journal of Educational Psychology 1974, Vol. 66, No. 5 ... · Recent psychological and educational literature has included extensive criticism of the use of nonrandomized studies

CAUSAL EFFECTS 701

learning process. Englewood Cliffs, N.J.: Pren- Journal of the American Medical Association,tice-Hall, 1971. 1971, 217, 1671-1675.

Savage, L. J. The foundations of statistics. New Tanur, J. M., Mosteller, F., Kruskal, W. H., Link,York: Wiley, 1954. R- F. Pieters, R. S., & Rising, G. R. Statistics: A

Scheffe, H. The analysis of variance. New York: 9uide io ihe unknown. San Francisco, Calif.:Wiley, 1959. Holden ̂ 1972'

Schor, S. The University Group Diabetes Program. (Received July 30, 1973)