9
In Defense of External Invalidity Douglas G. Mook University of Virginia ABSTRACT: Many psychological investigations are accused of "failure to generalize to the real world" because of sample bias or artificiality of setting. It is argued in this article that such "generalizations" often are not intended. Rather than making predic- tions about the real world from the laboratory, we may test predictions that specify what ought to hap- pen in the lab. We may regard even "artificial" find- ings as interesting because they show what can occur, even if it rarely does. Or, where we do make gener- alizations, they may have added force because of artificiality of sample or setting. A misplaced preoc- cupation with external validity can lead us to dismiss good research for which generalization to real life is not intended or meaningful. The greatest weakness of laboratory experiments lies in their artificiality. Social processes observed to occur within a.laboratory setting might not necessarily occur within more natural social settings. —Babbie, 1975, p. 254 In order to behave like scientists we must construct situ- ations in which our subjects . . . can behave as little like human beings as possible and we do this in order to allow ourselves to make statements about the nature of their humanity. —Bannister, 1966, p. 24 Experimental psychologists frequently have to listen to remarks like these. And one who has taught courses in research methods and experimental psy- chology, as I have for the past several years, has prob- ably had no problem in alerting students to the "artificiality" of research settings, Students, like lay- persons (and not a few social scientists for that mat- ter), come to us quite prepared to point out the remoteness of our experimental chambers, our preoccupation with rats and college sophomores, and the comic-opera "reactivity" of our shock gen- erators, electrode paste, and judgments of.length's of line segments on white paper. They see all this, My problem has been not to alert them to these considerations, but to break their habit of dismissing well-done, meaningful, infor- mative research on grounds of "artificiality." The task has become a bit harder over the last few years because a full-fledged "purr" word has gained currency: external validity. Articles and monographs have been written about its proper nur- ture, and checklists of specific threats to its well- being are now appearing in textbooks. Studies unes- corted by it are afflicted by—what else?—external invalidity. That phrase has a lovely mouth-filling resonance to it, and there is, to be sure, a certain poetic justice in our being attacked with our own jargon. Warm Fuzzies and Cold Creepies The trouble is that, like most "pun-" and "snarl" words, the phrases external validity and external in- validity can serve as serious barriers to thought. Obviously, any kind of validity is a warm, fuzzy Good Thing; and just as obviously, any kind of in- validity must be a cold, creepy Bad Thing. Who could doubt it? It seems to me that these phrases trapped even their originators, in just that way. Campbell and Stanley (1967) introduce the concept thus: "Exter- nal validity asks the question of generalizability: To what populations, settings, treatment variables, and measurement variables can this effect be general- ized?" (p. 5). Fair enough. External validity is not an automatic desideratum; it asks a question. It in- vites us to think about the prior questions: To what populations, settings, and so on, do we want the ef- fect to be generalized? Do we want to generalize it at all? But their next sentence is: "Both types of cri- teria are obviously important. . ." And ". . . the selection of designs strong in both types of validity is obviously our ideal" (Campbell & Stanley, 1967, P. 5). I intend to argue that this is simply wrong. If it sounds plausible, it is because the word validity has given it a warm coat of downy fuzz. Who wants to be invalid—internally, externally, or in any other way? One might as well ask for acne. In a way, I wish the authors had stayed with the term general- izability, precisely because it does not sound nearly so good. It would then be easier to remember that we are not dealing with a criterion, like clear skin, but with a question, like "How can we get this sofa down the stairs?" One asks that question if, and only if, moving the sofa is what one wants to do. But generalizability is not quite right either. The question of external validity is not the same as the question of generalizability. Even an experiment April 1983 American Psychologist Copyright 1983 by the American Psychological Association, Inc. 379

In Defense of External Validity

Embed Size (px)

DESCRIPTION

In Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External ValidityIn Defense of External Validity

Citation preview

Page 1: In Defense of External Validity

In Defense of External InvalidityDouglas G. Mook University of Virginia

ABSTRACT: Many psychological investigations areaccused of "failure to generalize to the real world"because of sample bias or artificiality of setting. Itis argued in this article that such "generalizations"often are not intended. Rather than making predic-tions about the real world from the laboratory, wemay test predictions that specify what ought to hap-pen in the lab. We may regard even "artificial" find-ings as interesting because they show what can occur,even if it rarely does. Or, where we do make gener-alizations, they may have added force because ofartificiality of sample or setting. A misplaced preoc-cupation with external validity can lead us to dismissgood research for which generalization to real life isnot intended or meaningful.

The greatest weakness of laboratory experiments lies intheir artificiality. Social processes observed to occur withina.laboratory setting might not necessarily occur withinmore natural social settings.

—Babbie, 1975, p. 254

In order to behave like scientists we must construct situ-ations in which our subjects . . . can behave as little likehuman beings as possible and we do this in order to allowourselves to make statements about the nature of theirhumanity.

—Bannister, 1966, p. 24

Experimental psychologists frequently have to listento remarks like these. And one who has taughtcourses in research methods and experimental psy-chology, as I have for the past several years, has prob-ably had no problem in alerting students to the"artificiality" of research settings, Students, like lay-persons (and not a few social scientists for that mat-ter), come to us quite prepared to point out theremoteness of our experimental chambers, ourpreoccupation with rats and college sophomores,and the comic-opera "reactivity" of our shock gen-erators, electrode paste, and judgments of.length's ofline segments on white paper.

They see all this, My problem has been not toalert them to these considerations, but to break theirhabit of dismissing well-done, meaningful, infor-mative research on grounds of "artificiality."

The task has become a bit harder over the lastfew years because a full-fledged "purr" word hasgained currency: external validity. Articles and

monographs have been written about its proper nur-ture, and checklists of specific threats to its well-being are now appearing in textbooks. Studies unes-corted by it are afflicted by—what else?—externalinvalidity. That phrase has a lovely mouth-fillingresonance to it, and there is, to be sure, a certainpoetic justice in our being attacked with our ownjargon.

Warm Fuzzies and Cold CreepiesThe trouble is that, like most "pun-" and "snarl"words, the phrases external validity and external in-validity can serve as serious barriers to thought.Obviously, any kind of validity is a warm, fuzzyGood Thing; and just as obviously, any kind of in-validity must be a cold, creepy Bad Thing. Whocould doubt it?

It seems to me that these phrases trapped eventheir originators, in just that way. Campbell andStanley (1967) introduce the concept thus: "Exter-nal validity asks the question of generalizability: Towhat populations, settings, treatment variables, andmeasurement variables can this effect be general-ized?" (p. 5). Fair enough. External validity is notan automatic desideratum; it asks a question. It in-vites us to think about the prior questions: To whatpopulations, settings, and so on, do we want the ef-fect to be generalized? Do we want to generalize itat all?

But their next sentence is: "Both types of cri-teria are obviously important. . ." And ". . . theselection of designs strong in both types of validityis obviously our ideal" (Campbell & Stanley, 1967,P. 5).

I intend to argue that this is simply wrong. Ifit sounds plausible, it is because the word validityhas given it a warm coat of downy fuzz. Who wantsto be invalid—internally, externally, or in any otherway? One might as well ask for acne. In a way, Iwish the authors had stayed with the term general-izability, precisely because it does not sound nearlyso good. It would then be easier to remember thatwe are not dealing with a criterion, like clear skin,but with a question, like "How can we get this sofadown the stairs?" One asks that question if, and onlyif, moving the sofa is what one wants to do.

But generalizability is not quite right either.The question of external validity is not the same asthe question of generalizability. Even an experiment

April 1983 • American PsychologistCopyright 1983 by the American Psychological Association, Inc.

379

Page 2: In Defense of External Validity

that is clearly "applicable to the real world," perhapsbecause it was conducted there (e.g., Bickman's,1974, studies of obedience on the street corner), willhave some limits to its generalizability. Cultural,historical, and age-group limits will surely be pres-ent; but these are unknown and no single study candiscover them all. Their determination is empirical.

The external-validity question is a special case.It conies to this: Are the sampje, the setting, and themanipulation so artificial that the class of "target"real-life situations to which the results can be gen-eralized is likely to be trivially small? If so, the ex-periment lacks external validity. But that argumentstill begs the question I wish to raise here: Is suchgeneralization our intent? Is it what we want to do?Not always.

The Agricultural ModelThese baleful remarks about external validity (EV)are not quite fair to its originators. In denning theconcept, they had a particular kind of research inmind, and it was the kind in which the problem ofEV is meaningful and important.

These are the applied experiments. Campbelland Stanley (1967) had in mind the kind of inves-tigation that is designed to evaluate a new teachingprocedure or the effects of an "enrichment" pro-gram on the culturally deprived. For that matter, theresearch context in which sampling theory was de-veloped in its modern form—agricultural re-search—has a similar purpose. The experimentalsetting resembles, or is a special case of, a real-lifesetting in which one wants to know what to do. Doesthis fertilizer (or this pedagogical device) promotegrowth in this kind of crop (or this kind of child)?If one finds a significant improvement in the ex-perimental subjects as compared with the controls,one predicts that implementation of a similar ma-nipulation, in a similar setting with similar subjects,will be of benefit on a larger scale.

That kind of argument does assume that one'sexperimental manipulation represents the broader-scale implementation and that one's subjects andsettings represent their target populations. Indeed,part of the thrust of the EV concept is that we havebeen concerned only with subject representativenessand not enough with representativeness of the set-tings and manipulations we have sampled in doingexperiments.

Deese (1972), for example, has taken us to taskfor this neglect:

Some particular set of conditions in an experiment is gen-erally taken to be representative of all possible conditionsof a similar type. . . . In the investigation of altruism,situations are devised to permit people to make altruisticchoices, Usually a single situation provides the setting forthe experimental testing. . . . [the experimenter] will al-

low that one particular situation to stand for the unspec-ified circumstances in which an individual could be al-truistic. . . . the social psychologist as experimenter iscontent to let a particular situation stand for an indefiniterange of possible testing situations in a vague and un-specified way. (pp. 59-60)

It comes down to this: The experimenter is gener-alizing on the basis of a small and biased sample,not of subjects (though probably those too), but ofsettings and manipulations.1

The entire argument rests, however, on an ap-plied, or what I call an "agricultural," conceptionof the aims of research. The assumption is that theexperiment is intended to be generalized to similarsubjects, manipulations, and settings. If this is so,then the broader the generalizations one can make,the more real-world occurrences one can predictfrom one's findings and the more one has learnedabout the real world from them. However, it maynot be so. There are experiments—very manyof them—that do not have such generalization astheir aim.

This is not to deny that we have talked nonsenseon occasion. We have. Sweeping generalizationsabout "altruism," or "anxiety," or "honesty" havebeen made on evidence that does not begin to sup-port them, and for the reasons Deese gives. But letit also be said that in many such cases, we haveseemed to talk nonsense only because our critics,or we ourselves, have assumed that the "agricul-tural" goal of generalization is part of our intent.

But in many (perhaps most) of the experimentsDeese has in mind, the logic goes in a different di-rection. We are not making generalizations, but test-ing them. To show what a difference this makes, letme turn to an example.

A Case Study of a Flat FlunkSurely one of the experiments that has had per-manent impact on our thinking is the study of"mother love" in rhesus monkeys, elegantly con-ducted by Harlow. His wire mothers and terry-clothmothers are permanent additions to our vocabularyof classic manipulations. And his finding that con-

I thank James E. Deese and Wayne Shebilske for their commentson an earlier version of this article.

Requests for reprints should be sent to Douglas G. Mook,Department of Psychology, University of Virginia, Charlottes-ville, Virginia 22901.

1 In fairness, Deese goes on to make a distinction much likethe one I intend here. "If the theory and observations are ex-plicitly related to one another through some rigorous logical pro-cess, then the sampling of conditions may become completelyunnecessary" (p. 60). I agree. "But a theory having such poweris almost never found in psychology" (p. 61). I disagree, notbecause I think our theories are all that powerful, but becauseI do not th ink all that much power is required for what we areusually trying to do.

380 April 1983 • American Psychologist

Page 3: In Defense of External Validity

tact comfort was a powerful determinant of "at-tachment," whereas nutrition was small potatoes,was a massive spike in the coffin of the moribund, ,but still wriggling, drive-reduction theories of the1950s.

As a case study, let us see how the Harlow wire-and cloth-mother experiment stands up to the cri-teria of EV.

The original discussion of EV by Campbell andStanley (1967) reveals that the experimental inves-tigation they had in mind was a rather complexmixed design with pretests, a treatment imposed orwithheld (the independent variable), and a posttest.Since Harlow's experiment does not fit this mold,the first two of their "threats to external validity"do not arise at all: pretest effects on responsivenessand multiple-treatment interference.

The other two threats on their list do arise inHarlow's case. First, "there remains the possibilitythat the effects . . . hold only for that unique pop-ulation from which the . . . [subjects were] se-lected" (Campbell & Stanley, 1967, p. 19). Moregenerally, this is the problem of sampling bias, andit raises the spectre of an unrepresentative sample.Of course, as every student knows, the way to com-bat the problem (and never mind that nobody doesit) is to select a random sample from the populationof interest.

Were Harlow's baby monkeys representative ofthe population of monkeys in general? Obviouslynot; they were born in captivity and then orphanedbesides. Well, were they a representative sample ofthe population of lab-born, orphaned monkeys?There was no attempt at all to make them so. Itmust be concluded that Harlow's sampling proce-dures fell far short of the ideal.

Second, we have the undeniable fact of the"patent artificiality of the experimental setting"(Campbell & Stanley, 1967, p. 20). Campbell andStanley go on to discuss the problems posed by thesubjects' knowledge that they are in an experimentand by what we now call "demand characteristics."But the problem can be generalized again: How dowe know that what the subjects do in this artificialsetting is what they would do in a more natural one?Solutions have involved hiding from the subjects thefact that they are subjects; moving from a laboratoryto a field setting; and, going further, trying for a"representative sample" of the field settings them-selves (e.g., Brunswik, 1955),

What then of Harlow's work? One does notknow whether his subjects knew they were in anexperiment; certainly there is every chance that theyexperienced "expectations of the unusual, with won-der and active puzzling" (Campbell & Stanley, 1967,p. 21). In short, they must have been cautious, be-wildered, reactive baby monkeys indeed. And what

of the representativeness of the setting? Real mon-keys do not live within walls. They do not encountermother figures made of wire mesh, with rubber nip-ples; nor is the advent of a terry-cloth cylinder,warmed by a light bulb, a part of their natural life-style. What can this contrived situation possibly tellus about how monkeys with natural upbringingwould behave in a natural setting?

On the face of it, the verdict must be a flatflunk. On every criterion of EV that applies at all,we find Harlow's experiment either manifestly de-ficient or simply unevaluable. And yet our tendencyis to respond to this critique with a resounding "Sowhat?" And I think we are quite right to so respond.

Why? Because using the lab results to makegeneralizations about real-world behavior was nopart of Harlow's intention. It was not what he wastrying to do. That being the case, the concept of EVsimply does not arise—except in an indirect andremote sense to be clarified shortly.

Harlow did not conclude, "Wild monkeys inthe jungle probably would choose terry-cloth overwire mothers, too, if offered the choice." First, itwould be a moot conclusion, since that simply isnot going to happen. Second, who cares whetherthey would or not? The generalization would be triv-ial even if true. What Harlow did conclude was thatthe hunger-reduction interpretation of mother lovewould not work. If anything about his experimenthas external validity, it is this theoretical point, notthe findings themselves. And to see whether the theo-retical conclusion is valid, we extend the experi-ments or test predictions based on theory.2 We donot dismiss the findings and go back to do the ex-periment "properly," in the jungle with a randomsample of baby monkeys.

The distinction between generality of findingsand generality of theoretical conclusions under-scores what seems to me the most important sourceof confusion in all this, which is the assumption thatthe purpose of collecting data in the laboratory isto predict real-life behavior in the real world. Ofcourse, there are times when that is what we aretrying to do, and there are times when it is not.When it is, then the problem of EV confronts us,full force.'When it is not, then the problem of EVis either meaningless or trivial, and a misplacedpreoccupation with it can seriously distort our eval-uation of the research.

But if we are not using our experiments to pre-dict real-life behavior, what are we using them for?Why else do an experiment?

- The term theory is used loosely to mean, not a strict de-ductive system, but a conclusion on which different findings con-verge. Harlow's demonstration draws much of its force from thecontext of other findings (by Ainsworth, Bowlby, Spitz, and oth-ers) with which it articulates.

April 1983 • American Psychologist 381

Page 4: In Defense of External Validity

There are a number of other things we may bedoing. First, we may be asking whether somethingcan happen, rather than whether it typically doeshappen. Second, our prediction may be in the otherdirection; it may specify something that ought tohappen in the lab, and so we go to the lab to seewhether it does. Third, we may demonstrate thepower of a phenomenon by showing that it happenseven under unnatural conditions that ought to pre-clude it. Finally, we may use the lab to produce con-ditions that have no counterpart in real life at all,so that the concept of "generalizing to the realworld" has no meaning. But even where findingscannot possibly generalize and are not supposed to,they can contribute to an understanding of the pro-cesses going on. Once again, it is that understandingwhich has external validity (if it does)—not the find-ings themselves, much less the setting and the sam-ple. And this implies in turn that we cannot assessthat kind of validity by examining the experimentitself.

Alternatives to Generalization"What Can" Versus "What Does""Person perception studies using photographs orbrief exposure of the stimulus person have com-monly found that spectacles, lipstick and untidy hairhave a great effect on judgments of intelligence andother traits. It is suggested . . . that these results areprobably exaggerations of any effect that might oc-cur when more information about a person is avail-able" (Argyle, 1969, p. 19). Later in the same text,Argyle gives a specific example: "Argyle andMcHenry found that targeted persons were judgedas 13 points of IQ more intelligent when wearingspectacles and when seen for 15 seconds; however,if they were seen during 5 minutes of conversationspectacles made no difference" (p. 135).

Argyle (1969) offers these data as an exampleof how "the results [of an independent variable stud-ied in isolation] may be exaggerated" (p. 19). Ex-aggerated with respect to what? With respect to what"really" goes on in the world of affairs. It is clearthat on these grounds, Argyle takes the 5-minutestudy, in which glasses made no difference, moreseriously than the 15-second study, in whichthey did.

Now from an "applied" perspective, there is noquestion that Argyle is right. Suppose that only the15-second results were known; and suppose that onthe basis of them, employment counselors beganadvising their students to wear glasses or sales execu-tives began requiring their salespeople to do so. Theresult would be a great deal of wasted time, and allbecause of an "exaggerated effect," or what I havecalled an "inflated variable" (Mook, 1982). Powerfulin the laboratory (13 IQ points is a lot!), eyeglasses

are a trivial guide to a person's intelligence and aretreated as such when more information is available.

On the other hand, is it not worth knowing thatsuch a bias can occur, even under restricted condi-tions? Does it imply an implicit "theory" or set of"heuristics" that we carry about with us? If so, wheredo they come from?

There are some intriguing issues here. Whyshould the person's wearing eyeglasses affect ourjudgments of his or her intelligence under any con-ditions whatever? As a pure guess, I would hazardthe following: Maybe we believe that (a) intelligentpeople read more than less intelligent ones, and (b)that reading leads to visual problems, wherefore (c)the more intelligent are more likely to need glasses.If that is how the argument runs, then it is an in-stance of how our person perceptions are influencedby causal "schemata" (Nisbett & Ross, 1980)—evenwhere at least one step in the theoretical sequence([b] above) is, as far as we know, simply false.

Looked at in that way, the difference betweenthe 15-second and the 5-minute condition is itselfworth investigating further (as it would not be if thelatter simply "invalidated" the former). If we are soready to abandon a rather silly causal theory in thelight of more data, why are some other causal the-ories, many of them even sillier, so fiercely resistantto change?

The point is that in thinking about the matterthis way, we are taking the results strictly as we findthem. The fact that eyeglasses can influence ourjudgments of intelligence, though it may be quitedevoid of real-world application, surely says some-thing about us as judges. If we look just at that, thenthe issue of external validity does not arise. We areno longer concerned with generalizing from the labto the real world. The lab (qua lab) has led us to askquestions that might not otherwise occur to us.Surely that alone makes the research more than asterile intellectual exercise.

Predicting From and Predicting To

The next case study has a special place in my heart.It is one of the things that led directly to this article,which I wrote fresh from a delightful roaring ar-gument with my students about the issues at hand.

The study is a test of the tension-reduction viewof alcohol consumption, conducted by Higgins andMarlatt (1973). Briefly, the subjects were made eitherhighly anxious or not so anxious by the threat ofelectric shock, and were permitted access to alcoholas desired. If alcohol reduces tension and if peopledrink it because it does so (Cappell & Herman,1972), then the anxious subjects should have drunkmore. They did not.

Writing about this experiment, one of my betterstudents gave it short shrift: "Surely not many al-

382 April 1983 • American Psychologist

Page 5: In Defense of External Validity

coholics are presented with such a threat under nor-mal conditions."

Indeed. The threat of electric shock can hardlybe "representative" of the dangers faced by anyoneexcept electricians, hi-fi builders, and Psychology101 students. What then? It depends! It depends onwhat kind of conclusion one draws and what one'spurpose is in doing the study,

Higgins and Marlatt could have drawn this con-clusion: "Threat of shock did not cause our subjectsto drink in these circumstances. Therefore, it prob-ably would not cause similar subjects to drink insimilar circumstances either." A properly cautiousconclusion, and manifestly trivial.

Or they could have drawn this conclusion:"Threat of shock did not cause our subjects to drinkin these circumstances. Therefore, tension or anxi-ety probably does not cause people to drink in nor-mal, real-world situations." That conclusion wouldbe manifestly risky, not to say foolish; and it is thatkind of conclusion which raises the issue of EV. Sucha conclusion does assume that we can generalizefrom the simple and protected lab setting to the com-plex and dangerous real-life one and that the fearof shock can represent the general case of tensionand anxiety. And let me admit again that we havebeen guilty of just this kind of foolishness on morethan one occasion.

But that is not the conclusion Higgins andMarlatt drew. Their argument had an entirely dif-ferent shape, one that changes everything. Para-phrased, it went thus: "Threat of shock did not causeour subjects to drink in these circumstances. There-fore, the tension-reduction hypothesis, which pre-dicts that it should have done so, either is false oris in need of qualification." This is our old friend,the hypothetico-deductive method, in action. Theimportant point to see is that the generalizability ofthe results, from lab to real life, is not claimed. Itplays no part in the argument at all.

Of course, these findings may not require muchmodification of the tension-reduction hypothesis. Itis possible—indeed it is highly likely—that there aretensions and tensions; and perhaps the nagging fearsand self-doubts of the everyday have a quite differentstatus from the acute fear of electric shock. Maybealcohol does reduce these chronic fears and is taken,sometimes abusively, because it does so.3 If thesepossibilities can be shown to be true, then we couldsharpen the tension-reduction hypothesis, restricting

11 should note, however, that there is considerable doubtabout that as a statement of the general case. Like Harlow's ex-periment, the Higgins and Marlatt (1973) study articulates witha growing body of data from very different sources and settings,but all, in this case, calling the tension-reduction theory intoquestion (cf. Mello & Mendelson, 1978).

it (as it is not restricted now) to certain kinds oftension and, perhaps, to certain settings. In short,we could advance our understanding. And the "ar-tificial" laboratory findings would have contributedto that advance. Surely we cannot reasonably askfor more.

It seems to me that this kind of argument char-acterizes much of our research—much more of itthan our critics recognize. In very many cases, weare not using what happens in the laboratory to"predict" the real world. Prediction goes the otherway: Our theory specifies what subjects should doin the laboratory. Then we go to the laboratory toask, Do they do it? And we modify our theory, orhang onto it for the time being, as results dictate.Thus we improve our theories, and—to say itagain—it is these that generalize to the real worldif anything does.

Let me turn to an example of another kind. Tothis point, it is artificiality of setting that has beenthe focus. Analogous considerations can arise, how-ever, when one thinks through the implications ofartificiality of, or bias in, the sample. Consider a casestudy.

A great deal of folklore, supported by somepowerful psychological theories, would have it thatchildren acquire speech of the forms approved bytheir culture—that is, grammatical speech—throughthe impact of parents' reactions to what they say.If a child emits a properly formed sentence (so theargument goes), the parent responds with approvalor attention. If the utterance is ungrammatical, theparent corrects it or, at the least, withholds approval.

Direct observation of parent-child interac-tions, however, reveals that this need not happen.Brown and Hanlon (1970) report that parents reactto the content of a child's speech, not to its form.If the sentence emitted is factually correct, it is likelyto be approved by the parent; if false, disapproved.But whether the utterance embodies correct gram-matical form has surprisingly little to do with theparent's reaction to it.

What kind of sample were Brown and Hanlondealing with here? Families that (a) lived in Boston,(b) were well educated, and (c) were willing to havesquadrons of psychologists camped in their livingrooms, taping their conversations. It is virtually cer-tain that the sample was biased even with respectto the already limited "population" of upper-class-Bostonian-parents-of-young-children.

Surely a sample like that is a poor basis fromwhich to generalize to any interesting population.But what if we turn it around? We start with thetheoretical proposition: Parents respond to thegrammar of their children's utterances (as by mak-ing approval contingent or by correcting mistakes).Now we make the prediction: Therefore, the parents

April 1983 « American Psychologist 383

Page 6: In Defense of External Validity

we observe ought to do that. And the prediction isdisconfirmed.

Going further, if we find that the childrenBrown and Hanlon studied went on to acquire Bos-toman-approved syntax, as seems likely, then we candraw a further prediction and see it disconfirmed.If the theory is true, and if these parents do nqt reactto grammaticality or its absence, then these childrenshould not pick up grammatical speech. If they doso anyway, then parental approval is not necessaryfor the acquisition of grammar. And that is shownnot by generalizing from sample to population, butby what happened in the sample.

It is of course legitimate to wonder whether thesame contingencies would appear in Kansas Cityworking-class families or in slum dwellers in theArgentine. Maybe parental approval/disapproval isa much more potent influence on children's speechin some cultures or subcultures than in others. Nev-ertheless, the fact would remain that the parentalapproval theory holds only in some instances andmust be qualified appropriately. Again, that wouldbe well worth knowing, and this sample of familieswould have played a part in establishing it.

The confusion here may reflect simple histor-ical accident. Considerations of sampling from pop-ulations were brought to our attention largely bysurvey researchers, for whom the procedure of "gen-eralizing to a population" is of vital concern. If wewant to estimate the proportion of the electorateintending to vote for Candidate X, and if Y% of oursample intends to do so, then we want to be able tosay something like this: "We can be 95% confidentthat Y% of the voters, plus or minus Z, intend tovote for X." Then the issue of representativeness issquarely before us, and the horror stories of biasedsampling and wildly wrong predictions, from theLiterary Digest poll on down, have every right tokeep us awake at night.

But what has to be thought through, case bycase, is whether that is the kind of conclusion weintend to draw. In the Brown and Hanlon (1970)case, nothing could be more unjustified than a state-ment of the kind, "We can be W% certain that X%of the utterances of Boston children, plus or minusY, are true and are approved." The biased samplerules such a conclusion out of court at the outset.But it was never intended. The intended conclusionwas not about a population but about a theory. Thatparental approval tracks content rather than form,in these children, means that the parental approvaltheory of grammar acquisition either is simply falseor interacts in unsuspected ways with some attri-bute(s) of the home.

In yet other cases, the subjects are of interestprecisely because of their unrepresentativeness.Washoe, Sarah, and our other special students are

of interest because they are not representative of alanguage-using species. And with all the quarrelstheir accomplishments have given rise to, I have notseem them challenged as "unrepresentative chimps,"except by students on examinations (I am not mak-ing that up). The achievements of mnemonists(which show us what can happen, rather than whattypically does) are of interest because mnemonistsare not representative of the rest of us. And whenone comes across a mnemonist one studies thatmnemonist, without much concern for his or herrepresentativeness even as a mnemonist.

But what do students read? "Samples shouldalways be as representative as possible of the pop-ulation under study." "[A] major concern of thebehavioral scientist is to ensure that the sample itselfis a good representative fsic] of the population."(The sources of these quotations do not matter; theycome from an accidental sample of books on myshelf.)

The trouble with these remarks is not that theyare false—sometimes they are true—but that theyare unqualified. Representativeness of sample is ofvital importance for certain purposes, such as surveyresearch. For other purposes it is a trivial issue.4

Therefore, one must evaluate the sampling proce-dure in light of the purpose—separately, case bycase.

Taking the Package Apart

Everyone knows that we make experimental settingsartificial for a reason. We do it to control for extra-neous variables and to permit separation of factorsthat do not come separately in Nature-as-you-find-it. But that leaves us wondering how, having steppedout of Nature, we get back in again. How do ourfindings apply to the real-life setting in all its com-plexity?

I think there are times when the answer has tobe, "They don't." But we then may add, "Somethingelse does. It is called understanding."

4 There is another sense in which "generalizing to a popu-lation" attends most psychological research: One usually tests thesignificance of one's findings, and in doing so one speaks of sam-ple values as estimates of population parameters. In this con-nection, though, the students are usually reassured that they canalways define the population in terms of the sample and take itfrom there—-which effectively leaves them wondering what all theflap was about in the first place.

Perhaps this is the place to note that some of the case studiesI have presented may raise questions in the reader's mind thatare not dealt with here. Some raise the problem of interpretingnull conclusions; adequacy of controls for confounding variablesmay be worrisome; and the Brown and Hanlon (1970) study facedthe problem of observer effects (adequately dealt with, I think;see Mook, 1982). Except perhaps for the last one, however, theseissues are separate from the problem of external validity, whichis the only concern here.

384 April 1983 • American Psychologist

Page 7: In Defense of External Validity

As an example, consider dark adaptation. Psy-chophysical experiments, conducted in restricted,simplified, ecologically invalid settings, have taughtus these things among others:

1, Dark adaptation occurs in two phases.There is a rapid and rather small increase in sen-sitivity, followed by a delayed but greater increase.

2. The first of these phases reflects dark ad-aptation by the cones; the second, by the rods.

Hecht (1934) demonstrated the second of theseconclusions by taking advantage of some facts aboutcones (themselves established in ecologically invalidphotochemical and histological laboratories). Conesare densely packed near the fovea; and they aremuch less sensitive than the rods to the shorter vis-ible wavelengths. Thus, Hecht was able to tease outthe cone component of the dark-adaptation curveby making his stimuli small, restricting them to thecenter of the visual field, and turning them red.

Now let us contemplate the manifest ecologicalinvalidity of this setting. We have a human subjectin a dark room, staring at a place where a tiny redlight may appear. Who on earth spends time doingthat, in the world of affairs? And on each trial, thesubject simply makes a "yes, I see it/no, I don't"response. Surely we have subjects who "behave aslittle like human beings as possible" (Bannister,1966)—We might be calibrating a photocell for allthe difference it would make.

How then do the findings apply to the realworld? They do not. The task, variables, and settinghave no real-world counterparts, What does apply,and in spades, is the understanding of how the visualsystem works that such experiments have given us.That is what we apply to the real-world setting—toflying planes at night, to the problem of reading X-ray prints on the spot, to effective treatment of nightblindness produced by vitamin deficiency, and muchbesides.

Such experiments, I say, give us understandingof real-world phenomena. Why? Because the pro-cesses we dissect in the laboratory also operate inthe real world. The dark-adaptation data are of in-terest because they show us a process that does occurin many real-world situations. Thus we could, it istrue, look at the laboratory as a member of a classof "target" settings to which the results apply. Butit certainly is not a "representative" member of thatset. We might think of it as a limiting, or even de-fining, member of that set. To what settings do theresults apply? The shortest answer is: to any settingin which it is relevant that (for instance) as the il-lumination dims, sensitivity to longer visible wave-lengths drops out before sensitivity to short onesdoes. The findings do not represent a class of real-world phenomena; they define one.

Alternatively, one might use the lab not to ex-

plore a known phenomenon, but to determinewhether such and such a phenomenon exists or canbe made to occur. (Here again the emphasis is onwhat can happen, not what usually does.) Henshel(1980) has noted that some intriguing and importantphenomena, such as biofeedback, could never havebeen discovered by sampling or mimicking naturalsettings. He points out, too, that if a desirable phe-nomenon occurs under laboratory conditions, onemay seek to make natural settings mimic the lab-oratory rather than the other way around. Engineersare familiar with this approach. So, for instance, aremany behavior therapists.

(I part company with Henshel's excellent dis-cussion only when he writes, "The requirement of'realism,' or a faithful mimicking of the outsideworld in the laboratory experiment, applies only to. . . hypothesis testing within the logico-deductivemodel of research" [p. 470]. For reasons given ear-lier, I do not think it need apply even there.)

The Drama of the ArtificialTo this point, I have considered alternatives to the"analogue" model of research and have pointed outthat we need not intend to generalize our resultsfrom sample to population, or from lab to life. Thereare cases in which we do want to do that, of course.Where we do, we meet another temptation: We mayassume that in order to generalize to "real life," thelaboratory setting should resemble the real-life oneas much as possible. This assumption is the forcebehind the cry for "representative settings."

The assumption is false. There are cases inwhich the generalization from research setting toreal-life settings is made all the stronger by the lackof resemblance between the two. Consider an ex-ample.

A research project that comes in for criticismalong these lines is the well-known work on obedi-ence by Milgram (1974). In his work, the differencebetween a laboratory and a real-life setting isbrought sharply into focus. Soldiers in the junglesof Viet Nam, concentration camp guards on thefields of Eastern Europe—what resemblance dotheir environments bear to a sterile room with ashock generator and an intercom, presided over bya white-coated scientist? As a setting, Milgram'ssurely is a prototype of an "unnatural" one.

One possible reaction to that fact is to dismissthe work bag and baggage, as Argyle (1969) seemsto do: "When a subject steps inside a psychologicallaboratory he steps out of culture, and all the normalrules and conventions are temporarily discarded andreplaced by the single rule of laboratory culture—'do what the experimenter says, no matter how ab-surd or unethical it may be' " (p. 20). He goes onto cite Milgram's work as an example.

April 1983 • American Psychologist 385

Page 8: In Defense of External Validity

All of this—which is perfectly true—comes ina discussion of how "laboratory research can pro-duce the wrong results" (Argyle, 1969, p. 19). Thewrong results! But that is the whole point of theresults. What Milgram has shown is how easily wecan "step out of culture" in just the way Argyle de-scribes—and how, once out of culture, we proceedto violate its "normal rules and conventions" inways that are a revelation to us when they occur.Remember, by the way, that most of the peopleMilgram interviewed grossly underestimated theamount of compliance that would occur in that lab-oratory setting.

Another reaction, just as wrong but unfortu-nately even more tempting, is to start listing simi-larities and differences between the lab setting andthe natural one. The temptation here is to get in-volved in count-'em mechanics: The more differ-ences there are, the greater the external invalidity.Thus:

One element lacking in Milgram's situation that typicallyobtains in similar naturalistic situations is that the ex-perimenter had no real power to harm the subject if thesubject failed to obey orders. The subject could alwayssimply get up and walk out of the experiment, never tosee the experimenter again. So when considering Mil-gram's results, it should be borne in mind that a powerfulsource of obedience in the real world was lacking in thissituation. (Kantowitz & Roediger, 1978, pp. 387-388)

"Borne in mind" to what conclusion? Since the nextsentence is "Nonetheless, Milgram's results are trulyremarkable" (p. 388), we must suppose that the re-marks were meant in criticism.

Now the lack of threat of punishment is, to besure, a major difference between Milgram's lab andthe jungle war or concentration camp setting. Butwhat happened? An astonishing two thirds obeyedanyway. The force of the experimenter's authoritywas sufficient to induce normal decent adults to in-flict pain on another human being, even though theycould have refused without risk. Surely the absenceof power to punish, though a distinct difference be-tween Milgram's setting and the others, only addsto the drama of what he saw.

There are other threats to the external validityof Milgram's findings, and some of them must betaken more seriously. There is the possibility thatthe orders he gave were "legitimized by the labo-ratory setting" (Orne & Evans, 1965, p. 199). Per-haps his subjects said in effect, "This is a scientificexperiment run by a responsible investigator, somaybe the whole business isn't as dangerous as itlooks." This possibility (which is quite distinct fromthe last one, though the checklist approach oftenconfuses the two) does leave us with nagging doubtsabout the generalizability of Milgram's findings.Camp guards and jungle fighters do not have this

cognitive escape hatch available to them. If Mil-gram's subjects did say "It must not be dangerous,"then his conclusion—people are surprisingly willingto inflict danger under orders—is in fact weakened.

The important thing to see is that the checklistapproach will not serve us. Here we have two dif-ferences between lab and life—the absence of pun-ishment and the possibility of discounting the dangerof obedience. The latter difference weakens the im-pact of Milgram's findings; the former strengthensit. Obviously we must move beyond a simple countof differences and think through what the effect ofeach one is likely to be.

Validity of What?Ultimately, what makes research findings of interestis that they help us understand everyday life. Thatunderstanding, however, comes from theory or theanalysis of mechanism; it is not a matter of "gen-eralizing" the findings themselves. This kind of va-lidity applies (if it does) to statements like "The hun-ger-reduction interpretation of infant attachmentwill not do," or "Theory-driven inferences may biasfirst impressions," or "The Purkinje shift occursbecause rod vision has these characteristics and conevision has those." The validity of these generaliza-tions is tested by their success at prediction and hasnothing to do with the naturalness, representative-ness, or even nonreactivity of the investigations onwhich they rest.

Of course there are also those cases in whichone does want to predict real-life behavior directlyfrom research findings. Survey research, and mostexperiments in applied settings such as factory orclassroom, have that end in view. Predicting real-lifebehavior is a perfectly legitimate and honorable wayto use research. When we engage in it, we do con-front the problem of EV, and Babbie's (1975) com-ment about the artificiality of experiments has force.

What I have argued here is that Babbie's com-ment has force only then. If this is so, then externalvalidity, far from being "obviously our ideal"(Campbell & Stanley, 1967), is a concept that appliesonly to a rather limited subset of the researchwe do.

A Checklist of DecisionsI am afraid that there is no alternative to thinkingthrough, case by case, (a) what conclusion we wantto draw and (b) whether the specifics of our sampleor setting will prevent us from drawing it. Of coursethere are seldom any fixed rules about how to "thinkthrough" anything interesting. But here is a sampleof questions one might ask in deciding whether theusual criteria of external validity should even beconsidered:

As to the sample: Am I (or is he or she whosework I am evaluating) trying to estimate from sam-

386 April 1983 • American Psychologist

Page 9: In Defense of External Validity

pie characteristics the characteristics of some pop-ulation? Or am I trying to draw conclusions notabout a population, but about a theory that specifieswhat these subjects ought to do? Or (as in linguisticapes) would it be important if any subject does, orcan be made to do, this or that?

As to the setting: Is it my intention to predictwhat would happen in a real-life setting or "target"class of such settings? Our "thinking through" di-vides depending on the answer.

The answer may be no. Once again, we maybe testing a prediction rather than making one; ourtheory may specify what ought to happen in thissetting. Then the question is whether the setting givesthe theory a fair hearing, and the external-validityquestion vanishes altogether.

Or the answer may be yes. Then we must ask,Is it therefore necessary that the setting be "repre-sentative" of the class of target settings? Is it enoughthat it be a member of that class, if it captures pro-cesses that must operate in all such settings? If thelatter, perhaps it should be a "limiting case" of thesettings in which the processes operate—the sim-plest possible one, as a psychophysics lab is intendedto be. In that case, the stripped-down setting mayactually define the class of target settings to whichthe findings apply, as in the dark-adaptation story.The question is only whether the setting actuallypreserves the processes of interest,5 and again theissue of external validity disappears.

We may push our thinking through a step fur-ther. Suppose there are distinct differences betweenthe research setting and the real-life target ones. Weshould remember to ask: So what? Will they weakenor restrict our conclusions? Or might they actuallystrengthen and extend them (as does the absence ofpower to punish in Milgram's experiments)?

Thinking through is of course another warm,fuzzy phrase, I quite agree. But I mean it to contrast

s Of course, whether an artificial setting does preserve theprocess can be a very real question. Much controversy centerson such questions as whether the operant-conditioning chamberreally captures the processes that operate in, say, the marketplace.If resolution of that issue comes, however, it will depend onwhether the one setting permits successful predictions about theother. It will not come from pointing to the "unnaturalness" ofthe one and the "naturalness" of the other. There is no disputeabout that.

with the cold creepies with which my students as-sault research findings: knee-jerk reactions to "ar-tificiality"; finger-jerk pointing to "biased samples"and "unnatural settings"; and now, tongue-jerk im-precations about "external invalidity," People arealready far too eager to dismiss what we have learned(even that biased sample who come to college andelect our courses!). If they do so, let it be for theright reasons.

REFERENCES

Argyle, M. Social interaction. Chicago: Atherton Press, 1969.Babbie, E. R. The practice of social research. Belmont, Calif.:

Wadsworth, 1975.Bannister, D. Psychology as an exercise in paradox. Bulletin of

the British Psychological Society, 1966, 19, 21-26.Bickman, L. Social roles and uniforms: Clothes make the person.

Psychology Today, July 1974, pp. 49-51.Brown, R., & Hanlon, C. Derivational complexity and order of

acquisition in child speech. In J. R. Hayes (Ed.), Cognition andthe development of language. New York: Wiley, 1970.

Brunswik, E. Representative design and probabilistic theory ina functional psychology. Psychological Review, 1955, 62,193-217.

Campbell, D. T., & Stanley, J. C. Experimental and quasi-exper-imental designs for research. Chicago: Rand McNally, 1967.

Cappell, H., & Herman, C. P. Alcohol and tension reduction: Areview. Quarterly Journal of Studies on Alcohol, 1972, 33,33-64.

Deese, J. Psychology as science and art. New York: HarcourtBrace Jovanovich, 1972.

Hecht, S. Vision II: The nature of the photoreceptor process. InC. Murchison (Ed.), Handbook of general experimental psy-chology. Worcester, Mass.: Clark University Press, 1934.

Henshel, R. L. The purposes of laboratory experimentation andthe virtues of deliberate artificiality. Journal of ExperimentalSocial Psychology, 1980, 16, 466-478.

Higgins, R. L., & Marlatt, G. A. Effects of anxiety arousal on theconsumption of alcohol by alcoholics and social drinkers. Jour-nal of Consulting and Clinical Psychology, 1973, 41, 426-433.

Kantowitz, B. H., & Roediger, H. L., III. Experimental psychology.Chicago: Rand McNally, 1978.

Mello, N. K., & Mendelson, J. H. Alcohol and human behavior.In L. L. Iverson, S. D. Iverson, & S. H. Snyder (Eds.), Hand-book of psychopharmacology: Vol. 12. Drugs of abuse. NewYork: Plenum Press, 1978.

Milgram, S. Obedience to authority. New York: Harper & Row,1974.

Mook, D. G. Psychological research: Strategy and tactics. NewYork: Harper & Row, 1982.

Nisbett, R. E., & Ross, L. Human inference: Strategies and short-comings in social judgment. New York: Century, 1980.

Orne, M. T., & Evans, t. J. Social control in the psychologicalexperiment: Anti-social behavior and hypnosis. Journal of Per-sonality and Social Psychology, 1965, /, 189-200.

April 1983 « American Psychologist 387