DOCUMENT RESUME - ERICED 198 180 AUTHOR TITLE PUB DATE NOTE AVAILABLE FPOM EDRS PRICE DESCRIPTORS DOCUMENT RESUME TM 810 149 Scriven, Michael Evaluation Thesaurus. Second Edition

ED 198 180

AUTHORTITLEPUB DATENOTE

AVAILABLE FPOM

EDRS PRICEDESCRIPTORS

DOCUMENT RESUME

TM 810 149

Scriven, MichaelEvaluation Thesaurus. Second Edition.BO

157p.: This work is one of a series to come out in1980-81.Edgepress, Box 69, Pt. Reyes, CA 94956 ($7.95 plus$.85 postage per single copy).

FF01 Plus Postage- PC Not Available from EDRS.*Definitions: Educational Assessment: *Evaluation:Evaluation Methods: *Thesauri

ABSTRACT

This thesaurus to the evaluation field is notrestricted to educational evaluation or to'program evaluation, butalso refers to product, personnel, and proposal evaluation, as well.as to quality control, the grading of work samples, and to all theother areas in which disciplined evaluation is practiced. It containsmany suagestions, procedureb,.comments,criticisms, definitions anddistinctions. Criteria for inclusion of an entry were: (1) at least afew participants ln workshops or classes requesting it; (2) a shortaccount was possible: (3) the account was found useful: or (4) the

author thought it should be included for,the edification or Morecurrent slang and jargon have been included then is usual, becausethat is seen as a problem area. The statistics and measurement areais lightly treated because it is believed to be well covered in otherworks. Many terms from 'Ale federalistate contract process appearbecause evaluation is often: funded in this way. References have beenkept to a select few. Acronyms (around 1001 appear in an appendix.(FL)

***********************************************************************Reproductions supplied by EDRS are the best that can be made

from the original document.

***********************************************************************

EVALUATION

THESAURUS

MICHAELSCRIVEN

-,EDGEPRESS INVERNESS CALIFORNIA-

3

Copyright © 1980, Michael Scriven

International Stmdard Book Number 0-913528-08-9Second editionFirst printing September 1980

Library of Congress, Cataloging in Progress,Number 80-68775

No part of this book may be reproduced by any mechani-cal, photographic, or electronic process, or in the form of aphonographic recording, nor may it be stored in a retrievalsystem, transmitted, or otherwise copied for public orprivate use, without permission from the publisher.

EdgepressBox 69, Pt. ReyesCalifornia 94956

Printed in the United States of America

Single copies $6.95 + postage etc. .550Two or more, 20% off price & postage.California tax for California order only: 6%

(.420 per single copy, .340 per copy on multiple orders.)Mastercharge or Visa acceptable.Above prices good for second edition, all printings.Third edition projected late 1981.Free copies of third edition to contributors who identifysignificant additions and errors.

4

INTRODUCTIONTO THE SECOND EDITION

Evaluation is a new discipline though an old practice. It isnot just a science, though there is a point to talking aboutscientific evaluation by contrast with unsystematic or sub-jective evaluation. Disciplined evaluation occurs in schol-arly book reviews, the Socratic dialogs, social criticism andin the opinions handed down by appellate courts. Its char-acteristics are the drive for a determination of merit, worthor value; the control of bias; the emphasis on sound logic,factual foundations and comprehensive coverage. That ithas become a substantial subject is attested to by the size ofthis work and of the work to which the entries herein refer.It is a subject in its own right, not to be dissipated insub-headings under education, health, law-enforcementand so on; one might as well argue that there is no subject ofstatistics, only agricultural statistics, statistics in biology etc.Nor will it do to classify evaluation under "Social Sciences,Sundry" since evaluation far transcends the social sciences.That the Library of Congress will not recogniz the auton-omy of the discipline leads to unflattering (evaluative) con-:lusions about the bureaucracy of bibliography. But schol-ars who read the dozen journals and the scores of books, inthe field, as well as the government (whose needs are morepractical) are coming to recognize evaluation as a subjectthat requires certain skills, some knowledge and specifictraining.

This small work may serve as a kind of miniature text-cum-reference-guide to the field. It developed (ruin a 1977pamphlet entitled Evaluation Thesaurus, and the dictionarydefinition of the term "thesaurus" still applies to this much

larger, more detailed, and massively rewritten work: "abook containing a store of words or information about aparticular field or set of concepts" (Webster III); treasuryor storehouse of knowledge" (Oxford English Dictionary).We already have a couple of encyclopedias in sub-fields ofevaluation (educational evaluation and program twain-atirm), and many texts provide brief glossaries. But for mostconsumers, the texts and larger compendia contain morethan they want to know or care to purchasefor they areindeed expensive. Tb :glossaries, on the other hand, are tobrief. Here then is a smaller, and cheaper guide than theencyclopedias, yet one that is more comprehensive than theglossaries and it is not restricted simply to educational eval-uation or to program evaluation. It also refers to productand personnel and proposal evaluation, to quality controland the grading of work samples, and to all this other areasin which disciplined evaluation is practiced. It containsmany suggestions and procedures, comments and criti-cisms, as well as definitions and distinctions. Where it func-tions as a dictionary, it is in the tradition of Samuel John-son's English Dictionary rather than the mighty OED; aca-demic presses would not have approved his definition ofoats ("A grain, which in England is genet.; fly given tohorses, but in Scotland supports the people") but you and Ido. Where this serves as a reference tosgood practice and notjust good usage, it is of course briefer than the special textsor encyclopedias, but it may provide agood starting-pointfor an instructor who wishes to focus on certain topics inconsiderable detail and to provide tailored readings on th-ose, while ensuring that students have some source foruntangling the rest of the complex conceptual net that cov-ers this field. Students who have sat down and read it coverto cover report on the experience as packing a semester's

,course into two days.Smaller than the other texts, yes; more judgmental be-

yond doubt. But also possibly more open to change; weprint short runs at Edgepress so that updating doesn't haveto compete with protection of inventory. Send in your cor-rections or suggestions, and receive a free copy of the nextedition. The most substantial or numerous suggestions alsoearn the choice of a handsome book on evaluation from ourstock of spares. (At this writing, we have spares of bothencyclopedias and twenty other weighty volumes.)

ii

6

The criteria for inclusion of an entry were (al at least a fewparticipants in workshops or classes requested it, (b) a shortaccount was possible, (c) the account was found useful,orin a few caws(d) the author thought it should beincluded for the edification or amusement of professionalsand/or amateurs. There is much more current slang andjargon in here than would usually be recognized by a re-spectable scholarly publicationbut that's exactly whatgives people the most trouble. (And besides, though i;omeof the slang is unlovely, some of it embodies the poetry andimaginativeness of a new field far better than mole pedest-rian and technical prose.) There's not much on the solidstatistics and measurement materialbecause that's verywell covered elsewhere. (But there's a little, because partic-ipants in some inservice workshops for professionals haveno statistical background and find these few definitionshelpful.) There's a good deal about the federal/state con-tract process because that's the way much of evaluation isfunded (and because its jargon is especially pervasi' e andmysterious). Some references are providedbut only a fewkey ones, because too many just leave the readers' problemof selection unanswered. The scholar will usually find morereferences in the few given; that was one criterion for selec-tion of them. Acronyms, besides a basic few, are in a sup-plement, to reduce clutter. The list of entries has benefitedfrom comparison with the Encyclopedia of EducationalEvaluation (eds. Anderson, Ba Murphy et al., Jessey Bass,1976); but there are over 120subst ntive entrieShere that arenot in EEE.

The University of San Francisco, through its support ofthe Evaluation Institute, deserves first place in a listing ofindebtedness. In 1971-72 the U.S. Office of Education (em-bodied in John Egermeier) was kind enough to support mein developing a training progro..1 in what I then calledQualitative Educational Evaluation at the University of Cali-fornia at Berkeley, and there began the glossary from whichthis work grew. Two contracts with Region IX of HEW, toassist in building staff evaluation capability, have led mefrom giving workshops there to developing materials whichcan be more widely distributed, more detailed, and used forlater reference more often than seminar notes; this the-saurus is part of those materials. My students and contactsin those courses and workshops, as others at Berkeley,-.

irr

Nova, USF, and elsewhere, have been a constant source ofimprovement still neededin formulating and coveringthis exploding and explosive held; and my colleagues andclients too. To all of these, many thanks, most especially toJane Roth for her work on the original Evaluation Themurugwhich she co-authored in 1977, and Howard Levine formany valuable suggestions on the first edition. Thanks, too.to Sienna S'Zell and Nola Lewis for handling the complex-ities of getting this into and out of our Mergenthaler photo-typesetters. They are not to blame for our minor efforts toreform punctuation e.g. by usually omitting the commasaround "e.g." since it provides its own pause in the flow;and cutting down on the use of single quotes, since the U,S,and British practices are reversed.

This work is one of a series to come out in 1980-81, Acompanion monograph, The Logic of Evaluation, is com-plete and should be available in September, The Evaluatio4of Composition Instruction will be available in November.Product Evaluation is typeset and in field readers' handsand should be available by the end of the year. Introductionto Eva' ation is scheduled for January '81, Others are pro-jected on personnel evaluation, qualitative resem n meth-odology, apportionment, etc. (A note to Edgepre!s or theInstitute will ensure that a descriptive flyer on each will besent as they become available.) Where more details on atopic referenced in the thesaurus are provided in the firsttwo of these monographs, the abbreviations LE (for Logic ofEvaluation) and PE ( for Product Evaluation) are used.

Evaluation InstituteUniversity of San FranciscoCalifornia 94117

September 1980

8iv

Terms are printed in bold type to indict e that they havetheir own entry; this slightly distracting flag Is not wavedmore than once in any entry.

ACCOUNTABILITY Responsibility for the justificationof expenditures or of one's own efforts. Thus program man-agent and teachers should be, it is often said, accountablefor their costs and salaries and time. The term is also used torefer to a movement towards increased expectations, e.g. ofmore detailed justification of expenditure or efforts, Ac-countability thus requires some kind of cost - effectivenessevaluation; it is not enough that one be able to explain howone spent the money ("fiscal accountability"), but it is alsoexpected that one be able to justify this in terms of theachieved results, Teachers have sometimes been held(wholly) accountable for their students' achievementscores, which is of course entirely inappropriate since theircontribution to these scores is only one of several (supportfrom parents, from peers, and from the rest of the schoolenvironment outside the classroom are the most frequentlycited other influences). On the other hand, a teacher ranappropriately be held accountable for the failure to producethe same kind of learning gains in his or her pupils thatother teachers of essentially similar pupils achieve. A com-mon fallacy associated with accountability is to suppose thatjustice requires the formulation of precise goals and objec-tives if there is to he any accountability; but ir. fact one maybe held accountable for what one does, within even themost general conception of professional work, e.g. for"teaching social studies in the twelfth grade," where onemight be picking a fresh (unprescribed) topic every day, orevery hour, in the light of one's best judgment as to whatcontemporary social events and the class capabilities makeappropriate. (Captains of vessels are held accountable fortheir actions in wholly unforeseen circumstances.) It is true,however, that any testing process has to be very carefullyselected and applied if educational accountability is to beenforced :n an equitable way; this does not mean that thetest must be matched to what is taught (because what istaught may have been wrongly chosen), but it does meanthat the test must Iv very carefully justified, e.g. by refer-ence to reasonable expectations as to what should be (orcould justifiably have been) covered, given the need and

ability of the students,

ACCREDITATION The award of ,.itdentials, m par-ticular the award of membership in one of the regionalassociations of educational institutions or one of the protes-sional ormnizations which attempt to maintain certainquality standards for entry, The "accreditation process" isthe process whereby these organisations determine eligibil-ity for membership and encourage self- improvement to-wards achieving and maintaining that status. The accredita-tion process has two phases; in the first, the institutionundertakes a self-study and self-evaluation exercise againstits own mission statement, In the second phase the regionalaccrediting commission sends in a team of people familiarwith similar institutions, to examine the self-study and itsresults, and to look at a very large number of particularfeatures of the institution, using data to be supplied by theinstitution together with a checklist (Evaluative Criteria isthe best known of these, published by The National Societyfor School Evaluation), which are then pulled together in aninformal synthesis process. At the elementary level, schoolsare typically not visited (although there is one of the handfulof regional accrediting commissions that is an exception tothis); at the high school level a substantial team visit isinvolved, and the same is true at the college level. Accredit-ing of professional schools, particularly law schools andmedical schools, is also widespread and done by the rele-vant professional organizations; it operates in a similar way.Accrediting of schools of education that award credentials,e.g. for teaching in elementary schools, is done by the state;there is also a private organization which evaluates suchschools. There are grave problems with the accreditationprocess as currently practiced, in particular its tendencytowards the rejection of innovations simply Kcause theyare unfamiliar (naturally this is denied); irc use of teamsunskilled in evaluation; its disinterest in looking at learningachievements by contrast with process indicators; the in-consistency between its practice and the claim that it acceptsthe institution's own goals; the brevity of the visits; theinstitutional veto and middle-of-the-road bias in selectingteam members; the lack of concern with costs; and so on(LE). See Institutional Evaluation.

102

W. Arl I I ULM I MO All I I 1J1411/ACHIEVEMENT DISTINCTION) It's obvious enoughthat there's a difference between the two; Motart presum-ably had more early aptitude for the piano than you or I,even if he'd never been shown one, Out statistical testingmethodology has always had a bard time over the distinc-tion because statistics isn't subtle enough to cope with thepoint of the distinction, lust as it isn't subtle enough to copewith the distinction between correlation and causation. Forno one has achievement who doesn't have aptitude, bydefinition, so there's a one-way correlation; and it's veryhard to show that someone has an aptitude without givingthem a test that actually measures (at least embryonic)achievement. Temerarious testi: ; types have thus some-times been lcd to deny that there is any real distinction,whereas the fact is only that they lack the tixils to detect itDistinctions only have to be conceptually clear, not statisti-cally simple; and the distinction between a capacity (anaptitude) and a manifested performance (achievement) isconceptually perfectly clear. Empirically, we may never findgood tests of aptitude that aren't mini-achievement tests.(Ref. The Aptitude Achievement Distinction, ed. Green,McGraw-Hill.)

ACTION RESEARCH A little-known sub-field in thesocial sciences that can be so,: la precursor of evaluation.

ACTORS Social science (and now evaluation) jargonterm for those participating in an evaluation, typically eval-uator, client a1 evaluee (if a person or his/her program isbeing evaluated). May also be used to .reter to all activestakeholders.

ADMINISTRATOR EVALUATION A species of per-sonnel evaluation which illustrates many of the problems ofteacher evaluation in that there is no demonstrably superioradministrative style (e.g. with respect to democratic versusauthoritarian leadership), where the criterion of merit iseffectiveness, rather than enjoyability. The three main com-ponents of administrative evaluation should be: (a) anon-ymous holistic rating of observed performance as an ad-ministrator, with an opportunity to give reasons, by allthose "significantly interactive" with the individuals inquestion. Identifying this group is done by a preliminary

request fora list from the administrator to be evaluated towhich is attached the comment that the search will also beinstigated from the groups at the other end of the interac-tion; (b) a study of objective measures of enedivvrico, e,g,turnaround time on urgently INNUtliird materials, output

stiff turrmver etc.; and (c) paper-and-mkt! ortion tests of relevant knowledge and skills, in par-

t cular of new knowledge and understanding that has be-come important since the time of the last review. This kindof evaluation can easily be tied to in-service training, so thatit is a productive and supportive experience. The usual farceof administrator evaluation via performance or behavioralobjectives is not only a prime opportunity for the con artistto exploit, is not only indefensible because of its lack ofinput from most of the people that have most of the relevantknowledge, it is also highly destructive of creative manage-ment because of the lack of rewards for handling "targets ofopportunity"indeed, there are usually de facie punish-ments for trying to introduce them as new objectives. (Italso has the other weaknesses of any gokbaied evalu-ation.)

Administrators are often nervous about the kind of ap-proach listed as preferable here, because they rightly under-stand that most of the people with whom they interact havea pretty poor grasp of the administrator's extensive re-sponsibilities and burdens. The questionnaire must ofcourse rather carefully delimit the requested respon.,e torating (holistibally) the obsererd behaviors, and the rest ofthe objection is taken care of by the comprehensive natureof the group interviews, supplemented by the objectivemeasures.

ADVOCATE-ADVERSARY EVALUATION (THE AD-VERSARY APPROACH) A type of evaluation in which,during the process and/or in the final report, presentationsare made by individuals or teams whose goal is to providethe strongest possible case for or against a particular view orevaluation of the program (etc.). There may or may not bean attempt at providing a synthesis, perhaps by means of ajudge or a jury or both. The techniques were developedveryextensively in the early seventies, from the initial. examplein which Stake and Denny were the advocate and the ad-versary (the TCITY evaluation), through Bob Wolf, Murray

4

12

Levine, Tom Owens and others. There are still great dif-ficulties in answering the question, "When does this give abetter picture and when does it tend to falsify the picture of

,a program?" The search for justicewhere we rely on theadversary approachis not the same as the search fortruth; nevertheless, there are great advantages about stat-ingeand attempting to legitimate radically different apprais-als, e.g. the competitive element. One of the most interest-ing reactive phenomena in evaluation was the effect of theoriginal advocate-adversary evaluation; many members ofthe "audience" were extremely upset by the fact that thehighly critical adversary report had been printed as part ofti' evaluation. They were unable to temper this reaction bytocognition of the equal legitimacy accorded to the advocateposition. The significance of this phenomenon is partly thatit reveals the enormous pressures towards bland evalua-tion, whether they are explicit or below the surface. In"purely logical" terms, one might think there wasn't muchdifference between giving two contradictory viewpointsequal status, and giving a merely neutral presentation. Butthe effect on the .audience shows that this is not the case;and indeed, a more practically oriented logic suggests thatimportant information is conveyed by the former method ofpresentation that is absent from the latter, namely the rangeof (reasonabh,) defensible interpretations. See also Relativ-ism, Judicial Model.

ADVOCATE TEAMS APPROACH (Stufflebeam) Notto be confused with the advocate/adversary approach toevaluation. A procedure for developing in detail the leadingoptions for a decision maker, as a preliminary to an evalua-tion of them.

AFFECTED POPULATION A program, product' etc.impacts the true consumers and its own staff. In program'evaluation both effects must be considered though theyhave quite different ethical standings. At "one stage, itlooked as if the Headstart program could be justified (only)because of its benefits to those it employed.

AFFECTIVE (Bloom) Original sense; pertaining to thedomain of affect. Often taken to be the same as the domainof feelings or attitudes. Since these are sometimes confusedwith beliefs, it should be remembered that affect should also

5

J.

be distil" guishi,i from the cognitive and psychomotor do-',mains. For example, self-esteem and locus of control areoften said to be affective variables, but many items or in-terview questions which are said to measure these actuallycall for estimates of self-worth and appraisals,or judgmentsof locus of control, which are straight propositional claimsand hence cognitive. Errors such as this often spring fromthe idea that the realm of valuing is not propositional, butmerely attitudinal, a typical fallacy of the value-free ideol-ogy in social science. While some personal values are evi-dent in attitudes and hence may be considered affect, somevaluationswhether or not they cause certain attitudesare sc:.:-_itifically testable assertions. Note the difference be-tween, "I fee/ perfectly capable of managing my own life,selecting an appropriate career and mate, etc." and "I amperfectly capable, etc." (Or "I feel this program is reallyvaluable for me." vs. "This program is really valuable forme.") Claims about feelings are autobiographical and theerror sources are lying and lack of self-knowledge. Claimsabout merit are external world claims and verified or falsi-fied by evaluations. The use of affective measures, beyondthe simplest expressions of pleasure, is currently extremelydubious because of (a) these conceptual confusions be-tween affect and cognition, (b) deliberate falsification ofresponses, (c) unconscious misrepresentation, (d) dubiousassumptions made by the interpreter, e.g. that increases inself-esteem are desirable (obviously false beyond a certain(unknown) point), (e) invasion of privacy, (f) lack of evenbasic validation, (g) high lability of much affect, (h) highstability of-other affect. Not long ago, I heard an expert saythat the only kno o.rn-vlid measure of affect relates to locusof control and that i3 fixed by the age of two. He may havebeen optimistic.

ANALYTICAL (evaluation) By contrast with holisticevaluation, which might be called macro-evaluation (byanalogy with macro-economics), analytical evaluation ismicrb-evaluation. There are two main varieties: componentevaluation and dimensional evaluation. It is often thoughtthat causal analysis or remedial suggestions are part ofanalytic (typically formative) evaluation, but they are not infact part of evaluation at all (LE).

ANCHORING (ANCHOR POINTS) Rating scales

6

14

that use numbers (e.g. 1-6, 1-10) or letters (AF) shouldnormally provide some translation of the labeled points onthe scale, or at least the end-points and mid-point. It iscommon, in providing these anchors, to confuse gradinglanguage with ranking language, e.g. by defining AF as"Excellent ... Average . .. Poor" which has two absoluteand-one relative descriptors, hence is useless if most of theevaluands are or may be excellent (or poor). Some, probablymost, ariLhors for letter grades create an asymmetrical dis-tribution of merit, e.g. because the range of performanceswhich D (potentially) describes is narrower than the Brange; this invalidates (though possibly not seriously) thenumerical conversion of letter-grades to grade points (LE).It may be a virtue, if conversion is not essential. In anotherbut related sense of anchoring, it means cross-calibration ofe.g. several reading tests, so as to identify (more or less)equivalent scores.

ANONYMITY The preservation of the anonymity ofrespondents sometimes requires very great ingenuity. Al-though even bulletproof systems do not achieve honestresponses from everyone in personnel evaluation, becauseof secret contract bias), leaky systems get honesty fromalmost no-one. The new legal requirements for open fileshas further endagered this crucial source of evaluation in-put; but not without adequate ethical basis. The use of a"filter" (a person who removes identifying information,usually the person in charge of the evlauation) is usuallyessential; a suggestion box, a phone with a recorder on it towhich respondents can talk (disguising their voice), check-lists that avoid the necessity for (recognizable) handwriting,forms that can be photocopied to avoid watermark identi-fiers, money instead of stamps or reply-paid envelopes(which can be invisibly coded), are all possibilities. Typicalfurther problems: What if you want to provide an incentivefor respondinghow can you tell who to reward?; What if,like a vasectomy, you wish to reverse the anonymyzingprocess (e.g. to get help to a respondent in great distress)?There are complex answers, and the questions illustrate theextent to which this issue in evaluation design takes usbeyond standard survey techniques.

APPLES & ORANGES ("Comparing apples & oranges")Certain evaluation problems evoke the complaint, particu-

/

larly from individuals trained in the traditional social sci-ences, that any solution would be "like comparing applesand oranges." Careful study shows that any true evaluationproblem (as opposed to a unidimensional measurementproblem) involves the comparison of unlike quantities, withthe intent of achieving a synthesis. It is the nature of thebeast. On the other hand, far from being impossible, thesimile itself suggests the solution; we do of course compareapples and oranges in the market, selecting the one or theother on the basis of various considerations, such as cost,quality relative to the appropriate standards for each fruit,nutritional value, and the preferences of those for whom weare purchasing. Indeed, we commonly consider two ormore of these factors and rationally amalgamate the resultsinto an appropriate purchase. Wh ;ie there are occasions onwhich the considerations just mentioned do not point to asingle winner, and the choice may be made arbitrarily, thisis typically not the case. Complaining about the apples andoranges difficulty is a pretty good sign that the complainerhas not thought very hard about the nature of evaluation(LE).

APPORTIONMENT (ALLOCATION, DISTRIBUTION)The process or result of dividing a given quantity of re-sources between a set of competing demands, e.g. dividinga budget between programs. This is in fact the definingproblem of the science of economics, but one that is usuallynot addressed directly or not in practical terms within theeconomic literature, presumably because any solution re-quires making assumptions about the so-called "interper-sonal comparison of utility," i.e., the relative worth of pro-viding goods to different individuals. Thus the value-freeconception of the social sciences makes it taboo to providepractical solutions to the apportionment problem. Appor-tionment is a separate evaluation predicate, distinct fromgrading and ranking and scoring although all of those areinvolved in it; it is one, very practical, way of showing one'sestimate of relative worth, and of all the evaluation predi-cates it is probably the closest to the decision makers' modalevaluation process. Various patently inappropriate solu-tions are quite frequently used, e.g. the "across-the-boardcut." This not only rewards the padding of budgets, andhence automatically leads to increased padding the follow-

8

ing year, but it also results in some funding at below the"critical mass" level, a complete waste of money. Anotherinappropriate solution involves asking program managersto make certain levels of cut; this of course results in theblackmail strategy of setting the critical mass levels too high,in order to get more than is absolutely necessary. The onlyappropriate kind of solution involves some evaluation by aperson external to the program, typically in conjunctionwith the program manager; and the first task of such areview must be to eliminate anything that looks like fat inthe budget. Later steps in the process involve segmentationof each program, identification of alternative articulations ofthe segments, grading of the cost-effectiveness of the prog-ressively larger systems in each sequence of add-ons, andconsideration of interactions between program componentsthat may reduce the cost of each at certain points. Given anestimate of the "return value" of the money (the good itwould do if not used for this set of programs), and theethical (or democratic) commitment to prima facie equality ofinterpersonal worth, one then has an effective algorithm forspending the available budget in the most effective way. Itwill typically be the case that some funding of each of theprograms will occur (unless the critical mass is too large),because of the declining marginal utility of the services toeach of the (semi-overlapping) impacted populations, thelong-term advisability of retaining capability in each area,and the political considerations involved in reaching largernumbers. The process just described pr_vides a rationalefor what has sometimes been called zero-based budgeting,an innovation of which the Carter Administration made agood deal in the first years of his presidency; but seriousdiscussion of the methodology for it never seemed toemerge, and the practice was naturally well behind that. Atthe informal but highly practical level, apportionment re-minds us of one of the most brilliant examples of biascontrol methodology in all evaluation; the solution to theproblem of dividing an irregularly-shaped portion of foodor land into two fair sharesYou divide, and I'll choose.(This is a Micro-version of-the "veil of ignorance" or antece-dent probability approach to the justification of justice andethics in Rawls, A Theory of Justice (1971), and Scriven ,Primary Philosophy, McGraw-Hill, 1966. It is not surpris-ing that ethics and evaluation share a common border here,

9

since justice is often analyzed as a distributional concept.(See LE)

ARCHITECTURAL EVALUATION Like the evalu-ation of detective stories and many novels (see literarycriticism), this field involves a framework of logic anda skinof aesthetics; it is frequently treated as if only one of thesecomponents is important. The solution to the problems oftraffic flow, the use of durable fixtures that are not over-priced, the provision of adequate floor-space and storage,meeting the requirements of expansion, budget, safety andthe law; these are.the logical constraints. The aesthetic areno less important and no easier to achieve. Unfortunatelyarchitecture has a poor record of learning by experience i.e.poor evaluation commitment; every new school buildingincorporates errors of the simplest kind, (e.g. classroomentries at the front of the room) and colleges of architecturewhen designed by their faculty not only make these errorsbut are often and widely thought to be the ugliest buildingson the campus. (Cf. Evaluators whowrite reports readableonly by evaluators). It is significant that the Ford Founda-tion's brilliant &Inception of a center for school architecturehas, after several years operation, sunk without a trace.

ARCHIVES Repository of records in which e.g. min-utes of key meetings, old budgets, prior evaluations andother found data are located.

ARTEFACT (or ARTIFACT) (of an experiment, evalua-tion, analytical or statistical procedure) An artificial result,one merely due to (created by) the investigatory or analyticprocedures used in an experiment, an. evaluation, or a sta-tistical analysis, and not a real property of the phenomenoninvestigated. Typically uncoveredand in goo;: designsguarded againstby using multiple independent methodsof investigation/analysis.

ASSESSMENT Often used as .a synonym for evalua-tion, but sometimes used to refer to a profess that is morefocussed on quantitative and/or testing approaches; thequantity may be money (as in real estate assessment), ornumbers and sco -es (as in National Assessment of Educa-tional Progress). People sometimes suggest that assessmentis less of a jidgmental and more of a measurement process

10

than certain <other kinds of evaluation; but it might be ar-gued that it is simply a case of evaluation in which thejudgment is built into the numerical results. Raw scores on atest of no known content or construct validity would n it beassessment; it is only when the test is (supposedly) of basicmathematical competence, for example, that reporting theresults constitutes assessment in the appropriate sense, andof course the judgment of validity is the key evaluativecomponent in this.

ATTENUATION (Stat.) In the technical sense this re-fers to the reduction in correlation due to errors of mea-surement.

ATTITUDES The compound of cognitive and affectivevariables describing a person's mental set towards anotherperson, thing, or state. It may be evaluative or simply pref-erential; that is, someone may think that running is good foryou, or simply enjoy it, or both; enjoying it does not entailthinking it is meritorious, nor vice versa, contrary to manysuggested analyses of attitudes. Attitudes are inferred frombehavior, including speech behavior, and inner states. Noone, including the person whose attitudes we are trying todetermine, is in an infallible position with respect to theinference to an attitudinal conclusion, even though thatperson is in a nearly infallible position with respect to his orher own inner states. Notice that there is no sharp linebetween attitudes and cognition; many attitudes areevinced through beliefs (which may be true or false), andattitudes can sometimes be evaluated as right or wrong, orgood or bad, in an objective way (e.g. attitudes towards"the world owing one a living," work, women (men), etc.).See Affective

ATTRITION The loss of subjects in the exp,!rimental orcontrol/comparison groups during the period of the study.This is often so large as to destroy the experimental design-60% loss within a year is not uncommon in the schools.Hence all choice of numbers in the groups must be basedupon a good estimate of attrition plus a substantial marginfor error.

AUDIENCE (in Robert Stake's sense) A group, wheth-er or not they are the client(s), who will or should see and

may use or react to an evaluation. Typically there are manysuch, and typically an evaluation report or presentation willneed careful planning in order to serve the several audi-ences reasonably well.

AUDIT, AUDITOR Apart from the original sense ofthis term, which refers to a check on the books of an institu-tion by an independent accountant, the,evaluation use ofthe term refers to a third patty evaluation or external evalua-tion, often of an evaluation. Henceand this is the stan-dard usage in Californiaan auditor may be a meta-eval-uator, typically serving in a formative and summative role.In the more general usage, an auditor may be simply anexternal evaluator working either for the same client as theprimary evaluator or for another client. There are otheroccasions when the auditor is halfway between the originalkind of auditor and an evaluation auditor; for example, theAudit Agency of HEW (now HHS/ED) was originally set upto monitor compliance with fiscal guidelines, but their staffare now frequently looking at the methodology and overallutility of evaluations. The same is true of GAO and OMB"audits."

BALANCE OF POWER A desirable feature of the so-cial environment of an evaluation, summed up in the for-mula: "The power relation of evaluator, evaluee and clientshould be as nearly symmetrical a.; possible." For example,evaluees should have the right to have their reactions to theevaluation appended to it when it goes to the client. Simi-la-rly, the client should undertake to be evaluated if thecontract identifies someone else as the evaluee. (Schooladministrators who are not being properly evaluated haveno right to have teachers critically evaluated.) Meta-eval-uation and goal-free evaluation are both part of the Balanceof Power concept. Panels used in evaluation should exhibita balance of power, not a lack of bias. There are both ethicaland political/practical reasons for arranging a balance ofpower.

BASELINE (data or measures) Facts about the condi-tion or performance of subjects prior to treatment. Theessential result of the pretest part of the pretest-posttestapproach. Gathering baseline data is one of the key reasons

12

for starting an evaluation before a program starts, some-thing that always seems odd to budgetary bureaucrats. SeePreformative.

BASIC CHECKLIST The 18-point checklist for evalu-ating products, programs, etc., to be found under KeyEvaluation Checklist.

BEHAVIORAL OBJECTIVES Specific goals of, e.g. aprogram, stated in terms which will enable their attainmentto be checked by observation or test/measurement. An ideawhich is variously seen as 1984/Skinner/dehumanizing,etc., or as a minimum requirement for the avoidance ofempty verbalisms. Some people now use "measurable ob-jectives" to avoid the miasma associated with the connota-tions of behaviorism. In general, people are now more toler-ant of objectives that are somewhat more abstractly sped-lied, provided that leading verification/falsification condi-tions can be spelled out. This is because the attempt tospell everything out (and skip the statement of intermediate-level goals) produces 7633 behavioral objectives for reading,which is an incomprehensible mess. Thus educational re-search has rediscovered the reason for the failure of theprecisely analogous move by positivist philosophers of sci-ence to eliminate all theoretical terms in favor of observa-tional terms. The only legitimate scientific requirement hereis that terms have a reliable use and agreed-upon empiricalcontent not a short translation into observational languagethelatter is just one way to the former. Fortunately, scientifictraining can lead to the reliable (enough) use of theoreticalterms, i.e., they can be unpacked into the contextually-relevant measurable indicators upon demand, therebyavoiding the total loss of the main cognitive organizers(above the taxonomical level) and all understanding thatwould result from the total translation project, evon if itwere possible. The same conclusion applies to the use ofsomewhat general goal statements.

BIAS A condition in an evaluation or other design, or inone of its participants, that is likely to produce errors; forexample, a sample of the students enrolled in a school isbiased against lower economic groups if it is selected fromthose present on a particular day since absenteeism rates areusually higher amongst lower economic groups. Hence, if

13 2i

we are investigating an effect that may be related to eco-nomic class, using such a sample would be faulty design. Itis common and incorrect to suppose that (strong) prefer-,ences are biases, e.g. someone who holds strong viewsagainst the use of busing to achieve desegregation is oftensaid to be biased. (See the gloSsary of Evaluation Standards,McGraw Hill 1980; where bias is wrongly defined as "aconsistent alignment with one point of view." This is true onlywhere the views are unjustified, i.e., involve or will prob-ably lead to errors. It is not true if the views are merelycontroversial; one would scarcely argue that believers inatoms are biased even though the existence of atoms isdenied by Chr!...tian Scientists. One sometimes needs ajudge in a dispute that is neutral or acceptable to both parties;this should be distinguished from unbiased. Being neutral,etc., is often a sign of error in a given dispute i.e. a sign ofbias. Evaluation panels should usually include trained andknowledgeable people with strong commitments both forand against whatever approach, program, etc., is beingevaluated (where such factions exist) and no attempt shouldbe made to select only neutral panelists at the usual cost ofselecting ignoramuses or cowards and getting superficial,easily dismissed reports. The neutral faction, if equallyknowledgeable, should be represented just as any otherfaction. Selecting a neutral chair may be good psychology orpolitics, but not because s/he is any more likely to be a goodjudge.

BIAS CONTROL A key part of evaluation design; it isnot an attempt to exclude the influence of definite views butof unjustified, e.g., premature or irrelevant views. For ex-ample, the use of (some) external evaluators is a part of goodbias control, not because it will eliminate the choice ofpeople with definite views about the type of program beingevaluated, but because it tends to eliminate people who arelikely to favor it :for the irrelevant (and hence error-con-ducive) reasons of ego-involvement or income - preserva -.tion (cf. also Halo Effect). Usually, however, program man-agers object to the use of an external evaluator with a knownnegative view of programs like theirs, which is to confusebias with preference. Enemies are one of the best sources ofuseful criticism, not that anyone enjoys it. Even if it is politi-cally necessary to take account of a manager's opposition to

23

the use of a negatively-disposed evaluator, it should bedone by adding a second evaluator, also knowledgeable, towhom there is no objection, not by findingsomeone neutralas such, since neutrality is just as likely to be biased; a keypoint. Other key aspects of bias control involve furtherseparation of the rewards channel from the evaluation re-porting, designing or hiring channel, e.g. by never allowingthe agency monitor for a program to be the monitor for theevaluation contract on that program, never allowing a pro-gram contractor to be responsible for letting the contract toevaluate that program, etc. The ultimate bias of contractedevaluations resides in the fact that the agencies which fundprograms fund most or all of their evaluations, hence wantfavorable ones, a fact of which evaluation contractors are(usually consciously) aware and which does a great deal toexplain the vast preponderance of favorable evaluations in aworld of rather poor programs. Even GAO, although effec-tively beyond this influence for most purposes, is not im-mune enough for Congress to regard them as totally credi-ble, hencein partthe creation of the CBO (Congres-sional Budget Office). The possible merits of an evaluation"judiciary," isolated from most pressures by life-time ap-pointment, deserve consideration. Another principle ofbias control reminds us of the instability of independence orexternalitytoday's external evaluator is ,tomorrow's co-author (or spurned contributor). For more details, see"Evaluation Bias and Its Control," in Evaluation StudiesReview Annual (Vol. 1, 1976, ed. G. Glass, Sage). Thepossibility of neat solutions to bias control design problemsis kept alive in the face of the above adversities by remem-bering the Pie-Slicing Principle: "You slice and I'll select."

BIG SHOPS The "big shops" in evaluation are the fiveto ten that carry most of the large evaluation contracts; theyinclude Abt Associates, AIR, ETS, RAND, SDC, SRI, etc.(for translations see the acronym appendix). The tradeoffsbetween the big shops and the small shops run somethinglike this, assuming for the moment that you can affordeither: the big shops have enormous resources of everykind, from personnel to computers; they have an ongoingstability that pretty well ensures the job will be done with atleast a minimum of competence; and their reputation isimportant enough to them that they are likely to meet dead-

15 23

lines and do othergood things of a 'aper- churning kind likeproducing nicely bound reports, staving within budget andso on. In all of these respects they are a better bet, often amuch better bet, than the small shops. On the other hand,you don't know who you are going to get to work for you ina big shop, because they have to move their project mana-gers around as the press of business ebbs and flows, and astheir people move on to other positions; they are rathermore hidebound by their own bureaucratic procedures thana small shop; and they are likely to be a good deal moreexpensive for the same amount of work, because they arecarrying a large staff through the intervals between fobswhich are inevitable, no matter how well they are run Asmall shop is often carrying a proportionally smaller ox r-head during those times, and may be workingout of a .noremodest establishment, taking some of their payments in thepleasures of independence. It's much easier to get a satis-factory estimate of competence about the large shops than itis about the small shops; but of course what you do learnabout the personnel ofa small shop is more likely to apply tothe people that do your work. There'san essential place forboth of them; small shops simply can't manage the bigprojects competently, although they sometimes try; and thebig shops simply can't handle the small contracts. If somemore serious evaluation of the quality of the work done wasinvolved in government review panelsand the increasingstrength of GAO in meta-evaluation giveF some promise ofthisthen small shops might fit better into the scheme ofthings, rather as they do in the management consultingfield and in the medical specialties. We are buying a lot ofmediocre work for our tax dollar at the moment, because thesystem of rewards and punishments is set up to punishpeople that don't deliver (or get delivered) a report on time;but not to reward those who produce an outstanding reportby comparison with a mediocre one.

8I-MODAL (Stat.) See Mode.

BLACK BOX EVALUATION A term, usually employedpejoratively, that refers to holistic summative evaluation,in which an overall and frequently brief evaluation is pro-vided, without any suggestions for improvements, etc.Black box evaluation is frequently extremely valuable (e.g. aconsumer product evaluation); is frequently far more valid

162 4

than any analytical evaluation that could be done within thesame time line and for the same budget; and has the greatadvantage of brevity. But there are many contexts in whichit simply will not provide the needed information e.g.where analytical formative evaluation is required. (Notethat black box evaluation may even be extremely useful inthe formative situation.) Cf::Engineering Model. .

BOILERPLATE Stock paragraphs or sections that aredumped into RFPs or reports (e.g. from storage in a word-processer) to fill them out or fulfill legal requirements. RFPsfrom some agencies are, 90 percent boilerplateone canscarcely find the specific material in them.

BUDGET Regardless of the form which particularagencies prefer, it's desirable to develop a procedure forproject budgeting that remains constant across projects sothat your own staff can become familiar with the categories.It can always be converted into a particular required formatif it is thoroughly understood. The main categories might bedirect labor costs, other direct costs (materials, supplies,etc.), indirect expenses (space and energy costs), other indi-rect costs (administrative expenses or "general and admin-istrative" expenses (G&A)). The difference between ordi-nary overhead and G&A is not sharp, but the idea is thatordinary overhead should be those costs that are incurred ata rate proportional to staff salaries on the project, this pro-portion being the overhead rate, e.g. retirement, insurance,etc. C&A will include indirect costs not directly related toproject or staff size (for example, license fees and profit). Anumber of indirect costs such as accounting services, in-terest charges, etc., could be justifiably put under eithercategory. See Costs.

CAI Computer Assisted Instruction. Computer pre-sents the material or at least the tests on it. Cf. CMI.

CALIBRATION Conventionally refers to the process ofmatching the readings of an instrument against a priorstandard. In evaluation would include identification of thecorrect cutting scores (which define the grades) on a newversion of a test, traditionally done by administering the oldand the new test to the same group of students (half getting

17

the old one first, half the new). A less common but equallyimportant use is with respect to the standardization ofjudges who are on e.g. a site-visit or proposal-reviewingpanel. They should always be run through two or threecalibration examples, specially constricted to illustrate (a) awide range of merit, (b) common difficulties e.g. (in proposalevaluation) comparing low probability of a big pay-off withhigh probability of a modest pay-off. While it is not crucialto get everyone to give the same rating (interjudge reliabil-ity), indeed it decreases validity, it is highly desirable toavoid: (a) intra-judge inconsistency; (b) extreme compres-sion of an individual's ratings, e.g. at the top, bottom ormiddle, unless the implications and alternatives are thor-oughly understood; (c) drift of each judge's standards asthey "learn on the job" (let them sort out their standards onthe calibration examples); (d) the intrusion of the panel'spossibly turbulent group dynamics into the first few ratings(let it stabilize during the-calibration period). While thetime-cost of calibration may appear to be serious, in fact it isnot, if the development of suitable scales and anchor pointsis undertaken when doing the calibration examples, sincethe use of these (plus e.g. salience scoring) greatly increasesspeed. And, if anyone really cares about validity, or in-terpanel reliability (i.e. justice), calibration is an essentialstep. See also Anchoring.

CASE-STUDY METHOD The case-study method is atthe opposite end of the spectrum of methods from thesurvey method. Both may involve intensive or casual test-ing and/of interviewing; observing, on the other hand, ismore characteristic of case study method than of large-scalesurveys. The case study approach is typical of the clinician,as opposed to the pollster; it is nearer to the historian andanthropologist than it is to the demographer. Causation isusually determined in case studies by the modus operandimethod, rather than by comparison of an experimental witha control group, although one could in principle do a com-parison case study of a matched case. Lie case study ap-proach is frequently used as an excuse for substituting richdetail for evaluative conclusions, a risk inherent in respon-sive evaluation; transactional evaluation and illuminativeevaluation. At its best, a case study can uncover causationwhere no statistical analysis could; and can block or suggest

26 18

interpretations that are far deeper than survey data canreveal. On the other hand, the patterns that emerge fromproperly done large-scale quantitative research cannot bedetected in case studies, and the two are thus naturallycomplementary processes for a complete investigation ofe.g. the health or law enforcement serviced in a city. See alsoNaturalistic.

CAUSATION The relation between mosquitos andmosquito bites. Easily understood by both parties but neversatisfactorily defined by philosophers (or scientists).

CEILING EFFECT The result of scoring near the top of ascalewhich makes it harder (even impossible) to improveas easily as from a point further dOwn. Sometimes de-scribed as "lack of headroom." Scales on which raters scorealmost everyone near the top will consequently providelittle opportunity for anyone to distinguish themselves byoutstanding (comparative) performance. In the language ofthe stock market, they (the scales plus the raters) provide"all downside risk." (Typical of teacher evaluation forms).Usually they should,bereconstructed,to avoid this; but not ifthey correctly represent the relevant range of the ratedvariable, since then the-"upside" differences would simplybe a measurement artefact. After all, if all the students get allthe answers right, there shouldn't be any headroom abovetheir grades on your scale. (You-might want to use a dif-ferent test, however, if your task was to get a ranking.)

CENTRAL TENDENCY (Stat.) The misleading techni-cal term- -for the, middle, or average of a distribution, asopposed to the extent to which it is spread thin;-or lumped,the latter being the dispersion or variability of the dis-tribution.

CERTIFICATION A term like credentialing, which re-fers to the award of some official recognition of status,typically based on a serious or trivial evaluation process.Accreditation is another cognomen. The certification ofevaluators has recently been discussed rather extensively,and raises a mumber of the usual problems: who is going tobe the super evaluator(s) who decide(s) on the rules of thegame (or who lost), what would be the enforcement proce-dures, how would the cost be handled, etc. Certification is a

19

2 "i

two-faced process which ,is sometimes represented as aconsumer-protection devicewhich it can be--and some-times as a turf-protection device for the guild members, i.e.a restraint of trade process, which it frequently is. Medicalcertification was responsible for driving out the midwives,probably at a substantial cost to ;he consumer; on the otherhand, it was also responsible for keeping a large number ofcomplete charlatans from exploiting the public. It certainlycontributed to the indefensible magnitude of physicians'and lawyers' salaries/fees; and in this respect is consumer-exploitative. The abuses of the big-league auditors, to takeanother example, are well-documented in UnaccountableAccounting by Abraham Briloff (1973). When the state getsinto the act, as it does with the certification of psychologistsin many states, and of teachers in most, various politicalabuses are added to the above. In areas such as architecture,where non-certificated and certificated designers of domes-tic structures compete againgt each other, one can see someadvantages to both approaches; but there is very little evi-dence supporting a single overall conclusion as to the direc-tion which is best for the citizenry, or even for the whole

'group of practitioners. A well set up certification approachwould undoubtedly be the best; the catch is always in thepolitical compromises involved in setting it up; in othercountries, the process is sometimes handled better andsometimes worse, depending upon variations in the politi-cal process.

CERTIFICATION (of evaluators) See EvaluationRegistry

VCHECKLIST APPROACH (to evaluation) A checklist

identifies all significant relevant dimensions of value, ide-ally in measurable terms, and may also provide for weight-ing them according to importance. The checklist providesan extremely versatile instrument for determining the qual-.ity of all kinds of educational activities and products. Thechecklist approach reduces the probability of omitting acrucial factor. It reduces artificial overweighting of certainfactors by careful definition of the checklist items, so as toavoid overlap. It also provides a guideline for investigatingthe thoroughness of implementation procedures and it re-duces possible halo effect and Rorschach effect. It does not

28 20

require a theory and should avoid depending on one asmuch as possible. Checkpointsif there are manyShould be grouped under categories that have common-sense or obvious meaning, to facilitate interpretation. Achecklist does not usually embody the appropriate com-binatorial procedure for cases where the dimensions arehighly interactive i.e. where the linear or weighted -sumapproach fails: such cases are rare.

CIFP An evaluation model expounded in Evaluationand Decision-Making by Cuba, Stufflebeam et al.; the ac-ronym refers to Context, Input, Process and Product evalu-ation, the four phases of evaluation they distinguish; itshould be noted that these tends are used in a slightlyspecial way. Possibly the most elaborate and carefullythought out model extant; it underemphasized evaluationfor accountability or for scientific interest.

CITATION INDEX The number of times that a publi-cation or person is referenced in other publications. If usedfor personnel evaluation, this is an example of a spuriousquantitative measure of merit since e.g. it depends on thesize of the field, discriminates against the young, againstthose working on unfashionable topics, does not in factidentify a third of the Nobel laureates etc. Only possible useis in evaluating the significance of a particular publicationwithin a field i.e. in history of ideas research; significance isvery loosely related ro merit.

CLIENT The person (or agency, etc.) for whom an eval-uation is formally done. Usually to be distinguished fromaudience and consumer. In social program evaluation"client" may be used to mean "consumer," i.e., the client ofthe program rather than the evaluator; it is better to try to usethe term "clientele" for that purpose.

CLIENTELE The population 'directly served by aprogram.

CLINICAL PERFORMANCE EVALUATION In thehealth field, and to an increasing extent elsewhere (e.g.teaching evaluation), the term "clinical" is being used tostress a kind of "hands-on" situation which is typici Ily notwell tested by anything like paper and pencil tests. How-ever, it can be very well tested by appropriate simulations,

as we have seen in some of the medical Boards exams. It canalso be very well tested by carefully done structured obser-vations by trained and calibrated observers. Ifone thinks ofa paper and pencil test as a limiting case of a simulation, onerealizes the enormous extent to which it depends uponimagination and role-playing skills that few of us possess,in order to be realistic. When one turns to look at standardsimulations, one finds that these have inherited a great dealof the artificiality of the paper and pencil tests. For example,they rarely involve "parallel processing," that is, the neces-sity of handling two or three tasks simultaneously. A seri-ous clinical simulation would start the candidate on oneproblem, providing charts and histories, and thenjust asthis was beginning to make sensea new problem withemergency overtones would be thrust at them, and justbefore they reached the point of making a preliminaryemergency decision on that, a third and even more pressingproblem would be thrown at them. Given that there is someanxiety associated with test-taking for most people, onecould probably come close to simulating clinical settings inthis respect. We have long since developed simulaticaswhich involve the provision of supplementary informationwhen requested by the testee, part of the scoring being tiedto the making of appropriate requests. But very few signs ofcareful job analysis show up in more advanced simulationswhere a true clinical performance is of interest.

CMI Computer Managed Instruction. Records are keptby the computer, usually on every test item and everystudent's performance to date. Important for large-scaleindividualized instruction. Computer may do diagnosis onbasis of test results and instruct student as to materials thatshould be used next. Extent of feedback to student variesconsiderably; main aim is feedback to course manager(s).

COGNITIVE The domain of the propositionally know-able; consisting of "knowledge-that," or "knowledge-how" to perform intellectual tasks.

COHORT A term used to designate one group among'many in a study, e.g. "the first cohort" may be the firstgroup o have been through the training program beingevaluated. Cf. Echelon.

22

COMPETENCY-BASED An approach to teaching ortraining which focuses on identifying the competenciesneeded by the trainee, and on teaching to mastery level onthese, rather than teaching allegedly relevant acad.,mic sub-jects to various subjectively determined achievement levels.Nice idea, but most attempts at it either fail to specify themastery level in clearly identifiable terms or fail to showwhy that level should be regarded as the mastery level.("Performance-based" is a cognomen.) C-B Teacher Educa-tion (CBTE) was a big deal in mid-70s but the catch was thatno one could validate the competencies since style researchhas come up with so little. There is always the. subject-matter competency requirement, of course, usually ignoredin K-12 teacher training and treated as the only one in thepost-secondary domain; but CBTE was talking about peda-gogical competenciesteaching method skills. See alsoMinimum Competency, Mastery.

COMPLIANCE (check). An aspect of monitoring.

COMPONENT (evaluation) A component ofan evalu-and is typically a physically discrete part of it, but moreprecisely any segment that can be said to relate to others inorder to make up the whole evaluand. (Typically, we distin-guish between the components and their relationships intalking about the evaluand as a system made up of parts orcomponents.) The holistic evaluation of something doesnot imply any evaluation of its components; and an evalua-tion of components does not automatically implyan evalua-tion of the whole evaluandexcellent components for anamplifier will not make a good amplifier unless they arecorrectly related by design and assembly relationships. Butsince components are frequently of variable equality, andsince we are frequently looking for diagnoses that will leadto improvement, evaluating the components may be a veryuseful approach to formative evaluation. If we can alsoevaluate the relationships, we may have a very helpful kindof formative evaluationhow helpful will depend uponhow self-evident or easily determined the "fixes" for defec-tive components are. Component evaluation is distin-guished from dimensional evaluation, another kind of ana-lytical evaluation, by the relatively greater likelihood ofmanipulability, in a constructive way, of components by

23

3i

comparison with dimensions (which may be statistical arte-facts).

CONCEPTUAL SCHEME A set of concepts in terms ofwhether one can organize the data/results/observations/evaluations in an area of investigation. Unlike theories,conceptual schemes involve no assertions or generaliza-tions (other than the minute presuppositions of referentialconstancy), but they do generate hypotheses and descrip-tive simplicity.

CONCLUSION-ORIENTED RESEARCH Contrastedwith decision-oriented. Cronbach and Suppes' distinction,between two types of educational research, sometimesthought to illuminate the difference between evaluationresearch (supposedly decision-oriented) and academic social science research (conclusion-oriented). This view isbased on the fallacy of supposing that conclusions aboutmerit and value aren't conclusions, a holdover from thepositivist, value-free doctrine that value-judgments are rottestable propositions, hence unscientific: and on the fallacyof supposing that all evaluation relates to some decision (tilyevaluation of many historical phenomena e.g. a reign or apolicy does not.)

CONCURRENT VALIDITY The validity of an instru-ment which is supposed to inform us about the simultaneousstate of another system or variable. Cf. predictive validity,construct validity.

CONFIDENTIALITY One of the requirement, thatsurfaces under the legitimate process considerations in theKey Evaluation Checklist. Confidentiality, as it is presentlyconstrued, relates to the protection of data about individu-als from casual perusal by other individuals, not to theprotection of evaluative judgments on an individual frominspection by that individual. The requirement that indi-viduals be able to inspect an evaluative judgment madeabout them, or at least summaries of these with some at-tempt at preserving anonymity of the evaluator, is a rela-tively recent constraint on personnel evaluation. It is widelythought to have undermined the process quite seriously,since people can no longer say what they think of thecandidate if they have any worry about the possibility of thecandidate inferring their authorship and taking reprisals or

24.

32

thinking badly of them, if the evaluation was critical, Itshould be noted that most large systems of personnel evalu-ation have long since failed because people were unwillingto do this even when complete anonymity was guaranteed.(This was characteristic of the armed services systems.)There is no doubt that amongst universities of the first rankthere has been a negative effect; but this mostly shows afailure of ingenuity on the part of personnel evaluation,

'since there are several ways to preserve complete anonym-ity, under even the weakest laws, namely those which onlyblank out the name and title of the evaluator. See alsoAnonymity.

CONFLICT OF INTEREST (COD One of many sourcesof bias. An evaluator evaluating his/her own products isinvolved in a conflict of interestbut the result may still bebetter than the evaluation done by an external evaluatorsince the latter's loss of intimate knowledge of and experi-ence with the product may not compensate for lack of ego-involvement. That is, although conflict of interest alwayshurts credibility, it does not always affect validity. But since itmay easily affect validity, it is normally better to use at least amixture of internal and external evaluation. In choosingpanels for evaluation, the effort to pick panelists who haveno conflict of interest is usually misplaced or excessive; it isbetter to choose a panel with a mix (not even an exactbalance) of conflicting interests, since they are likely toknow more about the area than those with no interests in itor against it. Financial, personal and social ties are no dif-ferent from intellectual commitment with respect to COI; allcan produce better insights as well as worse judgments. Thekey to managing COI is requiring that the arguments bepublic and that their validity be scrutinized and voted on bythose with other or no relevant COI. See Bias.

CONNOISSEURSHIP MODEL Elliott Eisner's non-traditional method of evaluation is based on the premisethat artistic and humanistic considerations are more im-portant in evaluation than scientific ones. No quantitativeanalysis is used but instead the connoisseur-evaluator ob-serves firsthand the program or product being evaluated.The final report is a detailed descriptive narrative about thesubject of the evaluation. Cf. Literary criticism, Natural-istic, Responsive and Models.

,25

33

CONSONANCE/DISSONANCE The phenomena ofcognitive consonance and dissonance, often associatedwith the work of the social scientist Leon Festinger, are amajor and usually underrated threat to the validity of clientsatisfaction surveys and follow-up interviews as guides toprogram or product merit. (The limiting case is the tendencyto accept Presidential decisions.) Cognitive consonance, notunrelated to the older notion of rationalization, occurswhen the subject's perception of the merit of X is changedby his or her having made a strong commitment to X, e.g. bypurchasing it, spending time taking it as therapy, etc. Thusa Ford Pinto may be rated as considerably better than a VWRabbit after it has been purchased than before, although nonew evidence has emerged which justifies this evaluationshift. This is the conflict of interest side of the coin whoseother side is increased knowledge of the (e.g.) product.Some approaches to discounting this phenomenon includevery careful separation of needs assessment from perform-ance assessment, the selection of subjects having experi-ence with both (or several) options, serious task-analysis bythe same trained observers, looking at recent purchasers ofboth cars, etc. The approval of boot camp by Marines and ofcruel initiation rites by fraternity brothers is a striking andimportant casecalled "initiation-justification" bias in LE.(These phenomena also apply at the meta-level, yieldingspurious positive evaluations of evaluations by clients.)

CONSTRUCT VALIDITY The validity of an instru-ment (e.g. a test, or an observer) as an indicator of thepresence of (a particular amount of) a theoretical construct.The construct validity of a thermometer as an indicator oftemperature is high, if'it has been correctly calibrated. Thekey feature of construct validity is that there can be nosimple test of it, since there is no simple test of the presenceor absence of a theoretical construct. We can only infer tothat presence from the interrelationships between a n.,mberof indicators and a theory which has been indirectly con--firmed. The contrast is with predictive and concurrent val-idity, which relate the readings on an instrument to anotherdirectly observable variable. Thus, the predictive validity ofa test for successful graduation from a college, administeredbefore admission, is visible on graduation day some yearslater. But the use, of a thermometer to test temperature

26

cannot be confirmed by looking at the temperature; in fact,the thermometer is as near as we ever get to the tempera-ture. Over the history of thermodynamics, we have adoptedfour successive different theoretical definitions of tempera-ture, although you couldn't tell this from looking at ther-mometers. Thus, what the thermometer has "read" hasbeen four different theoretical constructs and its validity asan indicator of one of these is not at all the same as itsralidity as an indicator of another. No thermometer reads

anything at all in the region immediately above absolutezero, since all gases and liquids have solidified by that point;nevertheless, this is a temperature range; and we infer whatthe temperature is, there, by complicated theoretical calcu-lations from other variables. The validity of almost all testsused for evaluative purposes is construLt validity, becausethe construct towards which they point (e.g. "excellentcomputational skills") is a complex construct and not ob-servable in itself. This follows from the very nature of evalu-ation as involving a synthesis of several performance scales.But of course it does not follow that evaluative conclusionsare essentially less reliable than those from tests with de-morstrated predictive validity, since predictive validities

*are entirely dependent upon the persistence through time(often long periods of time) of a relationshipa depen-dency which is often shakier than the inference to an in-tellectual skill such as computational exctIlence from aseries of observations of a very talented student faced withan array of previously unseen computational tasks. Ther-mometers are highly accurate though they "only" haveconstruct validity. Construct validity is rather more easilyattainable with respect to constructs which figure in a con-ceptual scheme that does not involve a theory; only therequirements of taxonomical merit (clarity, comprehensive-ness, insight, fertility etc.) need to be met, not confirmationof the axioms and laws of the theory. (Such constructs arestill called "theoretical constructs," perhaps because con-ceptual schemes shade and evolve into theories so fluidly.)

CONSULTANT Consultants are not simply peoplehired for advice on a short-term basis, as one might supposefrom the term; they include a number of people who areessentially regular (hut not tenured) staff members of stateagencies, where some budgetary or bureaucratic restriction

Z7 35.

prevents the addition of permanent staff, but allows a semi -pe,rmanent status to the consultant. Hence an evaluationconsultant is not always an external evaluator. The basicproblem about being an evaluation consultant, as a career,is thatwith the exception of the semi-permanent jobs justmentionedyou have to make enough on the days you'reworking to carry you through the days when you're not,and in the real world it is highly unlikely that Jobs will bekind enough to fill your time exactly. Meanwhile, some ofyour overhead, e.g. secretarial and rent, will continue, aswell as your grocery bills, etc. Consequently, the most cost-effective consultants from the client's point of view tend tobe people with full-time jobs who do their consulting asmoonlighting. In the management consultant field, wherefees are very much higher than in the evaluation consultantfieldalmost as high as a regular attorney's feesthis isless of a problem; bueln the human services program evalu-ation area, the true cost of the best consultant is usually farbeyond the budgetary limits placed on consulting fees byagencies. it is high time that some system of payment byresults was allowed as an alternative, so that there would besome incentive for fast and extremely good work by full-timers, instead of spreading the work out and moonlightingit. The big shops have some full-time evaluators on staff,but only for big projects funded by agencies, not as con-sultants for the average small client.

CONSUMER The "true consumers" are the personswho are being directly or indirectly affected at the using orreceiving end of a product or programthe impacted popu-lations. The true consumers are not usually just the targetpopulation. The "consumers" of an evaluation are its audi-ences. The staff of a program are also affected by the pro-gram, but at the producing or providing end.

CONSUMER-BASED EVALUATION An approach tothe evaluation of (typically) a program, that starts rvith andfocuses on the impact on the consumer or clientele orto bemore exactthe impacted population. It might or mightnot be done goal-free, though clearly that is the methodol-ogy of choice for consumer-based evaluation. it will particu-larly focus on the identification of non-target populationsthat are impacted, on unintended effects, on true cost to theconsumer etc.

3628

'CONTENT ANALYSIS The evaluative or pre-evalu-ative process of systematically determining the characterist-ics of a body of material or practices, e,g. tesK books,courses, jobs. A great many techniques have been devel-oped for doing this, running from frequency counts onwords of certain kinds (e.g. personal references), to analysisof plot structure in illustrative stories to determine whetherthe dominant figure is e.g. male or female, white or non-white. The use of content analysis is just as important indetermining whether the evaluand matches the "official"description of it, as it is in determining what it is and what itdoes in other dimensions than those involved in the "truthin packaging" issue. Thus, a social studies chart entitled"Great Americans" could be subject to content analysis inorder to determine whether those listed were actually greatAmericans (truth in labeling); but even if it passed that test,it would be subject to further content analysis for e.g. sex-ism, because a list that did not contain the names of thegreat women suffragists would show a deformed sense ofvalues, although it might be too harsh to argue that it wasnot correctly labeled. Notice that none of this refers to astudy of the actual effects (pay-off evaluation), but is a typeof legitimate process evaluation. The line between the twois not sharp, since literal falsehoods may be the best peda-gogical device for getting the student to remember truths.Although this approach would then violate the requirementof scientific or disciplinary integrity (a process considera-tion), this would be excused on the grounds that the onlypoint of the work is to produce the right effects and thatteaching the correct and much more complicated accountleads to less accurate residual learning than teaching theincorrect account. it is not an exaggeration to say that mostelementary science courses follow, the model of teachinguntruths in'order to get approximate truths instilled in thebrains of the students. A more radical view would hold thatHuman brains in general require knowledge to be presentedin the form of rather simple untruths rather than true comp-lexities. An excellent brief discussion of content analysis bySam Ball will be found on pp. 82-84 of the Encyclopedia ofEducational Evaluation, which he co-edited for fossey-Bass, 1976.

CONTENT VALIDITY The property of tests that, after

, 29

3

appropriate content analysis, appear to meet all require-ments for congruence between claimed and actual content.Thus a test of net - making ability should contain an ade-quate (weighted) sampling of all and only those skills whichthe expert net-maker exhibits. Note that this Is an exampleof a mainly psychomotor domain of skills; content validity isnot restricted to the cognitive or verbal areas. Content valid-ity is one step more sophisticated than face validity and onestep less sophisticated than construct validity. So it can beseen 45 a more scientific approach to face validity or as aless- than - comprehensive approach to construct validity.The kind of evaluation that is involved in and leads tocrecientialing by the state as a teacher of e.g. mathematics(in the U.S.) is content invalid because of its grotesquefailure to require mathematical skills at anything like areasonable level (e.g. same level as the second quartile ofcollege sophomores majoring in mathematics). In general,like other forms of process evaluation, content validitychecks are considerably quicker than construct validity ap-proaches, and frequently provide a rather highly reliablenegative result, thereby avoiding the necessity for the longerinvestigation. They cannot provide a positive result so eas-ily, since contsnt validity is a necessary but not a sufficientcondition for Merit.

CONTEXT (of evaluation or evaluand). The ambientcircumstances that do or may influence the outcome.

CONTRACT See Funding.

CONTRACT TYPES The usual categories of contracttypes (this particular classification comes from the EckmanCenter's The Project Manager's Workplan (TPMWP)) arefixed price, time and materials, cost reimbursement, costplus fixed fee, cost plus incentive fee, cost plus sliding feeand joint powers of agreement. Explaining the differencesbeyond those obvious from the terms would be telling youmore than you want to know unless you are about to be-come a large-project manager, in which case you'll needTPMWP, and may be able to afford it (price upward of $30);it can be ordered from The Eckman Center, P.O. Box 621,Woodland Hills, CA 91365. That's the technical stuff; but atthe commonsense level, it's a good idea to have somethingin writing that covers the basics like when payments are to

3 6 3°

be made (and under what conditions they will not br made)and who is empowered to release the results land when)Dan Stuff1ebeam has the best checklist for this, in his forth-coming (1981) text.

CONTROL GROUP A group which does not receive'the "treatment" (e.g. a service or product) being evaluated.(The Kmup which does receive it is the experimental Nnnip,though the study may be ex post facto and not experimen-tal.) It is used to check the extent to which the same effectoccurs without the treatmentwhich would tend to showthe treatment was not causing whatever changes were ob-served in the experimental group. To do this, the controlgroup must be "matched," i.e., so chosen as to be closelysimilar to the experimental group (not identical, just simi-lar). The more carefully the matching is done (e.g. by using"identical twins"), the more sure one can be that differencesin outcome are due to the experimental treatnicot. A greatimprovement is achieved if you can randomly assign matchedsubjects to the two groups, and arbitrarily designateone asthe experimental and the other as the control group. This isa "true experiment", other cases are weaker and include expost facto studies. Matching would ideally cover all environmental variable's as well as genetic onesall variablesexcept the experimental one(s)but in practice we matchonly on variables which are likely to significantly affect theresults significantly, for example, sex, age, schooling.Matching on specific characteristics (stratifying) is not es-sential, it is only efficient: a perfectly good control group canbe set up by using a (much larger) random sample of thepopulation as the control group (and also for the experi-mental or treatment group). The same degree of confidencein the results can thus be achieved either by comparingsmall closely matched groups (experimental and control) orlarge entirely randomly selected groups. Of course, ifyou're likely to be wrong --or if you're in doubtaboutwhich variables to match on, the large random sample is a

better bet even though more expensive and slower. Itshould be noted that it is sometimes important to run sev- ,eral control groups and that one could then just as well callthem all experimental groups or comparison groups. Theclassical control group is the "no treatment" group, but it'snot usually the most relevant to practical decision-making

31,

(see Critical Competitor), Indeed, it's often nut even clearwhat "no treatment" owns: e.g. if you withhold your treat-ment from a control group in evaluating psychotherapy,they create their own, and may change behavior lust kauseyou withheld treatmentthey may get divorced, change orlose their job, etc. So you finish up comparing psychother-apy with something else, usually a mixture of things, not withnothing, not even with no psychotherapy, only with nopsychotherapy of your particular brand. I lence it's better tohave control groups that get one or several standard alter-native treatments than "leave them to their ,1 devices"into which the "no treatment" group often degenerates.And in evaluation, that's exactly where you bring in thecritical competitors. In medicine, that's why the controlgroup gets a placebo.

CONVERGENCE GROUP (Stufflebeam). A tramwhose task is to develop the best version of a treatment fromvarious stakeholder or advocate suggestions. A generaliza-tion of the term, to convergence ,cessions, covers the processthat should follow the use of parallel (teams of) evaluators,viz. the comparison of their written reports and an attemptto resolve disagreements. This should be done in the firstplace by the separate teams, witls a referee (group) presentto prevent bullying; it may later be best to use a separateconvergence (synthesis) group.

CORRECTION FOR GUESSING In multiple-choiceexams with n alternatives in each question, the averagetestee would get 1/n of the marks by guessing alone. Thus ifa student fails to complete such an exam, it has been sug-gested that one should add 1/nth of the number of un-answered questions to his or her score, in order to get a faircomparison with the score of a testee that answers all thequestions by guessing the ones they do not have time to doseriously. There are difficulties both with this suggestion("applying the correction for guessing") and with not usingit; the correct procedure will depend on a careful analysis ofthe exact case. Another version of the correction for gues-sing involves subtracting the number of answers that onewould expect to get by guessing from the total score,whether the test is completed or not. These two approachesgive essentially the same results, but their effects may in-teract differently with different instructions on the test: in

32

general, ethics requires that if such corrections will be used,they be pre-explained to testees.

CORRELATION The relationship of concomitant oc-currence or variation. Its relevance to evalution is (a) as ahint that a causal relation exists (showing an effect to bepresent), (b) to establish. the validity of an indicator. Therange is from 1 to + 1, with 0 showing random relation-ship, ±1 showing perfect (100%) correlation (+1) or perfectavoidance (-1).

COSTS, COST-ANALYSIS It is often useful to,distin-guish initial (start-up) costs from running (maintenance)costs; capital costs from cash flow; discounted from rawcosts; direct from indirect costs or overhead, which includesdepreciation, maintenance, taxes, some supplies, insur-ance, some services, repairs, etc.; psychological from tangi-ble costs; outlays from opportunity costs. The "human capi-tal" or "human resources" approach stresses one non-monetary component. "Marginal analysis" looks at the rel-ative add-on costs, from a given cost-level, and is often bothmore relevant to a decision-maker's choices at that basiccost-level, and more easily calculated. Cf. Zero-BasedBugeting.

COST-BENEFIT OR BENEFIT-COST ANALYSISCost-benefit analysis goes a step beyond cost-effectivenessanalysis (see below) and estimates the overall cost andbenefit of each alternative (product or program) in term's of asingle quantity, usually money. This analysis will provide ananswer to the question: Is this program or product worth itscost? Of, which of the options has the highest benefit/costratio? (It is often not possible to do cost-benefit analysis, e.g.when ethical, essential, temporal, or aesthetic elements areat stake.)

COST-EFFECTIVENESS ANALYSIS The purpose ofthis type of analysis is to determine what a program orprocedure costs, and what it does (effectiveness), the latteroften being described in terms of qualities (pay-offs) whichcannot be reduced to money terms, or to any other singledimension of pay-off. This procedure does not provide 'anautomatic answer to the question: Is this program or prod-uct worth its cost? The evaluator will have_ to weight andsynthesize the needs data with cost-effectiveness results to

33

get an answer, and even that may not give an unequivocalresult.

COST-FEASIBILITY ANALYSIS Determining on aYes/No basis whether something can be afforded (thismeans you can afford the initial and the continuing costs).

COST-FREE EVALUATION The doctrine that evalua-tions should, if properly designed and used, provide a netpositive return, on the average. They may do this by leadingeither to the elimination of ineffective programs or proce-dures, or to an increase in productivity or quality fromexisting resources/levels of effort. The equivalence tablesbetween costs and benefits should be set up to match theclient's values, and accepted by the client, before the evalu-ation begins, so as to avoid undue pressure to be cost-freeby cost-cutting only, instead of by quality-improvement aswell a sgost-cutting (if the latter is requested at all).

COST PLUS Another basis for calculating budgets oncontracts is tho "cost plus" basis, which allows the con-tractor to charge for costs plus a margin of profit; &pendingon how "profits' is defined, this may mean the contractor ismaking less than if the money was in a savings account ands/he was on a salary at some other job, or a good deal more.Sometimes cost plus contracts, since they usually omit anyreal controls to keep costs down (indeed, sometimes thereverse, since the "plus" is often a percentage of the basiccost), are not ideal for the taxpayer either. Which hasprompted the introdUction of the "cost plus fixed fee" basis,where the fee is fixed and not proportional to the size of thecontract. That's sometimes better, but sometimeswhenthe scope of work is enlarged during the project, by thediscovery of difficulties or (subtly) by the agencyitshrinks the profit below a reasonable level. The profit, afterall, has to carry the contractor through periods when con-tracts happen not to abut perfectly, pay the interest on thecapital investment, and provide some recompense for highrisk. The justification for cost plus contracts is very clear incircumstances where it is difficult to foresee what the costswill be and no sane contractor is going to undertake some-thing with an unknown cost. Especially if the agency wishesto retain the option of changing the conditions that are to bemet, the hardware that is to be used, etc., say in the light of

4 2 "

obsolescence of the materials available at the beginning, thecost plus percentage contract can make sense. Competitivebidding is-still possible, after all.

CREDIBILITY Evaluations often need to be not onlyvalid but such that their audiences will believe that they arevalid (cf. "It is not enough that justice be done, etc."). Thismay require extra care about avoiding (apparent) conflict ofinterest, for example, even if in a particular case it does notin fact affect validity.

CRITERION The criterion is whatever is to count as the"pay-off," e.g. success in college is often the "criterionmeasure" against which we validate a predictive test like acollege entrance examination. Ability to balance a Check-book might be one "criterion behavior" against which,weevaluate a practical math course.

CRITERION-REFERENCED TEST This type of testpro*les information about the individual's (or a group's)knowledge or performance on a specific criterion. The testscores are thus interpreted by comparison with pre-deter-mined performance criteria rather than by comparison witha reference group (see Norm-Referenced Test). The merit ofsuch tests depends completely on the (educational) signifi-cance of the criteriontrivial criterion, trivial test; theory-impregnated criterion, theory-dependent testand on thetechnical soundness of the test. It is not within an amateur'sor the usual teacher's domain of competence to constructsuch tests, and when they do the results are often unin-terpretable because we know neither whether the subjectunderstood the question nor whether s/he should be able toanswer it. It is clear that successful construction of such testsis also beyond the capacity or interest of most professionals:we still lack one good functional literacy test, let alone fouror five to choose from

CRITICAL COMPETITORS Critical competitors arethose entities with which comparisons need to be madewhen a program, product, etc., is being evaluated. Thecritical competitors can be real or hypothetical, e.g. anotherexisting text or one we could easily make with scissors andpaste. They bear on the question whether the best use of themoney (and other resouces) involved is, being made, as

.35

opposed to the agmatically less interesting question ofwhether it's just ing thrown away. You don't just want toknow whether this $20.00 text is good; you want to know ifthere's a much better one for $20.00, or one that is just asgood for $10.00. Those others are (two of) the critical com-petitors that should figure in the eiraluation.of the text. Soshould a film (if there is one), lectures, TV, a job or intern-ship, etc., where they or an assemblage of them coversimilar material. Traditional evaluation design has tendedto use a no-treatment control group for the comparison,which is incorrect; "no treatment" is rarely the real option.It's either the old treatment or another innovative one, or both,or a hybrid, or something no one has so far seen as relevant(or perhaps not even put together). These unrecognized or"created" critical competitors are often the most valuablecontributions an evaluator makes and coming up with themrequires creativity, local knowledge and realism.

CRITICAL INCIDENT TECHNIQUE (Flanagan) Thisapproach, tied to the analysis of longitudinal records, at-tempts to identify significant events or times in an individu-al's life (or an institution's life, etc.) which in some wayappear to have altered the whole direction of subsequentevents. It offers 3 way of identifying the effects of e.g.schooling, in circumstances where a full experimental studyis impossible. It is, of course, fraught with hazards. (Ref.John Flanagan, Psychological Bulletin,1954, pp. 327-358.)

CROSS-SECTIONAL (study) If you want to get theresults that a longitudinal study would give you, but youcan't wait around to do one, then you can use a cross-sectional study as a substitute whose validity will dependupon certain assumptions about the world. In a cross-sectional study, you look at today's first year students andtoday's graduating seniors and infer e.g. that college hasproduced the difference between them; in a longitudinalstudy you would look at today's first yearstudents and waitand see how they change by the time they become gradua-ting seniors. The cross-sectional study substitutes- today'sgraduating seniors for a population which you cannot in-sped for another four years, namely the seniors that today'sfreshman or first year students will become. The assump-tions involved are that no significantchanges in the demo-

44.36

graphics have occurred since the present seniors formed theentering class, and that no significant changes in the collegehave occurred since that time. (For certain inferences, theassumptions will be in the other direction in time.)

CRYPTO-EVALUATIVE TERM A term which appearsto be purely descriptive, but whose meaning necessarily(definitionally) involves evaluative concepts e.g. intelli-gent, true, deduction. Cf. Value-imbued.

CULTURE-FAIR/CULTURE-FREE A culture-free testavoids bias for or against certain cultures. Depending uponhow generally culture is defined, and the test is used, thisbias may or may not invalidate the test. Certain types ofproblem-solving tests involving findihg food in an artificialdesert to avoid starvation, for example, are about as near toculture-free as makes any sense; but they are a little imprac-tical to use. To discover that a test discriminates betweene.g. races with respect to the numbers who pass a givenstandard, has absolutely no relevance to the question ofwhether the test is culture-fair. If a particular race has beenoppressed for a sufficiently long time, then its culture willnot provide the kind of support for intellectual exercises (orathletic ones, depending upon the type of oppression); itwill probably not provide the dietary prerequisites for fulldevelopment; and it may not provide the role models thatstimulate achievement in that direction. Hence, quite apartfrom any affects on the gene pool, it is to be expected that thatracial group will perform worse on certain types of tests ifit did not, the argument that serious oppression has occur-red would be weakened. Systematic procedures are nowused to avoid clear cases of cultural bias in test items, butthese are poorly understood. Even distinguished educatorswill sometimes point to the occurrence of a term like "chan-delier" in a reading vocabulary test as a sign of cultural bias,on the grounds that oppressed groups are not likely to havechandeliers in their houses. Indeed they are not, but that'sirrelevant; the question is whether the term reliably indi-cates wide reading, and hence whether a sufficient numberof the oppressor group in fact picked up the term throughlabeling an object in the environment rather than throughwide reading to invalidate that inference. That's an empiri-cal question, not an a priori one. A similar point comes up in

3Z

4 5

looking at the use of test scores for admission selection;validation of a cut-off is properly based on prior experience,and may be based on a mainly white population. In such acase, the use of the same cutting scores for minorities willtend to favor them, as a matter of empirical fact (possiblybecause the later efforts of those individualS get less peer/home support than in the white population.

CURRICULUM EVALUATION Curriculum evaluationcan be treated as a kind of product evaluation,, with theemphasis on outcome studies of those using the curriculum;or it can be approached in terms of content validity. ("Cur-riculum" can refer to the content or to the sequencing ofcourses, etc.) A popular fallacy in the area invulves thesupposition. that good tests used in a curriculum evaluationshould match the goals of the curriculum or at least itscontent; on the contrary, if they are to be tests of the cur-riculum, they must be independently constructed, by refer-ence to the needs of the user population and the generaldomain of the curriculum, without regard to its specificcontent, goals and objectives. Another issue concerns theextent to which long-term effects should be the decisiveones; since they are usually inaccessible because of time orbudget considerations, it is often thought that judgmentsabout curricula cannot be made reliably. But essentially alllong-term effects are best predicted by short-term effects,which can be measured. And the causal inferences involvedfrom temporally remote data, even if we could wait to studythe long-term situation, are so much less reliable that anygains from the long-term study would likely be illusory.One of the most serious errors in a great deal of curriculumevaluation involves the assumption that curricula are imp-lemented in much the same way by different teachers, or indifferent schools; even if a quite thorough checklist is usedto ensure implementation, there is still a great deal of slip-page in the teaching process. In the more general sense ofcurriculum, which refers to the sequence of courses takenby a student, the slippage occurs via the granting of excep-tions, the use of less-than-valid challenge exams, the sub-stitution of different instructors for others on leave, etc.Nevertheless, good curriculum materials and good cur-riculum sequences should be evaluated for gross differ-ences in their effectiveness and veracity /comprehensive-

38

46

hess/relevance to the needs of the students. The differencesbetween good and bad are so large and common that, de-spite all the difficulties, very much improved versions andchoices can result from even rougkand ready evaluation ofcontent and teachability. Davis identifies the followingcomponents in curriculum evaluation: determining theactual nature of the curriculum (and its support system ofcounselors, other curricula, catalogs, etc.) as compared withthe official descriptions (e.g. via transcript analysis, cur-riculum analysis of class notes); evaluating its academicquality; examining procedures for its evaluation and re-vision; assessing student learning; student surveys includ-ing exit of alumni interviews; faculty surveys; surveys ofemployers and potential employers, reviews by profes-sional curriculum experts; comparison with any standardsprovided by relevant professional associations; checkingwith leading schools or colleges to see if they have im-provements/updates that should be considered. Ref. De-signing and Evaluating Higher Education Curriculum, LynnWood & Barbara Gross Davis, AAHE, 1978.

CUTTING SCORE A score which marks the line,be-tween grades, between mastery and non-mastery, etc.Always arbitrary to some degree, it is justifiable in circumstances where a number of such scores will be synthesizedeventually. But in a final report, only cutting zones makesense and the grades should indicate this, e.g. A, A, AB,B+, ... where the AB indicates a borderline area. Manyopponents of minimum competency testing complain aboutthe arbitrariness of any cut-off point; the response should beto use a zone, i.e., three grades (clearly not competent;debatably competent; clearly competent).

DATA SYNTHESIS The semi-algorithmic semi-judg-mental process of producing comprehensible facts from rawdata via descriptive or inferential statistics and interpreta-tion in terms of concepts, hypotheses or theories.

DECILE (Stat.) See Percentile.

DECISION-MAKER It is sometimes important to dis-tinguish between making decisions about the truth of vari-ous propositions, and making decisions about the disposi-

tion of (or appropriate action about) something. While thescholar automatically falls into the first category, s/he typi-cally only serves as a consultant to a decision-maker of thesecond type. Most discussion about decision-makers in theevaluation context refers to those with the power to dis-pose, not merely with the power to propose or drawconclusions.

DECISION- ORIENTED RESEARCH See Conclusion-Oriented Research.

DECISION RULE A link between an evaluation andaction, e.g. "those with a grade below C must repeat thecourse"; "Hypotheses which are not significant at the .01level will be abandoned." (The latter example is commonbut logicalfy_improper; see Null Hypothesis.)

DELIVERY SYSTEM The link between a product orservice and the population that needs or wants it. Importantto distinguish this in evaluation, because it helps avoid thefallacy of supposing that the existence of the need justifiesthe development of something to meet the need. It does soonly if one can either develop a new (or make use of anexisting) delivery system.

DELPHI TECHNIQUE A procedure used in groupproblem solving, involvingfor instancecirculating apreliminary version of the problem to all participants, call-ing for suggested rephrasings (and/or preliminary solu-tions). The rephrasings are then circulated for a vote on theversion that seems most fruitful (and/or the preliminarysolutions are circulated for rank ordering). When the rankorderings have been synthesized, these are circulated foranother vote. Innumerable variations on this procedure are-practiced under the title "Delphi Technique," and there is aconsiderable literature on it. It is often done in a way thatover-constricts the input,,hence is ruined before it begins.In any case, the intellect of the organizer must be the equalof the participants or the best suggestions won't be recog-nized as such. A phone conference call may be more effec-tive, faster and cheaper, perhaps with one chance at writtenafter- thoughts.

DEMOGRAPHICS The characteristics of a populationdefined in terms of its macroscopic featuresage, sex, level

4 40

of education, occupation, place of birth, residence, etc., bycontrast with micro-features, e.g. IQ, attitude, scores.

DEPENDENT VARIABLE One which represents theoutcomecontrast is with the independent variableswhich are the ones we (or nature) can manipulate directly.That definition is circular and so are all others; the distinc-tion between dependent and independent variables is anultimate notion in science, definable only in terms of othersuch notions, e.g. randomness.

DESCRIPTIVE STATISTICS The part of statistics con-cerned with providing illuminating perspectives on or re-ductions of a mass of data inferential statistics); typi-cally this can be done as a translation, involving no risk. Forexample, calculating the mean score of a class from itsindividual scores is straight deduction and no probability isinvolved. But estimating the mean score of the class bycalculating the actual mean of a random sample of the classis of course inferential statistics.

DESIGN (of evaluation; see Evaluation Design)

DIFFUSION The process of spreading informationabout (typically) a product (cf. dissethination with whichdiffusion is deliberately and somewhat artificially con-trasted).

DIMENSIONAL EVALUATION A species of analyti-cal evaluation in which the meritorious performance is bro-ken out into a set of dimensions that have useful statisticalproperties (e.g. independence) or are familiar from othercontexts and easily grasped, etc. Cf. Component Evalua-tion.

DISCREPANCY EVALUATION (Provus) Evaluationconceived of as identifying the gaps between time-tied ob-jectives and actual performance, on the dimensions of theobjectives. A slight elaboration of the simple goal-achievement model of evaluation.

DISPERSION (Stat.) The extent to which a distribu-tion is "spread" across the range of its variables, as opposedto where it is "centered"the latter being described bymeasures of "central tendency," e.g. mean, median, mode.Dispersion is measured in terms of e.g. standard deviation

41

49

or semi-interquartile difference.

DISSEMINATION The process of distributing (typical-ly) a product itself, rather than information about it (cf.iliffusion). Also used as jargon synonym for distribution.

DISSONANCE See Consonance.

DOMAIN-REFERENCED TESTING The purpose oftesting is not usually to determine the testee's ability toanswer the questions on the test, but to provide a basis forconclusions about the testee's ability with regard to a muchwider domain. Criterion-referenced tests identify ability toperform at a certain (criterion) level ontypically--a par-ticular dimension, e.g. two-digit multiplication. DRT is aslight generalization of that to cover cases like social studieseducation where it seems misleading to suggest that there isa criterion. One can think of a domain as defined by a largeset of criteria, from which we sample, just asat the otherendthe test samples from the testee's abililes. Th.? majorproblem with DRT is defining domains in a useful way. J. R.Popham has a usefully specific discussion in his Educa-tional Evaluation, Prentice-Hall, 1975.

DUMPING The practice of unloading funds rapidlynear the end of the fiscal year in order that they will not bereturned to the central bureaucracy, which would be takenas a sign that next year's budget could be reduced by thatamount since it wasn't needed. This may be done with allthe trappings of an RFP, i.e., via a contract, but it's a situa-tion where the difference between a contract and a granttends to evaporate since the contract is so unspecific (be-cause of lack of time for writing the RFP carefully) that it hasessentially the status of a grant.

ECHELON A term like "cohort," sometimes used in-terchangeably with the latter, but better restricted to agroup (or group of groups) that is time-staggered withregard to its entry. If a new group comes on board everyfour weeks for five months, followed by a three month gap,while they are being trained, and then the whole processbegins again, the first three groups are called the first eche-lon; each of them is a cohort.

42

EDUCATIONAL ROLE (of the evaluator) It is bothempirically and normatively the case that this role is of thegreatest importance, at worst second only to the truthfind-ing role. This is not merely because few people have beenproperly educated as to either the importance or the tech-niques of evaluation; it is because the discipline will prob-ably always seem unimportant until it (or its neglect) bitesyou, and quick education about that particular branch orapplication of evaluation will then be-Come very important.No professional who is unsophisticated about personnel,product, proposal and program evaluation in their field is aprofessional; but even when (or if) this sophistication iswidespread, application to oneself and one's own programswill not be easy, and the evaluator can help to teach onehoW to handle the process and its results. When Socratessaid, "The unexamined life is not worth living," he wasidentifying himself as an evaluator; but it is not accidentalthat he is best-known as a teacher. Nor is it accidental thathe was killed for combining the two roles. See also Value-phobia.

EDUCATIONAL OR OVERALL SIGNIFICANCE Toget this, the evaluator must examine the data correspondingto each of the prior -checkpoints on the Key EvaluationChecklist: educational signifkance represents a total syn-thesis of all you know. In particular, the gains attributed tothe program or product being evaluated, must be education-ally significant/valuable and not just be statistically sig-nificant, something which may only be the result of using alarge sample, or due to irrelevant vocabulary gains, poortest construction, peculiar statistical analysis or some otherinsignificant variable. (The same applies for medically sig-

\ nificant, socially significant, etc.)

\ EFFECTIVENESS Usually refers to goal-achievement.Various indexes of effectiveness were developed aroundmid,-century, when evaluation was thought of as simplygoal- achievement measurement for social action programs.

EIR See Environmental Impact Report.

. ENEMIES LIST Worst enemies often make best critics.They have two advantages over friends, in that they aremore motivated to prove you wrong, and more experienced

43

with a radically different viewpoint. Hence they will oftenprobe deep enough to uncover assumptions one has notnoticed, and destroy complacence about the impregnabilityof one's inferential structures. Obviously we should usethem for metaevaluation, and pay them well. But who en-joys working with, thankihg, and paying their enemies?The answer is: A good evaluator. This is a key test of the"evaluation attitude" (see Evaluation Skills.) How little wereally care about the correct assessment of merit and howmuch we prefer to make life easy for ourselves shows upnowhere more clearly than on this issue. A good example isthe distribution of teaching-evaluation forms to students ina college class, normaily done near the end of the semester.But where are your enemies then? Long gone; only theself-selected remain. You should distribute the forms toevery warm body that crosses the threshold on the first dayand any later date; to be turned in to their seat-neighborwh6 they decide not to come back. It is the ones who leftwho can tell you the mostby now you know most of whatthe stalwarts will say. If you value quality, reach out forsuggestions to those who think you lack it.

ENGINEERING MODEL See Medical Model.

ENJOYMENT Although it is an error in educationalevaluation to treat enjoyment as primary and learning asnot worth direct inspection, there's no justification for notcounting enjoyment at all (Kohlberg once commented onthe big early childhood program evaluations that it was toobad no one bothered to check whether at least the kids criedless in Headstart centers than at home.) And the situation incertain cases, e.g. aesthetic education, is much nearer to onewhere enjoyment is a primary goal. A common fallacy is toargue that since it would be a serious mistake to teach K-3children some cognitive skills at the expense of makingthem hate school, we should therefore make sure they enjoyschool and try to teach them skills. That prioritization ofeffort reduces the already meager interest in teaching some-thing valuable, and has never been validated for gains inpositive attitude towards school. The teacher is in conflict ofinterest here,, since finger-painting takes less preparationthan spatial skill-building.

ENTHUSIASM EFFECT See Hawthorne Effect.

52 44

ENVIRONMENTAL IMPACT REPORT (EIR) Oftenrequired by law prior to granting building or business per-mits or variance. A form of evaluation focusing on theecosystem effects. Currently based mainly on bio-scienceand/or traffic analysis, these tend to be thin on the evalua-tion of opportunity costs, indirect costs, ethics contingencytrees, etc.

ERRORS OF MEASUREMENT It is a truism thatmeasurement involves some error; it is more interesting tonotice exactly how these errors can get one into trouble inevaluation studies. For example, it is obvious that if weselect the low scorers on a test for remedial work, than someof these will be in the group because of errors of measure-ment (i.e. their performance on the particular test items thatwere used does not give an accurate picture of their ability).It follows that a rerneasurement, using a test of matcheddifficulty, would immediately place them somewhathigher. Hence on a posttest, which is essentially such aretesting, they will come out looking better, although in factthis is not due to any merit of the intervening treatmeni, butis simply a statistical artefact due to errors of measurement(specifically, a regression effect). It also follows that match-ing two groups for their entry level skills, where we plan touse one of them as the control in a quasi-experimental study(i.e. one where the two groups are not created by randomassignment) will get us into trouble because the errors ofmeasurement on the two groups cannot be assumed to bethe same, and hence the regression effect will be different insize. Another nasty effect of errors of measurement is toreduce correlation coefficients; one may intuitively feel thatif the errors of measurement are relatively random, theyshould "average out" when one comes to !cal( at the corre-lations, but the fact is that the larger the errors of measure-ment, the smaller the correlations will appear. See Regres-sion to the Mean.

ESCROW A neutral individual or secure place whereidentifying data can be deposited until completion of anevaluation and/or destruction. (Term originated in the law.)See Filter, Anonymity.

ETHICS (in evaluation) See also Responsibility Evalu-ation. Ethics is the ultimate normative social science, ulti-

mate because it refers to duties (etc.) which transcend allother obligations such as those to prudence, science, andprofessionalism. It is in one sense a branch of evaluation, inanother a discipline which, like history or statistics, contri-butes a key element to many evaluations. That it is (logi-cally) a social science is of course denied by virtually allsocial scientists, who have valuephobia about even the sug-gestion that non-ethical value-judgments have a place inscience and hypervaluephobia about importing ethicaljudgments. But the inexorable consequence of the develop-ment of game- and decision-theory, latent function anal-ysis, democratic theory in political science, welfare econcm-ics, analytical jurisprudence, behavioral genetics, and the"good reasons" approach to ethical theory, is that all thebricks have been baked for the building, and it's just super-stitious to argue that some mysterious force prohibits put-ting one on top of another. The Constitution and Bill ofRights are essentially ethical propositions, with two proper-ties: first, there are good reasons for adopting them; second,they generate sound laws. The arguments for them (e.g.Mill's "On Liberty") are as good social science as you'll findin a long day's walk through the professional journals, andthe inferences to specific laws are well-tested. It follows thatall the well-known arguments for law and order are indi-rectly arguments for the (secular) ethics of the Constitutionand for the axiom of equal rights from which they flow, justas the arguments for the existence of atoms are, indirectly,arguments for the existence of electrons. Ethics is just ageneral social strategy and no more immune to criticism bysocial science than the death penalty or excise taxes orbehavior therapy or police strikes. To act as if some logicalbarrier prevents science from arguing for or against particu-lar ethical claims such as the immorality of the death pen-alty, a question of overall social strategy, but not from argu-ing for or against particular strategies within economics orpenology is to cut the social sciences off from the mostimportant area in which they can make a social contribu-tion. And it leads to ragged edges on and inconsistencieswithin the sciences themselves. For an excellent discussionof the "ethics-or-else" dilemma for allocation theory, seeE.J. Mishan, Cost-Benefit Analysis, 1976, Praeger, Chapter58, "The Social Rationale of Welfare Economics." Interest-

54i46

ingly enough, although a large part of that book is aboutevaluation, (e.g. Chapter 61 is called "Consistency in Proj-ect Evaluation,") neither that term nor the author'sfrequently-used variation "valuation" gets into the index.See Valuephobla.

EVALUABILITY Projects and programsand theplans for themare beginning to be scrutinized quite care-fully for evaluability. This might be thought of as the firstcommandment of accountability or as a refinement of Pop-per's requireMent of falsifiability. The underlying principlecan be expressed in several ways, e.g. "It is not enough thatgood works be done, it must be pogSible to Hi that (and,more importantly, when) good works have been done." Or"You can't learn by trial and error if there's no clear way toidentify the errors." The bare requirement of an evaluationcomponent in a proposal has been around for a while;what's new is a more serious effort to make it feasible andappropriate. That presupposes more expertise in evaluationthan most review panels and project monitors have; but thatmay come. Evaluability should be checked and improved atthe planning and prefonnative stages. Requiring evaluabil-ity of new programs is analogous to requiring serviceability ina new car; obvious enough, but who besides fleet owners(and GSA)knew that there was for many years a 2:1 differ-ence in standard service costs as between Ford and GM?Congress may some day learn that low evajuability has ahigh price.

EVALUAND Whatever is being evaluated; if it is a'person, the term "evaluee" is more appropriate.

EVALUATION The process of determining the merit orworth or value of something; or the product of that process.The special features of evaluation, as a particular kind ofinvestigation (distinguished e.g. from traditional empiricalresearch in the social sciences), include a characteristic con-cern with cost, comparisons, needs, ethics, and its ownpolitical, ethical, presentational, and cost dimensions; andwith the supporting and making of sound value judgments,rather than hypothesis-testing. The term is sometimes usedmore narrowly (as is "science") to mean only systematicand objective eval Alan, or only the work of people labeled"evaluators." While evaluation in the broad sense is ines-

47

5,

capable for rational behavior or thought, professional eval-uation is frequently worthless and expensive. Evaluationproperly donecan be said to be ",a science" in a loosesense, as can, for example, teaching; but it is also an art, aninter-personal 'skill, something that judges and juries andliterary critics and real estate assessors and jewelry apprais-ers doand thus not "one of the sciences." See 'alsoFormative /Sununative, Analytical/Holistic, etc.

EVALUATION EDUCATION Consumer education isstill rather weak on training in evaluation, which should beits most important component. And of course there areother contexts than,,those in which one's role is that of theconsumer, where evaluation education would be most val-uable, notably the manager role, or the service-provider /professional role. Few teachers, for example, have the faint-est idea how to evaluate their own work, although this issurely the minimum requirement of professionalism. Thelast decades have seen considerable federal and state effortto provide reasonable' standards of quality that will protectthe consumer in a number of areas; they have not yet reallyunderstood that the superimposition of standards is a poorsubstitute for understanding the justification for them.Evaluation training is the training of (mainly professional)evaluators; evaluation education is the training of the cit-izenry in evaluation techniques, traps, and resource -finding, and is the only satisfactory long-run approach toimproving the quality of our lives without extraordinarywastage of resources.

EVALUATION ETHICS AND ETIQUETTE Becauseevaluation in practice so often involves tricky interpersonal

.relations it has much to learn from diplomacy, arbitration,mediation, negotiating, and management (especially per-sonnel management). Unfortunately, the wisdom of theseareas is poorly encapsulated into learning and trainingmaterials, which are mainly truishc or anecdotal. The cor-rect approach would appear to be via the refinement ofnormative princir s and the collateral development of ex-tensive calibration examples, rather as in developing skill inapplied ethical analysis (casuistry.) An example: you are thecnly first-timer on a site-visit team to a prestigious institu-tion, and you gradually realize, as the time slips away insocializing and reading or listening to teports from adm'nts-

48

trators and administration-selected faculty, that no seriousevaluation is going to occur unless you do something aboutit. What should you do? There is a precise (flow-chartable)solution which specifies a sequence of actions and utter-ances, each contingent upon the particular outcome of theprevious act, and which avoids unethical behavior whileminimizing distress; mature professirals without evalua-tion experience never get it right; softie very experiencedand thoughtful evaluators come very close; a group contain-ing both reaches complete consensus on it after a twenty.minute disussion. Like so much in evaluation, this shows itto meet the standards of common-sense though it is not inour repertoires. It should be. Another example: awrite-in cesponse on an anonymous personnel evaluationform accuses the evaluee of sexual harassment. As the per-son in chgge of the evaluation, what exactly should you do?("Ignore le is not only ethically wrong, it is obviouslyimpossible.)

EVALUATION OF EVALUATIONS. See Meta-eval-uation.

EVALUATION OF EVALUATORS Track record, notpublications, is the key, but how do you get it? See Evalua-tion Registry.

EVALUATION PREDICATES The distinctively evaluative relations or ascriptions involved in grading, ranking,scoring, and apportioning.

EVALUATION REGISTRY A concept half-way to thecertification or licensing of evaluators from complete lais-sez-faire. This would operate by encouraging evaluatorsand their clients to file a copy of their joint contract or letterof agreement with the evaluation registry at the begi singof an evaluation; to this would be appended any modifica-tions made along the way and finally a brief standard reportby each party, made independently, assessing the qualityand utility of the evaluation, and the performance of theclient. Each would have a chance to add a brief reaction to .

the other's evaluation, and the net end result (2 pages)would then be available for inspection, for a fee, by poten-tial clients. This arrangement, it is argued, would be of moreuse to the client than asking an evaluator to suggest formerclients as references or simply looking at a list of publica-

49

5 ?.

tions or reports, but would avoid the key prob'ems withlicensingenforcement standards, and funding. Start-upcosts for such a registry, although small, are not available,possibly because we are in a period of evaluation backlash.

EVALUATION RESEARCH Evaluation done in a seri-ous scientific way; the term is popular amongst supportersof the social science model of evaluation.

EVALUATION-SPECIFIC METHODOLOGY Much ofthe methodology used in evaluation studies is derived fromother disciplinesthe special nature of evaluation is theway which it synthesizes these into an appropriate over-all perspective, and brings,them to bear on the various kindsof evaluation tasks. But there are some situations whereessential variations on the usual procedures in scientificresearch become appropriate. Two instances will be men-tioned. In survey research, sample size is normally pre-determined n the light of statistical considerations andprior evidence about population parameters. In evaluation,although there are occasions when a survey of the classicalkind is appropriate, surveys are frequently investigatoryrather than descriptive surveys and then the situation israther different. Suppose that a respondent, in a phoneinterview evaluation survey of users of a partmilar service,comes up with a wholly unexpected comment on the servicewhich stiggestslet us sayimproper behavior by theservice-providers. (It might equally well suggest an unex-pected and highly beneficial side-effect.) This respondent isthe thirtieth interviewee, from a planned sample of a hun-dred. On the standard survey pattern, one would continue,using the same interview fgrm, through the rest of thesample. In evaluations, one will quite often want to alter theform so as to include an explicit question on this point. Ofcourse, one can no longer report the results of the surveywith a sample of a hundred, with respect to this question(and any others with whom its presence might interact). Butone may very well be able to turn up another twenty peoplethat respond under cueing, who would not have producedthis as a free response. That result is much more importantthan salvaging the surveyin most cases. It also points toanother feature of the evaluation situation, namely the de-sirability of time-sequencing the interviews or question-

5850

naire responses. Hence one should try to avoid using asingle mass mailing, a common practice in survey research;by using sequential mailing one can examine the responsesfor possible modifications of the instrument. The secondtaboo that we may have good reason to break concernssample size. If we find, ourselves getting a very highlystandard kind of response to a fairly elaborate question-naire, we are discovering that the population has less varia-bility than we had expected, and we should alter our esti-mate of an appropriate sample size in mid-stream. No pointin continuing to fish in the same waters if you don't get abite after an hour. The generalization of this point is to theuse of "emergent", "cascading", or "rolling" designs,where the,whole design is varied en route as appropriate.(These terms come from the glossary in Evaluation Stan-dards.) Other evaluation-specific methodology includes theuse of parallel teams working independently, calibration ofjudges, convergence sessions, "blind" judges, synthesis,bias balancing etc. See also Anonymity, Questionnaires.

EVALUATION SKILLS There are lists of desirableskills for evaluators (Stufflebeam has one with 234 com-petencies); as for philosophers, almost any kind of special-ized knowledge is advantageous; and the more obvious toolskills alone (see the Key Evaluation Checklist) are far moredemanding than in any other disciplinestatistics, cost-analysis, ethical analysis, management, teaching, therapy,contract law, graphics, synthesis, dissemination(for the re-port); and of course there are the evaluation-specific tech-niques. Here we mention a couple that are less obvious.First, the evaluative attitude or temperament. Unless youare committed to the search for quality, as the best of thosein other professions are committed to the search for justiceor the search for truth, you are in the wi mg game. You willbe too easily tempted by the charms of ' ,Aning" (e.g. join-ing the program staffsee Going Native); too unhappywith the outsider's role. The virtue of evaluation must be itsown real reward, for the slings and arrows are very real.(Incidentally, this value is a learnable and probably even ateachable characteristic for many people; but some peoplecome by it naturally and others will never acquire it.) Thesecond package of relatively unproclaimed skills are "prac-tical logical analysis" skills e.g. identifying hidden agendas

5/

or unnoticed assumptions about the dissemination process,or mismatches between igoal-statement and a needs-state-ment, or loopholes in an evaluation design; the ability toprovide accurate summaries one fiftieth of the length of theoriginal (precis) "r tr give a totally non-evaluative, non-interpretive description of a program or treatment. Thegood news is that no-one is good at all the above; that thereis room for specialists, and also for team members. Partlybecause of the formidable nature of the relevant skills list,evaluation is a field where' teams if properly employed areimmensely better than soloists. Not only are two headsbetter than one, six (carefully chosen and appropriatelyinstructed) are better than five.

EVALUATION STANDARDS A set of principles forthe guidance of evaluators and their clients. The majoreffort is the Evaluation. Standards (ed. D. Stufflebeam,McGraw Hill, 1980), but the Evaluation Research Societyhas also produced a set. There are some shared weak-nessesfor example, neither includes needs assess-mentsbut the former is much more explicit about inter-pretation, giving specific examples of applications etc. Ingeneral; these are likely to do good by raising clients' con-sciousness and general performance, but fears have beenexpressed by first-rank evaluators that they may rigidifyapproaches, stifle research, increase costs (cf. "defensivelab tests" in medical practice today), and give a false impres-sion of sophistication. See also Bias.

EVALUATION, THEORY OF The theory of evaluationincludes a wide range of topics from the logic of evaluativediscourse, general accounts of the nature of evaluation andhow it can be justified (axiology), through socio-politicaltheories of its role in partic liar types of environment, toso-called "models" which are often simply conceptualiza-tions of or procedural recommendations for evaluation:Little work is funded on this; a notable exception is NIE'sResearch on Evaluation project at NWL, a series of studieson radically different "metaphors" for evaluation:

F,ALUATION TRAINING There is essentially noserious support for this at the moment, despite the largedemand (and larger need) for trained evaluators, perhaps asign of evaluation backlash. The best places are probably

52

60

CIRCE and the Evaluation Center at Western Michigan withpost-doc work at North West Labs. Short courses are morewidely available and advertised in Evaluation News. Seealso Training of Evaluators.

EVALUEE A person being evaluated; the more generalterm, which covers products and programs, etc., is "eval-uand."

EXECUTIVE SUMMARY Abstract of results from anevaluation, in non-technical language.

EXIT INTERVIEWS Interviews with subjects as theyleave the e :. training program, clinic, etc., to obtain factualand judgr .ental data. A very good time for these, withrespect to course or teaching evaluation in the school orcollege setting, is at the lime of graduation, when (a) thestudent will have some perspective on most of the educa-tional experience; (b) fear of retribution is low; (c) responserate can be nearly 100% with careful planning; (d) judg-ments of effects are relatively uncomplicated. Later thanthisalumni surveysconditions can and do deteriorate,though there is a partial offset because job-relevance can bejudged more accurately.

EXPERIMENT See True Experiment.

EXPERIMENTAL GROUP The group (or single per-son, etc.) that is receiving the treatment being studied.

EXPLANATION By contrast with evaluation, whichidentifies the value of something, explanation involves ans-wering a Why or How question about it, or other type ofrequest for understanding. Often explanation involvesfinding the cause of a phenomenon, rather than its effects(which is a major part of evaluation). When it is possible,without jeopardizing the main goals of what may be holisticsummative evaluation, a good evaluation design tries touncover micro-explanations (e.g. by identifying those com-ponents of the curriculum package which are producing themajor part of the effects, and which are having little effect).The first priority, however, is to resolve the evaluationissues (Is the package the best available? etc.), but too oftenthe research orientation and training of evaluators leadsthem to do a poor job on evaluation because they got in-

.536.1

terested in explanation (LE). The realization that the logicalnature and investigatory demands of evaluation are quitedifferent from those of explanation is as important as thecorresponding realization with respect to prediction andexplanation, which the neo-positivist philosophers ofscience still think are logically the same tinder the (tem-poral) skin.

EX POST FACTO DESIGN One where we identify acontrol group "after the fact," i.e., after the treatment hasoccurred. A %cry much weaker design than the true experi-ment since there must have been something different aboutthe subjects that got the treatment without being assignedto it, in order to explain why they got it, and that somethingmeans they're not the same as the control group, in someunknown respect that may be related to the treatment.

EXTERNAL (evaluator or evaluation) An externalevaluator is someone who is at least not on the project orprogram regular staff, or someonein the case of person-nel evaluationother than the individual being evaluated,or their staff. It is better if they are not even paid by theproject or by any entity with a prior preference for thesuccess or failure of the project. Where or to whom theexternal evaluator reports is what determines whether theevaluation is formative or summative, either of which maybe done by external or by internal evaluators (cc ntntry to thecommon view that rAtemal is for summative, internal forformative), and both of which should be done by both.

EXTERNAL VALIDITY By contrast with internalvalidity, this refers to the generalizability of the experi-mental/evaluation findings. Here the traps to avoid includefailure to identify key environmental variables that happento be constant throughout the experiment, decreased sensi-tivity of participants to treatment at posttest due to pretest,reactive effects of experimental arrangement, or biasedselection of participants that might affect the generalizabil-ity of the treatment's effect to non-participantsthus jeop-ardizing the external validity. (Ref. Experimental and Quasi-Experimental Designs for Research, D. T. Campbell and J. C.Stanley, Rand McNally & Co., Chicago, 1972.)

EXTRAPOLATE Infer conclusions about ranges of thevariables beyondkle measured. Cf. Interpolate.

54

FACE VALIDITY The apparent validity, typically oftest items or of tests; there can be skilled and unskilledjudgments of face validity, and highly skilled judgmentswhich come pretty close to content validity, which doesrequire systematic substantiation.

FADING Technique used in programmed texts, wherea first answer is given completely, the next one in part withgaps, then with just a single cue, then called for withouthelp. A key technique in training and calibrating evaluators.

FAULT TREE ANALYSIS (CAUSE TREE ANALYSIS)These terms emerged about 1965, originally in the literatureof management science and sociology. They are sometimesused in a highly technical sense, but are useful in a straight-forward sense. Basically, the model to which they refer isthe trouble-shooting chart, often to be found in the pages ofe.g. a Volkswagen manual. The branches in the tree identifypossible causes of the fault (hence the terms "cause" and"fault" in the phrase), and this method of representationwith various refinementsis used as a device formanage-ment consultants, for management training, etc. Its mainuse in evaluation is as a basis for needs assessment.

FIELD INITIATED This refers to proposals or projectsfor the funding of grants or contracts that originate fromworkers in the field of study, rather than from a, programannouncement of the availability of funds by an agency forwork in a certain area (which is known as "solicited" re-search or development.)

FIELD TRIAL (OR FIELD TEST) A dry run of a testof a product/program, etc. Absolutely mandatory in anyserious evaluation or development activity. It is essentialthat at least one true field trial should be done in circumst-ances and with a population that matches the targeted situ-ation and population. Earlier ("hothouse") trials may notmeet this standard, for convenience reasons, but the lastone must. Unless run by external evaluators (very rare),there is a major risk of bias in the sample or conditions_orcontent or interpretations used by the developer in th: finalfield trials.

FILTER Someone whoor a computer whichremoves identifying information from evaluative input, to

55

preserve the anonymity of the respondent.

FISCAL EVALUATION The highly developed sub-field that involves looking at the worth or probable worth ofe.g. investments, programs, companies. See ROI, Payback,Time Discounting, Profit, etc.

FISHING Colloquialism for exploratory (phase of) re-search; or for true nature of large slices of serious (e.g.program) evaluation; or for visits to Washington in search offunding support.

FLOWCHART A graphic representation of the sequenceof decisions, including contingent decisions, that is set up toguide the management of projects (or the design of compu-ter programs), including evaluation projects. Usually lookslike a sideways organization diagram, being a series ofboxes and triangles ("activity blocks," etc.) connected bylines and symbols that indicate simultaneous or sequentialactivities/decision points, etc. A PERT chart is a specialcase.

FORMATIVE EVALUATION Formative evaluation isconducted during the development or improvement of aprogram or product (or person, etc.). It is an evaluationwhich is conducted for the in-house stiff of the program andnormally remains in-house; but it may be done by an internalor an external evaluator or (preferably) a .ombination. Thedistinction between formative and summative has beenwell summed up in a sentence of Bob Stake's "When thecook tastes the soup, that's formative; when the guests tastethe soup, that's summative."

FOUND DATA Data that already exists, prior to theevaluationcontrast is with experimental data or test andmeasurement data.

FUGITIVE DOCUMENT One whiLii is not publishedthrough the public channels as a book or journal article.Evaluation reports have often been of this kind. ERIC (Edu-cational Resources Information Center) has picked up someof these, but since its criteria for selection are so variable andits selection so limited, time spent in searching it is all toooften not cost-effective.

FUNDING (of evaluations) Done in many ways, but a

56

Common pattern is described hint!. The evaluation proposalmay be "field-initiated," i.e. unsolicited, or sent in responseto (a) a program announcement, (b) an RFP (Request forProposal), (c) a direct request. Typically (a) results in agrant, (b) in a contract; the former identifies a general chargeor mission (e.g. "to develop improved tests for early child-hood affective dimensions") and the latter specifies more orless exactly what is to be dot le, e.g. how many cycles of fieldtests (and who is to be sample, how large a sampled is to beused, etc.), in a "Scope of Work." The legal difference is thatthe litter is enforceable for lack of performance, the formeris (practically) not. But it scarcely makes sense to use con-tracts for research (since you usually can't foresee whichway it will go), and it is rare:y justifiable to use them for thevery specific program evaluations required by law. Ap-proach c, "sole-sourcing", eliminates competitive biddingand can usually only be justified when only one contractorhas much the best combination of relevant expertise orequipment or staff resouces; but it is much faster, and itdoes avoid the common absurdity of 40 bidders, eachspending 12K ($12,000) to write a proposal worth 300K tothe winner. The wasteage there (180K) comes out of over-head costs which are eventually paid by the taxpayer, or bybidders going broke because of foolish requirements. Agood compromise is the two-tier system, all bidders sub-mitting a two (or five or ten) page preliminary proposal, thebest few then getting a small grant to develop a full pro-posal. Contra is may or may not have to be awarded to thelowest "qualifieNl" bidder; qualification may involve_finan-cial resources, stability, prior performance, etc., as well astechnical and management expertise. On big contracts thereis usually a "bidders' conference" shortly after publicationof the RFP (it's often required that federal agencies publishthe RFP in the Business Commerce Daily and/or the FederalRegister). Such a conference officially serves to clarify theRFP; it may in fact be a cross between a con jcb and a pokergame. If you ask clever questions, others may (a) be scaredoff, (b) steal your approach, etc. The agency may be sniffingaround fora "friendly" evaluator and the evaluators maybetrying to look friendly but not so friendly as to reducecredibility, etc. Eventually, perhaps after a second bidders'conference, the most promising bidders will be asked for

:57 65

their Best and Final bid and this basis the agency selectsone, probably using an (anonymous, possibly) external re-view panel to lend credibility to the selection. After the firstconference between.the winner and the project officer (theagency's reptesentative used to be called the monitor) itoffen turns out that the agency wants or can be persuaded towant something done that isn't c'early in the contract; theprice will then be renegotiated. Or if the price was too low(the RFP WI 11 often specify it in terms of "Level of Effort" asN "person-years" of work; this may mean N x 30K orN x 50K in dollar terms, depending on whether overheadis an add-on) to get the job done, the contractor may just goahead till they run out oEmoney and then ask for more, onthe grounds the agency will have sunk so much in and be soirreversibly committed (time-wise) that they have to comethrough to "save their investment.".The contractor ofcourse loses credibility on later bids but that's better thanbankruptcy; and the track-records are so badly kept that noone may hold it against them (if indeed they should). In thebad old days, low bids were a facade and renegotiation ontrumped-up grounds would often lead to a cost well abovethat of another and better bidder. Since evaluations aretricky to do in many ways, bidders have to allow a pad intheir budget for contingenciesor just cross their fingers,which quickly leads to bankruptcy. Hence another option isto RFP for the best design and per diem and then let thecontract for as long as it takes to do it. The form of abuseassociated with this cost-plus approach is that the con-tractor is motivated to string it out. So no overall clearsaving is attached to either approach; but the latter is stillused where the agency wants to be able to change targets aspreliminary results come in a sensible point, and where ithas good monitoring sti- prevent excessive over-runs(from estimates which of ,e are not binding). A majorweakness in all of these al. roaches is that innovative pro-posals will often fail because the agency has appointed areview panel of people committed to the traditional ap-proaches who naturally tend to fund "one of their own."Another major weakness is the complexity of all this, whichgleans that big organizations who can afford to openbranches in D.C., pay professional proposal-writers and"liaison staff" (i.e., lobbyists), have a tremendous edge (butoften do pcx . work, since most of the best people do no

66 58

work for them). A third key weakness is that the systemdescribed favors the product of timely paper rather than thesolution of problems, since that's all the monitoring andmanaging process can Identify. Billions of dollars, millionsof jobs, thousands of lives are wasted because we have noreward system for really good work, that produces reallyimportant solutions. The reward is for the proposal, not theproduct; and it is the contract. Once obtained, only unrelia-bility in delivery or gross negligence jeopardizes futureawards. You can see the value system this arrangementproduces from the way the vice-presidents all move on towork on the next "presentation" as soon as negotiation iscomplete. It would only cost pennies to reverse this via(partial) contingency awards and expert panels to reviewwork done instead of proposals.

FUTURISM Since many evaluands are designed toserve future populations and not (just) present ones, muchevaluation requires estimating future needs and perfor-mance. The simpler aspect of this task involves extrapola-tion of demographic data; even this is poorly done e.g. thecrunch on higher education enrolments was only foreseenby one analyst (Cartter) although the inference was simpleenough. The harder task is predicting e.g. vocational pat-terns twenty years ahead. Here .one must fall back onpossibility-covering techniques, rather than probability-selection e.g. by teaching flexibility of attitude or generaliz-able skills.

GOAL The technical sense of this term restricts its useto rather general descriptions of intended outcome; morespecific descriptions are referred to as objectives.

GOAL - ACHIEVEMENT MODEL (of evaluation) Theidea that the merit of the program (or person) is to beequated with success in achieving a stated goal. This is themost naive version of goal-based evaluation.

GOAL - BASED, EVALUATION (CBE) This type ofevaluation is based and focused on knowledge of the goalsand objectives of the program, person or product. A goal-based evaluation often does not question the merit of goals;often does not look at cost-effectiveness; often fails to search

59

for or locate the appropriate critical competitor*: often doesnot search For side effects; in short, often does not include anumber of important and necessary components of anevaluation. Even if it does include these components, theyare referenced to the program's (or to personal) goals andhence run into serious problems such as identifying thesegoals, handling inconsistencies in them and changes inthem over time, dealing with shortfall and overrun resultsand avoiding the perceptual bias'of knowing about them.,GBE is manager-oriented evaluation, close to monitoring andfar from consumer - oriented evaluation, (See GFE).

GOA LFREE EVALUATION (GEE) In this type ofevaluation, the evaluator(%) i% not told the purpose of theprogram but enters into the evaluation with the purpose offinding out what the program actually is ill)* withoutdetailed cueing as to what it is trying to do. If the program isdoing what its stated goals and objectives %ay, then theseachievements should show up (in observation of processand interviews with consumers (not staff)); if not, it is ar-gued, they are irrelevant. Merit is determined by relatingprogram adi lawmen ts to the needs of the imparted population,rather than to the program (i.e., agency or citizenry orcongressional or manager's) goals. It could thus be called"needs-based evaluation" or "consumer-oriented evalua-tion" by contrast with gt.11-based or manager-orientedevaluation. It does not substitute the evaluator's goals forthe program's goals, nor the goals of the consumer; theevaluation must justify (via the needs assessment) all as-signments of merit. GFE is generally disliked by bothmanagers/administrators and evaluators, for fairly obviousreasons. It is said to be less intrusive than GBE, more adapt-able to mid-stream goal shifts, better at finding side effectsand less prone to social, perceptual and cognitive bias. It isrisky, because the client may get a nasty shock when thereport comes in (no prior hand-holding) and refuse to paybecause embarrassed at the prospect of having to pass theevaluation along to funding agency. (But if the findings areinvalid, the client should simply document this and ask formodifications.) GFE is reversible, a key advantage overGBE; hence an evaluation design should (sometimes) beginGFE, write a preliminary report, then go to GBE to see ifserious errors of omission occurred. (Running a parallel GFE

60

effort along with a GBE reduces the time-span.) The shockreaction to GFE in the area of program evaluation (it is thestaff.. iard procedure used by all consumers evaluating Rrod-ucts) suggests that file grip of management bias on programevaluation was very strnrtg, and possibly that managers feltthey had achieved considerable control over the outcomesof GBEs.GFE is analogous to double-blind design in medi-cal research; even if the evaluator would like to give afavorable repo.., (e.g. because of being paid by the program,or because hoping for future work from them) it is not(generally) easy to tell now to "cheat" under GFE condi-tions. The risk of failure by an evaluator is of course grdaterin GFEs, which is desirable since it increases effort, iden-tifies incompetence, and improves the balance of power.

GOING NATIVE The fate of evaluator's that getco-opted by the programs they are eva uating. (Term origi-nated with the Experimental Schools Progam evaluation inmid-,60's.) The co-option was often entirely by choice andwen illustr3tes the pressures on, temptations for, and hencethe temperamental requirements for being a good evalu-ator. it can be a very lonely role and if you start thinkingatiout it in the wrong way you start seeing yourself as anegative forceand who wouldn't rather be a co- authorthan a (mere) critic? One answer; someone who cares moreabout quality than kudos. See Evaluation Skills.

GRADE-EQUNALENT SCORE A well-meant attemptto generate a meaningful index from the results of stan-dardized testing. If a child has a 7.4 grade-equivalent score,that means s/he is scoring at the average level (estimated tobe) achieved by students four months into the 7th grade.Use of the concept has often led to an unjustified worship ofaverage scores as a reasonable standard for individuals, andto overlooking the raw scores which may tella very differentstory. Suppose a beginning eighth grader is scoring at the7,4 level; parents may be quite upset unless someone pointsout that on this particular test the 8.0 level is the same as the7.4 level (because of summer' backsliding). In reading, adeficit of two whole gra.ie equivalents is quite often madeup mn a few months in junior high school if a teacher suc-ceeds in,motivating the student for the first time. Again, astudent may be a whole grade-equivalent down and be

61

ahead of most of the classif the average score is calculatedas the mean not the median. Again, a student in the fifthgrade scoririg 7.7. might flunk the seventh grade reading testcompletely; 7.2 just means that s/he scores where a seventhgrader would score on the fifth grade test. A year's deficitfrom the 5th grade norm isn't comparable to .1 year's deficitfrom the 4th grade norm. And so on i.e., use withcaution.

GRADING ("Rating" is sometimes used as a syno-nym.) Allocating individuals to an ordered (usually small)set of labeled categories, the order corresponding to merit,e.g. AF for "letter grading." Those within a category areregarded as tied if the letter grade only is used; out if anumerical grade ("scoring") is also used, they may beranked within grades. The use of plus and minus gradessimply amounts to using more categories. Grading pro-vides a partial ranking, but ranking cannot provide.gradingwithout a further assumption, e.g. that the best student isgood enough for an A, or that "grading on the curve" isjustified. That is, the grade labels normally have some inde-pendent meaning and cannot be treated las simply a se-quenced set that can be distributed by making arbitrary cutsin a ranked sequence of individuals. In short, the grades arenormally criterion-referenced and ranking is normally facil-itated by norm-referenced testing: that tension frequentlyresults in confusion. For example, gradingbf students doesnot imply the necessity for "beating" other students, doesnot need to engender "distractive competitiveness" as isoften thought. Only publicized grading on a curve does that.Pass/Not Pass is a simple form of grading, not a no-gradingsystem. Grades should be treated as estimates by anexpert and thus constitute essential feedback to the learneror consumer; corrupting that feedback because the externalsociety misuses the grades is abrogation of duty to thelearner or consumer. See Responsibility Evaluation.

GRANT See Funding.

HALO EFFECT The tendency of\someone's reaction topart of a Stimulus (e.g. part of a test, part of a student'sanswers to a test, part of someone's personality) to spill over

1

j62

into their reaction to other, especially adjacent, parts of thesame stimulus. For exzmple, judges of exams involving

'several essay answers will tend to grade the second answer,by a particular student higher if they graded the first onehigh than they would if this had been the first answer theyhad read by this student (the error is often as much as a fill

ade). Halo effect is avoided by having judges assess all thecomponents before they look at any of the second

com nents, and by concealing from them their grade onthe firs component when they come to evaluate the secondone. Me halo effect gets its name from the tendency tosuppose thdt someone who is saintly in one kind of situa-tion must be saintly (and perhaps also clever) in all kinds ofsituations. But the halo effect also refers to the illicit transferof a negative assessment. The Hartshorne & May work(Studies in Deceit, Columbia, 1928) suggests there is nogood basis for this transfer.

HARD vs. SOFT (approaches to evaluation) Colloquialway to refer to the differences between the quantitative/testing / measurement / experimental-design approach toevaluation and the descriptive / observational / narrative/ethnographic / participant-observer kind of approach.

HAWTHORNE -EFFECT The tendency of a group orperson being investigated, or experimented on, or eval-uated to react positively or negatively to the fact that theyare being investigated/evaluated, and hence to performbetter (or worse) than they would in the abser -e of theinvestigation, thereby making it difficult to identify anyeffects due to the treatment itself. Not the same as theenthusiasm effect, i.e., the effect on the consumer of anenthusiastic service-provider that results simply from theenthusiasm of provider or recipient. The placebo effect isthe medical analog of the enthusiasm effect.

HEADROOM See Ceiling Effect.

HIERARCHICAL SYSTEM See Two-Tier.

HOLISTIC SCORING/GRADING/EVALUATING Theallocation of a single score/grade/evaluation to the overallperformance of an evaluand; by contrast with analyticalscoring/grading/evaluating. The holistic/analytical distinc-tion corresponds to the macro/micro distinction in econom-

63

ics and the molar/molecular distinction in psychology.

HYPERCOGNITIVE or TRANSCOGNITIVE The do-main beyond the supercognitive, which is the stratosphereof the cognitive; includes meditation and concentrationskills; originality; the intellectual dimension of empathicinsight (as evidenced in role-playing, acting, etc.); eideticimaging; near-perfect objectivity, rationality, reasonable-ness, or ludgment" in the common parlance; moral sensi-tivity; ESP skills, etc. Some of this is incorrectly included in"affective education."

HYPOTHESIS TESTING The standard model of scien-tific research in the classical approach to the social sciences,in which a hypothesis is formulated prior to the design ofthe experiment, the design is arranged so as to test its truth,and the results come out in terms of a probability estimatethat only chance was at work. If the probbility is extremelylow that only chance was at work, the design should make itinductively highly likely that the hypothesis being testedwas correct. What is to count as the high degree of improba-bility that only chance was at work is usually taken to beeither the .05 "level of significance" or the .01 "level ofsignificance." When dealing with phenoa:..na whose exist-ence is in doubt, a more appropriate level is 001; where theoccu:ence of the phenomenon in this praticular situation isall that is at stake, the conventlonai levels are more ap-propriate. The significance level is thus used as a crudeindex of the merit of a hypothesis.

An important distinction in hypothesis testing that doescarry over to the evaluation context in a useful way is thedistinction between Type 1 and Type 2 errors. A Type 1error is involved when we conclude that the null hypothesisis false although it isn't; a Type 2 error is involved when weconclude that the null hypothesis is true when in fact it'sfa!5e. Using a .05 significance level means that in about 5%of cases studied, we will make a Type 1 error. As we tightenup on our level of significance, we reduce the chance ofType 1 error, but correspondingly increase the chance of aType 2 error (and mice versa). It is a key part of evaluation tolook carefully at the relative costs of Type 1 and Type 2errors. (In evaluation, of course, the conclusion is aboutmerit rather than truth.) A metaevaluation should carefully

64

74

spell out the costs and benefits of the two kinds of error, andscrutinize the evaluation for its failure or success in takingaccount of these in the analysis, synthesis, and recordingphases. For example, in, quality control procedures in drugmanufacture (a 'type of evaluation), it may be fatal to aprospective user to identify a drug sample as satisfactoryw'nen in fact it is not; on the other hand, identifying it es-unsatisfactory when it is really satisfactory will only cost therianufacttrer whatever that sample costs the manufacturerto make. Hence it is obviously in the interest of the publicand the manufacturer (given the possibility of damage suits)to -set up a system which minimizes the chance of falseacceptances, even at te efpense of a rather high level offalse rejections. Because of the totally non-mnemonic char-acteristics of the terms "Type 1" and "Type 2," it's alwaysbetter to use terms like "incorrect acceptance" and "incor-rect rejection" of evaluan,/q, rather than of the nullhypothesis,the latter concept being likely to prove unenlightening tomost audiences.

ILLUMINATIVE EVALUATION (Rippey) A type ofpure process evaluation, very heavy on multi-perspectivedescription and interpersonal relations, very light onjustified tough standards, very easy on valuephobes.

IMPACT EVALUATION An evaluation focussed on out-comes or pay-off rather than process delivery or implemen-tation evaluation.

IMPACTED POPULATION The population that iscrucial in evaluation, by contrast with the target populationand even the true consumers.

IMPLEMENTATION EVALUATION Recent reactionsto the generally unexciting results of impact evaluations onsocial action programs have included a shift to mere moni-toring of program delivery i.e. implementation evaluation.You can easily implement; it's harder to improve.

IMPLEMENTATION OF EVALUATIONS The fre-quent'complaint that evaluations have little effect, i.e. arenot implemented, refers to fou:- quite different situations.(a) Many evaluations are simply incompetent and it's most

65

73

desirable they not be implemented; (b) Some evaluationsmakeand should makeno immediate recommenda-tions (e.g. accountability evaluations); nevertheless theyhave a powerful preventive effect and some cumulativelong-run effect, but neither is readily measurable; (c) Manyevaluations are commissioned in such a way that evenwhen done as well as possible they will not be of any usebecause they were set up so as to be irrelevant to the realissues that affect the decision-maker, or one so under-funded that no sound answer can be obtainedagain, it isjust as well these not be implemented; (d) Some excellentevaluations are ignored because the decision-maker doesn'tlike (e.g. is threatened by) the results or won't take on therisks or trouble of implementation. The lack of implementa-tion phenomenon thus has little or large implications for thefield of evaluation, depending entirely on the distribution ofthe causes across these four categories. It is hardly some-thing to be unduly concerned about professionally as longas evaluation still has a long way to go in doing its own jobwell; doctors shouldn't worry that their patients ignore theiradvice if it's bad. But as a citizen one can scarcely avoidworry about the colossal wastage resulting from the fourthkind of situation; here's a fairly typical quote from the 8/1/80GAO reports on their (usually very good) evaluations: "TheCongress has an excellent opportunity to save billions ofdollars by limiting the number of noncombat aircraft tothose that can be adequately justified ... Dept. of Defensejustifications [were] ... based on unrealistic data and with-out adequate consideration of more economical alterna-tives." GAO has been issuing reports on this topic since1976 without noticeable effect so far.

IMPLEMENTATION OF TREATMENT The degreeto which a treatment has been instantiated in a particularsituation, typically a field trial of the treatment or an experi-mental investigation of it. The notion of an "index of imp-lementation," consisting of a set of scales describing the keyfeatures of the treatment, and allowing one to measure theextent to which it is manifested in each dimension, is auseful one for checking on implementation, an absolutelyfundamental check if we are to find out whether the treat-ment has merit. This is part of the "purely descriptive"efro0; in evaluation, and is handled under the description

66

checkpoint and the process checkpoint of the Key Evalua-tion Checklist. One characteristic situation occurs when thedescription checkpoint provides a correct account of thetreatment that is supposed to be implemented, and theprocess checkpoint pn,vides a correct description of what isactually occurring; the match between the two is a measureof the implementation, and hence of the extent to which wecan generalize from the results of the test to an evaluation ofthe evaluand which we are supposed to be evaluating.

INCESTUOUS RELATIONS (in evaluation) Refers to(a) extreme conflict of interest (where the evaluator is "inbed with" the program being evaluated), as is typical ofordinary program monitoring by agencies and foundationswhere the monitor is usually the godfather (sic) of the pro-gram, sometimes its inventor and nearly always its advo-cate at the agency, and a co-author of its modifications aswell assupposedlyits evaluator; (b) incestuous valida-tion of test items occurs when they are selected/rejected onthe basis of the correlation of performance on that item withoverall score on the test. Many widely-used tests have low-ered their construct validity by dumping face-valid itemsbecause of this. The correct procedure is to check for othererrors (e.g. irrelevance, ambiguity) perhaps by e) temaljudge review or rewriting the item(s), hoping the correla-tion won't hold upbecause then you have tapped into anindependent dimension of criterion performance.

INCREMENTAL NEED An unmet need.

INDEPENDENCE Independence is only a relative no-tion; but by increasing it, we can decrease certain types ofbias. Thus, the external evaluator is somewhat more inde-pendent than the internal, the consulting medical specialistcan provide a more "independent opinion" than the familyphysician, and so on But of course both may share certainbiases, and there is always the particular bias that the exter-nal or "secor opinion" is typically hired by the internalone and is thus dependent upon the latter for this or laterfees, a not inconsiderable source of bias. The more subtlesocial connections between members of the same profes-sion, e.g. evaluators, are an ample basis for suspicion aboutthe true independence of the second or meta-evaluator'sopinion. The best approach is typically to us ...inore than one

67.

SCLUI1U Up 1111011 dilU' IU sample as vv lutny as pv:3siule hiselecting these other evaluators, hoping from an inspectionof their (Independently written) reports to obtain a sense ofthe variation within the field, from which one can extrapo-late to an estimate of probable errors.

INDEPENDENT VARIABLE See Dependent Variable.

INDICATOR A factor, variable, or observation that isempirically or definitionally connected with the criterion; acorrelate. For example, the judgment by students that acourse has been valuable to them for pre-professional train-ing is a (weak) indicator of that value. Criteria, by contrast,are,or are definitionally connected with, the "criterion" (realpay-off) variable. Indicators thus include but are not limitedto criteria. Constructed indicators are variables designed toreflect e.g. the health of the economy (a social indicator) orthe effectiveness of a program. They, like course grades, areexamples of the frequent need for concise evaluations evenat the cost of some.accuracy and reliaLility.

INFERENTIAL STATISTICS That part concerned withmaking inferences from characteristics of samples to char-acteristics of the population from which the sample comes,which of course can only be done with a certain degr it ofprobability (cf. Descriptive Statistics). Significance testsand confidence intervals are devices for indicating the de-gree of risk involved in inference (or "estimate")but theyonly cover some dimensions of the risk. For example, theycannot measure the risk due to the presence of unusual andpossibly relevant circumstances such as freakish weather,an incipient gas shortage, ESP, etc. Judgment thus entersinto the final determination of the probability of the inferredcondition.

INITIATION-JUSTIFICATION BIAS See Conso-nance Dissonance.

INFORMAL LOGIC Several evaluation theorists con-sider evaluation to be in some respects or ways a kind ofpersuasion or argumentation (notably Ernest House, inEvaluating with Validity, Sage, 1980). In terms of this view,it is relevant that there are new movements in logic, law andscience which give more play to what have previously beendismissed as "merely psychological" factors e.g. feelings,

68

understanding, plausibility, credibility. The "informal logicmovement" parallels that of the New Rhetoric and natural-istic methodology in the social sciences. Ref. Informal Logiced. Johnson and Blair, Edgepress, 1980.

INFORMED CONSENT The state which in cc*Isciousadults represents a good start toward discharging one'sethical obligations towards human subjects. The toughcases involve semi-rational semi-conscious semi-adults.

INSTITUTIONAL EVALUATION A complex evalu-ation, typically involving the evaluation of a set of programsprovided by an institution plus an evaluation of the overallmanagement, publicity, personnel policies and so on of theinstitution. The accreditation of schools and colleges is es-sentially institutional evaluation, though a very poor exam-ple of it. One of the key problems with institutional evalua-tion is whei.her to evaluate in terms of the mission of theinstitution or on some absolute basis. It seems obviouslyunfair to evaluate an institution against goals that it isn'ttrying to achieve; on the other hand, the mission statementsare usually mostly rhetoric and virtually unusable for gener-ating criteria of merit, and they are at leabt potentially sub-ject to criticism e.g. because of inappropriateness to need ofclientele, internal inconsistencies, impracticality with re-spect to the available resources, ethical impropriety, etc. Soone must in fact evalauate the goals and the performancerelative to these goals or do goal-free evaluation. Institu-tional evaluation always involves more than the sum of thecomponent evaluations; for example, a major defect in mostuniversities is departmental dominance, with the attendantcosts in rigidifying career tracks, virtually eliminating therole-model of the generalist, blocking new disciplines orprogramsand preserving outdated ones(since insteady-state they have to come out of the department'sbudget) etc. Most evaluations of schools and colleges fail toconsider these system features, which may be more im-portant than any components.

INTERACTI V t. (evaluation) One in which the eval-uees have the opportunity to react to the content of a firstdraft of an evaluative report, which is reworked in the lightof any valid criticisms or additions. A desirable approachwhenever feasible, as long as the evaluator has the courage

69 7(

to make the appropriate criticisms and stick to them unlessthey are repudiated. Very few have, as one can see bylooking at site-visit or personnel reports that are not confi-dential, by comparison with those that are, e.g. verbal sup-plements by the site visitors.

INTERACTION Two factors or variables interact if theeffect of one, on the phenomenon being studied, dependson the magnitude of the other. For example, math educa-tion interacts with age, being more or less effective onchildren depending on their age; and it interacts with mathachievement. There are plenty of interactions between vari-ables governing human feelings, thought and behavior butthey are extremely difficult to pin down with any precision.The classic example is the search for aptitude-treatment ortrait-treatment interactions in education; everyone knowsfrom their own experience that they learn more from certainteaching styles than from others, and that other people donot respond favorably to the same styles. Hence there's aninteraction between the teaching style ;treatment) and thelearning style (aptituds with regard to learning. But, de-spite all our technical armamentarium of tests and measur-ing instruments, we have virtually no solid results as to thesize or even the circumstances under which these ATI'soccur. (Ref: The Aptitude-Achievement Distinction, ed.D. R. Green, McGraw Hill, 1974.)

INTERNAL Internal evaluators (or evaluations) are(done by) project staff, even if they are special evaluationstaff, i.e., even if they are external to the production/writing/teaching/ service part of the project. Usually; internal evalu-ation is part of the formative evaluation effort, but long termprojects have often had special summative evaluators ontheir staff, despite the low credibility (and probably lowvalidity) that results. Internal/external is really a differenceof degree rather than kind; see Independence.

INTERNAL VALIDITY The kind of validity of an eval-uation or experimental design that answers the question:"Does the design prove what it's supposed to prove aboutthe treatment on the subjects actually studied?" (cf. ExternalValidity). In particular, does it prove that the treatmentproduced the effect in the experimental subjects? Relates tothe CAUSATION checkpoint in the Key Evaluation Check-

70

76

list. Common threats to interna: validity include poor in-struments, participant maturation, spontaneous change, orassignment bias. (Ref. Experimental and Quasi-Experi-mental Designs for Research, D.T. Campbell and J. C. Stan-ley, Rand M4Nally & Co., Chicago, 1972.)

INTEROCULAR DIFFERENCES Fred Mostelle7, thegreat practic41 statistician, is fond of saying that he's notinterested in"Sfatistically significant differences, but only it:interocular ones those, that hit you between the eyes. Orthat's what he's said to be fond of saving.

INTERPOLATE Infer to conclusions about values of thevariables within the range sampled. Cf..Extrapolate.

INTERRUPTED TIME SERIES A type of quasi-eiperimental design-in which the treatment is applied and_then withheld in a certain pattern, to the same subjects. Thesomewhat ambiguous `term. "self-controlled" used to beused for suchcases, since the'control group is the same asthe experimental group. The simplest versioniS of coursethe "aspirin for a headache" design; if the headache goesaway, we credit the aspirin. On the other hand, "psycho-therapy for a neurosis" provides a weak inference becausethe length of the treatment is so great that the chance of the.neurosis ending during that interval for other reasons thanthe psychotherapy' is very significant. (Hence short-termpsychotherapy is a better bet, ceteris paribus.) The nextfancier self-controlled design is the so-called.-"ABBA" de7,sign, where A is the treatment, B the absence Of itor

,another treatment..Measurements are made at the begin-ning of each labeled period and at the end. Here we may be

,able to control for the spontaneous remission possibilityand sundry interaction effects. This is quite a good designfor experiments on supportive or incremental treatments,e.g. we'leach 50 words of vocabulary by method A, then 50more by method Band to eliminate the possibility that Bonly works when it follows A, we now reverse the order,and apply it.first, and the A. The classic fallacy in this areais probably that of the Governor of Connecticut who intro-dut.c.,c1 automatic, license'suspension for the first speeding.violation and got a very large reduction in the highwayfatality rate immediately, about which he crowed a gooddeal. But a look at the variability of the fatality rate in

71,

previous years would have made a statistician nervous, andsure enough, it soon swung up again in its fairly randomway. (Ref. Interrupted Time Series Designs, Glass, et al.,University of Colorado.)

JOB ANALYSIS A breakdown of a job into functionalcomponents, often necessary in order to provide remedialrecommendations and a framework for micro-evaluation orneeds assesstnent. Job analysis is a highly skilled task,which, like programming, is usually done badly by thosehired to do it because of the failure of the pay scale to reflectthe pay-offs from doing it well.

JOHN HENRY EFFECT (Gary Saretsky's term) Thecorrelative effect to, or in an extended sense a special caseof, the Hawthorne effect, i.e., the tendency of the controlgroup to behave differently just because of the realizationthat they are the control group. For example, a control groupof teachers using the traditional math program that is beingrun against an experimental program mayupon realizingthat the honor of defending tradition lies, upon themperform much better during theperiod of the investigationthan they would have otherwise, thus yielding an artificialresult. One cannot of course assume that the Hawthorneeffect (on the experimental group) cancels out the JohnHenry effect.

JUDGMENT It is not accidentalthough it was erro-neousthat the term "value judgment" came to be thoughtof as the paradigm of evaluative claims; judgment is a verycommon part of evaluation, as it is of all serious scientificinference. The function of the discipline of evaluation can beseen as largely a matter of reducing the element of judgmentin evaluation, or reducing the element of arbitrariness in thenecessary judgments e.g. by reducing the sources of bias in,the judges e.g. by using double-blind designs, teams, paral-lel teams, convergence sessions, calibration training etc.the most important fact about judgment is not that it isn't asobjective as measurement but that one can distinguish goodjuldgment from bad judgment (and train good judges.)

JUDICIAL OR JURISPRUDENTrAL MODEL (of evalu-ation) Wolf's preferred term and a term sometimes used

72

for his version or, rather, extension of advocate-adversaryevaluation. He emphasizes that the law as a metaphor forevaluation involves much more than an adversarial debate,e.g. the fact-finding phase, cross-examination, evidentiaryand procedural rules, etc. It involves a kind of inquiryprocess that is markedly different from the social scientificone, one that in several ways is tailored to needs morethose of evaluation (the action-related decision, the obliga-tory simplifications because of time, budget and audiencelimitations, the dependence on a particular judge and jury,the fate of individuals at stake, etc.). Wolf sees the educa-tional role of the judicial process (teaching the jury the rulesof just inquiry) as a key feature of the judicial model and it iscertainly a strong analogy with evaluation.

JURY TRIAL Used in TA and evaluation. See preced-ing entry.

KEY EVALUATION CHECKLIST (KEC) What fol-lows is not intended to be a full explanation of the keyevaluation -hecklist and its application, something whichwould be more appropriate for a monograph on the metho-dology of evaluation It simply serves to identify the manydimensions that must be explored prior to the final syn-thesis in an evaluation. The most important of these aregiven italicized headings in the checklist, but all are usuallyvery important. A few words are given to indicate the sensein which each of the headings is intended, the headingsthemselves being kept very short in order to make themusable as mnemonics; some are expanded elsewhere in theThesaurus.

The purpose of exhibiting the KEC here is partly to makethe point that evaluation is an extremely complicated disci-pline, what one might call a multi-discipline. It cannot beseen as a straightforward application of standard methodsin the traditional social science ri.,ertoire. In fact only sevenof the eighteen checkpoints :re sei iously addressed in thattraditional repertoire, and in most cases not very well ad-dressed as far as evaluation needs are concerned.

1. DESCRIPTION. What is to be evaluated? The mai-:rand, described as objectively as possible. Does it havecomponents? What are their relationships?

73

1

2. CLIENT. Who is commissioning the evaluation? Theclient for the evaluation; who may or may not be theinitiator of the request for the evaluation; and may or maynot be the instigator of the evaluand, e.g. its manufactureror funding agency or legislative godparent; and may or

ay not be its inventor e.g. designer of a product orpro ram.3. CKGROUND & CONTEXT of (a) the evaluand and

(b) the aluation. Includes identification of stakeholders(such as non-clients listed in 2, the monitor, commun-ity represe fives, etc.); believed nature of the evaluand;expectation from the evaluation; desired type of evalua-tion (formati e vs. summative vs. ritualistic, holistic vs.analytical); r porting system; organization charts; priorefforts, etc.4. RESOUR S ("Support System" or "Strengths As-

sessment") (a) vailable to or for use of the evaluand; (b)available to or r use of the evaluators. These are notwhat is used up, e.g. purchase or maintenance, butwhat could be. They include money, expertise, past ex-perience, technology, and flexibility considerations.These define the range of feasibility.5. FUNCTION. What does the evaluand do? Distin-

guish what it is supposed to dointended or alleged functionor rolefrom what it in fact doesactual function(s) bothfor the client and the consumer; both could be coveredunder Description but it's usually best to treat them sepa-rately. Are there obvious dimensions or aspects or com-ponents of these functions?6. DELIVERY SYSTEM. How does the evaluand reach

the market? How is it maintained (serviced)? How im-proved (updated)? How are users trained? How isimplementation achieved/monitored/improved? Whodoes all this?7. CONSUMER. Who is using or receiving the (effects of

the) evaluand? Distinguish targeted populations of con -sumers intended marketfrom actually and potentiallydirectly impacted. populations of consumersthe "truemarket" or customers or recipients (or clients for the eval-uand, often called the clientele); these should be distin-guished from the total directly or indirectly impacted reci-pient population which makes up the "true consumers."

8274

Note that the instigator, etc. (see 2 and 3) are also im-pacted, e.g. by having a job, but this does not make themconsumers, in thd usual sense. We should, however,consider them when looking at total effects and can' de-scribe them as pail of the affected, impacted or invoNedgroup.

S. NEEDS & VALUES of the impacted and potentiallyimpacted population. This will include wants as well asneeds; and also values such as judged or believed`gtan-dards of n. eriand ideals (cf. 9); the defined goals of theprogram where a goal-based evaluation is undertaken;and the needs etc. of the instigator, monitor', inventoretc., since they are indirectly impacted. The relative im-portance of these often conflicting considerations willdepend upon ethical and functional considerations.9, STANDARDS. Are there any pre-existing objectively

validated standards of merit or worth that apply? Can anybe inferred from CLIENT plus CONSUMER, FUNCTIONand NEEDS/VALUES? (This will include appropriate ide-als cf. the felt ideals in 8.) If goals are being considered,and if they can be validated as appropi.ate (e.g., from aneeds assessment) and legal /ethics etc., they wouldgraduate from being recorded in 8 to ing accepted, asone relevant standafd, in 9.10. PROCESS. What constraints/cost§/bencfits apply tothe normal operation of the evaluaM (not to its effects orOUTCOMES (11))? In particular, legal/ethical-moralpolitical/manageriallaesthetic/hedonictseentific? Onemanagerial process constraint of special significance con-cerns the."degree of implementation," i.e., the extent towhich the actual operation matches the program stipula-tions or sponsor's beliefs about its operation. One scie-tific process consideration would be the use of scientifi-

sally validated process indicators of eventual outcomes;another would be the, use of 'scientifically (historicallyetc.) sound material in a textbook/course. One ethicalissue would involve the relative weighting of the impor-tance of meeting the needs of needy target populationpeople and the career or status needs of other impacted-population people e.g.: the program staff;11. OUTCOMES. WhAt effects are produced by the eval-.uand? (Intended ot'unintended). A matrix of effects is

75

83

usefiil; population affected x type of effect (cognitive/a tfective/psychomotor/health/social/environmental) Xsize of each x time of onset (immediate/end of "treat-ment" /later) x duration x each component or dimension(if analytical evaluation is required). For some purposes,the intended effects should be rated from the unin-tended (e.g. program monitor , legal accountability);for others, the distinction shout not be made (consumer-oriented summative product evaluation).12. C ALIZABILITY to other people/places/times/ver People" means staff as well as recipients.)Th n Lie labeied Deliverability and Saleability/Exp. .y/ Durability/ Modifiability.13. CUS IS. Dollar vs. Psychological vs. Personnel; Initialvs., Repeated (including Preparation-Maintenance-Improvement); Direct/Indirect vs. Immediate/Delayeel-Discounted; by components if appropriate.14. COMPARISONS with alternative optionsincludeoptions recognized and unrecognized, those now avail-,able and those constructablethe leading contenders inthis field are the "critical competitors" and are identifiedon cost plus effectiveness grounds. They normally in-clude those that produce similar or better effects for lesscost, and better effects for a manageable (RESOURCES)extra cost.15. SIGNIFICANCE. A synthesis of all the above. Thevalidation of the synthesizing procedure is often one ofthe most difficult tasks in evaluation. It cannot normallybe left to the client who is-usually ill-equipped by experi-ence or objectivity to do it; and the formula approaches ofe.g. cost-benefit calculations are only rarely adequate."Flexible weighted-sum with overrides" is often useful.16. RECOMMENDATIONS. These may or may not berequested, and may or may not follcw from the evalua-tion; even if requested it may not be feasible to provideany, because the only type that would be appropriate arenot such that any.scientific evidence for specific ones isavailable in the relevant field of research. (RESOURCESavailable for the evaluation are crucial here.)17. REPORT. Vocabular;,, length, format, medium, time,location, and personnel for its (or their) prei.,ntation needcareful scrutiny as does protection/privacy/p. blicity and

84 76

prior screening or circulation of fin;.' and preliminarydrafts,18. METAEVALUATION. The evaluation must be eval-uated, preferably prior to (a) implementation, (b) finaldissemination of report. External evaluation is desirable,but first the primary evaluator should apply the KeyEvaluation Checklist to the evaluation itself. Results ofthe metaevaluation should be used formatively but mayalso be incorporated in the report or otherwise conveyed(summatively) to the client and other appropriate audi-ences. ("Audiences" emerge at metacheckpoint 7, sincethey are the "Market" and "Consumers" of the eval-uation.)

KILL THE MESSENGER (phenomenon) The ten-dency to punish the bearer of bad tidings. One aspect ofvaluephobia. Much of the current attack on testing is pureKTM, like many of the elaborately rationalized earlier at-tacks on course grades. The presence of the rationalizations(in both cases) identify these as examples of a'subspecies;Kill the MessengerAfter a Fair Trial, of course.

LAISSEZ FAIRE (evaluation) "Let the facts speak forthemselves." But do they? What do they say? Do they saythe same thing to different. listeners? Once in a while thisapproach is justified, but usually it's simply a cop-out, arefusal to do the hard professional task of synthesis and itsjustification. The laissez-faire approach is attractive tovaluephobesand to anyone else when the results aregoing to be controversial. The major risk in the naturalisticapproach is sliding into laissez-faire evaluation, i.e.to putit slightly tendentiouslyno evaluation at all.

LEARNER VERIFICATION A phrase of Ken Komo-ski's, president of EPIE; refers to the process of (a) establish-ing that educational products actually work with the in-tended audience, and (b) systematically improving them inthe light of the results of field tests. Now required by law ine.g. Florida and being considered for that status elsewhere.The first response of publishers was to submit letters fromteachers testifying that the materials worked. This is not theR&D process that the term refers to. Some of the early

77

85

programmed texts were good examples of learner verifica-tion. Of course, it's costly, but so are four-color plates andglossy paper. it simply represents the application to educa-tional products of the procedures of quality control anddevelopment without which other consumer goods are il-legal or dysfunctional.

LEVEL OF EFFORT Level of effort is normally specifiedin terms of person-years of work, but on a small projectmight be specified in terms of person-months. It refers tothe amount of direct "labor" that will be %RI uired, and it ispresumed that the labor will be of the apiii.opriate profes-sional level; subsidiary help such as clerical and janitorial iseither budgeted independently or regarded as part of thesupport cost, that is, included in a professional person-yearof wcrk. Person-years (originally man-years) is the normalunit for specifying level of effort. RFP's will o'ten not de-scribe the maximum sum in dollars that is countenanced forthe proposal, but may instead specify it in terms of person-years. Various translations of a person-year unit into dollarsare used; this will depend on the agency, the level of profes-sionalism required, whether or not overhead and clericalsupport is separately specified, etc. Figures from $30,000 toover $50,000 per person-year are used at times.

LICENSING (of evaluators) See Evaluation Registry.

LITERARY CRITICISM The evaluation of works of lit-erature; in many ways an illuminating model for evalua-tiona good corrective for the emphases of the socialscience model. Various attempts have been made to"tighten up" literary criticism, of which the New Criticismmovement is perhaps the best known, but they all involverather blatant and unjustified preferences of their own (i.e.,biases), exactly what they were alleged to avoid. The time isripe to try again, using what we now know about sensoryevaluation:and perhaps responsive and illuminativeevaluationto remind us of how to objectify the objectifi-able while clarifying the essentially subjective. Conversely,a good deal can be learnt from a study of the efforts of F.R.Leavis (the doyen of the New Critics) and T.S. Eliot in hiscritical essays to precisify and objectify criticism. His viewthat "comparison and analysis are the chief tools of thecritic" (Eliot, 1932), and even more his practice of displaying

86 78

very specific and carefully chosen passages h) make pointswould find favor with the responsive evaluators today.Ezra Pound and Leavis went even further towards exhibit-ing the concrete instance (rather than the general principle)to make a point. This idiographic, anti-monothetic ap-proach is not, contrary to much popular philosophy ofscience, anti-scientific as such; but in practice it failed toavoid various style or process biases, and too often (e.g.with Empson) became precious at the expense of logic. Onecan no more forget the logic of plot or the limits of possibilityin fiction than the logic of function and the limits ofresponsibility in program evaluation.

LOCUS OF CONTROL Popular "affective" variable,referring roughly to the locat:Jn someone feels is appro-priate for the center of power in the universe on a scale from"inside me" to "far far away." A typical itela might askabout the extent to which the subject feels s/he controlstheir' own c',1,stiny. In fact, this is often a simple test ofknowledge about reality and not affective (depending onhow much stress is put on the feeling part of the item), andwhere it is a ffective, the affect may be judged as appropriateor inappropriate. So these items are usually misinterpreted,e.g. by taking any movement towards internalization oflocus of control as a gain, whereas it may be a sign of loss ofcontact with reali. v.

LONGITUDINAL STUDY An investigation in whicha particular individual or group of individuals is followedover a substantial period of time, in order to discoverchanges due to the influence of an evaluand or maturation,or environment. The contrast is with a cross-sectionalstudy. Theoretically, a longitudinal study could also be anexperimental study, but none of those done on the effect ofsmoking on lung cancer are of this kind although the resultsare almost as solid. In the human services area, it is verylikely that longitudinal studies will be uncontrolled, cer-tainly not experimentally controlled.

LONGTERM EFFECTS In many cases, it is importantto examine the effects of the program or product after anextended period of time; often this is the only worthwhilecriterion. Buivaucractic arrangements such as the difficultyof carrying funds over from one fiscal year to the next often

79.

8

make investigation of these effects virtually impossible."Longitudinal studies" where one group is "followed-up"over a long period are more commonly recognized as stan-dard procedure in the medical and drug areas; an importantexample in education is the PROWCT TALENT study, nowin its third decade. See Overleaming.

MAINTENANCE NEED A met but continuing need.

MAN-YEARS (properly, person-years) See Level ofEffort.

MARKET The market checkpoint on the Key Evaluation Checklist refers to the disseminability of the product orprogram. Many needed products, especially educationalones, are unsaleable by available means. It is only possibleto argue for developing such products if there is a special,preferably tested, plan for getting them used. No deliverysystem, no market.

MASSAGING (the data) Irreverent -term for (mostly)legitimate synthesis of the raw results.

MASTERY LEVEL The level of performance actuallyneeded on a criterion. Focus on mastery level training doesnot accept anything less, and does not care about anythingmore. Closely tied to competency-based approaches. Rep-resents one application of criterion-referenced testing.

MATCHING See Control Group.

MATERIALS (evaluation) See Product Evaluation.

MATRIX SAMPLING If you want to evaluate a newapproach to preventive health care (or science education),you do not have to give a complete spectrum of tests (per-haps a total of ten) to all those allegedly affected, or even to asample of them; you can perfectly well give one or two teststo each in the sample, taking care that each test does getgiven to a random sub-sample, and preferably that it is-randomly associated with each of the others, if they areadministered pairwise (in order to reduce any bias due tointeractions between tests). This will involve (a) much lesscost to you than full testing of the whole sample, (b) lessstrain on each subject, (c) some contact with each, by con-

80

88

trast with giving all tests to a smaller sample, (d) ensuringthat all of a larger pool of items get ti' 'd nil some students,The cost to each testee is much r(duced, and the range oftestees and items tested is much greater, both likely to bebeneficial. Butthe trade-offyou will not be able to saymuch about each individual. You are only evaluating thetreatment's overall value. A good example of the importanceof getting the evaluation question clear before doing adesign.

MBO Management By Objectives, i.e. state what you'retrying to do in language that will make it possible to tellwhether you succeeded. Not bad as a guide to Manning(though it tends to overrigidify the institution), but dis-astrous as a model for evaluation (though acceptable as oneelement in an evaluation design.) See Goal-BasedEvaluation.

MEAN (Stat.) (Cf. Median, Mode) The mean score ona test is that obtained by adding all the scores and dividingby the number of people taking it; one of the several exactsenses of "average," The mean is, however, heavily af-fected by the scores of the top and bottom few in the class,and can thus be non-representative of the majority.

MEASUREMENT Determination of the magnitude of aquantity, not necessarily, though typically, on a criterion-referenced test scale, e.g. feeler gauges, or on a continuousnumerical scale. There are various types of measurementscale, in the loose sense, ranging from ordinal (grading orranking) to cardinal (numerical scoring). The standard sci-entific use refers to the latter only. Whatever is used to dothe measurement, apartusuallyfrom the experi-menter, is called the instrument. It may be a questionnaireor a test or an eye or a piece of apparatus. In certain contexts,we treat the observer as the instrument needing Calibrationor validation. Measurement is a common and sometimeslarge component of standardized evaluations, but a verysmall part of its logic, i.e. of the justification for the evalua-tive conclusions.

MEDIAN (Stat.) (Cf. Mean, Mode) The median per-formance on a test is that score which divides the group intotwo, as nearly as possible; the "middle" performance. It

81 8

provides one exact sense for the ambiguous term "aver-age." The median is not affected at all by the performance ofthe few students at the top and bottom of a class (cf. Mean).On the other hand, as with the mean, no one may score at ornear the median, s'.) that it doesn't identify a "most rep-resentative indivii.iiio! in the way that the mode does.Scoring at the 50th percentile is (roughly) the same as hav-ing the mecuan score, since 50% are belJw you and 50%above.

MEDIATED EVALUATION A more precise to m forwhat is sometimes called (in a loose sense) process evalua-tion, meaning evaluation of something by ;coking at secon-dary indicators of merit, e.g. name of manufacturer, prop-ortion of Ph.D.s on faculty, where someone went to college.The term "process evaluation" also refers to the direct checkon e.g. ethicality of process.

MEDIATION (OR ARBITRATION) model of evalu-ation. Little attention has been paid to the interesting so-cial role and skills of the mediator or arbitrator, which inseveral ways provides a model for the evaluator e.g. thecombination of distancing with considerable dependenceupon reaching agreement, the role of logic and persuasion,of ingenuity and empathy.

MEDICAL MODEL (of evaluation) In Sam Messick'sversion (in the Encyclopedia of Educational Evaluation) thecontrast is drawn between the engineering model and themedical model. The engineering model "focuses uponinput-output differences, frequently in relation to cost."The medical model, on the other hand, (which Messickfavors) provides considerably more complex analysis, en-ough to justify: the treatment's generalization into otherfield settings; remediation suggestions; and side effect pre-dictions. The problem here is that we cross the boundariesbetween evaluation and general causal investigations,thereby diluting the distinctive features of evaluation and soexpanding its scope as to make results extremely difficult toobtain. It seems more sensible to appreciate Consumer Re-ports for what it gives us, rather than complain that it fails togive us explanations of the underlying mechanisms in theproducts and services that it rates. Cf. Holistic and AnalyticEvaluation.

82

MERIT (Cf. Worth) "Intrinsic" value as opposed to ex-trinsic or system-based value /worth. For example, the meritof researchers lies in their skill and originalitytheir worth(to the institution that employs them) would include theincome they generate.

META - ANALYSIS (Gene Glass) The name for 1 par.Hilda). approach to synthesing studies on a common topic,involving the calculation of a special parameter for each("Effect Size"). It promise is to pick up something of vai totr in from studies which do not meet the usual ''minimumstandards"; its danger is what is referred to in the computerprogramming field as the GIGO Principle--Garbage In,Garbage Out. While it is clear that a number of studies,none Of which is statistically significant, can be integrated bya meta-analyst into a highly significant result (because thecombined N is larger), it is not clear how invatiti designs canbe integrated. An excellent review of results and methodswill be found in Evaluation in Ethicati in Volume 4, No. 1,1980, a special issue entitled "Research Integration: theState of tho Art". Meta-analysis is a special approach towhat is called the general problem of research (stud,,...$)integration or research synthesis, and this array of terms forit reflects the fact that it k an intellectual activity that Fesbetween data synthesis on the one hand and the evaluationof research on the other. As Light points out (ibid.) there is aresidual element of judgment involved at several places inmeta-analysis as in any research synthesis process; clarify-ing the basis for these judgments is a tare :.. for the evaluationmethodologist and Glass' efforts to do so have led to theburgeoning of a very fruitful area of (meta-)research.

META - EVALUATION Meta-evaluation is the evalu-ation of evaluations, and hence typically involves usinganother evaluator to evaluate a proposed or completed eval-uation. nig practice puts the primary evaluator in a similarposition to the evaluee; both are going to be evaluated ontheir performance. It can be done formatively or summa-tively. Reports should go to the original client, copy to thefirst-level evaluator for reaction. Meta-evaluation thengives the client independent evidence about the technicalcompetence of the primary evaluator: No infinite regress isgenerated because extrapolation shows it doesn't pay after

83

,Iii'levq t proieets .,the first nil:lotion ix kIn Int:soional obi,nkt the second im any.

meiwevalrullysis",%1 pr"14',',choanAt 'Ration for evaluators,as psychos ht Ix t.or psi vaildit;Sts, A dimensional ail-

th credibilitY, utilityproACII mir re,),t,,ider thcevane).

The ge.Y ( by Listlit) Chet enerate418(1 be applied. in two

.1-17t1i8lit)::::(711;tc;::":":1.1rerci1trc'uti9,citnthLe't.;iiinti'i:ul'clisott.rs.design), v./ evaL,4111) then1:,

). The(secortdarriluatioh'Iln), or di'ict (ttt; ing the checklist to theoriAt nal esfros inel'' (ts 0 Pri?formeri iin ill )1 pu,riltakt)en

latter proCt41 ConsCkle0 the but It ,tific proct'frreffe4Itation;

till, for

f the et ,,I4s;) Nunes us to look ate.g. the cOping tolVeness - haiance giotiott itself, and hence

. Evaluations

does somctArmall 1Xsif,t the,. loci< °If power. It shoexatnple, Typt?' ihchicle In t,' the qifferential costs ofType 1 arlriowev.: err(u.5.,Auated ilil,lkiation

, Consequences if

must 110t, 'es, 134 `. be (norm 01 ',elms of their actualconseriuetIC,04, t t)nlY il:he KEC,

rle tIlight use the vari-

tbe.r

owtti,s QUEMAC ap-

use,. ',pro/Pon sktSside_s, l'or Bobous , ,aluOiession't1dArug ,valuatioproacr ,0 self, .. s , an enc ,

PrO 4Ij gl in i' d h n rNuires regulation ofI- of the obligation tothe subjeciilotua.:4 ,-,It -er.-nct 150 Cons

true meta -e N, 5ec 0 (Inarice.i,04 CUL i4CY TEsil

MINIm , pasic sOln11' -linimulilof IttsultlY/4 has 13,_115 15 a graduil

STING A basic levelc'rnpetencY succeiis

ii, ,urfi te5At'ocatio 'il tied ir)kas beention, ei ivgrade ,- "'motion,remedial Cyril non, failure ''etc. With-ect to teacher evalua-tion, progrOery ho Finding -,55ue, all this at stake, mcr

,a.rld an ethical one, andhas been'a itAientintro Step -a mea5ure"on rep

oht l''alitical Iduced4 tol,valt,h due warning and

support it Co, it Not a - See r.;rms honest schooling;done corele0 1% 4 dioste 'uttiht, gcore.

,i BUtit ,..: A gener 'Mis510r.pudgetflf4`-ppBS); th a ,Ization of the notion

of prograni 'dgeti l'Ig (see will se klea j5 to develop aof b° Irlh hid.' el, questions of thegn y , nmuch vii wendsystem 0

ing 0 qtype, ''Hoc' cOntr4te w,e,,,rogrturi, 1 such and such amission?'' 011the previuvvne Ii

,y kinds.%t with rnries to %,"\Lett()/, and Personnelf cAteg- t?pBs :nich budget amounts

were tied). ms oviNtationthe

has been that a goodmany proe,, vet` in

ii_n tt we mavetise they serve and theservicc5 theYi we.'t,4, so tr°;. into e:p"ave a very Poor ideaof how mOC? ' puttinti '' Welfare or bilingual

education by merely looking at agency budgets or evenPPBS figures, unless we have an extremely clear picturewhich decision makers rarely can have, especially a newExecutive Cabinetof the actual impacted populations andthe level of service delivery from each of the programs. Thisconcept, along with zero-based budgeting, was popularwith the early Carter administration but we hear little aboutit later in that regime, just as MacNamara's introduction ofPPBS (into DOD, from Ford Motor Company) under anearlier administration has faded consit.lembly.

MODE (Stat.) (Cf. !Wean, Median) The mode is the"most popular" (most frequent) score (or score interval). It'smore likely' a student about whom you know nothingexcept their membership in this group scored the "modal"score of the group than any other score. But it may not bevery likely, e.g. if every student gets a different score, excepttwo who get 100 out of 100, then the mode is 100, but it's notvery "typical." In a "normal" curve, on the other hand, likethe (alleged) distribution of IQ scores in the U.S. popula-tion, the mean, the median, and the mode are all the samevalue corresponding to the highest point of the curve. Somedistributions, or curves representing them, are described ashi-modal, etc., which means that there are two (or more)peaks or modes; this is a looser sense of the erm mode, butuseful.

MODELS (of evaluation) A term loosely used to referto a conception or approach or sometimes even a method(naturalistic, goal-free) of doing evaluation. Model; are toparadigms as hypotheses are to theories, which means lessgeneral and some overlaps. Referenced here are the follow-ing, frequently referred to as models: advocate-adversary.black box, connoisseurship, CIPP, discrepancy, engineer-ing, judicial, medical, responsive, transactional and socialscience. The best classification of these and others (manyhave been attempted) is Stuffiebeam and Webster's (forth-coming, 1981).

MODUS OPERANDI METHOD A procedure foridentifying the cause of a certain effect by detailed analysisof the chain of events preceding it and of the ambientconditions: it is sometimes feasible when a control group isimpossible, and it is useful as a check or strengthening of

83

the design even when a control group is possible. Theconcept refers to the characteristic pattern of links in thecat. sal chain which tt e criminalist refers to as the modusoperandi of a criminal. These can be quantified and evenconfigurally scored; the problem of identifying the causecan thus be converted into a pattern-recognition task for acomputer. The strength of the approach is that it can beapplied in individual cases, informally, semi-formally (as incriminatistics), and formally (full computerization). It alsoleads to MOM-oriented designs which deliberately employ"tracers" i.e. artefactual features of a treatment which willshow up in the !greets. An example would be the use of aparticular sequeice of items in a student questionnaire dis-seminated to faculty for instructional development use.(Details in a section by this title in Evaluation in Education,ed. W.'. Popham, McCutcheon, 1976.)

MONITOR The term "monitor" was the original termfor what is now often called by an agency "the projectofficer," namely the person from the agency staff that isresponsible for supervising progress and compliance on aparticular contract or grant. "Monitor" was a much clearerterm, since "project officer" could equally well refer tosomebody whose responsibilities were to the project man-ager, or to somebody who merely handled the contractpaper work (the "contract officer," as the fiscal agent at theagency is sometimes called). But it was apparently thoughtto have "Big Brother" connotations, or not to reflect ade-quately the full range of resporsibilities, etc. See Moni-toring.

MONITORING A monitor (of a project) is usually arepresentative of the funding agency who watches forproper use of funds, observes progress, provides informa-tion to the agency about the project and vice versa. Monitorsbadly need and rarely have evaluation skills; if they were alleven semi-competent formative evaluators, their (at leastquasi-) externality could make them extremely valuablesince many projects either lack evaluation staff, or havenone worth having, or never supplement them with exter-nal evaluation. Monitors have a schizophrenic role whichfew learn to handle; they have to represent and defend theagency t9 the project and represent and defend the projectto the agency. Can9ve roles be further complicated by an

86

attempt at evaluation? They already include It and the onlyquestion is whether it should be done reasonably well

MOTIVATION The disposition of an organism or in-stitution to expend effort in a particular direction. It is bestmeasured by a study of behavior, since self-reports areintrinsically and contextually likely to be unreliable. Cf.Affect,

MOTIVATIONAL EVALUATION The deliberate useof evaluation as a management tool to alter motivation canbe content-dependent or content-independent. If theevaluation recommends a tie between raises and work -output which is adopted it may affect motivation; if it cutsthe (supposed or actual) connection, it will be likely to havethe opposite effect on motivation. But the mere announce-ment of an evaluation even without its occurrence, andcertainly the presence of an evaluator, can have very large(good or bad) effects on motivation, as experienced mana-gers well know. Evaluators, on the other hand, are prone tosuppose that the contents of their reports are what counts,and tend to forget the reactive effects, while they would bethe first to suspect the Hawthorne effect in a study done bysomeone else.

MULTIPLE-TIER See Two-Tier.

NATURALISTIC (evaluation or methodology). Anapproach which minimizes much of the paraphernalia ofscience e.g. technical jargon, prior technical knowledge,statistical inference, the effort to formulate general laws, theseparation of the observer from the subject, the commit-ment to a single correct perspective, theoretical structures,causes, predictions and propositional knowledge. Insteadthere is a focus, on the use of metaphor, analogy, informal(but valid) inference, vividness of description, reasons, exp-lanations, inter activeness, meanings, multiple (legitimate)perspectives, talcit knowledge. For an excellent discussion,see Appendix B,: Naturalistic Evaluation in Evaluating withValidity, E.HoUse, Sage, 1980.) The Indiana Universitygroup (Cuba and Wolf particularly) have paid particularattention to the naturalistic model and their definition(Wolf, personal Icommunica tion) stresses: (a) more orienta-

87

gJ

lion towards "current and spontanelus activities, behav-iors and expressions rather than to some statement of pre-stated formal objectives; (b) responds to educators, admin-istrators, learners and the public's interest in different kindsof information; and (c) accounts for the different values andperspectives that exist . "; the approach stresses con-textual factors, unstructured interviewing, observationrather than testing; meanings rather than mere behaviors.Much of the debate about the legitimacy/utility of thenaturalistic approach recapitulates the idiographic/nomo-thetic debate in the methodology of psychology and thedebates in the analytical philosophy of history over the roleof laws. At thisstage the principal exponents of the natural-istic approach (e.g. Stake) have'gone too far in the laissez-faire direction (any interpretation the audience makes isallowable); but their example has shown up the improprietyof many of formalists' assumptions about the applicabilityof the social science model.

1

NEEDS ASSESSMENT (NEEDS SENSING is a relatedrecent variant) This term has drifted from its literal mean-ing to a jargon status in which it refers to any study of theneeds wants, market preferences, values or ideals thatmigh be relevant to e.g. a program. This sloppy sensemigh be called the "direction-finding" sense (or process),and i is in fact a perfectly legitimate process when one islooliipg for all possible guidance in planning or justificationfor cintinuance of a program. Needs assessment in theliteral sense is just part of this and it is the most important part,hence, even if the direction-finding approach is taken, onemust then sort out the true needs. Needs provide the firstpriority for response just because they are in some sensenecessary whereas wants (merely) are desired and ideals are"idealistic," i.e., often impractical. It is therefore very mis-leading to produce something as a NA (needs assessment)when in fact it is just a market survey because it suggeststhat there is a level of urgency or importance about itsfindings which simply isn't there. True needs are consider-ably harder to establish than felt wants, because true needsare often unknown to those who have thempossibly evencontrary to what they want, as in the case of a boy whoneeds a certain diet and wants an entirely different one.

The most widely used definition of needthe "dis-

0 88

96

crepancy definition"does not confuse needs with wantsbut does confuse them with ideals. It defines need as thegap between the actual and the ideal, or whatever is neededto bridge it. This definition has even been built into law insome states. But the gap between your actual income andyour ideal income is quite different (and much larger) thanthe gap between your actual income and what you abso-lutely need. So we have to drop the use of the ideal level asthe key reference level in the definition of needwhich isjust as well, because it is very difficult to get much agree-ment on what the ideal curriculum is like and if we had to dothat before we could argue for any curriculum needs, itwould be hard to get started.

A second fatal flaw in the discrepancy definition is itsfallacious identification of needs with one particular subsetof needs, namely unmet needs. But there are many thingswe absolutely needlike oxygen in the air, or vitamins inour dietwhich are already there. To say we need them isto say they are necessary for e.g. life or health, which distin-guishes them from the many inessential things in the envi-ronment. Of course, on the discrepancy definition they arenot needs at all, because they are part of "the actual," notpart of the gap (discrepancy) between that and the ideal. Itmay be useful to use the dietary terminology for met andunmet needsmaintenance and incremental needs. Peoplesometimes think that it's better to focus on incrementalneeds because that's where the action is required (so maybethe discrepancy definition doesn't get us into too muchtrouble). But where will you get the resources for the neces-sary action? Some of them usually come from redistributionof existing resources, i.e., from robbing Peter's needs to payfor Paul's, where Peter's (the maintenance needs) are just asvital as Paul's (the incremental). This leads to an absurdflip-flop in successive years: it is mud) better to look at allneeds in the NA, prioritize them (using apportioningmethods not grading or ranking) and then act to redistributeold and new resources.

The correct definition of need, which we might call thediagnostic definition, defines need as anything essential for asatisfactory mode of existence, i.e., anything without whichthat mode of existence or "level of performance" would fallbelow a satisfactory level. The slippery term in this is ofcourse "satisfactory" and it is context-dependent; satisfac-

89.

tory diets in a nation gripped by famine may be consider-ably nearer the starvation level than those regarded as satis-factory in a time of plenty. But that is part of the essentiallypragmatic component in NAit is a prioritizing and prag-matic concept. Needs slide along the middle range of thespectrum from disaster to utopia as resources become avail-able. They never cover the ends of the spectrumnoriches, however great, legitimate the claim that everyoneneeds all possible luxuries,

The next major ambiguity or trap in the concept of needrelates to the distinction between what we can call perform-ance needs and treatment needs. When we say that childrenneed to be able to read, we are talking about a needed levelof performance. When we say they need classes in reading,or instruction in the phonics approach to reading, we aretalking about a needed treatment. The gap between the twois vast, and can only be bridged by an evaluation of thealternative possible treatments that could yield the alleg-edly needed performance. Children need to be able to con-versebut it does not follow they need classes in talking,since they pick it up without any. Even if it can be shownthat they do need the "treatment" of reading classes, that'sa long way from the conclusion that any particular approachto reading instruction is needed. The essential points arethat the kind of NA with which one should begin evalua-tions is performance NA; and that treatment needs claimsessentially require both a performance NA and a full-scaleevaluation of the relative merits of the best candidates in thetreatment stakes.

Conceptual problems not discussed here include theproblem of whether there are needs for what isn't feasible,and the Jist-inctiori between artificial needs (alcohol) andessential needs (food); methodological problems includingthe flay.5 iq the usual procedures for performing NA arediscus. ,ed elsewhere (LE).

The crucial perspective to retain on NA is that it is aprocess for discovering facts about orpnisms or systems;it's not an opinion survey or a wishing trip. It is a fact aboutchildren. in (his environment, that thea' need Vitamin C andfunctional literacy skills, whether or not think so ortheir parents think so or for that matter witchdoctors ornutritionists or reading .vecialists think so. What makes it a

90

fact is that the withdrawal of, or failure to provide thesethings, results in very bad consequences, by any reasonablestandards of good, or bad. Thus, models for NA must bemodals for truth-finding, not for achieving political agree-ment. That they are all too often of the latter kind reflects thetendency of those who design them to think that valuejudgments are not part of the domain of truth. For NA arevalue judgments just as surely as they are matters of fact;indeed, they are the key value judgments in evaluation, theroot source of the value that eventually makes the conclu-sion an evaluative one rather than a purely descriptive one.It's easy to see this if we began with a statement that referred to an ideal as we (implicitly) do with the discrepancyd'::inition, or if we had a treatment-need statement to startoff (since that 15 an evaluation). And it's easy to see that if webegan with mere market surveys, we would not have anevaluative conclusion, just a descriptive one (possibly de-scribing a population's evaluation, but not making. evalua-tions). But diagnostic-clefinition performance NAs are eval-uative because they require the identification of the essential,the importans, that which avoids bad results. Of course, theseare often relatively uncontroversial value judgments.Evaluations build on NM like theories build on observa-tions; it's not that observations are infallible, only thatthey're much less fallible than theoretical speculation.

NORMAL DISTRIBUTION (Stat.) Not the waythings are normally distributed, though some are, but anideal distribution which results in the familiar bell-shapedcurve (which, for example, is perfectly symmetrical thoughfew real distributions are). A large part of inferential statis-tics rests on the assumption that the population from whichwe are sampling is normally distributed, with regard to thevariables of interest, and is invalid if this assumption isgrossly violated as it quite often is. Height and eye color areoften given as examples of variables that are normally dis-tributed but neither are well-supported examples. (Theterm "Gaussian distribution" is sometimes and much lessconfusingly used for this distribution.)

NORM-REFERENCED TESTS These are constructedto yield a measure of relative performance of the individual(or group) by comparison with the performance of other

91

individuals (or groups) taking the same test cg, in terms ofpercentile ranking (cf. Criterion - Referenced Tests), Sincethe simplest and often the best quick way to determinewhether a test involves unrealistic stantards is by findingout how many students in the state succeed, at that levelnorm-referencing is a valuable part of any testing program.It is not ideal as a sole basis since it makes discriminating orcompering more important than (or the only meaning of)achieving, and severely weakens the test as an indicator ofmastetv (or excellence or weakness), which you should alsoknow about.. The best compromise is a criterion-referencedtest on which the norms are also provided, whose criteriaare documented needs,

NULL HYPOTHESIS The hypothesis that results aredue to chance, Statistics only tells us about the null hypo-thesis; it is experimental design that provides the basis forinferences to the truth of th:. scientific hypothesis of in-terest, The "significance levels" referred to in experimentaldesign and interpretation are the chances that the null hy-pothesis is correct. Hence, when results "reach the .01 levelof significance" that means there's only one chance in a

hundred that they would be due to chance. It does not mean'that there's a 99 percent chance that our hypothesis is cor-rect; because, of course, there may be other explanations ofthe result that we haven't thought of.

NUT ("making the nut") Management consultant jar-On for the basic cost of running the business for the year.After "making the nut" one may become a little choosierabout which jobs to take on, and what rates to set.

OBJECTIVES The technical sense of this term refers toa rather specific description of an intended outcome; themore general description is referred to as a goal.

OBSLRVATION The process or product of direct sen-sory inspection, frequently involving trained observers.The line between observation and its normal antonym "in-terpretation" is not sharp and is in any case context-de-pendent, i.e. what count; as an observation in one context("a very pretty dive") will count as an interpretation inanother (where the diving judges' score is appealed). Justas

92

it is very difficult to get trainees in evaluationeven thosewith considerable scientific training --to write non-evalu-ative descriptions of something that is to be evaluated, so itis difficult to get observers to see only what's there ratherthan their inferences from It. The use of checklists andtraining can produce very great increases In reliability andvalidity in observers; observation is thus a rather sophistica-ted process, and not to be equated with the amateur's per-ceptions or reports On them, It should be clear from theabove that there are contexts in which observers, especiallytrained observers, can correctly report their observations inevaluative terms, (ran obvious example, where no specialtraining is invOlved, is reporting scores at a rifle range.)

OPPORTUNITY COSTS Opportunity costs are whatone gives up by engaging in a particular activity. The sameconcept applies to investing money or any other resource.There are alum /s opportunity costs; one at least has to giveup leisure to do something, or give up work to do nothing,i.e., enjoy leisure, Calculating them (like profit) is a con-ceptual task first, and an arithmetic one later. In the firstplace, there is always an infinity of alternatives to any ac-tion, all of which one gives up. Does it follow that opportun-ity costs are always infinite? The convention is that the OP isthe value of the most vahable of these. So, calculating one OPoften involves calculating a Bret many costsof alternatives.

OUTCOME EVALUATION See Pay-off Evaluation.

OVERLEARNING Overlearning is learning past thepoint of 100% recall, and is aimed at generating long-termretention. In order to avoid boredom on the part of thelearner, and for other reasons, the best way to do this isthrough reintroducing the concept (etc.) in a variety ofdifferent contexts. One reason that long-term studies, orthe follow-up phase of an evaluation often reveals gravedeterioration of learning is that people have forgotten thedistinction between learning to criterion at t I and learning tocriterion at t2; in fact, the latter is the correct criterion, wheret2 is the time when the knowledge is needed, while ti is theend of the instructional period.

931 Ori

PAD, PADDING When a bidder goes up with a budgetfor a proposal, there has to beone way or anothersome allowance in It for unforeseen eventualitiesat leastif it is to be done according to sound business practices. Thisis Often referred to as "the pad," and the practice of doingthis is the legitimate version of "padding the budget." Padding the budget is also used as a term to refer to illegitimateadditions to the budget (excessive profits); but it must berealized that the pad Is the only recourse that the contractorhas for handling the obvious unreliabilities in predicting theease of implementing some complicated testing program,the ease of designlng a questionnaire that will get past thequestionnaire review panels, etc.

PARADIGM An extremely general conception of a dis-cipline, which may be very influential in shaping it, e.g."the classical social science paradigm in evaluation."

PARALLEL FORMS Versions of a test that have beentested for equal difficulty and validity.

PARALLEL PANELS In proposal review, for example,it is important to rev independent concurrent panels inorder to get some idea of the reliability of the ra tings they areproducing. On the few occasions this has been done, theresults have been extremely disquieting. Unreliability guar-antees both invalidity and injustifce. On would expect afederal science foundation to have enough commitment tovalidity and justice to do routine checks of this kind, theyusually cry poormouth instead of looking for ways to getvalidity within the same budget. In any case, dispensingfunds invalidly and unfairly is not justifiel by saying itwould cost slightly more to do it reasonably well, even iftrue, since the payoffs would be higher (from the definitionof "doing it, and reasonably well"), and justice is supposedto be worth a little.

PARETO OPTIMAL A tough criterion for changes ine.g. an organization or program which requires thatchanges be made only if nobody suffers and somebodybenefits. Crucial feature is that it appears to avoid the prob-lem of justifying so-called "interpersonal comparisons ofutility," i.e., showing that the losses some sustain as a resultof a change are less important than the gains made byothers. Improving welfare conditions by raising taxes is not

94

1 02

pareto optimal, obviously. But selecting between altema-live pareto optimal changes still involves relative hardshipand benefit considerations. A major weakness in Rawl'stheory of justice is the commitment to Parch) optimality.

PARETO PRINCIPLE A management maxim possiblymore illuminating than the Peter Principle and Parkinson'sLaw; it is sometimes described as the 80/20 rule, or the"principle of the vital few and the trivial many," and assertsthat about 80% of significant achievement e.g. at a meetingis done by about 20% of those present; 80% of the salescome from 20% of the salespetyle, 80% of the pay-off froma task-list can be achieved from 20% of the tasks, etc. Worthremembering because it's sometimes true, and often sur-prising.

PARKINSON'S LAW "Work (and budgets, timelinesand staff size), expands to fill the space, time and fundsavailable." If its converse were only true it would mean wecould do everything by allowing no time for it; but it is aninsight about large organizations. The fact that bids onRFIvs cotae in close to the estimated limit may not illustratethis, but only that the work could be done at various levelsof thoroughness, or that RFP writers aren't dumb.

PASSIVE (evaluation) See Active.

PAYBACK (PERIOD) A term from fiscal evaluationwhich refers to the time before the initial cost is recovered;the recovery cash flows should of course be time-discounted. Payback analysis is what shows that buying a$12,000 word-processor may be sensible even if the pricewill probably drop to $8000 in a year; if the payback-periodis say, 15 months (typical of many carefully-chosen instal-lations), you will in fact lose several thousand dollars bywaiting in the belief that.

PAYOFF EVALUATION Evaluation focused on results;the method of choice apart from costs, delay, and interven-ing loss of control or responsibility (See Process Evalua-tion.) Essentially similar to outcome evaluation.

PERCENTILE (Stat.) If you arrange a large group inthe order of their scores on a test, and divide them into 100equal-sized groups, beginning with those who have thelowest score, the first such group is said to consist of those

95

103

In the 1s percentile (i.e,, they have scores worse than 49%of the group), and so on to the top group which should becalled the 100th percentile: (or boing technical reasons theactual procedure used only dist ilgui.,hes 911 groups, so thebest one can do is get into the (Nth percentile. With smallernumbers or (or cruder estimates, the total group is dividedinto ten deceit's; similarly for four quartiles, etc.

PERFECTIONISM Marks' Principle: "The price of per-fection is prohibitive." Never get letters or papers retypedwhen fully legible corrections can be made by hand; therearen't enough trees, days or dollars for that. Legal docu-ments and typographical works of art may he exceptions,but the Declaration of Independence has two insertions bythe scribe so there's a precedent in a legal (Cited byBliss.

PERFORMANCE CONTRACTING "the system of hir-ing and paying someone to deliver (e,g, educational) ser-vices by results. They might be paid in terms of the numberof students times the number of grade equivalents theirscores are raised. Widely tried in the Ws, now rare. Usualstory is that it didn't work or worked only by the con-tractor's staff cheating ("teaching to the test"). Actual situa-tion was that the best contractors did a consistently good jobbut the pooled results of all contractors were not good. Aswith most innovations, the total lack of sophistication (inevaluation) of the educational decision makers treated thisresult as grounds for giving up, instead of for hiring thebetter contractors, from which we might have gone on tostill better teaching methods for everyone. See regression tothe mean for an example of the need for some sophisticationin setting the terms of the contract.

PERSONNEL EVALUATION Personnel evaluation ty-pically involves an assessment of job-related skills, in one ormore of five ways; first, judgmental observation of job-per-formance by untrained but well-situated observers e.g. coworkers; second, judgmental observation by skilled ob-servers e.g. experienced supervisor or personnel manageror consultant; third, direct measurement of job performanceparameters, by calibrated instruments (human or, usually,other); fourth, observation or measurement cf performanceon job simulations; fifth, the same on paper and pencil tests

96

104

which examine job-relevant knowledge or attitudes, Per-sonnel evaluation not only Involves ethical constraintsupon the way it should be done, It must also Involve anethical dimension on which the performance of the person-nel is scored, The importance of that will vary dependingupon the amount of authority and interpersonal contact ofthe individual being evaluated, There are a number of stan-dard traps In personnel evaluation which ir,validate most ofthe common approaches, For example, the failure to pro-vide appropriate levels of anonymity for the raters, cvngix--tent with relevant legislation, or a general fear of bad-mouthing others because it involves the sin referred to in"judge not that ye be not judged," leads to an unwillingnessto voice criticism even if deserved; this (solvable) problemrequires sustained and ingenious attention, The scales usedin personnel evaluation are rarely based upon serious jobanalysis and consequently can hardly give an accuratepicture of someone's performance. Another common mis-take is to put style variables into evaluation forms or re-ports, in situations where no satisfactory evidence existsthat a particular style is superior to others. Even when stylevariables have been validated as indicators of superior per-formance, they typically cannot be used in personnel evalu-ation because the correlations between their presence andgood perforMance ate merely statistical, and are thus asillegitimate in the evaluation of individuals as skin color,'which of course does correlate statistically with variousdesirable and/or undesirable characteristics. "Guilt by as-sociation" is as inappropriate when the association is via acommon style as when it is via a common friend, race orreligion.

PERSON-YEARS See Love! 3f Effort.

PERSPECTIVAL EVALUATION This approach to orpart of an evaluation requires the evaluator to attempt vari-ous conceptualizations of the program or product beingevaluated. Programs and products can be seen from manydiffereit perspectives which affect every aspect of the eval-uation, including cost analysis. Advocate-adversary is aspecial case of perspectival evaluation; consumer-based ormanager-based evaluations are special perspectives. As inarchitecture, multiple perspectives are required in order tosee something in full d'pth. Different from ilium native,

97 10.5

responsive etc. in the total commitment Ito the view thatthere is an objective reality of which the perpectives aremerely views and ioaccorafe by themselves, The correctstrand in the naturalistic approach stresses this; the weak'strand favors the "each perspective is legitimate" approach,which is false if the perspective is claimed to be The realityand not just one aspect of It

PERT, PER CHART Stantsis for Program EvaluationReview Techni4tre; a special type of flow charting. of whichperhaps the most interesting feature is the fact that an effortis made to 'project times at which various points in theproject's developent will he reached (and oatputs at thosepoints) at three levels, namely the maximum likely, theminimum likely and the most probable (date or level). Thisprovides a good approach to contingency planning, in thehands of a skilled mannger. As with all these devices, theycan become a pointless exercise if not closely tied to reality,and the tie to reality can't be read off the chart.

PLACEBO EFFECT The effect due to the delivery contextof a treatthent as opposed to the delivered content. In medi-cine, the placebo is a dummy pill, given to the control groupin exactly the same way as the test drug (or more generally,the experimental treatment) is given to the experimentalgroup, i.e., with the nurses, doctors and patients in ignor-ance as to whether the pill is a placebo or not. (Notice thatthere are two errors in this as a valid design for identifyingplacebo effect, but it's a considerable improvement overgiving no placebo to the control group.) "Bedside manner"carries the placebo effect with it and since it is estimated thatprior to the sulfa drugs, 90% of all therapeutic :exults weredue to the placebo effect, it's a little unfortunate that bed-side manner gets little play in medical practice and training(and, until 1948, no research). Psychotherapy has been saidto be entirely placebo effect (Frank); a design to investigatethis view presents interesting challenges. In education andother human service areas, the placebo effect is roughlyequivalent to the Hawthorne effect which probably ac-counts for most successes with innovations. This is as licit asbedside manner, but only if not ascribed to the snake-oiiitself. But if we're honest about it being only a placebo,won't the placebo effect evaporate? Not if the charismaticcontext is preserved; "the heart has its reasons that Reason

98

106

doesn't know,"

PLANNING (evaluation in) Sec Proforma !Iv@ EVA,*%ration.

POINT CONSTANCY REQUIREMENT IPCRI Therequirement on numerical scoring e,g, of tests that a point,however earned (Le, on whatever item and for whateverincrement of performance on a particular item), shouldreflect the same amount of merit, (It is connected with thedefinition of an interval scale.) If the PCR is violated, ad.ditivity fails Le, performance A will add up to more pointsthan performance R although it is in fact inferior. PCR is avery severe rtmuirement and rarely even tried for in anyserious way, hence one should normally (holistically) gradeas well as score tests to provide protection against PCRfailure. The key to PCR is the rubric in essay/simulationscoring and item-tailoring on multiple-choice tests,

POINT OF ENTRY The point of entry problem is theproblem for the elient of when to bring an evaluator in on aproject, and the problem fur the evaluator of the point in thetime flux of decisions when s/he should start evaluating theoptions (critical competitors). Project directors and programmanagers often feel that bringing in an external evaluator(and often any evaluator at all) at the very beginning of aproject is likely to product a "chilling eff4t," and that thestaff should have a chance to "run with the ball" in the waythey think is most likely to be productive for at least sometime without admonitions about measurabil'r of results,etc. The result is often that the evaluator is brought in toolate to be able to determine bar.!-Iine performance, and tolate to set up control groups and is hence unable to de-termine either gains or causation, to mention only two ofthe major problems that occur in trying to do evaluations ofprojects that were designed wiLiout evaluation in mind.This is not to say that evaluators never or rarely exert achilling effect; they often do. Often they could have avoidedit; sometimes not, (GFE is one way to avoid it but impossiblein the planning phase.) It's possible on a small project tohave an evaluator in for at least one series of discussionsduring the planning phase, maybe get by without one for awhile after that, bring one or more back in after things beginto take shape, and perhaps dispense with most of them

99

10

main for d Ntm)nd period of "unfettered creativity," t low,ever, there iirt many good evaluators that exert 4 (ostdotivsupportive dud helpful effect on projects, in spite of hongon board all the time They will need esternal evaluationhelp to avoid the bads of cooption, but on a Iv projectthere's really no alternative to an in= house earlv=on,boardevaluation staff, From the evaluator's point of view, thequestion is what to consider "fixed," what to consider asbeyond second-guessing in killing an eyaloiahuam, Supposothat one is brought in very late in a project_ For formativeevaluation purposes, there's really no point in secondguessing the early decisions about the form of the project,because they're presunibly irreversible, fur summativeevaluation. it will be necessary to et.cond,,guess those, andthat means that the point of entry of the emanative eval-uator will be back at the moment when the project designwas being dear. mined, a point which presumably antedatesthe allocation of funds to the project, The formative evil'thi tor, however, should in fact not be restricted to looking atthe set of choice points that are seen by the project staff asdownstream from the point at which the evaluator is calledin. For the formative evaluator, the correct point of entry forevaluation purposes is the last irreversible drown liventhough the staff hasn't thought of the possibility of revers-ing some earlier decisions, the formative evaluator mustlook into such possibilities and the cost/value of reversals,

POLITICS OF EVALUATION ()wending on one's rolland the day of the week, one is likely to think of litics asdirty politicsan intrusion into scientific evaluationor aspert of the ambient reality which evaluators are too often toocareless about including as relevant considerations. If oneha'; a favorable atoitude towards politic,. or uses the termwithout pejorative connotations, one will include virtuallyall program background and contextual factors in the politi-cal dimension of program evaluation. The jaundiced viewsimply defines it as the set of pressures that are not relatedto the truth or merits of the case. The politics of com-petency-based testing as a requirement for graduating is agood example. The situation in many states is that it hasbecome "politically necessary" to institute such require-ments, now or in the near future, although the way in whichthey have been instituted virtually destroys all the reasons

10(1

106

for the requirements. That is, the requirement for graduat-ing from the 12th grade is "basic skills" at the 7th or 8thgrade level; no demonstration of other skills; not even anydemonstration of application skills on the basics; the examsset up sothat multiple retakes of exactly the same test arepossible (hence there is no proof that the skill is present);teachers have access to and teach to the test; other subjectsare completely dropped from the 11th and 12th grade cur-riculum in order to make room for yet more repetitive teach-ing of drill-level basics, etc. A strong case can be made thatthis version of MCI' does more harm than good, though agenuine version would certainly contribute tgwards truth-in-packaging of the diplomate. This is politics without pay-off. But on many occasions, the "politics" is what getsequity into personnel evaluation, and racism out of thecurriculum, though it also keeps moral education out of thepublic schools, a Li rible handicap for the society. Bettereducation about aid in evaluation is the only hope of im-provement, short of a political leader with the charisma topersuade us of anything and the brains to persuade us toimprove our self-critical skills.

POPULATION (Stat.) The group of entities fromwhich a sample is drawn, or about which a conclusion isinferred. Originally meant people,_ obvious extension tothings (e.g. objects on the production line, the populationwhich is sampled for quality control studies); less obviousextensions to circumstances (a field trial samples the popu-lation of circumstances under which a product might beused); still fancier extensions in statistical theory' to possibleconfigurations, etc.

PORTRAYAL Semi-technical term for an evaluation-by-(rich)-description, perhaps using pictures, quotes, anec-dotes as well as observations. See Responsive Evaluation,Naturalistic Evaluation.

POSTTEST The measurement made after the "treat-ment," to get absolute or relative gains (depending onwhether the comparison is with pre-test scores or compari-son group scores.)

POWER (of a test, design, analysis) An important tech-nical concept involved in the evaluation of experimentaldesigns and methods of statistical analysis, related to effi-

101 .L'I Qy

ciency. It is in tension with other desiderata such as smallsample size, as is usual with evaluative criteria.

PPBS Program Planning and Budgeting System. Themanagement tool developed by MacNamara and others atFord Motor Company and .taken to the Pentagon whenMacNamara became Secretary of Defense; since thenwidely adopted in other federal and state agencies. Princi-pal advantage and feature: identifying costs by program andnot by conventional categories such as payroll, inventory,etc. Facilitates rational planning with regard to programcontinuance, increased support, etc. Two problems: first,it's too often (virtually always) instituted as a mere changein bookkeeping procedures, without a program evaluationcomponent worth the name, so the gains in decision valid-ity don't occur. Second, it's often very expensive to imple-ment and unreliable it, -:istribution of overhead and it neverseems to occur to anyone to evaluate the problem and costof shifting to PPBS before doing it, a typical example ofmissing the point of the whole enterprise. Cf. Meta-eval-uation, Mission Budgeting.

PRACTICE EFFECT The specific form of practice effectrefers to the fact that taking a second test with the same orclosely similar items in it, will result in improvement inperformance even if no additional instruction or learninghas occurred between the two tests. After all, one has doneall the "organizing of one's thoughts" before the secondtest. There is a general practice effect, which is particularlyimportant with respect to individuals who have not bad,much recent experience with test-taking; this practice effectsimply refers to improving one's test-taking skills throughpractice, e.g. one's ability to control the time spent on eachquestion, to understand the way in which various types ofmultiple-choice questions work, etc. The more speeded the.test is, the more serious the practice effect is likely to be. Theuse of control groups will enable one to estimate the size ofthe practice effect, but where they're not possible, the use ofa posttest-only design for some of the experimental groupwill do very nicely instead, since the difference between thetwo sub-groups on the posttest will give an indication of thepractice effect, which one then subtracts from the gains ofthe posttest-only group in order to get a measure of thegains due to the treatment.

1021

PREDICTIVE VALIDITY See Construct Validity

PREFORMATIVE (evaluation) Evaluation in the plan-ning phase of ? program; typically involves gathering base-line data, improving evaluability, designing the evaluation,improving the planned program etc. See Evaluation.

PRESS RELEASES The rules are: (1) Don't bother tohand (or send) out the technical version, even as a supple-ment: (2) Don't bother to hand out a summary of the techni-cal document. (3) Don't bother to hand out a statementwhich says favorable things and then qualifies them-either the qualification or the favorable comment will bedropped. (4) Issue only a basic description of the programitself plus a single overview claim, e.g. "Results do not asyet show any advantages or disadvantages from this ap-proach, because it's much too early to tell. May have adefinite conclusion in n months." (That's an interim release;in a final release you drop the second sentence.)

PRETEST Pretests are normally said to be of two kinds;diagnostic and baseline. In a diagnostic pretest, the peda-gogical (health etc.) function is to identify the presence orabsence of the prerequisite skills, or the places where re-mediation instruction should be provided. These tests willtypically not be like the posttests. In baseline pretesting, onthe other hand, we are trying to determine what the level ofknowledge (etc.) is on the criterion or pay-off dimensions,and hence it should be matched exactly, for difficulty, withthe posttest. Instructors often think that using this kind ofpretest will have bad results, because students will have a"failure experience". Properly managed, the reverse is thecase; not only does one frequently discover that some or allstudents are not as ignorant as one had thought about thesubject matter of the course, in which case very usefulchanges can be made in content, or "challenging out" canbe allowed, with a reduction in costs to the student andpossibly to the instructor. Moreover, the pretest gives anexcellent and highly desirable preview of the kind of workthat will be expected, and if it isas it should begoneover carefully in class, one has provided students with anopera tionaldefinition of the required standards for passing.Furthermore, one has created a quite useful climate forinteresting the students in early discussions, by giving them

103

a chance to try to solve the problems with their native wit,and then explaining how the content of the course helpsthem to do better. In many subjects, though not all, thisconstitutes a yery desirable proof of the importance of thecourse. Of course, treating the pretest as defining the earlycourse content is likely to qualify as teaching to the test ifone uses many of the items from the pretest in the posttest.But there are times when this is entirely appropriate; and ingeneral it is very sensible to pull items for the posttest out ofa pool that includes the items from the pretest, so that atleast some of them will be retested. This encourages learn-ing the material covered in the pretest, which should cer-tainly not be excluded from the course just because it hasalready been tested. Instructors who begin to give pretestsalso begin to adjust their teaching in a more flexible way tothe requirements of a specific class, instead of using exactlythe same material repeatedly. Thus the use of a pretest is an

/ excellent example of the integration of evalbation intoteaching, and a case of evaluation procedures paying offthrough side effects as well as through direct effects (which,in this case, would be the discovery that students are notable to learn certain types of material from the text, notesand lectures provided on that topic.)

PROCESS EVALUATION Usually refers to evaluationof the treatment (or evaluand) by looking at it instead of theoutcome. With exceptions to be mentioned, this is onlylegitimate if some connection is known (not believed) betweenprocess variables and outcome variables, and it is never thebest approach because such connections, where they doexist, are relatively weak, transient, and likely to be irrelev-ant to many new cases. The classic case is evaluation ofteachers by classroom observation (the universal procedureK-12), where there are no evaluation-useable connectionsbetween classroom behavior and learning outcomes; quiteapart from the problem that the observer's presence pro-duces atypical teaching behavior, and the observer is nor-mally someone with other personal relations with theteacher that are highly conducive to bias. (The evaluation ofadministrators is no better.) Certain aspects of processshould be looked at, as part of an evaluation, not as a substi-tute for inspection of outcomes, e.g. its legality, its morality,its enjoyability, implementation of alleged treatment, and

104.

112

whether it can provide any clues about causation. It is betterto use the term mediated evaluation to refer to what isdescribed in the opening sentence of this entry, and allowprocess evaluation to.reier to that and to the direct evalua-tion of process variables as part of an overall evaluationwhich involves looking.at outcomes.

PRODUCT Interpreted very broadly, e.g. may be usedto refer to students, etc., as the "product" of a trainingprogram; a pedagogical process might be the product of aResearch and Development effort.

PRODUCT EVALUATION The best-developed kind ofevaluation; Consumer Reports used to be the paradigmthough it has deteriorated significantly in recentyears (PE).See Key Evaluation Checklist.

PROFESSIONALISM, PROFESSIONALITY Some-where above minimum competence in a profession butshort of the realm of professional ethics there is a set ofobligations e.g. to keeping current, and to self-evaluation,which should be supported and counted in personnelevaluation. Professional ethics for quarterbacks prohibitskick-backs, professionalism requires kicking practice.

PROFIT This term from fiscal evaluation has unfortu-nate connotations to the uninformed. The gravity of themisconception becomes clear when a non-profit organiza-tion starts doing serious budgeting and discovers that it hasto introduce something which it can scarcely call profit, butwhich does the same job of funding a prudent reserve, newprograms and buildings, etc. (It calls it "contribution tomargin," instead.) The task of defining profit is essentially aphilosophical one. Granted that we should distinguishgross profit from net profit and that gross "profit" has tocover all overhead (e.g. administrative, amortization, insur-ance and space expenses) which may leave no (net) profit atall, what should we do about the cost of the money capitaland tirneinvested when both are furnished by a proprietor/manager or by donors? Is a proprietor whose "net profit"covers his time at the rate of $5 per hour really making aprofit? If ROI on the capital investment is 3% in a marketwhich pays 10% on certificates of deposit, is this "making aprofit" or a loss when s/he could make $20/hour in salary?

Using opportunity cost analysis, the answer is, No; but theusual analysis says, Yes. That's correct for the Internal Rev-enue Service, but not for employees considering a strike. Asusual, cost analysis turns out to be conceptually very com-plex although few peopie realize this; consequently seriousmistakes are very common. If the buildings (or equipment)have been amortized completely, should one deduct a sliceof the eventually-necessary replacement cost down-pay-ment before one has a profit? Should some recompense forrisk (or prior losses) be allowed before we g t to "profits"?Cost analysis/fiscal evaluation looks precise because it'squantitative, like statistics, but eventually the conceptual/practical problems have to be faced and most current defini-tions will give you absurd consequences, e.g. "the businessis profitable, but I can't afford to keep it going."

PROGRAM The general effort which marshals staffand projects towards some (often poorly) defined andfunded goals.

PROGRAMMED TEXT One in which the material isbroken down into small components ("frames"), ranging inlength from one sentence to several paragraphs, withinwhich some questions are asked about the material, e.g. byleaving a blank which the reader has to fill in with thecorrect word, possibly from a set of options provided. Thisinteractive feature was widely proclaimed to have great vir-tue in itself. It had none, unless very thorough R&Dr..f.fcitwas also employed in the process of formulating the 'exactcontent and sequence of the frames and choices provided.Since the typographic format does not reveal the extent ofthe field-testing and rewriting (and hence conceals the totalabsence of it), lousy programmed texts quickly swampedthe market (late 50s) and showed that Gresham's Law is not&ad. As usual, the consumers were mostly too -naive torequire performance data and the general conclusion wasthat programmed texts were "just another fad." In fact, thebest ones were extremely powerful teaching tools; were infact" teacher-proof" (a phrase which did not endear them toone group of consumers), and some are still doing well(Sullivan/BRL reading materials, for example). A valianteffort was made by a committee under Art Lumgdaine to setup standards, but the failure of all professional trainingprograms to teach their graduates serious evaluation skill

106

meant there was no audience for the standards. We shall seewhether the new Evaluation Standards from the Stuffle-beam group suffer a better fate.

PROJECTS Projects are time-bounded efforts, often ofa program.

PROJECTIVE TESTS These are tests with no tight an-swer; the Rorschach inkblot test is a classic example, wherethe subject is asked to say what s/he sees in the inkblot. Theidea behind projective tests was that they would be usefuldiagnostic tools, and it seems quite possible that there areclinicians who do make good diagnoses from projectivetests. However, the literature on the validity of Rorschachinterpretations, i.e. those which can be expressed verballyas unambiguous rules for interpretations, is essentiallynegative. The same is unfortunately true of many otherprojective tests, which fail to show even test-retest reliabil-ity, let alone interjudge reliability (assuming that sharedbias is ruled out by the experimental design), let alonepredictive validity. Of course, they're a lot of fun, and veryattractive to valuephobesboth testers and testeessincethere are no right answers.

PROTOCOL See ?valuation Etiquette.

PSEUDO-NEGATIVE EFFECT An outcome or datumthat appears to show that an evaluand is having exactly thewrong kind of effect, whereas in fact h is not. Four paradigmexamples are: the Suicide Prevention Bureau whose crea-tion is immediately followed by an increase in the rate ofreported suicides; the school intercultural program whichresults in a sharp rise of interracial violence; the collegefaculty teaching improvement service whose clients scoreworse than non-clients; the drug education (or sex educa-tion) program which leads to "experimentation." (See textof Introduction to Evaluation, Scriven, for treatment ofthese examples.)

PSEUDO-POSITIVE EFFECT Typically, an outcomewhich is consistent with the goals of the program, but incircumstances where either the goals or this way of achievingthe goals is in fact harmful or side effects of an overwhelm-ing and harmful kind have been overlooked. Classic case:"drug education" programs which aim to and get enrollees

10711.5

off marijuana and result in getting them on regular cigaret-tes or alcohol, thereby trading some reduction in (mostlyartificial) crimes for far more deaths from lung cancer, cir-rhosis of the liver and traffic accidents. (A typical example ofignoring opportunity costs and side effects i.e. bad GBE.)

PSYCHOMOTOR SKILLS (Bloom) Learnt muscularskills. The distinction from cognitive and affective is notalways sharp e.g. typing looks psychomotorbut is highlycognitive as well.

PSYCHOLOGICAL EVALUATION PSYCHO-EDUC-ATIONAL EVALUATION Particul.r examples of practi-cal evaluation, the first often primarily taxonomical, thesecond often primarily predictive. The usual standards ofvalidity apply, but are rarely checked; the few studies sug-gest that even the reliability is very low, and what there ismay be largely due to shared bias.

QUALITATIVE (evaluation) A great deal of goodevaluation (e.g. of personnel and products) is wholly orchiefly qualitative. But the term is sometimes used to mean"non-experimental" or "not using the quantitative meth-ods of the social sciences," and this has confused the issue,since there is a major tradition and component in evaluationwhich fits the just-quoted'descriptions but is quantitative,namely the auditing tradition and the cost analysis compo-nent. What has been happening is a gradual convergence ofthe accountants and the qualitative social scientists towardsthe use of the others' methods and the use of some qualita-tive techniques from humanistic disciplines and low-statussocial sciences (e.g. ethnography). Obviously evaluationrequires all this and more, and the dichotomy betweenqualitative and quantitative has to be defined clearly andseen in perspective or it is more confusing than en-lightening.

QUALITY CONTROL A type of evaluative monitor-ing, originating in the product manufacturing area, but nowused to refer to evaluative monitoring in the human servicesdelivery area. This kind of evaluation is formative in thesense that it is run by the staff responsible for the product,but it is the kind of formative that is essentially "early-

108

warning summative," because one is endeavoring to ensurethat the product, when it reaches the consumer, will appearto be highly satisfactory from the consumer's point of view.Thus quality control is not at all like a common type ofevaluative monitoring, which is checking on whether theproject is on target; that is a form of goal-based evaluation.Quality control should be consumer-oriented evaluation,i.e. goal-free, or needs-based evaluation.

QUANTITATIVE ( evaluation) Usually refers to theuse of numerical analysis methodology from social scienceor accounting. Cf. Qualitative.

QUARTILE (Stat.) See Percentile.

QUASI-EXPERIMENTAL DESIGN (Term due to Don-ald Campbell) When we cannot actually do a random alloca-tion of subjects .to the control and experimental groups, orcannot arrange that all subjects receive the treatment at thesame time, we settle as next best for quasi-experimentaldesign, where we try to simulate a true experimental designby carefully picking someone or a group for the "controlgroup" (i.e., selecting someone who did not in fact get thetreatment but who very closely matches the experimentalperson/group). Then we study what happens to and per-haps test our "experimental" and "control" groups just as ifwe had set them up randomly. Of course, the catch is thatthe reasons (causes) why the experimental group did in factget the treatment may be because they are different in someway that explains the difference in the outcomes (if there issuch a difference), whereas wenot having been able todetect that differencewill think the difference in outcomeis due to the difference in the treatment. For example,smokers may, it has been argued, have a higher tendency tolung irritability, an irritation which they find is somewhatreduced in the short run by ,smoking; and it may be thisirritability, not smoking, that yields the higher incidence oflung cancer. Only a "true experiment" could exclude thispossibility, but that would probably run into moral prob-lems. However, the weight and web of the quasi-experi-ments has. virtually excluded this possibility. See Ex PostFacto.

QUEMAC Acronym for an approach to metaevalu-ation: done by 'Bob Gowin, a philosopher of education at

109

11.7

Cornell, which emphasizes the identification of unques-tioned assumptions in the design. (Questions, Unques -.tioned Assumptions, Evaluations, Methods, Answers,,Concepts.)

QUESTIONNAIRES The basic instrument for surveysand structured interviews. Usually too long, which reducesresponse rate as well as validity (because it encouragesstereotyped or superficial responses.) Must be field-tested;usvey a second field-test still uncovers problems e.g. ofambiguity. Interesting problems arise with respect to evalu-ation questionnaires e.g. what type to use in personnelevaluation when the average response turns out to be a 6 ona 7-point scale, providing inadequate upside discrimina-tion. One can use stronger anchors; or rephrase as a rankingquestionnaire; or impose grading-on-the-curve (Q-sort)methodology, by putting limits on the number of allowable7's or 6's from any one respondent; or provide deflationaryinstructions or systems. The first and last of these introduceless distortion where merit levels really are high; the U.S.Air Force once ran into a minor rebellion when it adoptedthe third alternative. See also Rating Scales, Symmetry.

RANDOM A "primitive" or ultimate concept of statis-tics and probability, i.e., one that cannot be defined in termsof any other except circularly. Texts often define a randomsample from a population as one picked in a way that givesevery individual in the population an equal probability ofbeing chosen; but one can't define "equal probability" with-out reference to randomness or a cognate. A distinctly trickynotion. It is not surprising that the first three "tables ofrandom numbers" turned out to have been doctored bytheir authors; although allegedly generated in (completelydifferent) waysby mechanical and mathematical proce-dureswhich met the definition just given, they were ob-viously non-random, e:g. because pages or columns whichheld a substantial preponderance of a particular digit or adeficit of one particular digit-pair were deleted, whereas ofcourse such pages must occur in any complete listing of allpossible combinations. No finite table can be random by thepreceding definition. The best definition is relativistic andpragmatic; a choice is random with regard to the variable X if

110

it is not significantly affected by variables that significantlyaffect X. Hence a die or cut of cards or turn of the roulettewheel is random with regard to the interests of the players ifthe number that comes up is caused to do so by variableswhich are not under the influence of the players' interests.

RANKING, RANK-ORDERING Placing individualsin an order, usually of merit, on the basis of their relativeperformance on (typically) a test or measurement or obser-vation. Full ranking does not allow ties i.e. two or moreindividuals with the same rank ("equal third"), partial rank-ing does; it may then, in the limit case, not be different fromgrading.

RATING Usually same as grading.

RATING SCALES Device for standardizing responsesto requests for (typically evaluative) judgments. There hasbeen some attempt in the research literature to identify theideal number of points on a rating scale. An even numbercounteracts the tendency of some raters to use the midpointfor everything by forcing them to jump one way or theother; on the other hand, it eliminates what is sometimeshe only correct response. Scales with 10 or more points

generally prove confusing and drop the reliability; with 3 orless (Pass/Not Pass is a two point scale), too much informa-tion is thrown away. Five- and (especially) seven-pointscales usually work well. It should be noted that the A-Fscale is semantically asymmetrical with the usual anchorpoints i.e. it will not give a normal distribution (in thetechnical sense) of grades for a population in which talent isnormally distributed.) With + and and fence-sitting sup-plements (A+, A,A,AB,B+,B,B,BC . . .), it runs to 19points and with the double + (double ), it has 29 pointsand becomes essentially ritualistic. Note that the translationof letter grades into numbers, e.g. for purposes of comput-ing a grade-point average, involves assumptions about theequality of the intervals (of merit) between the grades, andabout the location of the zero point, which are usually notmet (LE). See also Questionnaire.

RATIONALIZATION Pseudo-justifications, usuallyprovided ex post facto. See Consonance.

RATIONALIZATION EVALUATION An evaluation

7

is sometimes performed in order to provide a rationalizationfor a predetermined decision. This is much easier than itmight appear, and a good many managers know very wellhow to do ft. If they want a program canned, they hire agunslinger; if they want one salvaged or prote-ed, theyhire a sweetheart. Every now and again evaluators arebrought in 133, clients who have got them into the wrongcategory and the early discussions are likely to be embaras-sing, annoying or amusing, depending upon how badlyyou needed the job.

RAW SCORES The actual score on a test, before it isconverted into percentiles, grade equivalents, etc.

R&D Research and Development; the basic cyclic (it-erative)process of improvement, e.g. of educational mate-rials or consumer products: research; design and prepare,pilot rugi investigate (evaluate) results, .design improve-ments, iNn improved version, etc.

RDD&E Research, Development, Diffusion (or Dis-semination) and Evaluation. A more elaborate acronym forthe development process.

REACTIVE EFFECT A phenomenon due to (an artefactof) the measurement procedure used: one species of evalua-tion or investigation artefact. It has two sub-species, con-tent-reaction effects and process-reaction effects. Evalu-ation-content reactions include cases where a criticism in apreliminary draft of an evaluation is taken to heart by theevaluee and leads to instant improvement, thereby "invali-dating" the evaluation. Evaluation-process reactions inc-lude cases where the mere occurrence (or even the prospect)of the evaluation materially affects the behavior of the eval-uee(s) so that the assessment to be made will not be typicalof the program in its pre-evaluated states. Process reactivityis thus content-independent. Although reactive measure-ments have not previouSly been thus sub-divided, the dis-tinction does apply there and not just to evaluation; but it isless significant. In both cases, unobtrusive approaches maybe appropriate to avoid process-reactivity; but on the otherhand openness may be required on ethical grounds. Theopenness may be with respect to content or with respect toprocess or both. See Reasons for evaluation. Example:Hawthorne Effect.

112

al...M.1,V, v61 a valb V -/Arla . NI WV %.11 .47

are to improve something (formative evaluation) and tomake various practical decisions about something (summa-five evaluation). Pure interest in determining the merits ofsomething is another kind of summative evaluation. Thereare also what might be called content-independent reasonsfor doing evaluation e.g. as a rationalization or excuse (for ahatchet job or for funding a favorite) or for motivation (towork more carefully or harder). In the excuse case, thegeneral nature of the evaluation's content must be known orarranged in advance e.g. by hiring a known "killer" or"sweetheart."

RECOMMENDATIONS In a trivial sense, an evalua-tion involves an implicit recommendationthat the evalu-and be viewed/treated in the way appropriate to the value itwas determined to have by the evaluation. But in the spe-cific sense often assumed to be appropriate where "recom-mendation" is taken to mean "remedial actions," evaluationsmay not lead to them even if designed so as to do so (which ismuch more costly.) That remediation recommendations arenot always possible, even when evaluation is possible isobvious in medicine and product evaluation; but becausethe logic has not been well thought out, it is widely sup-posed to be a sign of bad design or an absence of humanitywhen personnel or program evaluations do not lead tothem. There are some people who are irremediably incom-petent at a given complex task and not even the progress ofscience will alter that qualitative fact though it may alterpercentages. It is a very grave design decision in evaluationto commit a design to producing remedial suggestions, justas it is to undertake to discover explanations; it may in-crease cost and the chance of failure by 1000 percent.

RECOIL EFFECTS When .a hunter shoots a deer, he(sic) sometimes bruises his shoulder. Programs affect theirstaff as well as the clientele. The effect is of secondaryimportance compared to what happens to the deer or theclientele, but must be included in program evaluation.

REGRESSION TO THE MEAN You may have a run ofluck in roulette, but it won't last; your success ration willregress (drop back) to the mean. When a group of subjects isselected for remedial work on the basis of low test scores,

some of them will have scored low only through "bad luck,"i.e., the sampling of thei r skills yielded by (the items on) thistest is in fact not typical. If they go through the training andare retested, they will score better simply because any sec-ond test would (almost certainly) result in their displayingtheir skills more impressively. This phenomenon gives anautomatic but phony boost to the achievements of "per-forchance contractors" if they are paid on the basis of im-/provement by the low-scorers. If they had to improve thescore of a random sample of students, regression down to themean would offset the regression up to the mean we havejust discussed. But they are normally called in to help thestudents who "need it most" and picking that group bytesting will result in including a number who do not needhelp. (It will also exclude some who do. ) Multiple or longer testsor the addition of teacher (expert judge) evaluations reduce thisso:rce of error.

RELATIVISM/SUBJECTIVISM Roughly speaking,«the view that there is no objective reality about which theevaluator is to ascertain the truth, blit only various perspec-tives or approaches or responses, amongst which selectionis fairly arbitrary or is dependent upon aesthetic and psy-chological considerations rather than scientific ones. Thecontrary point of view would naturally be referred to asabsolutism or objectivism; in one technical sense used inphilosophy the opposite of subjectivism is called the doc-trine of realism. The fundamental logical fallacy that con-founds many discussions of this issue is the failure to see thefull implications of the fact that relativism is a self-refutingdoctrine, i.e., "relativism is true" can be no more true than"relativism is.false," and hence relativism can hardly repre-sent a Great Truth, since it is self-refuting. One very im-portant implication of this point for evaluation practice is asfollows: in a situation where a number of different ap-proaches, methodologies or perspectives on a particularprogram (for example) are possible and all are about equally'plausible, it does not follow that any one of them wouldconstitute a defensible evaluation. The only thing that fol-lows is that giving all of them and the statement that all ofthem are equally defensible, would constitute a defensibleevaluation. The moment that one has seen that alternativeapproaches are equally good, although they yield ino.An-

patible results, one has seen that no one of these can bethought of as sound in itself, just because the assertion of anyone of them implies the denial of the others and that denialis, in such a case, illegitimate. Hence the assertion of anyone of them by itself is illegitimate. If, on the other hand, thedifferent positions are not incompatible, then they must stillbe given in order to present a comprehensive picture ofwhatever is being evaluated. In neither case, then, is givinga sI,Igle one of these perspectives defensible. In short, thep. eat difficulties of establishing one evaluation conclusiontiy comparison with others cannot be avoided by arbitrarilypicking one, but only by proving the superiority of one orincluding all as perspectives, a term which correctly impliesthe existence of a reality which is only partly revealed ineach view. Thus it converts incompatible reports into comp-lementary ones i.e. it converts relativism into objectivism.Merely giving several apparently incompatible accounts inan evaluation is incompetent; showing how they can bereconciled i.e. seen as perspectives is also required. (Or elsea proof that the is no single reality.) The presuppositionthat there is a single reality is not an arbitrary one, any morethan the assumption that the future will be somewhat likethe past is arbitrary; these are well-established. Deter-minism was equally well-established and we have now hadto qualify it slightly because of the Uncertainty Principle.We have not yet encountered good reasons for qualifyingthe assumptions of realism and induction (the technicalnames for the two previously mentioned.)

As the practical end of these considerations, it must berecognized that even evaluations ultimately based on "merepreferences" may still be completely objective. One mustdistinguish sharply between the fact that the ultimate basisof merit in such cases is mere preference, on which thesubject is the ultimate source of authority, and the fallacy ofsupposing that the subject must therefore be the ultimatesource of authority about the merits of whatever is beingevaluated. Even in the domain of pure taste, the subjectmay simply not have researched the range of options prop-erly, or avoided/the biassing effects of labels and advertis-ing, orrecommendations by friends, so the evaluator maybe able to identify critical competitors that outperform thesubject's favorite candidate, in terms of the subject's own taste.And of course identifying Best Buys for an individual. in-

115

12

volves a .second dimension (cost) which the evaluator isoften able to determine and combine more reliably than theamateur. The moment we move the least step from areaswhere superiority is unidimensional, instantaneous, andentirely taste-dependent, then we find the subject begin-ning to make errors of synthesis in putting together two orthree dimensions of preference (halo or sequencing effects,for example), or in extrapolating to continued liking, errorsthat an evaluator can reduce or eliminate by appropriateexperimental design, often leading to a conclusion quitedifferent from that which the subject had formed. One stepfurther away, and we find the possibility of the subjectmaking first-level errors of judgnient, e.g, about what theyneed (or even what they want) by contrast with what theylike, and these can certainly be reduced or eliminated byappropriate evaluation design. [n the general case of theevaluation of consumer goods, the question of whether onecan identify "the best" product with complete objectivity,despite a substantial range of different interests and prefer-ences at the basic level by the relevant consumer group, issimply a question of whether the interproduct variations inperformance outweigh the interconsumer variations inpreference. Enormous variations in preference may be com-pletely blotted out by the tremendous superiority of a singleproduct over another, such that it "scores" so much onseveral d'..nensions which are accorded significant value byall the relevant consumers, that even the outlandish tastes(weightings) of some of the consumers with respect to someof the other dimensions cannot elevate any of the competi-tive products to the same level of total score, even for thcwewith the atypical tastes. Thus huge interpersonal differ-ences in all the relevant preferences do ngt demonstrate &herelativism of evaluations which depend on them. -

RELIABILITY (Stat.) Reliability in the technical sense isthe consistency with which an instrument or person mea-sures whatever it is designed to assess. If a thermometeralways says 90 degrees Centigrade when placed in boilingdistilled water at sea-level, it is 100% "reliable," thoughinaccurate. It is useful to distinguish test-retest reliability(the example just given) from interjudge reliability (whichwould be exhibited if several thermometers gave the samereading). There are many psychological tests which are

124. 116

testretest reliable but not interjudge (i.e., inter-adminis-trator) reliable: the reverse is less common, In the everydaysense, reliability means the same as the technical term val-idity; we'd say that a thermometer which reads 90 degreesCentigrade when it should read 100 degrees Centigradewasn't very reliable. This confusing situation could easilyhave been avoided by using the term "consistency" insteadof introducing a technical use of "reliability" but that was inthe days when jargon was thought to be a sign. f scientificsophistication. As it is, reliability is a necessary but not asufficient condition for validity, hence worth checkilg firstsince in its absence validity can't be there. (There is, un-fortunately, a hyper-technical exception to this.)

RELIABILITY (of evaluation) A largely unknownquantity, easily obtained by running replications of evalua-tions; either serially or in parallel. The few data on thesemake clear that reliability (apart from spurious effects suchas hared bias) is not high. The use of calibration exercisesand checklists and trained evaluators can improve thisenormously.

REMEDIATION A specific recommendation for im-provement, characteristic ofand certainly desirable informative rather than summative evaluation. But formativecan be useful without any remediation suggestions, and it isin general more difficult (sometimes completely impossible)and more expensive if it aims for remediation. See alsoRecommenation.

REPLICATION A very rare phenomenon, contrary toreports, mainly because people do not take the notion ofserious testing for implementation (e.g: through the use ofan index of implementation) as an automatic requirementon any supposed replication. 'Even the methodology forrepl'cation is poorly thought out; for example, whether thereplicator should have any detailed knowledge of theresults of the primary site? Such knowledge is seriouslybiasingon the other hand, it significantly simplifies thepreparations for ranges of measurement, etc. It is probablyquite important to arrange at least some replications wherethe (e.g.) program to be replicated is simply described inoperational terms, perhaps with the incidental remark thatit has shown "promising results" at the primary site.

117 .

1 25"

REPORT WRITING /GIVING One of several areas inevaluation where creativity and originality are really im-portant, is well as knowledge about diffusion and dissemi-nation. Reports must be. tailored to audience as well asclient needs and may require a minor needs assessment oftheir own. Multiple versions, sometimes using differentmedia, as well as different vocabularies, are often appro-priate. Reports are products and should be looked at interms of the KECfield-testing them is by no means inap-propriate. Who has time and resources for all this? It de-pends whether you are really interested in implementationof the evaluation. Would you write it in Greek? No, so whyassume that you are not writing it in the equivalent of Greek,as far as your audiences are concerned?

RESEARCH The general field of disciplined investiga-tion, covering the humanities, the sciences, jurisprudence,etc. Evaluation research is one subdivision; there is no wayto distinguish otner research from evaluation (apart fromcontent) except by distorting one or the other. "Evaluationresearch" is usually just a self-important name for seriousevaluation; it would be better used to refer to research onevaluation methodology, or research that pushes out thefrontiers of evaluation, or at least research that involvesconsiderable investigatory difficulty or originality. Cf. per-forming arts vs. creative arts.

RESEARCH INTEGRATION, RESEARCH SYNTHE-SIS See Meta-analysis.

RESEARCH EVALUATION Evaluating the qualityand/or value and/or amount of research (proposed or per-formed) is crucial for e.g. funding decisions and universitypersonnel evaluation. It involves the worth/merit distinc-tion"worth" here refers to the social or intellectual pay-offs from the research, "merit" to its intrinsic (professional)quality. While some judgment is always involved, that is noexcuse for allowing the usually wholly judgmental process;one can quantify and in other ways objectify the merit andworth of almost all research performances to the degree re-quisite ffzi personnel evaluation.

RESP NSIVE EVALUATION Bob Stake's current ap-proach, hich contrasts with what he calls "preordinate"evaluati n, where there is a predetermined evaluation de-

' 118

sign. In responsive evaluation, one picks up whatever turnsup and deals with it as seems appropriate, in the light of theknown and unfolding interests of the various audiences.The emphasis is' on rich description, not testing. The risk isof course a lack of structure or of valid proof, but the trade-off is the avoidance of the risk of a preordinate evaluationa rigid and narrow outcome of little interest to the audi-ences. Cf. Evaluation-Specific Methodology, NaturalisticEvaluation.

RESPONSE SET Tendency to respond in a particularway, regardless of the merits of the particular case. Somerespondents tend to rate everything very high on a scale ofmerit, others rate everything low, and yet others put every-thing in the middle. One can't argue out of context that suchpatterns are incorrect; there are plenty of situations in whichthose are exactly the correct responses. When we're talkingabout response set, however, we mean the cases wherethese rigid response patterns emerge from general habitsand not from well thought-out consistency.

RESPONSIBILITY EVALUATION Evaluation that isoriented e identification of the responsible person(s) or thedegree of responsibility, and hence usually the degree ofculpability or merit. Responsibility has causality as a neces-sary but not a sufficient condition. Culpability similarlypresupposes responsibility but involves further conditionsfrom ethics. Social scientists like most people not trained inthe law or casuistry are typically totally confused about suchissues e.g. supposing that evaluations shouldn't be done (orpublished) because "they may be abused." The abuse isculpable; but it is failure to publish (assuming it's profes-sional-quality work of some prima facie intellectual or socialvalue) that would be culpable (e.g. the Jensen case). Adifferent kind of example involves keeping really bad teach-ers on in a school district because the alternative of attempt-ing dismissal involves effort, is unpopular with the union,and usually unsuccessful. The responsibility is to the pupilswho are sacrificed at the rate of 30 per annum per badteacher; and that responsibility is so serious that you (thesuperintendent or the board) have to try for removal be-cause (a) you may succeed, (b) the effects may be on balancegood, (c) you may learn how to do it better next time. Theevaluation of schools should (normally) only be done in

11912 7

terms of the variables over which the school has contml inthe short run and often in the long run, this doei nut includescores on standardized tests. (See SEP). The evaluation ofevaluations should never be done in terms of results, be-cause the evaluator is not responsible for Implementation;but it should be done in terms of results if implemented. Ref,.Primary Philosophy, Scriven, McGraw-Hill, 1966.

RETURN ON INVESTMENT (ROI) One of the mea-sures of merit or worth in fiscal evaluation; usually quotedas a per annum percentage rate.

RISK(S), EVALUATING The classic expectancy ap-proach in which the products of the probability of eachoutcome by its utility are compared, thus converting thetwo dimensions of risk and utility into the one (of expectan-cy), has certain weaknesses. For example, it ignores thevariable value of risk itself to different individuals; the gam-bler likes it, many others seek to minimize it. "Risk manage-ment" is a topic that has begun to appear with increasingfrequency in planning and management training curricula.One reason that evaluations are mt. implemented is becausethe evaluator has failed to see that risks have differentsignificance for implementers by contrast with consumers; aprogram or policy (etc.) which should be implemented, interms of its probable benefit to the consumers may be onewhich carries a high risk for the implementers, because theirreward schedule is often radically different from that of theconsumer (usually as a result of bad planning and manage-ment at a higher level. Two classic examples are the classifi-cation of documents as Top Secret and the hiring of person-nel about whom there is a breath of suspicion; in eachsituation, the implen.enter gets zapped by review panelsexercising 100 percent hindsight after a disaster if there isthe least trace of a negative indicator, and in neither case isthere, ever a reward for taking a reasonable riskin fact,there's never a review panel. Consequently, the public'sutilities are not optimized and are often reversed. The pres-ent political-plus-media environment in the U.S. may beone in which the risk configuration for the road to thePresidency (or the legislature) is so different from that re-quired to do the job right as to guarantee the election of poorincumbents who were great candidates.

120

RICHT-TO-KNOW The legal domain of impactedpopulations' access to emation; much increased latelye.g. through "open file" legislation.

RHETORIC, THE NEW The title of a book by C. Perel-man and L. Olbrechts-Tyteca (Notre Dame, 1969), whichattempted to, develop a new logic of persuasion, revivingthe spirit of pre-Ramist efforts. (Since Ramus (1572), theview of rhetoric as the art of empty and illogical persuasionhas been dominant; the concept of "logical analysis," asseparate from rhetoric is Ramist.) This area Is of the greatestimportance to evaluation methodology as Ernest House hasstressed (e.g. in Evaluating with Validity, Sage, 1980), be-cause of the extent to which evaluations havewhetherintentionally or notthe function of persuasion and notjust reporting. The New Rhetoric emerged from the contextof studying legal reasoning where the same situation ob-tains and was poorly recognized. The same push for reap-praisal and new models has occurred in logic (see InformalLogic, eds. Blair and Johnson, Edgepress, 1980), and in thesocial sciences with the move towards naturalistic method-ology. It is all part of the backlash against neo-positivistphilosophy of science and the worship of the Newtonianmodel of science. Evaluation's fate clearly lies with the newmovements.

RITUALISTIC) EVALUATION One of the reasons fordoing evaluation that has nothing to do with the content ofthe evaluation (and hence is unlike formative and sum-mativeor rationalizationevaluation) is the ritual func-tion i.e. the doing of an evaluation because it is required,although nobody has the faintest intention of either doing itwell or taking any account of what it says. Evaluators arequite often called in to situations like this, although theymay not even be recognized as cases of ritual evaluation bythe client. (Evaluation in the bilingual education area iscurrently mostly ritualistic.) It is an important part of thepreliminary discussions in serious evaluation to get clearexactly what kind of implementation is planned, undervarious hypothesesabout what the content of the'evalua-tion report might be; unless, of course, you have time tospare, need the money, and are not misleading any remoteaudiences. The third condition essentially never applies.See also Motivational Evaluation.

121 12R

ROBUSTNESS (Stat,) Statistical tests and techniquesdepend to varying degrees on assumptions especially aboutthe population of origin, The less they 'depend on suchassumptions, the more robust they are, The Hest assumesnormality, non-parametric ("distribution-free") statisticsare often considerably more robust, One might translate"robust" as "stable under variation of conditions," Theconcept is also applicable to and most important in theevaluation of experimental designs and meta-evaluation.Designs should be set up to give definite answers to at leastsome of the most important questions no matter how the dataturns out, a matter quite different from their cost-effective-ness, power, o ilegance (the latter is a kind of limit case ofefficiency Or powe val ations should be set up so as to"go for the jugular" i.e. an adequately reliable answer tothe key evaluative qu stion(s) first, adding the trimmingslater if nothing goes wrong with Part One.This affects bud-get, staff and time-line planning. And it has a cost as doesrobustness in statistics; for example, robust approaches willnot be maximally elegant if everything goes right. But meta-evaluation will normally show that a minimax approach iscalled for, which means robust evaluation.

ROLE (of evaluator) The evaluator plays more rolesthan Olivier, or should. Major ones include therapist/con-fessor, educator, arbitrator, co-author, "the enemy," trou-ble-shooter, jury, judge, attorney.

RORSCHACH EFFECT An extremely complex evalu-ation, if not carefully and rationally synthesized into anexecutive summary report, provides a confusing mass ofpositive and negative comments, and the unskilled and/orstrongly biased client can easily project onto ("see in,"rationalize from) such a backdrop whatever perception s/heoriginally had.

RUBRIC 'Scoring or grading or (conceivably) rankingkey fora test.

SALIENCE SCORING The practice of requesting res-pondents to use only those scales which, they felt, mostsignificantly influenced them. it focuses attention on themost important features of whatever is being rated, and it

1122

greatly reduces processing time.

SCALES See Measurement.

SCOPE OF WORK This is the part of an MT or a pro-posal which describes exactly what is to be done, at the levelof description which refers to the activities as they might beseen by a visitor without special methodological skills orinsight, rather than to their goals, achievements, process orpurpose. In point of fact, scope of work statements tend todrift off into descriptions that are somewhat less than obser-vationelly testable. The scope of work statement is an im-porlartt part of making accountability possibleon a contract,and is Ian important part of the specifications in anRFP 1.4r a proposal.

SCORING Assigning numbers to an evaluand, (usu-ally z performance) usually from an interval scale i.e. one inwhiL1 the points all have equal value. Sometimes numbersare used as grades without commitment to point constancy,but this is misleadingletters should be used instead, andthe attempt to convert them to numbers e.g. to calculateGPAs should be protested unless point constancy holds atleast to on approximation that will not yield errors (LE.)Usually tests should be impressionastically graded as wellas scored, both to get the cutting scores and to provideinsurance against deviations from point constacy. Scoringnot only requires point constancy but also seribus consid-eration of the definition of a zero score: no answer? hope-lessly bad answer? both? ("bbth" is a hopelessly badanswer.)

SECONDARY ANALYSIS Reassessment of an experi-ment or investigation, either by reanalysis of the data orreconsideration of the interpretation. Gathering new datawould normally constitute replication; but there are inter-mediate cases. Sometimes used to refer to reviews of largenumbers of studies; See Meta-analysis, Secondary Eval-uation.

SECONDARY EVALUATION (Cook) Reanalysis oforiginalor original plus newdata in order to produce anew evaluation of a particular project (etc.). Russell SageFoundation commissioned a series of books in which fam-ous evaluations were treated in this way, beginning with

123

different and more 41Cet. Mitt` response to the next hir anylater) question,

SES Socio-Economic Status,

SENSORY EVALUATION %Vine-tasting when donescientifically, the better restaurant reviews, the ConsumersUnion report on bottled water, remind us ot the importantdifference between dismissing something as a "mere matterof taste" and doing sensory evaluation which dot's noteliminate dependence on performance but improves its relia-bility and improves the evaluative inference e.g. by elimin-ating distractors (such as labels), using multiple indepen-dent raters and standardited sets of criteria.

SHARED BIAS The principal problem with usingmultiple expert opinion for validation of evaluations is thatthe agreement (if any) may be due to common error; obvi-ous and serious examples occur in peer review of researchproposals, where the panelists tend to reflect current fads inthe field to the detriment of innovators, and in accredita-tion. The best antidote is often the use of intellectually andnot just institutionally external judges e.g. radical critics otthe field. "The inference from reliability to validity mustbridge the chasm of shared bias:.

SIDE EFFECTS Side effects are the unintended goodand bad effects of the program or product being evaluated.Sometimes the term refers to effects that were intended butare not part of the goals of the program e.g. employment ofstaff. In either case, they may or may not have been ex-pected, predicted or anticipated (a minor point). In the KeyEvaluation Checklist a distincti9n is made between sideeffects and standard effects on impacted non-target popula-tions, i.e. side-populations, but both are often called sideeffects.

SIGNIFICANT, SIGNIFICANCE The overall, synthe-sized conclusion of an evaluation; may relate to social orprofessional or intellectual significance. Statistical signifi-cance, when. relevant at all, is one of a dozen necessaryconditions for real significance. The significance of an in-tervention may be considerable even if it had no effects inthe intended directions which might be cognitive or healthgains; it may have employed many people, raised general

12'.11

Torn Cook's secondary evaluation of Sesame Street. Ex-tremely important because (a) it gives potential clients somebasis for estimating the reliability of evaluator% (in the caselust cited, the estimate would be fairly low); (b) it giveseValuaturs the chance to Identify and learn from their mis-takes. Evaluations have all too often been fugitive docu-ments and hence have not received the benefit of laterdiscussion in "the literature" as would a research reportpublished in a journal; a weakness in the field. (Similarproblem applies to classified material). Cf. Metaevaluation,

SECRET CONTRACT BIAS In proposal and personnelreview, raters are often too lenient because they know thathe roles will be reversed on another occasion and theythink or intuit that if everyone sees that, and acts af:cord-ingly, "well all come out smelling like roses." Typical un-professional conduct typical of the professions. A goodcounterbalance is to rate everyone on the long-term validityof their rar ngs.

SEMI-INTERQUARTILE RANGE (Stat.) Half the in-terval between the score that marks the top score of thelowest or first quartik; (i e. the lowest quarter of the groupbeing studied, after they have been ranked according to thevariable of interest, e.g. test scores), and the score thatmarks the top of the third quartile. This is a useful measureof the range of a variable in a population, especially when itis not a normal distribution (where the "standard devia-tion" would be used). It amounts to averaging the intervalsbetween the median and the individuals who are halfwayout to the ends of the distribution, one in each direction.Thus it is not affected by oddities occurring at the extremeends of the distribution.

SEP (School Evaluation Profile) An instrument forevaluating the performance of schools (and hence districts,principals etc.), which looks only at variables the school(population) controls. Available (in field test form only)from Edgepress. See Responsibility Evaluation.

SEQUENCING EFFECT The influence of the order ofitems (tests etc.) upon .responses; it jeopardizes the test'svalidity when items are removed e.g. for racial bias, sincethe item might have preconditioned the respondent (in a

way that has nothing to do with its bias) so as to give a

125 133

awareness of blems, produced other gains, The absenceof mere significant effects may also be due to dilution ofgood effects in a pool of pour programs producing no et.(mist one cannot infer from an overall-null to individual -nulls. For this rea..on, "lumping designs" are much lessdesirable than "splitting designs" in which separate studiesare made of many sites or sub-treatments (see Replication,Meta-analysis.) OniegastatistleA and Glass' "standardizedeffect size are attempts to produce measures that morenearly reflect true significance than does the p level of ,theabsolute size of the results,

SIMULATIONS Re-creations of typical job situationsto provide a realistic test of aptitudes or abilities. See Clini-cal Performance Testing.

SMILES TEST (of a program) People like it,

SOCIAL INDICATOR Set Indicator.

SOCIAL SCIENCE MODEL (of evaluation) The(naive) view that evaluation is an application of standardsocial science methodology. See Evaluation.

SOFT (approach to evaluation). Uses implementationdata or the Smiles Test. See Hard.

SOLE SOURCE "4ole- sourcing" a contract is an alter-native to "putting it out to bid," via publishing an RR'. Solesourcing is open to the abuse of the contract officer from theagency letting contracts to his or her buddies without regardto whether the price is excessive or the quality unsatis-factory; on the other hand, it is much faster, it costs lessif you take account of the time for preparing Ril's andproposals in cases where a. very large number of thesewould by written for a very complex RFP, and it is some-times mandatory when it is provable that the skills and/orresources required are available from only one contractorwithin the necessary time-frame. Simple controls can pre-vent the kind of abuse mentioned.

SPEEDED (tests) Also called power tests, those testswith a time limit (the time taken by each individual is usuallynot recorded, though it is in timed tests). These are oftenbetter instruments for evaluation or prediction than thesame test would be with no time limitusually because the

126134

criterion behavior involves doing something under timepressure, but sometimes, as in IQ tests, just as a matter ofempirical fact. A test is sometimes defined as speeded ifohly 75 percent of the testers finish in time.

SPONSOR (of evaluation) Whoever or whatever fundsor arranges it: referred to as "instigator" in ki:C,

STAKEHOLDER An interested party in an evaluatione.g. a politician who supported the original program.

STANDARD(S) The performance associated with aparticular rating or "grade" on a given criterion or dimen-sion of achievements; e.g. 80 percent success may be thestandard for passing the written portion (dimension) of thedriver's license test. A cutting score defines a standard, butstandards can be given in non-quantitative grading con-texts, e.g. by providing exemplars, as in holistic grading ofcomposition samples.

STANDARD DEVIATION (Stat.) A technical mea-sure of dispersion; in a normal distribution, about twothirds of the population lies within one standard deviationof the mean, median, or mode (which are the same in thiscrisc .) The S.D. is simply the mean of the sum of the squares'of the deviations i.e. distances from the mean.

STANDARD ERROR OF MEASUREMENT (Stat.)There are several alternative definitions of this term, all ofwhich attempt to give a precise meaning to the notion of theintrinsic inaccuracy of an instrument, typically a test.

STANDARD SCORE Originally, scores defined as de-viations from the mean, divided by the standard deviation.(Effect Size is an example.) More casually, various lineartransforlations of the above (Z-scores) aimed to avoidnegative scores.

STANDARDIZED TEST Standardized tests are oneswith standardized instructions for administration, use,scori ng and interpretation, standard printed forms and con-tent, and often with standardized statistical properties, thathave been validated on a large sample of a defined popula-tion. They are usually norm-referenced, at the moment, butthe terms are not synonymous since a criterion-referencedtest can also be standardized. Having the norms (etc.) on a

127

135.

test does 111040 standardised in one respect, but it doesnot mean it's pot a norm -referenced test in the technicalsense; nosy (also) criterion -referenced, which Implies adifferent technical approach to its construction and not just4 different purpose,

STANINES (or Martine *coml. It you are perverseenough to divide a distribution into nine equal parts insteadof ten (see docile), they aft Called stanines and the cuttingscores that demarcate them are called stanine scores, Theyare numbered from the bottom up, See also Percentiles.

STATISTICAL SIGNIFICANCE (Stall When the dif-ference between two results is determined to 1w "statisti-cally significant," ill.. evaluator can conclude that the differ-ence is probably not due to chance, The "level of signifi-cance" determines the degree of certainty or confidencewith which we can rule out chance (i.e, rule out the "nullhypothesis"), Unfortunately, if very large samples are usedeven tiny differences become statistically significant thoughthey may have no educational value at all, Omega statisticsprovide a partial correction for this, Cf, Interocular Dif-ference.

STEM The text of a multiple-choke test item that pre-cedes the listing of the possible responses.

STRATIFICATION A sample is said to be stratified if ithas been deliberately chosen so as to include all appropriatenumber of entities from each of several population sub-groups. For example, one usually stratifies the sample ofstudents in K-12 educational evaluations with regard togender, aiming at 50 percent males and 50 percent females,If one selects a random sample of females to make up half ofthe experimental and half of the control group and a ran-dom sample of males for the other half, then one has a"stratified random sample." If you stratify on too manyvariables you may not be able to make a random choice ofsubjects in a particular stratumthere may be no or onlyone eligible candidate. If one stratifies on very few or novariables, one has to use lager random samples to com-pensate. Stratification is only justified with regard to vari-ables that probably interact with the treatment variable, andit only increases efficiency, not validity, unless you do it inaddition to using large numbers i.e. abandon the efficiency

3 6 178

gains it makes possible. Indeed it runs some risk of reducingvalidity because you may not cover a key variable (throughignorance) and your reduced sample size may not take careof it.

STRENGTHS ASSESSMENT Looking at resourcesavailable, including time: it defines the range of the possibleand hence is important in both needs assessment and theidentification of critical competitors, as well as in makingremediation suggestions.

STYLE RESEARCH Investigations of two kinds; eitherdescriptive investigations of the actual stylistic charicteris-tics of people in e.g. certain professions such as teaching ormanaging; or investigations of the correlations between cer-tain style characteristics arid successful outcomes. The sec-ond kind of investigation is of great importance to evalua-tion, since discoveries of substantial correlations would al-low certain types of evaluation to be performedon a processbasis, which currently can only be done legitimately bylooking at outcomes. (However, personnel evaluationcould not be done in that way, even if the correlations werediscovered.) The former kind. of investigationa typicalexample is studying the frequency with which teachersutter questions by comparison with declarative sentences orcommandsis pure research, and extremely hard to justifyas of either intellectual or social interest unless the secondkind of connection can be made. In general, style researchhas come up with disapprointingly few winners. (ActualLearning Time is probably the most important and possiblythe ohly exception.) No doubt the interactions between thepersonality,' the style, the age and type of recipient and thesubject matter prevent any simple results; but the poorresults of research on interactions suggests that the interac-tions are so strong as tobliterate even very limited recom-mendations. We Must instead fall back to treating positiveresults as possible remedies, not probability indicators of merit.

SUMMATIVE EVALUATION Summative evaluationof a program (etc.) is conducted after completion and for thebenefit of some external audience or decision-maker (e.g.-funding agency, or future possible users,) though it may bedone by either internal or external evaluators or a mixture.For reasons of credibility, it is much more likely to involve

external evaluators than is a formative evaluation. Shouldnot be confused with outcome evaluation, whichiis simplyan evaluation focused on outcomes rather than on processit could be either formative or summative. (This confusionoccurs in the introduction to the ERS Evaluation Standards,1980 Field Edition). Should not be confused with holisticevaluationit may be holistic or analytic.

SUPERCOGNITIVE The domain of performance oncognitive (or information/communication) skills that is aquantum jump beyond normal levels, e.g. speed reading,lightning calculating, memory mastery, speedspeak orfastalk, tri-linguality, shorthand. Cf. Hypercognitive

SURVEY METHODS (in evaluation) See Evaluation-Specific Methodology.

SYMETTRY of evaluative indicators. It is a commonerror to suppose (or unwittingly to arrange) that the con-verse or absence of an indicator of merit is an indicator ofdemerit. This is illustrated by the assumption that items inevaluative questionnaires can be rewritten positively ornegatively to suit the configural requirement of foilingstereotyped responses. But "Frequently lies" is a strongindicator of demerit, while "Does not frequently lie" is noteven a weak indicator of (salient) merit. (Salient merit i.e.commendable behavior is what one rewards, not "beingbetter than the worst one could possibly be.") The preced-ing is an epistemological point about symmetry (related tothe virtue/supererogation distinction in -ethics). There arealso methodological asymmetries; for example, an item re-questing a report on absences e.g. "Was sometimes absentwithout leave" can be answered affirmatively by respon-dents who were often not there themselves but who ob-served one or more such absences by the evaluee; but "Wasrarely absent without leave" will be checked "Don't know"by the same respondents since it callS for knowledge theydo not possess.

SYNTHESIS (of studies) The integration of multipleresearch studies into an overall picture is'a field which hasrecently received considerable attention. These "reviews ofthe literature" are not only evaluations in themselves,withit turns outsome quite complex methodology andviable alternatives involved on the way to a bottom line; but

130

they are also a key element in the evaluator's repertoiresince they provide the basis for identifying e.g. critical com-petitors and possible side-effects. See Meta-analysis.

SYNTHESIS (in evaluation) The process of combininga set of ratings on several dimensions into an overall evalua-tion. Usually necessary and defensible; sometimes inap-propriate because it requires a decision on relative weight-ing which sometimes :s impossible. Those occasions requiregiving just the ratings on the separate dimensions. It isdesirable to require an explicit statement and justification ofthe synthesis procedure since this will often expose: (a)arbitrary assumptions, (b) inconsistent applications. In theevaluation of faculty, for example, the de facto weighting ofresearch vs. teaching is often nearer to 5:1 in institutionswhose rhetoric claims parity; but it may vary widely be-tween departments or between successive chairs in thesame department. The evaluation of student course workby the letter grade is often cited as an example of indefensi-ble synthesis; in fact it is a perfectly defensible summative,evaluation, though it is unjustifiable for formative feedbackto the student. "Synthe'sis by salience summary" illustratesanother trap; a teacher is rated on 35 scales by students andthe printout only shows cases of statistically significantdepartures of the ratings from the norms. Thisseems plaus-ible enough; but since the dimensions have not been inde-pendently validated, (and are not independent) it not onlyinvolves focusing on style characteristics which are beingappraised on a priori grounds, but it also involves all theconfusions of ranking instead of grading. The importanceof correct synthesis is illustrated by a psychiatrist on thestaff at the University of Minnesota who became legendaryfor requesting a grant so that a graduate student could "pullhis research results together"; his "research results" being acomplete set of taped recordings of five years of therapy.Evaluators that are tempted to "turn the facts over to thedecision-makers, and let them make the value-judgments",should remember that evaluations are interpretations thatrequire all the professional skills in the repertoire; a scien-tist's role does not end with observation and measurement.Weighted-sum synthesis is linear synthesis and usuallyworks well. Rarely, as in the evaluation of backgammonboard positions or in evaluating patients on the MMPI, we

131 13,9

need non-linearssyhesis rules. Synthesis is perhaps thekey cognitive skill ,k4 evaluation; it covers evaluating in-voked by the phrase "balanced judgment" as well as theapples and oranges difficulties. Its cousins appear in thecore of all intellectual activity; in science, not only in theoriz-ing and identifying the presence of a -theoretical constructfrom the data but in research synthesis. In evaluation, thewish to avoid it manifests itself in laissez-faire evaluation'sextreme forms of the naturalistic approach. Balking at thefinal synthesis is often (not always) balking at the valuejudgment itself and close to valuephobia.

SYSTEMS ANALYSIS The term is generally used in-terchangeably with "systems approach" and "systemstheory." This approach places the product or program be-ing evaluated into the context of some total system. Systemsanalysis includes an investigation of how the componentsof the program/product being evaluated interact and howthe environment (system) in which the program/productexists affe t it. The "total system" is not clearly defined,varying m a particular institution to the universe at large,hence the pproach tends to be more an orientation than anexact fo ula and the results of its use range from theabysmally trivial to deep insights. (Ref. C.W. Churchman,The Design of Inquiring Systems).

TA Technology Assessment. An evaluation, particu-larly with respect to probable impact, of usually new) tech-nologies. Discussed in more detail under Technologybelow.

TARGET POPULATION The intended recipients orconsumers. Cf. Impacted Population.

TAXONOMIES Classifications, most notably Bloom'staxonomy of educational objectives; a huge literature hasgrown up around these taxonomies, which are rather sim-plistic in their assumptions and excessively complex in theirramifications.

TEACHING TO THE TEST The practice of teachingjust or mostly those skills or facts that will be tested, basedon illicit prior knowledge of or inference as to the testcontent. If the test is fully comprehensive, e.g. testing

132

14 0

knowledge of the "times tables" by calling for all of them,this is simply task-orientation and no crime. But most testsonly sample a domain of behavior and generalize fromperformance on that sampleas to overall performance in thedomain, and that generalization will be erroneous whenteaching to the test has occurred. A serious weakness ofteacher-constructed tests is that they create the same situa-tion ex post facto: see Testing to the Teaching.

TECHNOLOGY ASSESSMENT A burgeoning form ofevaluation which aims to assess the total impact of-(typi-cally) a new technology. A cross between futurism andsystems analysis and consequently done at every level fromludicrously superficial to brilliant. OTA usually scores wellabove the middle of the possible range. The process remainsin need of systematization; predicting that cassette record-ers would displace books was clearly fallacious at the time,while predicting that hand-held optical-scanning voice-in-put/output printing micro-computers will virtually elimi-nate the necessity for instruction in basic skills by 1990seems now (1980) to be so certain that the vast restructuringof the educational system which it entails should have longbegun. One good feature of TA futurism might seem to bethat in the long run we'll know who waslight; but so muchof it relates to potential that refutation is hard.

TERROR The effect frequently induced by goal-freeevaluation (sometimes by the thought of it) in the whole castof actorsevaluators, managers, evaluees. The "terrortest" is the use of this awful threat to determine whether thecast is competent.

TESTS (& TEST ANXIETY). Tests are poor instru-ments when the subjects are more anxious than they wouldbe in the criterion situation or when they test a domainpoorly matched to the test's alleged domain; but they arebetter than most observers including the classroom teacherin many, many cases.

TESTING TO THE TEACHING Designing tests tomeasure just what is actually taught instead of testing learn-ing in the domain about which conclusions will be or needto be draWn. Tests of a reading program that only use wordsactually covered in class will give a false picture of readingskills. As with "teaching to the test," this situation will not

133

141

be improper in the extreme case where the teaching coversthe whole domain.

TEST WISE Said of a subject who has acquired sub-stantial skills in test-taking e.g. learning to say False on allitems which say "always" or "never," or (to give a sophisti-cated example) learning not to check answers on items onehasn't time to read carefully, if a "correction for guessing" isbeing used, but to do so if it is not.

THEORIES A luxury for the evaluator, since they arenot even essential for explanations, and explanations arenot essential for (99 percent of all) evaluations. It is a grossthough frequent blunder to suppose that "one needs atheory of learning in order to evaluate teaching". One doesnot need to know anything at all about electronics in orderto evaluate electronic typewriters, even formatively, andhaving such knowledge often adversely affects a summativeevaluation. See Conceptual Scheme.

THERAPEUTIC ROLE (OR MODEL) OF THE EVAL-UATOR The very nature of the evaluation situationcreates pressures that sometimes mold it into a therapist-patient or group therapy interaction; this is particularly butnot only true with regard to external evaluation. First, thereisin such a casethe client's feeling of having exhaustedhis/her own resources, needing help badly, perhaps des-perately. Second, there is the aura of expertise and esotericawhich (sometimes) surrounds the external expert. Third,there are the technical diagnoses and magical rites pre-scribed by the good doctor. Since it's doubtful that there isin general much more to psychotherapy than this, an amal-gam which is enough to generate at least the placebo effect,the analogy is clearand should be disturbing. The mainproblem with placebo and Hawthorne effects is their trans-itory nature and the evaluator who fades back into the hillsafter an ecstatic client's testimonial dinner may have tosneak back for a look around a year later if s/he wants to geta good idea of whether the recommendations were (a) solu-tions to the prOblems, (b) adopted, (c) supported. Hencefollow-up studies, sadly lacking in psychotherapy research(or innovation evaluation) and often deVastating whendone, are just as important in meta-evaluation.

TIME DISCOUNTING A term from fiscal evaluation134

which refers to the systematic process of discounting futurebenefits, e.g. income, for the fact that they are in the futureand hence (regardless of the risk, an essentially indepen-dent source of value_reduction for merely probable 'futurebenefits) lose the earnings that those monies would yield ifin hand now, in the interval before they will in fact material-ize. Time discounting can be done with reference to anypast or future moment as base point, but is usually done bycalculating everything in terms of "true present value."

TIME MANAGEMENT An aspect of managementconsulting with which the "general practitioner" evaluatorshould be familiar; it ranges from the trivial to the highlyvaluable. Psychologists from William James to B.F. Skinnerare amongst those who have made valuable contributions toit and it can yield very substantial output gains at very smallCost both for the evaluator and for clients or evaluees. It wasJames who suggested listing tasks to be done in decreasingorder of enjoyability and beginning at the bottom, perhapssince that gives you the largest reduction of guilt and thebiggest gain in charm for the remaining list. (Ref. McCay,James The Manaientent of Time, Prentice Hall.)

TIME SERIES (See Interrupted Time Series Analysis)

TRAINING OF EVALUATORS Evaluators, like philos-ophers, and unlike virtually every other kind of profes-sional, should be regarded as having a general obligation toknow as much as possible about as much as possible. Whileit is feasible and indeed quite common for evaluators tospecialize either in particular methodologies or in particularsubject matter areas, the costs of doing this-are usuallyrather obvious in their work. It is probably a consequence ofthe relative youth, of evaluation as a discipline that thesearch for illuminating analogies from other disciplines isstill so productive; but the other reason for versatility willalways be with us, namely that it enablesone to do better asan evaluator in as wide a range of subject matter areas aspossible. Columbia University used to have a requirementthat students could not be accepted for the doctorate inphilosophy unless they had a Master's degree in anothersubject, and an analogous requirement might be quite de-sirable in evaluation. However, it is commonly asserted thatthe preliminary degree should be in statistics, tests, and

135 143-

measurement. The problem with that requirement is that itleads to a strong methodological bias in the eventual prac-tice of the professional. While skill in the quantitative meth-odologies is highly desirable, it does not have to be a prelimi-nary to evaluation training; the reverse sequence may bepreferable. A simple formula for becoming a good evaluatoris to learn how to do everything that is required by the KeyEvaluation Checklist. The formula is simple, the task is not;but it may be better to specify the core of evaluation trainingin this way rather than by listing competencies in terms oftheir supposed prerequisite status with respect to evalua-tion. People get to be good evaluators by a large :lumber ofroutes, and the field would probably benefit by increasingthis number rather than standardizing the routes. SeeEvaluition Skills.

TRAIT-TREATMENT INTERACTION A less widely-used term for aptitude-treatment interaction, though it is actu-ally a more accurate term.

TRACERS Artifically added features of a treatment de-signed to make the identification of its effects easier. SeeModus Operandi Method.

TRANSACTIONAL EVALUATION (Rippey) Focuseson the process of improvement, e.g. by encouraging anon-ymous feedback for those that a change would affect, andthen a group process to resolve differences. Though a po-tentially useful implementation methodology in some cases,transactional evaluation does not help much with e.g. pro-duct evaluation or (in general) with the consumer effects ofa program, being mainly staff-oriented.

TRANSCOGNITIVE See Hypercognitive.

TREATMENT A term generalized from medical re-search to cover whatever it is that we're investigating; inparticular whatever is being applied or supplied to, or doneby, the experimental group that is intended to distinguishthem from the comparison group(s). Using a particularbrand of toothpaste or toothbrush or reading an advertise-ment or textbook or going to school are all examples oftreatments. "Evaluand" covers these, but also products,plans, and people etc.

TRIANGULATION Originally the procedure used by

4136

surveyOrs to locate ("fix") a poir-n on a grid. In evaluation, ofscientific research in general, it refers to the attempt to getafix on a phenomenon by approaching it from r ore than onehidependently based route. For example, ri you want toascertain the extent of sex stereotyping in a company, youwill interview at several levels, you will examine trainingmanuals and interoffice memos, you will observe personnelinterviews and files, you will analyze job/sex/qualificationmatches, job descriptions, advertising, placement and soon. In short, you avoid commitment to the validity of anyone source by the process of triangulation.

TRUE CONSUMER Someone who, directly or indir-ectly, receives the services etc. provided by the evaluand.Does not include the service providers though they are alsopart of the impacted population. Is usually a very differentgroup from the target population (intended consumers.)

TRUE EXITRIMENT A "true experiment" or "true ex-perimental design" is one in which the subjects are matchedin pairs or by groups as closely as possible and then onefrom each pair one group is randomly assigned to (be) thecontrol and the other to the experimental group. The looser-and-larger numbers version skips the matching step andjust assigns subjects randomly to each group. (Cf. ex postfacto design and quasi-experimental design.

TWO-TIER SYSTEM (also called Multi-Tier System,and Hierarchical System) A system of evaluation, some-times used in proposal evaluation, (but also with considerable potential in personnel evaluation) where an attempt ismade to reduce the total social cost of the ordinary RFPsystem by requiring two rounds of competition. The first,which is the only one RFP'd, involves stringent length restr-ictions on the proposal, which is supposed to indicate justthe general approach and, e.g., personnel available. Thesebrief sketches are then reviewed by panels that can movethrough them very fast, and a-small number of promisingones are identified; Grants are (sometimes) made to theauthors of this "short list" of bidders in order to cover theircosts in developing full proposals. The relatively smallnumber of full proposals is then reviewed by a (sometimesconsiderably smaller) group of reviewers or reviewing pan-elsthe second tier of the review system. The mathematics

137

14

of this varies from case to case, but it's worth looking at anexample. Suppose we simply put out the usual kind of RFPfor improvement of college science teaching laboratories.We might get back 600 or 1,200 proposals, averaging per-haps 50 or 60 pages in length. For convenience let's say theyaverage 50 pages and we get 1,000 of them. That's 50,000pages of proposals to be read, and 50,000 pages of proposalsto be written. Even if reviewers can "read" 100 or 200 pagesan hour, we're still looking at 250-500 hours of proposalreading, which means about 60 person-days of reading, i.e.a panel of 15 working for four days, two panels of 15 work-ing two days, or ten panels of 6 working for one day. Theproblem is that you can't get good reviewers for four days;and the small panels require more personnel to staff, andthen have to face the serious problem of interpanel differ-ences. Now if we go to a two-tier system, then we can placean upper limit of, say, five pages on the first proposal and,although we may get a few more, that's a good result since itmeans that we'll get some entries who don't have the timeor resources required to submit massive proposals. So wemight start with 1,200 five-pagers, which is 6,000 pages,and we've immediately got a reduction of 88 percent in theamount of reading that's done, with the result that a singlepanel can reasonably manage it. Then there will be perhapsten or twenty best proposals coming in at the 50 pagelength, which can be handled quite quickly, and indeedmuch more carefully, by the same panel, reconvened forthat purpose. Notice also that the reading speed for the firsttier of proposals may be higher since all the readers have todo is to be sure they're not missing a promising proposal,rather than to rank-order for final award. And validityshould be higher. Notice the triple savings that are in-volved: the proposers can save about 90 percent of theircosts (it may not be quite so high, because shorter proposalstake more than a prorated-by-page amount of resources,but it's still substantial); the agency saves a great deal of costin paying raters or panelists, and heavy staff 'ork costs;and the reliability of the process as well as the quality ofavailable judges goes up significantly. Hence the small sub-sidy for the second tier proposal is more than justified, bothfiscally and in terms of encouraging entries from people thatcouldn't otherwise afford it; and better entries for those thatcan.

138

1 4 G

TYPE 1/TYPE 2 ERRORS See Hypothesis Testing.

UNANTICIPATED OUTCOMES Often used as asynonym for side-effects, but only loosely equivalent,since: outcomes may be unanticipated by inexperiencedplanners but readily predictable by experienced ones; ef-fects that are anticipated but not goals are (sometimes) stillside-effectsand sometimes not (e.g. having to rentoffices.)

UNCERTAINTY, Evaluating. See Risk.

UNOBTRUSIVE MEASUREMENT The opposite ofre-active measurement. One that produces no reactive effect,e.g. observing the relative amount of wear on the carpet infront of interactive displays in a science museum as a mea-sure of relative amounts of use.. Sometimes unethical, andsometimes ethically preferable to obtrusive evaluation.("Obtrusive" is not necessarily "intrusive"; it may be obvi-ous but not disruptive.)

UTILITY (Econ.) The value of something to someoneor to some institution. "Interpersonal comparison of utility"is the stumbling-block of (welfare) economics. Sometimesmeasured in the hypothetical units of "utiles". See -Apportionment.

UTILIZATION (of evaluations). This refers either tothe effort to improve implementation of an evaluation'srecommendations or to a metaevaluative focus on the extentto which evaluations have been utilized. Utilization/imple-mentation must be planned into evaluations from the firstmoment; indeed, if the client isn't in a position to utilize the

\results appropriately, an ethical question arises as toWhether the evaluation should be done. Standard proce-dures include putting representatives of the evaluees on theevaluation team or advisory panel; soliciting and using sug-gestions from the whole impacted population about designand findings; identifying and focusing on positive benefitsof the evaluation if implemented; using appropriate lang-uage, length and formats in the report(s); establishing abalance of power to reduce threat; and, most importantly, aheavy emphasis on explaining/teaching about the particu-lar and general advantages of evaluation. See also Imp-lementation of Evaluations.

139 14 '7

VALIDITY A test is valid if it really does measure whatit purports to measure. It can be reliable (in the technicalsense) without being valid, and it can be valid without beingcredible. But if it's valid it has to be reliableif the ther-mometer is valid, it must say 100 degrees Centigradewhen-ever placed in boiling water and hence must agree withitself, i.e. be reliable. There are various subspecies of valid-ity in the jargon (especially face, content, construct, andpredictive validity,) but they represent an inflation ofmethodological differences into supposed conceptual dis-tinctions, except perhaps "face-valid" which possiblyshould be distinguished since it only means "look valid."Serious investigation of validity will identify the appro-priate kind for the (e.g.) test being studied; one should not

jallit-ab4ut "valid in this sense, but not in that," only about"valid in the appropriate sense."

VALUED PERFORMANCE A value-imbued descrip-tive variable, imbued with value by the context. For exam-ple, in the context of evaluating hot rods, the standing-startquarter-mile time is the principal evaluative measure, thevalued performance. On the one hand it's totally factual/descriptive; on the other hand, it is contextually imbuedwith value and is treated exactlyas if it logically involved theconcept of merit. Cf. Crypto-evaluative term.

VALUE-FREE CONCEPTION OF SCIENCE The be-lief that science, and in particular the social sciences, shouldnot or cannot properly infer to evaluative conclusions, onthe basis of purely scientific considerations. Mistakenly as-sumed to be a consequence of empiricism though in fact itrequires the further (erroneous) premise thatinference fromfacts to values is impossible; he error is precisely analogousto the error of-supposing that,one cannot infer to conclu-sions about theoretical constructs from observations. (Pop-per's simplistic attack on induction is thus partly responsi-ble for the continued support of the value-free doctrine.)Apart from the logical errors, there is the evidence of one'ssenses that science is redolent with highly responsible andwell-justified scientific evaluation of research designs, ofestimates, of fit, of instruments, of explanations, of researchquality, of theories. That the value-free position was main-tained at all iri thiTiCe of these considerations, requires anexplanation in terms of valuephobia. See Needs Assess-

140

14 A'

ment and LE.

VALUE-IMBUED TERM See Valued Performance.

VALUEPHOBIA The resistance to evaluation that gen-erated the myth of value-free science, the attacks on proper-ly-used testing or course grading.(seeKill the Messenger),on program evaluations for accountability andon the evalu-ation of college faculty is often more than any rational expla-nation can cover. We use the term "valuephobia" to cover itwithout any implications of neurosis, just irrationality. Ofcourse the natural defensive strategy (attack anything thatis a threat) is part of it; but part of it goes deeper, into theunwillingness to face possibly unpleasant facts about one-self even if it means large long-run benefits. (This phenom-enonrelated to "denial"is seen in people who won't goto a doctor because they don't want to hearabout imperfec-tions). Valuephobia leads to many abuses e.g. patheticguarantees that an evaluation will be done "only to help,not to criticize" (if there are no valid criticisms, there's rarelyany justification for help of programs/performances in-volving professionals); to the substitution of implementa-tion monitoring for outcome-based program evaluations; tothe refusal of professional associations to use professionalstandards in their own accreditation or enforcement pro-cedures; to excessive involvement of evaluation staff withthe program staff ("to reduce anxiety or "to improve im-plementation" ") which frequently produces pablum evalu-ations; & (via guilt) to the absurd ratio of favorable to un-favorable program evaluationsabsurd given what we re-ally know about the proportion of bad programs. The clini-cal status of valuephobia as a U.S. cultliral phenomenon ismore obvious to a visitor from e.g. England where verytough criticism in the academy is not taken personally to thedegree it is here; and it is in this country that ConsumersUnion was listed by the Attorney-General as a subversiveorganization and (independently) banned from advertisingin newspapers. But the ubiquity of valuephobia is moreimportant; Socrates was killed for his teaching and applica-tion of evaluative skills and dictators today seem no lessinclined to murder their critics than the Greek "democ-racy." Humility may best be construed not as the avoidanceof self-regard but as, the valuing of criticism: the outcome ofsuccessful "treatment" (hcipefully educational rather than

141

14,9

therapeutic) for valuephobia; this should be combined withsome capacity to distinguish good from bad criticism. SeeEducational Role.

VALUES (in evaluation; & measurement of) The valuesthat make evaluations more than mere descriptions cancome from a variety of different sources. They may bepicked up from a relevant and well-tried set of e.g. profes-sional standards. They may come from a needs assessmentwhich might show that children become very ill without aparticular dietary component (i.e. need it). Or they maycome from a logical and pragmatic analysis of the functionof something (processing speed in a computer is a virtue,ceteris paribus.) They may even come from a study of wantsandof the absence of ethical impediments to their fulfilment(e.g. in building a better roller-coaster.) In each of thesecases, the foundations are factual and the reasoning islogicalnothing comes in that a scientist should beashamed of. But something hovers in the background thatscientists are embarrassingly incompetent to handle,namely ethics. Without doing ethics, however, most evalua-tions can be validated by just checking for salient ethicalconsiderations that might override the non-ethical reason-ing. The values/preferences that sometimes come into theevaluation as the ultimate data range in visibility for obvious(political ballots) to very inaccessible (attitudes towards job-security, women supervisors, censorship of pornography.)Most instruments for identifying the more subtle ones are ofextremely dubious validity; they are best inferred from be-havior; although that inference is also difficult, it beginswith the kind of event we are (usually) hoping to influence.Some simulations are so good that they probably elicit truevalues, especially if not very important ones are involved;usually behavior in real situations should be used.

1 VARIABILITY The extent to which a population isspread out over its range, as opposed to concentrated nearone or a few places (or modes)the feature that producesdispersion.

WHOLISTIC Alternative spelling of Holistic.

WHY DENY A confererke with the staff of a funding

/ 142

150

agency which unsuccessful bidders on an RFP may requestand at which they are informed about the reasons why theylost out. One of the consequences of the recent move to-wards openness. Unfortunately the failure to use saliencescoring and other systematic procedures means that re-viewer rind staff feedback is very difficult to interpret in auseful way.

WIRED A contract or an RFP is said to be "wired" ifeither through its design and requirements or through aninformal agreement between agency staff and a particularcontractor, it is arranged so that it will go to that contractoi.Certainly illegal, and nearly always immoral. The mere factthat the RFPwith intrinsic good reasonspre-deter-mines the contractor e.g. because the problem can in factonly be handled by an outfit with two Cray computers, doesnot constitute wiring.

WORTH System value by contrast with intrinsic valuei.e. merit; e.g. market value is a function of the market, notof the evaluand's own virtues. The worth of a professor is afunction of the enrollment in her or his classes, grant-getting, relation to the college's mission, role-modelingfunction for prospective/actual women or minority stu-dents, as well as his/her professional merit. The latter is anecessary but not sufficient condition for the former.

. ZERO-BASED BUDGETING (ZBB) A system of bud-geting in which all expenditures have to be justified ratherthan additional expenditures (i.e. variations from "level-funding".)

143

ACRONYMS

ABBREVIATIONS

AA Audit Agencya division of HHS that reports di-rectly to the Secretary and does internal audits (cf. GAO)that amount to evaluations of program efforts and contractsincluding evaluations. Has moved from CPA orientation tomuch broader approach and does much very competentwork (though spread a little thin); still doesn't look at e.g.validity of test-instruments used.

AAHE American Association of Higher Education

ABT Properly, Abt Associates. Large shop with strongevaluation capability; headquarters Cambridge, Massa-chusetts.

ACT American College Testingbig Iowa-based edu-cational measurement shop.

AERA American Educational Research Association

AID Agency for International Development

AIR American Institutes for Research, a Northern Cali-fornia-based contractor with some evaluation capability.

ANCOVA Analysis of covariance

ANOVA Analysis of variance

ATI Aptitude-treatment interaction

AV Audiovisual

AVLINE Online audio-visual database maintained by

144

1 D2

NLM

CAI Computer-assisted instruction

CB° Congressional Budget Office. Provides analysisand evaluation services to Congress, as GAO does for theadministration.

CBTE, C117, CBTP Competency Based Teacher Edu-cation, Training or Preparation

CDC Computer Development Corporation; one of thetop five computer companies.

CEEB College Entrance Examination Board

CEDR enter for Evaluation, Development and Re-search (at P

ti Delta Kappa)

I

CFE Costfree evaluation

CIPP Daniel Stufflebeam and Egon Guba's modelwhich, distinguished four types of evaluation: context, in-put, process, and productall designed to delineate, ob-tain, and provide useful information for the decision-maker.

CIRCE Center for Instructional Research and Cur-riculum Evaluation, University of Illinois, Urbana, Illinois.

CMHC Community Mental Health Center or Clinic

CN Consultants News, the highly independent news-letter of the management consulting area, run by talentedloner Jim Kennedy.

COB Close of business (end of working day; a proposaldeadline)

CRT Criterion-referenced test (or cathode ray tube, thedisplay monitor on some computers)

CSE Center for the Study of Evaluation (at UCLA)

CSMP Comprehensive School Mathematics StudyGroup

DEd (properly ED) Department of Education (ex-USOE)

DOD Department of Defense

145 153

DOE Department of Energy

DRG Division of Research Grants

DRT Domain-referenced test

ED Education'Department

EIR Environmental Impact Report

EN Evaluation News, the newsletter of the EvaluationNetwork

ENet Evaluation Network, an organization of eval-uators

EPIE Education Products Information Exchange

ERIC Educational Resources Information Center; anationwide information network with its base in Washing-ton, D.C. and 16 clearinghouses at various locations in theU.S.

ERS Evalilation Research Society

ESEA Elementary and Secondary Education Act of 1965

ETS Educational Testing Service; headquarters inPrinceton, N.J.branches in Berkeley, Atlanta, etc.

FRACHE Federation of Regional Accrediting Commis-sions of Higher Education

FY Fiscal year

G & A General and administration (expenses, costs)

GAO General Accounting Office. The principal semi-external evaluation agency of thai Federal government.

GBE Goal-based evaluation

GFE Goal-free evaluation

GIGO Garbage In, Garbage Out (from computer pro-gramming; see meta-analysis)

GPA Grade-point average -

GPO Government Printing Office, Washington, D.C.

GRE Graduate Record Examination

154 146

HEW Department of Health, Education and Welfare,now divided into E.!), and H.H.S.

HHS Department of Health and Human Services

IBM international Business Machines.

IOX Instructional Objectives Exchange in Los Angeles

K $1000 as in "16K for evaluation."

K-12 Kindergarten through high school years

K-6 The domain of elementary education

LE The Logic of Evaluation, a monograph by the pres-ent author in this series

LEA Local Education Authority (e.g. school district)

Law School Admission Test

M Thousand, as in"$16M for evaluation."

MAS Management Advisory Services; term usually ref-ers to subsidiaries of the Big 8 accounting firms.

MBO Management by Objectives

MCI' Minimum Competency Testing

MIS Management Information System; usually a com-puterized database combining fiscal, inventory, and per-formance data.

MMPI Minnesota Multiphasic Personality Inventory

MOM Modus Operandi Method

NCES National Center for Educational Statistics

NCHCT National Center for Health Care Technology

NIA National Institute onAging

NICHD National Institute of Child, Health and HumanDevelopment

NIE National Institute of Education (in ED)

NIH National Institutes of Health (includes NIMH,MA etc.), or Not Invented Here (so don't encourage its usebecause someone else will get the credit and people willthink we can't manage our own affairs.)

147

NIJ National Institute of Justice

NIMH National Institute of Mental Health

NLM National Library of Medicine

NSF National Science Foundation

NWL Northwest Lab, Portland, Oregon. One of thefederal network of labs and R & D centers; currently hasstrongest evaluation staff.

OE Office of Education

OHDS Office of Human Development Services

OJT On-job training

OMB Office of Management and Budget

OPB Office of Planning and Budgeting

ONR Office of Naval Research; sponsor of e.g. Ency-clopedia of Educational Evaluation.

OTA Office of Technology Assessment

P & E Planning and Evaluation; a division of HEW/HHS, including regional offices, where it reports directly toRegional Directors. In ED, currently called OPB

PBTE Performance Based Teacher Education

PDK Phi Delta Kappa, the influential and quality-oriented educational honorary.

PEC Product Evaluation Checklist, forerunner of KEC.

PERT Program Evaluation and Review Technique

PHS Public Health Service

PLATO The largest CAI project ever; headquarters atthe University of Illinois/Champaign. Mostly NSF fundedin development phase, now CDC-controlled.

PPBS Planning-Programming-Budgeting-System

PSI Personalized System of Instruction (a.k.a. The Kel-ler Plan)

PT Programmed Text

148

156

RAND Big. Santa Monica-based contract research andevaluation and policy analysis Outfit. Originally, a U.S.Air-Force 'creature' (civilian subsidiary), set up becausethey couldn't get enough specialized talent from within theranksname came from Research And Development.Now independent non-profit, though still does some workfor USAF.

RFP Request for proposal

SAT Scholastic Aptitude Test, Widely used for collegeadmissions.

SDC Systems Development Corporation in SantaMonica; another large shop like Rand with substantialevaluation capability.

SEA State Education Ai thority

SEP School Evaluation Profile

SES Socioeconomic status

SMSG School Mathematics Study Group. One of theearliest and most prolific of the federal curriculum reformefforts.

SRI Originally Stanford Research Institute; in MenloPark, CA; once part-owned by Stanford University, nowautonomous. Large "shop" which does some evaluation.

TA Technology Assessment or Technical Assistance orTeaching Assistant

TAT Thematic Apperception Test

TCITY Twin Cities Institute for Talented Youth. Site ofthe first advocate- adversary evaluation.

USAF United States Air Force. Heavy R & D commit-ment, like Navy, and unlike Army or Marine Corps

USDA United States Department of Agriculture

USOE United States Office of Education, now ED orDEd (Department of Educatio..)

149 4

J. ry

Documents

DOCUMENT RESUME - ERICED 198 180 AUTHOR TITLE PUB DATE NOTE AVAILABLE FPOM EDRS PRICE DESCRIPTORS DOCUMENT RESUME TM 810 149 Scriven, Michael Evaluation Thesaurus. Second Edition