13
Psychological Bulletin 1970, Vol. 74, No. 1, 68-80 HOW WE SHOULD MEASURE "CHANGE"—OR SHOULD WE? 1 LEE J. CRONBACH 2 AND LITA FURBY" Stanford University Procedures previously recommended by various authors for the estimation of "change" scores, "residual" or "basefree" measures of change, and other kinds of difference scores are examined. A procedure proposed by Lord is extended to obtain more precise estimates, and an alternative to the Tucker-Damarin-Messick pro- cedure is offered. A consideration of the purposes for which change measures have been sought in the past leads to a series of recommended procedures which solve re- search and personnel-decision problems without estimation of change scores for individuals. A persistent puzzle in psychometrics has been "the measurement of change." Many in- vestigators have felt, for reasons good or bad, that their substantive questions required a measure of gain in ability or shift in attitude. "Raw change" or "raw gain" scores formed by subtracting pretest scores from posttest scores lead to fallacious conclusions, primarily be- cause such scores are systematically related to any random error of measurement. Although the unsuitability of such scores has long been discussed, they are still employed, even by some otherwise sophisticated investigators. At the end of this paper the authors argue that gain scores are rarely useful, no matter how they may be adjusted or refined. The authors also distinguish four kinds of inquiry for which such scores have been used, and con- clude that only one of these purposes is well served by any kind of gain score. This argu- ment applies not only to changes over time, but also to other differences between two vari- ables. The first part of the paper proposes superior ways of estimating true change and true resid- ual change scores. It may seem pointless to discuss such matters when in the end we rec- ommend against their use (save in a few kinds of investigation). However, the development of formulas clarifies the model, providing a 1 Initial work on this paper was conducted under a contract with the Cooperative Research Program of the United States Office of Education. We thank many colleagues for helpful comments, particularly Janet D. Elashoff, Gene V. Glass, Ingram Olkin, and Julian C. Stanley, Jr. 2 Requests for reprints should be sent to Lee J. Cron- bach, School of Education, Stanford University, Stan- ford, California 94305. 8 Now at the University of Poitiers, France. base for the final recommendations, and allows us to explain the limitations of previous papers on the subject. Furthermore, it develops supe- rior estimators for the kinds of problem where we do recommend using a measure of change. Very likely some investigators will decide to obtain change or difference scores, even for problems where we consider such measures in- appropriate. Such a person will often find one of our estimation formulas better than those now suggested in the literature. Methods of handling data from successive measurements have been offered by several writers (Harris, 1963). We shall be particularly interested in the related proposals of Lord (1956, 1958, 1963) and McNemar (1958), who estimate an individual's "true change"; their approach is clearly superior to conventional techniques. This paper extends the Lord- McNemar reasoning to get a still better estimate. DuBois (1957) and other investigators rec- ommend a "residual gain" score as a substitute for the "raw gain" score. A gain is residualized by expressing the posttest score as a deviation from the posttest-on-pretest regression line. The part of the posttest information that is linearly predictable from the pretest is thus partialled out. Tucker, Damarin, and Messick (1966) draw attention to the "true residual gain," which they refer to as a "basefree mea- sure of change." FORMULATION The Lord-McNemar argument considers measures X and Y, obtained by applying the same operation to the subject on two occasions. The subject has an observed difference score D = Y — X. Scores X x and Y x , represent- 68

HOW WE SHOULD MEASURE CHANGE OR SHOULD WE? · HOW WE SHOULD MEASURE "CHANGE" OR SHOULD WE?1 LEE J. CRONBACH2 AND LITA FURBY" Stanford University ... served by any kind of gain score

Embed Size (px)

Citation preview

Psychological Bulletin1970, Vol. 74, No. 1, 68-80

HOW WE SHOULD MEASURE "CHANGE"—OR SHOULD WE?1

LEE J. CRONBACH2 AND LITA FURBY"

Stanford University

Procedures previously recommended by various authors for the estimation of"change" scores, "residual" or "basefree" measures of change, and other kinds ofdifference scores are examined. A procedure proposed by Lord is extended to obtainmore precise estimates, and an alternative to the Tucker-Damarin-Messick pro-cedure is offered. A consideration of the purposes for which change measures havebeen sought in the past leads to a series of recommended procedures which solve re-search and personnel-decision problems without estimation of change scores forindividuals.

A persistent puzzle in psychometrics hasbeen "the measurement of change." Many in-vestigators have felt, for reasons good or bad,that their substantive questions required ameasure of gain in ability or shift in attitude."Raw change" or "raw gain" scores formed bysubtracting pretest scores from posttest scoreslead to fallacious conclusions, primarily be-cause such scores are systematically related toany random error of measurement. Althoughthe unsuitability of such scores has long beendiscussed, they are still employed, even bysome otherwise sophisticated investigators.

At the end of this paper the authors arguethat gain scores are rarely useful, no matterhow they may be adjusted or refined. Theauthors also distinguish four kinds of inquiryfor which such scores have been used, and con-clude that only one of these purposes is wellserved by any kind of gain score. This argu-ment applies not only to changes over time,but also to other differences between two vari-ables.

The first part of the paper proposes superiorways of estimating true change and true resid-ual change scores. It may seem pointless todiscuss such matters when in the end we rec-ommend against their use (save in a few kindsof investigation). However, the developmentof formulas clarifies the model, providing a

1 Initial work on this paper was conducted under acontract with the Cooperative Research Program of theUnited States Office of Education. We thank manycolleagues for helpful comments, particularly Janet D.Elashoff, Gene V. Glass, Ingram Olkin, and Julian C.Stanley, Jr.

2 Requests for reprints should be sent to Lee J. Cron-bach, School of Education, Stanford University, Stan-ford, California 94305.

8 Now at the University of Poitiers, France.

base for the final recommendations, and allowsus to explain the limitations of previous paperson the subject. Furthermore, it develops supe-rior estimators for the kinds of problem wherewe do recommend using a measure of change.Very likely some investigators will decide toobtain change or difference scores, even forproblems where we consider such measures in-appropriate. Such a person will often find oneof our estimation formulas better than thosenow suggested in the literature.

Methods of handling data from successivemeasurements have been offered by severalwriters (Harris, 1963). We shall be particularlyinterested in the related proposals of Lord(1956, 1958, 1963) and McNemar (1958), whoestimate an individual's "true change"; theirapproach is clearly superior to conventionaltechniques. This paper extends the Lord-McNemar reasoning to get a still betterestimate.

DuBois (1957) and other investigators rec-ommend a "residual gain" score as a substitutefor the "raw gain" score. A gain is residualizedby expressing the posttest score as a deviationfrom the posttest-on-pretest regression line.The part of the posttest information that islinearly predictable from the pretest is thuspartialled out. Tucker, Damarin, and Messick(1966) draw attention to the "true residualgain," which they refer to as a "basefree mea-sure of change."

FORMULATION

The Lord-McNemar argument considersmeasures X and Y, obtained by applying thesame operation to the subject on two occasions.The subject has an observed difference scoreD = Y — X. Scores Xx and Yx, represent-

68

MEASURING "CHANGE" 69

ing the person's "true" status at these times,are postulated. The "true difference" DK equalsFM — XM. The key topic of the McNemarpaper and the Lord papers is the determinationof regression coefficients for an estimator of theform:

Ao = PiX + /3«F + constant [1]

As Lord has written more extensively thanMcNemar on this matter, we shall refer onlyto Lord hereafter, though many of the com-ments also apply to what McNemar has said.

The development holds so long as X and Fare referred to the same numerical scale. Thatis to say, the investigator must be willing tosay what score on F (or F,,) is comparable to agiven score on X (or Xx). The relation must bereciprocal in the sense that, if the given X ismapped into a certain F, that value of F ismapped into the given X. It is particularlyimportant to note that this formulation side-steps the philosophically troublesome question,Are pretest and posttest "measuring the samevariable"? A common metric is the only re-quirement. Even this requirement is dispensedwith when we turn to a regression estimate ofoutcome.

The model applies to any two measures thatcan be sensibly expressed on the same numeri-cal scale. This might be, for example, a stan-dard-score scale or an age-equivalence scale.The data might be ratings of two distinct traitsexpressed on the same reference scale. Hencestatements made about change scores can beextended to any kind of difference score.Among the more famous lines of research em-ploying difference scores are work on "over-achievement," "empathy and insight," "self-concept" versus "ideal self," and "differentialaptitudes." Indices of the same psychometriccharacter are also involved in studies of re-tention and transfer.

Assumptions. There are two variables and XY. For each variable, the expected value overindependent observations of the same persondefines a true score: E(X) = Xa and E(F)= FM. Then Da = F» - Xx. The investigatorwho identifies F with X as "the same opera-tion" must keep the true scores distinct. In astudy of change, Xx is the person's averagescore over measurements of X that might bemade at Time 1; F«, is the average over ob-servations by the same procedure at Time 2.

Errors are uncorrelated with true scores.But we do not assume that all correlations p%Yare equal. Sometimes X and F observations are"linked," as when the two scores are obtainedfrom a single test or battery administered atone sitting, or when observations on differentoccasions are made by the same observer. Thecorrelation between linked observations willordinarily be higher than that between inde-pendent observations. This distinction has notbeen made in the literature on change, but itdoes appear in a paper by Stanley (1967) ondifference scores.

We develop a formal mathematical modelso that some parts of our argument can be ex-plicit rather than intuitive. We retain theclassical concept of strictly parallel observa-tions, but modify the classical concept of in-dependence, as suggested above.

1. Let Xpi represent the observed score ofperson p on the X variable observed undercondition i. The condition i may be, for ex-ample, a particular form of the test that mea-sures X. In studies where there are severalsources of "error" such as test form, observer,and short-term fluctuations in the state of theperson, we assume that these sources are com-pletely confounded in the design used to deter-mine reliability coefficients and correlations.

2. Where the classical theory considers truescore as the sum of observed score and error, weintroduce two random errors: Xpt = X^p+ epi +fpi. Likewise, Ypi = YMP + epi + lpiThis is required to formalize the concept ofindependence adequately.

If, as in the classical concept of parallel mea-sures, one assumes that the several measures ofX (or F) are strictly interchangeable, therecan be no linkage. Machinery for explicitly de-nning independence can be developed withsome concepts from generalizability theory.There is a universe of possible conditions ofobservation of X. (While observations mayvary with respect to test form, occasion, ob-server, etc., we shall avoid here the complica-tions of multifacet theory [Gleser, Cronbach,& Rajaratnam, 1965]. That theory recognizes,as this paper does not, that one may have twomeasurements with the same test form ondifferent occasions, or two measurements ondifferent test forms on the same occasion, or

70 LEE J. CRONBACH AND L1TA FURBY

two measurements where both form and oc-casion differ.) There is a universe of observa-tions of F, and we shall assume that these maybe made under the same set of conditions i thatare used for X observations. Then observationsXi and Yi made under the same condition i aresaid to be linked, and observations X{ and F\>made under different conditions are said to beindependent.

Formally, the model specifies that conditionsof observations are drawn from the universe.If a single i is drawn and used to obtain bothscores Xpi and Ypi for person p, we have link-age; if i and i' are drawn independently to givescores Xpi and Yfi>, we have independence.Errors are random and independent, exceptthat when X and F arc both observed undercondition i, the/ and f components are sampledsimultaneously. It will not be true, in general,that <r(fpi,ipi) = 0. The following assumptionsare made regarding all e components: Theirmean over persons is zero for every condition;their variance over persons is the same forevery condition; their intercorrelations withother components are zero. The e componentssatisfy the same assumptions and a-(epi,epi)— o'C/pi.Cpi) = 0. With regard to the / com-ponents, zero means, equal variances, and zerocorrelations with true scores are assumed(likewise for f). Also, <r(/pi,ipi') = 0 for allpairs where i is not identical to i'.

3. The measures of X made under differentconditions are parallel. It follows from the as-sumptions above that the measures have equalmeans, equal variances, and equal intercor-relations. The same is true for F measures.

4. It follows, now, that<r(Xpi,Ypii) = ^(XaspfYaop),

hence this covariance is the same for all in-dependent observations of X and F.

For linked observations the covariance

o'(Xpi,YPi) = <r(Xxp,Yxp) + a (fpi,i pt)

We assume that <r(fp,-,fjn) is the same for all i,hence that the linked covariance is the samefor all linked X, Y observations. The covari-ance of / with f may be large or small, depend-ing on the extent to which the condition iinfluences the X and F performances.

We shall simplify and compress notation inseveral convenient ways. We write pxx< for areliability coefficient, and pxy for a correlation.

For emphasis, when linked observations arecorrelated or used to form a difference score wemay refer to Xa and Ya; their covariance andcorrelation we shall designate *<TXY and *pxv-We shall similarly write Xa and F& (or the like)for a pair of independently observed scores, andfor the covariance and correlation shall write°<fxY and Opxy-

Ordinarily, | *PXY | > | °PXY | . It is possiblethat *pxv < °PXY, where Xa and Ya have somecomplementary relation. For instance, if rateof reading and comprehension are measured onthe same selections simultaneously, one score islikely to rise at the expense of the other and<r(fpi,tpi) will be negative.

5. It is assumed that the population param-eters are known. All parameters consideredmust be for the same population.

A regression estimate of a true score is usu-ally improved if the data permit one to derivean equation for a subpopulation — for example,for ninth-grade boys in a certain school ratherthan for the national ninth-grade population.The investigator using actual data will oftenhave made no reliability study on his sample.If he uses a reliability coefficient or a value ofPXY from the published study he must adjustit to take into account the variance of hissample.

RELIABILITY OF A DIFFERENCE SCORE

As Stanley (1967) pointed out, linkage mustbe taken into account in defining a reliabilitycoefficient for differences. Classical theory,ignoring linkage, defines a reliability

PDD' — 2 =

where D = Y — X. The reliability coefficientspxx' and pyy are correlations of independentobservations. Likewise, because of the inde-pendence assumption, classical theory can onlyinterpret OXY as what we have called OPXY.

The reliability of a difference score is definedas the correlation of the score with an indepen-dently observed difference. Unlike the classicalpapers, Stanley distinguishes p(Ya[-xa)(Yt-xb)from p(Ya-xb)(Ye-xd')- These are distinct kinds ofreliability, one for a difference of linked X andY and one for a difference of independent XandF.

If the observed difference is D^ = Ya — X^

MEASURING "CHANGE" 71

(i.e., if X and F are experimentally indepen-dent), the covariance with an independentlyobserved difference Dca is

ff^XpXX' + O^YPYY' — 2ffxffY°pXY- [3]

Even if Daa = Ya — Xa (i.e., X and F are ex-perimentally linked), the covariance with asimilar but independently observed linkeddifference DM = Ft — Xt, is the same. HenceFormula 3 is the appropriate numerator forFormula 2 in both cases.

The variance of D for the independent case is

o-2jr + trV — 2<rx<rr°pXY. [4]

For the linked case, however, Daa = Ya — Xa

and the variance equals

cr^x + <rV — 2<rxay*pxY- [5]

Hence for the linked case the reliability co-efficient p(Ya-xa)(Yb-xb) is obtained by dividingFormula 3 by Formula 5 :

*PDD' —PXY

PXY[6]

whereas for the independent case the reliabilitycoefficient p(Ya-xt,)<Y<r-xd'> is Formula 3 dividedby Formula 4:

°PDD' = ' [7]

Since °PXY < *PXY in most instances, °PDD'for the independent case will most likely besmaller than *PDD> for the linked case. Bothreliability coefficients are meaningful. Theydescribe the correlation between differencesobserved according to different experimentaldesigns. Distinctions like that between Equa-tions 6 and 7 have to be made in consideringthe reliability of any composite, weighted orunweighted.

In the discussion that follows formulas arewritten in terms of the linked case; the sub-stitution to fit the fully independent case willbe obvious.

ESTIMATORS OF TRUE CHANGE

Primitive Formulas

Among the possible estimators of Dx arethree simple formulas.

Raw gain. The simplest formula is:

D = Y -X [8] [1]

Correction by simple regression for error in X.If Xx is a regression estimate of X«, from X ,

Ac = F-£, = F-PM.jr+constant [9] [2]

In this and all other equations through Equa-tion 20, the constant is one that makes themean (over persons) of the estimates equal tothe mean raw gain. The foregoing estimate hasoccasionally been suggested (e.g., Trimble &Cronbach, 1943), but it has seen little use.Closely related concepts appear in Lord'scomments on analysis of covariance (1960) andin a series of Swedish papers on the effect ofschooling on intelligence (see Harnqvist, 1968).

Correction by simple regression for error in Xand in Y. To take a further step, let YM be anestimate of Yx from Y. Then

Ac = YM — Xx = pyy Y — pxx-'.X'+constant[10] [3]

This procedure does not take the X, Y correla-tion into account.

The Lord Procedure

Lord pointed out that unless pxaYx = 0,both X and Y yield information about Xx. Amultiple regression procedure can be used toobtain

Xa=pXX'X+/3Xa(Y.X}(Y-X) + (l-pxx,)X [11]

Here Y • X is a partial variate, the deviation ofY from the value predicted by the regression ofY on X in the population to which the otherparameters in Equation 11 apply.

For the linked case we have

Ya-Xa = Ya- *~(Xa -X)-Y [12]

We know that

Now VXXY = <rxaYb = °ffxY (not *<TXY). UsingEquations 12 and 13, the numerator of (8 be-comes

aXx(Ya-Xa) = <rX<Ty(°pXY — *pXYpYY'} [14]

and the denominator becomes

Substituting in the regression equation, we

72 LEE J. CRONBACH AND LITA FURBY

arrive at

•A PXX' —

<rx(°PXY — *PXYPXX')

ery(l — •p'arr)7 + constant [16]

ThenA. = ?,-*«, [17] [4]

where the estimates on the right-hand sidecome from Equation 16 and its analog for Yx.Expanding the expression we have

1 [YM =1 «~2 \GYPYY'—<TY*PXY®PXY

*PXYPXX> — <TX

+constant [17a] [4]

For independent observations, one substitutes°PXY wherever *pxv appears in Equation 17a;this is Lord's estimator of Dx. In an experi-ment, all calculations must be made within atreatment group.

Estimator 4 is as good or better than any ofthose listed ahead of it, giving a smaller meansquare of 0X — Dx) and a larger correlationbetween estimate t>K and true score Dx. Ordi-narily Estimator 3 is better than Estimator 2and both are better than Estimator 1. Ourmain point in presenting Estimators 2 and 3 isto show the Lord estimator as an elaboratedform of a more conventional estimator. Thislays a base for the further refinement.

It may seem anachronistic in Formulas 11and 16 to use a posttest score to "predict" apretest score. But the logic is clear. Within atreatment, persons higher on the posttest thanothers having the same observed pretest scoretend to be those for whom the true pretestscore is higher than the observed score. The Yreceives at least nominal weight in the regres-sion equation when PXY is not zero or one, andpxx1 < 1.00. The weight given to Y increaseswith larger PXY and smaller pxx1-

Taking group membership into account. Pre-vious papers have implicitly assumed that allpersons come from a single population, butoften there are several distinct subgroupsThese groups may be distinguished by demo-

graphic characteristics or by past experience,or they may be groups receiving differenttreatments between the X and Y observations.One could pool all groups and determine asingle within-group value for each parameter inthe equations above, but parameters calcu-lated within subgroups will give a better esti-mate of Xx, Yx, and Dx. There is a limit tohow far subdivision of samples can profitablybe carried, however.

Correlations and Regression Slopes for Dx

Sometimes an investigator wishes to knowthe correlation of Dx with another variable,say Q. He should note, then, that the correla-tion of DM with Q is not a sound estimate of thecorrelation of Dx with Q. The correlation shouldbe determined directly from the covariance ofDX, with the second variable of interest. Thiscovariance takes a form such as

To get the correlation coefficient one divides byO-Q andcr£>M (notaoj. SinceDx equals Yx — XK,its variance is given by Equation 3. All param-eters must be those for the same group.

The investigator is often interested in the re-gression of Dx (or Fa,) on another variable.The slope of the regression of Dx on (e.g.) Xx

Attention must be paid to linkages. Let uswrite Qc to indicate that the observation of Qis independent of Xa, Ya, and F&.

_ ri8~l

This cannot be determined from data where Qis linked to X or F.

A variance-covariance algorithm. A simplecomputational routine can be suggested forproblems of this character. One may form avariance-covariance matrix of observed valuesas in Figure 1. Linked and independent covari-ances are carefully distinguished. The matrixmay be augmented as shown, adding rows andcopying forward covariances. Reliability in-formation is taken into account at certainpoints. Columns may be added to the matrixalso, in a symmetric manner. Thus all entriesin the Xx row can be copied into the Xx col-umn. The full extension carried out in this waygives a square matrix, from which such valuesas VD a and tr2z> can be read out.

MEASURING "CHANGE" 73

Copy independentcovariances corre-sponding to Row 1

Copy independentcovariances corre-sponding to Row 4

Fill in by sub-tracting Row 1from Row 2

Row

Row

Row

Row

Row

Row

Row

Row

1 Xa

2 Y

3 Qa

4 Yt

5 Q.

6 XM

7 Ya

8 Daa

Xa Ya Qa Yb Q, Xn

\ax *<?XY *<TXQ O<TXY O&XQ <rxpxx* t

t&XY <T Y tffyQ uYpYY' QfTYQ Q&XY

^&XQ ^&YQ G Q Q&YQ & QPQQ'

OffXY SYPYY' O,T9 «*Y Off¥Q

OffXQ OffyQ 0 QPQQ' OffYQ ff Q1

<r*xpxx' | O<rxr OyXQ OaxY OfXQ <r*xpxx'11

Otrxy ff^YPYY' OffYQ ffZYPYY' OffYQ OffxY

Fill in by sub-tracting Row 1from Row 4

Fill in by sub-tracting Row 6from Row 7

ROW 9 Dab

Row 10

Copy independent Row 11 ()»covariances fromRowS

FIG. 1. Algorithm for constructing covariances and variances. (Covariances tor linked observations are identifiedby the symbol •»-, and those for independent observations by CHT .The broken line separates the original data fromentries added later.)

A Better Estimate of True Change

Lord's formula uses only X and Y data, butwe shall bring in two further categories ofvariables, W and Z. The W and X are Time-1measures, but need not be simultaneous. Al-though we write W without vector notation,there are, in principle, any number of W vari-ables that can be used singly or in combination.Our statements apply to any W or any weightedcomposite of the W.

A W might be any score describing thesubject as he was prior to the treatment understudy or W might be an index based on his lifehistory. The scores Y and Z are posttreatmentmeasures—again, not necessarily simultane-ous. The Y might, for example, be a measure ofperformance at the end of training, and Z a re-tention test, a month later. Where we are ex-amining a difference score rather than a change

score, no distinction between W and Z vari-ables is needed.

The steps taken in going from Estimate 2 toEstimate 4 above can be extended to make useof W and Z information so as to reach an evenbetter estimate of Dx.

The complete estimator. If W is univariate andthere is no Z information,

= pxx>r-21(Y.X)

<r (Y-X)

+^^^(W-X,Y)+constant [19]

Here W • X, Y is a partial variate, the deviationof W from the value predicted by the regressionof W on X and 7.

If the W information is multivariate, a wholeseries of?partial variates enters. The order ofpartialling is arbitrary; one might write terms

74 LEE J. CRONBACH AND LITA FURBY

for X; W-X; Y-W, X; etc. Where there is Zinformation, one adds further terms, againemploying partial variates; W, as well as Xand F, is partialled out. The estimation pro-cedures must take into account any linkagebetween X, Y, W, and Z variables as in Equa-tion 16.

A similar equation is written for Yx.Finally, one comes to an equation of the form

+ 03W + /34Z + constant [20] [5]

Here /33PF stands for a string of several terms ofthe form PiWi if there are many W (likewisefor ptZ). This estimator is superior to Formula4, provided that the sample size is large enoughto justify assigning a large number of weights.Where sample size is insufficient, the number ofpredictors must be held down, most likely byemploying the first one or more principal com-ponents of the W set as predictors (likewisefor Z).

When the problem becomes complicated, itis better to use efficient computing routinesthan to write out elaborate formulas. Thewithin-treatment covariance matrix for X, F,W, Z is written. Additional rows and columnsfor XK and Yx are formed as in Figure 1, withcare to enter independent or linked covariancesas required. The Xx column is subtracted fromthe Fa, column to form the Dx column. Whenthe symmetric matrix is complete, one appliesa multiple-regression program, treating entriesin the Dx column as "test-criterion covari-ances" and the appropriate X, Y, W, and Z aspredictors. If the observed scores have X andF linked, for example, then Xa and Ya areused as predictors, and covariances for F& areignored.

Demographic information and informationabout experience can and should be consideredas W variables. With a variable such as sex, onehas a choice of entering it directly as a variablecoded 1 and 0, or of performing a separate re-gression analysis for each sex. Both proceduresregress the person's score toward the meanfor his own sex rather than toward the mean forall cases. The second procedure allows for thepossibility that the regression surface for malesdiffers from that for females. Separate within-subgroup regressions would seemingly be pre-ferred when samples are truly large.

The difficulty is that the argument can berepeated for every other noncontinuous vari-able and for all combinations of them. Indeed,it applies to continuous variables also; forexample, a regression surface for more anxiouschildren may differ from that for the less anxi-ous. These remarks amount to entertaining thepossibility of nonlinear relationships. Whilethis possibility is real enough, one can rarelyget usable estimates of nonlinear functionsfrom samples of practical size (Burket, 1964;Goldberg, 1969). Hence, after one has dividedthe sample into a few salient subgroups, eachhaving a suitably large size, the dummy-variable technique seems to be the only feasibleway to handle a variety of nonquantitativeinformation.

RESIDUALIZED GAINS

Developments parallel to those above lead toso-called "residual gains" or "basefree mea-sures of change."

A Iternative Estimators

The raw residual-gain score is defined by

D-X<=Y-E(T)\X = Y-Y-/3y.x(X-X)[21] [!']

We designate this Formula 1' to emphasize thatit is comparable to Formula 1, the raw gain. IfYa and Xa define the difference score, j3y.xequals *pxro'r/o'x- If YI rather than Ya is used,Opxy replaces *PXY- The traditional definitiongiven by Equation 21 is ambiguous; the resid-ual (Ya — Xa)-Xa is conceptually differentfrom the residual (Yb — Xtt)-Xa,

Residualizing removes from the posttestscore, and hence from the gain, the portion thatcould have been predicted linearly from preteststatus. One cannot argue that the residualizedscore is a "corrected" measure of gain, since inmost studies the portion discarded includessome genuine and important change in the per-son. The residualized score is primarily a wayof singling out individuals who changed more(or less) than expected.

True residual gain could be defined either asthe expected value, over many observations onthe same person, of D • X (defined in either ofthe two possible ways), or as the partial variateDx • Xm, the part of the true gain not predict-able from true pretest status. DX-XX is more

MEASURING "CHANGE" 75

likely to be of interest. Certainly if we intendto pick out superior learners or persons whoseself-concept falls far below the self-ideal, wewould like to base the discrimination on a true-score disparity. Likewise, if we have correla-tional questions — for example, Does anxietypredict overachievement? — the variable seemsbetter specified by D X - X X . Hence we consideronly the definition :

Dx •XX=DX — jS^or

= 7«-'rr«'*y y«+constant [22]

No ambiguity arising from possible linkage ofthe observed X and F enters this definition,though linkage must be considered in any esti-mation procedure.

Successive estimators of true residual changecan be constructed on the same principles asbefore, but it will suffice to move directly to theestimator formed by the multiple-regressionprinciple in the manner of Lord:

DX-XPXX'pYY' — °P XY

(pxx'OpjnOCl —

XI F0- ~^Xa +constant [23] [4']<rx /

This and other constants in this section are de-fined to make the mean of estimated residualgain equal zero. This estimate turns out to beproportional to the raw residual gain. (If theestimate is made from independent F and X,°PXY replaces *pxr)-

If there is W or Z information, a still betterestimator is

+(3'3Wr+^'4Z+constant [24] [5']

Only in rare cases will this be proportional tothe observed D-X.

The computational algorithm used beforecan be extended to obtain the desired weightsfor Formula 4' or 5'. Suppose we have filled outthe square matrix for X, Y, Xx, Yx, Dx, . . . .Then we can form the covariance of any vari-able with Dn • Xx very simply. For example,

.« C25]"tw^'A^jy — "JJa^ 2 A«

Hence we may multiply every entry in the Xx

column of the matrix by aojcja^x^, enteringthis in a column to one side. Subtracting theentry in this side column from the entry in thecorresponding row of the Z>M column gives acovariance to be entered in a DX-XX column.This column is now taken as a set of covari-ances of predictors with the criterion, and amultiple-regression program is applied.

The Tucker-Damarin-Messick Proposals

This analysis puts us in a position to reviewand clarify the rather puzzling paper entitled"A base-free measure of change" (Tucker,Damarin, & Messick, 1966; hereafter, referredto as TDM). They start much as we do bynoting that the psychometrics of a change scoreapplies to all kinds of difference scores. Theysuggest, as Lord did, that one should be mostinterested in the true difference score. Theypropose to divide this difference into two com-ponents, "one entirely dependent on the truescore of the first or base-line test" and one "en-tirely independent of it." That is, they areinterested in a true predicted gain and a trueresidual gain. As their abstract says, "equa-tions for estimating both components aregiven." Since we shall recommend against useof their equations, we shall not go into details oftheir argument.

It might appear that TDM are concernedwith estimating E(DJ \XX and Dn - E(DX) \Xx( = DX-XK). The former, of course, is alinear function of Xx. TDM arrive at an equa-tion (their Equation 26) that, in a form con-sistent with the present paper, is

* opXX'ffX

or

OAXTOYi b A0

PXX'fX

•-f-constant [26]

It is rather startling to find that this agreeswith none of our formulas. It differs fromFormula 1' in that X is replaced by X/pxx>- Itdiffers by a further constant of proportionalityfrom Formula 4'. The marked departure fromFormula 4' is made the more puzzling by thefavorable references of TDM to the Lord andMcNemar papers and by their recommenda-tion of Formula 4 for the gain score itself.

76 LEE J. CRONBACH AND LIT A FURBY

Personal communication with the authorsverified that a failure of communication hadoccurred4. As readers, we had given too littleweight to one key phrase: The measures are"primarily intended for correlational work."That is to say, TDM have no intention of inter-preting "basefree" scores for individuals. Suchscores are intended only as an intermediatestep toward correlations. TDM offer an esti-mator that does not give the best least-squaresestimate of individual "basefree" scores be-cause they seek instead estimates that correlatezero with Xx.

Their intention is to determine correlationsso as to learn what kinds of person show gainslarger than would be predicted from the truepretest score. Correlating the estimated trueresidual gain from Equation 23 or 24 with an-other variable does not give the correlationTDM desire (unless pxx- is 1.00). To under-stand this, consider for a moment the simplecorrelation of F with Q. If we do not know Fscores but do know pxv, we might estimate Ffrom X scores by the usual regression equation.Then pfo will not be a good approximation toPYQ ', it will actually equal pxo- In general, onewho wants to interpret correlations, covari-ances, or regression slopes ought not to workfrom estimated scores. TDM intended to rec-ommend that in such a line of research onecalculate special-purpose scores by Formula 26and then determine correlations. This appearsto be unsound. TDM desire to obtain correla-tions with various Q of 7 and f, which in theirnotation are the residual and predictable por-tions of true gain, respectively. But the TDMformulas generate fallible values g and w; gequals 7 plus an error. Obviously POQ < pygand PIVQ < p[Q. As this is not explained byTDM, their paper is likely to mislead thereader. The confusion is reflected, and to somedegree intensified, when Traub (1967) andGlass (1968) comment on the TDM formula.

While one might adapt the TDM statementsto get pyQ, PZQ, etc., this is unnecessary. Astraightforward manipulation of the matrix ofobserved covariances for X, Y, and Q (alongwith pxx1 and pyy) yields the covariance of Qwith DX-XX (i.e., with 7). Thea2

Dai.xa(= v\)needed to reach a correlation is simply the co-

4 L. R. Tucker, F. Damarin, and S. Messick, personalcommunication, September 1968 and April 1969.

variance of Dx with Dx • Xx that we have al-ready obtained. To get covariances for Dx

— DX-XX (i.e., for f), one need only subtractcolumn DX-XX of the covariance matrix fromcolumn Z>M. And = a\ + <r2

A MULTIVARIATE CONCEPTION

The older statement of the problem as "themeasurement of gain" or of "residual gain"implies a special affinity between X and F —they are seen as "the same variable" in somesense. But change is multivariate in nature.

Even when X and F are determined by thesame operation, they often do not represent thesame psychological processes (Lord, 1958). Atdifferent stages of practice or developmentdifferent processes contribute to performanceof a task. Nor is this merely a matter of in-creased complexity; some processes drop out,some remain but contribute nothing to indi-vidual differences within an age group, some arereplaced by qualitatively different processes.This does not rule out purely empirical studiesof changes in the operationally defined variable.To assess such changes, even when one cannotdescribe them qualitatively, may be practicallyimportant. One must be careful not to fall intothe trap of assuming that the changes are in aparticular psychological attribute.

We may illustrate by referring to Fleish-man's well-known studies of psychomotorscores at successive stages of practice (seeFleishman, 1966) . On the first few trials, scorestend to correlate with cognitive measures; theusual speculation is that the high -cognitivesubjects gain fastest because they most rapidlycomprehend the instructions, display, strategy,etc. In a second stage, certain pretests of psy-chomotor ability correlate highest with scores,and the correlation of scores with cognitivepretests becomes rather small. One could easilyconclude that cognitive ability is unimportantto learning in this second stage. But the drop incorrelation (assuming no great rise in SD)demonstrates something more striking: thatcognitive ability is negatively correlated withchange from the first to second stage. And onecan suggest a good reason. If bright persons

6 This matrix-extension procedure is entirely consist-ent with the TDM rationale. In fact, TDM tell us thattheir ̂ formula was arrived at by analytic treatment ofthis extended matrix.

MEASURING "CHANGE" 77

catch on fastest, by some trial / they have com-pleted what can be done intellectually with thetask. On trials t -\- 1 and later, their cognitiveabilities produce no further gains. But by ourhypothesis the dull have not comprehendedfully by trial t, and therefore they have cogni-tive work left to do on trials t + 1, t + 2, . . .During this stage, any gain in performance thatcomes from improved comprehension will be again by those low in cognitive aptitudes, hencethose aptitudes have a negative correlationwith gains. The Fleishman studies have notbeen processed in terms of gain scores, but ithas been commonly said that they indicatecognitive abilities to be "important during theearly stages of practice on psychomotor tasks,"or the like. It seems more accurate to say thatcognitive abilities make their contribution atdifferent times for different persons and thatgains at any one time are due to different pro-cesses for different persons.

Something similar is to be said about the re-lation of mental age (MA) at one age to sub-sequent gains in MA or achievement. A positiverelation would result insofar as the high scorersunderstand new material better, or are moreefficient learners. But there would also be anegative relation, insofar as the high scorers arethose who have already mastered some highlyvaluable technique (e.g., mediation) that thelow scorers have yet to master. As they re-structure their behavior, the persons with lowscores at the start of the period may make largegains—-gains the high scorers had previouslymade. Positive and negative elements are prob-ably both present, which should make us muchless surprised than we have been by reports ofnear-zero correlation of MA (in a constant-agegroup) with subsequent gain in MA (Ander-son, 1939; Bloom, 1964, p. 26 ff., 62 ff.).

We reduce emphasis on the special role of Xas precursor of Y, and regard the whole WXset as a vector describing the person's initialstatus. Then one may ask how Y varies as afunction of the Time-1 data. To single out forintensive study persons who do better (orworse) than predicted, for example, it is wise todefine expected outcome as the forecast of Yon the basis of all Time-1 information. Insteadof DX-XX (= Fao-ZJ one would estimateDX-XX, Wx (= Ya-Xx, Wx). The machinerysuggested above for partialling out Xa would

be used, but extended by partialling the Wx

also out of Dx. The entire set of W and X mea-sures constitute the "base." There are optionaltargets for investigation: Dx; D*,-Xx; D«,-XX,Wx; etc.

Learning or growth is multidimensional;many measures could be taken at each point intime. To select one particular Y as somehowintegrating a variety of subcriteria is to sacri-fice information and possible insight. A per-son's change is better described by a vector oftrue scores Ww Xx; Yx, Zx. Each of these canbe estimated by the methods used in Equations19 and 20. One who wants to examine predictedand residual change will estimate Yx from ffia

and J?w, and also from all variables together.He will obtain these scores:

Yx | WXYZ (estimated true final status)FSO | ]^XXX (predicted true final status)Difference (estimate of unpredicted true

residual)

The estimates of Wx and Xm come from W, X,Y, and Z. There would be estimates like thosefor Y for each Z or for orthogonal componentsof the FM, Zx space.

We can rearrange the vectors W, X and Y, Zin a great variety of ways. Which target tochoose can be decided only in the light of the pur-poses of the study.

PURPOSES OF ESTIMATINGGAINS OR DIFFERENCES

Just why gains or differences are thought tobe worth estimating can perhaps be inferredfrom the studies where estimates of some sorthave been made in the past. The following aimsmay be noted:

1. To provide a dependent variable in anexperiment on instruction, persuasion, or someother attempt to change behavior or beliefs.

2. To provide a measure of growth rate orlearning rate that is to be predicted, as a way ofanswering the question, What kinds of personsgrow (learn) fastest? Here, the change measureis a criterion variable in a correlational study.

3. To provide an indicator of deviant de-velopment, as a basis for identifying individualsto be given special treatment or to be studiedclinically.

4. To provide an indicator of a constructthat is thought to have significance in a certain

78 LEE J. C RON BACH AND LIT A FVRBY

theoretical network. The indicator may beused as an independent variable, covariate,dependent variable, etc. An example is theinterference score needed on the Stroop ColorWord Test to represent the decline in readingrate when color names are printed in colors in-congruous with the names.

Much of the confusion in the literature arisesfrom a failure to distinguish these purposesand to match distinct methodological recom-mendations to them.

Gains as a Consequence of Treatments

There appears to be no need to use measuresof change as dependent variables and no virtuein using them. If one is testing the null hy-pothesis that two treatments have the sameeffect, the essential question is whether posttestYx scores vary from group to group. Assumingthat errors of measurement of F are random, Fis an entirely suitable dependent variable.

The randomized experiment. Suppose thatcases are assigned to treatments in a random orstratified-random manner. The X scores willvary within groups. An analysis of covarianceto take this variation into account is advan-tageous so long as pxv is large. (If p < 0.4, block-ing is probably to be preferred, according toElashoff, 1969.) The usual adjustment esti-mates the F scores expected under the null hy-pothesis and then expresses each observed Fas a deviation from the estimate. Ordinarily itis desirable to base the adjustment, not on X,but on whatever linear combination of X andW best predicts F within groups.

Where within-treatment regressions arelinear but significantly different in slope, thedifference between effects of treatments de-pends on the level of X. The "main effect" isnot interpretable. The most meaningful reportconsists of regression functions for F on theXx, WM space, computed with the aid of thecovariance matrix for true scores within eachgroup in turn.

Nowhere in this section have we made use ofa change score. We consider it likely thatchange will vary systematically with Xx.Where this is the case, the essential result is aregression function, not a mean gain. The ad-justed F score of the significance test is a sortof residual gain, but the procedure does not

involve calculating residual gain scores forindividuals.

Comparison of treatment groups not formed atrandom. When treatments are applied to groupsdifferentiated by a nonrandom process, the Xx

distributions within the subpopulations repre-sented by the groups are generally not thesame. Consequently, the same observed Xscore implies a different level of true pretestability, depending on the group.

If analysis of covariance is to be made, it isadvisable to regress the covariate toward themean of the treatment group before enteringit in the analysis (Lord, 1960). If there is Winformation as well as X, it also contributes tothe estimate. So does F and Z information.Here is a paradox: A proposal to use the post-test score to estimate the pretest true scorewhich will then be used to adjust posttestscores! The crucial point is that the estimatorof the covariate is determined from within-group data. Since the estimate of any linearfunction of Xx and Wx has the same within-group mean as that function of X and W, theprocedure does not introduce bias nor does itreduce any effect truly attributable to thetreatment.

Application of analysis of covariance tostudies where initial assignment was nonran-dom, which was widely recommended 10 yearsago, is now in bad repute. Even the elaboratetechnique just suggested is no more than apalliative. If the treatment groups differedsystematically at the start of the experimentwith respect to any relevant characteristicother than the covariate, even a perfect mea-sure of the covariate cannot remove the con-founding. To quote Lord (1967), "there simplyis no logical or statistical procedure that can becounted on to make proper allowances for un-controlled preexisting differences betweengroups [p. 305]." And Meehl (1970) calls suchcorrections "inherently fallacious [in press]."

The findings of the study can be usefullysummarized by calculating within-group re-gression functions relating Yx to Xx, WK, usingthe covariance matrix for true scores. Whatcannot be done is to "compare treatmenteffects."

One-group designs. A third kind of experi-ment is the simple one-group study where onewishes to learn whether a treatment produces

MEASURING "CHANGE" 79

significant change, or to describe the magnitudeof the effect. An estimate of true gain might ap-pear to be pertinent. But it is not. For if onewere to estimate DK for each individual, andaverage, he would arrive back at the samplemean of observed gain. A significance test needonly ask whether HY is reliably different fromHx- The difference in sample means for X andY is the best available estimate of the mean Dx,

Criteria in Correlational Studies.

Correlational studies are often intended toinvestigate a question such as this : Among per-sons with a given pretest score, what attributesdistinguish those who profit most from thetreatment? This may seem to ask about PDW orpoawa, or perhaps about P(D-X->W or PID^-XJW-It is more straightforward to ask about the re-gression of Y on Xx and Wx, the correspondingcorrelation for Y or YM, or related partial cor-relations. It appears that nothing is gained byreferring to change measures in this context.The relationships of true scores can be investi-gated without estimating true scores for indi-viduals.

Selecting Individuals on the Basis of Gain orDifference Scores.

Many who calculate difference scores areinterested in making decisions about individ-uals — identifying underachievers for clinicalattention or fast learners for special opportuni-ties, for example. One can scarcely defendselecting such individuals on a raw-gain orraw-difference score, especially as these scorestend to show a spurious advantage for personslow on X. Selecting cases whose estimated FM

is higher than that of others with similar Xx

and ~fyx seems more sensible. To do this, re-gression equations should be called into play.That is, one selects persons for whom FM|WXYZ is much larger (or smaller) than YK\

The persons with positive deviations arethose who did better than predicted. Thismeans either that they started with some valu-able attribute the W and X variables did notencompass, that their pretest true scores areunderestimated or their posttest scores areoverestimated, or that their success on F wasan accidental effect arising from some tacticcasually adopted during learning or some se-quence of lucky trials. It is very hard to dispose

of the hypothesis that these unexpected gainswere fortuitous.

Here, the focus of attention is on an esti-mated residual gain: not D • X, not Z)M • Xx, butYX-XXWM or, what is equivalent, DX-XXWM.Where X alone is available as a predictor, theraw residual gain selects the same persons asFormula 23 does. But Formula 24 is to be topreferred.

It is possible of course, given before-and-after scores on the same instrument, to esti-mate true gains of individuals and to identifythose who did and did not gain. But to whatpurpose? This has no clear bearing on decisionsabout the future of these persons, and the de-cision rule for fresh cases is to be inferred fromthe regression surface.

Differences and Gain Scores as Constructs.

One of the most common uses of differencescores is to operationalize a concept: For ex-ample, self-satisfaction is sometimes defined asthe difference between the rating of self andideal-self on an esteem scale. One might like-wise think of a gain score as reflecting "learn-ing ability" on a certain task. Operationaldefinitions will often take the form of linearcombinations of operations.

But there is little a priori basis for pinningone's faith on FM — Xx as distinct from themore general Yx — aXx. Just what weight toassign the "correcting" variable is an empiricalquestion. To arbitrarily confine interest to Dx

(which means that a is fixed at 1.00) is to ruleout possible discoveries. This argues, then, fordiscovering what function of Fw and Xx hasthe strongest relationships with variables thatshould connect with the construct.

The claim that an index has validity as ameasure of some construct carries a consider-able burden of proof. There is little reason tobelieve and much empirical reason to disbe-lieve the contention that some arbitrarilyweighted function of two variables will properlydefine a construct. More often, the profitablestrategy is to use the two variables separatelyin the analysis so as to allow for complex re-lationships.

One example of an "obvious" but question-able use of a subtractive correction is providedby a study in which skin conductance is a vari-able. At the start of the experiment a "base-

80 LEE J. CRONBACH AND LIT A FURBY

line" measure of the subject's galvanic skinresponse is taken. Then stress is applied and asecond measure is taken. During a rest periodthe subject receives a drug or a placebo. Stressis again applied and a third measure taken.Call the measures, in order, W, X, and F.Simple correction would use F — W as de-pendent variable and X — W as covariate.We, however, would prefer to use X and W asseparate covariates, with F as dependent vari-able. This should give a more precise analysiswhen W is unreliable. (As suggested earlier, itwould generally be still better to use Xx\WXY and ty^WXY as covariates.)

SUMMARY

Where true scores for individuals are de-sired, multiple regression procedures outlinedherein make use of more information than doprocedures hitherto advanced. There seems tobe no occasion to estimate true gain scores. Inthe experiment where treatment groups areformed nonrandomly, estimates of true scoreson the covariate can reduce the resulting bias.

Where individuals who have exceptionallyhigh or low residual gains are to be identified,the raw residual gain serves as well as the alter-nate formulas hitherto advanced. To estimatethe individual's true residual gain, however, asuperior formula is available.

Where correlations and regression functionsrelating true gains or true residual gains toother variables are desired, a calculating routineis available that makes it unnecessary to esti-mate gain scores for individuals.

It appears that investigators who ask ques-tions regarding gain scores would ordinarilybe better advised to frame their questions inother ways.

REFERENCES

ANDERSON, J. E. The limitations of infant and preschootests in the measurement of intelligence. Journal ofPsychology, 1939, 8, 351-379.

BLOOM, B. S. Stability and change in human character-istics. New York: Wiley, 1964.

BURKET, G. R. A study of reduced rank models formultiple prediction. Psychometric Monographs, 1964,No. 12.

DuBois, P. H. Multivariate correlational analysis. New

York: Harper, 1957.ELASHOFF, J. D. Analysis of covariance: a delicate in-

strument. American Educational Research Journal,1969, 6, 383-402.

FLEISHMAN, E. A. Human abilities and the acquisitionof skill. In E. A. Bilodeau (Ed.), Acquisition of skill.New York: Academic Press, 1966.

GLASS, G. V. Response to Traub's "Note on the relia-bility of residual change scores." Journal of Educa-tional Measurement, 1968, 5, 265-267.

GLESER, G. C., CRONBACH, L. J., & RAJARATNAM, N.Generalizability of scores influenced by multiplesources of variance. Psychometrika, 1965, 30, 395-418.

GOLDBERG, L. R. The search for configural relation-ships in personality assessment: the diagnosis ofpsychosis vs. neurosis from the MMPI. MultivariateBehawrial Research, 1969, 4, 523-536.

HARNQVIST, K. Relative changes in intelligence from 13to 18. Scandinavian Journal of Psychology, 1968, 9,50-82.

HARRIS, C. W. (Ed.) Problems in measuring change.Madison: University of Wisconsin Press, 1963.

LORD, F. M. The measurement of growth. Educationaland Psychological Measurement, 1956, 16, 421-437.See also Errata, ibid., 1957, 17, 452.

LORD, F. M. Further problems in the measurement ofgrowth. Educational and Psychological Measurement,1958, 18, 437-454.

LORD, F. M. Large-sample covariance analysis when thecontrol variable is fallible. Journal of the AmericanStatistical Association, 1960, 55, 309-321.

LORD, F. M. Elementary models for measuring change.In C. W. Harris (Ed.), Problems in measuring change.Madison: University of Wisconsin Press, 1963.

LORD, F. M. A paradox in the interpretation of groupcomparisons. Psychological Bulletin, 1967, 68, 304^-305.

McNEMAR, Q. On growth measurement. Educationaland Psychological Measurement, 1958, 18, 47-55.

MEEHL, P. E. Nuisance variables and the ex post factodesign. In M. Radner & S. Winokur (Eds.), Minne-sota studies in the philosophy of science, IV. Minne-apolis: University of Minnesota Press, 1970, inpress.

STANLEY, J. C. General and special formulas for reli-ability of differences. Journal of Educational Mea-surement, 1967, 4, 249-252.

TRAUB, R. E. A note on the reliability of residual changescores. Journal of Educational Measurement, 1967,4, 253-256.

TRIMBLE, H. C., & CRONBACH, L. J. A practical pro-cedure for the rigorous interpretation of test-retestscores in terms of pupil growth. Journal of Educa-tional Research, 1943, 35, 481-488.

TUCKER, L. R., DAMARIN, F., & MESSICK, S. A base-free measure of change. Psychometrika, 1966, 31,457-473.

(Received June 24, 1969)