Upload
doanmien
View
225
Download
0
Embed Size (px)
Citation preview
Furr, Special Topics - 1
Chapter 7
Special Topics in Social Psychological Measurement:
Profile Similarity and Difference Scores
This chapter addresses two important but somewhat misunderstood special topics in
social psychological measurement – profile similarity (i.e., profile agreement) and difference
scores. Both are used to reflect important social psychological phenomena. For example,
researchers studying personality similarity and relationship satisfaction might use sets of
psychological scales to obtain profiles of psychological characteristics for wives and husbands,
interpreting the similarity between wife-husband profiles in terms of “personality similarity.”
Similarly, researchers studying attraction and body types might ask participants to rate the
“thinnest” body type they find attractive and to rate the “heaviest” body type they find attractive,
computing the difference between these two ratings, and interpreting the difference in terms of
the “range of body sizes that each participant finds attractive.”
Although such measurement strategies are appealing, they entail complexities that, if not
understood and managed, can compromise researchers’ conclusions. For example, if not based
upon appropriately-conducted analyses, an apparent association between psychological similarity
and relationship satisfaction could reflect the relatively simple and mundane fact that well-
adjusted people generally have good relationships. Such results would say little, if anything,
about the supposed link between psychological similarity and relationship satisfaction. Similarly,
correlations between “desirability of women’s psychological attributes” and “attractiveness
range” difference scores may appear to indicate that men are attracted to a wider range of body
sizes among women who have desirable psychological attributes than among women having less
Furr, Special Topics - 2
desirable psychological attributes. However, careful analysis may reveal a simpler, clearer, and
more fundamental message that men’s attraction to heavier women increases in response to
desirable psychological attributes, but their attraction to thinner women is unchanged. The
current chapter addresses these measurement issues, articulating the problems and introducing
solutions.
Profile Similarity
Social psychologists often study profiles of psychological characteristics when examining
phenomena such as psychological similarity, accuracy of social judgments, person-environment
fit, cross-situational behavioral consistency, and developmental stability across time Such
examinations often arise from analysis of profile similarity – the degree to which two profiles of
characteristics are similar to each other (e.g., Baker & Block, 1957; Bernieri, Zuckerman,
Koestner, & Rosenthal, 1994; Biesanz, West, & Millevoi, 2007; Blackman & Funder, 1998;
Colvin, 1993; Furr, Dougherty, Marsh, & Mathias, 2007; Furr & Funder, 2004; Gonzaga,
Campos, & Bradbury, 2007; Letzring, Wells, & Funder, 2005; No, Hong, Liao, Lee, Wood, &
Chao, 2008; Starzyk, Holden, Fabrigar, & MacDonald, 2006).
For example, Luo and Klohnen (2005) examined associations between couples’
personality similarity and marital satisfaction, evaluating the hypothesis that people select mates
who are relatively similar to themselves. Participants completed self-report personality ratings
across several characteristics, and Luo and Klohnen computed a similarity correlation between
each wife’s profile of ratings and her husband’s profile of ratings. As an index of profile
similarity, a strong positive correlation indicates that a wife’s personality profile is similar to her
husband’s profile.
To illustrate, Figure 7.1a presents hypothetical profiles for a couple. The profiles’ shapes
Furr, Special Topics - 3
are quite similar to each other, with corresponding patterns of highs and lows. Thus, the
characteristics that the wife views as relatively descriptive of her personality (as compared to
other characteristics) are the characteristics that the husband as relatively descriptive of his
personality. The Pearson correlation between the two sets of scores is large, r = .70, quantifying
the considerable similarity between the profiles’ shapes.
--------Insert Figure 7.1----------
Profile analysis is appealing in at least three ways. First, it reflects similarity (or stability,
or consistency, or agreement, or fit) across a wide range of characteristics, providing an
ostensibly “holistic” perspective on similarity. Rather than defining similarity in terms of a
single variable (e.g., the degree to which couples have similar levels of Extraversion), it reflects
similarity across multiple characteristics. Second, it reflects similarity at the couple-level or
person-level, rather than at a sample-level. In the context of wife-husband similarity, it produces
a similarity score for each couple. Third, profile analysis provides a relatively straightforward
method of examining correlates, predictors, or consequences of similarity (or agreement, or
stability, etc). For example, after similarity indices are computed for each couple, they can be
correlated with couples’ relationship satisfaction scores. A positive correlation might be
interpreted as indicating that wife-husband pairs with relatively similar personality profiles are
relatively satisfied with each other. Such questions can be addressed in other ways, but they
require complex analytic strategies (e.g., Furr et al., 2007).
Despite this appeal, at least two complexities potentially obscure analyses of profile
similarity. First is the need to understand and accommodate profile elevation, scatter, and shape.
Second is the need to understand and differentiate between profile normativeness and
distinctiveness.
Furr, Special Topics - 4
Elevation, Scatter, and Shape
Thus far, discussion has highlighted one facet of a profile – its shape, in terms of the
pattern of high and low values (see Figure 7.1b). Indeed, for many applications of profile
similarity, profile shape may be the most interesting and psychologically meaningful facet.
However, there are two additional facets of a profile (Cronbach & Gleser, 1953; Furr,
2009b). Profile elevation refers to the average score across all variables in the profile, and
profile scatter refers to the variability across the variables – the degree of spread among the
scores. As shown in Table 7.1 and Figure 7.1b, the wife in couple 1 has an elevation of 3.8 and a
scatter (as indexed by the standard deviation) of 1.47. Setting aside ceiling or floor effects, the
three elements of a profile are independent.
--------Insert Table 7.1----------
Because each profile has multiple elements, similarity between profiles can be gauged in
multiple ways. That is, profile similarity can be indexed in terms of one, two, or all three
elements. In fact, psychologists have debated the methods of quantifying profile similarity for
more than 60 years (e.g., Carroll & Field, 1974; Cattell; 1949; Cohen, 1969; Cronbach & Gleser,
1953; du Mas, 1946; Furr, 2009b; Kenny, Kashy, & Cook, 2006; McCrae, 1993, 2008; Nunnally,
1962).
Currently, researchers use either of two primary techniques to quantify profile similarity.
The first and simplest is a Pearson correlation between two profiles, as described earlier. The
simple correlation reflects the similarity between profiles’ shapes – that is, it reflects similarity
between profiles’ patterns of highs and lows, being unaffected by similarity or dissimilarity in
the profiles’ elevation and scatter. In contrast, the double-entry intraclass correlation is an
“omnibus” index of profile similarity, affected by all three elements of profile similarity (Furr,
Furr, Special Topics - 5
2009b). The double-entry intraclass correlation is computed by: a) creating two “doubly-entered”
profiles by appending each profile of scores to the end of the other, and b) computing a Pearson
correlation between these two doubly-entered profiles (see Furr, 2009b for a detailed example).
Some researchers recommend the double-entry intraclass correlation over the Pearson correlation
because it blends the three elements (McCrae, 2008). However, such recommendations have
been questioned in several ways – in terms of conceptual ambiguity associated with blending of
independent elements, in terms of significant technical problems and potential confusions, and in
terms of a lack of clear empirical benefit over an approach focusing on each element separately
(Furr, 2009b).
Currently, the simple Pearson correlation seems to be the most commonly-used index of
profile similarity. However, researchers should be familiar with profile elevation, scatter, and
shape, and they should understand their effects on any index of profile similarity.
Normativeness and Distinctiveness
Profile similarity’s second complexity is the “normativeness problem” (Furr, 2008). A
profile’s normativeness is the degree to which it reflects an average profile – the similarity
between the shape of an individual profile and the shape of a group’s normative profile. Figure
7.2 illustrates this, with Figure 7.2a presenting three wives’ self-rated personality profiles. There
are clear commonalities across the three individual profiles – all have higher Extraversion than
Neuroticism, and all have lower Openness than Agreeableness and Conscientiousness. Figure
7.2b presents these profiles alongside a “normative” profile. The normative profile reflects scores
for each variable averaged across all wives’ self-ratings. Each wife’s profile is at least somewhat
similar to the normative wife profile, suggesting that each wife is normative to some degree.
--------Insert Figure 7.2----------
Furr, Special Topics - 6
Although there has been little empirical examination of normative profiles, there are at
least three likely properties of normativeness (Furr, 2008). First, many individual profiles are
likely to be quite normative, as suggested in Figure 7.2b. Second, two normative profiles in an
analysis are likely to be very similar to each other. For example, the normative wife profile in
Figure 7.2b is likely very similar to a normative husband profile. Third, a normative profile is
probably psychologically meaningful as social desirability, psychological well-being, or
adaptation to an environment (Wood, Gosling, & Potter 2007). That is, the variables having
relatively high scores within a normative profile are likely to be socially desirable, and those
having relatively low scores are likely to be undesirable.
Together, these properties have implications for profile similarity. First, any two profiles
are likely to be similar, even with no intrinsic connection between them. For example, a wife’s
profile is likely to be somewhat, perhaps very, similar to the profile of a husband from another
couple. Second, profile similarity can represent social desirability, adjustment, or adaptiveness.
For example, high similarity between a wife’s profile and her husband’s profile might indicate
simply that both people are well-adjusted.
These normativeness implications create two problems for analyses of profile similarity.
First, they obscure interpretations of average levels of profile similarity. For example, social
psychologists might wish to interpret the average level wife-husband similarity (i.e., averaged
across all couples) as indicating the degree to which people marry people with similar
psychological characteristics. Unfortunately, this interpretation is clouded by the fact that any
given wife’s profile is probably similar to any given husband’s profile. That is, researchers
would likely find psychological similarity between women and men in general, even among
people from different couples. The second normativeness problem concerns interpretations of
Furr, Special Topics - 7
antecedents, consequences, or correlates of profile similarity. For example, social psychologists
might examine wife-husband similarity and relationship satisfaction, find a positive correlation
between the two, and wish to conclude that psychological similarity contributes to (or at least is
associated with) satisfying relationships. However, wife-husband similarity may partially reflect
psychological adjustment, arising from connections between individual profiles, normativeness,
and desirability or adjustment. Therefore, a correlation between wife-husband similarity and
relationship satisfaction might indicate simply that well-adjusted people generally have
satisfying relationships. This may be an important psychosocial finding, but it reflects no
intrinsic link between psychological similarity and relationship satisfaction.
There are at least two methods for handling normativeness problems (Furr, 2008). The
first is a sample-level method, in which researchers create profile similarity values for random
pairs of profiles, in order to gauge the “normative” level of similarity in a sample. For example,
Luo and Klohnen (2005) examined wife-husband similarity and noted that “individuals, on
average, tend to be more similar than dissimilar” (p. 311). To address this normativeness issue,
they created “random couples” by pairing each wife’s profile with the profile of a husband from
a different couple, and they computed a similarity correlation for each random couple. They then
interpreted the mean “random couple” similarity correlation as reflecting “the average similarity
between men and women” (p. 311). Finally, they computed the average similarity correlation
between “real” couples and contrasted it with the mean “random couple” correlation, interpreting
the difference as the degree to which real couples are more similar than are random pairs of men
and women. In the context of couples, this approach has been referred to as “pseudo-couple
analysis” (Kenny et al., 2006, p. 335-337). This approach facilitates analysis of average levels of
Furr, Special Topics - 8
profile similarity, but it cannot address problems associated with antecedents, consequences, or
correlates of similarity. The second approach addresses both issues.
The second method is a pair-level method in which similarity is decomposed for each
pair of profiles (Cronbach, 1955; Furr, 2008). The association between two profiles can be
partitioned into components representing blends of similarity, normativeness, and distinctiveness.
This decomposition can be done in several ways (Furr, 2008), and Tables 7.1 and 7.2 illustrate
one way applied to three hypothetical couples.
--------Insert Table 7.2----------
The pair-level process begins by decomposing each profile into two component profiles –
a normative profile reflecting group means, and a distinctive profile reflecting differences
between an individual and the group on each variable. As shown in Table 7.1, the normative
wife profile is the mean of all wives (for each variable), and the normative husband profile is the
mean of all husbands. A distinctive profile includes “deviation scores” reflecting the difference
between an individual’s score on a variable and a group’s mean score on the variable. For
example, Marge’s distinctive profile (Table 7.1) reveals that she is exactly as Neurotic as the
average wife (4-4=0), somewhat less Extraverted than the average wife (5-5.67=-.67), and so on.
Following the decomposition of individual profiles, several indices can be computed for
each pair of profiles (see Table 7.2). Overall Similarity is the correlation between two raw,
unadjusted profiles as discussed earlier (e.g., the overall similarity between Marge and Homer
is .70, see Couple 1 in Table 7.2). Distinctive similarity is the correlation between a pair of
distinctive profiles, reflecting the degree to which the pair shares a pattern of non-normativeness.
For example, the correlation between Marge’s distinctive profile and Homer’s distinctive profile
is .19, indicating that, to a slight degree, the ways in which Marge has unusually high or low
Furr, Special Topics - 9
levels of specific characteristics is similar to the ways in which Homer has unusually high or low
levels of those characteristics. That is, the ways that Marge is distinctive are somewhat similar
to the ways that Homer is distinctive. Generalized Normative Similarity is the degree of
similarity between two normative profiles, such as the correlation between the normative wife
and normative husband profiles, r = .92 in Table 7.2 (note that this value is the same across all
pairs of profiles). Pair-level normativeness indices can also be computed, for example, between
each wife’s profile and the normative wife profile and between each husband’s profile and the
normative husband profile. For example, Table 7.2 indicates that Marge is extremely like the
average wife (r = .94) and that Homer is very much like the average husband (r = .77). See Furr
(2008) for additional possibilities in decomposing profiles and for psychometric details of these
decompositions.
A pair-level approach handles both normativeness problems – the “average-level”
problem and the “antecedents, consequences, and correlates” problem. By examining distinctive
similarity alongside overall similarity and normativeness, researchers gain insight into general
levels of wife-husband similarity, into the similarity between wives’ and husbands’ distinctive
qualities, and into the possibility that relationship satisfaction is associated either with
normativeness (and thereby potentially adjustment) or with the similarity between wives’ and
husbands distinctive qualities. Such differentiated analyses can produce interesting insights,
such as the insight that high acquaintanceship enhances peoples’ understanding of each others’
distinctive personality qualities while minimizing their reliance on normative personality
information in social judgments (Beisanz et al., 2007).
Summary
Analysis of profile similarity is an appealing method for assessing and examining social
Furr, Special Topics - 10
psychological phenomena such as psychological similarity, judgmental accuracy, cross-
situational behavioral consistency, and person-environment fit. However, to realize the full
potential of this method, researchers must account for important complexities. Specifically, they
should recognize that any index of profile similarity is affected by one or more of the three
elements of similarity – shape, elevation, and or scatter. Furthermore, they should be aware of
the elements affecting a specific index, and they should understand the costs, benefits, and
meaning of each potential index. Finally, they should understand normativeness and its potential
effects on profile similarity, and they should implement appropriate analytic strategies to account
for these effects. When conducted with appropriate analytic strategies, profile similarity can be a
useful tool for social psychologists.
Difference Scores
Many interesting social psychological phenomena can be seen as differences between
two “component” phenomena. For example, actual-ideal discrepancy might be viewed as the
difference between a participant’s actual standing on a variable and his or her preferred standing
on that variable. Similarly, “attractiveness range” might be viewed as the difference between the
thinnest body type (say as indexed in terms of Body Mass Index) that a person finds attractive
and the heaviest body type that he or she finds attractive. For such phenomena, researchers
might be tempted to use difference scores – measuring each component variable (e.g., BMI of
thinnest body type that a person finds attractive and BMI of the largest body type he or she finds
attractive) and then subtracting one value from another to produce a difference score (also called
change scores, gain scores, and discrepancy scores).
Furr, Special Topics - 11
Difference scores are intuitively appealing. Intuitively, they seem to fit well with
phenomena such as actual-ideal discrepancy, psychological change, and attractiveness range, and
they arise from simple subtraction:
Di=Xi-Yi Equation 7.1
An individual’s difference score is the difference between his or her score on two components –
variable X and variable Y. Given this intuitive appeal and simplicity, difference scores have
been used in many areas of psychology, including social psychology.
Unfortunately, this intuitive appeal masks psychometric issues potentially compromising
psychological conclusions based upon difference scores. These issues have been discussed for
decades (e.g., Collins, 1996; Cronbach & Furby, 1970; Overall & Woodward, 1975; Rogosa,
1995; Zimmerman, Williams, & Zumbo, 1993; Zumbo, 1999), but non-optimal use of difference
scores persists. If these issues are ignored when using difference scores, then research quality
suffers – perhaps producing conclusions that are misinformed or that simply miss more
fundamental phenomena.
This section presents psychometric and statistical properties of difference scores,
important implications of these properties, problems arising from these implications, and
measurement-based recommendations regarding the potential use of difference scores. The
practical take-home message is twofold.
1. Researchers should consider avoiding difference scores, instead focusing on the
component variables from which difference scores are computed.
2. If researchers use difference scores, then they should do so with thorough
examination of the component variables and with serious attention to psychometric
quality.
Furr, Special Topics - 12
Properties of Difference Scores
There are at least two fundamental psychometric properties of difference scores –
properties that, we shall see, have implications for the meaning of difference scores and,
ultimately, for the psychological meaning of conclusions based upon difference scores. The
properties concern the reliability and variability of difference scores.
Reliability. Observed difference scores are treated as indicators of “true” psychological
difference scores. That is, the observed difference between measured variable X and measured
variable Y is taken as an indicator of the difference between a person’s true score on variable X
and his or her true score on variable Y. Thus, it is important to recognize the factors affecting
the reliability of observed difference scores – factors affecting the degree to which variability in
observed difference scores reflects variability in true difference scores.
In theoretical terms, the reliability of observed difference scores (rDD) is affected by true
score variability in the component variables, the true correlation between component variables,
and the reliability of the measures of the component variables:
2 2
2 2
2
2
X Y X Y X Y
X Y
X Y X Y
T T T T T TDD
T TT T T T
XX YY
s s s s rr
s ss s r
r r
+ −=
+ −
Equation 7.2
In this equation, , , , and are the true score variances and standard deviations of
variables X and Y (again, the two components of the difference score), rXX and rYY are
reliabilities of the measures of X and Y, and is the true correlation between X and Y1. As
we shall see, this equation has important implications for the meaning and utility of difference
scores.
2XTs 2
YTsXTs
YTs
X YT Tr
Variability. Most analyses of difference scores focus on their observed variability. In an
experimental context, one might evaluate whether experimentally-induced differences in one or
Furr, Special Topics - 13
more IVs explain variability in participants’ observed difference scores. In a non-experimental
context, one might evaluate whether naturally-occurring differences in one or more predictor
variables are associated with variability in participants’ observed difference scores. The
importance of variability in difference scores requires an understanding the factors producing
that variability. Specifically, variability in observed difference scores reflects variability in
component measures and the correlation between those measures:
2 2 2 2D X Y XY Xs s s r s s= + − Y Equation 7.3
where 2Ds is the variance of observed difference scores, , , , and are the observed
variances and standard deviations of measures of variables X and Y, and rXY is the correlation
between those measures.
2Xs 2
Ys Xs Ys
Implications of These Properties
These two properties reflect fundamental statistical and psychometric qualities of
difference scores, and they have several implications. These implications, in turn, affect the
meaningfulness and utility of difference scores.
Unreliable scales produce unreliable difference scores. There is much debate about the
reliability of difference scores. Some researchers believe that difference scores are inherently
unreliable, while others note that difference scores can, in fact, be reasonably reliable. In reality,
difference scores can indeed be reliable, but they are unreliable when their component measures
are unreliable.
Based upon Equation 7.2, Figure 7.3a reflects the reliability of difference scores as a
function of the reliability of component measures. Values were generated by holding constant
the true correlation between X and Y (arbitrarily) at =.50, holding equal the true variances of
X and Y (i.e, = ), and assuming that measures of X and Y are equally reliable. As the
X YT Tr
2XTs 2
YTs
Furr, Special Topics - 14
Figure shows, components with low reliability produce difference scores with very poor
reliability – e.g., if rXX and rYY are .40, then rDD=.25. However, components with strong
reliability can produce difference scores that are reasonably reliable – e.g., if rXX and rYY are .90,
then rDD=.82. Thus, despite some widespread belief to the contrary, difference scores can be
reliable, but only if one (or both) component is highly reliable. Nevertheless, the fact remains
that, when component measures are unreliable, difference scores are unreliable.
--------Insert Figure 7.3----------
Highly-correlated components produce unreliable difference scores. Perhaps somewhat
counterintuitively, the reliability of difference scores is reduced when components are positively
correlated with each other. Equation 7.2 reflects the reliability of difference scores as a function
of the true correlation between the two components (i.e., ), and Figure 7.3b illustrates this
across a range of true correlations. This figure (which assumes that the measures of X and Y
have reliabilities of .80 and that they have equal true variances) shows that larger inter-
component correlations produce smaller reliabilities of difference scores. For example, it shows
that components truly correlated with each other at only =.10 can produce difference scores
with good reliability of rDD=.78; however, components correlated more robustly with each other
at =.70 produce difference scores with substantially lower reliability of rDD=.54. That is,
even though both component measures might be highly reliable, difference scores have lower
reliability when components are highly positively correlated. This fact leads some researchers to
question the reliability – perhaps even the utility in general – of difference scores.
X YT Tr
X YT Tr
X YT Tr
Difference scores can simply reflect one of the components. Because difference scores
arise from two component variables, variability in difference scores can simply reflect variability
in one component. That is, under some circumstances, difference scores reflect – or largely
Furr, Special Topics - 15
reflect – one component. This implication is apparent in the correlation (rXD) between the
difference scores and one component (in this case, variable X):
2 2 2X XY Y
XD
X Y XY X Y
s r srs s r s s
−=
+ − Equation 7.4
where , , , and and rXY are as defined above. This equation shows that the link
between a difference score and a component is largely a function of the difference between the
variabilities of the two components. That is, for any given level of association between
components (i.e., holding rXY constant), difference scores are more strongly correlated with the
component having greater variability. In sum, if components have different variabilities, then the
one with greater variability will have greater impact on the difference scores (see Equation 7.3)
and thus will be more strongly correlated with difference scores (Equation 7.4).
2Xs 2
Ys Xs Ys
This effect can be seen in Figure 7.4, which presents correlations between difference
scores and each component variable (i.e., rXD and rYD). Correlations are presented as a function of
the ratio of variability in component X to variability in component Y (i.e., sX/sY) and (arbitrarily)
setting the components correlated with each other at rXY = .40. When component X has less
variability than component Y (e.g., when sX/sY = .2), difference scores are less strongly correlated
with component X than with component Y (i.e., rXD = -.21 and rYD = -.98). In contrast, when
component X has greater variability than component Y (e.g., when sX/sY = 2), difference scores
are more strongly correlated with component X than with component Y (i.e., rXD = .87 and rYD =
-.10). It is only when the components have equal variability that they are equally-correlated with
difference scores (i.e., if sX/sY = 1, then |rXD| = |rYD|).
--------Insert Figure 7.4----------
Furr, Special Topics - 16
Issues in Application of Difference Scores
Thus far, this section has articulated psychometric properties of difference scores along
with three implications of those properties. In addition, my experience as a reviewer, editor, and
reader suggests three observations of the way that difference scores have been applied in social
psychological research.
First, difference scores are derived occasionally from “single-item” component measures,
with little, if any, attention to psychometric implications. This problem compounds an issue
mentioned earlier in this volume – single-item measures are relatively likely to have poor
psychometric quality. Thus, difference scores based upon single-item component measures likely
have very poor psychometric quality, with potentially serious consequences for the quality of
subsequent analyses and psychological interpretation.
Second, despite long-held concerns regarding the reliability of difference scores and
despite the importance of knowing the psychometric quality of any variables being examined,
researchers often seem to ignore the reliability of difference scores. When difference scores are
examined, researchers occasionally focus only on the psychometric quality of component
measures, paying little or no attention to the psychometric quality of the difference scores.
Third, despite the strong dependence of difference scores upon their components,
researchers often seem to ignore the components when examining difference scores. That is,
researchers seem to move quickly to analysis of difference scores, apparently ignoring the fact
that difference scores simply reflect their components to varying degrees.
Potential Problems Arising From Difference Scores
Furr, Special Topics - 17
Taken together, the properties, implications, and applied issues produce significant
concerns with the use of difference scores in social psychological research. Two problems are
particularly important, potentially compromising research based upon difference scores.
Poor reliability may obscure real effects. Although reliability might be less pervasively
problematic than sometimes supposed, there are legitimate concerns about the reliability of
difference scores in some research. As mentioned earlier, the reliability of difference scores
suffers when components are correlated positively with each other and when they are measured
with poor reliability. In many, perhaps most, applications of difference scores, components are
likely to be robustly correlated with each other. When combined with the possibility that
difference scores are sometimes derived from components having poor (or unknown) reliability,
this creates significant potential problems with the reliability of difference scores.
If difference scores have poor reliability, then analyses may miss meaningful and real
effects. As discussed earlier (Chapter 4), poor reliability reduces observed effect sizes, which
reduces the power of inferential analyses, which increases the likelihood of Type II errors.
These effects are as true for difference scores as they are for any dependent variable in any social
psychological study.
Lack of discriminant validity, producing obscured results. Perhaps the most subtle
problem is that difference scores may lack discriminant validity, potentially obscuring
psychological conclusions based upon their analysis. Because difference scores can be
influenced heavily by a component having relatively large variance, they can simply reflect one
component. This psychometric situation could produce conclusions that poorly reflect
psychological reality – that are misleading, overly complex, obscure, or that simply miss deeper
psychological messages2.
Furr, Special Topics - 18
For example, a recent study of perceived physical attractiveness presented personality
correlates of “attractiveness range” difference scores (Swami et al., in press). In this research,
male participants viewed images of nine increasingly-large female figures, which were
interpreted as a 1-to-9 interval scale of size. Each participant identified the smallest and largest
figures he found attractive, and the difference (in terms of the size-scale values) was interpreted
as a participant’s “attractiveness range” (AR). These AR difference scores were then correlated
with scores from personality scales, producing a significant negative correlation between AR and
Extraversion. This seemingly suggests a potentially-meaningful connection between males’
Extraversion and the range of body sizes they find attractive. Specifically, it suggests that males
with high levels of Extraversion are attracted to a “narrower range of body sizes” than are males
with low levels of Extraversion.
However, deeper analysis clarified and fully explained the apparent “attractiveness range”
findings. Specifically, across all males, the “largest attractive” ratings had much greater
variability than the “smallest attractive” ratings. That is, males varied much more dramatically
in the largest-sized figures they found attractive than in the smallest-sized figures they found
attractive. As discussed earlier, if one component of a difference score has greater variability
than the other, then the difference score largely reflects the one with greater variability.
Consequently, AR difference scores largely reflected the “largest attractive” ratings, and this fact
was verified with an extremely high correlation (r=.86) between the difference scores and the
“largest attractive” ratings (Swami et al., in press). Fortunately, the researchers examined the
components (i.e., smallest attractive and largest attractive ratings) alongside the difference scores
(i.e., AR scores), revealing the confounding of AR scores with one of its components. This
allowed the researchers to make more informative and psychologically-meaningful
Furr, Special Topics - 19
interpretations than would have been possible with only the analysis of difference scores.
Reliance upon only the AR scores would have produced conclusions that were limited reflections
of the psychological reality otherwise readily apparent in the analysis of the two components.
This potential obscuring effect can be seen in the correlation (rPD) between a predictor or
independent variable (e.g., Extraversion) and a difference score (e.g., AR):
2 2 2PX X PY Y
PD
X Y XY X Y
r s r srs s r s s
−=
+ − Equation 7.5
In this equation, rPX and rPY are correlations between the predictor and component variables, and
, , , , and rXY are as defined earlier. The numerator reveals that the association
between a predictor/IV and a difference score is largely a blend of correlations between the
predictor/IV and the components. More deeply, the association largely reflects a link between
the predictor/IV variable and whichever component has greater variability. Thus, in the AR
study, the association between Extraversion and AR largely reflected a link between
Extraversion and the “largest attractive” ratings.
2Xs 2
Ys Xs Ys
Recommendations and Alternatives
Given the potential problems associated with difference scores, researchers might
consider several alternatives and suggestions. There are at least two alternatives to difference
scores, and there are several recommendations that should be strongly considered if difference
scores are used.
Examine each component as a dependent variable. Researchers might avoid difference
scores altogether, focusing instead on the components. For example, researchers might conduct
analyses twice – once with each component as a dependent variable. Anything potentially
Furr, Special Topics - 20
revealed in an analysis of difference scores might be revealed more clearly, simply, and directly
by analysis of the two components.
Consider a regression context, though the same principles apply in an ANOVA context.
Rather than examining a single “difference score model” in which difference scores are predicted
by a predictor variable (i.e., Di=a+bPD(Pi)), researchers could examine two “component models,”
Xi=a+bPX(Pi) and Yi=a+bPY(Pi), predicting each component (X and Y) from the predictor
variable (P). The slopes from the component models reflect the association between each
component and the predictor, and the difference between these slopes is identical to the slope
obtained from the difference score model (i.e., bPD=bPX-bPY). Similarly, as shown earlier
(Equation 7.5), the correlation between a predictor variable and a difference score (rPD) largely
reflects the difference between the two component-predictor correlations (rPX and rPY, weighted
by the components’ standard deviations). Similar ANOVA approaches could be used, or perhaps
even more usefully, an ANOVA approach can be translated into a regression analysis. Important
generalizations of this approach are presented by Edwards (1995).
Such examination of components rather than difference scores avoids problems
associated with difference scores. For example, if the predictor is related to only one component,
then a separate-component analysis would reveal this fact. Similarly, if components differ in
their variability, then separate-components analysis avoids the discriminant validity problem
arising with difference scores (i.e., the difference score primarily reflecting the component
having greater variability).
Examine both components as predictors in a regression model. Another separate-
component approach is to enter components as predictors in regression models predicting a
variable of interest. As noted by others (e.g., Peter, Churchill, & Brown, 1993), this approach
Furr, Special Topics - 21
requires a slight reconceptualization to a question of incremental variance. For example, the
attractiveness example described earlier (Swami et al., in press) could be framed via a
hierarchical set of questions – 1) do male extraverts differ from introverts in the smallest size
figure they find attractive, and 2) after accounting for any such extraversion-related preferences
in small-sized figures, are there any remaining extraversion-related differences in preferences for
larger-sized figures? Such questions could be addressed through hierarchical regression with
two simple models:
1: Extraversion=a+b1(Smallest-size deemed attractive)
2: Extraversion=a+b1(Smallest-size deemed attractive)+b2(Largest-size deemed attractive).
The size and significance of b1 from Model 1 addresses the first question, and the size and
significance of b2 from Model 2 addresses the second. A large and significant b2 from Model 2
indicates that, for two males having identical “small-size” attraction preferences, the one with
higher Extraversion is likely to have a different “large-sized” attraction preference than the one
with lower Extraversion. Conceptually, this is quite similar to a “difference score” type of
finding that Extraversion is correlated with “attractiveness range,” but the two-phase analytic
strategy provides more information of potential importance. For example, it avoids the
possibility that an apparent “difference score” effect simply reflects one component, and it
provides information about the combined effect of the components.
Examine difference scores along with their components. After considering alternatives,
some researchers may remain interested in difference scores. In such cases, analysis might be
informatively-conducted by adding the analysis of difference scores to the analysis of the
components, rather than replacing the components with difference scores. Indeed, if difference
Furr, Special Topics - 22
scores are used, then they should be used only when accompanied by careful examination of
component scores.
At a minimum, researchers using difference scores should present fundamental
psychometric information and descriptive statistics of the components and of difference scores.
Specifically, they should present: a) reliability estimates of the components and of difference
scores, b) the means and variabilities of the components and of difference scores, and c) the
correlation between the two components and the correlation between each component and the
difference scores. Researchers can estimate the reliability of difference scores via:
Equation 7.6( ) ( )2 2
2 2
22
X XX Y YY X Y XYDD
X Y X Y XY
s estimated r s estimated r s s restimated r
s s s s r+ −
=+ −
where , , 2Xs 2
Ys Xs , Ys , and rXY are as defied earlier. Thus, researchers can use the components’
basic descriptive and psychometric information to estimate the reliability of difference scores.
Such information allows researchers and readers to gauge potential problems with
reliability and discriminant validity. If reliabilities appear low, then researchers should consider
the resulting limitations upon their ability to detect meaningful results. Further, if one
component has substantially-greater variability than the other, then researchers should recognize
the resulting lack of discriminant validity between the “larger variability” component and the
difference score. A lack of discrminant validity would be apparent also in a large correlation
between that component and the difference score. If such validity-related concerns exist, then
readers and researchers should interpret analysis of difference scores very cautiously. In fact,
such findings might motivate researchers to avoid difference scores altogether, returning to a
separate-component approach.
Going further, researchers interested in using difference scores should strongly consider
running all analyses with both components along with the difference score (i.e., running three
Furr, Special Topics - 23
sets of analysis). For example, the “attractiveness range” research presented one set of
ANCOVAs with the difference score as the dependent variable, another ANCOVA with the
“thinnest-size deemed attractive” component as the DV, and another with the “largest-size
deemed attractive” component as the DV (Swami et al., in press). Results revealed no effects for
the “thinnest” component, significantly robust effects for the “largest” component, and
significantly robust effects for the difference score. The latter finding is fully predictable from
the fact that previously-reported correlational analysis revealed an extremely large association
between the “largest” component and the difference scores. That is, the IV’s effects on the
“attractiveness range” difference score simply reflect the IV’s effects on the “largest” component.
The authors noted this important fact when discussing their results.
Summary
The intuitive appeal of difference scores masks a psychometrically-thorny set of
problems. The current section introduced some psychometric properties of difference scores,
noted important implications and potential problems arising from these properties, and presented
recommendations regarding the analysis of difference scores. Two psychometric issues are
particularly relevant – the potential lack of discriminant validity and the potential for low
reliability. The discriminant validity issue is perhaps the more serious, though less-appreciated,
problem with difference scores. That is, many researchers are familiar with concerns about the
reliability of difference scores, but fewer may be aware that a difference score could simply
reflect one of its components. In sum, analysis of difference scores – if conducted at all – should
be conducted only alongside analysis of the components producing the difference score, and only
with careful attention to core psychometric issues.
Furr, Special Topics - 24
Footnotes
1. Equation 7.2 differs from some equations articulating the reliability of difference scores, such
as this commonly-presented equation:
( )1 21
XX YY XYDD
XY
r r rr
r+ −
=−
This equation is accurate only when the two component measures have equal variabilities – an
assumption that, though sometimes valid, bypasses some crucial psychometric facts. In contrast,
Equation 7.2 follows the basic tenets of classical test theory, with no additional assumptions, and
it reveals implications regarding the links between variability and reliability.
2. Interestingly, difference scores are the basis of some familiar statistical procedures, such as the
test of an interaction in a split-plot analysis. Note that such analysis requires attention to the
homogeneity of variances of the within-subjects factor, corresponding to concern about the
similarity of variances of the two components of a difference score. Furthermore, researchers
would rarely limit analysis to a significant interaction, more likely proceeding to decompose the
interaction into simple main effects. Such informative, important, and quite standard follow-up
analysis parallels the examination of the two components of a difference score.
Furr, Special Topics - 25
Table 7.1
Example Data for Profile Similarity
Couple 1 Distinctive DistinctiveTrait Marge Homer Marge Homer
Neuroticism 4 3 .00 -1.00Extraversion 5 7 -.67 1.67
Openness 1 2 -.33 -1.67Agreeableness 5 4 -.33 -1.00
Conscientiousness 4 3 -1.67 -2.00Mean 3.80 3.80 -.60 -.80
Std Dev 1.47 1.72 .57 1.29
Couple 2 Distinctive DistinctiveTrait Wilma Fred Wilma Fred
Neuroticism 2 4 -2.00 .00Extraversion 5 4 -.67 -1.33
Openness 1 5 -.33 1.33Agreeableness 6 5 .67 .00
Conscientiousness 7 6 1.33 1.00Mean 4.20 4.80 -.20 .20
Std Dev 2.32 .75 1.15 .93
Couple 3 Distinctive DistinctiveTrait Betty Barney Betty Barney
Neuroticism 6 5 2.00 1.00Extraversion 7 5 1.33 -.33
Openness 2 4 .67 .33Agreeableness 5 6 -.33 1.00
Conscientiousness 6 6 .33 1.00Mean 5.20 5.20 .80 .60
Std Dev 1.72 .75 .81 .53
Norms Normative NormativeTrait Wife Husband
Neuroticism 4.00 4.00Extraversion 5.67 5.33
Openness 1.33 3.67Agreeableness 5.33 5.00
Conscientiousness 5.67 5.00Mean 4.40 4.60
Std Dev 1.65 .65
Furr, Special Topics - 26
Table 7.2 Profile Similarity Components Generalized Overall Distinctive Normative Wife Husband Couple Similarity Similarity Similarity Normativeness Normativeness
1 .70 .19 .92 .94 .77 2 .48 .37 .92 .89 .11 3 .59 -.29 .92 .89 .72
Furr, Special Topics - 27
Figure 7.1 Hypotehtical profiles
7.1a
0
1
2
3
4
5
6
7
8
Neur. Ext Open. Agree. Consc.
Scor
e
Trait
Wife and Husbad Personality Profiles
WifeHusband
7.1a
.00
1.00
2.00
3.00
5.00
6.00
7.00
8.00
Neur. Ext Open. Agree. Consc.
Scor
e
Trait
Wife 1 Personality Profile
4.00Elevation Shape
Scatter
Furr, Special Topics - 28
Figure 7.2
7.2a
.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Neur. Ext Open. Agree. Consc.
Scor
e
Trait
Wives' self-rated profiles
MargeWilmaBetty
7.2b
.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Neur. Ext Open. Agree. Consc.
Scor
e
Trait
Wives' self-rated profiles, with the Normative Wife proflie
MargeWilmaBettyNormative Wife
Furr, Special Topics - 29
Figure 7.3 Reliability of difference scores as a function of: a) reliabilities of components, and b) true correlations between components
7.3a
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
.00 .10 .20 .30 .40 .50 .60 .70 .80
Reliability of Differen
ce Scores (rDD)
Reliability of measures of X and Y (rXX and rYY)
.90 1.00
7.3b
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
.00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00
Reliability of Differen
ce Scores (rDD)
True correlation between X and Y (rTXTY)
Furr, Special Topics - 30
Figure 7.4
‐1.00
‐.80
‐.60
‐.40
‐.20
.00
.20
.40
.60
.80
1.00
.20 .40 .60 .80 1.00 1.20 1.40 1.60 1.80 2.00 2.20
r XD
or rYD
sX/sY
rXD
rYD