34
RUNNING HEAD: Reliability and Validity Coefficients Reliability and Construct Validity Coefficients Adjusted for Correlation within Family Dyads George J. Knafl, 1* Jane K. Dixon, 2 Margaret Grey, 2 Jean P. O'Malley, 1 Kathleen A. Knafl 1 1 School of Nursing, Oregon Health & Science University 2 School of Nursing, Yale University * Corresponding author, [email protected] Acknowledgements This work was supported in part by grant R01 NR08048 from the National Institute of Nursing Research (NINR) of the National Institutes of Health (NIH), grant R01 AI057043 from the National Institute of Allergy and Infectious Diseases of the NIH, and by Oregon Health & Science University's Oregon Clinical and Translational Research Institute (OCTRI) through grant UL1 RR024140 from the National Center for Research Resources (NCRR) of the NIH and the NIH Roadmap for Medical Research.

RUNNING HEAD: Reliability and Validity Coefficients Reliability …gknafl/research/FA/rel/reliability.paper.pdf · RUNNING HEAD: Reliability and Validity Coefficients Reliability

Embed Size (px)

Citation preview

RUNNING HEAD: Reliability and Validity Coefficients

Reliability and Construct Validity Coefficients Adjusted for Correlation within Family Dyads

George J. Knafl,1* Jane K. Dixon,2 Margaret Grey,2 Jean P. O'Malley,1 Kathleen A. Knafl1

1 School of Nursing, Oregon Health & Science University

2 School of Nursing, Yale University

* Corresponding author, [email protected]

Acknowledgements

This work was supported in part by grant R01 NR08048 from the National Institute of Nursing

Research (NINR) of the National Institutes of Health (NIH), grant R01 AI057043 from the

National Institute of Allergy and Infectious Diseases of the NIH, and by Oregon Health &

Science University's Oregon Clinical and Translational Research Institute (OCTRI) through

grant UL1 RR024140 from the National Center for Research Resources (NCRR) of the NIH and

the NIH Roadmap for Medical Research.

Reliability and Validity Coefficients

2

ABSTRACT

The assessment of reliability and construct validity for scales used to evaluate families is

potentially limited because standard coefficients are not computed appropriately using the

combined responses for family members due to their possible intra-familial correlation. In this

paper, we describe and demonstrate the use of linear mixed modeling to compute generalizations

of such coefficients for item and scale responses for family dyads. These include generalizations

of coefficients for construct validity correlation, differential reliability index, internal consistency

reliability, item-total correlation, and test-retest reliability. Example analyses are presented using

survey responses from parents of a child with a chronic condition that focus on family

management of the child's condition.

Key Words: childhood chronic conditions, construct validity, dyadic data, family, reliability,

survey instruments

Reliability and Validity Coefficients

3

Reliability and Construct Validity Coefficients Adjusted for Correlation within Family Dyads

In order to address the family as a unit, family researchers collect measurements on multiple

family members, often on family dyads (e.g., mothers/fathers, parent/child, ill/healthy child).

Measurements for family members are likely to be correlated, and so their combined analysis

requires special methods accounting for intra-familial correlation (IFC) (Kashy & Snyder, 1995).

Hierarchical linear modeling (HLM), also called multilevel modeling, often is used to model

such measurements (Atkins, 2005; Raudenbush & Bryk, 2002) including approaches specialized

to family data (Cook & Snyder, 2005). These measurements commonly are based on summated

scales whose reliability for the complete study data requires assessment. Standard reliability

coefficients, however, are based on the assumption of independence across subjects, and so

should be computed only for each family member separately. Standard coefficients do not reflect

the composite reliability of combined responses for all family members, but commonly are

reported due to limitations of current methods. Recent examples include Martini, Grusec, and

Bernadini (2001) on satisfaction with help provided to older mothers by their adult daughters,

O'Leary and Vidair (2005) on child rearing for fathers and mothers, and Yeh et al. (2006) on

sexual satisfaction, marital quality, and marital instability for midlife couples.

The results reported in these and similar articles may provide an accurate assessment of

reliability, but that only can be verified through comparison with results that account for IFC

computed using responses for all surveyed family members. Even when results based upon

standard reliability coefficients do provide an accurate assessment, they do not provide an

indication of the strength of the relationship between family members and thus only provide a

partial assessment of the family, addressing individual family members but not the family unit.

As Kashy and Snyder (1995) point out in the context of couples, the couple together is more than

Reliability and Validity Coefficients

4

the aggregate of two individuals.

Reliability has been generalized in several ways. Wang (2002) used HLM (i.e., the method as

opposed to the software tool also called HLM) to generalize internal consistency reliability (ICR)

with intraclass correlations computed from models conditioned on properties of surveyed

individuals (in their case students). The properties they considered are fixed effects and so would

not account for IFC, but HLM could be used to introduce random family effects. However, HLM

with covariance structures generated by random effects or coefficients is a special case of linear

mixed modeling which also supports other, directly specified covariance structures, including the

ones presented here. Graham, Liu, and Jeziorski (2006) applied the reliability generalization of

Vacha-Haase (1998) to conduct a meta-analysis combining ICR values for multiple studies.

These methods combine reliability values computed separately from multiple samples, rather

than computing composite values from combined scale responses for family members within a

single sample. Confirmatory factor analysis (CFA) can be used to compute composite ICR

values for multiple factors combined together in a single CFA model (Hatcher, 1994). This

provides an alternative to Cronbach's alpha for computing reliability for factors combined

together rather than for associated scales one at a time, but does not generalize that coefficient to

address family data. Raykov and Shrout (2002) used structural equation modeling (SEM) to

generalize ICR to the case of scales with more than one underlying latent construct. While their

work does not address generalization to handle IFC, SEM could be used to generate analogous

models to the ones considered here. However, some popular SEM software tools, such as PROC

CALIS of the Statistical Analysis System (SAS), could not be used because they do not support

differing numbers of participating family members.

This paper describes how linear mixed modeling (Brown & Prescott, 1999) can be used to

Reliability and Validity Coefficients

5

compute reliability and construct validity coefficients addressing the family as a composite unit

for the special but common case of family dyads. Formulations are provided for generalized

coefficients accounting for IFC in item and scale responses from or about family dyads (e.g.,

from mothers and fathers about the family or from mothers about ill and healthy siblings). With

these methods, composite coefficients can be computed using combined responses for both

family members (as opposed to computing them with standard formulas one member at a time)

while accounting for the family unit through the correlation in responses within families. These

methods are demonstrated using data from an instrument development study that surveyed

parents of a child with a chronic condition on the family's management of the child's condition.

We begin by generalizing Cronbach's alpha for measuring ICR, the most commonly reported

reliability coefficient. The construct validity correlation (CVC) is generalized next. This requires

a somewhat more complex formulation than needed to generalize ICR. Then, several other

coefficients are generalized including the test-retest reliability (TRR), the item-total correlation

(ITC), and the differential reliability index (DRI) in that order. These are generalized using the

same kind of formulation as for the CVC. Formulations are presented in separate sections for

readers interested in the computational details. SAS macros to support these computations are

available from the authors.

Composite Internal Consistency Reliability (ICR)

ICR for a scale commonly is measured using Cronbach's alpha, computed using a well-

known formula combining variances for the scale and all of its items (e.g., see DeVellis, 2003, p.

35). These item and scale variances are computed using the standard approach based on the

assumption of independence across subjects, which will not hold in general for item and scale

responses from members of the same family. Consequently, ICR should be computed in this way

Reliability and Validity Coefficients

6

using responses for individual family members separately, not responses for all family members

combined. However, item and scale variances can be generalized using linear mixed modeling to

composite quantities accounting for IFC. These can be inserted into the standard Cronbach's

alpha formula to obtain composite ICR coefficients that account for IFC. Composite item-deleted

ICR coefficients can be computed in this way as well.

Generalizability theory (or G theory for short) can be used to extend Cronbach's alpha to

express ICR for a scale as a function of the number of items used in computing it, not just the

actual number of items (Shavelson & Webb, 1991). These G theory ICR extensions are

generalized here as well to composite coefficients.

ICR Formulation

General linear mixed models are used in modeling multiple measurements (Brown &

Prescott, 1999). Measurements for the same subject are treated as dependent while measurements

for different subjects are treated as independent. Each such model consists of a fixed component

modeling expected or mean values for the measurements and a random component modeling the

within-subject dependence in measurements. The fixed component can be any special case of the

general linear model (e.g., regression, analysis of variance, analysis of covariance). The random

component can be any covariance matrix for within-subject measurements (or, equivalently, the

variances for all measurements together with correlations for all pairs of measurements). This

covariance matrix can be generated through random effects and/or coefficients along with

independent, constant-variance error terms, and then the linear mixed model is equivalent to an

associated hierarchical linear or multilevel model (Singer, 1998). For example, a single random

intercept generates the covariance structure called compound symmetry because it has constant

variances and constant correlations for different measurements. On the other hand, there are

Reliability and Validity Coefficients

7

valid covariance structures that cannot be generated through random effects/coefficients, so that

linear mixed models are more general (Verbeke & Mohlenberghs, 2000). Parameters of linear

mixed models are estimated using full information maximum likelihood (Arminger & Sobel,

1990), allowing subjects to have partial sets of measurements without imputing values for

missing measurements. This is especially important for modeling family data which typically

involve families with differing numbers of participating members.

The specific linear mixed model used to generate composite variances for ICR computations

for family dyads is based on a pair of measurements, one for each family dyad member, for

either a scale or for one of its items. Families are the "subjects" of the model, rather than the

participating family members (the subjects of the study). The fixed component has different

mean values for the two family members along with a possible dyad effect (e.g., partnered versus

single) on the mean. The covariance matrix has two rows and two columns (and so dimension

2×2). The two diagonal entries correspond to variances, treated as different, for the two family

members. The two off-diagonal entries are equal to each other, and together with the variances

determine the IFC in responses within family dyads. This covariance matrix is called

unstructured since no structure has been imposed on it to reduce the number of covariance

parameters. For this 2×2 case, compound symmetry is the structured alternative with the

variances constrained to be equal.

It is important to allow for different mean values for family members since the goal is to

compute separate variances for those family members, but it also may be important to account

for an additional dyad effect. For example, when the dyad consists of parents from one- or two-

parent families, it may be important to allow for a family type effect on mean values.

Alternately, for ill and healthy siblings, it may be important to allow for a gender effect.

Reliability and Validity Coefficients

8

Using an unstructured covariance matrix allows different composite variances to be

computed for each family member. These composite variances, for a scale and for its items, are

inserted into the standard Cronbach's alpha formula to compute the scale's composite ICR

coefficients, one for each family member, but based on the combined responses for both family

members. A single composite ICR for both family members combined could have been

computed instead using the constant variance parameter of the compound symmetry covariance

matrix. This has the advantage of providing a single measure of ICR, as would be used in non-

family settings, but ignores the individuality of family members. Families are composed of

individual family members, and so the assessment of reliability for a scale addressing multiple

family members should address the reliability for each of those individuals. On the other hand,

the family unit has dimensions not fully described by the individual properties of its members

alone. This is represented by the IFC, a dimension not addressed in standard ICR computations.

ICR Example Analysis

The data used in example analyses come from an instrument development study that

surveyed parents of children with a chronic condition on family management of the condition. A

total of 579 parents were interviewed, including 414 mothers and 165 fathers from 417 families,

with both parents responding from 162 families. These survey responses were used to develop

the Family Management Measure (FaMM) consisting of six scales representing attributes of

family management, including the Child Identity, Concern, Difficulty, Effort, Manageability, and

Parental Mutuality Scales (Knafl et al., in review). The Parental Mutuality Scale is appropriate

only for partnered parents, but the other scales apply to both partnered and single parents. For

example, the Concern Scale, measuring concern parents have about the condition, has 10 items

rated from 1 to 5, meaning strongly disagree to strongly agree (e.g., "we think about our child’s

Reliability and Validity Coefficients

9

condition all the time" and "because of the condition, we worry about our child’s future"). Larger

scores indicate higher levels of concern. The Manageability Scale measures how manageable the

condition is and has 12 items, also scored from 1 to 5 (e.g., "we often feel unsure about what to

do to take care of our child’s condition" and "when something unexpected happens with our

child’s condition, we usually know how to handle it"). Larger scores indicate higher levels of

manageability. Table 1 contains ICR results along with summary statistics for the Concern and

Manageability Scales computed from responses for mothers and fathers both separately and in

combination, allowing for an effect on scale means for family type (partnered versus single).

Composite ICR for mothers on concern was lower at .73 than for fathers at .77, but both at

respectable levels (i.e., at least .70 but less than .80, DeVellis, 2003). Separate ICR values were

lower for mothers at .72 and also for fathers at .75. For manageability, composite ICR values

were also respectable for mothers at .71 and for fathers at .73, and the same as separate values in

both cases. ICR values computed separately can be lower than associated composite values in

some cases, but may not be too much lower. This suggests that the separate approach can

produce reasonably accurate ICR values, at least for cases like this with substantial numbers of

observations for each family member. However, IFCs (in this case, inter-parental correlations)

associated with the computation of composite summary statistics for concern and manageability

are .57 and .33, respectively, with both significantly nonzero (p < .01). Not accounting for this

correlation can ignore a distinct dimension of the family as a composite unit.

The estimated composite mean for concern was 26.9 for partnered mothers, 28.4 for single

mothers, and 26.9 for fathers and was not significantly different for partnered mothers compared

to single mothers (p = .08) and for partnered mothers compared to fathers (p = .94). The

estimated composite mean for manageability was 49.3 for partnered mothers, 47.5 for single

Reliability and Validity Coefficients

10

mothers, and 49.4 for fathers and was significantly different for partnered mothers compared to

single mothers (p = .04) but not for partnered mothers compared to fathers (p = .76). Although it

may not always be crucial to allow for different means for partnered and single mothers, as for

concern, it may be in some cases, as for manageability. Consequently, allowance for an effect on

the mean for single versus partnered mothers was incorporated into analyses of reliability and

construct validity for all the FaMM scales reported by Knafl et al. (in review) except for the

Parental Mutuality Scale since it is not completed by single mothers.

Estimates for mean concern and manageability were not affected much by consideration of

IFC. The separate approach produced almost the same values as the composite approach.

Separate estimates for all partnered mothers and for fathers, however, cannot be compared using

standard tests. The standard two-sample t test is not appropriate because of the lack of

independence between responses within the same families. The standard paired t-test can be

appropriate, but only for responses from families with two participating parents. Differences

cannot be calculated for the other families with only participating mothers. The test generated by

the linear mixed model generalizes the standard paired t-test to account for responses from

families with either one or two participating members while accounting for correlation in

responses within families with two participating members. Separate and composite estimates of

the standard deviation were also almost the same, explaining why separate and composite ICR

values, which are computed from such standard deviation estimates, were also not too different.

G Theory Extension of ICR

Cronbach's alpha can be extended using G theory to express ICR for a scale as a function of

the number of items used in computing it, not just the actual number of items (Shavelson &

Webb, 1991). These alternative ICR values can be computed from the ICR value α for the actual

Reliability and Validity Coefficients

11

number of items (see the following section). Composite values for such generalized ICRs can be

generated with the same computation using composite values for α. Figure 1 portrays how

composite ICR for the Concern Scale would change with differing numbers of items of similar

effect to the 10 available items. One use for these curves would be to assess how many additional

items are needed to improve the scale's reliability to .80, considered the minimal very good level

by DeVellis (2003). Only two more items are needed to increase ICR for fathers to at least .80,

but 5 more are needed for mothers. This may be an unacceptable increase in subject burden,

especially if similar increases are needed for other FaMM scales as well.

G Theory Formulation

Using a one-facet crossed design in terms of persons and items (Shavelson & Webb, 1991),

the associated relative decision study generates an ICR coefficient ICR(IN) for an arbitrary

number IN of items. This coefficient equals the standard Cronbach's alpha when IN is set to the

observed number I of items (I = 10 for the Concern Scale), that is, ICR(I) = α. It also can be

expressed in the form

ICR(I′) = y / (y + 1 / I′) (1)

where y is the ratio of the person variance component divided by the residual variance

component (see equation 6.18, Shavelson & Webb). Set IN = I in equation 2 and solve for y

giving

y = α / (I · (1 − α)) (2)

and then insert this into equation 1 to solve for ICR(IN) for any IN in terms of α. The coefficients

ICR(IN) can be generalized to account for IFC by using composite ICR coefficients α in equation

2 to compute a composite y value and then using it in equation 1 to calculate composite values

ICR(IN) over a selection of possible values IN for the number of items (e.g., as plotted in Figure

Reliability and Validity Coefficients

12

1).

Composite Construct Validity Correlation (CVC)

The construct validity of a scale can be addressed by computing correlations between that

scale and other established measures related to the scale's underlying latent construct. Construct

validity for the scale is supported when these correlations are distinct and in expected directions,

thereby supporting hypothesized relationships. However, these correlations also are expected to

be limited in strength, indicating that underlying constructs are only related, not equivalent.

CVC Formulation

The standard CVC is computed using a general bivariate approach, treating the scale and the

established measure as correlated with different means and different variances. The associated

covariance matrix is unstructured with dimension 2×2.

The specific linear mixed model used to generalize this to a composite CVC for family dyads

is based on four measurements, the scale and the established measure for each of the two family

members. The fixed component is determined by allowing the scale and the established measure

each to have different mean values for the two family members along with a possible dyad effect

(e.g., partnered versus single). The covariance matrix has dimension 4×4. The first two rows and

columns correspond to the scale for the two family members. The last two rows and columns

correspond to the established measure for the two family members. The four diagonal elements

correspond to the variances for the four measurements, allowing them to have different values.

The off-diagonal elements together with the variances determine the correlations. There are six

possible correlations corresponding to the possible distinct pairs for the four measurements.

However, the model imposes structure on these correlations so that there are only two distinct

parameters, an IFC and a CVC. The correlation between scale values for the two family

Reliability and Validity Coefficients

13

members is set equal to the correlation between established measure values for the two family

members. This common value is the IFC parameter. Similarly, the correlation between scale

values and established measure values for the first family member is set equal to the correlation

between scale values and established measure values for the second family member. This

common value is the CVC parameter. The other correlations are determined by multiplying these

two parameters together. In other words, the correlation between scale values for one family

member and established measure values for the other family member is set equal to the product

of the IFC and CVC parameters. This 4×4 covariance matrix is a special case of what is called a

Kronecker product matrix (Schott, 2005). It is the product of two 2×2 unstructured covariance

matrices.

This model provides a common composite CVC parameter, rather than different ones, for the

two family members. While different CVC parameters would provide information on individual

family member differences, the latent construct underlying the scale would be the same for both

family members, as would the latent construct for the established measure. A common CVC for

both family members better reflects this homogeneity in latent constructs across family

members, and so seems more appropriate. Moreover, while an extended model with different

CVC parameters can be formulated, this extension is not supported by SAS PROC MIXED, the

software tool used here to estimate linear mixed models, and so is not readily computable. While

PROC MIXED does not support this one case, another popular statistical software tool, the

Statistical Package for the Social Sciences (SPSS), does not support Kronecker product

covariance matrices of any kind for linear mixed modeling. This model could be computed in

SAS using PROC NLMIXED, but the extra effort did not seem warranted. An alternate approach

supported by PROC MIXED would be to use an unstructured 4×4 covariance matrix with 6

Reliability and Validity Coefficients

14

different correlation parameters including different CVCs for mothers and fathers and different

IFCs associated with these two CVCs. There would also be two other correlations combining

aspects of the CVC and the IFC since they involve two different parents and two different scales.

This approach was judged to be too complex.

CVC Example

Several established measures were collected as part of the FaMM Study for construct validity

purposes including the Functional Status II (Stein & Jessop, 1990), providing a parent's

assessment of child functioning in the presence of a chronic condition. This scale is computed as

the sum of 14 items scores from 0 to 2. Larger scores indicate better child functional status which

would be expected to correspond to lower concern and higher manageability. Consequently,

construct validity for the Concern Scale would be supported by a negative CVC with child

functional status and for the Manageability Scale by a positive CVC.

Composite construct validity results for the Concern and Manageability Scales in terms of

child functional status are reported in Table 2 with composite ICR results. Child functional status

was missing for two parents, and so results used responses for 577 parents. Composite ICR for

child functional status was .79 for mothers and .81 for fathers, with IFC of .58 (p < .01). The

composite CVC between concern and child functional status was −.32 (p < .01), a distinct value

in the expected direction and of limited strength, thereby supporting construct validity for the

Concern Scale. The associated IFC was .52 (p < .01). Construct validity of the Manageability

Scale was supported by a positive composite CVC with child functional status of .32 (p < .01),

with associated IFC of .38 (p < .01). For both scales, the impact of the family unit on construct

validity was distinct as indicated by IFCs, demonstrating the importance of accounting for intra-

familial correlation

Reliability and Validity Coefficients

15

Composite Test-Retest Reliability (TRR)

The standard TRR for a scale is computed as the correlation between responses to that scale

by the same individual at two points in time. The time points should be far enough apart so that

subjects cannot recall their original responses, but close enough so the scale's underlying latent

construct remains stable. Item TRR can be computed similarly. These computations use the same

general bivariate approach as used for the CVC, with different means and variances for the scale

at the two time points along with a temporal correlation parameter representing TRR.

These standard coefficients can be generalized to composite coefficients with the same direct

product covariance approach used for the CVC, providing composite TRR estimates for a scale

and its items adjusted for associated IFC. TRR sample sizes are smaller than for general scale

development purposes, so it may be important to avoid too complex a model. One natural way to

simplify the model is to allow variances to differ across times, but not across family members.

These two alternatives can be compared using penalized likelihood criteria (Sclove, 1987). We

use Schwarz's Bayesian information criterion (BIC) because its penalty accounts for the sample

size as well as the number of parameters, as opposed to the Akaike information criterion with

penalty only accounting for the number of parameters. Accounting for the size of the sample in

the penalty seems important for TRR since it typically is based on samples of limited size.

Alternate forms reliability based on two alternate forms for the items can be generalized

similarly.

Composite TRR Formulation

The specific linear mixed model used to generalize the TRR coefficient has the same

structure as the model for the composite CVC. Measurements are either for a scale or for one of

its items, measured four ways corresponding to two family members at two time points, the test

Reliability and Validity Coefficients

16

and the retest, with different means possibly changing due to a dyad effect and with different

variances. The six possible correlations are determined by an IFC parameter, a temporal

correlation parameter representing the TRR, and products of these two parameters. Two

alternative versions of this model are considered here, the full model with all four variances

different and the reduced model with variances only different across time points not also across

family members. The reduced model is considered since the full model can be over-

parameterized for the kinds of sample sizes usually considered for TRR.

Each of these alternative models provides a common composite TRR parameter, rather than

different ones, for the two family members. Models with different parameters for each family

member would be preferable, but as for the CVC are not readily computable. An individual

family member perspective on TRR can be obtained from standard coefficients computed for

each family member separately, but there may not be sufficient numbers retested within each

family member type for meaningful results. The two alternatives for a common composite TRR

coefficient are more accurate because they are computed from responses for all retested subjects.

If at most one member per family is retested, a common TRR value can be computed using

the standard approach, but this would not account fully for the family as a unit composed of

multiple individuals. Composite coefficients account for this through the IFC. However, if the

IFC is not significantly different from zero, it could be more appropriate to compute the common

value for TRR using the standard approach applied to responses for all retested subjects, treating

them as independent across subjects even when for different members of the same family. This

latter approach may not be appropriate, though, unless the IFC is distinctly nonsignificant. For

example, Kashy and Snyder (1995) recommend using a conservative cutoff of .25 for the p-value

for the test of zero IFC before treating responses within families as independent. In any case, this

Reliability and Validity Coefficients

17

latter approach treats variances as different across time, but not across family members, and so is

generalized to account for IFC by the reduced linear mixed model.

Composite TRR Example

For the FaMM Study, 65 parents were retested within 2-4 weeks, including 41 mothers and

24 fathers from 44 families with both parents retested from 21 families. Table 3 contains TRR

results for the Concern and Manageability Scales. The composite TRR for the Concern Scale

was .87 using both the full and reduced models. The associated IFC was .34 for the full model

and .32 for the reduced model, with both significantly nonzero (both at p = .02), so it is

inappropriate to compute a composite value using the standard approach. The standard TRR

computed just for mothers was .89 compared to .84 just for fathers. The Concern Scale has very

good TRR levels from all perspectives.

For the Manageability Scale, composite TRR was at a reasonable level of .79 using both the

full and reduced models. The associated IFC was −.16 for the full model with p = .29 while it

was −.17, almost the same, with p = .27 for the reduced model. These p-values are large enough

(over .25) to compute an alternative composite value using the standard approach, which was

.78, about the same as for the other more general cases. Alternately, the standard TRR computed

just for mothers was .83 compared to .64 just for fathers. It is possible that TRR for the

Manageability Scale is at an undesirable level (i.e., at least .60 but less than .65, DeVellis, 2003)

for fathers, but at a very good level for mothers. However, these results also could be a

consequence of small subsample sizes, especially for fathers with only 24 retested. It seems quite

possible that the Manageability Scale has reasonable TRR levels as indicated by its three

alternative composite scores, a conclusion in this case not affected much by correlation in

responses within the same families. In any case, an effective evaluation of TRR for family scales

Reliability and Validity Coefficients

18

like the FaMM requires sufficiently large numbers of all family member types, not just a large

number of family members of any type as was collected for in the FaMM Study.

Composite TRR results for the full and reduced models are not too different for both the

Concern and Manageability Scales. In such situations, either approach could be used, but then

the reduced model seems preferable as more parsimonious. In other situations, it is possible that

estimating more variance parameters than is necessary might produce inferior TRR estimates,

suggesting again the reduced model may be preferable. However, there are objective ways to

compare these models. Table 3 provides such a comparison using BIC scores, with smaller

values indicating better models. The reduced model generates smaller BIC scores for both the

Concern and the Manageability Scales. On the other hand, the differences in scores are not too

large, indicating that either model may be an acceptable choice, but the reduced model is still

preferable as the simpler model. The reduced model is both more parsimonious as well as more

effective for these scales. We recommend its use over the full model in general. Furthermore,

BIC scores are distinctly larger for separate TRR analyses than for composite TRR analyses

suggesting that the differences in separate TRR results for mothers and fathers, especially for the

Manageability Scale, are questionable.

Composite Item-Total Correlation (ITC)

The standard ITC is computed as the correlation between an item and the total of a scale with

that item deleted. The scale is corrected by subtracting the item before computing the correlation.

An uncorrected version can be computed similarly, but is not considered as appropriate since

including the item in the scale can inflate the resulting ITC coefficient (DeVellis, 2003, p. 93).

The ITC also is called the discrimination index. Items with values at least .30 are considered

discriminating and those with values over .20 but less than .30 moderately discriminating

Reliability and Validity Coefficients

19

(Nunnally & Bernstein, 1994). The standard ITC coefficient can be generalized to a composite

coefficient using the same direct product covariance approach used to generalize the CVC,

providing a composite ITC adjusted for associated IFC.

Composite ITC Formulation

The specific linear mixed model used to generalize the ITC has the same form as the model

for the CVC. There are four measurements corresponding to the item and the corrected scale for

each of the two family members. These four measurements are treated as having different means,

possibly changing due to a dyad effect, and different variances. The six possible correlations are

determined by an IFC parameter, an ITC parameter, and products of these two parameters. This

model provides a common composite ITC parameter, rather than different ones, for the two

family members. A model with different parameters for each family member would be

preferable, but as for the CVC is not readily computable. An individual family member

perspective on ITC can be obtained from standard coefficients computed for each family

member separately, but those address only the individual family members, not the family unit.

Composite ITC Example

Table 4 contains (corrected) ITCs, separate values for mothers and fathers as well as

composite values, along with associated IFCs for the 10 Concern Scale items. Items are listed in

increasing order of composite ITCs. Composite ITCs range from .24 to .52 with all significantly

nonzero (p < .01). IFCs range from .24 to .53 with all significantly nonzero (p < .01). Seven of

the Concern Scale items are discriminating while the other three are moderately discriminating,

suggesting that it is reasonable from this perspective to retain all the items in the scale. Separate

ITC values can be quite different for mothers and fathers (as much as .11 apart), and can be only

moderately discriminating for one parent when discriminating for the other parent. However,

Reliability and Validity Coefficients

20

when items are considered discriminating if discriminating for both parents and moderately

discriminating if moderately discriminating for one parent and moderately discriminating or

better for the other parent, the Concern Scale items are categorized in exactly the same way as by

the composite ITC. This suggests that it is sufficient to compute the composite coefficient,

producing a consistent categorization of items while also providing information on the family

unit.

Composite Differential Reliability Index (DRI)

An item's ITC can be combined with its correlation with a measure of social desirability to

compute its DRI, measuring how much of an effect it has on its scale beyond what is attributable

to its correlation with social desirability (Jackson, 1970; Streiner & Norman, 2003). A new

formulation is not required to generalize DRI to a composite value based on the combined

responses for two family members. It is computed from the composite ITC and the composite

correlation for an item with social desirability. This latter correlation is computed using the same

approach as for the composite CVC, but applied to an item rather than to a scale. These two

correlation values are combined using the standard formula for the DRI (e.g., see Streiner &

Norman, 2003, p. 87).

DRI Example

The short version of the Marlowe-Crowne Social Desirability Scale (Strahan & Gerbasi,

1972) was used in the FaMM Study to assess the effect of social desirability on responses to the

FaMM items and scales. This scale is computed as the sum of 10 items scored as either 0 or 1.

Larger scores indicate more of an influence to social desirability on other responses. Social

desirability was missing for one parent, and so results used responses for 578 parents. Table 5

contains composite DRI values for the Concern Scale items together with associated composite

Reliability and Validity Coefficients

21

ITCs (reproduced from Table 4) and composite item-social-desirability correlations used in their

computation. Items are listed in Table 5 in the same order as in Table 4, but that is also

increasing order in composite DRI. Composite DRI ranges from .24 to .65, and so all Concern

Scale items are at least moderately discriminating in terms of DRI (using the same definition as

for the ITC). Associated composite item-social-desirability correlations are quite small ranging

from −.10 to .11 while their associated IFCs are substantial ranging from .22 to .49 with all of

these significantly nonzero (p < .01). Three of these item-social-desirability correlations are

significant (p < .05), but the composite correlation between concern and social desirability is .00

(p = .96), and so scale values are not influenced by social desirability. Also, social desirability

has undesirable composite ICR levels of .60 for mothers and .62 for fathers, possibly due to the

fact that the items are binary valued, with associated IFC of .33 (p < .01).

A convention does not appear to exist for what DRI values to consider acceptable, but a

small change from the ITC would indicate an acceptable impact of social desirability. For

example, composite DRI values for the Concern Scale items differ from associated composite

ITCs by at most .02, with decreases of at most 6.9%, indicating that social desirability has at

most a limited impact on the effect of items of the Concern Scale.

Computation

Computations were conducted using the linear mixed modeling procedure PROC MIXED

and the matrix language procedure PROC IML available in SAS Version 9.1 (SAS Institute, Inc.,

Cary, NC). Measurements for some analyses needed normalizing for maximum likelihood

routines to converge. For example, when computing composite ITCs, item and corrected scale

values were divided by their observed maximum values, so that both normalized variables had

the same upper bound of 1. SAS macros for conducting these computations are available at

Reliability and Validity Coefficients

22

http://www.ohsu.edu/son/faculty/knafl/comprel.html (accessed ?/?/08).

Discussion

Standard reliability and construct validity coefficients are appropriately computed only for

one family member at a time because their formulations assume independence across subjects.

This paper formulates methods for calculating composite versions of these coefficients from

combined responses for family dyads, which account for the dyad as a unit through the intra-

familial correlation (IFC) in responses for that dyad. Composite coefficients provide more

comprehensive measures of reliability and construct validity based on responses for both

members of a family dyad combined. Their values may not be too much different, however, from

those computed separately one family member at a time, especially if subsample sizes are not too

small. This was the case for reported internal consistency reliability (ICR) values, but not for

reported test-retest reliability values based on data for a much smaller set of subjects. Separate

values can provide useful information, but provide an incomplete assessment only addressing

family members individually, not the dyad as a unit. In our opinion, composite coefficients and

associated IFC values should be reported as standard practice, possibly along with separate

estimates if they provide useful information. The importance of accounting for IFC in these

contexts is supported by the example analyses for which most IFC values were significant and of

substantial strength. Even when the IFC is nonsignificant, not accounting for this possible

dependence ignores an important structural aspect of the relationship for couples (Kashy &

Snyder, 1995). That is, a couple together is more than the aggregate of its two members.

Whether or not composite ICR values are close to those computed separately for each family

member, associated IFC values represent an important aspect of the reliability of scales used to

assess family dyads. This is especially important in development of new scales. A nonsignificant

Reliability and Validity Coefficients

23

IFC for a scale intended for use with family dyads and computed from scale development survey

data with subjects purposely recruited to be broadly representative of those dyads might suggest

that the scale is an unacceptable family measure. Even if a new family scale has acceptable

levels of reliability for both members of the dyad, if that is achieved without a substantive IFC,

the scale represents a latent construct that reflects only individual family members not the family

unit they form. For example, if one of the scales of the Family Management Measure had a

nonsignificant IFC for the survey responses used to develop those scales, its value in measuring a

dimension of the management of childhood chronic conditions by the family as a unit would be

questionable. The IFC is also important for assessing responses to established scales, but a

nonsignificant value in that case may be an interesting result suggesting a weak relationship for

the kinds of family members under study rather than a reliability problem for the associated

scale. For example, Knafl, Knafl, & McCorkle (2005) provide an example involving parental

assessments of school performance for a child with a chronic condition and for a healthy sibling.

There were two IFC dimensions, an inter-parental correlation between assessments by different

parents of the same child and an inter-sibling correlation between assessments of different

children by the same parent. The inter-sibling correlation was nonsignificant (p = .39), indicating

that each parent evaluated the school performance of each child independently from the school

performance of the other child, while the inter-parental correlation was significant (p < .01),

indicating that as expected each parent's assessment of each individual child's school

performance was strongly related to the assessment of the other parent for that same child.

Current guidelines for reporting research study results (Wilkinson & The Task Force on

Statistical Inference, 1999) treat ICR as a property of scale responses rather than of the scale

itself, and so recommend that ICR computed from observed responses be reported as common

Reliability and Validity Coefficients

24

practice. Analogously, reliability for scale responses from family research studies can be

considered a property of responses for all family members combined rather than responses for

individual types of family members separately. Composite ICR coefficients provide for such a

more comprehensive assessment of reliability. However, their computation is not generally

supported in statistical software tools at this time.

Consequently, macros have been developed as part of this research that automate the

computation of composite reliability and construct validity coefficients within SAS, a popular

statistical software tool. These macros are available on the Internet for use by researchers with

access to SAS (when licensed to use PROC MIXED and PROC IML). Familiarity with using the

SAS programming interface is needed to run these macros, but a novice level would be

sufficient. Computation within other statistical software tools is feasible, but would require a

sophisticated level of familiarity with the programming interface of the tool to automate.

This work also applies to simple repeated measurement situations with responses to items

and scales repeated over two arbitrary conditions, like two treatments or two time points, or to

any kind of dyad. The generalization to any dyad is the same as for two family members.

However, further work is needed to extend standard reliability and construct validity coefficients

to arbitrary repeated measurement contexts. Generalization using similar approaches is possible

for repeated measurements over more than two conditions, like multiple family members,

treatments, or time points, but that is an issue for future research.

This work is also only a first step in generalizing psychometric methods to address survey

instruments used in family research. For example, standard factor analysis models assume

independence across subjects and so are appropriately applied only to item responses for one

member in each surveyed family. It would be preferable to factor analyze the complete set of

Reliability and Validity Coefficients

25

responses for all surveyed family members in combination while accounting for IFC in item

responses within families, but that is an issue for future research.

Reliability and Validity Coefficients

26

References

Arminger, G., & Sobel, M. E. (1990). Pseudo-maximum likelihood estimation of mean and

covariance structures with missing data. Journal of the American Statistical Association, 85,

195-203.

Atkins, D. C. (2005). Using multilevel models to analyze couple and family treatment data:

Basic and advanced issues. Journal of Family Psychology, 19, 98-110.

Brown, H., & Prescott, R. (1999). Applied mixed models in medicine. New York: Wiley.

Cook, W. L., & Snyder, D. K. (2005). Analyzing nonindependent outcomes in couple therapy

using the Actor-Partner Interdependence Model. Journal of Family Psychology, 19, 133-141.

DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Thousands Oaks,

CA: Sage.

Graham, J. M., Liu, Y. J., & Jeziorski, J. L. (2006). The Dyadic Adjustment Scale: A reliability

generalization meta-analysis. Journal of Marriage and Family, 68, 701-717.

Hatcher, L. (1994). A step-by-step approach to using SAS for factor analysis and structural

equation modeling. Cary, NC: SAS Institute.

Jackson, D. N. (1970). A sequential system for personality scale development. In C. D.

Spielberger (Ed.), Current topics in clinical and community psychology, vol. 2 (pp. 61-96).

New York: Academic Press.

Kashy, D. A., & Snyder, D. K. (1995). Measurement and data analytic issues in couples research.

Psychological Assessment, 7, 338-348.

Knafl, G. J., Dixon, J. K., O'Malley, J. P., Grey, M., Deatrick, J., Gallo, A., & Knafl, K. A. (in

review). Scale development supported by likelihood cross-validation.

Reliability and Validity Coefficients

27

Knafl, G. J., Knafl, K. A. & McCorkle, R. (2005). Mixed models incorporating intra-familial

correlation through spatial autoregressive methods. Research in Nursing and Health, 28,

348-356.

Martini, T. S., Grusec, J. E., & Bernardini, S. C. (2001). Effects of interpersonal control,

perspective taking, and attributions on older mothers' and adult daughters' satisfaction with

their helping relationship. Journal of Family Psychology, 15, 688-705.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw

Hill.

O'Leary, S. G., & Vidair, H. B. (2005). Marital adjustment, child-rearing disagreements, and

overreactive parenting: Predicting child behavior problems. Journal of Family Psychology,

19, 208-216.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data

analysis methods (2nd ed). Thousand Oaks, CA: Sage.

Raykov, T., & Shrout, P. E. (2002). Reliability of scales with general structure: Point and

interval estimation using a structural equation modeling approach. Structural Equation

Modeling, 9, 195-212.

SAS Institute Inc. (2004). SAS/STAT 9.1 user's guide, Cary, NC: SAS Institute Inc.

Schott, J. R. (2005). Matrix analysis for statistics (2nd ed.). Hoboken, NJ: J. Wiley & Sons.

Sclove, L. S. (1987). Application of model-selection criteria to some problems in multivariate

analysis. Psychometrika, 52, 333-343.

Singer, J. D. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models,

and individual growth curves. Journal of Educational and Behavioral Statistics, 24, 323-355.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park:

Reliability and Validity Coefficients

28

Sage.

Stein, R., & Jessop, D. (1990). Functional Status II: A measure of child health status. Medical

Care, 28, 431-438.

Strahan, R., & Gerbasi, K. (1972). Short, homogeneous version of the Marlowe-Crowne Social

Desirability Scale. Journal of Clinical Psychology, 28, 191-193.

Streiner, D. L., & Norman, G. R. (2003). Health measurement scales: A practical guide to their

development and use. Oxford: Oxford University Press.

Vacha-Haase, T. (1998). Reliability generalization: Exploring variance in measurement error

affecting score reliability across studies. Educational and Psychological Measurement, 58, 6-

20.

Verbeke, G., & Mohlenberghs, G. (2000). Linear mixed models for longitudinal data. New York:

Springer.

Wang, J. (2002). Reliability generalization: An HLM approach. Journal of Instructional

Psychology, 29, 213-218.

Wilkinson, L., & The Task Force on Statistical Inference. (1999). Statistical methods in

psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Yeh, H.-C., Lorenz, F. O., Wickrama, K. A. S., Conger, R. D., & Elder, G. H., Jr. (2006).

Relationships among sexual satisfaction, marital quality, and marital instability at midlife.

Journal of Family Psychology, 20, 339-343.

Reliability and Validity Coefficients

29

Table 1 Summary Statistics and Internal Consistency Reliability (ICR) for the Concern and Manageability Scales of the Family Management Measurea

Scale Type Mean SD ICR Mothers Fathers Mothers Fathers Mothers Fathers Partnered Single Concern compositeb 26.9 28.4c 26.9d 6.8 7.2 .73 .77 separate 26.8 28.4 27.1 6.6 7.0 .72 .75 Manageability compositee 49.3 47.5f 49.4g 6.2 5.9 .71 .73 separate 49.3 47.5 49.3 6.3 5.9 .71 .73 a Using responses from 579 parents, 414 mothers and 165 fathers, of 417 families including 162 families with both parents responding. b The associated intra-familial correlation (IFC) was .57 and was significantly nonzero at p < .01. c Not significantly different from the mean for partnered mothers at p = .08. d Not significantly different from the mean for partnered mothers at p = .94. e The associated IFC was .33 and was significantly nonzero at p < .01. f Significantly different from the mean for partnered mothers at p = .04. g Not significantly different from the mean for partnered mothers at p = .76.

Reliability and Validity Coefficients

30

Table 2 Support for Composite Construct Validity for the Concern and Manageability Scales of the Family Management Measurea

Scale Hypothesized Retationship

Correlation with Child Functional Statusb

Intra-Familial Correlation

Concernc − −.32d .52d Manageabilityc + .32d .38d a Using responses from 577 parents, 413 mothers and 164 fathers, from 415 families including 160 families with both parents responding. Child functional status was missing for the other 2 parents. b Composite internal consistency reliability (ICR) for child functional status was .79 for mothers and .81 for fathers with associated intra-familial correlation .58 (p < .01). c Composite ICR reported in Table 1. d All significantly nonzero at p < .01.

Reliability and Validity Coefficients

31

Table 3 Test-Retest Reliability (TRR) for the Concern and Manageability Scales of the Family Management Measurea

Scale SeparateAnalysis Composite Analysis SeparateTRR BICb,c Full Reduced Mothers Fathers TRR BICb TRR BICb Concern .89 .84 780.8 .87d 771.2 .87e 768.8 Manageability .82 .64 823.2 .79f 817.4 .79g 814.5 a Using responses from 65 parents, 41 mothers and 24 fathers, of 44 families including 21 families with both parents responding, that were retested within 2-4 weeks of their initial interviews. b BIC is the Bayesian information criterion with smaller values indicating better models. c Computed using the combined data for mothers and fathers under the model equivalent to the separate models for each parent while treating IFC as zero. d The associated intra-familial correlation (IFC) was .34 and was significantly nonzero at p = .02. e The associated IFC was .32 and was significantly nonzero at p = .02. f The associated IFC was −.16 and was not significantly different from zero at p = .29. When treated as zero, the composite TRR was .78. g The associated IFC was −.17 and was not significantly different from zero at p = .27. When treated as zero, the composite TRR was .78.

Reliability and Validity Coefficients

32

Table 4 Composite Item-Total Correlation (ITC) for the Items of the Concern Scale of the Family Management Measurea

ITC Intra-Familial Correlationc

Separate Compositeb Mothers Fathers

.23 .33 .24 .34

.27 .35 .27 .35

.33 .25 .29 .30

.31 .40 .30 .47

.37 .33 .33 .39

.39 .48 .40 .34

.41 .52 .41 .53

.51 .43 .47 .31

.51 .46 .48 .24

.51 .59 .52 .38 a Using responses from 579 parents, 414 mothers and 165 fathers, of 417 families including 162 families with both parents responding. b All composite ITCs were significantly nonzero at p < .01. c All intra-familial correlations were significantly nonzero at p < .01.

Reliability and Validity Coefficients

33

Table 5 Composite Differential Reliability Index for the Items of the Concern Scalea

Composite Composite Differential Composite Social Reliability Item-Total Desirability Inter-Familial

Index Correlation Correlationb Correlationc value p .24 .24 −.03 .44 .23 .26 .27 .06 .16 .26 .27 .29 .11 .01 .22 .28 .30 .10 .02 .41 .32 .33 .07 .12 .33 .39 .40 −.06 .15 .28 .40 .41 −.03 .51 .49 .47 .47 −.03 .47 .23 .47 .48 −.10 .02 .23 .51 .52 −.04 .33 .32 a Using responses from 578 parents, 414 mothers and 164 fathers, of 417 families including 162 families with both parents responding. Social desirability was missing for the other 1 parent. Items are listed in the same order as in Table 4. b Composite internal consistency reliability (ICR) for social desirability was .60 for mothers and .62 for fathers with associated intra-familial correlation .33 (p < .01). The composite correlation between social desirability and concern was .00 (p = .96) with intra-familial correlation (IFC) .44 (p < .01). c All intra-familial correlations were significantly nonzero p < .01.

Reliability and Validity Coefficients

34

0.10.20.30.40.50.60.70.80.9

1

0 5 10 15 20 25Number of Items

Inte

rnal

Con

sist

ency

Rel

iabi

lity

fathers mothers actual value

Figure 1. Composite Generalizability Theory Results for the Concern Scale of the Family Management Measure