How does health psychology measure up

How does health psychology measure up?

A critical look at measurement in health psychology

Matthew Hankins16th September 2011

2

The empirical basis of Health Psychology• Why do Health Psychologists collect data?

– Theory generation, esp. identifying constructs– Theory corroboration – Measuring outcomes (trials etc.)

• The value of such activities is therefore critically dependent on the quality of the data

3

Questionnaire measures• Majority of data collected by Health Psychologists

is generated by questionnaire measures (‘scales’)

• Questionnaires vary in the quality of data that they generate

– Validity: extent to which the questionnaire measures what is intended

– Reliability: extent to which variance in data reflects variance in construct measured

• Index of measurement error

4

Pragmatic approach• Validity

– Unidimensionality (factor analysis)– Associations between measures– Discrimination between known groups

• Reliability

– Estimated by Cronbach’s Alpha– Or test-retest correlation

5

Scale development• Combination of these approaches is derived from

‘Classical Test Theory’ (CTT)

– Originated with Spearman (1904)– Landmark text: Guilford 2nd ed. (1954) – Fully developed by Lord & Novick (1968)

• Further developments: ‘item-response theory’ (IRT)

– E.g Rasch model (1960)

• CTT implicit in most empirical Health Psychology research

6

CTT vs. IRT• Argument tends to be that IRT is superior to CTT

• In particular, it is argued that IRT is ‘objective’ measurement

• For large samples, differences more apparent than real:

– Strong correlations between CTT data & IRT data

• And differences tend to be smaller than the margin of error

– If data treated as ordinal, perfect correlation between CTT & Rasch data

7

What is a scale?• A scale orders people on the construct of interest

• Both CTT & IRT agree that a person’s position on the dimension can be estimated from the item scores

• Strength of IRT is that it does not assume that a set of correlated items forms a scale

• Implicit in CTT: if items load on same factor, we automatically assume that they form a scale

Construct

Low Person A Person B Person C Person D High

8

Scaling problem• Whether a set of items forms a scale is a hypothesis

(Guttman 1950)

– Formally tested whether items formed ‘Guttman scales’

• “In contemporary psychometric practice, it is the rule rather than the exception that two people having the same score on a test will have [endorsed]different items…Such scores are crude empirical devices known to have some predictive efficiency, but they cannot be called measurements in any strict sense” (Loevinger 1948)

• Additionally, there is no rational basis for adding up a set of ordinal Likert scores unless they have been shown to scale

9

Example: PHQ-9• Feeling tired + Little interest in doing things +

Poor appetite several days in last 2 weeks

– Scale score = +3

• Thoughts of hurting yourself in some way nearly every day in last 2 weeks

– Scale score = +3

• Are these responses really equivalent?

10

Implications• If a set of items are assumed to form a scale, then

we cannot be sure that the scale score accurately ranks people on the construct of interest

– People with different positions may be assigned the same score

– People with the same position may be assigned different scores

• Unless we test the hypothesis, assessing reliability & validity is pointless

11

What we would like: interval scales

What we think we have: ordinal scales

What we probably have: disordered categories

A scale that cannot rank-order people is not a scale

Disordered categories

12

Item ‘difficulty’ (intensity)• The problem arises because CTT does not account

for item difficulty or intensity

• Some items are endorsed at low levels of the construct

– ‘Low intensity item’– Endorsement may indicate low or high level of construct

• Some items are endorsed at high levels of the construct

– ‘High intensity item’– Endorsement indicates high level of construct

13

Example: PHQ-9• Feeling tired on several days is a low intensity item

– Endorsed at low level of depression– But may also be endorsed at higher levels of

depression

Depression

Low Yes Yes Yes Yes High

14

Example: PHQ-9• Thoughts of hurting yourself in some way nearly

every day in last 2 weeks is a high intensity item

– Endorsed at high level of depression– But not endorsed at lower levels of depression

Depression

Low No No No Yes High

15

How CTT fails to deal with item intensityFactor analysis groups items of similar intensity

• Factor analysis of a unidimensional construct will produce more than one ‘factor’

• These ‘factors’ are simply sets of items with similar intensities

16

Example: GHQ-12

• Example: GHQ-12

• Many studies report 2- or 3-factor solutions

• ‘Factors’ simply group items by intensity

Psychiatric morbidity

Low High7 4 5 2 6 10 111 12 98 3

17

How CTT fails to deal with item intensitySelecting items on basis of factor analysis exacerbates problem, but simultaneously conceals it

• Items are selected on basis of similar intensities, creating scales with limited range but high reliability


Low High7 4 5 2 6 10 111 12 98 3

Low High

7 41 128 3


18

Why Rasch modelling is not the answer• Rasch modelling explicitly takes into account item

intensities

– Stochastic Guttman scale

• Additionally claims to produce interval scaling & ‘objective’ measurement

• Increasingly popular in Health Psychology

19

Problems• Rasch models require very large samples to allow

estimation of person and item parameters

• Very strong assumptions, e.g. logistic item-response curve

• The data must fit the model, not the other way round

– Discards useful data to fit arbitrary assumptions

• Interval scaling is questionable gain if psychological constructs are not quantitative in the first place

20

Non-parametric IRT (NPIRT)• E.g. Mokken (1971)

• Takes into account item intensities

– Stochastic Guttman scale

• Claims only to rank order people

• Very weak assumptions

– Retains data

• Complements CTT

– Uses simple scale score

21

22

PROMIS project• NIH funded project since 2004 ($100m)

• Establish a domain framework and develop candidate items for adult and paediatric Patient Reported Outcome Measures

• Questionnaires developed using published methodology

• Scaling methods include NPIRT and Graded Response Model (GRM)

23

Summary• The credibility of Health Psychology research &

practice rests on its empirical evidence base

• This evidence base relies on the quality of questionnaire data

• The quality of questionnaire data may be compromised by the use of inappropriate methods

• We should stop relying on factor analysis & reliability coefficients & test the hypothesis that a set of items constitutes a scale

Examples of NPIRT

• Mokken (1971) proposed two models

– Monotone homogeneity model (MH)– Doubly monotone model (DM)

• Scales fitting the MH model rank order people on the attribute of interest

• Corollary is that scales not fitting the MH model do not rank order people on the attribute of interest

• Select items for the scale based on homogeneity

• Assess whether the resulting scale fits the MH model

• Scaling procedure and the MH model based on the following minimal assumptions:

– For all items, if person A has a higher degree of X than person B, A’s probability of endorsing an item will be equal to or higher than B’s

– Local independence: item scores are uncorrelated for the same degree of attribute

• If the purpose of the scale is to rank order people on a given attribute then the scale must be monotone homogenous

• Probability of item being endorsed must be monotone nondecreasing against attribute

• i.e. probability of item endorsement does not decrease with an increase in the measured attribute

* - as estimated from the remaining items of the scale

For this GHQ-12 item the probability of endorsement reaches 50% at a low level of psychological distress

It is therefore a low intensity item: people endorsing this item are signalling a low level of distress

Note that probability (Y-axis) increases with increase in class score (X-axis)

For this GHQ-12 item the probability of endorsement reaches 50% at a high level of psychological distress

It is therefore a high intensity item: people endorsing this item are signalling a high level of distress

Note that probability (Y-axis) also increases with increase in class score (X-axis), but curves:

(a)Do not have the same slope

(b)Are not required to have the same shape

• If two items belong to a unidimensional scale, then:

– Endorsing the more intense item entails that the less intense item also be endorsed

– Endorsing the less intense item does not entail that the more intense item be endorsed

• For a Guttman scale, these are deterministic statements

• For a Mokken scale, these are probabilistic statements

• A Guttman error occurs when the more intense item is endorsed but not the less intense item

• Too many Guttman errors imply that items are not measuring the same attribute

More intense item

Less intense item

• This asymmetrical relationship between item pairs can be summarised with Loevinger’s H

– H is the coefficient of homogeneity between two items i and j

• Ranges from 0.0 to 1.0

– 0.0 indicates no association between items– 1.0 indicates perfect association, given the differences in item

intensity– 1.0 also indicates no Guttman errors

• Mokken (1971) developed H for scale development

– Hij : Homogeneity of pair of items

– Hi : Homogeneity of item i with all items

– H : Homogeneity of scale

• All Hij > 0

• Start with item pair with highest Hij

• Select third item to maximise scale H

• Proceed until H reaches threshold value c

• Produces a unidimensional scale– c = 0.3; weak scale– c = 0.4; medium scale– c = 0.5; strong scale– c = 1.0; perfect Guttman scale

Results for GHQ-12

Step Item Scale H1 p6d 0.791 n4d 0.792 n6d 0.733 n5d 0.684 n2d 0.645 n3d 0.616 p5d 0.597 p3d 0.578 p4d 0.559 n1d 0.5310 p2d 0.5111 p1d 0.50

• => the items of the GHQ-12 form a strong unidimensional scale

Monotone homogeneity model: GHQ-12

Item H #vi maxvi zmax #zsig

p1d 0.44 0 0.00 0.00 0

n1d 0.45 0 0.00 0.00 0

p2d 0.43 1 0.06 0.99 0

p3d 0.50 0 0.00 0.00 0

n2d 0.55 0 0.00 0.00 0

n3d 0.51 0 0.00 0.00 0

p4d 0.47 0 0.00 0.00 0

p5d 0.50 1 0.05 0.90 0

n4d 0.56 0 0.00 0.00 0

n5d 0.50 0 0.00 0.00 0

n6d 0.56 1 0.05 0.93 0

p6d 0.53 1 0.04 0.68 0

• Small deviations from MH model but none significant

Conclusion

• The GHQ-12 is a strongly homogenous unidimensional scale

• Small deviations from monotone homogeneity, none significant

• The GHQ-12 summed score can rank order people by the measured attribute

• i.e. it can serve as an ordinal measure of severity of psychiatric impairment

• Compare to results of EFA/CFA studies

Example: Northwick Park dependency scale

• Item selection from pool of 16 items

Item Scale H

Q8 0.93

Q5 0.93

Q9 0.93

Q2 0.91

Q1 0.88

Q13 0.87

Q7 0.84

Q12 0.82

Q6 0.79

Q14 0.76

Q4 0.74

Q3 0.70

Q11 0.67

Q15 0.62

• 14 items form unidimensional scale

• Two items with serious violations of monotone homogeneity

Item H #vi maxvi zmax #zsig

Q3 0.45 6 0.25 2.88 4

Q11 0.32 5 0.28 3.43 2

Q3: help required using toilet (urination)

Q11: help required with drinking

• These items decrease in probability at the top end of the scale

• With extreme dependency, patients require less help with drinking and emptying bladder– Because at this extreme, they are more likely to be

tube-fed and catherised • Hence, for these items, probability of

endorsement decreases as dependency increases– Scale is not monotone homogenous

• The summed score will not rank order people on the measured attribute

Health & Medicine

How does health psychology measure up