10
Person. in&id. DI$ Vol. 9. No. 5. pp. 873-882. 1988 Printed m Great Britain. All rights reserved 0191-8869 88 53.00 + 0.00 Copyright c‘ 1988 Pergamon Press plc PREDICTING CONSISTENT PSYCHOLOGICAL TEST ITEM RESPONSES: A COMPARISON OF MODELS G. CYNTHIA FEKKEN’* and DOUGLAS N. JACKSON? ‘Queen’s University, Kingston, Ontario K7L 3N6 and ?The University of Western Ontario, London. Ontario N6A 5C2, Canada (Receiced 3 October 1987) Summary-The efficacy of four models for predicting the stability of a given individual’s test item responses on a structured inventory was examined. Two models were based on item characteristics alone and predicted that an individual would be most likely to change responses to items with moderate endorsement probabilities, or with moderate social desirability scale values. Two other prediction models incorporated individual differences in the perception of item characteristics by predicting that unstable items would have relatively long response latencies for an individual, or would be near an individual’s threshold for responding desirably to items. Results from two studies yielded support for the following conclusions: (a) a person’s test item responses are relatively stable over short time intervals; (b) items to which a person will show response changes on retest can be identified to a statistically significant degree; (c) the model based on response latencies constituted in both studies a significantly better predictor than the other models examined. The implications of these results for the threshold model were discussed as were the practical and theoretical applications of the response latency-item stability relationship at the level of an individual’s test protocol. INTRODUCTION A concern of clinicians and psychometricians alike has been the reliability of individuals’ psychological test item responses. Clinicians might cite a variety of practical reasons for the need to establish the reliability of a single client’s data. For example, stable responses are required to make adequate treatment or vocational recommendations; to provide a useful baseline for evaluating treatment effects; or, to interpret meaningfully single ‘critical’ items, as is often done in the domain of psychopathology. Psychometricians, too, are concerned about the stability of individuals’ item responses. Practical applications may include constructing tailored tests for individuals with some minimal reliability; developing indices of individual consistency; or simply improving tests by selecting out items which tend to be inconsistent. Theoretical interests would include understanding inconsistent responding in terms of the overall process of responding to psychological test items. The purpose of this research is to examine the efficacy of four models for predicting the stability of an individual’s responses to a structured inventory. Models are derived from previous empirical findings regarding the relationship of item and of person characteristics to person reliability. Two models are based on item characteristics, namely, endorsement probabilities and social desirability scale values. Two other prediction models incorporate individual differences in the perception of item characteristics. These models are based on an individual’s latencies for responding to items and on the proximity of items to a person’s threshold for responding in terms of some item characteristic. It is hypothesized that: (a) the exact items a given individual will change on retest can be predicted above chance by all four models; and (b) models based on person characteristics, that is, accounting for individual differences in the perception of item properties, will perform better than those models based on item characteristics alone. Research on the consistency of an individual’s response to particular test items dates back to the 1930s. Basically, the research might be divided into two general categories, one focussing on characteristics of the item and the other focussing on characteristics of the person. The empirical work on item characteristics suggested that, while items tended overall to be stable, items having certain properties tended to be somewhat less stable than average. Indeed, an unstable item *Reprint requests should be addressed to: G. Cynthia Fekken, Department of Psychology, Queen’s University, Kingston, Ontario K7L 3N6, Canada. PAlD 92--c 873

Predicting consistent psychological test item responses: A comparison of models

Embed Size (px)

Citation preview

Page 1: Predicting consistent psychological test item responses: A comparison of models

Person. in&id. DI$ Vol. 9. No. 5. pp. 873-882. 1988 Printed m Great Britain. All rights reserved

0191-8869 88 53.00 + 0.00 Copyright c‘ 1988 Pergamon Press plc

PREDICTING CONSISTENT PSYCHOLOGICAL TEST ITEM RESPONSES: A COMPARISON OF MODELS

G. CYNTHIA FEKKEN’* and DOUGLAS N. JACKSON? ‘Queen’s University, Kingston, Ontario K7L 3N6 and ?The University of Western Ontario, London.

Ontario N6A 5C2, Canada

(Receiced 3 October 1987)

Summary-The efficacy of four models for predicting the stability of a given individual’s test item responses on a structured inventory was examined. Two models were based on item characteristics alone and predicted that an individual would be most likely to change responses to items with moderate endorsement probabilities, or with moderate social desirability scale values. Two other prediction models incorporated individual differences in the perception of item characteristics by predicting that unstable items would have relatively long response latencies for an individual, or would be near an individual’s threshold for responding desirably to items. Results from two studies yielded support for the following conclusions: (a) a person’s test item responses are relatively stable over short time intervals; (b) items to which a person will show response changes on retest can be identified to a statistically significant degree; (c) the model based on response latencies constituted in both studies a significantly better predictor than the other models examined. The implications of these results for the threshold model were discussed as were the practical and theoretical applications of the response latency-item stability relationship at the level of an individual’s test protocol.

INTRODUCTION

A concern of clinicians and psychometricians alike has been the reliability of individuals’ psychological test item responses. Clinicians might cite a variety of practical reasons for the need to establish the reliability of a single client’s data. For example, stable responses are required to make adequate treatment or vocational recommendations; to provide a useful baseline for evaluating treatment effects; or, to interpret meaningfully single ‘critical’ items, as is often done in the domain of psychopathology. Psychometricians, too, are concerned about the stability of individuals’ item responses. Practical applications may include constructing tailored tests for individuals with some minimal reliability; developing indices of individual consistency; or simply improving tests by selecting out items which tend to be inconsistent. Theoretical interests would include understanding inconsistent responding in terms of the overall process of responding to psychological test items.

The purpose of this research is to examine the efficacy of four models for predicting the stability of an individual’s responses to a structured inventory. Models are derived from previous empirical findings regarding the relationship of item and of person characteristics to person reliability. Two models are based on item characteristics, namely, endorsement probabilities and social desirability scale values. Two other prediction models incorporate individual differences in the perception of item characteristics. These models are based on an individual’s latencies for responding to items and on the proximity of items to a person’s threshold for responding in terms of some item characteristic. It is hypothesized that: (a) the exact items a given individual will change on retest can be predicted above chance by all four models; and (b) models based on person characteristics, that is, accounting for individual differences in the perception of item properties, will perform better than those models based on item characteristics alone.

Research on the consistency of an individual’s response to particular test items dates back to the 1930s. Basically, the research might be divided into two general categories, one focussing on characteristics of the item and the other focussing on characteristics of the person. The empirical work on item characteristics suggested that, while items tended overall to be stable, items having certain properties tended to be somewhat less stable than average. Indeed, an unstable item

*Reprint requests should be addressed to: G. Cynthia Fekken, Department of Psychology, Queen’s University, Kingston, Ontario K7L 3N6, Canada.

PAlD 92--c 873

Page 2: Predicting consistent psychological test item responses: A comparison of models

874 G. CYNTHIA FEKKEN and DOUGLAS N. JACKSON

might be characterized has having many letters (Dunn, Lushene and O’Neil, 1972; Hartley, 1962). moderate P-value (Goldberg, 1963; Hanley, 1962; Lentz. 1934; Neprash, 1936; Payne, 1974). moderate social desirability scale value (Frank, 1936; Payne, 1974) and high ratings on ambiguity (Goldberg, 1963; Payne, 1974) and controversiality (Rogers, 1973).

Two questions arose from such empirical findings. How precisely do these item characteristics relate to individuals changing their item responses on retest? And, how might individual differences in the actual number and even type of items changed be accounted for? These issues proved to be the impetus for an examination of characteristics associated with the person.

After a brief, perhaps misguided, attempt to study personality characteristics and item response consistency (see Bentler, 1964, and Glaser, 1949, 1952, for a full analysis of the problems associated with this research), researchers turned to conceptualizing person characteristics in terms of the process of responding. Essentially, theoretical models were developed to describe an individual’s cognitive processes in responding to a personality test item. Scaling procedures are generally used to estimate the distance in psychological space between persons and items, based on patterns of item endorsement with respect to some item characteristic (e.g. Cliff, 1977; DeBoeck, 1981). Models that specifically incorporate the threshold concept in describing the process of responding (Rasch. 1960; Jackson, 1968, 1982) would seem especially applicable to the item consistency issue. The threshold marks a point along a dimension of personality where items are sufficiently close to the individual’s concept of self that the tendency to reject items changes to the tendency to endorse items. For example, Kuncel(l973, 1977; Kuncel and Fiske, 1974) found that person-item distance was negatively related to item inconsistency. That is, the further away in space that an item is from the threshold, the more stable the response to the item. Kuncel’s interpretation was that when an item is far from a person’s threshold, psychological distance discrimination is easy and hence, the decision to endorse or to reject is easy. However, when an item is close to the threshold, the decision in relatively more difficult. To support the hypothesis that decisions near the threshold were difficult, Kuncel examined response latencies, reporting increased latency of responding to items near the threshold.

The notion of response latency has itself formed an integral part of some conceptualizations of the response process. Some researchers have interpreted response latencies simply as a function of item characteristics (Dunn et al., 1972; Hanley, 1962), noting that long item response latencies are indeed related to unstable responding (Rogers, 1973; Holden, Fekken and Jackson, 1985). However, other formulations have emphasized that individual differences in item response

-1atencies describe the difficulty of responding to an item for an individual. For example, research emphasizing the role of the self in information processing demonstrated significantly faster response latencies for extreme than for moderate self-ratings (Kuiper, 1981; Rogers, Kuiper and Rogers, 1979). Similarly, Markus (1977; Markus and Smith, 1981) reported individuals with well-developed self-schemata for a construct processed relevant information more quickly than irrelevant information. Persons lacking self schematas showed no such differences in response latencies when endorsing construct-relevant information or its opposite. Finally, Ebbesen (1980) obtained significantly faster response times for global judgements of personality using trait terms than for ratings of behavioral details, presumably because the latter involve a more particular memory search.

The literature on the relationship of response stability to item and to person characteristics defined in terms of the response process yields two conclusions. First, response instability and the ambiguity of item content appear to be related. For example, items with moderate endorsement and desirability properties or high ambiguity ratings show somewhat less stability across Ss. Second, response instability and the degree to which the cognitive processing of an item is difficult for a particular individual appear to be related. For example, responses are less stable for a person when items are close to his or her threshold for responding or, for a variety of reasons, the decision process is lengthy. These generalizations reflect separate approaches to the prediction of unstable psychological test item responses. One approach to prediction emphasizes item characteristics; another approach emphasizes individual differences in the perception of item characteristics thus accounting for item and person characteristics in interaction. Both approaches take into account item characteristics; however, the latter approach does not treat item characteristics as invariant. Rather, it is sensitive to person’s unique judgments of item characteristics. Specific prediction

Page 3: Predicting consistent psychological test item responses: A comparison of models

Consistent item responses 875

SOCIAL DESIRABILITY SCALE VALUE

Fig. I. Subject operating curve described in the threshold model for responding.

models based on the item by person characteristics approach would seem to be more accurate, but such an hypothesis has never been explicitly tested. Thus, previous empirical data were synthesized to derive two models incorporating item characteristics alone and two models accounting for both person and item characteristics.

The P-Value Model was based on item endorsement properties and predicted that items with relatively moderate P-values would be unstable (Goldberg, 1963; Payne, 1974). The Social Desirability Model was based on group judgements of an item’s tendency to elicit a socially desirable response. An item’s mean judged desirability is referred to as its social desirability scale value (SDSV). The Social Desirability Model leads to the prediction that items with relatively moderate SDSVs would be unstable (Goldberg, 1963; Payne, 1974). The Threshold Model was based on a threshold model for responding (Jackson, 1968, 1982) which describes an individual’s response pattern in terms of a subject operating curve (see Fig. 1). This curve is a plot of the relationship between an item characteristic (e.g. SDSV) and the item endorsement probability for a respondent. The curve variance defines the salience parameter, reflecting the degree to which the item property determines the individual’s responses. The curve mean defines the threshold parameter, marking an individual’s transition from the tendency to reject to the tendency to endorse items relative to some item property. The threshold explicitly takes into account the interaction between items and persons, designating the point for each individual where items have moderate levels of the item characteristic and hence, where responding should be most difficult. Thus, the Threshold Model, with social desirability scale value as the item property, predicted that items falling near an individual’s threshold would be least stable. Finally, the Response Latency Model was based on an individual’s response latencies associated with specific item responses. Individual differences in item response latencies may be interpreted as an index of the difficulty of responding to an item for a person (Ebbesen and Allen, 1979; Kuiper, 1981; Rogers, 1974). The Response Latency Model would lead to the prediction that individuals would be most likely to change their responses on retest to items which have relatively long response latencies.

In summary, all four models were hypothesized to predict above chance which responses to a structured inventory an individual will change on retest. Further, the Threshold and Response Latency Models were hypothesized to have greater predictive efficiency than the P-Value and Social Desirability Models.

Study 1

METHOD

Subjects

Ss were 40 introductory psychology students (20 men, 20 women) at a major Ontario university who received experimental course credit for their participation.

Materials

Items were taken from the Personality Research Form-Form E (PRF-E; Jackson, 1984). The PRF-E is a true-false inventory that yields scores on 20 content scales and two validity scales,

Page 4: Predicting consistent psychological test item responses: A comparison of models

876 G. CYSTHIA FEKKEN and DOUGLAS N. JACKSOS

namely, Infrequency and Desirability. The PRF-E measures dimensions of normal personality in a manner relatively free of response style variance. Altogether, 192 items comprising 10 content scales plus the two validity scales were included in this study. As well, a multiple choice vocabulary measure (French, 1962) was completed by all Ss.

Procedure

Each S was tested twice, with approx. 1 week separating sessions. In the first session, all Ss responded to the 192 PRF items in a manner allowing the collection of response latencies. Items were presented using a slide projector electrically connected to a clock timing in milliseconds and a two-key control panel. As a slide fell into the projection slot, the clock was started. To respond to the item, the S depressed either the right-hand key marked ‘T’ for a ‘True’ response or the left-hand key marked ‘F’ for a ‘False’ response. Depressing either key stopped the clock and illuminated one of two response indicators for the experimenter, who recorded responses and latencies manually. All Ss responded to 15 practice items before beginning the experimental task.

In the second session, 20 Ss were randomly selected within sex to respond to the PRF-E items as in the first session. The other 20 Ss responded to the items in the standard paper and pencil format.

RESULTS

Properties of the testing material

The psychometric properties (see Fekken, 1983, for complete data) of the PRF-E scales, including means, standard deviations, internal consistencies and test-retest correlations were comparable to published data (Jackson, 1984). This is taken to indicate that the present sample exhibits characteristics similar to those exhibited by other samples.

Person reliability

An unstable item response was defined as the change from a ‘True’ to a ‘False’ response on retest or vice versa. Overall, Ss changed 16.8% of their responses or a mean of 26.85 (SD 10.68) of 160 responses to the content scale items. There was no significant difference in the total number of items changed by the group tested in the slide-key format versus the standard paper and pencil procedure.

Model parameters

The P-values and social desirability scale values (SDSV) were obtained from Helmes, Reed and Jackson (1977). Mean P-value was 0.51 (SD 0.05) and mean SDSV was 5.20 (SD 1.84), based on a nine-point rating scale. The two sets of item properties correlated r = 0.75. Parameters for the threshold model were calculated using the SDSVs of the PRF-E items. Thresholds and saliences both showed adequate test-retest stability (r = 0.75 and r = 0.85, respectively) and were only moderately intercorrelated for both test (r = 0.33) and retest (r = 0.29).

Mean response latency associated with individual item responses across the 40 Ss was 4.7 set (SD 1.4 set). For the 20 Ss who were tested twice, mean latency from the first (X = 4.9; SD = 1.4) to the second (X = 4.0; SD = 1 .O) testing session dropped significantly [t( 19) = 5.4, P < O.OOOl], although the test-retest correlation for mean latency was high (r = 0.87).

Model comparison

The predictive accuracy of models was compared as follows. Given the total number of item responses an individual changed on retest, each model was required to predict which items the individual changed. For example, assume an individual changed 16 of 160 item responses on retest. The P-Value and Social Desirability Models would predict these are the 16 items with the most moderate P-values or SDSVs, respectively. The Threshold Model would predict that the 16 items nearest the person’s threshold would be altered. The Response Latency Model would predict that the 16 items with the longest latencies for the person would be changed. Note that for any unselected subset of 16 items, 1.6 correct predictions are expected by chance. That is, given 16 out of 160 items were changed (16/160), any group of 16 items should by chance contain

Page 5: Predicting consistent psychological test item responses: A comparison of models

Consistent item responses 877

Table I. Mean number of correct predictions for each of the four

models (,V = 40)

Correct predictions

Models Mean SD -

P-Vallle 5.48 6.21 Social Desirability 6.00. 5.76 Response Latency 7.95.’ 6.31

Threshold 5.7g* 5.92

Chance 5.05 5.57

Mean number of item changes 26.85 10.68

*Significantly different from chance at P i 0.05.

**Significantly different from chance at P < 0.001.

(16/160 x 16 =) 1.6 unstable items. Part items were always rounded up to the next whole item, which in the current example, equals two correct predictions.

The number of correct predictions made by each model is reported in Table 1. A series of r-tests indicated that the Social Desirability, Threshold and Response Latency Models all performed significantly better than chance while the P-Value Model did not [r(39) = 1.65, n.s.]. The predictive efficiency of the four models was compared using a repeated measures analysis of variance. A highly significant effect was obtained for models’ ability to predict response changes [F(3, 117) = 13.07, P < O.OOl]. When models were compared using Dunn’s Multiple Comparison Test, the Response Latency Model was found to predict changes significantly better than the P-Value, Social Desirability and Threshold Models, none of which differed significantly from one another.

DISCUSSION

The data support previous empirical findings that, over a 1 week test-retest interval, responses to personality test items tend to be quite stable. More importantly, the results indicated that it is possible to predict exactly which items individuals are most likely to change on retest. As hypothesized, the Social Desirability, Response Latency and Threshold Models were able to predict unstable items above chance level. Thus, instability of item responses does appear to be related to moderate social desirability scale values, long response latencies and proximity to a threshold. The P-Value Model did not predict above chance which items would be changed on retest. Perhaps the reduced range of PRF item P-values restricted the relationship of P-values to consistency. Alternatively, perhaps an actual consideration of the group endorsement probabilities associated with items is not a part of the process of responding consistency. The relationship which P-value has shown to stable responding may indeed be moderated by other item characteristics, such as content or desirability.

When the predictive accuracy of models was compared, the Response Latency Model was found to be significantly better than the other models. This finding supports the hypothesis that taking into account individual differences in the perception of an item’s properties improves prediction of response changes over reliance on item properties alone. The time an individual requires to respond to a single item partly reflects general item parameters, such as length, negativity, content saturation, etc. These parameters contribute to response latency presumably because they make the response decision more difficult. Within a neomentalistic framework, however, decision difficulty may also be a function of the ease of comparing the particular item to the self’s position on the underlying attribute.

The Threshold Model, which takes into account individual differences in the tendency to respond desirably, was expected to perform as well as the Response Latency Model and better than the Social Desirability and P-Value Models. The Threshold Model could readily predict above chance exactly which items individuals changed on retest but was only as successful as the two models based on item characteristics. The Threshold Model incorporated the assumption that individuals were responding to items in terms of social desirability. The degree to which other item properties (e.g. content) determine responses will compromise the efficacy of the Threshold Model. The Response Latency Model of course remains valid regardless of the item property determining the response. To evaluate the Threshold Model adequately, the single item property on which it is

Page 6: Predicting consistent psychological test item responses: A comparison of models

878 G. CYNTHIA FEKKEN and DOKGLAS N. JACKSON

based must be specified correctly. In Study 2. the Threshold Model will be examined under conditions where the social desirability of items is made salient.

METHOD

Subjects

Ss were 90 student volunteers (30 men, 60 women) at a major Ontario university recruited through summer school classes and through campus advertisements. For their participation, Ss received a computerized personality profile and a one dollar Canadian coin.

Materials

Items were again taken from Jackson’s Personality Research Form-Form E (PRF-E; 1984).

Procedure

Ss were individually tested in a single session on each of three tasks. (I) the Computer Task involved responding to 96 PRF items, comprising six PRF-E content scales (i.e. Abasement. Achievement, Affiliation, Aggression, Autonomy and Change) in a computerized format. Items were presented on a video terminal, and subjects responded using an independent control panel which had one telegraph key marked ‘T’ for ‘True’ and another marked ‘F’ for ‘False’. Responses latencies and item responses were readily collected using this procedure. (2) The Rating Task required Ss to make judgements on a nine-point scale of the social desirability of a ‘True’ response to the same 96 PRF-E items. (3) The PRF-E task simply involved completing the entire 352 item PRF-E using standard instructions and the standard paper and pencil format.

Ss were randomly assigned within sex to one of the following three task orders to examine the effect of the Rating Task on responding: Rating Task, Computer Task and PRF-E Task; Computer Task, Rating Task and PRF-E Task; and Computer Task, PRF-E Task and Rating Task. There were no order effects for the number of items changed [F(2,87) = 2.20, n.s.] or for scores on the PRF-E Desirability scale [F(2,87) = 0.65, n.s.1. Therefore, data were collapsed across groups. Univariate analyses comparing the efficacy of models for each of the three groups yielded no significant group effects.

RESULTS

Properties of the testing material

The psychometric properties of the PRF-E scales, including means, standard deviations and internal consistencies (see Fekken, 1983, for complete data) were comparable to those reported in the PRF manual (Jackson, 1984). Test-retest correlations for the six PRF-E scales which Ss completed twice tended to be very high, no doubt a function of the short retest interval.

Person reliability

As in Study 1, an unstable response was defined as the change from a ‘True’ response to a ‘False’ response or vice versa. The percentage of changed responses was 8.60 (i.e. a mean of 8.26 of 96 item responses were altered with a standard deviation of 2.71 responses).

Model parameters

PRF-E item characteristics published by Helmes et al. (1977) were again used in Study 2. P-values for the subset of 96 items ranged from 0.09 to 0.78, with a mean of 0.51 and a SD of 0.03. The range of social desirability scale values (designated ‘group SDSVs’) was 2.48 to 7.54, judgments having originally been made on a nine-point scale. The mean SDSV was 5.21, with a standard deviation of 1.48. The correlation of P-values and group SDSVs was r = 0.69, P < 0.001. Parameters for the threshold model could be calculated using either the group SDSVs described above or ‘individual SDSVs’, which were provided by every S on the 96 items. Thus, four estimates of threshold and of salience were calculated per person-one based on group SDSVs and one on

Page 7: Predicting consistent psychological test item responses: A comparison of models

Consistent item responses 879

Table 2. Intercorrelations among various threshold estimates and

amona various salience estimates (.V = 90)

I. Test-Group SDSV

Threshold

Salience

2. Test-Individual SDSV

Threshold

Salience

3. Retest-Group SDSV

Threshold

Salience

4. Retest--Individual SDSV

Threshold

Salience

I 2 3 4

1.00

1.00

0.61 I .oo 0.56 I .oo

0.80 0.56 I .oo 0.93 0.60 I .oo

0.55 0.76 0.67 1.00

0.56 0.96 0.64 1.00

individual SDSVs, for each of the test and retest conditions. Intercorrelations among the various threshold estimates and among the salience estimates are provided in Table 2. Nonsignificant correlations were obtained between threshold and salience regardless of whether estimates were based on group or individual SDSVs and across both test and retest. Salience parameters based on group and on individual SDSVs were significantly different [F( 1,87) = 113.06, P < O.OOl], with individual SDSVs showing a stronger relationship to responses than independent, group-judged SDSVs. The mean latency for responding to an item across all 90 Ss was 2.98 set (SD 1.83). Men took a mean of 3.46 set to respond to an item while women took a mean of 2.73 set to respond to an item [t(89) = 2.09, P < 0.041. There was no relationship between age and mean latency [r(88) = -0.10, n.s.1.

Model comparison

The predictive accuracy of models was compared as before. Given the total number of item responses an individual changed on retest, each of six prediction models was required to predict exactly which items were changed. Models and the items to which responses are most likely to be altered are as follows: (1) P-Value Model, items with moderate P-values; (2) Social Desirability Model, items with moderate SDSVs; (3) Threshold Model, items closest to an individual’s threshold for responding desirably; (4) Individual Social Desirability Model, items with moderate SDSVs where SDSVs are judged by the individual himself or herself; (5) Individual Threshold Model, items closest to an individual’s threshold for responding desirably, calculated from his or her own ratings of item social desirability; and, (6) Response latency Model, items with the longest latencies relative to the individual’s mean latency for responding.

With a mean of 8.26 of 96 items changed, by chance, one might correctly predict 0.93 items (SD 1.28) in any unselected group of 8.26 items. Thus, the predictive utility of each method was first compared to chance using a series of t-tests. As seen in Table 3, only the Individual Social Desirability and the Response Latency Models performed significantly above chance. A one-way analysis of variance with repeated measures indicated that models differed significantly from one

Table 3. Means and standard deviations for correct predictions made by each model for

each experimental group

Models

Group I

Group 2

Group 3

Individual

Social Social Response Individual

P-value Desirability Desirability Latency Threshold Threshold

0.60 0.70 0.70 I .63 0.70 0.77

(0.72) (0.99) (0.95) (1.38) (1.06) (1.01)

I .23 I.17 I .73 2.07 I .53 I .30

(1.33) (1.42) (1.66) (2.05) (1.54) (I .49)

0.90 1.00 I .20 I s7 0.97 1.33

(1.81) (1.91) (2.86) (2.1 I) (I .97) (1.60) - - - - - -

0.91 0.96 1.21’ I .76’* I .07 I.13

(1.37) (I .48) (2.08) (1.87) (I .59) (1.40)

*Significantly different from chance at P < 0.02.

**Significantly different from chance at P < 0.001.

Page 8: Predicting consistent psychological test item responses: A comparison of models

880 G. CYNTHIA FEKKEN and DOUGLAS N. JACK~OS

another. Dunn’s Multiple Comparison Test indicated that the Response Latency Model out-

performed all other models, which did not differ significantly from one another. In a supplementary analysis, response latencies for each item were converted to standard scores

around the item’s respective group mean. Elimination of mean differences among items rules out the possible confound that all individuals changed the same items. When the Response Latency Model was based on standardized response latencies, the mean number of correct predictions for changed items was virtually identical to that obtained for the original Response Latency Model (X = 1.71, SD = 1.51 vs X = 1.76, SD = 1.87). Thus, there appears to be no support for the possibility that a single item characteristic, such as length, accounted for the predictive utility of response latencies.

DISCUSSION

The present results support previous findings on response stability, in that Ss changed only approx. 8% of their responses. No doubt the short test-retest interval contributed to the strong consistency.

The purpose of this study was to evaluate models for predicting item response changes when the social desirability of items was made salient using individuals’ own ratings of item desirability. Results supported the hypothesis that responding was more strongly determined by individuals’ perceptions of social desirability than an independent, group consensus of the desirability of a true response. Indeed, one of only two models able to predict significantly above chance which items persons would change on retest was based on individual SDSVs. This finding held despite the lesser reliability necessarily associated with single versus aggregated judgments of an item characteristic. Surprisingly, neither the threshold model based on individual SDSVs nor that based on group SDSVs improved predictive efficacy over models based on the item characteistic, SDSV, alone. What are the potential explanations? The various threshold parameters showed adequate con- vergence; therefore, unreliable estimation of the threshold parameter is an unlikely explanation. Perhaps, all Ss’ thresholds were so moderate that the items predicted by the Threshold Model to be unstable were essentially the same items selected by the Social Desirability Model. This does not appear to be the case: data from Study 1 indicated that the proportion of overlap among the four models tended to be less than 0.20. Another possibility is that social desirability was not a significant determinant of responding to the PRF-E items which were chosen for their relative neutrality with regards to desirability. But, the threshold parameter of Jackson’s model (1968, 1982) provides the most sensitive discrimination among individuals when items are clustered in the moderate range of a characteristic, such as desirability. Nonetheless, the Threshold Model may need to be re-evaluated using another item characteristic, especially content saturation. Even so, it must still be assumed that individuals respond to items in terms of that single item property. To the extent that response strategies vary across items, the predictive accuracy of the Threshold Model will be compromised.

The exciting finding in this study was again with regards to response latency. The Response Latency Model was best able to predict exactly which items individuals were most likely to change on retest. Response latency was assumed to be a function of the difficulty of the response decision for the individual. Decision difficulty may be associated with particular item character- istics, some of which are relevant response determinants (e.g. content), others irrelevant deter- minants (e.g. negative wording; Holden et al., 1985). Further, decision difficulty may be related to different item properties across items. The decision difficulty engendered by item properties across individuals will be evidenced in long latencies for a certain subgroup of items. That is, long latencies and instability are always associated with the same group of items for all individuals. However, the present results demonstrate that even when items are standardized to remove a general difficulty component, long latencies still predict which items an individual will change. The key component of latencies appears to be indkidual decision difficulty which, in turn, may be expected to contribute to item instability.

Various conceputalizations may underlie individual decision difficulty. Response latencies may be a function of the difficulty of a comparison of the item to the self on the basis of their relative positions on some underlying attribute (Ebbesen and Allen, 1979; Kuiper, 1981; Rogers, 1974).

Page 9: Predicting consistent psychological test item responses: A comparison of models

Consistent item responses 881

This comparison may involve a relatively long time because distance discrimination is difficult. Alternatively, the self, to which the item is being compared, may be poorly crystallized. Again others have suggested that the storage of social information in memory may contribute to differential response times. To illustrate, Ebbesen (1980) proposed that personality information may be stored as a global code, made up of trait terms, or a specific code, made up of behavioral details. Comparison to the global code could result in faster decisions than comparison to the specific code which entails a fine memory search. On the other hand, the necessity of coding specific behavioral details contained in items into a global code might contribute to slower response speeds. Future research might seek to delineate how such cognitive processes are related to consistent responding to personality items. Particularly appropriate would be delineation of components of response latencies analogous to that undertaken in the cognitive domain (Estes, 1982; Hunt, 1978; Sternberg, 1977).

There are practical applications to the Response Latency Model’s success as well. Item selection for structured tests could be improved: items with short latencies could be selected to help maximize test-retest stability of scales or to develop highly reliable scales which are considerably shorter (e.g. eight items) than the scales commonly used. In the age of the microcomputer, an individual’s response latency information could be collected for evaluation of the meaningfulness of interpreting single items and for construction of tailored tests. For example, when an individual exhibits a relatively long latency to a critical item, attaching special significance to the response in the absence of additional information may be contraindicated. In addition, long latencies may provide a more general index of domain articulation. If an individual were to show long response times for one domain (e.g. anxiety) but not for other domains, additional evaluation, perhaps via a tailored testing procedure, may prove valuable.

CONCLUSIONS

1. An individual’s psychological test item responses tend to be stable, at least over short time intervals.

2. The specific test items to which an individual will change his or her responses on retest may be predicted.

3. Relatively long item response latencies constitute significantly better predictors of which items an individual will change on retest than other item properties, including moderate endorsement frequencies, moderate social desirability scale values, or proximity to the individual’s threshold for responding desirably.

Acknowledgemenr-This work was supported in part by Social Sciences and Humanities Research Council of Canada, Grants 410-80-0576 and 410-85-0156.

REFERENCES

Bentler P. M. (1964) Response variability: fact or artifact? Unpublished doctoral dissertation, Stanford University, Stanford, Calif.

Cliff N. (1977) Further study of cognitive processing models for inventory response. Appl. psychol. Meus. 1, 41-49. DeBoeck P. (1981) Individual differences in the validity of a cognitive processing model for responses to personality

inventories. Appl. psychol. Meas. 5, 481492. Dunn T. G., Lushene R. E. and O’Neil H. F. Jr (1972) Complete automation of the MMPI and a study of its response

latencies. J. consult. clin. Psychol. 39, 381-387. Ebbesen E. B. (1980) Cognitive processes in understanding ongoing behavior. In Person Memory: The Cognifice Baris

o/Social Perceprion (Edited by Hastie R., Ostrom T. M., Ebbesen E. B., Wyer R. S., Hamilson D. L. and Carlston E. E.). Erlbaum, Hillsdale, N.J.

Ebbesen E. B. and Allen R. B. (1979) Cognitive processes in implicit personality trait inferences. J. Person. sot. Psychol. 37, 471488.

Estes W. K. (1982) Learning, memory, and intelligence. In Hundbook of Human Intelligence (Edited by Stemberg R. J.). Cambridge University Press.

Fekken G. C. (1983) Comparison of four models for predicting person reliability. Unpublished doctoral dissertation, The University of Western Ontario, London, Canada.

Frank B. (1936) Stability of questionnaire responses. J. ubnorm. sot. Psychol. 30, 320-324. French J. W. (1962) Extended Range Vocabulary Test. Educational Testing Service, Princeton, N.J. Glaser R. (1949) A methodological analysis of the inconsistency of response to test items. Educ. psychol. Meos. 9.727-739. Glaser R. (1952) The reliability of inconsistency. Educ. psychol. Meus. 12, 60-64. Goldberg L. R. (1963) A model of item ambiguity in personality assessment Educ. psychol. Meus. 23, 467-492.

Page 10: Predicting consistent psychological test item responses: A comparison of models

882 G. CYNTHIA FEKKEN and DOUGLAS N. JACK;SON

Hanley C. (1962) The “difficulty” of a personality inventory item. Educ. psychol. .Meas. 22, 577-584. Helmes E., Reed P. L. and Jackson D. N. (1977) Desirability and frequency scale values and endorsement properties for

items of Personality Research Form-E. Psychot. Rep. 41, 43544. Holden R. R., Fekken G. C. and Jackson D. N. (1985) Structured personality test item characteristics and validity. J. Res.

Person. 19, 386-394. Hunt E. B. (1978) Mechanics of verbal ability. Ps_rchol. Rec. 85, 109-130. Jackson D. N. (September 1968) A threshold model for stylistic responding. Paper presented at the Americun Psychological

Association. San Francisco, CaliJ Jackson D. N. (June 1982) Threshold model for personality assessment. Paper presented at the lnrernarionaf Conference

on Personalily Measurement, Bielefeld. F. D. R. Jackson D. N. (1984) Personalify Research Form Manual, 3rd edn. Research Psychologists Press. Port Huron. Mich. Kuiper N. A. (1981) Convergent evidence for the self as a prototype: the “inlerted-U RT effect” for self and other

judgements. Person. Sot. Psychol. Bull. 7, 438-443. Kuncel R. B. (1973) Response processes and relative location of subject and item. Educ. psychol. Meas. 33, 545-563. Kuncel R. B. (1977) The subject-item interaction in intemmetric research. Educ. pswhol. Meas. 47, 665-678. Kuncel R. B. and Fiske D. w. (1974) Stability of response process and response. Educ. psychol. Meas. 34, 743-755. Lentz T. F. (1934) Reliabilitv of the ooinionaire techniaue studied bv the retest method. J. sot. Pwchol. 5. 338-364. Markus H. (1977) Self-schemata and processing information about the self. J. Person. sot. Ps.vchoi 35, 63-78. Markus H. and Smith J. (1981) The influence of self-schemata on the perception of others. In Personalify Cognition, and

Social Interaction (Edited by Cantor N. and Kihlstrom J. F.). Erlbaum. Hillsdale, N.J. Neprash J. A. (1936) The reliability of questions in the Thurstone Personality Inventory. J. sot. Psycho/. 7, 239-244. Payne F. D. (1974) Relationships between response stability and item endorsement, social desirability. and ambiguity.

Mufrivar. Behav. Res. 9, 127-148. Rasch G. (1960) Probabilisric Models for Some Intelligence and Altainment Tests. University of Chicago Press, Chicago.

Ill. Rogers T. B. (1973) Toward a definition of the difficulty of a personality item. Ps_rchol. Rep. 33, 159-166. Rogers T. B. (1974) An analysis of two central stages underlying responding to personality items: the self-referent decision

and response selection. J. Res. Person. 8, 128-138. Rogers T. B., Kuiper N. A. and Rogers P. J. (1979) Symbolic distance and congruity effects for paired-comparisons

judgements of degree of self-reference. J. Res. Person. 13, 433-449. _ -

Sternberg R. J. (1977) InteNigence, information Processing, and Analogical reasoning: the Componenrial Ana(vsis of Human Abilities. Erlbaum, Hillsdale, N.J.