38

Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen
Page 2: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics

Mary K. Enright Mary Morley

Kathleen M. Sheehan

GRE Board Report No. 95-15R

September 1999

This report presents the findings of a research project funded by and

carried out under the auspices of the Graduate Record Examinations Board

Educational Testing Service, Princeton, NJ 0854 1

Page 3: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

********************

Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate

Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy.

********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,

services, and employment policies are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.

The modernized ETS logo is a trademark of Educational Testing Service.

Educational Testing Service Princeton, New Jersey 08541

Copyright 0 1999 by Educational Testing Service. All rights reserved.

Page 4: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Acknowledgments

We wish to recognize the contribution of the many test development staff members whose advice and cooperation was essential to this project. Special thanks to Jackie Tchomi, Judy Srnith, and Jutta Levin. We also appreciate Bob Mislevy’s advice about how to estimate the usefulness of collateral information. Finally, we are grateful to the Graduate Record Examinations Board for supporting this research.

Page 5: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Abstract

This study investigated the impact of systematic item feature variation on item statistical characteristics and the degree to which such information could be used as collateral information to supplement examinee performance data and reduce pretest sample size. Two families of word problem variants for the quantitative section of the Graduate Record Examination (GRE@) General Test were generated by systematically manipulating item features. For rate problems, the item design features affected item difficulty (Adj. R2 = .90), item discrimination (Adj. R2 = SO), and guessing (Adj. R2 = .41). For probability problems the item design features affected difficulty (Adj. R2 = .6 l), but not discrimination or guessing. The results demonstrate the enormous potential of systematically creating item variants. However, questions of how best to manage variants in item pools and to implement statistical procedures that use collateral information must still be resolved.

KEY WORDS Quantitative Reasoning Graduate Record Examinations Faceted Item Development Algebra Word Problems Item Statistical Characteristics Assessment of Quantitative Skills

Page 6: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Table of Contents

Introduction ................................................................................................................................ 1 Research on Word Problems .................................................................................................... 1

Method ....................................................................................................................................... 3

Design of Word Problems ....................................................................................................... .3 Item Pretesting ....................................................................................................................... .7

Item Analysis ........................................................................................................................... 7

Data Analysis .......................................................................................................................... 7

Results ........................................................................................................................................ 9

Summary of Item Statistics ..................................................................................................... .9

Impact of Item Design Features on Item Operating Characteristics ........................................ 12 Implications for Reductions in Pretest Sample Sizes .............................................................. .23

Discussion ................................................................................................................................ .27

Summary .............................................................................................................................. .27

Understanding Item Difficulty and Construct Representation ................................................ .27

Implications for Creating Item Variants ................................................................................ .28

Implications for Reducing Pretest Sample Size ..................................................................... .29

Concluding Comments ......................................................................................................... -29

References ................................................................................................................................ 31

Page 7: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

List of Tables

TABLE 1. Examples of Rate Items .................................................................................................................. ..4 TABLE 2. Examples of Probability Items ........................................................................................................ ..6 TABLE 3. Mean Item Statistics for Experimental and Nonexperimental Problem Solving Items ................. 11 TABLE 4. IRT Item Parameters for Rate Problems with Differing Item Design Features ............................. 13 TABLE 5. IRT Item Parameters for Probability Problems with Differing Item Design Features.. ................. 14 TABLE 6. Regression of Item Features on IRT Item Parameters for Two Families of Item Variants ........... 15 TABLE 7. The Precision of Difficulty Estimates Generated With and Without Collateral Information.. ...... 25

List of Figures

FIGURE 1. Estimated regression tree for the difficulty parameter for rate problems. ..................................... 18 FIGURE 2. Estimated regression tree for the discrimination parameter for rate problems .............................. 19 FIGURE 3. Estimated regression tree for the guessing parameter for rate problems. ..................................... .20

FIGURE 4. Estimated regression tree for the difficulty parameter for probability problems. ........................ .22

FIGURE 5. The effect of increasing sample sizes with and without collateral information. ........................... .26

Page 8: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Introduction

Because of the continuous nature of computer adaptive testing, the danger of item exposure will increase unless item pools are large enough or can be changed frequently enough to reduce the probability of examinees revealing items that a large number of subsequent examinees may receive. Thus continuous computer adaptive testing has created a demand for more items and greater efficiencies in item development. Improvement in the efficiency of item development will result if methods for generating items systematically are developed or if pretest sample size requirements can be reduced. A particularly critical bottleneck in the item development process at present is the need for item pretesting. The number of items that can be pretested is constrained by the number of examinees on whom the item must be tested in order to obtain reliable estimates of item operating characteristics. Recently, however, methods have been developed that permit the use of collateral information about item features to supplement examinee performance data, so that smaller pretest samples can be used to obtain reliable estimates of item operating characteristics (Mislevy, Sheehan, & Wingersky, 1993). The purpose of this study was to determine if designing systematic variants of quantitative word problems would result in more efficient item development, thus permitting item operating characteristics to be reliably estimated using smaller pretest samples.

The idea of creating variants of existing items as a way of developing more items is not a novel idea and probably has been done informally by item writers as long as standardized tests have been in existence. While the unsystematic creation of variants contributes to the efficiency of the item development process, there are some dangers associated with this practice, such as overlap among items or inadvertently narrowing the construct being measured.

The ideal alternative would be to create item variants systematically by using a framework that distinguishes construct relevant and irrelevant sources of item statistical characteristics, as well as incidental item features that are neutral with respect to item statistical characteristics and the underlying construct. Thus item variants with different statistical parameters could be created by manipulating construct relevant features, and item variants with similar statistical parameters could be created by manipulating incidental features. With this method, overlap among items could be better controlled. Unfortunately, the constructs tapped by most existing tests are not articulated in enough detail to allow the development of construct-driven item design fi-ameworks.

A third approach to generating item variants is to use item design frameworks as a hypothesis-testing tool to assess the impact of different item features on item statistical characteristics. This is the systematic approach that was taken in the present study. Frameworks for creating item variants were developed based on prior correlational analyses of item features that affect problem difficulty and on the hypotheses of experienced item writers. Thus the item development and research processes were integrated so that the degree to which different item features impact item statistical characteristics could be determined and the constructs underlying the creation of item variants could be more clearly articulated.

Research on Word Problems

A body of research about problem features that affect problem difficulty already exists for arithmetic and algebra word problems. This research can serve as a basis for creating systematic item variants and estimating problem difficulty. The relevant research was stimulated by Mayer (198 l), who analyzed algebra word problems from secondary school algebra texts. Mayer found that these problems could be classified into eight families based on the problems ’ “story line” and source formulas (such as “distance = rate x time” or

Page 9: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

“dividend = interest rate x principle”). However, similar story lines may reflect very different quantitative structures (Mayer, 1982). In order to capture this relevant quantitative structure separately from the specific problem content, a number of network notations have been developed (Hall, Kibler, Wenger, & Truxaw, 1989; Reed, 1987; Reed, Dempster, & Ettinger, 1985; Shalin & Bee, 1985).

For example, Shalin and Bee (1985) analyzed the quantitative structure of word problems in terms of elements, relations, and structures. Many word problems consist of one or more triads of elements combined in additive or multiplicative relationships. One of the relationships Shalin and Bee described--a multiplicative relationship among a rate and two quantities--is typical of many arithmetic and algebra word problems, such as those involving travel, interest, cost, and work. For complex problems that involve more that one triad, problem structure describes the way that these triads are linked. Shalin and Bee found that many two-step arithmetic word problems could be classified as exemplars of one of a number of structures (such as hierarchy, shared-whole, and shared-part), and that these problem structures had an effect on problem difficulty. This idea can be extended to other word problems, and the kind of superordinate constraint that allows the subparts of a problem to be composed can be used as one feature in classifying problems (Hall et al., 1989; Sebrechts, Enright, Bennett, & Martin, 1996). For example, round trip problems (the distance on one part of the trip equals the distance on the second part of the trip) exemplify a class of problems in which the superordinate constraint can be described as Distance 1 = Distance 2. Another type of problem involving parts of a trip in the same direction but at different rates might have a superordinate constraint such that Distance 1 + Distance 2 = Total Distance.

Problem features such as those described above can be related theoretically to individual differences in cognition. For example, because of limitations on working memory capacity, the more elements and relationships there are, the more difficult a problem is likely to be. However, knowledge about basic, complementary mathematical relationships among elements (such as “distance, rate, and time” or “dividends, interest, and principal”) should help individuals to group or chunk subparts of a problem. Integrating these chunks into a larger structure requires recognition of the superordinate constraints that are operating in the problem situation. Thus we assume, as pointed out by Embretson (1983), that the “stimulus characteristics of the test items determine the components that are involved in its solution” (p. 18 1).

In a study of 20 word problems that had appeared on the quantitative section of the Graduate Record Examination (GRE@) General Test, Sebrechts et al. (1996) found that three problem features--the need to apply algebraic concepts (manipulate variables), problem complexity, and content--accounted for 37% to 62% of the variance in two independent estimates of problem difficulty. In addition to this correlational study of a small set of problems, other studies also demonstrate that similar item features are useful in designing word problems (Lane, 199 l), in providing substantive understanding of changes in student performance with training (Embretson, 1995), and in accounting for problem difficulty.

To date, researchers have focused on identifying sources of item difficulty because this information is useful for explicating the constructs represented on a test and for developing proficiency descriptors (Embretson, 1983, 1995; Sheehan, 1997). However, information about the problem features that affect item discrimination and guessing parameters as well as item difficulty parameters is also valuable at present because recent advances in measurement theory support the use of collateral information about item features to estimate item operating characteristics using smaller examinee samples (Mislevy et al., 1993; Mislevy, Wingersky, & Sheehan, 1994). Such estimation procedures can reduce the cost of item development.

For word problems on many standardized tests, the kinds of item features described above are varied unsystematically and on an ad hoc basis, and so it is difficult to estimate precisely how much any particular feature contributes to item statistical characteristics. In this study, we developed and pretested items that

2

Page 10: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

varied systematically on some of these features so that we could better estimate the degree to which different manipulations affected item statistical characteristics. The questions we wished to answer were as follows:

1. Were the systematically designed items of an acceptable quality?

2. What impact did the item design features have on item statistical characteristics?

3. How useful would the item design information be for reducing pretest sample sizes?

Method

Design of Word Problems

For the purposes of this study, two families of 48 related word problems were created. For each family, a design matrix specified three item features that were crossed with each other to create eight classes of variants. Six problem variants were written for each class. All items were presented in a five-option multiple-choice format.

Family I: Rate Problems, Equal Outputs. For the first family of problems, three item features-- complexity, context, and using a variable--were selected for manipulation based on the findings of Sebrechts et al. (1996). Some examples of problems typical of this family are provided in Table 1. The basic structure of these problems can be described in terms of three constraints, which can be combined into a simple linear system, as follows:

Rate1 x Unit Al = Unit B1

Rate2 x Unit A2 = Unit Bz

Unit B1 = Unit B2

To increase problem complexity, an additional constraint, or step, was added to half of the problems:

Unit Al + Unit A2 = Total Unit A.

Thus the less complex problems were composed of three constraints, and the more complex consisted of four constraints. The goal of the less complex problems was to find Unit AZ given Unit Al, Ratel, and Rate,;,the goal of the more complex problems was to find Unit A2 and Rate2 given Unit Al, Total Unit A, and Ratel. The narrative context of these problems involved either cost or distance. Finally, to manipulate the algebraic content, one of the elements of the problem was changed from a quantity to a variable: “John bought 6 cans of soda” became “John bought x cans of soda.” This later manipulation led to a solution that was an algebraic expression rather than an derived quantity.

3

Page 11: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

TABLE 1. Examples of Rate Items

Item Design Features

Use Variable

Context

cost

DRT

cost

DRT

Complexity

Level 1

Level 1

Level 2

Level 2

No

Soda that usually costs $6.00 per case is on sale for $4.00 per case. How many cases can Jack buy on sale for the price he usually pays for 6 cases?

Under normal circumstances, a train travels from City X to City Y in 6 hours at an average speed of 60 miles per hour. When the tracks were being repaired, this train traveled on the same tracks at an average speed of 40 miles per hour. How long did the trip take when the tracks were being repaired?

As a promotion, a store sold 90 cases of soda of the 150 cases they had in stock at $4.00 per case. To make a profit, the store needs to bring in the same total amount of money when they sell the remaining cases of soda. At what price must the store sell the remaining cases?

A round trip by train from City X to City Y took 15 hours. The first half of the trip took 9 hours and the train traveled at an average speed of 40 miles per hour. What was the train’s average speed on the return trin?

Yes

Soda that usually costs $6.00 per case is on sale for $4.00 per case. How many cases can Jack buy on sale for the price he usually pays for x cases?

Under normal circumstances, a train travels from City X to City Y in t hours at an average speed of 60 miles per hour. When the tracks were being repaired, this train traveled on the same tracks at an average speed of 40 miles per hour. How long did the trip take when the tracks were being repaired?

As a promotion, a store sold 90 cases of soda of the x cases they had in stock at $4.00 per case. To make a profit, the store needs to bring in the same total amount of money when they sell the remaining cases of soda. At what price must the store sell the remaining cases?

A round trip by train from City X to City Y took 15 hours. The first half of the trip took t hours and the train traveled at an average speed of 40 miles per hour. What was the train’s average speed on the return trip?

Note. These example items were not used in this study.

Page 12: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

FamiZy 2: Probability Problems. The second family of items was made up of variants of probability problems. Examples of problems typical of this family are provided in Table 2. These problems had three components--determinin g the number of elements in a set, determinin g the number of elements in a subset, and calculating the proportion of the whole set that was included in the subset. Given a lack of prior research on these types of problems, hypotheses about item features that might affect item difficulty were more speculative and were based on the expert knowledge of item writers.

First, we varied the complexity of counting the elements in the subset. The set always consisted of the number of integers within a given range. The difficulty of the subset counting tasks was varied as follows:

Complexity Level 1

Numbers in a smaller range

Numbers ending with a certain digit

Numbers with 3 digits the same

Complexity Level 2

Second, we speculated that items cast as probability problems would be more difficult than those cast as percent problems. And third, we varied the cover story so that some problems involved a real-life context (phone extensions, room numbers) and others simply referred to sets of integers. Although this latter feature (real versus pure) is a specification that is used to assemble test forms, we did not have a clear sense of how it might affect difficulty for these kinds of problems.

Numbers beginning with certain digits and ending with certain digits

Numbers beginning with certain digits and ending with odd digits

Numbers with 2 or 3 digits equal to 1

Page 13: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

TABLE 2. Examples of Probability Items

Item Design Features

Complexity

Context 1 I Context 2

Probability Real

Probability Pure

Level 1

Parking stickers for employees’ cars at a certain company are numbered consecutively from 100 to 999. Stickers from 200 to 399 are assigned to the sales department. What percent of the parking stickers are assigned to the sales department?

What percent of the integers between 100 and 999, inclusive, are between 200 and 399, inclusive?

Parking stickers for employees’ cars at a certain company are numbered consecutively from 100 to 999. Stickers from 200 to 399 are assigned to the sales department. If a parking sticker is chosen at random, what is the probability that it will belong to the sales department?

If a integer is chosen at random Ii-om the integers between 100 and 999, inclusive, what is the probability that the chosen integer will be between 200 and 399, inclusive?

Level 2

Parking stickers for employees’ cars at a certain company are numbered consecutively from 100 to 999. Stickers that begin with the digits 2 or 3 are assigned to the sales department. Stickers that end with the digits 8 or 9 belong to managers. What percent of the parking stickers are assigned to managers in the sales department?

What percent of the integers between 100 and 999, inclusive, begin with the digits 2 or 3 and end with the digits 8 or 9?

Parking stickers for employees’ cars at a certain company are numbered consecutively from 100 to 999. Stickers that begin with the digits 2 or 3 are assigned to the sales department. Stickers that end with the digits 8 or 9 belong to managers. If a parking sticker is chosen at random, what is the probability that it will belong to a manager in the sales department?

If an integer is chosen at random from between 100 and 999, inclusive, what is the probability that the chosen integer will begin with the digits 2 or 3 and end with the digits 8 or 9?

Note. These example items were not used in this study.

6

Page 14: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Item Pretesting

Items from the variant families were included in 24 quantitative pretest sections of the GRE General Test. Paper-and-pencil test forms were administered to random samples of 1,000 or more examinees in October and December 1996. Four experimental items with minimal overlap of item features were included in each pretest section, so that each pretest section included one each of the following types of problems: cost, DRT (distance = rate x time), percent, and probability. Within a pretest section, items were positioned in accord with test assembly conventions, which included placing problem-solving items in positions 16 through 30 and roughly ordering them according to expected difficulty. Finally, an experienced test developer was asked to estimate the difficulty of the experimental items on a scale of 1 to 5.

Item Analysis

Item statistics that were generated as a part of the pretest process and entered into a database include the following:

1. Equated delta (E-Delta)--an inverse translation of proportion correct into a scale with a mean of 13 and a standard deviation of 4 (based on the curve for a normal distribution and equated over tests and samples).

2. R-biserial (Rbis)--the correlation between examinees’ scores on an individual item and their total scores on the operational quantitative measure.

3. DDIF-m/f--a measure of differential difficulty of items for different groups of examinees (in this case, males and females) after controlling for overall performance on a measure (based on the 1988 adaptation of the Mantel and Haenszel statistic by Holland & Thayer, 1988).

In addition, item response theory (IRT) parameters were estimated for each item using BILOG (Mislevy & Bock, 1982). In the specific IRT model assumed to be underlying performance on GRE items, the probability that an examinee with ability Qi will respond correctly to an item with parameters (aj, bj, cj) is modeled as follows :

P(xii=I/Bi,aj,bj,cj)=cj+ I-Cj

I+e ’ -1.7aj(&-bj)

In this particular model, the item parameters are interpreted as characterizations of the item’s discrimination (ai), difficulty (bj), and susceptibility to correct response through guessing (Cj). Because parameter estimates for some of the experimental items were not included in the test development database, item parameter estimates were also obtained fi-om a second IRT calibration which included the 96 experimental items and a sample of 120 nonexperimental items.

Data Analysis

To determine whether the items that were systematically designed for this study were of acceptable quality, we compared the item statistics and the attrition rate for the experimental and nonexperimental items, and assessed the impact of the item design features on gender-related differential item difficulty. To assess

Page 15: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

the impact, if any, that the item design features had on item operating characteristics, the relationship between the item design features and resulting item operating characteristics was analyzed using a combination of tree-based regression and classical least squares regression. Finally, the usefulness of the collateral information about the item features for reducing pretest sample size was examined.

Tree-based Regression. The impact of different item feature manipulations on resulting item parameter estimates was investigated using a tree-based regression technique. Like classical regression models, tree-based regression models provide a rule for estimating the value of a response variable (JJ), from a set of classification or predictor variables (x). In the particular application described here, y is an (n X 1) vector of item parameter estimates, and x is an (n X k) matrix of item feature classifications. As in the classical regression setting, tree-based prediction rules provide the expected value of the response for clusters of observations having similar values of the predictor variables. Clusters are formed by successively splitting the data into increasingly homogeneous subsets, called nodes, on the basis of the feature classification variables. A locally optimal sequence of splits is selected by using a recursive partitioning algorithm to evaluate all possible splits of all possible predictor variables at each stage of the analysis (Brieman, Friedman, Olshen & Stone, 1984). Potential splits are evaluated in terms of deviance, a statistical measure of the dissimilarity in the response variable among the observations belonging to a single node. At each stage of splitting, the original subset of observations is referred to as the parent node and the two outcome subsets are referred to as the left and right child nodes. The best split is the one that produces the largest decrease between the deviance of the parent node and the sum of the deviances in the two child nodes. The deviance of the parent node is calculated as the sum of the deviances of all of its members,

D(y,f)=C(.Yi -EY

where 9 is the mean value of the response calculated from all of the observations in the node. The deviance of a potential split is calculated as

where pL is the mean value of the response in the left child node and YR is the mean value of the response in the right child node. The split that maximizes the change in deviance

is the split chosen at any given node. After each split is defined, the mean value of the response within each child node is taken as the predicted value of the response for each of the items in each of the nodes. The more homogeneous the node, the more accurate the prediction.

The node definitions developed for the current study characterize the impact of specific item feature manipulations on resulting item parameter estimates. This characterization was corroborated by implementing the following two step procedure: First, the estimated tree model was reexpressed as a linear combination of binary-coded dummy variables; second, the dummy variable model was subjected to a classical least squares regression analysis. The significance probabilities resulting from this procedure indicate whether, in a classical least squares regression analysis, any of the effects included in the estimated tree model would have been deemed “not significant” and any of the effects omitted from the estimated tree model would have been deemed “significant.” When the results obtained in the classical least squares regression analysis replicate those obtained in the tree-based analysis, confidence regarding the validity of resulting conclusions is enhanced

8

Page 16: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Estimating the usefulness of collateral information. From a Bayesian statistical perspective, the precision of a given item parameter estimate (say, item difficulty) is determined from the amount of information available from two different sources: examinee response vectors and collateral information about item features. The parameter estimates considered in the current study characterize the precision levels achievable under two different scenarios: one in which all of the available information about item operating characteristics is derived from an analysis of approximately 1,000 examinee response vectors, and another in which all of the available information about item operating characteristics is derived from the estimated item feature model. The former scenario is represented by the item parameter estimates obtained from the BILOG calibration, while the latter is represented by the item parameter estimates obtained from the estimated regression models.

The usefulness of the item feature information, as captured in the estimated regression models, can be determined by comparing the precision of the difficulty estimates obtained from the BILOG calibration to the precision of the corresponding estimates obtained from the estimated regression model. Precision is defined as the inverse of the variance of the distribution representing knowledge about an estimated parameter value. For the BILOG difficulty estimates considered in this study, precision is calculated as the inverse of the squared standard error obtained from a calibration with noninformative prior distributions. For the regression estimates considered in this study, precision is calculated as the inverse of the variance estimated for sets of items predicted to have the same level of item difficulty (p>.

Because precision is additive in both pretest examinee sample size and in collateral information, the BILOG precision estimates can be divided by the sample size to yield an estimate of the contribution per examinee. The value of the collateral information can then be expressed in terms of equivalent numbers of pretest examinees (m), as follows

m = PR I (P&z)

where PR is the precision yielded by the estimated regression model, and (P&z) is the precision per examinee yielded by the BILOG calibration.

Results

Sumrnarv of Item Statistics

On the 24 pretests, there were 360 problem solving items, 96 of which were the items written for this study. After pretesting, items are subjected to a final review before being entered into the pool of items suitable for use in future operational tests. About 9% of the experimental items and 24% of the other problem-solving items were dropped from further consideration during fmal review. Items can be eliminated for a variety of reasons, and no record of why particular items are deemed unusable is kept. However, all the rate items that were eliminated were from one cell of the design and were extremely easy. On the other hand, four of the six probability items that were dropped had a common, difficult counting task--three-digit numbers within a range with two or three digits equal to 1; these may have confused examinees of all ability levels. In our subsequent analysis, we found that the IRT parameters for these items could not be calibrated. There was no obvious reason why the remaining two probability items were eliminated.

The mean item statistics for the experimental and nonexperimental problem solving items that survived the pretest process are presented in Table 3. The experimental rate problems were easier than the nonexperimental items overall, as measured by E-Delta, t (243) = -3.4 1, p < .OO 1, and by IRT b, t (243) =

9

Page 17: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

- 1.99, p ~05, but their variability was similar. Thus, this set of rate problems covered as wide a range of -difficulty levels as did a heterogeneous mix of other problem solving items. The IRT c parameter was higher for these rate problems than for all nonexperimental items--t (243) = 4.92, p < .OO l--suggesting that examinees were more successful at guessing the correct answer for the rate problems than they were for other problems. However, the guessing parameter for rate problems did not differ from what might be expected by chance (.20).

The mean difficulty of the experimental probability problems was equal to the mean difficulty of nonexperimental items overall, but the probability problems were less variable in difficulty, as measured by E-Delta (Levene’s Test), F (1,24 1) = 9.27, p ~003. Probability problems also were more discriminating than nonexperimental items--t (241) = 2.36, <.02--and were differentially easier for males--t (87.54) = -1.96, p ~05 (t-test for unequal variances). In addition, they were less variable in differential difficulty--Levene’s Test, F (1,241) = 7.95, p ~005. Finally, the correlation of an experienced test developer’s estimates of difficulty with the items’ IRT b parameters was -75 (n = 92, p < .OOl) for all of the experimental items--.89 (n = 48, p < .OOl) for the rate problems, and .54 (n = 44, p < .OOl) for probability problems.

To assess whether the item design features had any impact on differential difficulty for males and females, separate 2 x 2 x 2 ANOVAS were carried out on the DDIF-m/f data for the two experimental item families. For the rate problems, only the main effect for context was significant--F ( 1,43) = 23.3 1, p <. 00 1. The mean DDIF-m/f was .46 (favoring females) for cost items and -.30 (favoring males) for DRT problems. For probability problems, the item design features had no significant impact on DDIF-m/f, although as noted above, this item set as a whole was slightly easier for males than for females.

10

Page 18: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

TABLE 3. Mean Item Statistics for Experimental and Nonexperimental Problem Solving Items

Item Statistics

Item Set E-Delta Rbis DDIF-m/f IRTa IRTb IRTc

Rate

M

SD

n = 44

Probability

M

SD

n = 42

Nonexperimental

M

SD

n=201

12.07 0.41 0.05 0.98 0.02 0.24

2.07 0.14 0.65 0.37 1.20 0.12

13.69 0.40 -0.20 1.00 0.51 0.18

1.41 0.12 0.43 0.26 0.98 0.09

13.27 0.42 -0.04 0.88 0.40 0.15

2.14 0.15 0.66 0.32 1.13 0.11

11

Page 19: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Impact of Item Design Features on Item Operating Characteristics

Separate regression analyses were conducted for each of the three item parameters (difficulty, discrimination, and guessing) and for each of the two variant families (rate problems and probability problems). In each analysis, the dependent variable was one of the item parameters of interest (difficulty, discrimination, or guessing), and the independent variables were the item features.

The item parameter values considered in the analyses are summarized in Tables 4 and 5. Table 4 lists means and standard deviations calculated for the rate problems. Table 5 lists means and standard deviations calculated for the probability problems. The least squares regression results for predicting difficulty, discrimination, and guessing for both the rate problems and the probability problems are summarized in Table 6. The table provides raw (unstandardized) regression coefficients for all main effects and interaction effects that were found to be significant at the .05 significance level. Effects that were significant at the .O 1 or .OO 1 significance levels are also indicated.

12

Page 20: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

TABLE 4. IRT Item Parameters for Rate Problems with Differing Item Design Features

Item Features IRT Parameters

Use Variable Complexity Context a b C

Yes

Yes

Yes

Yes

No

No

No

No

Level 2

Level 2

Level 1

Level 1

Level 2

Level 2

Level 1

Level 1

DRT

cost

DRT

cost

DRT

cost

DRT

cost

M .98 1.49 .24

SD .28 .30 .03

M 1.03 1.15 .30

SD .25 .38 .06

M .77 .53 .27

SD .22 .42 .04

M 67 .30 .27

SD .19 .56 .03

M .83 .13 .24

SD .19 .25 .04

M .48 -1.84 .22

SD .14 .63 .Ol

M .76 -1.16 .21

SD .15 -72 .02

M .46 -3.09 .22

SD -13 .57 .Ol

13

Page 21: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

TABLE 5. IRT Item Parameters for Probability Problems with Differing Item Design Features

Complexity

Item Features

Context 1 Context 2

IRT Parameters

a b C

Level 2

Level 2

Level 2

Level 2

Level 1

Level 1

Level 1

Level 1

Probability Real

Probability Pure

Percent Real

Percent Pure

Probability Real

Probability Pure

Percent Real

Percent Pure

M” .89 1.70 .21

SD .13 .53 .06

M” 1.02 1.60 .23

SD .20 .29 .06

M” .89 1.62 .22

SD .35 -86 .05

M” .88 1.14 .23

SD .15 .54 .07

M .96 .37 .20

SD .13 .55 .06

M .95 .48 .20

SD .16 .53 .05

M .84 -.05 .20

SD .12 .53 .04

M .91 .09 .18

SD .13 .53 .04

an = 5, otherwise g = 6

14

Page 22: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

TABLE 6. Regression of Item Features on IRT Item Parameters for Two Families of Item Variants

Regression Statistics and Significant Coefficients

Effect Difficulty Discrimination Guessing

Intercept

Use Var = Yes

Context = Cost

Complexity = L2

UseVar = No and Context = DRT

UseVar = Yes and Complexity = L2

UseVar = Yes and Context = Cost

-3.01”‘”

3.34”“’

1.09’“”

1.95”“’

Rate Problems (n = 48)

.47”“”

.25**

.32*“”

.29***

.22’“’

.03**

_

-03’

RMSE .50 .19 .03

R2 .91 .52 .42

Adj. R2 .90 .50 .41

Intercept

Complexity = L2

.05

1.29””

Probability Problems (n = 44)

Percent/Probability .34”

Real/Pure _

RMSE .54 _

R2 62

Adj. R2 .61

*** p c.001, ** p c.01, * E

15

Page 23: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Rate Problems. The tree-based analyses of the IRT parameters for rate problems--difficulty, discrimination, and guessing--are summarized in Figures 1,2, and 3, respectively. In these illustrations, each node is plotted at a horizontal location based on its estimated parameter value; its vertical location is determined by its estimated deviance value, the residual sum of squares for items in the node. The item features selected to define each split are listed on the edges, connecting parents to offspring. The number of items assigned to each node is plotted as the node label. The resulting displays illustrate how variation in item feature classifications leads to subsequent variation in IRT parameter estimates.

Figure 1 demonstrates that, among the 48 rate variants, the manipulation that had the greatest impact on item difficulty required students to perform operations on variables as opposed to numbers. As shown in the upper section of Figure 1, the 24 items that did not require students to perform operations on variables (Use Var = No) had an average difficulty of - 1.49 (SIJ = 1.30), and the 24 items that did require students to perform operations on variables (Use Var = Y) had an average difficulty of 87 (SJ = .63). Thus, items that required examinees to use variables were more difficult--by more than 1.5 standard deviation units--than those that did not. The significance of this result can be seen both in the tree and in the table. As shown in Figure 1, this split (Use Var: Y) produced the largest decrease in deviance. As shown in Table 6, this effect produced the largest coefficient in the regression for difficulty.

Figure 1 also illustrates that, among the subset of rate problems that did not require operations on variables (Use Var = No), the 12 items with a cost context were significantly easier (M = -2.47 , SD = 87) than the 12 items with a DRT context @J = -.5 1, SD = .85). However, among the subset of rate problems that did require operations on variables, the cost and DRT contexts were equally difficult. This interaction is clearly illustrated in the tree and is also evident in Table 6. That is, as indicated in Table 6, the costDRT effect was not significant as a main effect but it was significant when crossed with Use Var = No. Thus, the context results obtained in the least squares regression analysis exactly replicated those obtained in the tree- based analysis. In particular, both analyses indicated that context can be a strong determiner of item difficulty when items do not require proficiency at using variables, but context is not a strong determiner of item difficulty when items do require proficiency at using variables. These results suggest that context effects may have a greater impact on performance among lower performing examinees than among higher performing examinees.

Figure 1 also summarizes the effect of problem complexity on item difficulty. Overall, the 24 items at the higher complexity level were significantly more difficult (M = -.86, SD = 1 S7) than the 24 items at the lower complexity level (M = .24, SD = 1.38). In addition, this effect was of similar magnitude for problems involving a cost or a DRT context, and for problems that either included or did not include a variable. That the magnitude of the complexity effect was similar for different types of problems can also be seen in Table 6, which indicates that the main effect for complexity was highly significant (pc.00 1). Because all of the items at the higher complexity level involved four constraints, and all of the items at the lower complexity level involved only three constraints, this result suggests that the presence of a fourth constraint contributes to additional difficulty at all levels of proficiency.

The tree-based analysis of item discrimination is summarized in Figure 2. The similarity of the difficulty and discrimination trees suggests that the factors used to generate the rate variants affected difficulty and discrimination similarly. Problems that included a variable had better discrimination (M = .86, SD = .27) than those that did not (M = .63, SD = .22). Among items that did not include a variable, DRT problems were more discriminating (M = .79, $IJ = .17) than cost problems (M = .47, SD = .13). And finally, among problems that did include a variable, more complex items were more discriminating @J = 1 .O 1, SD = .26) than less complex problems (M = .72, SD = .20).

16

Page 24: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

The tree-based analysis of item guessing is summarized in Figure 3. Rate variants that included variables tended to have higher guessing parameters (kJ = .27, SD = .04) than rate variants that did not (&I = .22, SD = .02). In addition, among items that included variables, items with a cost context tended to have slightly higher guessing parameters (NJ = -29, SD = .05) than items with a DRT context (hJ = .25, SD = .04).

17

Page 25: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

R-squared = 0.91 Adj. k-sqr = 0.90

/“”

\

/ UseVacN

I

\ UseVarY

IRT Item Difficulty

FIGURE 1. Estimated regression tree for the diffkulty parameter for rate problems.

Page 26: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

R-squared = 0.52 Adj. R-sqr = 0.50

I I

0.0 0.5 1.0 1.5

IRT Item Discrimination

FIWRE 2. Estimated regression tree for the discrimination parameter for rate problems.

Page 27: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

R-squared = 0.42 Adj. R-sqr = 0.41

48

. I .

0.0 0.1 0.2 0.3

IRT Guessing Parameter

FIGURE 3. Estimated regression tree for the guessing parameter for rate problems.

Page 28: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Probability problems. The tree-based analysis of difficulty for the probability problems is summarized in Figure 4, and the related regression statistics are presented in Table 6. Both the tree-based analysis and the classical least squares regression analysis indicate that, among the 44 probability variants, the manipulation that had the greatest impact on item difficulty involved the complexity of the counting subtask. In particular, the 24 items that required a less complex counting subtask were easier (WJ = .22, SD = .5 5) than the 20 items that required a more complex counting subtask (M = 1.5 1, $IJ = .5 9).

For probability problems at both complexity levels, the 22 items that were cast as probability problems were slightly more difficult (M = .98, SD = .78) than the 22 that were cast as percent problems (M =.64, SD = .92). Note that this effect is reflected both in the tree and in the regression coefficients shown in Table 6.

For probability problems at both complexity levels, the difficulty of items set in real-life contexts did not differ substantially from similarly configured items that simply referred to sets of integers. This result is indicated by the absence of a real vs. pure split in the estimated regression tree, and by the fact that the real vs. pure effect was not significant in the least squares regression analysis.

As indicated in Table 6, none of the features used to generate the probability variants were useful for explaining variation in item discrimination parameters or in item guessing parameters. A similar result was obtained in the tree-based analysis. That is, the estimated trees yielded no useful splits.

21

Page 29: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

R-squared = 0.62 Adj. R-sqr = 0.61

44

~~ Cmplx:Ll Cmplx:U

i \ 24

/ \ 20

Pet\

Pet pr? 10 Prob

12 12 \ 10

I I I I

-3 -2 -1 0 1 2

IRT Item Difficulty

FIGURE 4. Estimated regression tree for the diffkulty parameter for probability problems.

Page 30: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Implications for Reductions in Pretest Sample Sizes

The improvements in posterior precision achievable with the collateral models estimated in this study are summarized in Table 7. Because precision varies with difficulty level, separate estimates are provided for groups of items located at varying points on the underlying difficulty scale. Item groupings correspond to the feature categories identified in the regression trees (Figures 1 through 4). Under the estimated difficulty model, all of the items in each group are predicted to have the same value of item difficulty. This value is listed in the column labeled “Predicted Difficulty.”

Table 7 also lists two precision estimates for each group. The estimates listed in the column labeled “BILOG Precision” incorporate information from approximately 1,000 examinee response vectors, but no information from the estimated item feature model. These estimates were calculated as the inverse square of the average within-group standard error obtained from the BILOG calibration. The estimates listed in the column labeled “Collateral Precision” incorporate information from the estimated item feature model, but no information from examinee response vectors. These estimates were calculated as the inverse of the within- group variance obtained from the estimated regression model.

The right-most column of Table 7 provides an estimate of the value of the collateral information expressed in terms of equivalent nurnbers of pretest examinees (m). As can be seen, the collateral model for rate variants yielded an equivalent sample size of approximately 2 15 examinees, and the collateral model for probability variants yielded an equivalent sample size of approximately 128 examinees. In interpreting these results it is important to note that, while precision is additive, the effect of increasing sample sizes is not. Specifically, the posterior standard deviation of item parameters shows diminishing returns as calibration sample size is increased, so that the first 200 examinees reduce posterior standard deviations the most, the next 200 reduce posterior standard deviations by less, and by the time that there are 1,000 pretest examinees, another 200 examinees reduces posterior standard deviations only slightly. The relevance of using collateral information that is worth, say, 200 examinees, is that the impact of the collateral information is tantamount to that of the first 200 examinees, not the last 200.

Figure 5 illustrates this phenomenon for the rate variants. The solid curve depicts the effect of increasing sample sizes when collateral information is not included in the calibration. The dashed curve shows the effect of increasing sample sizes when collateral information is included in the calibration. The line from A to B represents the decrease in uncertainty that would be attained if, in addition to collateral information, 10 examinee response vectors were also available. The line from C to D represents the decrease in uncertainty that would be attained if, in addition to collateral information, 250 examinee response vectors were also available. The line at E shows that a calibration that included both collateral information and 250 pretest examinees would yield an effective sample size of about 420 examinees. (These estimates do not reflect the additional improvements achievable through the use of expected response curves, discussed below.)

How valuable is 200 examinees-worth of information about item parameters from item features? The answer depends on how this information will be used. The current calibration system uses information from pretest examinees only, and treats the resulting estimates as if they were true item parameter values (that is, any remaining uncertainty is ignored). Experience has shown that 1,000 examinees will suffice for this approach. Collateral information worth 200 examinees would be disappointing indeed if all it meant was reducing the pretest sample to 800 with the rest of the current system intact. This would be a reduction of pretest sample size of just 20%.

23

Page 31: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

The preferred alternative addresses not only the source of information about item parameters, but also the way the information is used. The approach, described in Mislevy, Sheehan, and Wingersky (1993), uses expected response curves (ERCs) that incorporate information from both sources (collateral information and pretest examinees); it models uncertainty about these sources as well. The first of these properties means that it is possible to use collateral information about the item features that influence item operating characteristics. The second property means that it is not necessary to have the total amount of information about item parameters so great as to treat them as known. The ERCs reduce biases that arise when estimates are treated as true values in the current system--the phenomenon that kept people from using small calibration samples in that system. Mislevy, Sheehan, and Wingersky found that ERCs based on collateral information, plus responses from 250 pretest examinees, provided measurement of examinees that was as effective as item parameter estimates based on 1,000 pretest examinees. This is a reduction of 750 pretest examinees, or 75%.

24

Page 32: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

TABLE 7. The Precision of Difficulty Estimates Generated With and Without Collateral Information

Item Group” n Predicted BILOGb Difficulty Precision

Collateral” Precision

Equivalent Sample Size

Rate Problems

No Var, Cost, Ll 6 -3.09 16.56 3.07 220

No Var, Cost, L2 6 -1.84 15.29 2.54 197

No Var, DRT, Ll 6 -1.16 19.75 1.93 117

No Var, DRT, L2 6 0.13 88.23 16.13 218

var, Ll 12 0.41 39.17 4.23 129

var, L2 12 1.32 24.01 7.17 355

Weighted Averaged 215

Probabilitv Problems

Ll, Pet 12 0.02 43.32 3.84 105

Ll, Prob 12 0.43 73.73 3.72 60

L2, Pet 10 1.38 23.23 1.93 99

L2, Prob 10 1.65 27.45 c 6.12 265

Weighted Averaged 128

a Item groups reflect the combinations of features found to be significant in the regression analysis. b BILOG precision = 1 / (Average Standard Error)2 from a calibration of 1,190 examinee response vectors ’ Collateral precision = 1 / (Residual Standard Deviation)2 from the estimated regression equation. d Weights are proportional to the numbers of items available in each group.

25

Page 33: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Without collateral information . . . . . . . With collateral information

Calibration Sample Size

FIGURE 5. The effect of increasing sample sizes with and without collateral information.

26

Page 34: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Discussion

The attempt to systematically manipulate difficulty was extremely successful for rate problems and moderately successfiil for probability problems. For rate problems, all the manipulated features affected difficulty, accounting for 90% of the variance in difficulty in the set of problems. This family of items covered a wide difficulty range. One manipulation in particular--using a variable to transform a multistep arithmetic word problem into a multistep algebra word problem--had a very powerful effect on difficulty. In addition, there was an interesting interaction between context and the use of a variable: For easier items that did not involve a variable, cost problems were easier than DRT problems, but this particular context did not affect difficulty for problems that did involve a variable. This suggests that some aspects of context may facilitate or impede problem solution among lower-performing examinees, but not among higher-performing exarninees. The item features also had similar effects on item discrimination and guessing.

I

In contrast with the rate problems, the probability problems were more difficult and covered a narrower difficulty range. Increasing the complexity of the counting task had the greatest impact on difficulty. One aspect of context (whether the problem was cast as a percent or probability problem) did affect difficulty, but another (whether or not the problem narrative involved a real-life context) did not However, the context interaction for the rate problems serves as a reminder not to dismiss the possibility that such a contrast (real-life versus pure context) may be an important feature for less difficult items. Finally, item design features did not impact the discrimination or guessing parameters for probability problems.

One issue that the results for the probability problems raises is why these probability problems were so difficult. The items with the simple counting task were not very demanding in terms of the arithmetic involved, and presenting the problem in terms of percent rather than probability facilitated performance. Taken together, these factors suggest that a significant portion of the examinees taking the GRE General Test in 1996 were unfamiliar with basic statistical concepts and procedures.

In the following sections, the implications this study may present for articulating the constructs assessed by the GRE quantitative measure, for increasing the efficiency of test development, and for reducing pretest sample size are discussed.

Understanding Item Diffcultv and Construct Renresentation

Among item statistical characteristics, difficulty has received the most attention because of its role in construct validation (Embretson, 1983) and proficiency scaling (Sheehan, 1997). Embretson distinguished between the two aspects of test validity--nomothetic span, which refers to the relationship of test scores to other variables, and construct representation, which “is concerned with identifying the theoretical mechanisms that underlie item responses such as information processes, strategies, and knowledge stores” (p. 179). With respect to construct representation, items can be described from either the task perspective (what are the features of the task?) or the examinee perspective (what processes, skills, strategies, and knowledge do people use to solve problems?). Of course, items can be described in many different ways. Difficulty modeling introduces a criterion--that of the relationship of item features to difficulty--which permits a distinction between critical and incidental features. A basic assumption of this approach is that the features of the task and examinee processes are interdependent.

27

Page 35: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Although many studies such as the current one focus primarily on one of these perspectives, a complete theory of the task requires both. Some evidence about the relationship between the item features of rate problems that were manipulated in this study and problem solution processes is reported in Sebrechts et al. (1996). Sebrechts et al. categorized the strategies used by college students in solving 20 GlXE word problems, and examined the relationships between item features, solution strategies, and errors. The four classes of strategies identified included:

1. following step-by-step mathematical solutions (equation based)

2. setting up and solving ratios

3. modeling the situation by deriving solutions for a set of potential values for a variable and converging on an answer (simulations)

4. using other, unidentified strategies

Most of the successful problem solutions involved equation-based strategies. Nevertheless, when the use of an equation-based strategy would have required actually manipulating variables rather than an arithmetic step-by-step solution, students were less likely to use this strategy even though it was highly appropriate. They were more likely to use other, unidentifiable strategies or simulation strategies. It seems that many of these students either lacked appropriate strategies or failed to apply the strategies they possessed to word problems that required the manipulation of variables. Problem complexity, on the other hand, did not have an impact on strategy but was associated with errors of misusing the givens in the problem statement.

In sum, determining which item features impact item difficulty and how these features affect examinee problem solving provides a better explication of the constructs being assessed. This more detailed understanding of constructs is necessary for principled item generation, and can serve as a basis for the development of diagnostic assessment and reporting.

Implications for Creating Item Variants

The results of the current study demonstrate the enormous potential of systematically creating item variants. The systematic generation of item variants can result in a set of items with predictable item characteristics that differ from each other in specified ways and degrees. Efforts to automate some aspects of systematic item generation are currently underway (Singley & Bennett, 1998). In addition to creating items for operational tests, variants can be created for use in diagnostic and practice tests without compromising operational pools. However there are many issues that need to be addressed before the potential of this approach to item development can be fully realized in the context of large scale assessment. These issues include the diversity of problems that exist in the GRE quantitative measure, the wide variety of item features that can be manipulated to create variants, how items should be classified, and how similarity among problems should be defined.

The pool of GRE quantitative problems is quite diverse. Rate and probability word problems represent only a small proportion of the item types included in the measure. In a sample of about 340 arithmetic and algebra items in two GRE computer adaptive test pools, only 4% were classified as probability problems and 2% as rate problems. Furthermore, even for these small sets of problems, many features can be manipulated to create variants. Two criteria that might be used to determine which item features to manipulate include the impact of the features on item performance, and whether or not the features are deemed construct relevant. While information about the former criterion can be gleaned from examination of

28

Page 36: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

existing items and experimental studies, establishing construct relevance requires other kinds of evidence, such as studies of similarities in the processes used to solve assessment items and those used to solve problems in the academic domains of interest.

Finally, if large numbers of item variants were created, methods to manage their distribution among the pools of items used for computerized adaptive testing would need to be developed. This might require the revision of the current item classification system. A better understanding of the item features that contribute to perceived item similarity by examinees, and to transfer among items, would be helpful here.

hnnlications for Reducing Pretest Sample Size

Knowledge of the degree to which different features impact item statistics could allow us to create item variants along with estimates of item operating characteristics. Statistical procedures for using collateral information such as this to reduce pretest sample size have been developed (Mislevy et al., 1993). Nevertheless, two barriers block the application of these methods at present, although neither barrier is insurmountable. One of these barriers concerns operational constraints that must be taken into consideration. Currently, sample size is controlled at the section rather than the item level. This means that one would want to have collateral information for all of the diverse items in a section before the sample size could be reduced for that section. A study in which four pretest sections that consist of item variants based on the same set of parent items with known item operating characteristics is currently in process. The second barrier is the lack of a knowledge base that would permit prediction of item operating characteristics for the wide variety of items that exist on the GRE quantitative measure. Over time, this knowledge base could be developed through the examination of existing items and through experimental studies such as this one. In the meantime, the difficulty estimates of experienced item writers are reliable and predictive of actual difficulty and could be used to reduce pretest sample size (Sheehan & Mislevy, 1994).

Concluding Comments

Construct-driven item generation requires a description of items that can be related to the processes, skills, and strategies used in item solving. The benefits of such an approach are that, if item variants can be created systematically through an understanding of critical problem features, tests can be designed to cover important aspects of a domain, overlap can be controlled, and pretesting requirements can be reduced. A closer integration of research and test development would contribute to the development of the knowledge base needed to support construct-driven item generation. Ideally, every time items are pretested, knowledge about how item features impact item performance could be gained if items were designed to vary systematically on selected features. This kind of knowledge would not only help to improve item development efficiency, but could also provide a basis for the development of new products and services.

29

Page 37: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

References

Brieman, L., Friedman, J. H., Olshen, R., & Stone, C. J. (1984). Classification and reBession trees. Belmont, CA: Wadsworth International Group.

Embretson, S. ( 1983). Construct validity: Construct representation versus nomothetic span. Psvchological Bulletin, 93(l), 179- 197.

Embretson, S. E. (1995). A measurement model for linking individual learning to processes and knowledge: Application to mathematical reasoning. Journal of Educational Measurement, 32(3), 277-294.

Hall, R., Kibler, D., Wenger, E., & Truxaw, C. (1989). Exploring the episodic structure of algebra story problem solving. Cognition and Instruction. 6(3), 223-283.

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel Procedure. In H. Wainer & H. I. Braun (Eds.), Test Validi& (pp. 129-145). Hillsdale, NJ: Lawrence Erlbaum Associates.

Lane, S. (199 1). Use of restricted item response models for examining item difficulty ordering and slope uniformity. Journal of Educational Measurement, 28(4), 295-309.

Mayer, R. E. (198 1). Frequency norms and structural analysis of algebra story problems into families, categories, and templates. Instructional Science, 10, 135 175.

Mayer, R. E. (1982). Memory for algebra story problems. Journal of Educational Psychology, 74(2), 199-216.

Mislevy, R. J., & Bock, R. D. (1982). BILOG: Maximum likelihood item analvsis and test scoring with logistic models for binan, items. Chicago: International Educational Services.

Mislevy, R. J., Sheehan, K. M., & Wingersky, M. (1993). How to equate tests with little or no data. Journal of Educational Measurement, 30(l), 55-78.

Mislevy, R. J., Wingersky, M. S., & Sheehan, K. M. (1994). Dealing with uncertaintv about item parameters: Expected response functions (ETS Research Report RR-94-28-ONR). Princeton, NJ: Educational Testing Service.

Reed, S. K. (1987). A structure-mapping model for word problems. Journal of Experimental Psvchologv, 13(l), 124-139.

Reed, S. K., Dempster, A.? & Ettinger, M. (1985). Usefulness of analogous solutions for solving algebra word problems. Journal of Experimental Psvcholom: Learning. Memorv, and Cognition. 1 l(l), 106- 125.

Sebrechts, M. M., Enright, M., Bennett, R. E., & Martin, K. (1996). Using algebra word problems to assess quantitative ability: Attributes, strategies, and errors. Cognition and Instruction, 14(3), 285-343.

Shalin, V. L., & Bee, N. V. (1985). Structural differences between two-step word problems (Technical Report No. ED-259-949). Pittsburgh University: Pa. Learning Research and Development Center.

31

Page 38: Items by Design: The Impact of Systematic Feature …Items by Design: The Impact of Systematic Feature Variation on Item Statistical Characteristics Mary K. Enright Mary Morley Kathleen

Sheehan, K. M. (1997). A tree-based approach to proficiency scaling and diagnostic assessment. Journal of Educational Measurement. 34(4), 333-352.

Sheehan, K., & Mislevy, R. J. (1994). A tree-based analysis of items from an assessment of basic mathematics skills (ETS Research Report 94-14). Princeton, NJ: Educational Testing Service.

Singley, M. K., & Bennett, R. E. (1998). Validation and extension of the mathematical expression response type: Applications of schema theory to automatic scoring and item generation in mathematics (ETS Research Report RR-97- 19, GRE Report 93-24P). Princeton, NJ: Educational Testing Service.

32