39
EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT TEST IN MATHEMATICS Robert McKinley Neal Kingston GRE Board Professional Report No. 86-8P ETS Research Report 87-21 November 1987 This report presents the findings of a research project funded by and carried out under the auspices of the Graduate Record Examinations Board.

EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

EXPLORING THE USE OF IRT EQUATING

FOR THE GRE SUBJECT TEST IN MATHEMATICS

Robert McKinley Neal Kingston

GRE Board Professional Report No. 86-8P ETS Research Report 87-21

November 1987

This report presents the findings of a research project funded by and carried out under the auspices of the Graduate Record Examinations Board.

Page 2: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

Exploring the Use of IRT Equating for the GRE Subject Test in Mathematics

Robert McKinley Neal Kingston

GRE Board Professional Report !Jo. 86-8P

November 1987

Educational Testing Service, Princeton N.J. 08541

Page 3: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

Copyright @ 1987 by Educational Testing Service. All rights reserved.

Page 4: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

ABSTRACT

A study was conducted to investigate the feasibility of using IRT equating for the GRE Subject Test in Mathematics. Two forms of the test were equated using the three-parameter logistic (3PL) model, and the results were compared to the results of the Tucker equating procedure currently used operationally, as well as to equipercentile equating. In addition, the two basic assumptions of the 3PL model were investigated to determine whether they were reasonable within the context of the GRE Mathematics Test. These assumptions were that of unidimensionality, which was investigated using nonlinear, full-information factor analysis, and that the item response function follows a logistic form with three parameters, which was investigated using two different goodness-of-fit procedures.

The results of the study indicate that the GRE Mathematics Test is reasonably unidimensional, and that the three-parameter logistic model does yield reasonable fit to the data. It was found that both the IRT and the equipercentile equating procedures yielded somewhat different results than the Tucker method. The results for the IRT procedure and the equipercentile procedure were quite similar.

Page 5: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

1

INTRODUCTION

Currently, the GRE Subject Test in Mathematics is equated using the common item method described as Design IV in Angoff (1984). Each new form of the test contains at least twenty items from the old form of the test to which it was equated. Examinee scores on these common items are used to place the new form scores on the same scale as the old form scores using the Tucker procedure (Gulliksen, 1950) or the Levine procedure (Levine, 1955).

Every other form of the test is double-part equated to reduce scale drift. Two sets of common items are used, each common to a different old form of the test. Two separate equatings (either Tucker or Levine) are performed, and the bisector of the two equatings is used. In this procedure, then, the new form of the test may contain as many as forty items in common with previous forms. Since the GRE Mathematics Test has only sixty-six items, the fewest of any GRE Subject Tests, as many as 60 percent of the items on a new form may also appear on other forms of the test.

To exacerbate the problem, when an edition of the test is used as the old form in a subsequent equating, twenty more common items are required. Forms that are not double-part equated are used twice as old forms. Thus, at least sixty items of a given form, 90 percent of the test, will eventually appear on other forms of the test.

This extreme overlap of test forms poses two major problems. First, there is a disturbingly high probability that an examinee who takes the GRE Mathematics Test a second time will receive a form of the test containing twenty items the examinee has already seen. Second, if the security of a form of the GRE Math test were compromised through theft, it could well require major revision not only of that form, but also of three other forms-- two older forms and a newer form.

Item response theory (IRT) provides one potential solution to the problem of item overlap. It may be possible, through the use of IRT-based equating procedures, to reduce the number of items required in the common item sets without increasing the risk of scale drift. Even if more sets of common items were required, if they contained fewer items the total amount of overlap could be substantially reduced. For example, if only ten common items were required for IRT-based equating, triple-part equating (the use of three sets of common items to establish a single IRT metric that would then be used for equating) would require only thirty common items, ten fewer than double-part equating. Perhaps more important, a form with items in common with a compromised form would contain only ten items that would require revision, rather than the twenty items now affected.

Page 6: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

2

Of course, the benefits of IRT-based equating just described accrue only if two major assumptions are found to be appropriate. First, it must be true that the IRT model used provides an adequate representation of the data (that is, IRT assumptions such as unidimensionality and postulated form of the item response function must be reasonable). Second, it must be true that adequate IRT-based equating can be performed with common item sets containing fewer items than are adequate for the current equating methods.

The purpose of this research was to evaluate the first of these two assumptions, and to evaluate the adequacy of IRT equating using the same number of common items as are used with current operational procedures. If the results of this study are positive, future research will be proposed to address the feasibility of reducing the number of common items.

METHOD

DesiPn

The general design of this study can be divided into two parts. The first part was an evaluation of the appropriateness of using the three-parameter logistic (3PL) model with the GRE Mathematics Test (see Appendix A for a discussion of the 3PL model). This included an assessment of the dimensionality of the test, as well as an investigation of the fit of the 3PL model to the data. The second part involved the application and comparison of two conventional equating procedures and an IRT equating procedure to equate two active forms of the GRE Mathematics Test using the common item set used operationally.

Data

Forms 3BGR and 3EGR of the GRE Mathematics Test were used in this study. Hereafter these forms will be referred to as form A and form B, respectively. Form B was treated as the new form and form A as the old form. These two forms had been previously equated operationally in December 1982, using the Tucker equations and a double-part equating design. Table 1 describes the new and old forms and the common items in terms of means and standard deviations of the estimated IRT parameters. From this table it can be seen that the distributions of item parameter estimates are quite similar for the three sets of items.

Page 7: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

3

Table 1 Summary of IRT Parameter Estimates

_____________~______~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parameter Statistic Form A Form B Common items

____________________~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

a Mean 1.01 0.96 1.02 Std. dev. 0.28 0.32 0.27

b Mean 0.07 0.22 -0.10 Std. dev. 1.14 1.06 1.24

C Mean 0.17 0.16 0.14 Std. dev. 0.10 0.09 0.09

____________________~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Note. Form A and B statistics do not include common items.

Table 2 GRE Math Test Equating Sample Sizes

and Scaled Score Means and Standard Deviations ____________________~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Form Administration N Mean Std.dev.

____________________~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A (3BGR) December 1979 694 714 157 June 1980 233 638 157 October 1980 534 716 164 December 1981 635 706 160 April 1983 497 634 148 February 1984 547 640 134 Total 3,140 681 161

B (3EGR) December 1982 774 689 147 October 1983 636 722 146 February 1986 507 642 134 Total 1,917 689 151

_____________-______~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Page 8: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

4

The samples used for this study are described in Table 2. As can be seen from the table, the data used for form A included response data from six separate administrations of the form over the period from December 1979 to February 1984. The total sample size for form A was 3,140. The data used for form B included all response data from three different administrations of the form over the period from December 1982 to February 1986. The total sample size for form B was 1,917. Also shown in Table 2 are the scaled score means and standard deviations obtained for each administration date. Reported statistics are based on the linear equating methods used operationally. The table shows that the score distributions for the two total samples are quite similar.

Equating Procedures

The two conventional equating procedures used in this study, the Tucker procedure and the equipercentile procedure, are described in detail in Angoff (1984, chapter 3). The Tucker procedure was designed for linear equating of two forms of a test administered to two nonrandomly formed groups through a set of common items administered to both groups. The common items are used to adjust for any group differences. As was mentioned above, this is the procedure used operationally.

The equipercentile procedure, as employed in this study, was also designed for use with two nonrandomly formed groups, but uses a nonlinear approach to equating. Although this procedure was not used operationally, it was felt that it would provide a useful guide for interpretation of any differences found between the IRT procedure and the operational procedure. Since there is no way of knowing the shape of the true equating function, the equipercentile equating provides information useful in judging the adequacy of the Tucker equating as a criterion. The equipercentile procedure used consisted of first equating the common items to the old form, and then equating the new form to the common items.

The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating is defined such that scores on two different test forms are equivalent if they represent true scores yielded by the same ability. It is assumed that the equating function appropriate for true scores is also appropriate for observed scores.

The first step in this procedure was to obtain the item parameter estimates for the 3PL model for both forms using the LOGIST program (Wingersky, 1983). Item parameter estimates for the two forms were placed on the same scale via simultaneous calibration. The second step in this procedure involved using the old form item parameter estimates to estimate, via a Newton process, the ability that would yield an estimated true score equal to each possible raw score on the old form. The resulting ability estimates were then used to compute, with the new form item parameter estimates, estimated true scores on the new form.

Page 9: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

Analvses

The analyses performed in this study fall into three categories. The first category includes analyses intended to assess the reasonableness of the IRT assumption of unidimensionality for the GRE Mathematics Test. This involved the use of full-information factor analysis (Bock, Gibbons, & Muraki, 1985; Kingston, 1986) as implemented in the TESTFACT program (Wilson, Wood, & Gibbons, 1984). Full-information factor analysis uses a multidimensional IRT model to perform a nonlinear factor analysis of binary data. Using LOGIST c-parameter estimates as input, a three-factor stepwise solution was obtained for form A. The strengths of the second and third factors were then evaluated using a log-likelihood chi-square test provided by the TESTFACT program. The resulting solutions were interpreted by examining factor loading patterns and eigenvalues in light of the known content structure of the tests.

The second set of analyses performed in this study focused on the goodness-of-fit of the 3PL model to the data. These analyses included the computation and interpretation of chi-square goodness-of-fit statistics (Yen, 1981), as well as item-ability regression plots (Kingston & Dorans, 1985). The interpretation of the chi-square statistics consisted only of testing the statistic for each item for significance using a type I error rate of 0.01. The interpretation of the plots of the item-ability regressions, on the other hand, was more complex and subjective in nature. Basically, the regression plots were visually inspected for observed proportions correct that fell outside an approximate 95 percent confidence interval around the value predicted by the model.

The final set of analyses involved the comparison of the equatings yielded by the two procedures used. This involved primarily a visual inspection of equating plots to determine whether the two equatings yielded different score transformations.

RESULTS

Dimensionalitv

The results of the full-information factor analysis are summarized in Tables 3 and 4. In Table 3, promax-rotated factor loadings on each factor are shown for each item for the two-factor solution. The percent of variance accounted for by each factor in the orthogonal solution is shown at the bottom of the table. As an aid to interpretation, the proportion correct for each item is also shown. Table 4 presents the rotated factor loadings for the three-factor solution. Again, the

Page 10: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

6

Table 3 Proportion Correct and Promax-Rotated Factor Loadings

for the Two-Factor Solution _________________~~_~~~~~~~~~~ ~-__~~~--___________~~~~~~~~~

Factor Factor Item P ________ c-w_ Item P --_____--____

I II I II ______ ~~~~~_~~__~__~_~~~~_~~~~~~~~~~~~~~~~~ ________________

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

0.75 -0.06 0.90 0.12 0.96 0.11 0.84 0.00 0.85 0.00 0.82 0.02 0.76 0.36 0.71 -0.03 0.89 0.15 0.59 0.40 0.86 -0.04 0.72 0.37 0.46 0.49 0.76 -0.02 0.63 0.47 0.55 0.18 0.75 0.62 0.82 -0.09 0.46 0.18 0.40 0.58 0.28 0.62 0.57 0.59 0.66 0.19 0.63 0.38 0.40 0.37 0.24 0.32 0.77 0.40 0.41 0.48 0.23 0.37 0.14 0.56 0.42 0.46 0.57 0.59 0.50 0.58

0.62 0.57 0.58 0.64 0.47 0.56 0.28 0.63 0.32 0.23 0.69 0.30 0.32 0.60 0.31 0.56 0.23 0.66 0.56 0.22 0.29 0.32 0.46 0.08 0.09 0.66 0.37 0.25 0.28 0.36 0.42 0.07 0.30

34 0.72 0.46 0.38 35 0.65 0.64 -0.04 36 0.71 0.20 0.58 37 0.36 0.38 0.49 38 0.59 0.36 0.19 39 0.43 0.18 0.59 40 0.29 0.62 0.29 41 0.36 0.67 0.09 42 0.41 0.45 0.37 43 0.72 0.32 0.44 44 0.50 0.53 0.22 45 0.29 0.26 0.65 46 0.49 0.50 0.36 47 0.47 0.32 0.31 48 0.39 0.56 -0.02 49 0.26 0.70 0.12 50 0.69 0.61 0.02 51 0.21 0.63 0.02 52 0.20 0.61 0.15 53 0.22 0.40 0.41 54 0.59 0.52 0.28 55 0.47 0.27 0.58 56 0.87 0.22 0.31 57 0.42 0.70 0.09 58 0.23 0.66 0.34 59 0.30 0.70 -0.03 60 0.40 0.77 0.04 61 0.26 0.38 0.58 62 0.30 0.53 0.44 63 0.32 0.61 -0.04 64 0.11 0.60 0.29 65 0.34 0.57 0.13 66 0.49 0.35 0.38

_---*-----c_*_-___ __~~__~_~_~___~__~__~~~~~~~~~~ ___________

Percent of Variance 47.6 2.3 ____L_____________ ______~~____~~__~___~~~~~~~~~~~~~~~~~~~~~

Page 11: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

7

Table 4 Promax-Rotated Factor Loadings for the Three-Factor Solution

_______________-------- _~~__~~____~~~_~~~~~~~~~~~~~~~~~~ Factor Factor

Item __________________ Item _________--_-------

I II III I II III __________-_-_--_------ ____________________---~---~-----

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

0.06 0.54 0.02 34 0.75 0.00 0.08 0.58 0.18 -0.09 35 0.59 -0.22 0.25 0.60 0.23 -0.17 36 0.50 0.32 -0.05 0.43 0.34 -0.13 37 0.04 0.62 0.29 0.21 0.32 -0.05 38 0.79 -0.22 -0.07 0.32 0.33 -0.07 39 -0.12 0.77 0.19 0.50 0.03 0.12 40 0.20 0.37 0.41 0.11 0.74 0.05 41 0.21 0.15 0.48 0.33 0.10 0.05 42 0.58 0.17 0.12 0.55 0.02 0.05 43 0.40 0.28 0.11 0.18 0.48 0.05 44 0.65 -0.01 0.13 0.53 0.10 0.05 45 -0.17 0.81 0.32 0.67 0.04 0.11 46 0.54 0.21 0.16 0.04 0.60 0.08 47 -0.16 0.52 0.39 0.71 -0.01 0.07 48 -0.04 0.18 0.50 0.52 0.26 -0.04 49 0.43 0.09 0.35 0.44 0.14 0.34 50 0.54 -0.14 0.26 0.24 0.47 -0.12 51 0.12 0.15 0.45 0.14 0.49 0.16 52 0.33 0.13 0.37 0.66 -0.04 0.21 53 0.48 0.25 0.13 0.26 0.24 0.48 54 0.16 0.30 0.41 0.80 -0.02 0.15 55 -0.02 0.75 0.19 0.07 0.52 0.12 56 0.49 0.02 0.01 0.36 -0.02 0.13 57 0.32 0.09 0.46 0.17 0.13 0.21 58 0.42 0.26 0.40 0.04 0.67 0.32 59 0.21 0.11 0.43 0.60 0.06 0.12 60 0.45 -0.02 0.45 0.80 -0.06 0.01 61 -0.12 0.76 0.36 0.82 -0.21 0.01 62 0.44 0.28 0.30 0.29 0.27 0.42 63 0.41 -0.08 0.29 0.49 0.31 0.15 64 0.12 0.47 0.38 0.83 -0.28 0.10 65 0.03 0.32 0.44 0.11 0.50 0.38 66 -0.20 0.65 0.39

Page 12: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

8

percent of variance bottom of the table

accounted for by each factor is shown at the

The results shown in Tables 3 and 4 strongly support the assumption of unidimensionality. Although the test of significance for the second and third factors provided by the TESTFACT program indicated that both the second and third factors contributed significantly to the reduction in misfit of the model to the data, the proportion of explained variance contributed by the first factor was quite large relative to the second and third factors. The first factor accounted for approximately 46 percent of the variance, while the second and third factors accounted for approximately 4 and 2 percent of the variance, respectively. The ratio of first factor variance to second factor variance is among the largest the authors have seen for binary test data.

The conclusion that the data are unidimensional is supported by two other findings. First, it was found that in the promax-rotated two-factor solution the factors were correlated 0.72. In the three-factor solution, factors 1 and 2 had a correlation of 0.71, factors 1 and 3 had a correlation of 0.59, and factors 2 and 3 had a correlation of 0.49.

Second, it was found in both the two- and three-factor solutions that the extra factors were very difficult to interpret. In the two-factor solution, neither factor was at all interpretable; there was no common thread such as content or item format relating the items that loaded most heavily on the same factor. In the three-factor solution, factor 1 appeared to be a general factor, factor 2 appeared to be related primarily to integration and differentiation, and factor 3 appeared to be related to item difficulty, though there were numerous exceptions to these interpretations. The appearance of what might be an item difficulty factor is troublesome, since the factor analytic method used by TESTFACT is supposed to avoid the finding of artifactual difficulty factors.

Goodness-of-Fit

As was previously indicated, two procedures for assessing the goodness-of-fit of the model to the data were employed. The first procedure, the inspection of item-ability regression plots, indicated appreciable misfit for six items--two from form A, two from form B, and two from the block of common items. The item-ability regression plots for all 112 items are shown in Appendix B. Items 1 through 46 are unique to form A, items 67 to 112 are unique to form B, and items 47 to 66 are the common items. Using this numbering system, the six items for which there was misfit were items 10, 15, 50, 55, 76, and 80. The item parameter estimates obtained for each item are also shown.

Page 13: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

9

The second goodness-of-fit procedure, the chi-square test, indicated significant misfit for ten items--three from each form and four from the block of common items. Using the above numbering system, these items were items 2, 10, 16, 49, 50, 51, 55, 63, 76, and 87. Four of these were included within the group of items for which misfit was identified using the plots. As was stated earlier, significance was tested using a type I error rate of approximately 0.01. The error rate is approximate because the distribution of the statistic is not exactly chi-square. Obtained chi-squares and associated degrees of freedom are shown on the figures contained in Appendix B.

As a follow-up to the goodness-of-fit analyses, the items for which there was misfit were examined to determine whether there was any type of pattern to the misfit. No pattern was found, either in terms of item content or item parameter estimates. Nor was any pattern discernible in the item-ability regression plots.

Eauating

Table 5 shows the formula score to scaled score conversions for all three equating methods that were used. The plot shown in Figure 1 provides a visual comparison of these three conversions. As can be seen, the IRT equating yielded a nonlinear conversion. Using the IRT conversion yields scaled scores that are slightly higher than those for the Tucker equating in the formula score range of 16 to 41. In all other segments of the formula score range, the Tucker equated scale scores are higher. As can be seen, the differences between the two sets of equated scores widen near the upper end of the formula score range. At the upper end of the scale, the difference between the two is greater than thirty scaled score points. Note, however, that the GRE Subject Test reported score ceiling is 990, and thus thirty-point differences between IRT and Tucker conversions would occur only at formula scores of 54 through 57.

An examination of the equipercentile conversion indicates that it yields scaled scores quite similar to those yielded by the IRT method. The equipercentile conversion is clearly nonlinear and, while it is not identical to the IRT conversion, it yields scaled scores that never differ from the IRT scaled scores by more than twelve scaled score points.

DISCUSSION

Item response theory provides a very powerful alternative to traditional test analysis procedures. However, the power of IRT is acquired at the cost of some rather strong assumptions. Among these are the assumption that the test measures a single ability and the assumption that the relationship between examinee ability and item responses follows a particular parametric form--in this case, the three-parameter logistic model.

Page 14: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

10

Table 5 Formula Score to Scaled Score Conversions

-----~~--- _~~__~~~~~~____~_~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Score Conversions Score Conversions

___________________--------- ____________________~~~~~~~~~~ Formula Formula Score IRT Tucker Equi. Score IRT Tucker Equi. __________~_~_~~~_~_~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

66 1048.3 1080.2 1047.1 65 1040.7 1069.7 1035.7 64 1030.2 1059.3 1024.9 63 1019.6 1048.9 1016.8 62 1009.4 1038.4 1008.6 61 999.4 1028.0 1000.5 60 989.7 1017.6 992.8 59 980.2 1007.1 985.7 58 970.8 996.7 978.6 57 961.7 986.3 971.5 56 952.6 975.8 964.3 55 943.6 965.4 953.9 54 934.7 955.0 943.5 53 925.9 944.5 933.0 52 917.0 934.1 924.0 51 908.2 923.7 915.2 50 899.5 913.2 906.4 49 890.7 902.8 897.7 48 881.9 892.4 889.0 47 873.1 882.0 880.3 46 864.3 871.5 872.1 45 855.4 861.1 863.9 44 846.5 850.7 855.7 43 837.5 840.2 847.2 42 828.5 829.8 835.2 41 819.5 819.4 829.1 40 810.3 808.9 820.1 39 801.1 798.5 810.5 38 791.8 788.1 800.5 37 782.4 777.6 790.4 36 773.0 767.2 780.3 35 763.4 758.8 769.8 34 753.7 746.3 759.2 33 743.9 735.9 748.7 32 734.0 725.5 736.7 31 724.0 715.1 724.5

30 713.9 704.6 712.3 29 703.6 694.2 701.5 28 693.1 683.8 691.0 27 682.6 673.3 680.4 26 671.8 662.9 669.6 25 661.0 652.5 658.5 24 650.0 642.0 647.5 23 638.9 631.6 636.5 22 627.6 621.2 625.8 21 616.3 610.7 615.1 20 604.9 600.3 604.4 19 593.4 589.9 591.7 18 581.9 579.4 579.0 17 570.4 569.0 566.4 16 558.9 558.6 554.4 15 547.5 548.1 542.5 14 536.1 537.7 530.9 13 524.9 527.3 520.2 12 513.8 516.9 509.6 11 502.8 506.4 499.0 10 492.0 496.0 488.7

9 481.2 485.6 478.5 8 470.6 475.1 468.2 7 460.1 464.7 457.2 6 449.6 454.3 446.2 5 439.1 443.8 435.8 4 428.5 433.4 425.8 3 417.9 423.0 415.6 2 407.1 412.5 404.5 1 399.2 402.1 393.4 0 385.3 391.7 382.5

-1 374.4 381.2 371.6 -2 363.8 370.8 360.3 -3 353.7 360.4 347.6 -4 344.1 349.9 328.0

Page 15: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

Figure 1 Formula Score to Scaled Score Conversions

D A lb 2'0 3'0 4’0 40 6’0 I FORMULA TRUE SCORE - 3EGR

0

---- Tucker -x-x IRT

Equipercentile

Page 16: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

12

The assumption of unidimensionality is probably the most restrictive of the IRT assumptions, and it is also probably the most critical, particularly if the three-parameter model is employed. If a test does not meet the unidimensionality requirement (and, precisely speaking, most tests do not), the use of the three-parameter logistic model can result in a shift in the meaning of examinee scores. If a test includes one major component and several incidental minor factors, use of the three-parameter model has the effect of "cleaning up" the test scores by eliminating the secondary factors. However, if the test includes two or more consequential factors, use of the three-parameter model for equating could result in a shift of the score scale if there were a shift in the population with regard to their relative strengths on the various factors.

The GRE Subject Test in Mathematics appears to be quite remarkable with regard to the extent to which the requirement of unidimensionality is met. The size of the first factor relative to the size of the second and third factors, coupled with the almost total lack of interpretability of the second and third factors, provides very strong evidence that the test measures essentially a single ability. Indeed, in the authors' experience, the relative size of the first factor of the Mathematics test is almost without precedence among tests with binary items.

The evidence supporting the goodness-of-fit of the three-parameter model is not so impressive, but it is still satisfactory. The item-ability regression procedure indicated poor fit for six items- -roughly 5 percent of the items. The chi-square procedure indicated poor fit for ten items--about 9 percent of the items. The number of items for which misfit was identified using the item-ability regression method was about what would be expected by chance. The number for the chi-square method was probably a little higher than would be expected by chance, although this is not certain, since the exact distribution of the statistic used is not known.

The results of the equating comparisons strongly favor the IRT equating method. One of the hazards of operational use of a linear procedure such as the Tucker method is that the true conversion might not be linear. The IRT method has the advantage of allowing but not forcing linearity. Any natural curvilinearity in the conversion is allowed to emerge, as happened in this case.

Based on the results of this study, it seems reasonable to conclude that additional research on the use of IRT equating with the GRE Mathematics Test is warranted. In particular, a study investigating the feasibility of reducing test form overlap through the use of IRT equating seems in order. If such a study yielded positive results, it would be possible to achieve substantial improvement of test security through the reduction of test form overlap, without sacrificing score scale stability. Given the positive results of this study, such a study would seem to be the logical next step.

Page 17: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

13

SUMMARY AND CONCLUSIONS

The purpose of this study was to assess the feasibility of using IRT true-score equating for the GRE Subject Test in Mathematics. The feasibility assessment included an analysis of the factor structure of the test using full-information factor analysis, two different analyses of the goodness-of-fit of the three-parameter logistic model to real response data, and a comparison of the results of IRT equating of the test to the results obtained using Tucker and equipercentile equating.

The results of the analyses indicated that the GRE Mathematics Test is highly unidimensional, and that the three-parameter logistic model yielded satisfactory fit to real response data. Results of the equating analyses strongly support the use of IRT equating with this test. The IRT and equipercentile equatings were quite similar. Both differed from the Tucker equating in that they revealed curvilinearity not indicated by the linear Tucker method.

Based on these results, it was concluded that use of IRT true score equating with the Mathematics Test is feasible, and that further research on the use of the IRT method should be performed. In particular, it was concluded that a simulation study examining the possibility of reducing test form overlap through the use of triple-part IRT equating should be conducted.

Page 18: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

15

REFERENCES

Angoff, W. H. (1984). Scales. norms and equivalent scores. Princeton, NJ: Educational Testing Service.

Bock, R. D., Gibbons, R., & Muraki, E. (1985). Full information factor analvsis. MRC Report 85-l. Chicago: National Opinion Research Center.

Gulliksen, H. (1950). Theorv of mental tests. New York: Wiley.

Kingston, N. M. (1986). Assessing the dimensionalitv of the GMAT verbal and quantitative measures using full information factor analvsis (ETS Research Report 86-13). Princeton, NJ: Educational Testing Service.

Kingston, N. M., & Dorans, N. J. (1985). The analysis of item-ability regressions: An exploratory IRT model fit tool. Applied PsvcholoPical Measurement, 9, 281-288.

Levine, R. S. (1955). Equating the score scales of alternate forms administered to samples of different ability (ETS Research Bulletin No. 23). Princeton, NJ: Educational Testing Service.

Lord, F. M. (1980). Applications of item response theorv to practical testing nroblems. Hillsdale, NJ: Lawrence Erlbaum Associates.

Wilson, D., Wood, R., & Gibbons, R. (1984). TESTFACT user guide. Mooresville, IN: Scientific Software.

Wingersky, M. S. (1983). LOGIST: A program for computing

Yen,

maximum likelihood procedures for logistic test models. In R. K. Hambleton (Ed.), Applications of item resoonse theory. Vancouver, BC: Educational Research Institute of British Columbia.

W. M. (1981). Using simulation results to choose a latent trait model. ADDlied Psvcholorzical Measurement, 5, 245-262.

Page 19: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

17

APPENDIX A

The Three-Parameter Logistic Model

The three-parameter logistic, or 3PL, model postulates that underlying examinees' responses to test items is a single unobservable ability. The probability of an examinee with a particular level of ability (e) responding correctly to an item depends solely on three parameters associated with the item: a, the ability of the item to differentiate among examinees of different abilities; b, the difficulty level of the item; and c, the probability of an examinee with very low ability responding correctly.

The a, b, and c parameters determine the relationship between examinee ability and the probability of a correct response to an item according to a mathematical model assumed to have a logistic form. This logistic form, referred to as the 3PL model, is given by

P(Q) = c + (1-c)/(l + e-1*7a(e-b)) .

Page 20: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

19

APPENDIX B

Item-Ability Regression Plots

and Goodness-of-Fit Statistics

Page 21: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

DF= 8

3 1 a 3

b" = 1.04 = -2.05

CHISG = 0.05 = 23.32

DF= 8

0 x 2

3

I b = -2.65 I

CHISQC = 0.09 = 6.02

DF= 9

b" = 1.03 = -0.68

CHISQ = 6.63 DF= 8

I I f 1 I

a

a = 0.53 b= -2.76

Cl-& = 0.09 = 4.91

DF= 9

-3 cl 3 # 6

Page 22: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

22

b” - 0.92 = -0.23

C’ 0.10

/ CHISQ = 9.89 d DF= 8 ,

0 ‘r’ 1 I I I I

I

-3 0 3 $ 7

b" = 1.23 = 0.21

C= 0.07 CHISQ = 18.53

, DF- 8 &

3 0 . # 8

a- 1.39 b = -0.30 C = 0.15

CHISQ - 5.97 DF- 8

I

b" = 1.01 = -0.23

CHIS; = 0.02 = 43.56

DF= 8

b" = 1.40 = -0.72

C = 0.30 CHISQ = 4.56

DF- 9

0 IF I I

3

I

DF= 8

Page 23: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

23

I

b = 1.04 C = 0.12

0 ' b 0 CHISQ = 21.33 I DF- 8

O- I I I I I

I

-3 0 3

- 4

L

# 13

i a = 1.40 b = -0.22

CHIS; = 0.07 = 17.72

DF= 9

I I 1 I I

5 0 # 14

= 0.60 b"= -.84 c = 0.06

CHISQ = 14.84 DF- 8

0 # 15

3

0

a = 1.05 b = -1.08 c = 0.10

CHISQ = 29.66 DF= 8

5 0 # 16

3

b" 1 0.62 0.94

c = 0.20 1 CHISQ = 3.99

DF- 8

1 I 1 I I -3 0 3

# 17

DF= 8

-3 0 # 18

3

Page 24: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

24

b" = 1.40 = 0.24

C = 0.23 CHISQ - 2.91

! ’ DF= 9

1 I I 1 I

0 # 21

DF- 8

O-

0 # 22

b” = 0.90 = -0.10

C = 0.39

CHISQ = 7.33 DF= 8

3

CHISQ - 14.35 DF= 8

Page 25: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

25

b" = 0.56 = -0.46

C = 0.04 CHISQ = 7.47

DF- 8

a 3 # 25

I

DF- 8

0 # 26

3

I DF

= 1.40 = 0.81 = 0.21 = 10.67 = 9

3 0 3 # 27

CHISQ = 9.22 DF- 8

I'

0

-3 0 3

a = 1.17 b = -0.58 c = 10.69

CHISQ = 10.69 DF= 8

0 # 29

a = 1.05

3 b = 0.09 C = 0.10

CHISQ = 12.64 DF= 8

I

I 1 I I I I

-3 0 # 30

Page 26: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

DF- 8

3

it = 0.89 = 0.32

a = 0.99 b = 0.34

0.22 CHIS; 1 6.87

DF= 8

I I I I I

26 I

DF =

1.05 1.08 0.30 6.36 8

x 34

I -

b” = 1.01 = 0.99 = 0.15 = 7.70

DF- 8

0 i’ I I I I I

3

I DF- 8

0 ; I I I I I

-'3 0 # 36

0 # 33

3

Page 27: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

27

C’ 0.16

I I' ' CHISQ = 6.30

0 DF= 8

I I I i I -3

I -

0

0 # 37

3

0.88 1.35 0.14 1.63 8

-'3 0 # 38

I

0

b = -0.12 C = 0.23

CHISQ = 10.61 DF- 8

0 1 I I I I

3 0 # 41

b" = 0.86 = 0.11

C” 0.16 CHISQ = 5.64

DF- 8

b = -2.51 C = 0.09

CHISQ = 17.37 DF= 9

3

I

= ;=

1.26 0.70

C = 0.29 CHISQ = 6.69

I I --

D$= 8

0 I

I 1 I I I I 1 -3 0 3

# 42 -3 0

# 39 3

Page 28: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

0 I I DF

= 0.72 = 1.31 = 0.22 = 5.17 t 8

I

I I I I I I

-3 a # 43

i

b” = 1.05 t = 1.98 C = 0.09

CHISO = 8.40 i/t = 8

3 # 44

I DF- 7 a I I 1 I I

-3 cl . # 4s

28

I

b" = 1.40 = 0.31

CHIS; = 0.37 i= 14.62

DF- 8 I I

f I 1

0 X 46

i I

0.79 -1.73

CHIS; = 0.01 = 3.85

DF- 9 0 ; I I I I I /

-3 0 # 47

1 CHIS; - 0.09 = 7.79

DF = 10 I I I I I

-3 0 3 # 48

Page 29: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

DF = 10

-3 a 3 # 49

DF= 9

1 I f I I

0 3 # so

b" = 0.84 = -0.99

C = 0.03 CHISQ = 32.25

DF- 9

I I I I I

; 0 # SI

DF= 9 a 1 I I I I

-3 0 3 # 52

I .

a-

CHISQ = 16.66 DF = 10

0 3 # 53

CHISQ - 7.13 DF- 9

0 I I I I I

-3 0 # 54

3

Page 30: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

0

I I . c - 0.31 CHISQ - 46.04

DF- 9

I I I I I I

b" = 1.01 = 1.06

CHIS; = 0.27 = 16.21

DF- 9 I

I I I I I

-3 0 3 # 56

DF- 9

a= 1.31

0

b” - - -0.72 0.99

CHIS; = = 17.65 0.09

DF- 9

I I I I I

-3 0 . # 59

DF = 10

0 # 60

3

Page 31: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

1.02 1.54 0.19 6.28 9

b" = 1.35 = 1.36 = 0.26 = 11.05

DF- 9 I I I i I 0

0 # 61

3 -3 0 3 # 64

I-P Irm -

I DF= 9

0 1

I I 1 I I I -3 0 3

0 I I I I I I

-3 0 3 t 62

DF = 10 .

# 65

DF= 9 0.

0 # 63

3

Page 32: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

32

b” = - -1.13 1.11

a& - = 6.51 0.00

DF- 7

I 1 I I I 3 u a

. # 67

I

E = 0.78 - -2.04

C= 0.09 CHISQ = 7.05

DF- 7

0 3 # 69 c

a -3 0

I

1 I I I I I

I

a-

0

;: - 0.70 = -2.55

C& - 0.09 - 3.49

DF- 7

# 70

b" - 0.57 = -2.11

MIS; = 0.09 = 13.63

DF- 7

0 # 71

a = 0.98

DF- 7

i 3

-3 0 # 72

Page 33: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

- 0.87 - -0.31

-' 1 - 0.18 , - 7.10

DF- 6

I

I I I I I

a

b" - 1.01 = 0.78

CHISG = 0.01 = 2.04 r DF= 6

3

0 # 74

3

0

b = -0.69

CHISG = - 0.02 8.70 DF- 6

1 i I I I -3 0 i

# 7s

o-

b- -0.95

CHIS; = 0.09 - 22.21

DF- 7

I I I I 1

# 76

0 ’ I I I I I

-3 0 3 # 77

I

CHISQ = 3.12 DF= 7

-'3 0 # 78

Page 34: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

34

CHISQ = 5.95 DF- 6

5 0 . 7 # 79

0.42 0.16 0.01

, 1

1 I I I I

; 0 _ .

# 80

I4

b” = 0.97 - -0.03

CHIS; = 0.24 = 6.88

DF= 6

# 92

> DF- 6

0 1’ I I 1 I I

-3 0 # 83

I

0

II I ’

b" = 1.40 = 1.02

C = 0.34 CHISO =

DF= 7 I I 1 I I

-3 0 # 84

Page 35: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

O-

I -

DF- 6

r

0 3 # 85

b" = 0.26 = 2.66

a& = 0.18 = 3.79

DF- 6 I I

# 86 # 89

C== 0.10 CHISQ = 23.98

DF- 7

3

0. .

DF =

0.94 0.81 0.27 9.36 6

0 # 88

3

DF- 6 0

-'3 0

Page 36: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

P z- 0.97 0.20

I

C" 0.40 CHISQ = 6.84

DF- 6

DF- 6 I 0 i I I I I I

-3 0 . 9 : -3 0 3 # 91 K 94

b" = 1.40 = 1.15

C = 0.13 CHISQ = 8.56

DF- 7 I I

b"= = 1.40 0.24 71 i=

CHIS; DF=

0.19 = 15.45

1 7 1 I I 1 I 1 cl

-'3 0 -'3 0 3 # 95 # 92

I - -

o-

0.96 1.00 0.16 8.26 6

= CHIS; =

DF = =

CHIS; = 0.16 5.01

DF= 6

3

Page 37: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

= 0.92

I b” - 0.82

0

DF- 6

I I F I 1

0 # 98

C = 0.16 CHISQ = 3.55

I - DF- 6

I I f I -3 0

# 99

DF =

0.52 0.49 0.31 5.03 6

# 100

0. .

I

0

C E 0.23 CHISQ = 6.00

DF= 6 I

I I 1 I I

3 0 # 101

3 -3 0 # 102

Page 38: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

I

, I 1 I ’ DF- 6

0 1 I i I --3 n

# 103 # 106

b” = 0.89 = 0.80

CHISG = 0.15 = 6.18

DF= 6

I I I I I I

-3 0 3 # 104

LS

b”= =

CHIS; = DF =

1.05 0.77 0.15 6.76 6

0 i # 105

38

I

0

I

0.

C== 0.10 CHISQ = 1.71

DF- 6

I I I I -3 0 .

I 107

I I I I I

; 0 .

# 108

Page 39: EXPLORING THE USE OF IRT EQUATING FOR THE GRE SUBJECT … · The IRT equating procedure employed in this study is described in Lord (1980, chapter 13). With this procedure, equating

39

0

I ’ C' 0.26

CHISQ = 3.88 ( DF- 6

I i I I I -'3 0

Y 109

b” - 0.93 - 1.18

CHIS; = 0.14 = 3.88

DF- 5

0 # 110

3

3

0

I A

T I C E 0.23

I I I I ’ I CHISQ = 7.14 DF- 6

I '

0 -3 0

# II2