27
This article was downloaded by: [Harvard Library] On: 08 October 2014, At: 03:21 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Applied Measurement in Education Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hame20 Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test Laura S. Hamilton Published online: 07 Jun 2010. To cite this article: Laura S. Hamilton (1999) Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test, Applied Measurement in Education, 12:3, 211-235, DOI: 10.1207/S15324818AME1203_1 To link to this article: http://dx.doi.org/10.1207/S15324818AME1203_1 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

  • Upload
    laura-s

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

This article was downloaded by: [Harvard Library]On: 08 October 2014, At: 03:21Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,UK

Applied Measurement inEducationPublication details, including instructions forauthors and subscription information:http://www.tandfonline.com/loi/hame20

Detecting Gender-BasedDifferential Item Functioningon a Constructed- ResponseScience TestLaura S. HamiltonPublished online: 07 Jun 2010.

To cite this article: Laura S. Hamilton (1999) Detecting Gender-Based DifferentialItem Functioning on a Constructed- Response Science Test, Applied Measurement inEducation, 12:3, 211-235, DOI: 10.1207/S15324818AME1203_1

To link to this article: http://dx.doi.org/10.1207/S15324818AME1203_1

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all theinformation (the “Content”) contained in the publications on our platform.However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness,or suitability for any purpose of the Content. Any opinions and viewsexpressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of theContent should not be relied upon and should be independently verified withprimary sources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses, damages,and other liabilities whatsoever or howsoever caused arising directly orindirectly in connection with, in relation to or arising out of the use of theContent.

Page 2: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

This article may be used for research, teaching, and private study purposes.Any substantial or systematic reproduction, redistribution, reselling, loan,sub-licensing, systematic supply, or distribution in any form to anyone isexpressly forbidden. Terms & Conditions of access and use can be found athttp://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 3: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

Detecting Gender-Based DifferentialItem Functioning on a Constructed-

Response Science Test

Laura S. HamiltonRAND

Santa Monica, California

In this study, I explored methods for detecting gender-based differential item func-tioning on a 12th-grade constructed-response (CR) science test administered as partof the National Education Longitudinal Study of 1988 (NELS:88). The primary diffi-culty encountered with many CR tests is the absence of a reliable and appropriatemeasure of ability on which to condition. In this study, several combinations of condi-tioning variables were explored, and results were supplemented with evidence frominterviews of students who completed the test items. The study revealed that 1 item inparticular displayed a large male advantage and contributed to the gender differenceon total score. Results were similar to those obtained with the NELS:88 multi-ple-choice test. In both cases, gender differences were largest on items that involvedvisualization and called on knowledge acquired outside of school. Implications forusers of large-scale assessment results are discussed.

Until recently, large-scale testing programs relied almost exclusively on the multi-ple-choice (MC) item format, primarily due to the need for standardization and in-expensive scoring. In the past several years, however, many assessment programshave adopted constructed-response (CR) item formats to supplement or replaceMC items. CR items require students to produce rather than select their answers andare often presumed to measure reasoning in a way that is difficult or impossiblewith the MC format (Frederiksen, 1984; Resnick & Resnick, 1992; Shavelson,Carey, & Webb, 1990).

APPLIED MEASUREMENT IN EDUCATION,12(3), 211–235Copyright © 1999 Lawrence Erlbaum Associates, Inc.

Requests for reprints should be addressed to Laura Hamilton, RAND, 1700 Main Street, P.O. Box2138, Santa Monica, CA 90407–2138. E-mail: [email protected]

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 4: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

One of the presumed benefits of CR items, especially on science tests, is a re-duction in gender differences. Studies have revealed small but potentially impor-tant differences in the average measured science achievement of male and femalestudents (e.g., Jones, Mullis, Raizen, Weiss, & Weston, 1992), and some evidencehas suggested that, in fact, such differences were larger on MC than on CR assess-ments (Bolger & Kellaghan, 1990; Mazzeo, Schmitt, & Bleistein, 1993). How-ever, results have been inconsistent, with open-ended items sometimes showinglarger differences (e.g., Dunbar, Koretz, & Hoover, 1991; Mullis, Dossey, Owen,& Phillips, 1991). Furthermore, a review and synthesis conducted by the Educa-tional Testing Service (Cole, 1997) revealed no clear format effect.

In contrast, there were fairly consistent findings with regard to the effect ofcontent on gender differences in science achievement. Male students, on aver-age, outperformed female students on physical science items, whereas little orno difference was typically observed on life science items (Becker, 1989;Burkam, Lee, & Smerdon, 1997; Fleming & Malone, 1983; Jovanovic,Solano-Flores, & Shavelson, 1994; Young & Fraser, 1994). On the 1991 Inter-national Assessment of Educational Progress (Beller & Gafni, 1996), the largestmale advantage occurred for physical science and earth and space science items.Some studies have traced such differences to course-taking patterns or other as-pects of opportunity to learn, including participation in extracurricular activitiesrelated to science (Johnson, 1987; Linn, 1985; National Assessment of Educa-tional Progress, 1988).

The type of reasoning elicited by different types of items may also have af-fected the degree to which items exhibited gender differences. In particular, malestudents tended to outperform female students on measures requiring visual orspatial processing (Halpern, 1997; Lohman, 1993). Although the implications ofthis difference for achievement in science were not explored extensively, therewas some evidence that it affected performance on certain types of mathematicsitems (e.g., Fennema & Tartre, 1985; Halpern, 1992). Male students tended to per-form better on geometry items than did female students who were matched on totaltest score (O’Neill & McPeek, 1993), a result that may have reflected the spatialdemands of geometry. The male advantage in spatial skills may have stemmed inpart from differential exposure to activities that helped to develop those skills(Halpern, 1992; Linn & Hyde, 1989). Careful study of the features of items exhib-iting gender differences is needed to understand the complex relations among for-mat, content, and reasoning processes and their effects on the performance of maleand female students.

This investigation focused on gender differences on CR science items adminis-tered as part of the National Education Longitudinal Study of 1988 (NELS:88).Emphasis was placed on identifying characteristics of items that exhibited particu-larly large differences. The research combined an exploratory differential itemfunctioning (DIF) study with a set of interview data to provide evidence concern-

212 HAMILTON

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 5: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

ing sources of gender differences on the CR items. Implications for users oflarge-scale achievement test data are discussed.

DIF METHODS FOR CR ASSESSMENTS

Indexes of DIF are typically calculated as part of the test development process. DIFstatistics reveal whether members of two groups, matched on the ability measuredby the test, have different probabilities of answering an item correctly. Establishedprocedures exist for dichotomous items (e.g., Angoff, 1993), and much work is be-ing done to investigate DIF indexes for polytomously scored items. For example,the generalized Mantel–Haenszel (M–H) statistic (Agresti, 1990; Somes, 1986) ex-tends the commonly used M–H procedure to items with more than two scoring cate-gories (which are treated as unordered). Miller and Spray (1993) described a logis-tic regression procedure that extended the logistic regression model described bySwaminathan and Rogers (1990). Miller and Spray also discussed a logisticdiscriminant function analysis (LDFA) procedure in which probabilities of groupmembership were predicted from item and total test scores. Many CR items arepolytomously scored, making these methods good candidates for DIF studies of CRtests.

A difficulty that frequently arises with CR tests is the absence of a suitable con-ditioning variable. Most CR items require a longer response time than MC items,reducing the number of items that can be administered in a given amount of testingtime. If only a few items are administered, as is typical with many performance as-sessments, total test score may not be an appropriate conditioning variable. Zwick(1990) demonstrated that (a) application of the M–H procedure may lead to incor-rect conclusions regarding DIF when the ability distributions for the focal and ref-erence groups differ and (b) this problem is exacerbated when the conditioningvariable has low reliability. CR tests that include few items would be subject to thisproblem. However, including the studied item in the conditioning variable im-proves the estimation for both dichotomously and polytomously scored items(Donoghue, Holland, & Thayer, 1993; Zwick, Donoghue, & Grima, 1993).

An MC test in the same subject could be used as a conditioning variable, as longas the two formats measure similar abilities. However, if the assumption of con-struct equivalence between formats does not hold, DIF may be confused with itemimpact (i.e., differences in item performance due to differences in group means ona relevant ability) because students are not matched on the ability being measuredby the studied test (Welch & Miller, 1995). In many cases, it may not be reasonableto assume construct equivalence, and this method, therefore, should be used withcaution.

For complex performance tasks that may measure multiple abilities, amultivariate matching procedure may be most appropriate. Studies of test data as

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 213

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 6: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

well as simulations have demonstrated that matching on more than one abilitycould reduce the number of items identified as exhibiting DIF on multidimen-sional tests and could improve interpretations of results (Ackerman, 1992;Clauser, Nungester, Mazor, & Ripkey, 1996; Mazor, Kanjee, & Clauser, 1995). Itis also sometimes useful to condition on both ability and an educational back-ground variable, depending on the purpose of the DIF investigation (Clauser,Nungester, & Swaminathan, 1997; Zwick & Ercikan, 1989). Logistic regressionand LDFA, because they allow for multiple matching criteria, appear especiallypromising for DIF analyses of complex, open-ended assessments. In this study, Iexamined the effects of using a variety of matching criteria on the number andtypes of items identified as exhibiting DIF on a CR science achievement test.

DESIGN AND METHOD

The NELS:88 High School Effectiveness Study Sampleand Science Tests

NELS:88 was the most recent in a series of large-scale, longitudinal surveys spon-sored by the National Center for Education Statistics (NCES). It is widely used as aresearch tool to study student achievement and its correlates. NELS:88 followed anational probability sample of 8th grade students into the 10th and 12th grades us-ing cognitive tests in four subjects as well as questionnaires completed by students,parents, teachers, and school administrators. NCES conducted as a supplementarystudy the High School Effectiveness Study (HSES; Pollack & Rock, 1997), inwhich 10th-grade students from 247 high schools were sampled in 1990 and fol-lowed into the 12th grade. These students completed the same questionnaires andtests that were administered to the full NELS:88 sample.

The NELS:88 Grade 12 MC science test included 25 MC items and had a20-min time limit (for additional information, see Rock & Pollack, 1995). Asubsample of 2,204 HSES Grade 12 students also completed four CR scienceitems (Pollack & Rock, 1997). Each item had a 10-min time limit and required stu-dents to supply one or more brief written explanations or diagrams. The items in-cluded the following:

1. Nuclear and fossil fuels(hereafter fuels): Write a brief essay outlining ad-vantages and disadvantages of each.

2. Eclipses:Produce diagrams of solar and lunar eclipses and explain why onecan be seen from a greater geographical area on earth.

3. Rabbit and wolf populations(hereafter populations): Given graph repre-senting population of rabbits, produce graph representing population ofwolves, subject to certain constraints, and explain features of the graph.

214 HAMILTON

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 7: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

4. Heating curve:Explain segments of graph representing temperature of amixture as a function of time (the mixture contains water and ice, and is be-ing heated over an open flame).

The CR items were scored by teams of readers, most of whom taught highschool science. Each item was broken down into several components and scoredusing categories of possible responses. This analytic scoring system preserved in-formation on specific parts of students’ responses. After scoring was completed,the readers and test developers created a system for combining the analytic scoresinto a set of ordered categories for each item. This process resulted in a 6-pointscale score for each item ranging from 0 (apparent absence of understanding) to 5(complete and correct responses to all parts of the item). Additional informationabout the items and their scoring, including interrater agreement, can be found inthe NCES report by Pollack and Rock (1997).

Statistical Analysis

The LDFA procedure (Miller & Spray, 1993) was the primary method used in thisstudy to investigate DIF on the CR items. This method is more flexible than thechi-square method, permitting the inclusion of multiple conditioning variables andinteraction terms. Furthermore, for polytomously scored items, most chi-squaremethods treat the response categories as unordered, resulting in a loss of informa-tion. In the LDFA procedure, probabilities of group membership (in this case, malevs. female) are predicted from total test score, item score, and their interaction, withlikelihood ratio tests conducted for main effects and interaction models. In thisstudy, a Type I error rate of .01 was used for the likelihood ratio tests, althoughmagnitude of change was considered in addition to statistical significance. Resultsof analyses using several sets of conditioning variables were compared, and bothuniform and nonuniform DIF were investigated. Various combinations of condi-tioning variables were used. Conditioning variables included total CR score, totalitem response theory (IRT) score from the MC science test, several subscores of theMC test, and IRT scores from the NELS:88 reading and math tests. Reading andmath scores were included because responses to CR items might have drawnheavily on verbal or quantitative abilities, possibly affecting the relative perfor-mance of male and female students. As suggested by Miller and Spray, for itemsthat exhibited DIF, confidence bands were constructed around the estimated logis-tic discriminant function to assess the practical importance of DIF.

The subscores of the MC test used in this investigation were derived throughfull-information item factor analysis (see Bock, Gibbons, & Muraki, 1988). Previ-ous studies of the science MC test have revealed the utility of treating this test asmultidimensional and analyses of several different samples in both Grades 10 and

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 215

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 8: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

12 have yielded consistent results (Hamilton, Nussbaum, Kupermintz, Kerkhoven,& Snow, 1995; Nussbaum, Hamilton, & Snow, 1997). The dimensional analysesconducted in these studies were replicated for the HSES sample used here, and fac-tor scores were estimated for the resulting dimensions. These factor scores wereestimated a posteriori (EAP) scores, Bayes estimates of the mean of the posteriorability distribution given the observed response pattern over all items on the test.

Interview Study Procedures

To supplement the statistical analysis, 25 high school students were interviewedand asked to think aloud as they completed the four CR items and a subset of 16 MCitems. Participants also responded to a set of postitem interview questions that elic-ited additional information concerning solution strategies and sources of knowl-edge. Interviews were audiotaped and transcribed. The two interviewers used a de-tailed protocol and a structured observation sheet on which they recorded eventsnot captured on audiotape, such as the use of gestures. The four CR items werescored by two raters using the rubrics provided by the item developers (Pollack &Rock, 1997). Agreement was good, with weighted kappa values (Agresti, 1990) of.93 for fuels and .97 for the other three items. A third rater scored the papers onwhich the original two raters disagreed, and the final score assigned to each paperwas the one on which two of the three raters agreed.

Several coding categories were created to capture a range of strategies for re-sponding to MC items and to identify the most common sources of knowledge. Se-lection was based in part on observations gathered during a previous study(Hamilton, Nussbaum, & Snow, 1997) in which some items tended to evoke par-ticular types of responses such as gestures and visualization. Four transcripts wererandomly selected for coding by a second rater, and kappa values were .80 orhigher for each. The raters resolved discrepancies through discussion.

RESULTS

Dimensions of the MC Test

Results of the full-information item factor analysis of the HSES MC data are pre-sented in Table 1. Three dimensions emerged, with patterns of loadings nearlyidentical to those obtained in the full NELS:88 sample in the studies cited earlier.The dimensions included (a) spatial–mechanical reasoning (SM), with items thatrequired interpretation of visual or spatial relations; (b) quantitative science (QS),involving chemistry and physics content and use of mathematical formulas; and (c)basic knowledge and reasoning (BKR), consisting primarily of items that called for

216 HAMILTON

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 9: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

application of concepts and reasoning in biology, astronomy, and, to a lesser de-gree, chemistry. BKR items also covered some overarching themes in science, suchas experimental design and the difference between models and observations. Cor-relations among factors were 0.67 between SM and QS, 0.73 between SM andBKR, and 0.76 between QS and BKR.

The factor interpretations were based on inspection of item content and on ob-servations of student responses obtained through interviews. The factors reflectedcontent to some extent. Most physical science items loaded on QS or SM, whereasmost life science items loaded on BKR. However, the factors were distinguished

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 217

TABLE 1Factor Loadings From Full-Information Factor Analysis of the National Education

Longitudinal Study of 1988 Grade 12 Science Multiple-Choice Test Items After PromaxRotation, High School Effectiveness Study Sample

MasterItem No. Description SM QS BKR

CommunaliltyEstimate

S27 Lever 0.72a 0.06 0.04 .54S29 Camera lens 0.67a 0.15 0.01 .71S28 Contour map 0.56a 0.14 0.20 .67S12 Earth orbit 0.50a –0.13 0.41 .59S36 Pendulum 0.43 0.11 0.29 .87S14 Mix water 0.39a 0.18 0.29 .64S38 Train 0.33 0.29 0.26 .52S37 Hydrogen reaction 0.13 0.77a –0.13 .57S35 Uranium decay –0.10 0.77a 0.14 .70S30 Half life 0.24 0.67a 0.03 .78S26 Calculate mass 0.23 0.56a 0.17 .83S16 Enzyme graph 0.10 0.54a 0.20 .60S05 Light of the moon 0.25 –0.26 0.72a .59S06 Simple reflex 0.08 –0.09 0.72a .61S17 Algae 0.16 0.06 0.66a .74S04 Experimental design –0.07 0.12 0.59a .40S10 Classify substances 0.16 0.13 0.56a .69S34 Fish population –0.10 –0.04 0.55a .22S33 Tissue 0.11 –0.01 0.53a .40S31 Population graph –0.12 0.34 0.52a .53S19 Chemical change 0.28 0.00 0.50a .52S18 Storm 0.24 0.08 0.43a .48S22 Food chain –0.08 0.29 0.43a .43S15 Respiration –0.08 0.08 0.36a .14S24 Model/observation 0.12 0.22 0.31a .36

Notes. N= 52,240. Correlations among estimated a posteriori scores were .54 for SM and QS, .62for SM and BKR, .56 for QS and BKR. SM = spatial–mechanical reasoning; QS = quantitative science;BKR = basic knowledge and reasoning.

aHighest loading for each item.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 10: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

not only by content but also by the form of reasoning required. An interview study(Hamilton, 1997) revealed that SM items elicited reports of visualization and pre-diction, BKR items depended heavily on verbal reasoning ability, and QS itemswere most likely to reflect application of specific factual knowledge learned in sci-ence class.

MC and CR Performance of Male and Female Students

Descriptive statistics for the standardized EAP scores (M = 0, variance = 1) fromthe Grade 12 MC test are presented separately for male and female students in Ta-ble 2. SM showed a large gender difference, with male students scoring an averageof nearly one half standard deviation higher than female students. In contrast, gen-der differences on QS and BKR were minimal. These findings were consistent withthose obtained using the full NELS:88 sample. They revealed that the commonlyobserved gender difference on physical science items was not, at least in this case,due solely to content. Both SM and QS consisted primarily of physical scienceitems, but no gender difference was observed on QS. Instead, examination of per-formance across dimensions and inspection of items suggested that visual or spatialreasoning demands contributed to the gap in physical science performance betweenmale and female students.

Table 3 provides frequencies of scores at each scale score level for the CRitems, broken down by gender. The totals for each score revealed strong skewness,with relatively few students scoring at the highest levels. Especially noteworthywas the difference in numbers of male and female students at score level 5 oneclipses. Although more students achieved the highest possible score on this itemthan on the other three, the male:female student ratio was substantial. As with theSM MC items, this difference may reflect visual and spatial reasoning demands.Similar but less extreme results were obtained for score levels 4 and 5 on both thefuels and populations items.

218 HAMILTON

TABLE 2Descriptive Statistics by Gender on Grade 12 Multiple-Choice Dimensions

Female Students Male Students

Dimension M SD Q1 Q3 M SD Q1 Q3

SM –0.24 0.984 –1.04 0.56 0.24 0.960 –0.52 1.09QS –0.01 0.958 –0.78 0.69 0.01 1.040 –0.85 0.88BKR –0.03 0.992 –0.82 0.77 0.03 1.008 –0.61 0.81

Note. Q1 = first quartile; Q3 = third quartile; SM = spatial–mechanical reasoning; QS = quantativescience; BKR = basic knowledge and reasoning.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 11: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

DIF

DIF detection on the CR test presented two major challenges: (a) the absence of areliable matching criterion in the form of total score on a test of similar items and (b)the polytomous scoring system. The second problem could be addressed by the useof the LDFA procedure (Miller & Spray, 1993). This study investigated solutions tothe first problem by examining the effects of including various sets of conditioningvariables.

The LDFA procedure was carried out several times for scale scores on each ofthe four CR items, using various combinations of science achievement measuresas matching criteria. CR total score (which included the studied item) and MC total(IRT) score were each investigated separately. Although it could not be assumedthat the MC test measured the same construct as the CR test, it was of interest todiscover whether the CR items exhibited gender differences for groups of maleand female students who were matched on the kind of science achievement mea-sured by traditional tests.

The effects of conditioning on subscores of the MC test were also explored. Be-cause BKR appeared to be the most similar of the three MC dimensions to generalscience achievement, it was included alone as a matching variable. Then QS wasadded, followed by SM, which was the least like a general science achievementmeasure. The MC scores and CR total score were also entered in combination.

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 219

TABLE 3Frequencies of Constructed-Response Scale Scores by Gender

Score

Item 0 1 2 3 4 5 Total

Nuclear and fossils fuelsMale students 299 317 181 126 99 50 1,072Female students 435 340 131 82 55 25 1,068Total 734 657 312 208 154 75 2,140

EclipsesMale students 124 120 359 267 30 168 1,068Female students 229 234 357 174 19 50 1,063Total 353 354 716 441 49 218 2,131

Rabbit and wolf populationsMale students 292 237 308 87 64 55 1,043Female students 366 266 256 81 47 37 1,053Total 658 503 564 168 111 92 2,096

Heating curveMale students 188 444 142 202 34 19 1,029Female students 180 414 195 208 25 15 1,037Total 368 858 337 410 59 34 2,066

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 12: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

Finally, reading and math scores from NELS:88 were included to test the hypothe-sis that matching on measures of verbal or quantitative ability would affect DIF re-sults.

Table 4 indicates for which items and matching criteria significant DIF was re-vealed. Although earlier studies of the MC dimensions revealed differences intheir patterns of relations with other variables, use of the dimensions did not affectthe DIF results. For fuels and populations, the only factor determining whetherDIF was present was the inclusion of CR total score as a matching criterion. WhenCR total score was excluded, the LDFA procedure suggested the presence of DIFin favor of male students for both items. Eclipses and heating curve showed DIFregardless of matching criteria, with eclipses favoring male students and heatingcurve favoring female students. The only instance of nonuniform DIF occurred forheating curve when BKR and QS were included as matching criteria. It appears,therefore, that choice of conditioning variable had little effect and that eclipses andheating curve in particular warranted further investigation.

To assess the importance of DIF on these two items, Scheffé-type confidencebands were constructed around the logistic discriminant function curve at eachscore level of each item. For simplicity, only one conditioning variable was in-cluded in each set of plots. Confidence bands were constructed for models that in-cluded only CR total score as a conditioning variable and also for those thatincluded only the MC IRT score. This procedure was somewhat conservative(Hauck, 1983), so a Type I error rate of 0.05 was used for these analyses. Each plotshows the null model (e.g., probability of being a male student given CR totalscore) and the full model (e.g., probability of being a male student given CR total

220 HAMILTON

TABLE 4DIF Results for CR Science Test

Matching Criteria Fuels Eclipses Populations Heating Curve

CR total M FIRT M M M FBKR M M M FBKR + QS M M M F–NUBKR + QS + SM M M M FCR + IRT M FCR + BKR M FCR + BKR + QS M FCR + BKR + QS + SM M FCR + BKR + QS + SM + read M FCR + BKR + QS + SM + math M F

Note. DIF = differential item functioning; CR = constructed response; M = DIF in favor of malestudents; F = DIF in favor of female students; IRT = item response theory; BKR = basic knowledge andreasoning; QS = quantitative science; NU = nonuniform DIF; SM = spatial–mechanical reasoning.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 13: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

score and item score with interaction term included). Confidence bands are givenfor the full model. Score regions where the curve for the null model lies outside theconfidence bands indicate practically important DIF (Miller & Spray, 1993). Fig-ure 1 shows 95% confidence bands for each score level of eclipses for the total CRscore model, and Figure 2 gives confidence bands for eclipses when MC IRT scorewas used. Figures 3 and 4 show similar plots for heating curve. For illustrative pur-poses, all six score levels are plotted for Figure 1, but the remaining figures showplots only for the lowest and highest score levels.

The curve for the null model in Figure 1 shows that the predicted probability ofbeing a male student was strongly related to total CR score. When score oneclipses was included in the model, however, the relation between probability ofbeing a male student and total CR score was greatly diminished. Throughout mostof the CR total score range, students receiving low scores on eclipses were signifi-cantly less likely to be classified as male than they would have been if eclipsesscores were ignored, and students receiving high scores on eclipses were morelikely to be classified as male. It should be noted that because of the dependencybetween eclipses and CR total, some high and low scores in these plots were im-possible; for example, students scoring 3 on eclipses could not receive a CR totalscore higher than 18.

The finding of DIF may result, in part, from the mutual influence of the itemseclipses and heating curve, which show DIF in opposite directions (Wang & Lane,1996). Therefore, it was worthwhile to investigate the effects of using a matchingcriterion that did not include these two items. Using CR score was not feasible be-cause only two items would remain. Instead, MC IRT score was used alone as amatching criterion to examine the practical importance of DIF on these items. Asdiscussed earlier, this method was less than ideal because MC score could not beinterpreted as representing the ability measured by the CR test. Nonetheless, it wasinformative to show the relation between gender and CR item scores for studentsmatched on MC score. Confidence bands for the logistic discriminant functioncurve for each score level on eclipses are given in Figure 2.1 These plots reveal aweaker relation between eclipses and MC than between eclipses and total CRscore. This was to be expected because of the format differences but also becauseof the dependency of CR total on eclipses. Still, students receiving low scores oneclipses were more likely to be classified as female than they would have beenbased only on their MC scores; this was especially true for those with high MCscores. At higher score levels on eclipses, students receiving low MC scores weremore likely to be classified as male than they would have been if eclipses were ig-nored.

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 221

1The range of item response theory (IRT) scores was from 10 to 35. IRT score provides an estimate ofthe total number of items the student would have answered correctly if he or she had taken all of the itemsthat appeared on any version of the science test at any grade.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 14: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

FIGURE 1 (Continued)

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 15: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

FIGURE 1 (Continued)

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 16: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

FIGURE 1 Ninety-five percent confidence bands for logistic discriminant function analysiscurve, constructed-response (CR) Item 2 (eclipses), using CR total score as matching crite-rion.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 17: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

FIGURE 2 Ninety-five percent confidence bands for logistic discriminant function analysiscurve, constructed-response (CR) Item 2 (eclipses), using multiple-choice item response theory(MC IRT) score as matching criterion.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 18: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

FIGURE 3 Ninety-five percent confidence bands for logistic discriminant function analysiscurve, constructed-response (CR) Item 4 (heating curve), using CR total score as matching crite-rion.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 19: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

FIGURE 4 Ninety-five percent confidence bands for logistic discriminant function analysiscurve, constructed-response (CR) Item 4 (heating curve), using multiple choice item responsetheory (MC IRT) score as matching criterion.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 20: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

Confidence bands for heating curve based on total CR score appear in Figure 3.The pattern here is opposite that for eclipses. The curve for the null model lies be-low the lower confidence band at high CR total score levels and low score levels onheating curve and above the upper confidence band at low CR total levels and highheating curve levels.

Unlike eclipses, however, the results based on MC IRT score did not appear tobe practically significant for heating curve. Figure 4 shows that at each level ofheating curve, the function for the null model lies within the 95% confidencebands for the full model. In other words, knowing someone’s score on heatingcurve apparently provided no additional information about someone’s probabilityof being male once total MC score was known.

Cumulative logits analysis. Gender differences on CR scale scores mighthave been larger at some score level transitions than at others. For example, perhapsmale and female students were equally likely to reach some minimum level of per-formance, but more male students than female students achieved scores of 4 and 5.Furthermore, even items that did not exhibit DIF when scale score was studiedmight have shown important differences at particular score levels. To explore thesepossibilities, DIF analyses were conducted for each of five possible scale-scoresplits on each CR item. This procedure is known ascumulative logits analysis(Agresti, 1990; French & Miller, 1996). Scores were divided into two groups foreach analysis: those scoring at or below a certain level and those scoring above.Three separate sets of conditioning variables were used for each analysis: CR totalscore, CR total plus MC IRT, and CR total plus three MC dimensions. The analyseswere conducted using logistic regression, and both uniform and nonuniform DIFwere investigated.

Consistent with the findings from the previous analysis, fuels and populationsshowed no DIF at any score level.2 Eclipses showed DIF in favor of male students,regardless of matching criteria, at every score-level split except that between 0 and1. In other words, equally able (as measured by the various conditioning variables)male and female students did not differ in their probabilities of attempting theproblem and providing at least an incomplete explanation about distance (the re-quirements for a score of 1), but they did differ at all other score levels. Nonuni-form DIF was not observed. Heating curve, which showed DIF favoring femalestudents when scale score was used, favored female students at the lowest threesplits but not at the highest two. This suggests that female students were morelikely than their equally able male counterparts to provide at least a partially cor-rect response but were no more likely to achieve the highest scores. In contrast toeclipses, which showed greater DIF at higher score levels, DIF was observed on

228 HAMILTON

2Tables are not presented for these analyses but are available from the author.

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 21: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

heating curve primarily at lower levels of performance. The next section providesmore detail on these findings by analyzing categorical subscores separately.

Categorical CR scores. As discussed earlier, scale scores may havemasked important information contained in responses to specific parts of an item.The analytical subscores provided some data about these responses. These analyti-cal scores were categorical in nature, but many could have been easilydichotomized (e.g., categories of responses to the solar eclipse diagram could havebeen divided into two groups: drew correct diagram or not). DIF analyses for eachcategorical score were conducted using the conditioning variables described in theprevious section. The results were not affected by the choice of conditioning vari-ables. There were two main findings. First, on eclipses, significant DIF in favor ofmale students was observed for all three parts of the item: drawing the solar eclipse,drawing the lunar eclipse, and writing the explanation about relative distances. Inother words, male students were more likely to produce accurate diagrams of boththe solar and lunar eclipse than were female students and were more likely to pro-vide a correct explanation. Male and female students were equally likely, however,to provide at least a partial explanation. The other interesting result was for heatingcurve, in which DIF in favor of female students was observed for only the first partof the item (describing what was happening as the ice began to melt). However, incontrast to the results for eclipses, female students were more likely than equallyable male students to supply a response to this item that was at least partially correctbut were no more likely to provide a completely correct response. Significant DIFwas not observed on the other parts of the item. These results suggest that the rela-tive weighting assigned to various parts of the items could have had a small effecton the extent to which DIF was observed but would have been unlikely to changethe basic conclusions about gender differences. Eclipses showed the strongest DIFof the four CR items, and all three of its parts exhibited DIF.

Interview Study

In this section, results from the interview study of the CR items are describedbriefly with a focus on shedding light on the gender differences observed on someitems. The item of primary interest with regard to the DIF study was eclipses. Themajor difference between eclipses and the other CR items was that most responsesto eclipses included evidence of visualization. Only four students reported no visu-alization, and their average score was 0.25, compared to a mean of 3.86 for the stu-dents who did report visualization. The nonvisualizers said they had no idea how toapproach the problem or had never learned about eclipses, so they probably lackedthe knowledge needed to form a visual image. Successful students reported form-ing mental images of the solar system; for example, “I just thought about it and

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 229

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 22: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

imagined what it looks like.” Many students also used gestures when describingtheir reasoning, and gestures were often observed in conjunction with spatial mate-rial (Rauscher, Krauss, & Chen, 1996). These responses suggested that eclipseselicited a form of spatial reasoning similar to that used on the SM MC dimension,which was the only dimension on the MC test to exhibit a large gender difference.Successful responses to eclipses typically involved a combination of knowledgeacquired in or outside of school with reasoning of a visual or spatial nature. Thegender difference on eclipses was, therefore, not surprising, given the large genderdifferences on most of the SM items. Of course, SM and eclipses did not functionidentically, and eclipses showed DIF even for students who were matched on SM.The spatial demands may have been greater on eclipses than on similar MC itemsbecause of the unstructured nature of the CR items.

Most students reported using knowledge learned outside of school to completethis item. This was particularly true for the lunar eclipse diagram; 13 of the 16 stu-dents who produced a correct diagram said they had never learned about lunareclipses in school. These students reasoned from other parts of the item (e.g., “Iknow in a solar eclipse the moon is between the sun and the earth, so a lunar eclipsemust be the other way around and have the earth between the moon and the sun”)or relied on information they had been exposed to through television, books, news-papers, or actual eclipse viewing. Again, this item was similar to SM MC items,which also elicited reports of using outside experiences.

After eclipses, fuels was the item that showed the largest gender difference inits raw distribution. Interestingly, this item favored male students even though itconsisted solely of a single essay and thus might have been expected to favor fe-male students. However, analysis of the interview results as well as examination ofthe scoring rubric revealed that writing ability had little or no effect on scores. Stu-dents received points for mentioning at least one advantage and one disadvantageof each type of fuel and were not rewarded for organization, mechanics, or style.Interview respondents rarely displayed evidence of the kind of planning that usu-ally accompanies writing tasks. Seventeen of the 25 students started writing imme-diately after reading the question, and even those who did some planning tended tolist the major points quickly and then write them down.

As on eclipses, students reported using outside knowledge more often on thisitem than on either of the remaining CR items (heating curve and populations).Thus, one consistent feature of the items favoring male students in either formatwas an apparent need to apply knowledge or skills beyond those currently empha-sized in most science classrooms. Further evidence for this conjecture was ob-tained when performance on the nuclear and fossil fuels parts of the item werestudied separately: The gender difference on this item was due to male students’superior performance on the discussion of nuclear fuels; no difference occurred forfossil fuels. Although most students said they had studied fossil fuels in school,knowledge about nuclear fuels was typically obtained through outside reading or

230 HAMILTON

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 23: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

television viewing. The difference between scores of male and female students onthis item, then, arose solely from differences on the part of the item that tended tocall on outside knowledge. This result suggests that efforts to improve the scienceachievement of female students might profitably consider ways in which such ex-tracurricular experiences could be incorporated into formal science instruction.

Heating curve was the only item that exhibited DIF in favor of female students.Its raw distribution showed virtually no gender difference in contrast to the otherthree CR items, which all favored male students. The interviews revealed that thedistinguishing feature of this item was its similarity to an activity that students hadencountered in class. Eighty percent of participants said they had conducted an ex-periment similar to the one described in this item, in sharp contrast to the other CRitems, which were unfamiliar to most students. Also in contrast to the other items,knowledge acquired outside of school did not appear to enhance performance onthis item. In fact, both male and female students who referred to outside experi-ences, such as boiling water in the kitchen, tended to receive low scores becausetheir written responses did not include the kind of scientific terminology (such asreference to potential and kinetic energy) that high scores required. These resultswere consistent with other research that demonstrated relative female advantageon items that resembled textbook material or that were closely tied to curriculum(Hanna, 1989; O’Neill & McPeek, 1993). The heating curve results provided addi-tional support for the assertion that the items most likely to favor male studentswere those that were the least closely tied to school curriculum.

DISCUSSION

This study investigated DIF on a CR test under less than ideal circumstances. In par-ticular, therewasnoappropriateor reliable total scoreonwhich tocondition.The re-sults revealed, however, that an exploratory DIF study can be valuable when multi-ple measures of achievement are available. Flagged items perhaps should not beassumed to be exhibiting DIF in the usual sense (i.e., reflecting differential perfor-mance of equally able subgroup members), but it is nonetheless informative to dis-cover thatmaleand femalestudentswhowerematchedonseveralMCmeasuresdis-played unequal probabilities of success on an CR item. More detailed informationabout thenatureof thegroupdifferenceonagiven itemcanbeobtained throughpro-cedures such as the cumulative logits analysis or the analysis of categoricalsubscores. Furthermore, this study revealed the benefits of supplementing quantita-tive analyses with an investigation of the cognitive processes that items elicit. Ef-forts toexplain thesourcesofDIFare thusnotsolelydependenton inspectionof itemcontent. Although only a single test was examined here, limiting generalizability offindings to other CR tests, the study does suggest the value of the analytic approachfor evaluating group differences on large-scale achievement tests.

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 231

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 24: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

This study raises questions concerning the sources of gender differences on par-ticular items. Is the male advantage on SM items and on the eclipses item due todifferences in course taking or exposure to science outside of school or can it be at-tributed to a more highly developed spatial ability that is formed in the elementaryschool years or even earlier? For the items discussed here, the interviews providedsupport for the hypothesis that the SM items and eclipses had some dependence onvisual or spatial reasoning in common. The interviews also revealed the impor-tance of knowledge acquired outside of school, particularly for items that favoredmale students.

The finding of DIF on the heating curve item also raises questions. This was theonly CR item on which male and female students performed equally well. Conse-quently, it was flagged because female students received higher scores than ex-pected based on their CR performance. This item might be interpreted as beingunfair to male students, if total CR score is believed to reflect some underlyingability on which they are superior. On the other hand, perhaps this item is the onlyone on the CR test that is not unfair to female students. If this is the case, total CRscore might be an inappropriate criterion for DIF studies. When MC score wasused as a conditioning variable, the heating curve item did not exhibit any clearsign of DIF. The interviews suggested that heating curve might function more sim-ilarly to some of the MC items than to the other CR items, particularly in its reli-ance on school-based knowledge and skills versus extracurricular experiences.Although the DIF study did not provide clear guidance concerning which itemsshould be considered biased or whether any should be eliminated, it did reveal thatsimple rules regarding content or format are insufficient to explain gender differ-ences on science achievement tests.

Achievement tests such as those in the NELS:88 survey are often used to studyrelations among school performance, group membership, and educational back-ground. The results of this study suggest that users of large-scale assessmentsshould carefully examine their achievement measures before using them to makeinferences about group differences. The CR format does not necessarily reduce themale advantage in science and may, in fact, increase it under certain circum-stances. The magnitude of gender difference is also affected by the content andreasoning requirements of the test items and is subject to change depending on thenature of the items that make up a particular test. Conclusions about group differ-ences should therefore be informed by careful study of the achievement measureand the items it comprises.

ACKNOWLEDGMENTS

This research was supported by a Spencer Foundation Dissertation Fellowship andNational Science Foundation Grant RED–9253068. Parts of the study were alsosupported by the American Educational Research Association, which receives

232 HAMILTON

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 25: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

funds for its AERA Grants Program from the National Science Foundation and theNational Center for Education Statistics (U.S. Department of Education) underNSF Grant RED–9452861. Opinions reflect those of the author and do not neces-sarily reflect those of the granting agencies. I am grateful to Dick Snow, Ed Haertel,and Rich Shavelson for their advice on the design of the study and for their com-ments on written drafts of the dissertation, to Vi-Nhuan Le and Judy Dauberman fortheir assistance in interviewing and scoring, and to the editors and two anonymousreviewers for suggestions that improved this article.

REFERENCES

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a mul-tidimensional perspective.Journal of Educational Measurement, 29,67–91.

Agresti, A. (1990).Categorical data analysis.New York: Wiley.Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland &

H. Wainer (Eds.),Differential item functioning(pp. 3–23). Hillsdale, NJ: Lawrence Erlbaum Asso-ciates, Inc.

Becker, B. J. (1989). Gender and science achievement: A reanalysis of studies from two meta-analyses.Journal of Research in Science Teaching, 26,141–169.

Beller, M., & Gafni, N. (1996). The 1991 International Assessment of Educational Progress in Mathe-matics and Sciences: The gender differences perspective.Journal of Educational Psychology, 88,365–377.

Bock, D., Gibbons, R., & Muraki, E. (1988). Full-information item factor analysis.Applied Psychologi-cal Measurement, 12,261–280.

Bolger, N., & Kellaghan, T. (1990). Method of measurement and gender differences in scholasticachievement.Journal of Educational Measurement, 27,165–174.

Burkam, D. T., Lee, V. E., & Smerdon, B. A. (1997). Gender and science learning early in high school:Subject matter and laboratory experiences.American Educational Research Journal, 34,297–331.

Clauser, B. E., Nungester, R. J., Mazor, K., & Ripkey, D. (1996). A comparison of alternative matchingstrategies for DIF detection in tests that are multidimensional.Journal of Educational Measure-ment, 33,202–214.

Clauser, B. E., Nungester, R. J., & Swaminathan, H. (1997). Improving the matching for DIF analysis byconditioning on both test score and an educational background variable.Journal of EducationalMeasurement, 33,453–464.

Cole, N. S. (1997).The ETS gender study: How males and females perform in educational settings.Princeton, NJ: Educational Testing Service.

Donoghue, J. R., Holland, P. W., & Thayer, D. T. (1993). A Monte Carlo study of factors that affect theMantel–Haenszel and standardization measures of differential item functioning. In P. W. Holland &H. Wainer (Eds.),Differential item functioning(pp. 137–166). Hillsdale, NJ: Lawrence ErlbaumAssociates, Inc.

Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of per-formance assessments.Applied Measurement in Education, 4,289–303.

Fennema, E., & Tartre, L. A. (1985). The use of spatial visualization in mathematics by girls and boys.Journal for Research in Mathematics Education, 16,184–206.

Fleming, M. L., & Malone, M. R. (1983). The relationship of student characteristics and student perfor-mance in science as viewed by meta-analysis research.Journal of Research in Science Teaching, 20,481–495.

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 233

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 26: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

Frederiksen, N. (1984). The real test bias.American Psychologist, 39,193–202.French, A. W., & Miller, T. R. (1996). Logistic regression and its use in detecting differential item func-

tioning in polytomous items.Journal of Educational Measurement, 33,315–332.Halpern, D. F. (1992).Sex differences in cognitive abilities(2nd ed.). Hillsdale, NJ: Lawrence Erlbaum

Associates, Inc.Halpern, D. F. (1997). Sex differences in intelligence: Implications for education.American Psycholo-

gist, 52,1091–1102.Hamilton, L. S. (1997).Construct validity of constructed-response assessments: Male and female high

school science performance.Unpublished doctoral dissertation, Stanford University, School of Ed-ucation, Stanford, CA.

Hamilton, L. S., Nussbaum, E. M., Kupermintz, H., Kerkhoven, J. I. M., & Snow, R. E. (1995). En-hancing the validity and usefulness of large scale educational assessments: II. NELS:88 ScienceAchievement.American Educational Research Journal, 32,555–581.

Hamilton, L. S., Nussbaum, E. M., & Snow, R. E. (1997). Interview procedures for validating scienceassessments.Applied Measurement in Education, 10,181–200.

Hanna, G. (1989). Mathematics achievement of girls and boys in grade eight: Results from twenty coun-tries.Educational Studies in Mathematics, 20,225–232.

Hauck, W. W. (1983). A note on confidence bands for the logistic response curve.The American Statis-tician, 37,158–160.

Johnson, S. (1987). Gender differences in science: Parallels in interest, experience, and performance.In-ternational Journal of Science Education, 9,467–481.

Jones, L. R., Mullis, I. V. S., Raizen, S. A., Weiss, I. R., & Weston, E. A. (1992).The 1990 science reportcard: NAEP’s assessment of fourth, eighth, and twelfth graders.Princeton, NJ: Educational TestingService.

Jovanovic, J., Solano-Flores, G., & Shavelson, R. J. (1994). Performance-based assessments: Will gen-der differences in science achievement be eliminated?Education and Urban Society, 26,352–366.

Linn, M. C. (1985). Fostering equitable consequences from computer learning environments.Sex Roles,13,229–240.

Linn, M. C., & Hyde, J. S. (1989). Gender, mathematics, and science.Educational Researcher, 18(8),17–19, 22–27.

Lohman, D. F. (1993). Spatially gifted, verbally inconvenienced. InWallace symposium on talentdevelopment.Symposium conducted at the University of Iowa, Iowa City.

Mazor, K. M., Kanjee, A., & Clauser, B. E. (1995). Using logistic regression and the Mantel–Haenszelwith multiple ability estimates to detect differential item functioning.Journal of Educational Mea-surement, 32,131–144.

Mazzeo, J., Schmitt, A. P., & Bleistein, C. A. (1993).Sex-related performance differences on con-structed-response and multiple-choice sections of Advanced Placement Examinations(CollegeBoard Rep. No. 92–7). New York: College Entrance Examination Board.

Miller, T. R., & Spray, J. A. (1993). Logistic discriminant function analysis for DIF identification ofpolytomously scored items.Journal of Educational Measurement, 30,107–122.

Mullis, I. V. S., Dossey, J. A., Owen, E. H., & Phillips, G. W. (1991).The state of mathematics achieve-ment: Executive summary(Report No. 21–ST–04). Washington, DC: National Center for EducationStatistics.

National Assessment of Educational Progress. (1988).The science report card: Elements of risk and re-covery: Trends and achievement levels based on the 1986 National Assessment.Princeton, NJ: Edu-cational Testing Service.

Nussbaum, E. M., Hamilton, L. S., & Snow, R. E. (1997). Enhancing the validity and usefulness of largescale educational assessments: IV. NELS:88 science performance through the twelfth grade.Ameri-can Educational Research Journal, 34,151–173.

234 HAMILTON

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014

Page 27: Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test

O’Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differen-tial item functioning. In P. W. Holland & H. Wainer (Eds.),Differential item functioning(pp.255–276). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Pollack, J. M., & Rock, D. A. (1997).Constructed-response tests in the NELS:88 High School Effective-ness Study(NCES 97–804). Washington, DC: National Center for Education Statistics.

Rauscher, F. H., Krauss, R. M., & Chen, Y. (1996). Gesture, speech, and lexical access: The role of lexi-cal movements in speech production.Psychological Science, 7,226–231.

Resnick, L. B., & Resnick, D. P. (1992). Assessing the thinking curriculum: New tools for educationalreform. In B. R. Gifford & M. C. O’Connor (Eds.),Changing assessments: Alternative views of apti-tude, achievement, and instruction(pp. 35–75). Boston: Kluwer.

Rock, D. A., & Pollack, J. M. (1995).Psychometric report for the NELS:88 base year through secondfollow-up(NCES Rep. No. 95–382). Washington, DC: National Center for Education Statistics.

Shavelson, R. J., Carey, N. B., & Webb, N. M. (1990). Indicators of science achievement: Options for apowerful policy instrument.Phi Delta Kappan, 71,692–697.

Somes, G. W. (1986). The generalized Mantel–Haenszel statistic.The American Statistician, 40,106–108.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regres-sion procedures.Journal of Educational Measurement, 27,361–370.

Wang, N., & Lane, S. (1996). Detection of gender-related differential item functioning in a mathematicsperformance assessment.Applied Measurement in Education, 9,175–199.

Welch, C. J., & Miller, T. R. (1995). Assessing differential item functioning in direct writing assess-ments: Problems and an example. Journal of Educational Measurement, 32,163–178.

Young, D. J., & Fraser, B. J. (1994). Gender differences in science achievement: Do school effects makea difference?Journal of Research in Science Teaching, 31,857–871.

Zwick, R. (1990). When do item response function and Mantel–Haenszel definitions of differential itemfunctioning coincide?Journal of Educational Statistics, 15,185–197.

Zwick, R., Donoghue, J. R., & Grima, A. (1993). Assessment of differential item functioning for perfor-mance tasks.Journal of Educational Measurement, 30,233–251.

Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP History Assess-ment.Journal of Educational Measurement, 26,55–66.

DIF ON A CONSTRUCTED-RESPONSE SCIENCE TEST 235

Dow

nloa

ded

by [

Har

vard

Lib

rary

] at

03:

22 0

8 O

ctob

er 2

014