Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center

Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment ActivitiesJeffrey Steedle and Steve Ferrara

Center for Next Generation Learning and Assessment

CCSSO National Conference on Student Assessment, June 24, 2015

Which of these essays is of higher quality?

A time when i felt free was, when i finally got released from being in the hospital for four days. The reason i was in the hospital was because i had a kidney stones which hurted really bad that i couldn't eat and stand up straight.So i decided to go to the emergency room to see what was going on.This was before i found out i had kidney stones…

A time I felt like I was free was when I was fifteen years old. At age fifteen, everybody is curious and anxious to do things on there own without parental consent. I was just another one of those fifteen year olds anxious to get my turn at something, but then I learned how to drive. A lot of people enjoy driving around, some people do it because they have to get to their job or because they need to go from one place to another…

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 2

Traditional, Rubric-Based Scoring

“Those responsible for test scoring should establish and document quality control processes and criteria. Adequate training should be provided. The quality of scoring should be monitored and documented. Any systematic source of scoring errors should be documented and corrected” (AERA, APA, & NCME, 2014).


Comparative Judgment


Prompt and responses from http://tea.texas.gov/student.assessment/staar/writing/

Comparative Judgment Background

• Not a new idea (Law of Comparative Judgement, Thurstone, 1927)• Relative judgments are more accurate than absolute judgments for

– psychophysical phenomena (Stewart et al., 2005)– estimating distances, counting spelling errors (Shah et al., 2014)– evaluating physics and history exams (Gill & Bramley, 2008)

• Past uses in educational assessment– Comparing the alignment of passing standards over time (Bramley,

Bell, & Pollitt, 1998; Curcin et al., 2009)– Estimating item difficulty (Walker et al., 2005)– Scoring essays, portfolios, and short-answer responses (Pollitt, 2004;

Whitehouse & Pollitt, 2012; Kimbell et al., 2009; Pollitt, 2012; Attali, 2014)


Comparative Judgments


Rubric-Based Scoring Comparative Judgment

Scorers must internalize the definition of each score point

Judges must internalize the definition of “quality”

Scorers must agree exactly with the trainer and “anchor papers”

Judges must agree with the trainer about the relative quality of responses

Lengthy training and qualification (e.g., 16 hours)

Brief training and qualification (e.g., 3 hours)

Longer time per evaluation

Shorter time per evaluation

Requires fewer evaluations per response

Requires more evaluations per response

Comparative Judgment Advantages

• Eliminating certain scorer biases/increased validity• Faster time per evaluation• Reduced cognitive demand• Minimal training, qualification, and monitoring• Reduced costs

Research is needed to test the potential advantages.


POTENTIAL

Potential Applications in Scoring

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.

8

Few responses to a large number of prompts

Many lengthy trainings, shorter overall evaluation time

Many brief trainings, longer overall evaluation time

Rubric Scoring


Field Test Scoring

Educators get buy-in and

professional development

Fewer teachers in lengthy trainings

More teachers in brief trainings

Lower overall productivity,

narrow PD reach

Greater overall productivity,

expanded PD reach

Educator ScoringRubric Scoring


Possibly more efficient

Research Questions

1. How closely do comparative judgment measures correspond to rubric scores?

2. Do comparative judgments take less time than rubric scoring decisions?

3. How do comparative judgment measures and rubric scores compare in terms of validity coefficients?

4. How is the reliability of comparative judgment measures associated with the number of judgments per essay response?


Method: Essay Prompts

• Two essay prompts from online administrations of a high school achievement testing program in a large state

• 4-point holistic rubric scoring, at least two scores per response, exact agreement required

• Samples of 200 responses for each prompt


Agreementr

Rubric Score DistributionPrompt Exact Adj. 1 2 3 4

1 70% 29% .81 25% 40% 25% 10%2 69% 30% .85 25% 40% 25% 10%

Method: Participants

• All with secondary English teaching experience

• No professional scorers to avoid interference between methods of evaluating student responses


4 judges

Prompt 1

5 judges

Prompt 2

Method: Training

• Conducted via web conference by an experienced scoring trainer

• Judges learned rubric criteria (focus, organization, development, etc.), but the rubric was never shown

• Judges practiced making comparative judgments on “anchor pairs” involving “anchor papers” used in rubric-based training

• Qualification test accuracy ranged from 11 to 15 out of 15

• Training durations were 3 and 3.75 hours


Multivariate generalization of Bradley-Terry model (Bradley & Terry, 1952)

µA is the latent location of response A on a continuum of writing quality

Method: Statistical Model


-1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Category Response Curve for Writing Sample 1

Location of response B B

ABj

Prefer BOptions equalPrefer A

Lo

catio

n o

f re

spo

nse

A

A

When µB < µA, “Prefer A” is the most probable judgment

When µB > µA, “Prefer B” is the most probable judgment

“Options equal” is never the most probable judgment

𝑃 (𝑌 𝐴𝐵= 𝑗|𝜇𝐴 ,𝜇𝐵 ,𝜏 )=𝜋 𝐴𝐵𝑗=exp(∑

𝑠=1

𝑗

[𝜇𝐴−(𝜇𝐵+𝜏 𝑠)])

∑𝑦=1

𝐽

exp (∑𝑠=1

𝑦

[𝜇𝐴−(𝜇𝐵+𝜏𝑠)])

Method: Pairing Responses

• Note: The most information about a response’s latent location is obtained by comparing it to another response of similar quality.

• The Generalized Grading Model (GGM) provided a predicted score for each response on the 1–4 rubric scale (based on text complexity, coherence, length, spelling, and vocabulary).

• Each response was paired with – 16 other responses (with the same or adjacent predicted score)– 2 anchor papers

• 2,000 judgments per prompt


Method: Data Collection

• Responses were “chained” so that a judge only read one new response per judgment


A vs. B B vs. C C vs. D D vs. E

Results: Parameter Estimation


Prompt 1

Fre

qu

en

cy

0 1 2 3 4 5

01

02

03

04

0

Mean = 2.4SD = 0.99

Prompt 2

Fre

qu

en

cy

0 1 2 3 4 5

05

15

25

35

Mean = 2.13SD = 1.04

Scale anchored by anchor paper scores, so most measures fall between 1.0 and 4.0

Results: Correspondence


Measure

Prompt 1 Prompt 2

RubricRounded

CJ RubricRounded

CJ

Mean 2.20 2.40 2.20 2.21

Std. Deviation 0.93 0.97 0.93 0.98

Exact Agmt. 60.0% 64.0%

Adj. Agmt. 38.5% 33.5%

Correlation .78 .76

60.0% exact agreement between rubric scores and rounded comparative judgment scores on Prompt 1

Slight tendency for comparative judgment to overestimate on Prompt 1

Better agreement overall on Prompt 2

Results: Judgment Time


Prompt 1 Prompt 2 Both

Mean (Rubric) 121.2 s 116.4 s 119.4 s

Mean (CJ) 116.7 s 70.45 s 93.5 s

Median (CJ) 83.0 s 45.0 s 62.0 s

Some huge outliers in these data (e.g., 2,760 seconds)

Medians likely provide better measures of central tendency

Results: Validity Coefficients


Rubric Score

Multiple-Choice Writing Test

.63, .69

Continuous Comparative

Judgment Measure

.67, .72

Rounded Comparative

Judgment Measure

.66, .71

Results: Reliability• In this context, “reliability” reflects judge behavior and is

therefore akin to inter-rater reliability.

• High reliability translates into greater precision in estimating the perceived relative quality of responses.

• Reliability does not reflect correspondence between estimated scores and “true” scores. Studying this would require multiple responses from each student.


Reliability =Consistency in judgments about the quality of a response relative to other responses

Results: Reliability• Remove random samples of judgments, refit the model,

recalculate reliability.


Prompt 1

Average Number of Comparisons per Object

Re

liab

ility

0 2 4 6 8 10 12 14 16 18 20

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prompt 2

Average Number of Comparisons per Object

Re

liab

ility

0 2 4 6 8 10 12 14 16 18 20

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Reliability drops below .80 with a 50% reduction (~9 judgments per response)

A Note on Number of Judgments

• TRUE or FALSE: If you have 200 responses and you want reliability of .80, you need about 200×9 = 1,800 judgments.

• FALSE: A judgment provides information about 2 responses, so you would need about 900 judgments (or 4.5 judgments per unique response).


Conclusions

• Scores from comparative judgment correspond to rubric scores at a rate similar to that observed between two scorers (60–70% exact agreement; Ferrara & DeMauro, 2006).

• Comparative judgment measures appear to have higher validity coefficients than rubric scores

• With 3-4 hours of comparative judgment training, judges can consistently judge the relative quality of responses, as reflected by high reliability coefficients.

• Time per comparative judgment appears to be less than time per rubric score.


Future Research

• Agreement might be improved with improvements in the pairing process

• Potentially improve accuracy and efficiency by implementing adaptive comparative judgment (Pollitt, 2012)– Initial pairings are random– Subsequent pairings are based on preliminary score estimates

• Pilot rangefinding study• Data-free form assembly and equating


Pilot Rangefinding Results

• Six panelists made 106 judgments about 15 responses in 16 minutes (with reliability = .97).


Caterpillar Plot

Co

mp

ara

tive

Ju

dg

me

nt

Me

asu

re

-6

-4

-2

0

2

4

6

Pa

pe

r01

Pa

pe

r02

Pa

pe

r03

Pa

pe

r05

Pa

pe

r06

Pa

pe

r04

Pa

pe

r08

Pa

pe

r07

Pa

pe

r09

Pa

pe

r10

Pa

pe

r12

Pa

pe

r11

Pa

pe

r13

Pa

pe

r15

Pa

pe

r14

1s

2s3s

4s5s

Data-Free Forms Assembly and Equating

• Field testing (especially embedded) is useful for estimating item difficulties for forms assembly and/or pre-equating

• Problems with field testing:– It is not permitted or valued in some countries– There is backlash against it in the U.S. (i.e, using kids as unpaid

laborers)– Test security may be compromised because performance tasks

and essays are highly memorable– Examinees may not be motivated


Which of these items is more difficult?


What single transformation is shown below?

ReflectionRotationTranslationNo single transformation is shown.

The masses of two gorillas are given below.

A female gorilla has a mass of 85,000 grams.A male gorilla has a mass of 220 kilograms.

What is the difference between these two masses in grams?

135,000 g84,780 g63,000 g305,000 g

http://tea.texas.gov/Student_Testing_and_Accountability/Testing/State_of_Texas_Assessments_of_Academic_Readiness_(STAAR)/STAAR_Released_Test_Questions/





• To the extent that such judgments are accurate, comparative judgment can be used to put items (from different test forms) on a common scale of perceived item difficulty.

• Those measures could be used for– Developing test forms of similar difficulty– Equating test forms (with no common items or persons)


Example Equating Process


Calibrate Form X (prior admin.)

Calibrate Form Y (current admin.)

Compare a sample of Form Y items to a sample of Form X “equating” items to

calculate an equating constant

Apply the constant to all of Form Y

Locate the Form X performance standard on Form Y


• Prior research has demonstrated that comparative judgment measures can be highly correlated with empirical item difficulties (e.g., Heldsinger & Humphry, 2014).

• Our study will focus on the accuracy of the comparative judgment measures and subsequent accuracy of raw-to-theta pre-equating tables, equating of performance standards across forms, and inferences about the relative difficulty of test forms.


THANK YOU!

Center for Next Generation Learning and AssessmentResearch and Innovation Network

[email protected]@pearson.com

31

ReferencesAERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Attali, Y. (2014). A ranking method for evaluating constructed responses. Educational and Psychological Measurement, Online First, 1-14. Bradley, R.A., & Terry, M.E. (1952). Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika, 39, 324-345. Bramley, T., Bell, J.F., & Pollitt, A. (1998). Assessing changes in standards over time using thurstone paired comparisons. Education Research and

Perspectives, 25(2), 1-24.Curcin, M., Black, B., & Bramley, T. (2009). Standard maintaining by expert judgment on multiple-choice tests: A new use for the rank-ordering

method. Paper presented at the the British Educational Research Association Annual Conference, Manchester.Elliot, S., Ferrara, S., Fisher, T., Klein, S., Pitoniak, M., & Steedle, J. (2010). Developing the edsteps continuum Washington, DC. Council of Chief State

School Officers.Ferrara, S., & DeMauro, G.E. (2006). Standardized assessment of individual achievement in k-12. In R. L. Brennan (Ed.), Educational measurement

(4th ed., pp. 579-621). Westport, CT: Praeger.Gill, T., & Bramley, T. (2008). How accurate are examiners’ judgments of script quality? An investigation of absolute and relative judgments in two

units, one with a wide and one with a narrow ‘zone of uncertainty’. Paper presented at the British Educational Research Association Annual Conference, Edinburgh, Scotland.

Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37(2), 1-19.

Heldsinger, S., & Humphry, S. (2014). Maintaining consistent metrics in standard setting. Murdoch, Western Australia: Murdoch University.Kimbell, R., Wheeler, T., Stables, K., Shepard, T., Martin, F., Davies, D., . . . Whitehouse, G. (2009). E-scape portfolio assessment: Phase 3 report.

London: Technology Education Research Unit, Goldsmiths College, University of London.Pollitt, A. (2004). Let’s stop marking exams. Paper presented at the IAEA Conference, Philadelphia, PA.Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. Shah, N.B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., & Wainwright, M. (2014). When is it better to compare than to score? arXiv.

http://arxiv.org/abs/1406.6618Stewart, N., Brown, G.D.A., & Chater, N. (2005). Absolute identification by relative judgment. Psychological Review, 112(4), 881-911. Thurstone, L.L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286.Walker, M.E., Dorans, N.J., Kim, S., Vafis, G., & Fecko-Curtis, E. (2005). Alternative methods for obtaining item difficulty information. Paper presented

at the Annual Meeting of the American Educational Research Association, Montreal, Canada.Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment.

Manchester: The Assessment and Qualifications Alliance.Wolfe, E.W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues

and Practice, 31(3), 31-37.Zahner, D., & Steedle, J.T. (2014). Evaluating performance task scoring comparability in an international testing program. Paper presented at the

National Council on Measurement in Education Annual Meeting, Philadelphia, PA.


http://arxiv.org/abs/1406.6618

http://arxiv.org/abs/1406.6618

Documents

Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center