Upload
darcy-mccormick
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment ActivitiesJeffrey Steedle and Steve Ferrara
Center for Next Generation Learning and Assessment
CCSSO National Conference on Student Assessment, June 24, 2015
Which of these essays is of higher quality?
A time when i felt free was, when i finally got released from being in the hospital for four days. The reason i was in the hospital was because i had a kidney stones which hurted really bad that i couldn't eat and stand up straight.So i decided to go to the emergency room to see what was going on.This was before i found out i had kidney stones…
A time I felt like I was free was when I was fifteen years old. At age fifteen, everybody is curious and anxious to do things on there own without parental consent. I was just another one of those fifteen year olds anxious to get my turn at something, but then I learned how to drive. A lot of people enjoy driving around, some people do it because they have to get to their job or because they need to go from one place to another…
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 2
Traditional, Rubric-Based Scoring
“Those responsible for test scoring should establish and document quality control processes and criteria. Adequate training should be provided. The quality of scoring should be monitored and documented. Any systematic source of scoring errors should be documented and corrected” (AERA, APA, & NCME, 2014).
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 3
Comparative Judgment
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 4
Prompt and responses from http://tea.texas.gov/student.assessment/staar/writing/
Comparative Judgment Background
• Not a new idea (Law of Comparative Judgement, Thurstone, 1927)• Relative judgments are more accurate than absolute judgments for
– psychophysical phenomena (Stewart et al., 2005)– estimating distances, counting spelling errors (Shah et al., 2014)– evaluating physics and history exams (Gill & Bramley, 2008)
• Past uses in educational assessment– Comparing the alignment of passing standards over time (Bramley,
Bell, & Pollitt, 1998; Curcin et al., 2009)– Estimating item difficulty (Walker et al., 2005)– Scoring essays, portfolios, and short-answer responses (Pollitt, 2004;
Whitehouse & Pollitt, 2012; Kimbell et al., 2009; Pollitt, 2012; Attali, 2014)
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 5
Comparative Judgments
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 6
Rubric-Based Scoring Comparative Judgment
Scorers must internalize the definition of each score point
Judges must internalize the definition of “quality”
Scorers must agree exactly with the trainer and “anchor papers”
Judges must agree with the trainer about the relative quality of responses
Lengthy training and qualification (e.g., 16 hours)
Brief training and qualification (e.g., 3 hours)
Longer time per evaluation
Shorter time per evaluation
Requires fewer evaluations per response
Requires more evaluations per response
Comparative Judgment Advantages
• Eliminating certain scorer biases/increased validity• Faster time per evaluation• Reduced cognitive demand• Minimal training, qualification, and monitoring• Reduced costs
Research is needed to test the potential advantages.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 7
POTENTIAL
Potential Applications in Scoring
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
8
Few responses to a large number of prompts
Many lengthy trainings, shorter overall evaluation time
Many brief trainings, longer overall evaluation time
Rubric Scoring
Comparative Judgment
Field Test Scoring
Educators get buy-in and
professional development
Fewer teachers in lengthy trainings
More teachers in brief trainings
Lower overall productivity,
narrow PD reach
Greater overall productivity,
expanded PD reach
Educator ScoringRubric Scoring
Comparative Judgment
Possibly more efficient
Research Questions
1. How closely do comparative judgment measures correspond to rubric scores?
2. Do comparative judgments take less time than rubric scoring decisions?
3. How do comparative judgment measures and rubric scores compare in terms of validity coefficients?
4. How is the reliability of comparative judgment measures associated with the number of judgments per essay response?
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 9
Method: Essay Prompts
• Two essay prompts from online administrations of a high school achievement testing program in a large state
• 4-point holistic rubric scoring, at least two scores per response, exact agreement required
• Samples of 200 responses for each prompt
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 10
Agreementr
Rubric Score DistributionPrompt Exact Adj. 1 2 3 4
1 70% 29% .81 25% 40% 25% 10%2 69% 30% .85 25% 40% 25% 10%
Method: Participants
• All with secondary English teaching experience
• No professional scorers to avoid interference between methods of evaluating student responses
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 11
4 judges
Prompt 1
5 judges
Prompt 2
Method: Training
• Conducted via web conference by an experienced scoring trainer
• Judges learned rubric criteria (focus, organization, development, etc.), but the rubric was never shown
• Judges practiced making comparative judgments on “anchor pairs” involving “anchor papers” used in rubric-based training
• Qualification test accuracy ranged from 11 to 15 out of 15
• Training durations were 3 and 3.75 hours
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 12
Multivariate generalization of Bradley-Terry model (Bradley & Terry, 1952)
µA is the latent location of response A on a continuum of writing quality
Method: Statistical Model
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 13
-1 0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Category Response Curve for Writing Sample 1
Location of response B B
ABj
Prefer BOptions equalPrefer A
Lo
catio
n o
f re
spo
nse
A
A
When µB < µA, “Prefer A” is the most probable judgment
When µB > µA, “Prefer B” is the most probable judgment
“Options equal” is never the most probable judgment
𝑃 (𝑌 𝐴𝐵= 𝑗|𝜇𝐴 ,𝜇𝐵 ,𝜏 )=𝜋 𝐴𝐵𝑗=exp(∑
𝑠=1
𝑗
[𝜇𝐴−(𝜇𝐵+𝜏 𝑠)])
∑𝑦=1
𝐽
exp (∑𝑠=1
𝑦
[𝜇𝐴−(𝜇𝐵+𝜏𝑠)])
Method: Pairing Responses
• Note: The most information about a response’s latent location is obtained by comparing it to another response of similar quality.
• The Generalized Grading Model (GGM) provided a predicted score for each response on the 1–4 rubric scale (based on text complexity, coherence, length, spelling, and vocabulary).
• Each response was paired with – 16 other responses (with the same or adjacent predicted score)– 2 anchor papers
• 2,000 judgments per prompt
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 14
Method: Data Collection
• Responses were “chained” so that a judge only read one new response per judgment
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 15
A vs. B B vs. C C vs. D D vs. E
Results: Parameter Estimation
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 16
Prompt 1
Fre
qu
en
cy
0 1 2 3 4 5
01
02
03
04
0
Mean = 2.4SD = 0.99
Prompt 2
Fre
qu
en
cy
0 1 2 3 4 5
05
15
25
35
Mean = 2.13SD = 1.04
Scale anchored by anchor paper scores, so most measures fall between 1.0 and 4.0
Results: Correspondence
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 17
Measure
Prompt 1 Prompt 2
RubricRounded
CJ RubricRounded
CJ
Mean 2.20 2.40 2.20 2.21
Std. Deviation 0.93 0.97 0.93 0.98
Exact Agmt. 60.0% 64.0%
Adj. Agmt. 38.5% 33.5%
Correlation .78 .76
60.0% exact agreement between rubric scores and rounded comparative judgment scores on Prompt 1
Slight tendency for comparative judgment to overestimate on Prompt 1
Better agreement overall on Prompt 2
Results: Judgment Time
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 18
Prompt 1 Prompt 2 Both
Mean (Rubric) 121.2 s 116.4 s 119.4 s
Mean (CJ) 116.7 s 70.45 s 93.5 s
Median (CJ) 83.0 s 45.0 s 62.0 s
Some huge outliers in these data (e.g., 2,760 seconds)
Medians likely provide better measures of central tendency
Results: Validity Coefficients
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 19
Rubric Score
Multiple-Choice Writing Test
.63, .69
Continuous Comparative
Judgment Measure
.67, .72
Rounded Comparative
Judgment Measure
.66, .71
Results: Reliability• In this context, “reliability” reflects judge behavior and is
therefore akin to inter-rater reliability.
• High reliability translates into greater precision in estimating the perceived relative quality of responses.
• Reliability does not reflect correspondence between estimated scores and “true” scores. Studying this would require multiple responses from each student.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 20
Reliability =Consistency in judgments about the quality of a response relative to other responses
Results: Reliability• Remove random samples of judgments, refit the model,
recalculate reliability.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 21
Prompt 1
Average Number of Comparisons per Object
Re
liab
ility
0 2 4 6 8 10 12 14 16 18 20
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Prompt 2
Average Number of Comparisons per Object
Re
liab
ility
0 2 4 6 8 10 12 14 16 18 20
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Reliability drops below .80 with a 50% reduction (~9 judgments per response)
A Note on Number of Judgments
• TRUE or FALSE: If you have 200 responses and you want reliability of .80, you need about 200×9 = 1,800 judgments.
• FALSE: A judgment provides information about 2 responses, so you would need about 900 judgments (or 4.5 judgments per unique response).
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 22
Conclusions
• Scores from comparative judgment correspond to rubric scores at a rate similar to that observed between two scorers (60–70% exact agreement; Ferrara & DeMauro, 2006).
• Comparative judgment measures appear to have higher validity coefficients than rubric scores
• With 3-4 hours of comparative judgment training, judges can consistently judge the relative quality of responses, as reflected by high reliability coefficients.
• Time per comparative judgment appears to be less than time per rubric score.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 23
Future Research
• Agreement might be improved with improvements in the pairing process
• Potentially improve accuracy and efficiency by implementing adaptive comparative judgment (Pollitt, 2012)– Initial pairings are random– Subsequent pairings are based on preliminary score estimates
• Pilot rangefinding study• Data-free form assembly and equating
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 24
Pilot Rangefinding Results
• Six panelists made 106 judgments about 15 responses in 16 minutes (with reliability = .97).
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 25
Caterpillar Plot
Co
mp
ara
tive
Ju
dg
me
nt
Me
asu
re
-6
-4
-2
0
2
4
6
Pa
pe
r01
Pa
pe
r02
Pa
pe
r03
Pa
pe
r05
Pa
pe
r06
Pa
pe
r04
Pa
pe
r08
Pa
pe
r07
Pa
pe
r09
Pa
pe
r10
Pa
pe
r12
Pa
pe
r11
Pa
pe
r13
Pa
pe
r15
Pa
pe
r14
1s
2s3s
4s5s
Data-Free Forms Assembly and Equating
• Field testing (especially embedded) is useful for estimating item difficulties for forms assembly and/or pre-equating
• Problems with field testing:– It is not permitted or valued in some countries– There is backlash against it in the U.S. (i.e, using kids as unpaid
laborers)– Test security may be compromised because performance tasks
and essays are highly memorable– Examinees may not be motivated
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 26
Which of these items is more difficult?
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 27
What single transformation is shown below?
ReflectionRotationTranslationNo single transformation is shown.
The masses of two gorillas are given below.
A female gorilla has a mass of 85,000 grams.A male gorilla has a mass of 220 kilograms.
What is the difference between these two masses in grams?
135,000 g84,780 g63,000 g305,000 g
http://tea.texas.gov/Student_Testing_and_Accountability/Testing/State_of_Texas_Assessments_of_Academic_Readiness_(STAAR)/STAAR_Released_Test_Questions/
Data-Free Forms Assembly and Equating
• To the extent that such judgments are accurate, comparative judgment can be used to put items (from different test forms) on a common scale of perceived item difficulty.
• Those measures could be used for– Developing test forms of similar difficulty– Equating test forms (with no common items or persons)
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 28
Example Equating Process
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 29
Calibrate Form X (prior admin.)
Calibrate Form Y (current admin.)
Compare a sample of Form Y items to a sample of Form X “equating” items to
calculate an equating constant
Apply the constant to all of Form Y
Locate the Form X performance standard on Form Y
Data-Free Forms Assembly and Equating
• Prior research has demonstrated that comparative judgment measures can be highly correlated with empirical item difficulties (e.g., Heldsinger & Humphry, 2014).
• Our study will focus on the accuracy of the comparative judgment measures and subsequent accuracy of raw-to-theta pre-equating tables, equating of performance standards across forms, and inferences about the relative difficulty of test forms.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 30
THANK YOU!
Center for Next Generation Learning and AssessmentResearch and Innovation Network
[email protected]@pearson.com
31
ReferencesAERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Attali, Y. (2014). A ranking method for evaluating constructed responses. Educational and Psychological Measurement, Online First, 1-14. Bradley, R.A., & Terry, M.E. (1952). Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika, 39, 324-345. Bramley, T., Bell, J.F., & Pollitt, A. (1998). Assessing changes in standards over time using thurstone paired comparisons. Education Research and
Perspectives, 25(2), 1-24.Curcin, M., Black, B., & Bramley, T. (2009). Standard maintaining by expert judgment on multiple-choice tests: A new use for the rank-ordering
method. Paper presented at the the British Educational Research Association Annual Conference, Manchester.Elliot, S., Ferrara, S., Fisher, T., Klein, S., Pitoniak, M., & Steedle, J. (2010). Developing the edsteps continuum Washington, DC. Council of Chief State
School Officers.Ferrara, S., & DeMauro, G.E. (2006). Standardized assessment of individual achievement in k-12. In R. L. Brennan (Ed.), Educational measurement
(4th ed., pp. 579-621). Westport, CT: Praeger.Gill, T., & Bramley, T. (2008). How accurate are examiners’ judgments of script quality? An investigation of absolute and relative judgments in two
units, one with a wide and one with a narrow ‘zone of uncertainty’. Paper presented at the British Educational Research Association Annual Conference, Edinburgh, Scotland.
Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37(2), 1-19.
Heldsinger, S., & Humphry, S. (2014). Maintaining consistent metrics in standard setting. Murdoch, Western Australia: Murdoch University.Kimbell, R., Wheeler, T., Stables, K., Shepard, T., Martin, F., Davies, D., . . . Whitehouse, G. (2009). E-scape portfolio assessment: Phase 3 report.
London: Technology Education Research Unit, Goldsmiths College, University of London.Pollitt, A. (2004). Let’s stop marking exams. Paper presented at the IAEA Conference, Philadelphia, PA.Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. Shah, N.B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., & Wainwright, M. (2014). When is it better to compare than to score? arXiv.
http://arxiv.org/abs/1406.6618Stewart, N., Brown, G.D.A., & Chater, N. (2005). Absolute identification by relative judgment. Psychological Review, 112(4), 881-911. Thurstone, L.L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286.Walker, M.E., Dorans, N.J., Kim, S., Vafis, G., & Fecko-Curtis, E. (2005). Alternative methods for obtaining item difficulty information. Paper presented
at the Annual Meeting of the American Educational Research Association, Montreal, Canada.Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment.
Manchester: The Assessment and Qualifications Alliance.Wolfe, E.W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues
and Practice, 31(3), 31-37.Zahner, D., & Steedle, J.T. (2014). Evaluating performance task scoring comparability in an international testing program. Paper presented at the
National Council on Measurement in Education Annual Meeting, Philadelphia, PA.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 32