Upload
robbin
View
70
Download
6
Embed Size (px)
DESCRIPTION
Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008. Claudia Harsch, IQB Guido Martin, IEA DPC. Overview. Background - Standards-based assessment in Germany here: Writing in EFL Writing tasks and rating approach Feasibility Studies - PowerPoint PPT Presentation
Citation preview
Evaluation and Control of Rater Reliability:
Holistic vs. Analytic Scoring
EALTA, AthensMay 9-11, 2008
Claudia Harsch, IQBGuido Martin, IEA DPC
Overview
1. Background- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007trial holistic vs. analytic approach
3. Pilot Study, July/August 2007- Training
- Comparison FS II vs. Pilot Study Training
Overview
1. Background- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007trial holistic vs. analytic approach
3. Pilot Study, July/August 2007- Training
- Comparison FS II vs. summer training
Background Assessing ES in Germany
Evaluation of Educational Standards for grades 9 and 10 by IQB Berlin
In Foreign Languages, standards are linked to the CEF, targetingA2 for lower track of secondary schoolB1 for middle track of secondary school
Assessment of “4 skills”:reading, listening, writing and speaking (under development)
Tasks based on CEF-levels A1 to C1;uni-level approach
Sample task: Keeper, targeting B1
Assessment of Writing Tasks
Criteria of assessment, each defined by descriptors based on CEF, Manual, Into Europe:
task fulfilment organisation grammar vocabulary overall impression
Rating approach A uni-level approach to grading the tasks in line with the
specific target level Performance to be graded on a below / pass / pass plus basis "Holistic approach": Ratings are the result of a weighted
assessment of several descriptors per criterion
Overview
1. Background- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007trial holistic vs. analytic approach
3. Pilot Study, July/August 2007- Training
- Comparison FS II vs. summer training
Feasibility Study I May 2007
Aims Trial training / rating approach with student teachers Gain insight into scales and criteria Get feedback on accessibility of handbooks, benchmarks,
coding software
Procedure 2 tasks: A2 “Lost dog” / B1 “Keeper for a day” 6 raters: student teachers of English, proficient in writing English First training session (1day): introduction to CEF, scales and
tasks Practice 1: 30 scripts per task (over 1 week) Second training session (1day): evaluation & discussion of
practice results Practice 2: 28 scripts per task (over 1 week) Evaluation of results in terms of rating reliability
Feasibility Study I May 2007
Evaluation: Assessing Rater Reliability
Index used: Percent Agreement with Mode Measures the percentage of agreement with the value most
often awarded on the level of individual ratings Can be aggregated on item (variable) and rater level Easily interpreted No assumptions about scale level No assumptions about value distributions No estimation errors Can be interpreted as a proxy for validity
Outcome Feasibility Study I, May 2007
ITEM REL
TaskFulfilment [Keeper] 0,759
Organisation [Keeper] 0,852
Grammar [Keeper] 0,846
Vocabulary [Keeper] 0,870
Overall [Keeper] 0,858
TaskFulfilment [Lost dog] 0,839
Organisation [Lost dog] 0,863
Grammar [Lost dog] 0,845
Vocabulary [Lost dog] 0,869
Overall [Lost dog] 0,833
Reliability per Item
Outcome Feasibility Study I, May 2007
ITEM R01 R02 R03 R04 R05 R06
Overall [Keeper] 0,852 1,000 0,741 0,889 0,704 0,963
Overall [Lost dog] 0,857 0,929 0,786 0,857 0,571 1,000
REL Average 0,847 0,931 0,770 0,826 0,757 0,931
Reliability per Rater & Item
Approach appears feasible Scales seem to be usable and applicable BUT: We do not know what raters do on the sub-
criterion-level Need to further explore behaviour at descriptor
level=> Feasibility Study II
Outcome Feasibility Study I, May 2007
Overview
1. Background- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007trial holistic vs. analytic approach
3. Pilot Study, July/August 2007- Training
- Comparison FS II vs. summer training
Feasibility Study II, June 2007
Comparison: Holistic scores for the five criteria (FS I) Scoring each descriptor on its own and in addition scoring the
criteria “holistically” (FS II)
Reasons behind: “below” – “pass” – “pass plus” in a uni-level approach targeting
a specific population: tendency towards the “pass” value
Similar outcomes can be achieved by purely random value distributions at the descriptor level
Data on scoring each descriptor show whether raters interpret descriptors uniformly before using them to compile the weighted overall criterion rating
Reliable usage of descriptors is a precondition for valid ratings on the criterion-level
Outcome Feasibility Study II, June 2007
CRITERIA REL
TaskFulfilment [Keeper] 0,81
Organisation [Keeper] 0,83
Grammar [Keeper] 0,85
Vocabulary [Keeper] 0,84
Overall [Keeper] 0,87
Outcome Feasibility Study II, June 2007
Descriptors/Criterion Organisation REL
Organisation_01 [Keeper for a day] 0,75
Organisation_02 [Keeper for a day] 0,56
Organisation_03 [Keeper for a day] 0,73
Organisation_04 [Keeper for a day] 0,82
Organisation_05 [Keeper for a day] 0,54
Organisation_06 [Keeper for a day] 0,83
Organisation_07 [Keeper for a day] 0,84
Organisation_08 [Keeper for a day] 0,66
Organisation_09 [Keeper for a day] 0,63
Organisation [Keeper for a day] 0,83
Outcome Feasibility Study II, June 2007
Fairly high agreement on criterion-level ratings is NOT the result of uniform interpretation of descriptors …
BUT rather results from cancellation of deviations on the descriptor-level during the compilation of the criterion ratings
Rating holistic criteria by evaluation of several pre-defined descriptors can only be valid if descriptors are understood uniformly by all raters
Descriptors need to be revised Training and assessment of pilot study has to be
conducted on the descriptor level in order to be able to control rating behavior
Overview
1. Background- Standards-based assessment in Germany
here: Writing in EFL
- Writing tasks and rating approach
2. Feasibility Studies- Feasibility Study I, May 2007
trial scales and approach
- Feasibility Study II, June 2007trial holistic vs. analytic approach
3. Pilot Study, July/August 2007- Training
- Comparison FS II vs. summer training
Background Pilot Study
Sample Size: N = 2932 Number of Items:
Listening: 349 Reading: 391 Writing: 19 Tasks
n = 300 – 370 / item (M = 330) All Länder All school types 8th, 9th and 10th graders
Summer Training
13 Raters, selected on the basis of English language proficiency, study background and DPC coding test
Challenge of piloting tasks, rating approach and scales simultaneously
First one-week seminar: - Introduction of CEF, scales and tasks
- Introduction of rating procedures
- Introduction of benchmarks
Summer Training
6 one-day sessions: - Weekly practice- Discussion & Evaluation of practice results- Introduction of further tasks / levels- Revision of scale descriptors
Five levels, 19 tasks: Simultaneous introduction of several levels and tasks necessary in order to control level and task interdependencies
Three rounds of practice per task ideal:1. Intro – practice2. Feedback – practice3. Feedback – practice4. Evaluation of reliabilities
…
Criterion/descriptorsTask Fulfilment
REL Practice 4
REL Practice 5
REL Practice 6
TF 1 [Sports Accident] 0,65 0,76 0,88
TF 2 [Sports Accident] 0,66 0,77 0,79
TF 3 [Sports Accident] 0,87 0,85 0,92
TF 4 [Sports Accident] 0,80 0,72 0,77
TF 5 [Sports Accident] 0,70 0,78 0,83
TF gen [Sports Accident] 0,71 0,80 0,80
Training Progress "Sports Accident", B1
Criterion/descriptorsOrganisation
REL Practice 4
REL Practice 5
REL Practice 6
O 1 [Sports Accident] 0,730,77 0,85
O 2 [Sports Accident] 0,81
O 3 [Sports Accident] 0,72 0,71 0,80
O 4 [Sports Accident] 0,77 0,79 0,82
O 5 [Sports Accident] 0,96
O gen [Sports Accident] 0,71 0,76 0,81
Training Progress "Sports Accident", B1
Summer Training
Second one-week seminar: - Feedback on last round of practice
- Addition of benchmarks for borderline cases
- Addition of detailed justifications for benchmarks
- Finalisation of scale descriptors
- Revision of rating handbooks
Comparison FS II - Training
FS II PRACTICE 4
Criteria REL REL
[Keeper - TaskFulfilment] 0,81 0,71
[Keeper – Organization] 0,83 0,74
[Keeper - Grammar] 0,85 0,76
[Keeper - Vocabulary] 0,84 0,74
[Keeper - Overall] 0,87 0,77
Comparison FS II - Training
FS II
ITEM REL
O_01 [Keeper] 0,75
O_02 [Keeper] 0,56
O_03 [Keeper] 0,73
O_04 [Keeper] 0,82
O_05 [Keeper] 0,54
O_06 [Keeper 0,83
O_07 [Keeper] 0,84
O_08 [Keeper] 0,66
O_09 [Keeper] 0,63
O_gen [Keeper] 0,83
Practice 4
ITEM REL
O 1 [Keeper] 0,75
O 2 [Keeper] 0,73
skipped
skipped
O 3 [Keeper] 0,72
O 4 [Keeper] 0,74
O 5 [Keeper] 0,95
O gen [Keeper] 0,74
Conclusion
Training concept for the future
Materials prepared – weekly seminars not necessary
Training and rating on descriptor level Multiple one-day sessions, one per week to
give time for practice- Introduction
- Practice: 3 rounds per task ideal
- Feedback
Thank you for your attention!
Claudia Harsch
Phone + 49 + (0)30 + 2093 - 5508Telefax + 49 + (0)30 + 2093 - 5336E-mail [email protected] www.IQB.hu-berlin.de
Mail Address
Humboldt-Universität zu BerlinUnter den Linden 610099 BerlinGERMANY
Guido Martin
Phone + 49 + (0)40 + 48 500 612E-mail [email protected] www.iea-dpc.de
Mail Address
IEA DPCMexikoring 37D-22297 HamburgGERMANY