29
Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008 Claudia Harsch, IQB Guido Martin, IEA DPC

Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens

  • Upload
    robbin

  • View
    70

  • Download
    6

Embed Size (px)

DESCRIPTION

Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring EALTA, Athens May 9-11, 2008. Claudia Harsch, IQB Guido Martin, IEA DPC. Overview. Background - Standards-based assessment in Germany here: Writing in EFL Writing tasks and rating approach Feasibility Studies - PowerPoint PPT Presentation

Citation preview

Page 1: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Evaluation and Control of Rater Reliability:

Holistic vs. Analytic Scoring

EALTA, AthensMay 9-11, 2008

Claudia Harsch, IQBGuido Martin, IEA DPC

Page 2: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Overview

1. Background- Standards-based assessment in Germany

here: Writing in EFL

- Writing tasks and rating approach

2. Feasibility Studies- Feasibility Study I, May 2007

trial scales and approach

- Feasibility Study II, June 2007trial holistic vs. analytic approach

3. Pilot Study, July/August 2007- Training

- Comparison FS II vs. Pilot Study Training

Page 3: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Overview

1. Background- Standards-based assessment in Germany

here: Writing in EFL

- Writing tasks and rating approach

2. Feasibility Studies- Feasibility Study I, May 2007

trial scales and approach

- Feasibility Study II, June 2007trial holistic vs. analytic approach

3. Pilot Study, July/August 2007- Training

- Comparison FS II vs. summer training

Page 4: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Background Assessing ES in Germany

Evaluation of Educational Standards for grades 9 and 10 by IQB Berlin

In Foreign Languages, standards are linked to the CEF, targetingA2 for lower track of secondary schoolB1 for middle track of secondary school

Assessment of “4 skills”:reading, listening, writing and speaking (under development)

Tasks based on CEF-levels A1 to C1;uni-level approach

Page 5: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Sample task: Keeper, targeting B1

Page 6: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Assessment of Writing Tasks

Criteria of assessment, each defined by descriptors based on CEF, Manual, Into Europe:

task fulfilment organisation grammar vocabulary overall impression

Rating approach A uni-level approach to grading the tasks in line with the

specific target level Performance to be graded on a below / pass / pass plus basis "Holistic approach": Ratings are the result of a weighted

assessment of several descriptors per criterion

Page 7: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Overview

1. Background- Standards-based assessment in Germany

here: Writing in EFL

- Writing tasks and rating approach

2. Feasibility Studies- Feasibility Study I, May 2007

trial scales and approach

- Feasibility Study II, June 2007trial holistic vs. analytic approach

3. Pilot Study, July/August 2007- Training

- Comparison FS II vs. summer training

Page 8: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Feasibility Study I May 2007

Aims Trial training / rating approach with student teachers Gain insight into scales and criteria Get feedback on accessibility of handbooks, benchmarks,

coding software

Procedure 2 tasks: A2 “Lost dog” / B1 “Keeper for a day” 6 raters: student teachers of English, proficient in writing English First training session (1day): introduction to CEF, scales and

tasks Practice 1: 30 scripts per task (over 1 week) Second training session (1day): evaluation & discussion of

practice results Practice 2: 28 scripts per task (over 1 week) Evaluation of results in terms of rating reliability

Page 9: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Feasibility Study I May 2007

Evaluation: Assessing Rater Reliability

Index used: Percent Agreement with Mode Measures the percentage of agreement with the value most

often awarded on the level of individual ratings Can be aggregated on item (variable) and rater level Easily interpreted No assumptions about scale level No assumptions about value distributions No estimation errors Can be interpreted as a proxy for validity

Page 10: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Outcome Feasibility Study I, May 2007

ITEM REL

TaskFulfilment [Keeper] 0,759

Organisation [Keeper] 0,852

Grammar [Keeper] 0,846

Vocabulary [Keeper] 0,870

Overall [Keeper] 0,858

TaskFulfilment [Lost dog] 0,839

Organisation [Lost dog] 0,863

Grammar [Lost dog] 0,845

Vocabulary [Lost dog] 0,869

Overall [Lost dog] 0,833

Reliability per Item

Page 11: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Outcome Feasibility Study I, May 2007

ITEM R01 R02 R03 R04 R05 R06

Overall [Keeper] 0,852 1,000 0,741 0,889 0,704 0,963

Overall [Lost dog] 0,857 0,929 0,786 0,857 0,571 1,000

REL Average 0,847 0,931 0,770 0,826 0,757 0,931

Reliability per Rater & Item

Page 12: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Approach appears feasible Scales seem to be usable and applicable BUT: We do not know what raters do on the sub-

criterion-level Need to further explore behaviour at descriptor

level=> Feasibility Study II

Outcome Feasibility Study I, May 2007

Page 13: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Overview

1. Background- Standards-based assessment in Germany

here: Writing in EFL

- Writing tasks and rating approach

2. Feasibility Studies- Feasibility Study I, May 2007

trial scales and approach

- Feasibility Study II, June 2007trial holistic vs. analytic approach

3. Pilot Study, July/August 2007- Training

- Comparison FS II vs. summer training

Page 14: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Feasibility Study II, June 2007

Comparison: Holistic scores for the five criteria (FS I) Scoring each descriptor on its own and in addition scoring the

criteria “holistically” (FS II)

Reasons behind: “below” – “pass” – “pass plus” in a uni-level approach targeting

a specific population: tendency towards the “pass” value

Similar outcomes can be achieved by purely random value distributions at the descriptor level

Data on scoring each descriptor show whether raters interpret descriptors uniformly before using them to compile the weighted overall criterion rating

Reliable usage of descriptors is a precondition for valid ratings on the criterion-level

Page 15: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Outcome Feasibility Study II, June 2007

CRITERIA REL

TaskFulfilment [Keeper] 0,81

Organisation [Keeper] 0,83

Grammar [Keeper] 0,85

Vocabulary [Keeper] 0,84

Overall [Keeper] 0,87

Page 16: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Outcome Feasibility Study II, June 2007

Descriptors/Criterion Organisation REL

Organisation_01 [Keeper for a day] 0,75

Organisation_02 [Keeper for a day] 0,56

Organisation_03 [Keeper for a day] 0,73

Organisation_04 [Keeper for a day] 0,82

Organisation_05 [Keeper for a day] 0,54

Organisation_06 [Keeper for a day] 0,83

Organisation_07 [Keeper for a day] 0,84

Organisation_08 [Keeper for a day] 0,66

Organisation_09 [Keeper for a day] 0,63

Organisation [Keeper for a day] 0,83

Page 17: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Outcome Feasibility Study II, June 2007

Fairly high agreement on criterion-level ratings is NOT the result of uniform interpretation of descriptors …

BUT rather results from cancellation of deviations on the descriptor-level during the compilation of the criterion ratings

Rating holistic criteria by evaluation of several pre-defined descriptors can only be valid if descriptors are understood uniformly by all raters

Descriptors need to be revised Training and assessment of pilot study has to be

conducted on the descriptor level in order to be able to control rating behavior

Page 18: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Overview

1. Background- Standards-based assessment in Germany

here: Writing in EFL

- Writing tasks and rating approach

2. Feasibility Studies- Feasibility Study I, May 2007

trial scales and approach

- Feasibility Study II, June 2007trial holistic vs. analytic approach

3. Pilot Study, July/August 2007- Training

- Comparison FS II vs. summer training

Page 19: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Background Pilot Study

Sample Size: N = 2932 Number of Items:

Listening: 349 Reading: 391 Writing: 19 Tasks

n = 300 – 370 / item (M = 330) All Länder All school types 8th, 9th and 10th graders

Page 20: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Summer Training

13 Raters, selected on the basis of English language proficiency, study background and DPC coding test

Challenge of piloting tasks, rating approach and scales simultaneously

First one-week seminar: - Introduction of CEF, scales and tasks

- Introduction of rating procedures

- Introduction of benchmarks

Page 21: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Summer Training

6 one-day sessions: - Weekly practice- Discussion & Evaluation of practice results- Introduction of further tasks / levels- Revision of scale descriptors

Five levels, 19 tasks: Simultaneous introduction of several levels and tasks necessary in order to control level and task interdependencies

Three rounds of practice per task ideal:1. Intro – practice2. Feedback – practice3. Feedback – practice4. Evaluation of reliabilities

Page 22: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Criterion/descriptorsTask Fulfilment

REL Practice 4

REL Practice 5

REL Practice 6

TF 1 [Sports Accident] 0,65 0,76 0,88

TF 2 [Sports Accident] 0,66 0,77 0,79

TF 3 [Sports Accident] 0,87 0,85 0,92

TF 4 [Sports Accident] 0,80 0,72 0,77

TF 5 [Sports Accident] 0,70 0,78 0,83

TF gen [Sports Accident] 0,71 0,80 0,80

Training Progress "Sports Accident", B1

Page 23: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Criterion/descriptorsOrganisation

REL Practice 4

REL Practice 5

REL Practice 6

O 1 [Sports Accident] 0,730,77 0,85

O 2 [Sports Accident] 0,81

O 3 [Sports Accident] 0,72 0,71 0,80

O 4 [Sports Accident] 0,77 0,79 0,82

O 5 [Sports Accident] 0,96

O gen [Sports Accident] 0,71 0,76 0,81

Training Progress "Sports Accident", B1

Page 24: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Summer Training

Second one-week seminar: - Feedback on last round of practice

- Addition of benchmarks for borderline cases

- Addition of detailed justifications for benchmarks

- Finalisation of scale descriptors

- Revision of rating handbooks

Page 25: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Comparison FS II - Training

FS II PRACTICE 4

Criteria REL REL

[Keeper - TaskFulfilment] 0,81 0,71

[Keeper – Organization] 0,83 0,74

[Keeper - Grammar] 0,85 0,76

[Keeper - Vocabulary] 0,84 0,74

[Keeper - Overall] 0,87 0,77

Page 26: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Comparison FS II - Training

FS II

ITEM REL

O_01 [Keeper] 0,75

O_02 [Keeper] 0,56

O_03 [Keeper] 0,73

O_04 [Keeper] 0,82

O_05 [Keeper] 0,54

O_06 [Keeper 0,83

O_07 [Keeper] 0,84

O_08 [Keeper] 0,66

O_09 [Keeper] 0,63

O_gen [Keeper] 0,83

Practice 4

ITEM REL

O 1 [Keeper] 0,75

O 2 [Keeper] 0,73

skipped

skipped

O 3 [Keeper] 0,72

O 4 [Keeper] 0,74

O 5 [Keeper] 0,95

O gen [Keeper] 0,74

Page 27: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Conclusion

Training concept for the future

Materials prepared – weekly seminars not necessary

Training and rating on descriptor level Multiple one-day sessions, one per week to

give time for practice- Introduction

- Practice: 3 rounds per task ideal

- Feedback

Page 28: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Thank you for your attention!

Page 29: Evaluation and Control  of Rater Reliability:  Holistic vs. Analytic Scoring EALTA, Athens

Claudia Harsch

Phone + 49 + (0)30 + 2093 - 5508Telefax + 49 + (0)30 + 2093 - 5336E-mail [email protected] www.IQB.hu-berlin.de

Mail Address

Humboldt-Universität zu BerlinUnter den Linden 610099 BerlinGERMANY

Guido Martin

Phone + 49 + (0)40 + 48 500 612E-mail [email protected] www.iea-dpc.de

Mail Address

IEA DPCMexikoring 37D-22297 HamburgGERMANY