28
TEST TASKS FOR SPEAKING – BALANCING BETWEEN AUTHENTICITY AND RELIABILITY Raili Hildén, University of Helsinki, Finland [email protected] TBLT 2009 Lancaster ‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language Teaching 13-16 September 2009

Test tasks for speaking – balancing between authenticity and reliability

  • Upload
    taariq

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Test tasks for speaking – balancing between authenticity and reliability. Raili Hildén , University of Helsinki, Finland [email protected] TBLT 2009 Lancaster ‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language Teaching 13-16 September 2009. - PowerPoint PPT Presentation

Citation preview

Page 1: Test tasks for speaking – balancing between authenticity and reliability

TEST TASKS FOR SPEAKING – BALANCING

BETWEEN AUTHENTICITY AND

RELIABILITY

Raili Hildén, University of Helsinki, [email protected]

TBLT 2009Lancaster

‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language

Teaching13-16 September 2009

Page 2: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 2

BACKGROUND: HY-TALK PROJECT OF SPEAKING ASSESSMENT The project is funded by the University

of Helsinki To validate the illustrative scales of

speaking included in the national core curricula for general education and upper secondary level by trialing a prototype test of speaking.

Subscales: overall task completion, fluency, pronunciation, range and accuracy is empirically aligned to relevant scales of the CEFR.

http://blogs.helsinki.fi/hy-talk/

Page 3: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 3

THE CONCEPTUAL FRAMEWORK Validity argumentation scheme for

interpretation of the HY-Talk project data (adapted from Kane, 2001, Fulcher & Davidson, 2007, 164 – 174; Bachman, 2005)

The claim to be probed:“The illustrative scales of descriptors of

oral proficiency included in the national core curricula for language education enable sufficiently valid conclusions on students´ oral proficiency in general school education in Finland.”

Page 4: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 4

THE PURPOSE OF THE HY-TALK STUDY The validity claim is supported and

challenged by warrants and rebuttals regarding

relevance utility (Intended consequences) sufficiency

Page 5: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 5

WARRANTS The tasks used to elicit student

performance correspond to pedagogic tasks and target language use tasks of students at the age of general education. (utility)

Reliability of assessments based on the scale and the tasks to elicit performances is found to be high enough. (sufficiency)

Page 6: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 6

BACKING TO SUPPORT THE UTILITY CLAIM Rater and test taker feedback confirm

the perceived authenticity of the tasks and appropriateness of administration.

The level ratings correspond to the target levels in the curricula.

Page 7: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 7

BACKING DATA TO SUPPORT THE SUFFICIENCY CLAIM Statistical reliability evidence confirm

sufficient level of consistency across raters, tasks and languages, and interlocutors.

Page 8: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 8

COUNTERCLAIMS The tasks used to elicit student performance

correspond inadequately to pedagogic tasks or TLU tasks of students. (utility)

The link to the scale descriptors may be weak. (utility)

The level assignments do not match the target levels set in the curricula.

Reliability of assessments is not stable, but varies too much across tasks, raters or languages, or is caused by intervening variables or inadequate evidence base. (sufficiency)

Page 9: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 9

REUBUTTAL DATA TO SUPPORT THE UTILITY CLAIM Statistical evidence challenge the

intended utility of the tasks. Verbal data from students and teachers

question the utility and/or sufficiency of the tasks for the purpose.

Page 10: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 10

RESEARCH QUESTIONS1. How is the inter-rater reliability of the

judgements?2. How are the tasks and corresponding

salient task features related to target level judgements, assessment criteria and their combination? (numeric data, analysed with Facets)

3. How are the tasks perceived by students and raters? (verbal data based on feedback sheets and audio recorded rating sessions)

Page 11: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 11

SPEAKING TASKS Tasks were designed to reflect the

average target level specified for good mastery of the syllabus

English (grade 7: A1.3, grade 1: A2.2) German etc. (grade 7: A1.2, grade 1:

A2.1) They also draw on the thematic content

of the curricula Discussed, revised and piloted by the

project group

Page 12: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 12

PROTOTYPE TASKS (WITH EXAMPLES)1. Presentation (A2.2) partly controlled

monologue2. Everyday life (A2.1 – A2.2) rigidly

controlled dialogues At the airport, grade 7 At home, grade 7 Accommodation, grade 1 On the way home, grade 13. Negotiation: partly controlled idalogue

Planning an outing (A2.1 – B1.1)

Page 13: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 13

SPEAKING TASKS Prompts in L1 Time on task 10-15 min, Conducted in pairs Rated by 5-10 language experts

Page 14: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 14

DATA OF THIS STUDY Speech samples in English (56) Speech samples in German (66)

Page 15: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 15

FACETS EXAMINED IN THIS STUDY Raters (5 English, 7 German) Tasks 1-4 Task dimensions Overall task performance Fluency Pronunciation Range Accuracy

Page 16: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 16

RESULTS: RQ1 ENGLISH SAMPLES:OVERALL INTER-RATER AGREEMENT Majority of total ratings were placed

between levels 5-6 (CEFR A2-B1) Across all facets the raters the distance

between the most severe and the most lenient rater was 1 logit (levels 5/6)

Average of ratings given by R4 6.66 Average of ratings given by R1 5.87

For more detailed record please contact the author.

Page 17: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 17

RESULTS: RQ1 ENGLISH SAMPLES:OVERALL TASK DIFFICULTY”The easiest” task: Presentation was assigned the highest

fair average of 6.29

”The trickiest” task: Everyday life task ”Accommodation”

was assigned the lowest fair average of 6.21

For more detailed record please contact the author.

Page 18: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 18

RESULTS: RQ1 ENGLISH SAMPLESCRITERIA ”The easiest” criterion:Pronunciation (fair average 6.39)

”The trickiest” criterion:Range (fair average 6.02)

For more detailed record please contact the author.

Page 19: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 19

RESULTS: RQ1 ENGLISH SAMPLESCOMBINED DIFFICULTY =TASK+CRITERIA

”The easiest” combination Presentation + Accuracy Presentation+ Fluency

”The trickiest” combination: Everyday situation: Accommodation +

Range

For more detailed record please contact the author.

Page 20: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 20

RESULTS: RQ1 GERMAN SAMPLES:OVERALL INTER-RATER AGREEMENT Majority of total ratings were placed

between levels 5-6/10 (CEFR A2-B1) Across all facets and raters, the distance

between the most severe and the most lenient rater was 1 logit (levels 5-6)

Average of ratings given by R6 (3.96/10) Average of ratings given by R2 (3.57/10)

For more detailed record please contact the author.

Page 21: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 21

RESULTS: RQ1 GERMAN SAMPLES:OVERALL TASK DIFFICULTY”The easiest” task: Presentation task was assigned the

highest fair average of 4.21/10

”The trickiest” task: Everyday life task ”On the way home”

was assigned the lowest fair average of 3.57/10

For more detailed record please contact the author.

Page 22: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 22

RESULTS: RQ1 GERMAN SAMPLESCRITERIA ”The easiest” criterion: Pronunciation

4.24/10 (fair average )

”The trickiest” criterion: Range 3.49/10(fair average )

For more detailed record please contact the author.

Page 23: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 23

RESULTS: RQ1 GERMAN SAMPLESCOMBINED DIFFICULTY =TASK+CRITERIA

”The easiest” combination Presentation + Pronunciation (level

6=B1.1)

”The trickiest” combination: Negotiation (Planning an outing) + Range

(level 5 = A2.2 lower band)

For more detailed record please contact the author.

Page 24: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 24

RQ2: ENGLISH & GERMAN The tasks were conceived as authentic in

regard to themes and situations Authenticity (Bachman & Palmer, 1996) was

questioned by raters during the sessions due to the high grade of control regulated by the L1 prompts (to increase reliability)

Students regarded the tasks as relevant and highly probable in real life.

The raters of German discussed the interlocutor impact of the pair setting as a biasing factor.

The results suggest that the target level requirements set in the Finnish curricula are attained reasonably well.

Page 25: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 25

DISCUSSION Utility claim was confirmed as to the

high level of agreement of raters across facets (reliability)

Sufficiency and relevance were partly questioned due to the claimed unauthenticity of the task (rigor of instructions)

How to go about the dilemma in the future versions of the test?

Page 26: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 26

REFERENCES Bachman. L.F. (2005). Building and supporting a

case for test use. Language Assessment Quarterly, 2(1), 1–34.

Fulcher, G. & Davidson, F. (2007). Language Testing and Assessment. An advanced resource book. Abington & New York: Routledge.

Hildén, R. & Takala, S. 2007. Relating Descriptors of the Finnish School Scale to the CEF Overall Scales for Communicative Activities. Teoksessa Koskensalo, A., Smeds, J., Kaikkonen, P. & Kohonen, V. (toim.) Foreign languages and multicultural perspectives in the European context; Fremdsprachen und multikulturelle Perspektiven im europäischen Kontext. Dichtung, Wahrheit und Sprache (ss. 73 – 88). LIT-Verlag.

Page 27: Test tasks for speaking – balancing between authenticity and reliability

Raili Hildén 15.9.2009 27

BIBLIOGRAPHY National Core Curriculum for the Comprehensive

School 2004. Helsinki: Finnish National Board of Education. In Finnish http://www.oph.fi/info/ops/

National Core Curriculum for the Upper Secondary Level 2003. Helsinki: Finnish National Board of Education. In Finnish

http://www.oph.fi/pageLast.asp?path=1,17627,1830,23059

Kane, M. D. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38 (4), 319 – 342.