Upload
kevon-statham
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
TEST TASKS FOR SPEAKING – BALANCING
BETWEEN AUTHENTICITY AND
RELIABILITY
Raili Hildén, University of Helsinki, Finland
TBLT 2009Lancaster
‘Tasks: context, purpose and use’ 3rd Biennial International Conference on Task-Based Language
Teaching
13-16 September 2009
Raili Hildén 15.9.2009 2
BACKGROUND: HY-TALK PROJECT OF SPEAKING ASSESSMENT The project is funded by the University
of Helsinki To validate the illustrative scales of
speaking included in the national core curricula for general education and upper secondary level by trialing a prototype test of speaking.
Subscales: overall task completion, fluency, pronunciation, range and accuracy is empirically aligned to relevant scales of the CEFR.
http://blogs.helsinki.fi/hy-talk/
Raili Hildén 15.9.2009 3
THE CONCEPTUAL FRAMEWORK Validity argumentation scheme for
interpretation of the HY-Talk project data (adapted from Kane, 2001, Fulcher & Davidson, 2007, 164 – 174; Bachman, 2005)
The claim to be probed:“The illustrative scales of descriptors of
oral proficiency included in the national core curricula for language education enable sufficiently valid conclusions on students´ oral proficiency in general school education in Finland.”
Raili Hildén 15.9.2009 4
THE PURPOSE OF THE HY-TALK STUDY The validity claim is supported and
challenged by warrants and rebuttals regarding
relevance utility (Intended consequences) sufficiency
Raili Hildén 15.9.2009 5
WARRANTS The tasks used to elicit student
performance correspond to pedagogic tasks and target language use tasks of students at the age of general education. (utility)
Reliability of assessments based on the scale and the tasks to elicit performances is found to be high enough. (sufficiency)
Raili Hildén 15.9.2009 6
BACKING TO SUPPORT THE UTILITY CLAIM Rater and test taker feedback confirm
the perceived authenticity of the tasks and appropriateness of administration.
The level ratings correspond to the target levels in the curricula.
Raili Hildén 15.9.2009 7
BACKING DATA TO SUPPORT THE SUFFICIENCY CLAIM Statistical reliability evidence confirm
sufficient level of consistency across raters, tasks and languages, and interlocutors.
Raili Hildén 15.9.2009 8
COUNTERCLAIMS The tasks used to elicit student performance
correspond inadequately to pedagogic tasks or TLU tasks of students. (utility)
The link to the scale descriptors may be weak. (utility)
The level assignments do not match the target levels set in the curricula.
Reliability of assessments is not stable, but varies too much across tasks, raters or languages, or is caused by intervening variables or inadequate evidence base. (sufficiency)
Raili Hildén 15.9.2009 9
REUBUTTAL DATA TO SUPPORT THE UTILITY CLAIM Statistical evidence challenge the
intended utility of the tasks. Verbal data from students and teachers
question the utility and/or sufficiency of the tasks for the purpose.
Raili Hildén 15.9.2009 10
RESEARCH QUESTIONS1. How is the inter-rater reliability of the
judgements?2. How are the tasks and corresponding
salient task features related to target level judgements, assessment criteria and their combination? (numeric data, analysed with Facets)
3. How are the tasks perceived by students and raters? (verbal data based on feedback sheets and audio recorded rating sessions)
Raili Hildén 15.9.2009 11
SPEAKING TASKS Tasks were designed to reflect the
average target level specified for good mastery of the syllabus
English (grade 7: A1.3, grade 1: A2.2) German etc. (grade 7: A1.2, grade 1:
A2.1) They also draw on the thematic content
of the curricula Discussed, revised and piloted by the
project group
Raili Hildén 15.9.2009 12
PROTOTYPE TASKS (WITH EXAMPLES)
1. Presentation (A2.2) partly controlled monologue
2. Everyday life (A2.1 – A2.2) rigidly controlled dialogues
At the airport, grade 7 At home, grade 7 Accommodation, grade 1 On the way home, grade 13. Negotiation: partly controlled idalogue
Planning an outing (A2.1 – B1.1)
Raili Hildén 15.9.2009 13
SPEAKING TASKS Prompts in L1
Time on task 10-15 min,
Conducted in pairs
Rated by 5-10 language experts
Raili Hildén 15.9.2009 14
DATA OF THIS STUDY Speech samples in English (56) Speech samples in German (66)
Raili Hildén 15.9.2009 15
FACETS EXAMINED IN THIS STUDY Raters (5 English, 7 German) Tasks 1-4 Task dimensions Overall task performance Fluency Pronunciation Range Accuracy
Raili Hildén 15.9.2009 16
RESULTS: RQ1 ENGLISH SAMPLES:OVERALL INTER-RATER AGREEMENT
Majority of total ratings were placed between levels 5-6 (CEFR A2-B1)
Across all facets the raters the distance between the most severe and the most lenient rater was 1 logit (levels 5/6)
Average of ratings given by R4 6.66 Average of ratings given by R1 5.87
For more detailed record please contact the author.
Raili Hildén 15.9.2009 17
RESULTS: RQ1 ENGLISH SAMPLES:OVERALL TASK DIFFICULTY
”The easiest” task: Presentation was assigned the highest
fair average of 6.29
”The trickiest” task: Everyday life task ”Accommodation”
was assigned the lowest fair average of 6.21
For more detailed record please contact the author.
Raili Hildén 15.9.2009 18
RESULTS: RQ1 ENGLISH SAMPLESCRITERIA ”The easiest” criterion:Pronunciation (fair average 6.39)
”The trickiest” criterion:Range (fair average 6.02)
For more detailed record please contact the author.
Raili Hildén 15.9.2009 19
RESULTS: RQ1 ENGLISH SAMPLESCOMBINED DIFFICULTY =TASK+CRITERIA
”The easiest” combination Presentation + Accuracy Presentation+ Fluency
”The trickiest” combination: Everyday situation: Accommodation +
Range
For more detailed record please contact the author.
Raili Hildén 15.9.2009 20
RESULTS: RQ1 GERMAN SAMPLES:OVERALL INTER-RATER AGREEMENT
Majority of total ratings were placed between levels 5-6/10 (CEFR A2-B1)
Across all facets and raters, the distance between the most severe and the most lenient rater was 1 logit (levels 5-6)
Average of ratings given by R6 (3.96/10) Average of ratings given by R2 (3.57/10)
For more detailed record please contact the author.
Raili Hildén 15.9.2009 21
RESULTS: RQ1 GERMAN SAMPLES:OVERALL TASK DIFFICULTY
”The easiest” task: Presentation task was assigned the
highest fair average of 4.21/10
”The trickiest” task: Everyday life task ”On the way home”
was assigned the lowest fair average of 3.57/10
For more detailed record please contact the author.
Raili Hildén 15.9.2009 22
RESULTS: RQ1 GERMAN SAMPLESCRITERIA ”The easiest” criterion: Pronunciation
4.24/10 (fair average )
”The trickiest” criterion: Range 3.49/10(fair average )
For more detailed record please contact the author.
Raili Hildén 15.9.2009 23
RESULTS: RQ1 GERMAN SAMPLESCOMBINED DIFFICULTY =TASK+CRITERIA
”The easiest” combination Presentation + Pronunciation (level
6=B1.1)
”The trickiest” combination: Negotiation (Planning an outing) + Range
(level 5 = A2.2 lower band)
For more detailed record please contact the author.
Raili Hildén 15.9.2009 24
RQ2: ENGLISH & GERMAN The tasks were conceived as authentic in
regard to themes and situations Authenticity (Bachman & Palmer, 1996) was
questioned by raters during the sessions due to the high grade of control regulated by the L1 prompts (to increase reliability)
Students regarded the tasks as relevant and highly probable in real life.
The raters of German discussed the interlocutor impact of the pair setting as a biasing factor.
The results suggest that the target level requirements set in the Finnish curricula are attained reasonably well.
Raili Hildén 15.9.2009 25
DISCUSSION Utility claim was confirmed as to the
high level of agreement of raters across facets (reliability)
Sufficiency and relevance were partly questioned due to the claimed unauthenticity of the task (rigor of instructions)
How to go about the dilemma in the future versions of the test?
Raili Hildén 15.9.2009 26
REFERENCES Bachman. L.F. (2005). Building and supporting a
case for test use. Language Assessment Quarterly, 2(1), 1–34.
Fulcher, G. & Davidson, F. (2007). Language Testing and Assessment. An advanced resource book. Abington & New York: Routledge.
Hildén, R. & Takala, S. 2007. Relating Descriptors of the Finnish School Scale to the CEF Overall Scales for Communicative Activities. Teoksessa Koskensalo, A., Smeds, J., Kaikkonen, P. & Kohonen, V. (toim.) Foreign languages and multicultural perspectives in the European context; Fremdsprachen und multikulturelle Perspektiven im europäischen Kontext. Dichtung, Wahrheit und Sprache (ss. 73 – 88). LIT-Verlag.
Raili Hildén 15.9.2009 27
BIBLIOGRAPHY
National Core Curriculum for the Comprehensive School 2004. Helsinki: Finnish National Board of Education. In Finnish http://www.oph.fi/info/ops/
National Core Curriculum for the Upper Secondary Level 2003. Helsinki: Finnish National Board of Education. In Finnish
http://www.oph.fi/pageLast.asp?path=1,17627,1830,23059
Kane, M. D. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38 (4), 319 – 342.