Impact, Washback and Impact, Washback and Consequences of Consequences of Large-Large-scale Testingscale Testing
Liying Cheng (Ph.D)Liying Cheng (Ph.D)
Queen’s University Queen’s University
[email protected]@educ.queensu.ca
OverviewOverview
1. Define the research terms – washback, impact and consequences
2. Discuss this phenomenon in relation to test validity and social consequences
3. Argue for conducting further empirical evidence beyond Alderson & Wall, 1993 and Cheng et al 2004
4. Illustrate a series of empirical studies using different methodologies
Focusing on the influence of testing on students
Impact, washback, and Impact, washback, and consequencesconsequences There is a set of relationships, intended and
unintended, positive and negative, between teaching, learning and testing (Alderson & Wall, 1993).
measurement-driven instruction (e.g. Popham, 1987),
test-curriculum alignment (Shepard, 1990), and
consequences (Cizek, 2001) (see Cheng & Curtis, 2004 for a detailed review)
Impact, washback, and Impact, washback, and consequencesconsequences Test Impact - the effects of tests on macro-
levels of education and society, and washback - the effects of language tests on
micro-levels of language teaching and learning, i.e. inside the classroom (Bachman & Palmer, 1996; McNamara, 2000; Wall, 1997).
A view of test influence falling between the narrow one of washback and the all-encompassing one of impact (Hamp-Lyons, 1997).
Validity (theoretical Validity (theoretical models)models) Washback - ‘only one form of testing
consequences that need to be weighted in evaluating validity’ (Messick, 1996, p.243) promoting the examination of the two threats to test validity, construct under-representation and construct-irrelevant variance, to decide the possible
consequences that a test can have on teaching and learning.
Bachman (2005) proposes a framework with a set of principles and procedures for linking test scores and score-based inferences to test use
and the consequences of test use
Social consequences to the Social consequences to the society (philosophical society (philosophical modelsmodels)) Critical language testing - political uses and
abuses of language tests (Shohamy, 2001) Fairness framework (Kunnan, 2004) - drew on
research in ethics to link validity and consequences - tests as instruments of social policy and control.
An encompassing ethics framework to examine the consequences of testing on language learning at the classroom as well as the educational, social and political levels Hamp-Lyons (1997).
Model of WashbackModel of Washback
F i g u r e 1 A n A d a p t e d M o d e l o f W a s h b a c k
P A R T I C I P A N T P R O C E S S P R O D U C T
T E S T S t u d e n t s L e a r n i n g
T e a c h e r s T e a c h i n g
T e x t b o o k w r i t e r s a n d
c u r r i c u l u m d e s i g n e r s
N e w t e x t b o o k s
a n d n e w
c u r r i c u l a
R e s e a r c h e r sR e s e a r c h
r e s u l t s
S o u r c e : A d a p t e d f r o m K . M . B a i l e y ( 1 9 9 6 ) , W o r k i n g f o r w a s h b a c k : A r e v i e w o f t h e w a s h b a c k c o n c e p t i n l a n g u a g e t e s t i n g , L a n g u a g e T e s t i n g , 1 3 .
Washback studiesWashback studies
Two areas of washback studies have recently been conducted: those relating to ‘traditional’ or existing tests which are
thought to stifle innovative teaching, and those relating to cases where a test has been specifically
changed in order to encourage innovation in the classroom. Methods
Survey methods – interviews and questionnaire Classroom observations
Cheng, L., & Watanabe, Y., with Curtis, A. (Eds.) (2004). Washback in language testing: Research contexts and methods. Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc.
Major work on washback
Alderson & Wall (1993) 15 Alderson & Wall (1993) 15 hypotheseshypotheses
1. A test will influence teaching*. 2. A test will influence learning?.3. A test will influence what teachers teach*.4. A test will influence how teachers teach*.5. A test will influence what learners learn*.6. A test will influence how learners learn?.7. A test will influence the rate and sequence of
teaching?.8. A test will influence the rate and sequence of
learning?.
Alderson & Wall (1993) 15 Alderson & Wall (1993) 15 hypotheseshypotheses9. A test will influence the degree and depth of
teaching?.10. A test will influence the degree and depth of
learning?.11. A test will influence attitudes to the content, method,
etc., of teaching and learning*?.12. Tests that have important consequences will have
washback*.13. Tests that do not have important consequences will
have no washback*?.14. Tests will have washback on all learners and
teachers?.15. Tests will have washback effects for some learners
and some teachers, but not for others?.
Why studying the impact on Why studying the impact on students?students? … of the many millions of people who will sit
down to take (English) tests …, virtually none will have participated in the test’s design, in writing test items, in critiquing the test methods, in setting cut scores or in writing or commenting on the performance descriptions that tie to their all-important score.
Of all stakeholders in testing events, test takers surely have the highest stake of all (Hamp-Lyons, 2000, p. 581).
The Impact Study of Ontario The Impact Study of Ontario Secondary School Literacy Test Secondary School Literacy Test
Phase I of the Impact Study - 2002-03 Oct. OSSLT data
o Impact of test types, skills and strategies
o Impact of test formats
o Impact of test-taker characteristics and their test
performance
(Non-ESL/ELD N=4068, 5003) (ESL/ELD n=2164, 4311)
Phase II of the Impact Study - 2006-07 Mar. OSSLT data
o Case studies of individual students
o Focus group
o Survey
Impact of test/task types, skills Impact of test/task types, skills
and strategies on Studentsand strategies on Students Difficulty Discrimination
o Text types
Information Narrative
o Skills
Making connections Indirect understand
o Strategies
Syntax Vocabulary
o Writing tasks
Summary* News report
Impact of test formats on Impact of test formats on studentsstudentsDifficulty Discrimination
Test formats
o MC
For the 2003 data, only MC questions best separated the two groups (=1, p< .001).
o CR
For the February 2002 data, CR questions best separated the
two groups (=. 42, p< .001). MC questions had a discriminant
coefficient of .34 (p< .001), and CRE question had the lowest
discriminant coefficient of .30 (p< .001).
o CRE
Impact of L2 test takers’ Impact of L2 test takers’ characteristics and test characteristics and test performanceperformance
Multiple Regression Results
t Sig. R2
Total reading score (Total=200, M=114.41, SD=33.49)
.18
E-literacy activities .26 18.11 .00
Literature activities .17 12.06 .00 Non-fiction activities .13 9.38 .00 Newspaper and magazine activities .13 8.83 .00 Literacy hours .11 7.69 .00 Home language spoken .07 4.69 .00 First language .05 3.41 .00
Total writing score (Total=180, M=104.29, SD= 38.83)
.11
E-literacy activities .19 12.79 .00
Literature activities .17 11.49 .00 Non-fiction activities .09 6.21 .00 Literacy hours .07 4.55 .00 Newspaper and magazine activities .08 5.06 .00 Home language spoken .08 4.80 .00 First language .05 3.17 .00
Note. * p<.01
Focus group (Cognitive lab)Focus group (Cognitive lab)
The research questions:
• How do L1 and L2 test takers’ accounts of the OSSLT differ?
• Is it the same for both L1 and L2 groups or does the construct change in important ways in relation to language background?
• Do these differences pose a threat to the inferences drawn from the test results?
• What other differences are evident in these test takers’ accounts of the OSSLT?
Fox, J. & Cheng, L. (in press). Did we take the same test? Differing accounts of the Ontario Secondary School Literacy Test by first and second language test takers. Assessment in Education: Principles, Policy & Practice, 14(1).
Key differences between English speakers and ESL/ELD students in behaviour and accounts of:
•Test knowledge Knowledge of test genre --formats; space; of what’s expected ; what raters want/will reward.
•Test-wiseness Strategic vs. non-strategic responses connects with background in test taking.
•The construct or what’s being tested Language proficiency or writing? Problem -- prompts that rely on a key word for response: “junk food”; “invention”; how L2 test takers engage with the test – the importance of pictures and other cues vs. reliance on texts.•Affect/investment Emotional investment – anxiety, sadness, confidence, perceptions of difficulty
Initial FindingsInitial Findings
Students’ attitude and their Students’ attitude and their CET performance (Zhao & CET performance (Zhao & Cheng, 2006)Cheng, 2006)
What are the attitudes of students toward CET-4? What relationships exist between students’
attitudes and their performance on the CET-4? Does sex difference exist in attitudes and their
relation to test performance? What attitudes differentiate high achievers (who
score above 80 percentile) from low achievers (below 20 percentile)? What is the relationship between the two?
Four Attitudinal FactorsFour Attitudinal Factors
Factor Item Mean SD
1 Test-taking Anxiety/ 2,3,4,8,10,19,20,23,24 3.87 .72
Lack of Concentration 26,27,29,32,34,35,36,38
2 Test-taking Motivation 5, 7, 9, 22, 37 2.79 .58
3 Belief in CET-4 6,13,17,21,25,39 2.43 .58
4 Test Ease 12,14,30 2.35 .75
N=212
Multiple RegressionMultiple Regression
Multiple Regression Cont’dMultiple Regression Cont’d
Multiple Regression: females’ attitudes toward CET-4 and their test performance (N=145)
Model Factor β t p R2
1 Test-taking Anxiety/Lack of Concentration -.28 -3.44 .001 0702 Test-taking Anxiety/Lack of Concentration -.26 -.327 .001 .104 Test-taking Motivation -.20 2.55 .012
Multiple Regression: males’ attitudes toward CET-4 and their test performance (N=63)
Model Factor β t p R2
1 Belief in CET-4 .46 4.02 <.001 .197
Multiple Regression Cont’dMultiple Regression Cont’d
Multiple Regression: High achievers’ attitudes toward CET-4 and their test performance (N=42)
Factor β t p R2
2 Test-taking Motivation .58 4.18 <.001 .338
Multiple Regression: Low achievers’ attitudes toward CET-4 and their test performance (N=42)
Factor β t p R2
2 Test-taking Motivation .32 2.11 .041 .322
Strategy use (Song & Strategy use (Song & Cheng, 2006)Cheng, 2006)
Descriptive Statistics and Reliability estimates at the Scale Level
Strategy Use Scales Mean SD Skewness Kurtosis Reliabilities
Comprehending strategies 2.14 .99 .293 .156 .771
Generative strategies 2.53 .78 .275 .444 .808
Memory and retrieval
strategies 2.61 .76 .163 .723
.873 Cognitive
strategy use
Total 2.49 .68 .097 .476 .908
Metacognitive
strategy use Assessment 2.85 .64 .004 -.178
.875
Students’ strategy use and Students’ strategy use and their CET performancetheir CET performance
Multiple Regression: Memory and retrieval strategies on the CET-4
R square B Beta t Sig.
(Constant) 59.13 15.79
Memory and retrieval
strategies .086 4.60 .29 3.34 ** .001
** P < .01.
How to establish the How to establish the relationship between relationship between testing and its impact?testing and its impact? Work backward from the test items (test design)
Explore test-takers’ characteristics over testing learners’ academic background L1 (native language), Culture, Ethnicity Gender, Age Learning Strategies Learning styles and personality (Field in/dependence) Test anxiety Motivation
Longitudinal/cross-group studies Linking the effects to student test-performance using higher
level analysis
Qualities of language testsQualities of language tests
Bachman and Palmer’s test usefulness framework (1996) Reliability + Construct Validity + Authenticity +
Interactiveness + Impact + Practicality Kunnan’s test fairness framework (2004)
Validity + Absence of Bias + Access + Administration + social consequences
Future directionsFuture directions Washback/impact researchers need to fully analyze
the test under study and understand its test use. ‘the extensive research on validity and validation has
tended to ignore test use, on the one hand, while discussions of test use and consequences have tended to ignore validity, on the other’. It is, then, essential for us to establish the link between test validity and test consequences (Bachman, 2005, p.7).
Therefore, it is imperative that washback/impact researchers work together with other language testing researchers as well as educational policy makers and test agencies to address the issue of validity, in particular, fairness and ethics of our tests.
ReflectionsReflections
It is clear that “testing is never a neutral process and always has consequences” (Stobart, 2003, p. 140). Tests are a differentiating ritual for students: “for every one who advances there will be some who stay behind” (Wall, 2000, p. 500). This is particular true to the large scale language tests.
Assessment (testing) is central to the teaching and learning process.
references Alderson, J.C. and Wall, D. (1993). Does washback exist? Applied
Linguistics 14, 115-129. Bachman, L. F. (2005). Building and supporting a case for test use.
Language Assessment Quarterly, 2(1), 1-34. Bachman, L. F. and Palmer, A.S. (1996). Language Testing in Practice,
Oxford University Press, Oxford, England. Bailey, K. M. (1996). Working for washback: A review of the washback
concept in language testing, Language Testing 13, 257-279. Cheng, L. (2005). Changing language teaching through language testing:
A washback study. Studies in Language Testing: Volume 21, Cambridge University Press, Cambridge.
Cheng, L. and Curtis, A. (2004). Washback or backwash: A review of the impact of testing on teaching and learning, in L. Cheng and Y. Watanabe with A. Curtis. (eds.), Washback in Language Testing: Research Contexts and Methods, Lawrence Erlbaum Associates, Mahwah, New Jersey.
Cheng, L., & Watanabe, Y., with Curtis, A. (Eds.) (2004). Washback in language testing: Research contexts and methods. Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc.
references Cheng, L., Klinger, D., & Zheng, Y. (2007). The challenges of the
Ontario Secondary School Literacy Test for second language students. Language Testing, 24(2).
Cizek, G. J. (2001). More unintended consequences of high-stakes testing. Educational Measurement: Issues and Practrice, 23(3),1-17.
Hamp-Lyons, L. (1997). Washback, impact and validity: ethical concerns. Language Testing, 14(3), 295-303.
Hawkey, R. (2006). Impact Theory and Practice: Studies of the IELTS test and Progetto Lingue 2000. Cambridge University Press, Cambridge.
Klinger, D., Cheng, L., & Zheng, Y. (under review). Factors influencing ESL/ELD students’ performance on the Ontario Secondary School Literacy Test. Educational Assessment.
Kunnan, A. J. (2004). Test fairness. In M. Milanovic, C. Weir, & S. Bolton (Eds.). Europe language testing in a global context: Selected papers from the ALTE conference in Barcelona. Cambridge: Cambridge University Press.
Messick, S. (1996). Validity and washback in language testing, Language Testing 13, 243-256.
Popham, W. J. (1987). The merits of measurement-driven instruction. Phi Delta Kappa, 68, 679-682.
references Qi, L. (2005). Stakeholders’ conflicting aims undermine the washback
function of a high-stakes Test, Language Testing 22, 142-173. Shepard, L. A. (1990). Inflated test score gains: Is the problem old
norms or teaching the test? Educational Measurement: Issues and Practice 9, 15‑22. Shohamy, E. (2001). The Power of Tests: A Critical Perspective on the Uses of Language Tests, Longman, Essex, England.
Song, X., & Cheng, L. (2006). Language learner strategy use and test performance of Chinese learners of English. Language Assessment Quarterly: An International Journal, 3(3), 241-266.
Wall, D. (1997). Impact and washback in language testing. In Clapham, C. & Corson, D. (Eds.). Encyclopedia of Language and Education (p. 291-302).
Zhao, J. & Cheng, L. (2006, May). Exploring the relationship between students’ attitudes toward testing and their test performance. Paper presented at the Canadian Society for Study of Education, Toronto, Ontario.
Zheng, Y., Cheng, L. & Klinger, D. (under review). Do test formats in reading comprehension affect ESL/ELD and non-ESL/ELD students’ test performance differently? TESL Canada.