Click here to load reader
Upload
hatuong
View
214
Download
0
Embed Size (px)
Citation preview
Final Paper Written Report 1
Running head: FINAL PAPER WRITTEN REPORT
Final Paper Written Report
Dr. Zsuzsa Londe
University of Southern California
EDUC 527
Yu-Ching Yang (Virginia)
December 3, 2008
Introduction
Final Paper Written Report 2
TOEFL iBT writing was first introduced in 2005, provided from Educational Testing Service
(ETS). The purpose of TOEFL iBT writing is to measure task takers’ communicative
competences to write in academic settings; therefore, the target population of this test is non-
native speakers who want to apply for any institution around the world such as the colleges and
universities in America, Canada, U.K., Australia, and New Zealand. In order to really reflect test
takers’ writing ability in actual academic situations, the test is divided into two tasks that need
test takers to combine reading, listening and writing skills to express their knowledge, thoughts
and ideas effectively. The testing time of TOEFL iBT writing takes test takers twenty minutes to
finish task one, and thirty minutes to task two. Test takers’ written responses will be sent to
ETS’s Online Scoring Network where two human raters will score the responses. Test takers can
see their scores online after 15 business days, and paper reports will be mailed later. The testing
fee of TOEFL iBT is $ 150.
Basically, the two writing tasks of TOEFL iBT are performance-based that focus on how
well test takers can use English to write in academic settings, and also require test takers to
demonstrate “real-life English-language usage in university lectures, classes, and laboratories”
(Educational Testing Service, 2007b, p.6). The first task of writing, which is integrated, simulates
the real academic scenario where students need to write a paper to digest and combine the
information of what they have learned in the class lecture and what they have read in the reading
material. Thus, during the writing task one, test-takers need to read a short passage of an
academic topic in 3 minutes and take notes if they want. Latter, test takers listen to a 2-minute
Final Paper Written Report 3
lecture which has a different point of view about the same topic of the reading passage. After
reading and listening to the different input, test takers have to summarize and compare these
different perspectives from reading and listening passages, and then produce their writing
responses in 20 minutes. The second task of writing is independent, which is similar to TOFEL
CBT writing tasks. In this writing task, test takers have 30 minutes to employ their experiences
and knowledge to state, express, and support their opinions on a controversial issue.
Overall, the first task can measure test-takers’ abilities of organizing information from
different input, and summarizing, paraphrasing information into a clear and coherent manner.
The second task can help test-takers to “express information in an organized manner, and to
accurately and appropriately use vocabulary, grammar, and idiomatic expressions, as well as to
use reasons, examples and details to develop the essay” (Educational Testing Service, 2007b,
p26).
Qualities of Usefulness
Bachman and Palmer (1996) proposed a framework that includes six different qualities of
test usefulness which can be used to assess a given test. The six qualities are reliability, construct
validity, authenticity, interactiveness, impact, and practicality.
Reliability
The first is reliability that refers to the consistency and dependability of the test, which
means that “a reliable test score will be consistent across different characteristics of the testing
Final Paper Written Report 4
situations” (p.19). In other words, a reliable test should yield a similar result on different sets of
test task characteristics (the characteristics are setting, test rubric, input, expected response, and
the relationship between input and response). Thus, the same test takers should have similar
scores on the same test task if they take it several times on different dates or on different settings.
Construct Validity
The second is construct validity which refers to “the meaningfulness and appropriateness
of the interpretations that we make on the basis of test scores” (p.21). It means that when we see
a test score, we can interpret it as an appropriate and meaningful indicator of test takers’ real
language abilities. Therefore, a valid test score should actually reflect test taker’s abilities that we
want to evaluate.
Authenticity
The third, authenticity, is defined as “the degree of correspondence of the characteristics of
a given language test task to the feature of a TLU task” (p.23). In other words, the test tasks
should be similar to the TLU tasks that can elicit test takers’ real performance in the real world.
Interactiveness
The fourth one is interactiveness that pertains to “the extent and type of involvement of the
test taker’s individual characteristics in accomplishing a test task” (p.25), which means that the
test tasks should better engage test takers’ language ability, topical knowledge, affective
schemata, and personal characteristics to elicit their best performance. Language ability refers to
Final Paper Written Report 5
organizational knowledge and pragmatic knowledge; topical knowledge pertains to test takers’
culture and knowledge schemata; affective schemata regard the correlation between emotion and
topical knowledge; personal characteristic involves age, gender, native language, etc. In short,
the more involvement of individual’s characteristics the test tasks can have, the better results and
performance the test takers will make.
Impact
The fifth quality is impact that deals with the issue of what influence the test tasks may
have on individuals, educational systems, and society. For instance, the impact on test takers
(individuals) may include three aspects, that is, “the experience of taking and preparing for the
test”, “the feedback about their performance on the test” and “the decisions that may be made
about them on the basis of their test scores” (p.31). In addition, the values and goals of using
various tests should be considered carefully, since the language teaching, language programs,
and the society will be influenced by the tests use and the test results. Thus, it’s important to
consider what will happen after using these tests when we are going to develop the tests.
Practicability
The last one is practicability, which is defined as “the relationship between the resources
that will be required in the design, development, and use of the test and the resources that will be
available for these activities” (p.36). The resources can be divided into three categories: human
Final Paper Written Report 6
resources, material resources, and time. Human resources involve test writers, raters,
administrators, etc. and material resources consist of space, equipment, and materials, and time
refers to developing time and time for specific tasks. Therefore, the test will be practical if the
available resources exceed the required resources.
Analysis of Qualities:
TOEFL iBT Writing
Reliability
According to Brown (2004), there are several factors contributing to the unreliability of a test,
which are student-related reliability, rater reliability, test administration reliability, and test
reliability. First, as to the student-related reliability, it is hard to control that every test taker is in
the best condition to take the exam. Next, based on the research study from ETS, the rater
reliability is high because there are two certified raters who score each response based on “a
score scale of 0 to 5 according to the rubric, and the average of the scores on the two writing
tasks is converted to a scaled score of 0 to 30” (Educational Testing Service, 2005). The criterion
for the first integrated task is to evaluate the “quality of writing” and “the completeness and
accuracy of the content.” The second independent task is scored on “the overall quality of the
writing” (Educational Testing Service, 2005). In addition, the certified raters are all well-trained
from ETS, and there will be the third rater to score the writing responses again if the previous
two raters’ ratings differ by more than one point on the same written responses. However,
Final Paper Written Report 7
according to Cumming et al. (2001), sometimes raters may score differently on the same
responses due to their background and teaching experiences.
Third, the test administration reliability is high since ETS “certify all test centers facilities
and equipment for administering TOEFL iBT” (Educational Testing Service, 2008c). However,
many test takers’ experiences show the opposite opinions about the conditions of the test
administration. Finally, the test reliability is comparatively low (see Appendix A) because the
samples of writing tasks are too small that there are only two tasks to measure test-takers writing
abilities (ETS, 2008).
Construct Validity
The study shows that this writing test can really reflect test takers writing skills and their
abilities to use and understand English as integrated skills in college and university settings
(ETS, 2007b). Moreover, the score of the written response can be interpreted appropriately as an
indicator of test takers’ writing skills in real academic settings, since the writing tasks simulate
the “real-life writing tasks in academic contexts” (Lee & Kantor, 2005). For instance, the
integrated task entails test takers’ different language abilities in academic settings, such as
listening to the lecture, writing papers, reading textbooks, and taking notes (Cuming, et al.,
2005), and the integrated task presents a real academic content that simulates the lecture in
academic institution, and also provides real reading materials form actual textbooks (ETS,
2007b). Furthermore, the scoring procedure can show that the construct validity of these writing
tasks is high. The reason is because ETS doesn’t expect test takers “to produce a well-researched,
Final Paper Written Report 8
comprehensive essay, they can earn a high score with a response that contains some errors”
(ETS, 2007b). Hence, test takers’ performance will not be underestimated due to some errors in
the written responses.
However, test takers’ language proficiency only depends on two samples of writing tasks that
reduces the test tasks’ validity, and it’s like Lee and Kantor (2005) state that “the generalizability
of writing scores across tasks and task types is an important issue in evaluating and validating the
tasks.” Thus, the small sample of the test tasks’ scores may influence tests’ construct validity. In
addition, the test scoring rubric may influence the test tasks’ validity, since “each of the writing
tasks measures a somewhat distinct construct of writing, and separate scores should be reported
for each of these distinct constructs” (Lee and Kantor, 2005).
Authenticity
The authenticity of writing tasks is high, since the writing tests can really reflect test takers’
writing skills and their abilities to use and understand English as integrated skills in college and
university settings. In other words, the writing tasks, which need test takers to summarize,
compare, and paraphrase the different input of written and audio resources, are correspondent to
TLU tasks (ETS, 2007b). In addition, the test tasks and TLU tasks have some common
procedures in completing the writing tasks; for example, when test takers do the integrated task,
they need to take notes of written and audio information, and then summarize, compare these
two different resources into one paper. This kind of task is like the real academic classroom
where students need to know how to take notes of reading materials and lectures from the
Final Paper Written Report 9
professor, and they need to develop skills to compare or summarize what they have read from the
books and what they have heard from the lecture, and then write it down to form a clear and
coherent paper (ETS, 2007b).
Moreover, the content of the rubric that ETS provides to evaluate test takers’ written
responses is similar to the way that teachers and professors evaluate their students’ papers in
academic settings. For instance, both test task and TLU task emphasize that a good writing paper
should “effectively address the topic and task; it is well organized and well developed,
Using clearly appropriate explanations, exemplifications…; it should display coherence…
consistent…, demonstrating syntactic variety, appropriate word choice…” (ETS, 2007b).
Besides, the topics, and the language use in reading materials and class lecture of the integrated
task, and the controversial issues of the independent task are both similar to TLU tasks; also,
taking note and using computer to type the written responses are close to TLU tasks. Overall, the
writing tasks of TOEFL iBT correspond to the tasks of real academic settings.
Interactiveness
In order to assess interactiveness of TOEFL iBT writing, we can examine the following four
characteristics. First, regarding test takers’ topical knowledge, the new writing tasks are generally
can help them to perform better while accomplishing the tasks. For instance, the integrated task,
including the simulated lectures and the readings, is close to real university-level material;
however the topics and materials might be appropriate for undergraduate students, not for
Final Paper Written Report 10
graduate students, since “for graduate students would not necessarily have the language ability
needed to deal with these topics, but might have the language ability needed to succeed in their
particular fields” (Cumming et al., 2005, p.23).On the other hand, the interactiveness of the
independent task is lower, because some test takers may not perform well due to their little
personal experience or knowledge with the specific topics of the prompts (Cumming et al.,
2005).
Second, as to test takers’ personal characteristics, the interactiveness of integrated task is
enhanced, since all subjects and topics of the task are academic purpose that provides every non-
native speaker an opportunity to demonstrate his/her listening, reading and writing skills in
academic settings; thus, test takers’ personal characteristics don’t really interfere the integrated
task’s interactiveness, and test takers can perform their best of they have knowledge background
of the topics of the task. Nevertheless, the independent task which needs test takers’ personal
experiences and opinions should consider personal characteristics carefully. Breland et al. (2004)
claim that male may perform better on some specific topics, and female may be good at some
topics; In addition, the study shows that the prompts having the largest gender differences are
topics of “art and music, roommates, housing, friends, and children,” and the smallest gender
differences are topics such as “research, space travel, factories, and advertising” (p. 21).
Third, the interactiveness between the integrated task and test takers’ language knowledge is
high, since it involves a wide range of areas of test takers’ language knowledge, such as reading,
listening, and writing and also requires their “complex cognitive, literate, and
Final Paper Written Report 11
language abilities for comprehension, such as apprehending, synthesizing,
and presenting source ideas” (Cumming, 2006, p.46). However, for independent
task, its interactiveness is lower. Because this task only measures test
takers’ abilities of writing coherent written arguments based on their
personal experience and background knowledge, it doesn’t involve lots of
language knowledge from test takers (Cumming, 2006). Finally, pertaining to
test takers’ affective schemata, the new writing task (integrated task) simulates the real
classroom environment, and provides real reading materials and lectures, and also the writing
task corresponds to the TLU settings where needs test takers to read, listen, take notes, and write
papers. Thus, test takers will have positive affective responses to this task. In addition, the test
takers may have positive affective responses when they have relevant topical knowledge to the
integrated task and personal experiences to the independent task. In contrast, test takers may
have negative affective responses if they can’t use their topical knowledge and personal
experiences. Moreover, test takers who have higher language proficiency may have positive
affective responses toward these two writing tasks, while those who have lower language
proficiency may feel threatened by the tasks (Bachman & Palmer, 1996).
Impact
There is a positive washback for students, because they can receive feedback from ETS,
which gives a detailed description about their performance and language proficiency levels, can
help test takers to know how to improve their writing skills. This new system of feedback report
Final Paper Written Report 12
is beneficial for test takers, since they can adjust their learning strategies to improve writing
skills for getting better scores, and also they can develop a meaningful learning experience from
these tests. Besides, the integrated task can help students to be familiar the TLU task in academic
settings. However, these writing tasks may cause negative impact on students. First, during the
test preparation time, students may prepare the test individually for several months, or students
may go to some private institutes where they only provide students with specific test skills to get
higher scores. Second, TOEFL scores may not accurately measure students’ actual language
proficiency, so students may be frustrated or upset while getting lower scores. Lastly, though the
purpose of the integrated task is to measure test takers’ language abilities in academic settings in
the US, the integrated writing task may be not fair for overseas students who didn’t study in
American colleges and universities to receive topical knowledge.
As to the impact on the teacher, there will be a positive washback for teachers’ teaching
strategies. For instance, in the past, many teachers think that the writing course should only focus
on writing skills, such as grammatical rules and linguistic knowledge, which can help students
produce a high-score compositions; however, after the new writing task was introduced by ETS,
many teachers started to use communicative approach to assist students in developing their
integrated language skills such as note-taking, summarizing, paraphrasing and comparing
different reading and listening input. In addition, language teachers can attain various teaching
materials provided by ETS online that they can prepare their lesson plans more appropriately and
meaningfully.
Final Paper Written Report 13
For society and educational systems, there will be a positive impact. According to
Bachman and Palmer (1996), “tests are not developed and used in a value-free psychometric test-
tube; they are virtually always intended to serve the needs of an educational system or of society
at large” (p.30). Thus, this new writing task can draw society and educational systems’ attention
to the importance of developing students’ integrated skills in academic settings. Moreover, by
having the integrated task in the writing section, “curriculum coordinators, academic directors,
and teachers will use communicative approach and provide more authentic materials to teach
English and to develop students’ integrated skills” (Educational Testing Service, 2007c); in this
way, students can be better aware of the importance of communication skills. Nevertheless, the
TOEFL score may be misused by companies and government departments to make a negative
impact on the society. For example, some companies ask their applicants to present their TOEFL
scores to prove their language abilities, yet these test scores can’t demonstrate their actual
workplace English skills.
Practicality
Generally, the writing tasks are practical. For human resources, there are enough professional
raters who can rate lots of written responses appropriately and effectively; for instance, there are
two human raters who are in charge of responses rating, and if their ratings differ more than one
point on the same response, the chief rater will evaluate it again. Besides, the test score reports
are presented on time and accurate online. Second, the material resources are sufficient that
every test administration around the world is equipped with needed facility such as pencil, paper,
Final Paper Written Report 14
computer, microphone, headphone, etc.
Conclusion
Overall, the six qualities of TOEFL iBT writing are generally high since these two test tasks
are similar to the TLU tasks, and these test tasks simulating the real academic scenarios can help
test takers to develop their integrated skills, and also make more positive impact on students,
teachers, schools, and society. More important, the new type of writing task can really reflect test
takers’ language abilities of academic writing and effective communication. However, there are
some suggestions can be proposed to make TOEFL iBT writing better.
First, ETS should try to improve the condition of testing environment. Thought ETS
claims that test takers can wear noise-cancelling headphones during the test to avoid the noise,
each test taker still sits too close to each other, and it’s easy to be interfered with each other; for
example, when one test taker does his/her writing section, another test taker sitting next to
him/her does speaking section, and we can imagine that the test taker who does the writing tasks
may interfered by the speaking voice.
Second, though the writing tasks simulate the real academic classroom where student need
to know how to digest different resources to produce a well-form paper, and write a coherent
argument according to a controversial issue, the topics of these two test tasks still may be unfair
to some test takers. For instance, like I mentioned before, the various subjects and topics of the
integrated task are more appropriate for undergraduate students than graduate students, and also
Final Paper Written Report 15
the topics of the independent task that relates to test takers’ personal experiences and background
knowledge may cause bias. Thus, it will be better if test takers can have more open-choices to
choose a topic based on their interests, preferences, and professional field, etc.
References
Bachman, L.F.& Palmer, A.(1996). Language Testing in Practice. Oxford University Press;
Oxford.
Breland, H., Lee, Y.-W., Najarian, M., & Muraki, E. (2004). An analysis of TOEFL-CBT writing
prompt difficulty and comparability for different gender groups (TOEFL Research Rep.
No. RR-76). Princeton, NJ: ETS.
Brown, H.D. (2004). Language Assessment: Principles and Classroom Practices. Pearson
Final Paper Written Report 16
Education; NY.
Cumming, A., Kantor, R., Baba, K., Eouanzoui, K., Erdosy, U., & James, M. (2006). Analysis of
discourse features and verification of scoring levels for independent and integrated
prototype written tasks for the new TOEFL (TOEFL Monograph No. MS-30). Princeton,
NJ: ETS.
Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL 2000
prototype writing tasks: An investigation into raters' decision making and development of
a preliminary analytic framework (TOEFL Monograph No. MS-22). Princeton, NJ: ETS.
Cumming, A., Grant, L., Mulcahy-Ernt, P. & Powers, D. (2005). A teacher-verification study of
speaking and writing prototype tasks for a new TOEFL. Educational Testing Service:
Princeton, NJ
Educational Testing Service (2005).TOEFL iBT Scores: Better Information about the Ability to
Communicate in an Academic Setting. Retrieved on November 28, 2008, from http://
www.ets,org/toefl/50.html
Educational Testing Service. (2007a). TOEFL® iBT Score Reliability and Generalizability
Educational Testing Service (2007b) TOEFL iBT Tips: How to prepare for the TOEFL iBT.
Princeton, NJ.
Educational Testing Service (2007c). Validity Evidence Supporting the Interpretation and Use of
TOEFL iBT Scores. Princeton, NJ.
Educational Testing Service. (2008). Reliability and Comparability of TOEFL® iBT Scores.
Lee, Y. & Kantor, R. (2005). Dependability of New ESL Writing Test Scores: Evaluating
Prototype Tasks and Alternative Rating Schemes. Educational Testing Service: Princeton,
NJ.
Final Paper Written Report 17
Appendix A
Reliabilities and Standard Errors of Measurement
Score Scale Reliability
Estimate
SEM
Reading 0 – 30 0.85 3.35
Listening 0 – 30 0.85 3.20
Speaking 0 – 30 0.88 1.62
Writing 0 – 30 0.74 2.76
Total 0 – 120 0.94 5.64