Final Paper Written Report - Weeblyyuching.weebly.com/.../0/2130955/virginia__final_paper_… · Web viewFinal Paper Written Report. Dr ... The testing fee of TOEFL iBT is ... the

Final Paper Written Report 1

Running head: FINAL PAPER WRITTEN REPORT

Final Paper Written Report

Dr. Zsuzsa Londe

University of Southern California

EDUC 527

Yu-Ching Yang (Virginia)

December 3, 2008

Introduction


TOEFL iBT writing was first introduced in 2005, provided from Educational Testing Service

(ETS). The purpose of TOEFL iBT writing is to measure task takers’ communicative

competences to write in academic settings; therefore, the target population of this test is non-

native speakers who want to apply for any institution around the world such as the colleges and

universities in America, Canada, U.K., Australia, and New Zealand. In order to really reflect test

takers’ writing ability in actual academic situations, the test is divided into two tasks that need

test takers to combine reading, listening and writing skills to express their knowledge, thoughts

and ideas effectively. The testing time of TOEFL iBT writing takes test takers twenty minutes to

finish task one, and thirty minutes to task two. Test takers’ written responses will be sent to

ETS’s Online Scoring Network where two human raters will score the responses. Test takers can

see their scores online after 15 business days, and paper reports will be mailed later. The testing

fee of TOEFL iBT is $ 150.

Basically, the two writing tasks of TOEFL iBT are performance-based that focus on how

well test takers can use English to write in academic settings, and also require test takers to

demonstrate “real-life English-language usage in university lectures, classes, and laboratories”

(Educational Testing Service, 2007b, p.6). The first task of writing, which is integrated, simulates

the real academic scenario where students need to write a paper to digest and combine the

information of what they have learned in the class lecture and what they have read in the reading

material. Thus, during the writing task one, test-takers need to read a short passage of an

academic topic in 3 minutes and take notes if they want. Latter, test takers listen to a 2-minute


lecture which has a different point of view about the same topic of the reading passage. After

reading and listening to the different input, test takers have to summarize and compare these

different perspectives from reading and listening passages, and then produce their writing

responses in 20 minutes. The second task of writing is independent, which is similar to TOFEL

CBT writing tasks. In this writing task, test takers have 30 minutes to employ their experiences

and knowledge to state, express, and support their opinions on a controversial issue.

Overall, the first task can measure test-takers’ abilities of organizing information from

different input, and summarizing, paraphrasing information into a clear and coherent manner.

The second task can help test-takers to “express information in an organized manner, and to

accurately and appropriately use vocabulary, grammar, and idiomatic expressions, as well as to

use reasons, examples and details to develop the essay” (Educational Testing Service, 2007b,

p26).

Qualities of Usefulness

Bachman and Palmer (1996) proposed a framework that includes six different qualities of

test usefulness which can be used to assess a given test. The six qualities are reliability, construct

validity, authenticity, interactiveness, impact, and practicality.

Reliability

The first is reliability that refers to the consistency and dependability of the test, which

means that “a reliable test score will be consistent across different characteristics of the testing


situations” (p.19). In other words, a reliable test should yield a similar result on different sets of

test task characteristics (the characteristics are setting, test rubric, input, expected response, and

the relationship between input and response). Thus, the same test takers should have similar

scores on the same test task if they take it several times on different dates or on different settings.

Construct Validity

The second is construct validity which refers to “the meaningfulness and appropriateness

of the interpretations that we make on the basis of test scores” (p.21). It means that when we see

a test score, we can interpret it as an appropriate and meaningful indicator of test takers’ real

language abilities. Therefore, a valid test score should actually reflect test taker’s abilities that we

want to evaluate.

Authenticity

The third, authenticity, is defined as “the degree of correspondence of the characteristics of

a given language test task to the feature of a TLU task” (p.23). In other words, the test tasks

should be similar to the TLU tasks that can elicit test takers’ real performance in the real world.

Interactiveness

The fourth one is interactiveness that pertains to “the extent and type of involvement of the

test taker’s individual characteristics in accomplishing a test task” (p.25), which means that the

test tasks should better engage test takers’ language ability, topical knowledge, affective

schemata, and personal characteristics to elicit their best performance. Language ability refers to


organizational knowledge and pragmatic knowledge; topical knowledge pertains to test takers’

culture and knowledge schemata; affective schemata regard the correlation between emotion and

topical knowledge; personal characteristic involves age, gender, native language, etc. In short,

the more involvement of individual’s characteristics the test tasks can have, the better results and

performance the test takers will make.

Impact

The fifth quality is impact that deals with the issue of what influence the test tasks may

have on individuals, educational systems, and society. For instance, the impact on test takers

(individuals) may include three aspects, that is, “the experience of taking and preparing for the

test”, “the feedback about their performance on the test” and “the decisions that may be made

about them on the basis of their test scores” (p.31). In addition, the values and goals of using

various tests should be considered carefully, since the language teaching, language programs,

and the society will be influenced by the tests use and the test results. Thus, it’s important to

consider what will happen after using these tests when we are going to develop the tests.

Practicability

The last one is practicability, which is defined as “the relationship between the resources

that will be required in the design, development, and use of the test and the resources that will be

available for these activities” (p.36). The resources can be divided into three categories: human


resources, material resources, and time. Human resources involve test writers, raters,

administrators, etc. and material resources consist of space, equipment, and materials, and time

refers to developing time and time for specific tasks. Therefore, the test will be practical if the

available resources exceed the required resources.

Analysis of Qualities:

TOEFL iBT Writing

Reliability

According to Brown (2004), there are several factors contributing to the unreliability of a test,

which are student-related reliability, rater reliability, test administration reliability, and test

reliability. First, as to the student-related reliability, it is hard to control that every test taker is in

the best condition to take the exam. Next, based on the research study from ETS, the rater

reliability is high because there are two certified raters who score each response based on “a

score scale of 0 to 5 according to the rubric, and the average of the scores on the two writing

tasks is converted to a scaled score of 0 to 30” (Educational Testing Service, 2005). The criterion

for the first integrated task is to evaluate the “quality of writing” and “the completeness and

accuracy of the content.” The second independent task is scored on “the overall quality of the

writing” (Educational Testing Service, 2005). In addition, the certified raters are all well-trained

from ETS, and there will be the third rater to score the writing responses again if the previous

two raters’ ratings differ by more than one point on the same written responses. However,


according to Cumming et al. (2001), sometimes raters may score differently on the same

responses due to their background and teaching experiences.

Third, the test administration reliability is high since ETS “certify all test centers facilities

and equipment for administering TOEFL iBT” (Educational Testing Service, 2008c). However,

many test takers’ experiences show the opposite opinions about the conditions of the test

administration. Finally, the test reliability is comparatively low (see Appendix A) because the

samples of writing tasks are too small that there are only two tasks to measure test-takers writing

abilities (ETS, 2008).

Construct Validity

The study shows that this writing test can really reflect test takers writing skills and their

abilities to use and understand English as integrated skills in college and university settings

(ETS, 2007b). Moreover, the score of the written response can be interpreted appropriately as an

indicator of test takers’ writing skills in real academic settings, since the writing tasks simulate

the “real-life writing tasks in academic contexts” (Lee & Kantor, 2005). For instance, the

integrated task entails test takers’ different language abilities in academic settings, such as

listening to the lecture, writing papers, reading textbooks, and taking notes (Cuming, et al.,

2005), and the integrated task presents a real academic content that simulates the lecture in

academic institution, and also provides real reading materials form actual textbooks (ETS,

2007b). Furthermore, the scoring procedure can show that the construct validity of these writing

tasks is high. The reason is because ETS doesn’t expect test takers “to produce a well-researched,


comprehensive essay, they can earn a high score with a response that contains some errors”

(ETS, 2007b). Hence, test takers’ performance will not be underestimated due to some errors in

the written responses.

However, test takers’ language proficiency only depends on two samples of writing tasks that

reduces the test tasks’ validity, and it’s like Lee and Kantor (2005) state that “the generalizability

of writing scores across tasks and task types is an important issue in evaluating and validating the

tasks.” Thus, the small sample of the test tasks’ scores may influence tests’ construct validity. In

addition, the test scoring rubric may influence the test tasks’ validity, since “each of the writing

tasks measures a somewhat distinct construct of writing, and separate scores should be reported

for each of these distinct constructs” (Lee and Kantor, 2005).

Authenticity

The authenticity of writing tasks is high, since the writing tests can really reflect test takers’

writing skills and their abilities to use and understand English as integrated skills in college and

university settings. In other words, the writing tasks, which need test takers to summarize,

compare, and paraphrase the different input of written and audio resources, are correspondent to

TLU tasks (ETS, 2007b). In addition, the test tasks and TLU tasks have some common

procedures in completing the writing tasks; for example, when test takers do the integrated task,

they need to take notes of written and audio information, and then summarize, compare these

two different resources into one paper. This kind of task is like the real academic classroom

where students need to know how to take notes of reading materials and lectures from the


professor, and they need to develop skills to compare or summarize what they have read from the

books and what they have heard from the lecture, and then write it down to form a clear and

coherent paper (ETS, 2007b).

Moreover, the content of the rubric that ETS provides to evaluate test takers’ written

responses is similar to the way that teachers and professors evaluate their students’ papers in

academic settings. For instance, both test task and TLU task emphasize that a good writing paper

should “effectively address the topic and task; it is well organized and well developed,

Using clearly appropriate explanations, exemplifications…; it should display coherence…

consistent…, demonstrating syntactic variety, appropriate word choice…” (ETS, 2007b).

Besides, the topics, and the language use in reading materials and class lecture of the integrated

task, and the controversial issues of the independent task are both similar to TLU tasks; also,

taking note and using computer to type the written responses are close to TLU tasks. Overall, the

writing tasks of TOEFL iBT correspond to the tasks of real academic settings.

Interactiveness

In order to assess interactiveness of TOEFL iBT writing, we can examine the following four

characteristics. First, regarding test takers’ topical knowledge, the new writing tasks are generally

can help them to perform better while accomplishing the tasks. For instance, the integrated task,

including the simulated lectures and the readings, is close to real university-level material;

however the topics and materials might be appropriate for undergraduate students, not for


graduate students, since “for graduate students would not necessarily have the language ability

needed to deal with these topics, but might have the language ability needed to succeed in their

particular fields” (Cumming et al., 2005, p.23).On the other hand, the interactiveness of the

independent task is lower, because some test takers may not perform well due to their little

personal experience or knowledge with the specific topics of the prompts (Cumming et al.,

2005).

Second, as to test takers’ personal characteristics, the interactiveness of integrated task is

enhanced, since all subjects and topics of the task are academic purpose that provides every non-

native speaker an opportunity to demonstrate his/her listening, reading and writing skills in

academic settings; thus, test takers’ personal characteristics don’t really interfere the integrated

task’s interactiveness, and test takers can perform their best of they have knowledge background

of the topics of the task. Nevertheless, the independent task which needs test takers’ personal

experiences and opinions should consider personal characteristics carefully. Breland et al. (2004)

claim that male may perform better on some specific topics, and female may be good at some

topics; In addition, the study shows that the prompts having the largest gender differences are

topics of “art and music, roommates, housing, friends, and children,” and the smallest gender

differences are topics such as “research, space travel, factories, and advertising” (p. 21).

Third, the interactiveness between the integrated task and test takers’ language knowledge is

high, since it involves a wide range of areas of test takers’ language knowledge, such as reading,

listening, and writing and also requires their “complex cognitive, literate, and


language abilities for comprehension, such as apprehending, synthesizing,

and presenting source ideas” (Cumming, 2006, p.46). However, for independent

task, its interactiveness is lower. Because this task only measures test

takers’ abilities of writing coherent written arguments based on their

personal experience and background knowledge, it doesn’t involve lots of

language knowledge from test takers (Cumming, 2006). Finally, pertaining to

test takers’ affective schemata, the new writing task (integrated task) simulates the real

classroom environment, and provides real reading materials and lectures, and also the writing

task corresponds to the TLU settings where needs test takers to read, listen, take notes, and write

papers. Thus, test takers will have positive affective responses to this task. In addition, the test

takers may have positive affective responses when they have relevant topical knowledge to the

integrated task and personal experiences to the independent task. In contrast, test takers may

have negative affective responses if they can’t use their topical knowledge and personal

experiences. Moreover, test takers who have higher language proficiency may have positive

affective responses toward these two writing tasks, while those who have lower language

proficiency may feel threatened by the tasks (Bachman & Palmer, 1996).

Impact

There is a positive washback for students, because they can receive feedback from ETS,

which gives a detailed description about their performance and language proficiency levels, can

help test takers to know how to improve their writing skills. This new system of feedback report


is beneficial for test takers, since they can adjust their learning strategies to improve writing

skills for getting better scores, and also they can develop a meaningful learning experience from

these tests. Besides, the integrated task can help students to be familiar the TLU task in academic

settings. However, these writing tasks may cause negative impact on students. First, during the

test preparation time, students may prepare the test individually for several months, or students

may go to some private institutes where they only provide students with specific test skills to get

higher scores. Second, TOEFL scores may not accurately measure students’ actual language

proficiency, so students may be frustrated or upset while getting lower scores. Lastly, though the

purpose of the integrated task is to measure test takers’ language abilities in academic settings in

the US, the integrated writing task may be not fair for overseas students who didn’t study in

American colleges and universities to receive topical knowledge.

As to the impact on the teacher, there will be a positive washback for teachers’ teaching

strategies. For instance, in the past, many teachers think that the writing course should only focus

on writing skills, such as grammatical rules and linguistic knowledge, which can help students

produce a high-score compositions; however, after the new writing task was introduced by ETS,

many teachers started to use communicative approach to assist students in developing their

integrated language skills such as note-taking, summarizing, paraphrasing and comparing

different reading and listening input. In addition, language teachers can attain various teaching

materials provided by ETS online that they can prepare their lesson plans more appropriately and

meaningfully.


For society and educational systems, there will be a positive impact. According to

Bachman and Palmer (1996), “tests are not developed and used in a value-free psychometric test-

tube; they are virtually always intended to serve the needs of an educational system or of society

at large” (p.30). Thus, this new writing task can draw society and educational systems’ attention

to the importance of developing students’ integrated skills in academic settings. Moreover, by

having the integrated task in the writing section, “curriculum coordinators, academic directors,

and teachers will use communicative approach and provide more authentic materials to teach

English and to develop students’ integrated skills” (Educational Testing Service, 2007c); in this

way, students can be better aware of the importance of communication skills. Nevertheless, the

TOEFL score may be misused by companies and government departments to make a negative

impact on the society. For example, some companies ask their applicants to present their TOEFL

scores to prove their language abilities, yet these test scores can’t demonstrate their actual

workplace English skills.

Practicality

Generally, the writing tasks are practical. For human resources, there are enough professional

raters who can rate lots of written responses appropriately and effectively; for instance, there are

two human raters who are in charge of responses rating, and if their ratings differ more than one

point on the same response, the chief rater will evaluate it again. Besides, the test score reports

are presented on time and accurate online. Second, the material resources are sufficient that

every test administration around the world is equipped with needed facility such as pencil, paper,


computer, microphone, headphone, etc.

Conclusion

Overall, the six qualities of TOEFL iBT writing are generally high since these two test tasks

are similar to the TLU tasks, and these test tasks simulating the real academic scenarios can help

test takers to develop their integrated skills, and also make more positive impact on students,

teachers, schools, and society. More important, the new type of writing task can really reflect test

takers’ language abilities of academic writing and effective communication. However, there are

some suggestions can be proposed to make TOEFL iBT writing better.

First, ETS should try to improve the condition of testing environment. Thought ETS

claims that test takers can wear noise-cancelling headphones during the test to avoid the noise,

each test taker still sits too close to each other, and it’s easy to be interfered with each other; for

example, when one test taker does his/her writing section, another test taker sitting next to

him/her does speaking section, and we can imagine that the test taker who does the writing tasks

may interfered by the speaking voice.

Second, though the writing tasks simulate the real academic classroom where student need

to know how to digest different resources to produce a well-form paper, and write a coherent

argument according to a controversial issue, the topics of these two test tasks still may be unfair

to some test takers. For instance, like I mentioned before, the various subjects and topics of the

integrated task are more appropriate for undergraduate students than graduate students, and also


the topics of the independent task that relates to test takers’ personal experiences and background

knowledge may cause bias. Thus, it will be better if test takers can have more open-choices to

choose a topic based on their interests, preferences, and professional field, etc.

References

Bachman, L.F.& Palmer, A.(1996). Language Testing in Practice. Oxford University Press;

Oxford.

Breland, H., Lee, Y.-W., Najarian, M., & Muraki, E. (2004). An analysis of TOEFL-CBT writing

prompt difficulty and comparability for different gender groups (TOEFL Research Rep.

No. RR-76). Princeton, NJ: ETS.

Brown, H.D. (2004). Language Assessment: Principles and Classroom Practices. Pearson


Education; NY.

Cumming, A., Kantor, R., Baba, K., Eouanzoui, K., Erdosy, U., & James, M. (2006). Analysis of

discourse features and verification of scoring levels for independent and integrated

prototype written tasks for the new TOEFL (TOEFL Monograph No. MS-30). Princeton,

NJ: ETS.

Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL 2000

prototype writing tasks: An investigation into raters' decision making and development of

a preliminary analytic framework (TOEFL Monograph No. MS-22). Princeton, NJ: ETS.

Cumming, A., Grant, L., Mulcahy-Ernt, P. & Powers, D. (2005). A teacher-verification study of

speaking and writing prototype tasks for a new TOEFL. Educational Testing Service:

Princeton, NJ

Educational Testing Service (2005).TOEFL iBT Scores: Better Information about the Ability to

Communicate in an Academic Setting. Retrieved on November 28, 2008, from http://

www.ets,org/toefl/50.html

Educational Testing Service. (2007a). TOEFL® iBT Score Reliability and Generalizability

Educational Testing Service (2007b) TOEFL iBT Tips: How to prepare for the TOEFL iBT.

Princeton, NJ.

Educational Testing Service (2007c). Validity Evidence Supporting the Interpretation and Use of

TOEFL iBT Scores. Princeton, NJ.

Educational Testing Service. (2008). Reliability and Comparability of TOEFL® iBT Scores.

Lee, Y. & Kantor, R. (2005). Dependability of New ESL Writing Test Scores: Evaluating

Prototype Tasks and Alternative Rating Schemes. Educational Testing Service: Princeton,

NJ.


Appendix A

Reliabilities and Standard Errors of Measurement

Score Scale Reliability

Estimate

SEM

Reading 0 – 30 0.85 3.35

Listening 0 – 30 0.85 3.20

Speaking 0 – 30 0.88 1.62

Writing 0 – 30 0.74 2.76

Total 0 – 120 0.94 5.64

Documents

Final Paper Written Report - Weeblyyuching.weebly.com/.../0/2130955/virginia__final_paper_… · Web viewFinal Paper Written Report. Dr ... The testing fee of TOEFL iBT is ... the