akademik.uhn.ac.id · Web viewThis book is a compilation material for English Language Testing. General outlines of material as an introduction to English Language Testing that has

A COMPILATION MATERIAL OF

ENGLISH LANGUAGE

TESTING

Compiled By:

BERTARIA SOHNATA HUTAURUK

Prodi PendidikanBahasaInggrisFAKULTAS KEGURUAN DAN ILMU PENDIDIKAN

UNIVERSITAS HKBP NOMMENSEN

PEMATANGSIANTAR

2015

English Language Testing 1

INTRODUCTION

This book is a compilation material for English Language

Testing. General outlines of material as an introduction to

English Language Testing that has been compiled for the

students in S1 degree. This collection of material consists

of Testing, Assessment, Meassurement, Evaluation, Kinds

of Testing, Validity and Reliability of the tests, and

Interpreting the test Score. Hopefully, this compilation

will be useful for the students and yet not perfect, so any

critism is welcomed.

Compiled by:

Bertaria Sohnata Hutauruk


CONTENS

1. What is the difference between assessment and evaluation?…………. 1

2. Testing, Assessment, Measurement and Evaluation…………………... 4

3. Informal vs. Formal Assessments: Tests are not the only end-all-be-all

of how we assess.………………………………………………………….. 6

4. Norm-referenced test and Criterion-referenced test…………………... 11

5. Discrete Point Testing and Integrative Testing………………………… 19

6. Communicative Language Testing……………………………………… 22

7. Testing Communicative Competence…………………………………… 24

8. Testing Reading and Writing……………………………………………. 30

9. Performance-Based Assessment……………………………………….... 32

10. Validity and Reliability…………………………………………………... 42

11. Constructing Tests……………………………………………………….. 61

12. Types of Listening Testing………………………………………………. 75

13. Testing Grammar………………………………………………………… 97

14. Interpreting Test Score…………………………………………………... 103


1

What is the difference between

assessment and evaluation?

There is a lot of confusion over these two terms as well as other terms associated with

assessment, testing, and evaluation. The big difference can be summarized as this:

assessment is information gathered by the teacher and student to drive instruction, while

evaluation is when a teacher uses some instrument (such as the CMT or an end-of-unit

test) to rate a student so that this information can be used to compare or sort students.

Assessment is for the student and the teacher in the act of learning while evaluation is

usually for others.

“If mathematics teachers were to focus their efforts on classroom assessment that is

primarily formative in nature, students’ learning gains would be impressive. These

efforts would include gathering data through classroom questioning and discourse,

using a variety of assessment tasks, and attending primarily to what students know and

understand” (Wilson & Kenney, page 55).

Assessment is a lot more important because it is integral to instruction. Unfortunately, it

is being hampered by the demands of evaluation. The biggest demand for evaluation is

grading or report cards. There shouldn’t be a problem with that, except historically

evaluation (grades) were determined exclusively by computing a student’s numeric

average on paper and pencil assessments called quizzes or tests.

“Most experienced teachers will say that they know a great deal about their students in

terms of what the students know, how they perform in different situations, their attitudes

and beliefs, and their various levels of skill attainment. Unfortunately, when it comes to

grades, they often ignore this rich storehouse of information and rely on test scores and

rigid averages that tell only a small fraction of the story.


The myth of grading by statistical number crunching is so firmly ingrained in schooling

at all levels that you may find it hard to abandon. But it is unfair to students, to parents,

and to you as the teacher to ignore all of the information you get almost daily from a

problem-based approach in favor of a handful of numbers based on tests that usually

focus on low-level skills” (Van de Walle and Lovin, page 35).

The reason this is a problem is that students learn what is valued and they strive to do

well on those things. If the end-of-unit tests are what are used to determine your grade,

guess what kids want to do well on, the end-of-unit test! You can do all the great

activities you want, but if the bottom line is the test, then that is what is going to be

valued most by everyone: teachers, students, and parents, alike.

What we need to get better at is valuing the day-to-day

activities we do and learn how to use them for both

assessment and evaluation.

This will not be an easy task.

It is very different from what we are used to doing. We are used to teaching and then

assessing. In reality, the line between teaching and assessment should be blurred

(NCTM, 2000). “Interestingly, in some languages, learning and teaching are the same

word”(Fosnot and Dolk, page 1). We need to assess on a daily basis to give us the

information to make choices about what to teach the next day. If we just teach the

whole unit and wait until the end-of-unit test to find out what the kids know, we may be

very unhappily surprised. On the other hand, if we are assessing on a daily basis

throughout the unit, we do not need to average all those assessments to come up with a

final evaluation. Instead, we could just use the most recent assessments to make that

evaluation. In this way, we do not penalize the student that did not know much at the

beginning of the unit and worked really hard to learn what you felt were the big ideas.

Instead we rate them on where they are when you finished the unit. This gives a more

accurate report or evaluation of where they are performing when the evaluation is made.



2

Testing, Assessment, Measurement and

Evaluation

The definition for each are:

Test: a method to determine a student’s ability to complete certain tasks or demonstrate

mastery of a skill or knowledge of content. Some types would be multiple choice tests,

or a weekly spelling test. While it is commonly used interchangeably with assessment

or even evaluation, it cab be distinguished by the fact that a test is one form of an

assessment.

Assessment: The process of gathering information to monitor progress and make

educational decisions if necessary. As noted an assessment may include a test, but also

includes methods such as obeservations, interviews, behavior monitoring etc.

Evaluation: Procedured used to determine whether the subject (i.e. student) meets a

preset criteria such as qualifying for special education services. This uses assessment

(remember that an assessement may be a test) to make a determination of qualification

in accordance with a predetrmined criteria.

Meassurement, beyond its general definition, refers to the set of procedures and the

principles for how to use the procedures in educational evaluations would be raw

scores, percentiles ranks, derrived scores, standard scores etc.



3

Informal vs. Formal Assessments: Tests

are not the only end-all-be-all of how

we assess.

Formal assessment

Formal assessment uses formal tests or structured continuous assessment to evaluate a

learner's level of language. It can be compared to informal assessment, which involves

observing the learners' performance as they learn and evaluating them from the data

gathered.

Example

At the end of the course, the learners have a final exam to see if they pass to the next

course or not. Alternatively, the results of a structured continuous assessment process

are used to make the same decision.

In the classroom Informal and formal assessments are both useful for making valid and

useful assessments of learners' knowledge and performance. Many teachers combine the

two, for example by evaluating one skill using informal assessment such as observing

group work, and another using formal tools, for example a discrete item grammar test.

Formative assessment

Formative assessment is the use of assessment to give the learner and the teacher

information about how well something has been learnt so that they can decide what to

do next. It normally occurs during a course. Formative assessment can be compared


with summative assessment, which evaluates how well something has been learnt in

order to give a learner a grade.

Example

The learners have just finished a project on animals, which had as a language aim better

understanding of the use of the present simple to describe habits. The learners now

prepare gap-fill exercises for each other based on some of their texts. They analyse the

results and give each other feedback.

In the classroom ,One of the advantages of formative feedback is that peers can do it.

Learners can test each other on language they have been learning, with the additional

aim of revising the language themselves. It has been once said that ““Everybody is a

genius. But if you judge a fish by its ability to climb a tree, it will live its whole life

believing that it is stupid.” Our students must be assessed relative to what their skills

are. It could be done by doing formal assessments or informal assessments or

combination of both.

I realized that beyond giving formal assessments (i.e. Summative assessments:

Quizzes, long tests, periodical exams, etc.), our main role as teachers is determined by


http://jeffreymdelacruz.files.wordpress.com/2013/02/if-you-judge-a-fish-by-its-ability-to-climb-trees.jpg

how we recognize our students’ progress/stagnation through informal assessments (i.e.

formative assessments: port folios, role play, record tracking, etc.) These methods allow

the teacher to easily maneuver where and how his/her instuction is going.

The result of a formal test (e.g. long test) alone would not necessarily dictate the entire

academic ability of our students. It does not mean that when a student fails a formal test

(e.g. periodical test), we could already conclude that he’s entire learning capabilities for

that subject failed as well.

Assessing students is not monopolized by just doing it formally (e.g. giving out tests,

quizzes, summative exams, etc.), but rather depends on the other informal assessments

(e.g. coaching sessions, reflective logs, fly-by-question and answers, etc.) that reinforce

formal ones.

There are many factors why a student could fail from a test (e.g. lack of sleep,

emotional and family distress, etc.), but there would only be few factors why he/she

would not be able to provide a reflective insight on the lesson. But how do we separate

formal assessments from informal ones?

The table and concept map I incorporated below could give some help (you could click

the picture or open it in a new tab to see it clearer =).


http://jeffreymdelacruz.files.wordpress.com/2013/02/informal-vs-formal-table.jpg

When are informal assessments useful (versus formal assessments)?

The most applicable time to use informal assessments is when:

1. We want to gauge the students cognitive, affective and manipulative skills in the

simplest way possible. We ask students to recite or write down essays to easily

determine if they understood a specific lesson well or poorly, if they are enthusiastic or

bored with the lesson, if they are already familiar or completely unfamiliar with the

topic, etc.

2. We deem that the results of the formal examinations are not enough to give a

concluding mark for the students’ performance. If a specific student performs

excellently in class activities but suddenly failed a summative test, it could tell us that

there could be a deviation between our formal against our informal assessments, or

other factors might have been involved with such event (e.g. student factor: did not

review, physically/emotionally troubled, etc.)

How valuable are informal assessments? Can informal assessments be

good replacements for formal assessments?

Although informal assessments provide teachers with solid bases of how the students

are performing, it would not imply that it could already replace formal assessments.

They should work hand-in-hand and interdependently. One should complement the

other.For instance, if we opt to use role plays and recitals in assessing students’

communications skills informally, we should also align our formal exams with the

activities our students previously engaged on. In this way, we could ensure validity and

fairness of our assessments. Moreover, we could find that these methods relieve our

burdens with analyzing, comparing, and understanding our students “true” abilities.

We cannot just give (formal) tests or quizzes in the same manner as we cannot just

consume course-time with just giving out (informal) class activities. Arriving at a valid

and reliable grades for our students is a combination of maximing both formal and

informal assessments.


Tosummarize:Informal assessment being a systematic observation = knowing what,

when and where we are going to assess + (How) Establishing criteria for assessing

students

4


Norm-referenced test and Criterion-

referenced test

A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields

an estimate of the position of the tested individual in a predefined population, with

respect to the trait being measured. The estimate is derived from the analysis of test

scores and possibly other relevant data from a sample drawn from the population.That

is, this type of test identifies whether the test taker performed better or worse than other

test takers, not whether the test taker knows either more or less material than is

necessary for a given purpose.The term normative assessment refers to the process of

comparing one test-taker to his or her peers.Norm-referenced assessment can be

contrasted with criterion-referenced assessment and ipsative assessment. In a criterion-

referenced assessment, the score shows whether or not test takers performed well or

poorly on a given task, not how that compares to other test takers; in an ipsative system,

test takers are compared to previous performance.Alternative to normative testing, tests

can be ipsative, in which individuals' assessment is compared to themselves through

time.

By contrast, a test is criterion-referenced when provision is made for translating the test

score into a statement about the behavior to be expected of a person with that score. The

same test can be used in both ways. Robert Glaser originally coined the terms norm-

referenced test and criterion-referenced test.

Standards-based education reform is based on the belief that public education should

establish what every student should know and be able to do.Students should be tested

against a fixed yardstick, rather than against each other or sorted into a mathematical

bell curve.

By assessing that every student must pass these new, higher standards, education

officials believe that all students will achieve a diploma that prepares them for success

in the 21st century.Most state achievement tests are criterion-referenced. In other words,


http://en.wikipedia.org/wiki/Grading_on_a_curve

http://en.wikipedia.org/wiki/Standards-based_education_reform

http://en.wikipedia.org/wiki/Criterion-referenced_test

http://en.wikipedia.org/wiki/Ipsative_assessment


http://en.wikipedia.org/wiki/Criterion-referenced_assessment

http://en.wikipedia.org/wiki/Sample_(statistics)

http://en.wikipedia.org/wiki/Evaluation

http://en.wikipedia.org/wiki/Educational_assessment

http://en.wikipedia.org/wiki/Test_(student_assessment)

a predetermined level of acceptable performance is developed and students pass or fail

in achieving or not achieving this level. Tests that set goals for students based on the

average student's performance are norm-referenced tests. Tests that set goals for

students based on a set standard (e.g., 80 words spelled correctly) are criterion-

referenced tests.

Many college entrance exams and nationally used school tests use norm-referenced

tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale

for Children (WISC) compare individual student performance to the performance of a

normative sample. Test takers cannot "fail" a norm-referenced test, as each testtaker

receives a score that compares the individual to others that have taken the test, usually

given by a percentile. This is useful when there is a wide range of acceptable scores that

is different for each college.

By contrast, nearly two-thirds of US high school students will be required to pass a

criterion-referenced high school graduation examination. One high fixed score is set at a

level adequate for university admission whether the high school graduate is college

bound or not. Each state gives its own test and sets its own passing level, with states like

Massachusetts showing very high pass rates, while in Washington State, even average

students are failing, as well as 80 percent of some minority groups. This practice is

opposed by many in the education community such as Alfie Kohn as unfair to groups

and individuals who score lower than others.

Advantages and limitations

An obvious disadvantage of norm-referenced tests is that it cannot measure progress of

the population as a whole, only where individuals fall within the whole. Thus,

measuring against only a fixed goal can be used to measure the success of an

educational reform program that seeks to raise the achievement of all students against

new standards that seek to assess skills beyond choosing among multiple choices.

However, while this is attractive in theory, in practice, the bar has often been moved in

the face of excessive failure rates, and improvement sometimes occurs simply because

of familiarity with and teaching to the same test.


http://en.wikipedia.org/wiki/Alfie_Kohn

http://en.wikipedia.org/wiki/High_school_graduation_examination

http://en.wikipedia.org/wiki/Wechsler_Intelligence_Scale_for_Children

http://en.wikipedia.org/wiki/Wechsler_Intelligence_Scale_for_Children

http://en.wikipedia.org/wiki/Graduate_Record_Examination

http://en.wikipedia.org/wiki/SAT

With a norm-referenced test, grade level was traditionally set at the level set by the

middle 50 percent of scores.By contrast, the National Children's Reading Foundation

believes that it is essential to assure that virtually all of read at or above grade level by

third grade, a goal which cannot be achieved with a norm referenced definition of grade

level.

Advantages to this type of assessment include that students and teachers know what to

expect from the test and just how the test will be conducted and graded. Likewise, all

schools will conduct the exam in the same manner, reducing such inaccuracies as time

differences or environmental differences that may cause distractions to the students.

This also makes these assessments fairly accurate as far as results are concerned, a

major advantage for a test.

Critics of criterion-referenced tests point out that judges set bookmarks around items of

varying difficulty without considering whether the items actually are compliant with

grade level content standards or are developmentally appropriate Thus, the original

1997 sample problems published for the WASL 4th grade mathematics contained items

that were difficult for college educated adults, or easily solved with 10th grade level

methods such as similar triangles.The difficulty level of items themselves and the cut-

scores to determine passing levels are also changed from year to year. Pass rates also

vary greatly from the 4th to the 7th and 10th grade graduation tests in some states.

One of the limitations of No Child Left Behind is that each state can choose or construct

its own test, which cannot be compared to any other state. A Rand study of Kentucky

results found indications of artificial inflation of pass rates which were not reflected in

increasing scores in other tests such as the NAEP or SAT given to the same student

populations over the same time.Graduation test standards are typically set at a level

consistent for native born 4 year university applicants.Unusual side effect is that while

colleges often admit immigrants with very strong math skills who may be deficient in

English, there is no such leeway in high school graduation tests, which usually require

passing all sections, including language. Thus, it is not unusual for institutions like the

University of Washington to admit strong Asian American or Latino students who did


http://en.wikipedia.org/wiki/University_of_Washington

http://en.wikipedia.org/wiki/No_Child_Left_Behind

http://en.wikipedia.org/wiki/Washington_Assessment_of_Student_Learning

not pass the writing portion of the state WASL test, but such students would not even

receive a diploma once the testing requirement is in place.

Although the tests such as the WASL are intended as a minimal bar for high school, 27

percent of 10th graders applying for Running Start in Washington State failed the math

portion of the WASL. These students applied to take college level courses in high

school, and achieve at a much higher level than average students. The same study

concluded the level of difficulty was comparable to, or greater than that of tests

intended to place students already admitted to the college.

A norm-referenced test has none of these problems because it does not seek to enforce

any expectation of what all students should know or be able to do other than what actual

students demonstrate. Present levels of performance and inequity are taken as fact, not

as defects to be removed by a redesigned system. Goals of student performance are not

raised every year until all are proficient. Scores are not required to show continuous

improvement through Total Quality Management systems. Disadvantages include

standards based assessments measure the level that students are currently by measuring

against where their peers are currently at instead of the level that both students should

be at.

A rank-based system produces only data that tell which average students perform at an

average level, which students do better, and which students do worse, contradicting

fundamental beliefs, whether optimistic or simply unfounded, that all will perform at

one uniformly high level in a standards based system if enough incentives and

punishments are put into place. This difference in beliefs underlies the most significant

differences between a traditional and a standards based education system.

Examples

1. IQ tests are norm-referenced tests, because their goal is to see which test taker is

more intelligent than the other test takers.


http://en.wikipedia.org/wiki/IQ_test

http://en.wikipedia.org/wiki/Running_Start

2. Theater auditions and job interviews are norm-referenced tests, because their

goal is to identify the best candidate compared to the other candidates, not to

determine how many of the candidates meet a fixed list of standards.

A criterion-referenced test is one that provides for translating test scores into a statement

about the behavior to be expected of a person with that score or their relationship to a

specified subject matter. Most tests and quizzes that are written by school teachers can

be considered criterion-referenced tests. The objective is simply to see whether the

student has learned the material. Criterion-referenced assessment can be contrasted with

norm-referenced assessment and ipsative assessment.

A common misunderstanding regarding the term is the meaning of criterion. Many, if

not most, criterion-referenced tests involve a cutscore, where the examinee passes if

their score exceeds the cutscore and fails if it does not (often called a mastery test). The

criterion is not the cutscore; the criterion is the domain of subject matter that the test is

designed to assess. For example, the criterion may be "Students should be able to

correctly add two single-digit numbers," and the cutscore may be that students should

correctly answer a minimum of 80% of the questions to pass.

The criterion-referenced interpretation of a test score identifies the relationship to the

subject matter. In the case of a mastery test, this does mean identifying whether the

examinee has "mastered" a specified level of the subject matter by comparing their

score to the cutscore. However, not all criterion-referenced tests have a cutscore, and the

score can simply refer to a person's standing on the subject domain.The ACT is an

example of this; there is no cutscore, it simply is an assessment of the student's

knowledge of high-school level subject matter.Because of this common

misunderstanding, criterion-referenced tests have also been called standards-based

assessments by some education agencies,as students are assessed with regards to

standards that define what they "should" know, as defined by the state.


http://en.wikipedia.org/wiki/Standards-based_assessment

http://en.wikipedia.org/wiki/Standards-based_assessment

http://en.wikipedia.org/wiki/ACT_(test)

http://en.wikipedia.org/w/index.php?title=Cutscore&action=edit&redlink=1


http://en.wikipedia.org/wiki/Norm-referenced_assessment

http://en.wikipedia.org/wiki/Behavior

http://en.wikipedia.org/wiki/Test_(student_assessment)

http://en.wikipedia.org/wiki/Job_interview

http://en.wikipedia.org/wiki/Audition

Comparison of criterion-referenced and norm-referenced tests

Both terms criterion-referenced and norm-referenced were originally coined by Robert

Glaser.Unlike a criterion-reference test, a norm-referenced test indicates whether the

test-taker did better or worse than other people who took the test. For example, if the

criterion is "Students should be able to correctly add two single-digit numbers," then

reasonable test questions might look like " " or " " A criterion-

referenced test would report the student's performance strictly according to whether the

individual student correctly answered these questions. A norm-referenced test would

report primarily whether this student correctly answered more questions compared to

other students in the group. Even when testing similar topics, a test which is designed to

accurately assess mastery may use different questions than one which is intended to

show relative ranking. This is because some questions are better at reflecting actual

achievement of students, and some test questions are better at differentiating between

the best students and the worst students. (Many questions will do both.) A criterion-

referenced test will use questions which were correctly answered by students who know

the specific material. A norm-referenced test will use questions which were correctly

answered by the "best" students and not correctly answered by the "worst" students (e.g.

Cambridge University's pre-entry 'S' paper). Some tests can provide useful information

about both actual achievement and relative ranking. The ACT provides both a ranking,

and indication of what level is considered necessary to likely success in college. Some

argue that the term "criterion-referenced test" is a misnomer, since it can refer to the

interpretation of the score as well as the test itself.In the previous example, the same

score on the ACT can be interpreted in a norm-referenced or criterion-referenced

manner.

Sample scoring for the history question: What caused World War II?

Student answers

Criterion-

referenced

assessment

Norm-referenced assessment

Student #1:

WWII was caused by Hitler

This answer is

correct.

This answer is worse than Student #2's

answer, but better than Student #3's


http://en.wikipedia.org/wiki/World_War_II

http://en.wikipedia.org/wiki/ACT_(test)

http://en.wikipedia.org/wiki/Ranking

http://en.wikipedia.org/wiki/Norm-referenced_test

http://en.wikipedia.org/wiki/Robert_Glaser

http://en.wikipedia.org/wiki/Robert_Glaser

and Germany invading

Poland.answer.

Student #2:

WWII was caused by multiple

factors, including the Great

Depression and the general

economic situation, the rise of

nationalism, fascism, and

imperialist expansionism, and

unresolved resentments related

to WWI. The war in Europe

began with the German

invasion of Poland.

This answer is

correct.

This answer is better than Student #1's

and Student #3's answers.

Student #3:

WWII was caused by the

assassination of Archduke

Ferdinand.

This answer is

wrong.

This answer is worse than Student #1's

and Student #2's answers.

Relationship to high-stakes testing

Many high-profile criterion-referenced tests are also high-stakes tests, where the results

of the test have important implications for the individual examinee. Examples of this

include high school graduation examinations and licensure testing where the test must

be passed to work in a profession, such as to become a physician or attorney. However,

being a high-stakes test is not specifically a feature of a criterion-referenced test. It is

instead a feature of how an educational or government agency chooses to use the results

of the test.

Examples

1. Driving tests are criterion-referenced tests, because their goal is to see whether

the test taker is skilled enough to be granted a driver's license, not to see whether

one test taker is more skilled than another test taker.


http://en.wikipedia.org/wiki/Driving_test

http://en.wikipedia.org/wiki/High_school_graduation_examination

http://en.wikipedia.org/wiki/High-stakes_test

2. Citizenship tests are usually criterion-referenced tests, because their goal is to

see whether the test taker is sufficiently familiar with the new country's history

and government, not to see whether one test taker is more knowledgeable than

another test taker.

5English Language Testing 21

http://en.wikipedia.org/wiki/Category:Citizenship_tests

Discrete Point Testing and Integrative

Testing

Electronic quiz tools usually involve a discrete point approach to testing as opposed to

an integrated or authentic approach, such as papers and projects. Discrete point tests are

made up of test questions each of which is meant to measure one content point. Discrete

point testing is associated with multiple choice and true/false formats, which have been

criticized for testing only recognition knowledge and facilitating guessing and cheating.

However, if they are used for an appropriate PURPOSE and if the test questions are

well constructed, discrete point tests can be used for effective teaching and learning.

Should language be tested by discrete points or by integrative testing?Traditionally, language test have

been constructed on the assumption that: language can be broken down intoits component and those

component parts are duly tested.What is discrete point?

Language is segmented into many small linguistic points and the four language skills of

listening, speaking,reading and writing. Test questions are designed to test these skills

and linguistic points. A discrete point testconsists of many questions on a large number

of linguistic points, but each question tests only one linguisticpoint.Examples of

Discrete point test are:1. Phoneme recognition.2. Yes/No, True/ False answers.3.

Spelling.4. Word completion.5. Grammar items.6. Multiple choice tests.Such tests have

a down side in that they take language out of context and usually bear no relationship to

theconcept or use of whole language.Discrete point test met with some criticism,

particularly in the view of more recent trends toward viewing theunits of language and

its communicative nature and purpose, and viewing language as the arithmetic sum ofall

its parts.That is why John Oller (1976) introduced“INTEGRATIVE TESTING”.

According to him“language competence is a unified set of interacting abilities which

cannot be separated apart and testedadequately.“ Oller (1979:37) “Whereasdiscrete

items attempt to test knowledge of language one bit at a time, integrativetests attempt to

assess a learner's capacity to use many bits all at the same time, and possibly while

exercisingseveral presumed components of a grammatical system, and perhaps more


than one of the traditional skills oraspects of skills. Therefore, communicative

competence is so global and requires such“integration”for its“pragmatic”use in thereal

world that it cannot be captured in additive tests of grammar or reading or vocabulary

and other discretepoints of language.This emphasizes the simultaneous testing of the

testee's multiple linguistic competence from variousperspectives.Examples of

integrative test are:

1.Cloze tests

2. Dictation

3.Translation

4.Essays and other coherent writing tasks

5.Oral interviews and conversation

6.Reading, or other extended samples of real text

Oller (1979:38) has refined the integrative concept further by proposing what he calls

pragmatic test.A pragmatic test is“...any procedure or task that causes the learner to

process sequences of elements in alanguage that conform to the normal contextual

constraints of that language and which requires learner torelate sequences of linguistics

elements via pragmatic mappings to extra linguistic contexts.

” A step in a positive direction would be to concentrate on tests of communicative

competence.The recent direction of linguistic study has been toward viewing language

as an integrated and pragmatic skill,we cannot be certain that a test like a cloze test

meets the criterion of predicting or assessing a unified andintegrated underlying

linguistic competence we must be cautious in selecting and constructing test

oflanguage.There is nothing wrong to use the traditional tests of discrete points of

language especially in achievement andother classroom-oriented testing in which

certain discrete points are very important.


6


Communicative Language Testing

The notion of communicative competence is broad and needs to be fully understood

before being considered as a basis for a research testing regime. As previously indicated

assessment can be viewed in terms of two distinct paradigms as follows: 1) The

Psychometric-Structuralist era: Testing is based on discrete linguistic points related to

four language skill areas, reading, writing, speaking and listening. Additionally there is

the Psycholinguistic-Sociolinguistic era: Integrative tests were conceived in response to

the language proficiency limitations associated with discrete point testing. According to

Oller (in Weir, 1988), Integrative testing could measure the ability to integrate disparate

language skills in ways that more closely resembled the actual process of language use.

The communicative paradigm is founded on the notion of competence. According to

Morrow (in Weir, 1988; pp8) communicative language testing should be concerned

with :1) what the learner knows about the form of the language and how to use it

appropriately in context (Competence). 2) the extent to which the learner is able to

demonstrate this knowledge in a meaningful situation (Performance) i.e what can he do

with the language. Performance testing should therefore be representative of a real-life

situation where an integration of communicative skills is required. The performance test

criteria should relate closely to the effective communication of ideas in that context.

Weir emphasises the importance of context and related tasks as an important dimension

in communicative (performance) language assessment (ibid, pp11). In conclusion a

variety of tests different tests are required for a range of different purposeds and the

associated instruments are no longer uniform in content or method.

In recognising the broad definitions of communication, Carroll (Testing Communicative

Performance, 1980) adopts a rationalist approach to test requirement definition. The

basis of the methodology therefore is a detailed analysis including the identification of

events and activities (communication functions) that drive the communicative need.

Having identified the test requirements, they are divided between the principle

communicative domains of speaking, listening, writing and reading.


This approach is no doubt reminiscent of the requirements definition related to English

for Specific Purposes (ESP) i.e functional language appropriate for Tourist, Students,

Lawyers etc. However, this strategy (and associated methodology) would seem

inappropriate in the given research context for the following salient reasons:

1. No practical to undertake a meaningful needs analysis for all participants

2. The entirely process is far too complex and labour intensive

3. ESP is not aimed at marginalised communities or children

Sabria and Samer (other students) have pointed me in the direction of Cambridge

ToEFL exams (conformant with the Common European Framework of Reference for

Languages) as a potential basis for communicative testing. The tests are divided into the

4 principal language dimensions (Speaking, Listening, Writing and Reading) and

provide tests and marking criteria at all levels of competency including that for the

research context (Young Learners English – YLE starters).


7

Testing Communicative Competence

Testing language has traditionally taken the form of testing knowledge about language,

usually the testing of knowledge of vocabulary and grammar. However, there is much

more to being able to use language than knowledge about it. Dell Hymes proposed the

concept of communicative competence. He argued that a speaker can be able to produce

grammatical sentences that are completely inappropriate. In communicative

competence, he included not only the ability to form correct sentences but to use them

at appropriate times. Since Hymes proposed the idea in the early 1970s, it has been

expanded considerably, and various types of competencies have been proposed.

However, the basic idea of communicative competence remains the ability to use

language appropriately, both receptively and productively, in real situations.

The Communicative Approach to Testing

What Communicative Language Tests Measure

Communicative language tests are intended to be a measure of how the testees are able

to use language in real life situations. In testing productive skills, emphasis is placed on

appropriateness rather than on ability to form grammatically correct sentences. In

testing receptive skills, emphasis is placed on understanding the communicative intent

of the speaker or writer rather than on picking out specific details. And, in fact, the two

are often combined in communicative testing, so that the testee must both comprehend

and respond in real time. In real life, the different skills are not often used entirely in

isolation. Students in a class may listen to a lecture, but they later need to use

information from the lecture in a paper. In taking part in a group discussion, they need

to use both listening and speaking skills. Even reading a book for pleasure may be

followed by recommending it to a friend and telling the friend why you liked it.


The "communicativeness" of a test might be seen as being on a continuum. Few tests

are completely communicative; many tests have some element of communicativeness.

For example, a test in which testees listen to an utterance on a tape and then choose

from among three choices the most appropriate response is more communicative than

one in which the testees answer a question about the meaning of the utterance.

However, it is less communicative than one in which the testees are face- to-face with

the interlocutor (rather than listening to a tape) and are required to produce an

appropriate response.

Tasks

Communicative tests are often very context-specific. A test for testees who are going to

British universities as students would be very different from one for testees who are

going to their company's branch office in the United States. If at all possible, a

communicative language test should be based on a description of the language that the

testees need to use. Though communicative testing is not limited to English for Specific

Purposes situations, the test should reflect the communicative situation in which the

testees are likely to find themselves. In cases where the testees do not have a specific

purpose, the language that they are tested on can be directed toward general social

situations where they might be in a position to use English.

This basic assumption influences the tasks chosen to test language in communicative

situations. A communicative test of listening, then, would test not whether the testee

could understand what the utterance, "Would you mind putting the groceries away

before you leave" means, but place it in a context and see if the testee can respond

appropriately to it.

If students are going to be tested over communicative tasks in an achievement test

situation, it is necessary that they be prepared for that kind of test, that is, that the course

material cover the sorts of tasks they are being asked to perform. For example, you

cannot expect testees to correctly perform such functions as requests and apologies

appropriately and evaluate them on it if they have been studying from a structural


syllabus. Similarly, if they have not been studying writing business letters, you cannot

expect them to write a business letter for a test.

Tests intended to test communicative language are judged, then, on the extent to which

they simulate real life communicative situations rather than on how reliable the results

are. In fact, there is an almost inevitable loss of reliability as a result of the loss of

control in a communicative testing situation. If, for example, a test is intended to test the

ability to participate in a group discussion for students who are going to a British

university, it is impossible to control what the other participants in the discussion will

say, so not every testee will be observed in the same situation, which would be ideal for

test reliability. However, according to the basic assumptions of communicative

language testing, this is compensated for by the realism of the situation.

Evaluation

There is necessarily a subjective element to the evaluation of communicative tests. Real

life situations don't always have objectively right or wrong answers, and so band scales

need to be developed to evaluate the results. Each band has a description of the quality

(and sometimes quantity) of the receptive or productive performance of the testee.

Examples of Communicative Test Tasks

Speaking/Listening

Information gap. An information gap activity is one in which two or more testees work

together, though it is possible for a confederate of the examiner rather than a testee to

take one of the parts. Each testee is given certain information but also lacks some

necessary information. The task requires the testees to ask for and give information. The

task should provide a context in which it is logical for the testees to be sharing

information.


The following is an example of an information gap activity.

Student A

You are planning to buy a tape recorder. You don't want to spend more than about 80

pounds, but you think that a tape recorder that costs less than 50 pounds is probably not

of good quality. You definitely want a tape recorder with auto reverse, and one with a

radio built in would be nice. You have investigated three models of tape recorder and

your friend has investigated three models. Get the information from him/her and share

your information. You should start the conversation and make the final decision, but

you must get his/her opinion, too.

(information about three kinds of tape recorders)

Student B

Your friend is planning to buy a tape recorder, and each of you investigated three types

of tape recorder. You think it is best to get a small, light tape recorder. Share your

information with your friend, and find out about the three tape recorders that your friend

investigated. Let him/her begin the conversation and make the final decision, but don't

hesitate to express your opinion.

(information about three kinds of tape recorders)

This kind of task would be evaluated using a system of band scales. The band scales

would emphasize the testee's ability to give and receive information, express and elicit

opinions, etc. If its intention were communicative, it would probably not emphasize

pronunciation, grammatical correctness, etc., except to the extent that these might

interfere with communication. The examiner should be an observer and not take part in

the activity, since it is difficult to both take part in the activity and evaluate it. Also, the

activity should be tape recorded, if possible, so that it could be evaluated later and it

does not have to be evaluated in real time.


Role Play. In a role play, the testee is given a situation to play out with another person.

The testee is given in advance information about what his/her role is, what specific

functions he/she needs to carry out, etc. A role play task would be similar to the above

information gap activity, except that it would not involve an information gap. Usually

the examiner or a confederate takes one part of the role play.

The following is an example of a role play activity.

Student

You missed class yesterday. Go to the teacher's office and apologize for having missed

the class. Ask for the handout from the class. Find out what the homework was.

Examiner

You are a teacher. A student who missed your class yesterday comes to your office.

Accept her/his apology, but emphasize the importance of attending classes. You do not

have any extra handouts from the class, so suggest that she/he copy one from a friend.

Tell her/him what the homework was.

Again, if the intention of this test were to test communicative language, the testee would

be assessed on his/her ability to carry out the functions (apologizing, requesting, asking

for information, responding to a suggestion, etc.) required by the role.



8

Testing Reading and Writing

Some tests combine reading and writing in communicative situations. Testees can be

given a task in which they are presented with instructions to write a letter, memo,

summary, etc., answering certain questions, based on information that they are

given.Letter writing. In many situations, testees might have to write business letters,

letters asking for information, etc.The following is an example of such a task.

Your boss has received a letter from a customer complaining about problems with a

coffee maker that he bought six months ago. Your boss has instructed you to check the

company policy on returns and repairs and reply to the letter. Read the letter from the

customer and the statement of the company policy about returns and repairs below and

write a formal business letter to the customer.

(the customer's complaint letter; the company policy)

The letter would be evaluated using a band scale, based on compliance with formal

letter writing layout, the content of the letter, inclusion of correct and relevant

information, etc.

Summarizing. Testees might be given a long passage--for example, 400 words--and be

asked to summarize the main points in less than 100 words. To make this task

communicative, the testees should be given realistic reasons for doing such a task. For

example, the longer text might be an article that their boss would like to have

summarized so that he/she can incorporate the main points into a talk.The summary

would be evaluated, based on the inclusion of the main points of the longer text.

Testing Listening and Writing/Note Taking


Listening and writing may also be tested in combination. In this case, testees are given a

listening text and they are instructed to write down certain information from the text.

Again, although this is not interactive, it should somehow simulate a situation where

information would be written down from a spoken text.


9

Performance-Based Assessment

Performance-based assessment is an alternative form of assessment that moves away

from traditional paper and pencil tests. Performance-based assessment involves having

the students produce a project, whether it is oral, written or a group performance. The

students are engaged in creating a final project that exhibits their understanding of a

concept they have learned.

A unique quality of performance-based assessment is that is allows the students to be

assessed based on a process. The teacher is able to see first hand how the students

produce language in real-world situations. In addition, performance-based assessments

tend to have a higher content validity because a process is being measured. The focus

remains on the process, rather than the product in performance-based assessment.

There are two parts to performance-based assessments. The first part is a clearly

defined task for the students to complete. This is called the product descriptor. The

assessments are either product related, specific to certain content or specific to a given

task. The second part is a list of explicit criteria that are used to assess the students.

Generally this comes in the form of a rubric. The rubrics can either be analytical,

meaning it assesses the final product in parts, or holistic, meaning that is assesses the

final product as a whole.

Performance-based assessment tasks are generally not as formally structured. There

is room for creativity and student design in performance-based tasks. Generally, these

tasks measure the students when they are actually performing the given task. Due to the

nature of these tasks, performance-based assessment is highly interactive. Students are

interacting with each other in order to complete real-world examples of language tasks.


Also, performance-based assessment tends to integrate many different skills. For

example, reading and writing can be involved in one task or speaking and listening can

be involved in the same task.

As previously mentioned, there are many types of performance-based assessments.

Each type of assessment brings with it different strengths and deficiencies relative to

credible and dependable information. Because it is virtually impossible for a single

assessment tool to adequately assess all aspects of student performance, the real

challenge comes in selecting or developing performance-based assessments that

complement both each other and more traditional assessments to equitably assess

students in physical education and human performance.

The goal for assessment is to accurately determine whether students have learned the

materials or information taught and reveal whether they have complete mastery of the

content with no misunderstandings. Just as researchers use multiple data sources to

determine the truthfulness of the results, teachers can use multiple types of assessment

to evaluate the level of student learning. Because assessments involve the gathering of

data or information, some type of product, performance, or recording sheet must be

generated. The following are some examples of various types of performance-based

assessments used in physical education.

Performance-based assessment is an opportunity to allow students to produce

language in real-world contexts while being assessed. This type of assessment is unique

because it is not a traditional test format. Some examples of performance-based

assessment tasks are as follows:

Types of Performance-Based Assessment:

1. Journals


Students will write regularly in a journal about anything relevant to their life, school or

thoughts. Their writing will be in the target language. The teacher will collect the

journals periodically and provide feedback to the students. This can serve as a

communication log between the teacher and students.Journals can be used to record

student feelings, thoughts, perceptions, or reflections about actual events or results. The

entries in journals often report social or psychological perspectives, both positive and

negative, and may be used to document the personal meaning associated with one’s

participation (NASPE Standard 6). Journal entries would not be an appropriate

summative assessment by themselves, but might be included as an artifact in a portfolio.

Journal entries are excellent ways for teachers to “take the pulse” of a class and

determine whether students are valuing the content of the class. Teachers must be

careful not to assess affective domain journal entries for the actual content, because

doing so may cause students to write what teachers want to hear (or give credit for)

instead of true and genuine feelings. Teachers could hold students accountable for

completing journal entries. Some teachers use journals as a way to log participation

over time.

2.Letters

The students will create original language compositions through producing a letter.

They will be asked to write about something relevant to their own life using the target

language. The letter assignment will be accompanied by a rubric for assessment

purposes.

3. Oral Reports

The students will need to do research in groups about a given topic. After they have

completed their research, the students will prepare an oral presentation to present to the

class explaining their research. The main component of this project will be the oral

production of the target language.

4. Original Stories

The students will write an original fictional story. The students will be asked to

include several specified grammatical structures and vocabulary words. This assignment

will be assessed analytically, each component will have a point value.


5. Oral Interview

An oral interview will take place between two students. One student will ask the

questions and listen to the responses of the other student. From the given responses,

more questions can be asked. Each student will be responsible for listening and

speaking.

6. skit

The students will work in groups in order to create a skit about a real-world situation.

They will use the target language. The vocabulary used should be specific to the

situation. The students will be assess holistically, based on the overall presentation of

the skit.

7.Poetry Recitations

After studying poetry, the students will select a poem in the target langugage of their

choice to recite to the class. The students will be assessed based on their pronunciation,

rhythm and speed. The students will also have an opportunity to share with the class

what they think the poem means.

8.Portfolios

Portfolios allow students to compile their work over a period of time. The students

will have a checklist and rubric along with the assignment description. The students will

assemble their best work, including their drafts so that the teacher can assess the

process.

9.PuppetShow

The students can work in groups or individually to create a short puppet show. The

puppet show can have several characters that are involved in a conversation of real-

world context. These would most likely be assessed holistically.

10. Art Work/ Designs/Drawings

This is a creative way to assess students. They can choose a short story or piece or

writing, read it and interpret it. Their interpretation can be represented through artistic


expression. The students will present their art work to the class, explaining what they

did and why.

Using Observation in the Assessment Process

Human performance provides many opportunities for students to exhibit behaviors that

may be directly observed by others, a unique advantage of working in the psychomotor

domain. Wiggins (1998) uses physical activity when providing examples to illustrate

complex assessment concepts, as they are easier to visualize than would be the case

with a cognitive example. The nature of performing a motor skill makes assessment

through observational analysis a logical choice for many physical education teachers. In

fact, investigations of measurement practices of physical educators have consistently

shown a reliance on observation and related assessment methods (Hensley and East

1989; Matanin and Tannehill 1994; Mintah 2003).

Observation is a skill used with several performance-based assessments. It is often used

to provide students with feedback to improve performance. However, without some way

to record results, observation alone is not an assessment. Going back to the definition of

assessment provided earlier in the chapter, assessment is the gathering of information,

analyzing the data, and then using the information to make an evaluation. Therefore,

some type of written product must be produced if the task is considered an assessment.

Teachers and peers can assess others using observation. They might use a checklist or

some type of event recording scheme to tally the number of times a behavior occurred.

Keeping game play statistics is an example of recording data using event recording

techniques. Students can self-analyze their own performance and record their

performances using criteria provided on a checklist or a game play rubric. Table 14.1 is

an example of a recording form that could be used for peer assessment. When using

peer assessment, it is best to have the assessor do only the assessment. When the person

recording assessment results is also expected to take part in the assessment (e.g., tossing

the ball to the person being assessed), he or she cannot both toss and do an accurate

observation. In the case of large classes, teachers might even use groups of four, in


which one person is being evaluated, a second person is feeding the ball, the third

person is doing the observation, and a fourth person is recording the results.

Individual or Group Projects

Projects have long been used in education to assess a student’s understanding of a

subject or a particular topic. Projects typically require students to apply their knowledge

and skills while completing the prescribed task, which often calls for creativity, critical

thinking, analysis, and synthesis. Examples of student projects used in physical

education and human performance include the following: demonstrating knowledge of

invasion game strategies by designing a new game; demonstrating knowledge of how to

become an active participant in the community by doing research on obesity and then

developing a brochure for people in the community that presents ideas for developing a

physically active lifestyle; demonstrating knowledge of fitness components and how to

stay fit by designing one’s own fitness program using personal fitness test results;

demonstrating knowledge of how to create a dance by video recording a dance that

members of the group choreographed; and doing research on childhood games and

teaching children from a local elementary school how to play them. Criteria for

evaluating the projects are developed and the results of the project are recorded.

Group projects involve a number of students working together on a complex problem

that requires planning, research, internal discussion, and presentation. Group projects

should include a component that each student completes individually to avoid having a

student receive credit for work that he or she did not do. Another way to avoid this issue

is to have members of the group award paychecks to the various members of the group

(e.g., split a $10,000 check) and provide justifications about the amount given to each

person. To encourage reflections on the contributions of others, students are not allowed

to give an equal amount to everyone. These “checks” are confidential and submitted

directly to the teacher in an envelope that others in the group are not allowed to see.

The following example of a project designed for middle school or high school students

involves a research component, analysis and synthesis of information, problem solving,

and effective communication.


Portfolios

Portfolios are systematic, purposeful, and meaningful collections of an individual’s

work designed to document learning over time. Since a portfolio provides

documentation of student learning, the knowledge and skills that the teacher desires to

have students document guides the structure of the portfolio. The type of portfolio, its

format, and the general contents are usually prescribed by the teacher. Portfolio

collections may also include input provided by teachers, parents, peers, administrators,

or others.The guidelines used to format a portfolio will be based on the type of learning

that the portfolio is used to document. The following are two basic types of portfolios:

Working portfolio—A repository of portfolio documents that the student accumulates

over a certain period of time. Other types of process information may also be included,

such as drafts of student work or records of student achievement or progress over time.

Showcase or model portfolio—A portfolio consisting of work samples selected by the

student that document the student’s best work. The student has consciously evaluated

his or her work and selected only those products that best represent the type of learning

identified for this assessment. Each artifact selected is accompanied by a reflection, in

which the student explains the significance of the item and the type of learning it

represents.

It’s a good idea to limit the portfolio to a certain number of pieces of work to prevent

the portfolio from becoming a scrapbook that has little meaning to the student and to

avoid giving teachers a monumental evaluation task. This also requires students to

exercise some judgment about which artifacts best fulfill the requirements of the

portfolio task and document their level of achievement. The portfolio itself is usually a

file or folder that contains the student’s collected work. The contents could include

items such as a training log, student journal or diary, written reports, photographs or

sketches, letters, charts or graphs, maps, copies of certificates, computer disks or

computer-generated products, completed rating scales, fitness test results, game


statistics, training plans, report of dietary analyses, and even video- or audio recordings.

Collectively, the artifacts selected will document student growth and learning over time

as well as current levels of achievement. The potential items that could become

portfolio artifacts are almost limitless. Kirk (1997) suggests the following list of

possible portfolio artifacts that may be useful for physical activity settings. A teacher

would never require that a portfolio contain all of these items. The list is offered as a

way to generate ideas for possible artifacts.

A rubric (scoring tool) should be used to evaluate portfolios in much the same manner

as any other product or performance. Providing a rubric to students in advance allows

them to self-assess their work and thus be more likely to produce a portfolio of high

quality. Portfolios, since they are designed to show growth and improvement in student

learning, are evaluated holistically. The reflections that describe the artifact and why the

artifact was selected for inclusion in the portfolio provide insights into levels of student

learning and achievement. Teachers should remember that format is less important than

content and that the rubric should be weighted to reflect this. Table 14.2 illustrates a

qualitative analytic rubric for judging a portfolio along three dimensions.

For additional information about portfolio assessments, Lund and Kirk (2010) have a

chapter on developing portfolio assessments. An article published as part of a JOPERD

feature presents a suggested scoring scale for a portfolio (Kirk 1997). Melograno’s

Assessment Series publication (2000) on portfolios also contains helpful information.

Performances

Student performances can be used as culminating assessments at the completion of an

instructional unit. Teachers might organize a gymnastics or track and field meet at the

conclusion of one of those units to allow students to demonstrate the skills and

knowledge that they gained during instruction. Game play during a tournament is also

considered a student performance. Rubrics for game play can be written so that students

are evaluated on all three learning domains (psychomotor, cognitive, and affective).


Students might demonstrate their skills and learning in one of the following ways:

Performing an aerobics routine for a school assembly

Organizing and performing a jump rope show at the half-time of a basketball game

Performing in a folk dance festival at the county fair

Demonstrating wushu (a Chinese martial art) at the local shopping mall

Training for and participating in a local road race or cycling competition

Although performances do not produce a written product, there are several ways to

gather data to use for assessment purposes. A score sheet can be used to record student

performance using the criteria from a game play rubric. Game play statistics are another

example of a way to document performance. Performances can also be video recorded

to provide evidence of learning. In some cases teachers might want to shorten the time

used to gather evidence of learning from a performance. Event tasks are performances

that are completed in a single class period. Students might demonstrate their knowledge

of net or wall game strategies by playing a scripted game that is video recorded during a

single class. The ability to create movement sequences or a dance that uses different

levels, effort, or relationships could be demonstrated during a single class period with

an event task. Many adventure education activities that demonstrate affective domain

attributes can be assessed using event tasks.

Student Logs

Documenting student participation in physical activity (NASPE Standard 3) is often

difficult. Teachers can assess participation in an activity or skill practice trials

completed outside of class using logs. Practice trials during class that demonstrate

student effort can also be documented with logs. A log records behaviors over a period

of time (see figure 14.1). Often the information recorded shows changes in behavior,

trends in performance, results of participation, progress, or the regularity of physical


activity. A student log is an excellent artifact for use in a portfolio. Because logs are

usually a self-recorded document, they are not used for summative assessments unless

as an artifact in a portfolio or for a project. If teachers wanted to increase the importance

placed on a log, a method of verification by an adult or someone in authority should be

added.


10

VALIDITY AND RELIABILITY

For the statistical consultant working with social science researchers the estimation of

reliability and validity is a task frequently encountered. Measurement issues differ in

the social sciences in that they are related to the quantification of abstract, intangible

and unobservable constructs. In many instances, then, the meaning of quantities is only

inferred.

Let us begin by a general description of the paradigm that we are dealing with. Most

concepts in the behavioral sciences have meaning within the context of the theory that

they are a part of. Each concept, thus, has an operational definition which is governed

by the overarching theory. If a concept is involved in the testing of hypothesis to

support the theory it has to be measured. So the first decision that the research is faced

with is “how shall the concept be measured?” That is the type of measure. At a very

broad level the type of measure can be observational, self-report, interview, etc. These

types ultimately take shape of a more specific form like observation of ongoing activity,

observing video-taped events, self-report measures like questionnaires that can be open-

ended or close-ended, Likert-type scales, interviews that are structured, semi-structured

or unstructured and open-ended or close-ended. Needless to say, each type of measure

has specific types of issues that need to be addressed to make the measurement

meaningful, accurate, and efficient.

Another important feature is the population for which the measure is intended. This

decision is not entirely dependent on the theoretical paradigm but more to the

immediate research question at hand.


A third point that needs mentioning is the purpose of the scale or measure. What is it

that the researcher wants to do with the measure? Is it developed for a specific study or

is it developed with the anticipation of extensive use with similar populations?

Once some of these decisions are made and a measure is developed, which is a careful

and tedious process, the relevant questions to raise are “how do we know that we are

indeed measuring what we want to measure?” since the construct that we are measuring

is abstract, and “can we be sure that if we repeated the measurement we will get the

same result?”. The first question is related to validity and second to reliability. Validity

and reliability are two important characteristics of behavioral measure and are referred

to as psychometric properties.

It is important to bear in mind that validity and reliability are not an all or none issue but

a matter of degree.

Measurement Error

All measurements may contain some element of error; validity and reliability

concern the amount and type of error that typically occurs, and they also show how we

can estimate the amount of error in a measurement.

There are three chief sources of error:

1. in the thing being measured (my weight may fluctuate so it's difficult to get an

accurate picture of it);

2. the observer (on Mondays I may knock a pound off my weight if I binged on my

mother's cooking at the week-end. Obviously the binging doesn't reflect my true

weight!);

3. or in the recording device (our clinic weigh scale has been acting up; we really

should get it recalibrated).And there are two types of error:

Random errors are not attributable to a specific cause. If sufficiently large numbers of

observations are made, random errors average to zero, because some readings over-

estimate and some under-estimate.Systematic errors tend to fall in a particular direction

and are likely due to a specific cause. Because systematic errors fall in one direction

(e.g., I always exaggerate my athletic abilities) they bias a measurement.Random errors


are considered part of the reliability of a measurement.Systematic errors are considered

part of the validity of a measurement.

Reliability and validity

The reliability of an assessment tool is the extent to which it measures learning

consistently.

The validity of an assessment tool is the extent by which it measures what it was

designed to measure.

Reliability

The reliability of an assessment tool is the extent to which it consistently and accurately

measures learning. When the results of an assessment are reliable, we can be confident

that repeated or equivalent assessments will provide consistent results. This puts us in a

better position to make generalised statements about a student’s level of achievement,

which is especially important when we are using the results of an assessment to make

decisions about teaching and learning, or when we are reporting back to students and

their parents or caregivers. No results, however, can be completely reliable. There is

always some random variation that may affect the assessment, so educators should

always be prepared to question results.

Factors which can affect reliability:

The length of the assessment – a longer assessment generally produces more reliable

results.

The suitability of the questions or tasks for the students being assessed.

The phrasing and terminology of the questions.

The consistency in test administration – for example, the length of time given for the

assessment, instructions given to students before the test.

The design of the marking schedule and moderation of marking procedures.

The readiness of students for the assessment – for example, a hot afternoon or straight

after physical activity might not be the best time for students to be assessed.


How to be sure that a formal assessment tool is reliable

Check in the user manual for evidence of the reliability coefficient. These are measured

between zero and 1. A coefficient of 0.9 or more indicates a high degree of reliability.

Assessment tool manuals contain comprehensive administration guidelines. It is

essential to read the manual thoroughly before conducting the assessment.

Validity

Educational assessment should always have a clear purpose. Nothing will be gained

from assessment unless the assessment has some validity for the purpose. For that

reason, validity is the most important single attribute of a good test.

The validity of an assessment tool is the extent to which it measures what it was

designed to measure, without contamination from other characteristics. For example, a

test of reading comprehension should not require mathematical ability.

There are several different types of validity:

Face validity: do the assessment items appear to be appropriate?

Content validity: does the assessment content cover what you want to assess?

Criterion-related validity: how well does the test measure what you want it to?

Construct validity: are you measuring what you think you're measuring?

It is fairly obvious that a valid assessment should have a good coverage of the criteria

(concepts, skills and knowledge) relevant to the purpose of the examination. The

important notion here is the purpose. For example:

The PROBE test is a form of reading running record which measures reading

behaviours and includes some comprehension questions. It allows teachers to see the

reading strategies that students are using, and potential problems with decoding. The

test would not, however, provide in-depth information about a student’s comprehension

strategies across a range of texts.


STAR (Supplementary Test of Achievement in Reading) is not designed as a

comprehensive test of reading ability. It focuses on assessing students’ vocabulary

understanding, basic sentence comprehension and paragraph comprehension. It is most

appropriately used for students who don’t score well on more general testing (such as

PAT or e-asTTle) as it provides a more fine grained analysis of basic comprehension

strategies.

There is an important relationship between reliability and validity. An assessment that

has very low reliability will also have low validity; clearly a measurement with very

poor accuracy or consistency is unlikely to be fit for its purpose. But, by the same token,

the things required to achieve a very high degree of reliability can impact negatively on

validity. For example, consistency in assessment conditions leads to greater reliability

because it reduces 'noise' (variability) in the results. On the other hand, one of the things

that can improve validity is flexibility in assessment tasks and conditions. Such

flexibility allows assessment to be set appropriate to the learning context and to be

made relevant to particular groups of students. Insisting on highly consistent assessment

conditions to attain high reliability will result in little flexibility, and might therefore

limit validity.

Validity:

Very simply, validity is the extent to which a test measures what it is supposed

to measure. The question of validity is raised in the context of the three points made

above, the form of the test, the purpose of the test and the population for whom it is

intended. Therefore, we cannot ask the general question “Is this a valid test?”. The

question to ask is “how valid is this test for the decision that I need to make?” or “how

valid is the interpretation I propose for the test?” We can divide the types of validity

into logical and empirical.

VALIDITY refers to what conclusions we can draw from the results of a measurement.

Introductory-level definitions are "Does the test measure what we are intending to

measure?", or "How closely do the results of a measurement correspond to the true state

of the phenomenon being measured?"


Nerd's Corner: These ideas of validity fit under a more general conception in terms of

"How can we interpret the test results?" or "What does this measurement actually

mean?" This approach is useful because sometimes information collected for one

purpose can also tell us about something quite different. So, the World Bank records the

gross national product of each country for economic monitoring, but this also gives us a

pretty good idea of how countries will rank in terms of child health.

Nerd's Corner: Putting these ideas together, we get a table showing how validity and

reliability may be assessed:

Thing being measured

Observer

Recording device

(e.g., screening test)

Random error

test re-test reliability

correlation between observers

calibration trial (variation with standard object)

Systematic

record diurnal (etc) variation (e.g. BP higher on Mondays)

agreement between observers (e.g. nurses or patients)

construct& criterion validity;

sensitivity& specificity

Validity of a screening test. This can be used to illustrate the way validity is assessed.

Here, it is commonly reported in terms of sensitivity and specificity.

Sensitivity refers to what fraction of all the actual cases of disease a test detects.

If the test is not very good, it may miss cases it should detect. Its sensitivity is low and it

generates "false negatives" (i.e., people score negatively on the test when they should

have scored positive). This can be extremely serious if early treatment would have

saved the person's life.

Mnemonics to help you: The word 'sensitivity' is intuitive: a sensitive test is one that

can identify the disease.


SeNsitivity is inversely associated with the false Negative rate of a test (high sensitivity

= few false negatives).

Specificity refers to whether the test identifies only those with the disease, or

does it mistakenly classify some healthy people as being sick? Errors of this type are

called "false positives." This can lead to worry and expensive further investigations.

Types of Validity

1. Content Validity:

When we want to find out if the entire content of the behavior/construct/area is

represented in the test we compare the test task with the content of the behavior. This is

a logical method, not an empirical one. Example, if we want to test knowledge on

American Geography it is not fair to have most questions limited to the geography of

New England.

2. Face Validity:

Basically face validity refers to the degree to which a test appears to measure

what it purports to measure. Face Validity ascertains that the measure appears to be

assessing the intended construct under study. The stakeholders can easily assess face

validity. Although this is not a very “scientific” type of validity, it may be an essential

component in enlisting motivation of stakeholders. If the stakeholders do not believe the

measure is an accurate assessment of the ability, they may become disengaged with the

task. Example: If a measure of art appreciation is created all of the items should be

related to the different components and types of art. If the questions are regarding

historical time periods, with no reference to any artistic movement, stakeholders may

not be motivated to give their best effort or invest in this measure because they do not

believe it is a true assessment of art appreciation.

3. Criterion-Oriented or Predictive Validity:


Criterion-Related Validity is used to predict future or current performance - it correlates

test results with another criterion of interest.

Example: If a physics program designed a measure to assess cumulative student

learning throughout the major. The new measure could be correlated with a

standardized measure of ability in this discipline, such as an ETS field test or the GRE

subject test. The higher the correlation between the established measure and new

measure, the more faith stakeholders can have in the new assessment tool.

When you are expecting a future performance based on the scores obtained

currently by the measure, correlate the scores obtained with the performance. The later

performance is called the criterion and the current score is the prediction. This is an

empirical check on the value of the test – a criterion-oriented or predictive validation.

4. Concurrent Validity:

Concurrent validity is the degree to which the scores on a test are related to the

scores on another, already established, test administered at the same time, or to some

other valid criterion available at the same time. Example, a new simple test is to be

used in place of an old cumbersome one, which is considered useful, measurements are

obtained on both at the same time. Logically, predictive and concurrent validation are

the same, the term concurrent validation is used to indicate that no time elapsed between

measures.

5. Construct Validity:

Construct Validity is used to ensure that the measure is actually measure what it is

intended to measure (i.e. the construct), and not other variables. Using a panel of

“experts” familiar with the construct is a way in which this type of validity can be

assessed. The experts can examine the items and decide what that specific item is

intended to measure. Students can be involved in this process to obtain their feedback.

Example: A women’s studies program may design a cumulative assessment of learning

throughout the major. The questions are written with complicated wording and

phrasing. This can cause the test inadvertently becoming a test of reading

comprehension, rather than a test of women’s studies. It is important that the measure is

actually assessing the intended construct, rather than an extraneous factor.


Construct validity is the degree to which a test measures an intended

hypothetical construct. Many times psychologists assess/measure abstract attributes or

constructs. The process of validating the interpretations about that construct as

indicated by the test score is construct validation. This can be done experimentally,

e.g., if we want to validate a measure of anxiety. We have a hypothesis that anxiety

increases when subjects are under the threat of an electric shock, then the threat of an

electric shock should increase anxiety scores (note: not all construct validation is this

dramatic!)

A correlation coefficient is a statistical summary of the relation between two

variables. It is the most common way of reporting the answer to such questions as the

following: Does this test predict performance on the job? Do these two tests measure

the same thing? Do the ranks of these people today agree with their ranks a year ago?

(rank correlation and product-moment correlation)

According to Cronbach, to the question “what is a good validity coefficient?”

the only sensible answer is “the best you can get”, and it is unusual for a validity

coefficient to rise above 0.60, though that is far from perfect prediction.

All in all we need to always keep in mind the contextual questions: what is the

test going to be used for? how expensive is it in terms of time, energy and money? what

implications are we intending to draw from test scores?

Formative Validity when applied to outcomes assessment it is used to assess how well a

measure is able to provide information to help improve the program under study.

Example: When designing a rubric for history one could assess student’s knowledge

across the discipline. If the measure can provide information that students are lacking

knowledge in a certain area, for instance the Civil Rights Movement, then that

assessment tool is providing meaningful information that can be used to improve the

course or program requirements.

Sampling Validity (similar to content validity) ensures that the measure covers

the broad range of areas within the concept under study. Not everything can be

covered, so items need to be sampled from all of the domains. This may need to be

completed using a panel of “experts” to ensure that the content area is adequately

sampled. Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an

individual personally feels are the most important or relevant areas).


Example: When designing an assessment of learning in the theatre department, it would

not be sufficient to only cover issues related to acting. Other areas of theatre such as

lighting, sound, functions of stage managers should all be included. The assessment

should reflect the content area in its entirety.

What are some ways to improve validity?

1. Make sure your goals and objectives are clearly defined and operationalized.

Expectations of students should be written down.

2. Match your assessment measure to your goals and objectives. Additionally, have

the test reviewed by faculty at other schools to obtain feedback from an outside

party who is less invested in the instrument.

3. Get students involved; have the students look over the assessment for

troublesome wording, or other difficulties.

4. If possible, compare your measure with other measures, or data that may be

available.

Reliability:

Research requires dependable measurement. (Nunnally) Measurements are

reliable to the extent that they are repeatable and that any random influence which tends

to make measurements different from occasion to occasion or circumstance to

circumstance is a source of measurement error. (Gay) Reliability is the degree to which

a test consistently measures whatever it measures. Errors of measurement that affect

reliability are random errors and errors of measurement that affect validity are

systematic or constant errors.

Test-retest, equivalent forms and split-half reliability are all determined through

correlation.

RELIABILITY refers to consistency or dependability. Your patient Jim is

unpredictable; sometimes he comes to his appointment on time, sometimes he's late and

once or twice he was early.

One way to estimate reliability of a measurement is to record its stability: do you

get the same blood pressure reading if you repeat the measurement? This is sometimes


called "test-retest stability" or "intra-rater reliability" and focuses on the observer and

the instrument as potential sources of error. (Note that we must assume that no actual

change in BP occurred between the measurements: there is no error in the thing being

measured).

You can also estimate reliability by comparing the agreement between different people

making a rating (e.g., if several nurses measure a patient's blood pressure, do they get

the same reading?). This can be called "inter-rater reliability" or "inter-rater agreement."

Nerd's Corner: This is a simplification. Sometimes it's difficult to figure out if an

error is random or systematic: the disagreement between the nurses could really be

random, or it could arise because one of them tends to under-record the BP. Further

testing would be needed to trace the origin of the inaccuracy.

Types of Reliability

1. Test-retest Reliability:

Test-retest reliability is the degree to which scores are consistent over time. It

indicates score variation that occurs from testing session to testing session as a result of

errors of measurement. Problems: Memory, Maturation, Learning.

Test-retest reliability is a measure of reliability obtained by administering the same

test twice over a period of time to a group of individuals. The scores from Time 1 and

Time 2 can then be correlated in order to evaluate the test for stability over time.

Example: A test designed to assess student learning in psychology could be given to a

group of students twice, with the second administration perhaps coming a week after the

first. The obtained correlation coefficient would indicate the stability of the scores.

2. Equivalent-Forms or Alternate-Forms Reliability

Parallel forms reliability is a measure of reliability obtained by administering

different versions of an assessment tool (both versions must contain items that probe the

same construct, skill, knowledge base, etc.) to the same group of individuals. The

scores from the two versions can then be correlated in order to evaluate the consistency

of results across alternate versions.


Example: If you wanted to evaluate the reliability of a critical thinking assessment, you

might create a large set of items that all pertain to critical thinking and then randomly

split the questions up into two sets, which would represent the parallel forms.

Equivalent-Forms or Alternate-Forms Reliability:

Two tests that are identical in every way except for the actual items included.

Used when it is likely that test takers will recall responses made during the first session

and when alternate forms are available. Correlate the two scores. The obtained

coefficient is called the coefficient of stability or coefficient of equivalence. Problem:

Difficulty of constructing two forms that are essentially equivalent.

Both of the above require two administrations.

3. Inter-rater reliability

Inter-rater reliabilit is a measure of reliability used to assess the degree to which

different judges or raters agree in their assessment decisions. Inter-rater reliability is

useful because human observers will not necessarily interpret answers the same way;

raters may disagree as to how well certain responses or material demonstrate knowledge

of the construct or skill being assessed.

Example: Inter-rater reliability might be employed when different judges are evaluating

the degree to which art portfolios meet certain standards. Inter-rater reliability is

especially useful when judgments can be considered relatively subjective. Thus, the use

of this type of reliability would probably be more likely when evaluating artwork as

opposed to math problems.

4. Internal consistency reliability

Internal consistency reliability is a measure of reliability used to evaluate the

degree to which different test items that probe the same construct produce similar

results.

A. Average inter-item correlation is a subtype of internal consistency reliability.

It is obtained by taking all of the items on a test that probe the same construct

(e.g., reading comprehension), determining the correlation coefficient for each


pair of items, and finally taking the average of all of these correlation

coefficients. This final step yields the average inter-item correlation.

B. Split-half reliability is another subtype of internal consistency reliability. The

process of obtaining split-half reliability is begun by “splitting in half” all items

of a test that are intended to probe the same area of knowledge (e.g., World War

II) in order to form two “sets” of items. The entire test is administered to a

group of individuals, the total score for each “set” is computed, and finally the

split-half reliability is obtained by determining the correlation between the two

total “set” scores.

Split-Half Reliability:

Requires only one administration. Especially appropriate when the test is very

long. The most commonly used method to split the test into two is using the odd-even

strategy. Since longer tests tend to be more reliable, and since split-half reliability

represents the reliability of a test only half as long as the actual test, a correction

formula must be applied to the coefficient. Spearman-Brown prophecy formula.

Split-half reliability is a form of internal consistency reliability.

Internal Consistency Reliability:

Determining how all items on the test relate to all other items. Kudser-

Richardson-> is an estimate of reliability that is essentially equivalent to the average of

the split-half reliabilities computed for all possible halves.

Rationale Equivalence Reliability:

Rationale equivalence reliability is not established through correlation but rather

estimates internal consistency by determining how all items on a test relate to all other

items and to the total test.

Inter-rater reliability is a measure of reliability used to assess the degree to which

different judges or raters agree in their assessment decisions. Inter-rater reliability is

useful because human observers will not necessarily interpret answers the same way;

raters may disagree as to how well certain responses or material demonstrate knowledge

of the construct or skill being assessed.


Example: Inter-rater reliability might be employed when different judges are evaluating

the degree to which art portfolios meet certain standards. Inter-rater reliability is

especially useful when judgments can be considered relatively subjective. Thus, the use

of this type of reliability would probably be more likely when evaluating artwork as

opposed to math problems.

Standard Error of Measurement:

Reliability can also be expressed in terms of the standard error of measurement.

It is an estimate of how often you can expect errors of a given size.

Principles of Language Testing

1. What are the principles of language testing?

2. How can we define them?

3. What factors can influence them?

4. How can we measure them?

5. How do they interrelate?

Three Important Characteristics of Tests:

1.Reliability: consistency and free from extraneous sources of error

2.Validity : how well a test measures what it is supposed to measure

Refers to measuring what we intend to measure. How well a test measures what it is

supposed to measure. For eexampleIf math and vocabulary truly represent intelligence

then a math and vocabulary test might be said to have high validity when used as a

measure of intelligence. Estimating the Validity of a Measure:

1. A good measure must not only be reliable, but also valid

2. A valid measure measures what it is intended to measure

3. Validity is not a property of a measure, but an indication of the extent to which

an assessment measures a particular construct in a particular context—thus a

measure may be valid for one purpose but not another


4. A measure cannot be valid unless it is reliable, but a reliable measure may not be

valid

Content Validity:

1. Does the test contain items from the desired “content domain”?

2. Based on assessment by experts in that content domain.

3. Is especially important when a test is designed to have low face validity.

4. Is generally simpler for “other tests” than for “psychological constructs”

For Example - Easier for math experts to agree on an item for an algebra test than it is

for psych experts to agree whether or not an item should be placed in a EI or a

personality measure.

5. Content Validity is not “tested for”. Rather it is assured by experts in the

domain.

Basic Procedure for Assessing Content Validity:

1. Describe the content domain

2. Determine the areas of the content domain that are measured by each test item

3. Compare the structure of the test with the structure of the content domain

For Example:

In developing a nursing licensure exam, experts on the field of nursing would identify

the information and issues required to be an effective nurse and then choose (or rate)

items that represent those areas of information and skills.

A test is to measure foreign students’mastery of English sentence structure, an analysis

must first be made of the language itself and decisions made on which matters need to

be tested and in what proportions.

Face Validity


1. Face validity refers to the extent to which a measure ‘appears’ to measure what

it is supposed to measure

2. Not statistical—involves the judgment of the researcher (and the participants)

3. A measure has face validity—’if people think it does’

4. Just because a measure has face validity does not ensure that it is a valid

measure (and measures lacking face validity can be valid)

Relationship Between Reliability & Validity

• usefulness of a test. Though different, they work together. It would not be

beneficial to design a test with good reliability that did not measure what it was

intended to measure. The inverse, accurately measuring what you desire to

measure with a test that is so flawed that results are not reproducible, is

impossible. Reliability is a necessary requirement for validity. This means that

you have to have good reliability in order to have validity. Reliability actually

puts a cap or limit on validity, and if a test is not reliable, it can not be valid.

Establishing good reliability is only the first part of establishing validity.

Validity has to be established separately. Having good reliability does not mean

you have good validity, it just means you are measuring something consistently.

Now you must establish what it is that you are measuring consistently. The main

point here is reliability is necessary but not sufficient for validity. Tests that are

reliable are not necessarily valid or predictive. If the reliability of a

psychological measure increases, the validity of the measure is also expected to

increase.

FACTORS THAT INFLUENCE VALIDITY:

1. Inadequate sample

2. Items that do not function as intended

3. Improper arrangement/unclear directions

4. Too few items for interpretation

5. Improper test administration

6. Scoring that is subjective


Reliability is influenced by:

1. the longer the test, the more reliable it is likely to be [though there is a point of

no extra return]

2. items which discriminate will add to reliability, therefore, if the items are too

easy / too difficult, reliability is likely to be lower

3. if there is a wide range of abilities amongst the test takers, test is likely to have

higher reliability

4. the more homogeneous the items are, the higher the reliability is likely to be

Practicality:

The ease with which the test:

1. items can be replicated in terms of resources needed e.g. time, materials, people

2. can be administered

3. can be graded

4. results can be interpreted

Factors which can influence reliability, validity and practicality:

From the TEST:

1. quality of items

2. number of items

3. difficulty level of items

4. level of item discrimination

5. type of test methods

6. number of test methods

7. time allowed

8. clarity of instructions


9. use of the test

10. selection of content

11. sampling of content

12. invalid constructs

From the TEST TAKERS:

1. familiarity with test method

2. attitude towards the test i.e. interest, motivation, emotional/mental state

3. degree of guessing employed

4. level of ability

From the Test Administration

1. consistency of administration procedure

2. degree of interaction between invigilators and test takers

3. time of day the test is administered

4. clarity of instructions

5. test environment – light / heat / noise / space / layout of room

6. quality of equipment used e.g. for listening tests

From the Scoring

1. accuracy of the key e.g. does it include all possible alternatives?

2. inter-rater reliability e.g. in writing, speaking

3. intra-rater reliability e.g. in writing, speaking

4. machine vs. human

How can we measure reliability?

Test-retest:same test administered to the same test takers following an interval of no

more than 2 weeks


Inter-rater reliability: two or more independent estimates on a test e.g. written scripts

marked by two raters independently and results compared

3.Practicality


11

CONSTRUCTING TESTS

Writing items requires a decision about the nature of the item or question to which we

ask students to respond, that is, whether discreet or integrative, how we will score the

item; for example, objectively or subjectively, the skill we purport to test, and so on. We

also consider the characteristics of the test takers and the test taking strategies

respondents will need to use. What follows is a short description of these considerations

for constructing items.

Test Items

A test item is a specific task test takers are asked to perform.Test items can assess one

or more points or objectives, and the actual item itself may take on a different

constellation depending on the context. For example, an item may test one point

(understaning of a given vocabulary word) or several points (the ability to obtain facts

from a passage and then make inferences based on the facts). Likewise, a given

objective may be tested by a series of items. For example, there could be five items all

testing one grammatical point (e.g., tag questions). Items of a similar kind may also be

grouped together to form subtests within a given test.

Classifying Items

Discrete – A completely discrete-point item would test simply one point or objective

such as testing for the meaning of a word in isolation. For example:

Choose the correct meaning of the word paralysis.


(A) inability to move

(B) state of unconscious

(C) state of shock

(D) being in pain

Integrative – An integrative item would test more than one point or objective at a time.

(e.g., comprehension of words, and ability to use them correctly in context). For

example:

Demonstrate your comprehension of the following words by using them together in a

written paragraph: “paralysis,” “accident,” and “skiing.”

Sometimes an integrative item is really more a procedure than an item, as in the case of

a free composition, which could test a number of objectives; for example, use of

appropriate vocabulary, use of sentence level discourse, organization, statement of

thesis and supporting evidence. For example:

Write a one-page essay describing three sports and the relative likelihood of being

injured while playing them competitively.

Objective – A multiple-choice item, for example, is objective in that there is only one

right answer.

Subjective – A free composition may be more subjective in nature if the scorer is not

looking for any one right answer, but rather for a series of factors (creativity, style,

cohesion and coherence, grammar, and mechanics).

The Skill Tested

The language skills that we test include the more receptive skills on a continuum –

listening and reading, and the more productive skills – speaking and writing. There are,


of course, other language skills that cross-cut these four skills, such as vocabulary.

Assessing vocabulary will most likely vary to a certain extent across the four skills, with

assessment of vocabulary in listening and reading – perhaps covering a broader range

than assessment of vocabulary in speaking and writing. We can also assess nonverbal

skills, such as gesturing, and this can be both receptive (interpreting someone else’s

gestures) and productive (making one’s own gestures).

The Intellectual Operation Required

Items may require test takers to employ different levels of intellectual operation in order

to produce a response (Valette, 1969, after Bloom et al., 1956). The following levels of

intellectual operation have been identified:

knowledge (bringing to mind the appropriate material);

comprehension (understanding the basic meaning of the material);

application (applying the knowledge of the elements of language and comprehension to

how they interrelate in the production of a correct oral or written message);

analysis (breaking down a message into its constituent parts in order to make explicit

the relationships between ideas, including tasks like recognizing the connotative

meanings of words and correctly processing a dictation, and making inferences);

synthesis (arranging parts so as to produce a pattern not clearly there before, such as in

effectively organizing ideas in a written composition); and

evaluation (making quantitative and qualitative judgments about material).

it has been popularly held that these levels demand increasingly greater cognitive

control as one moves from knowledge to evaluation – that, for example, effective

operation at more advanced levels, such as synthesis and evaluation, would call for

more advanced control of the second language. Yet this has not necessarily been borne

out by research (see Alderson &Lukmani, 1989). The truth is that what makes items

difficult, sometimes defies the intuitions of the test constructors.


The Tested Response Behavior

Items can also assess different types of response behavior. Respondents may be tested

for accuracy in pronunciation or grammar. Likewise, they could be assessed for fluency,

for example, without concern for grammatical correctness. Aside from accuracy and

fluency, respondents could also be assessed for speed – namely, how quickly they can

produce a response, to determine how effectively the respondent replies under time

pressure.In recent years, there has also been an increased concern for developing

measures of performance – that is, measures of the ability to perform real-world tasks,

with criteria for successful performance based on a needs analysis for the given task

(Brown, 1998; Norris, Brown, Hudson, & Yoshioka, 1998).

Performance tasks might include “comparing credit card offers and arguing for the best

choice” or “maximizing the benefits from a given dating service.” At the same time that

there is a call for tasks that are more reflective of the real world, there is a

commensurate concern for more authentic language assessment. At least one study,

however, notes that the differences between authentic and pedagogic written and spoken

texts may not be readily apparent, even to an audience specifically listening for

differences (Lewkowicz, 1997). In addition, test takers may not necessarily concern

themselves with task authenticity in a test situation. Test familiarity may be the

overriding factor affecting performance.

Characteristics of Respondents

Items can be designed to be appropriate for groups of test-takers with differing

characteristics. Bachman and Palmer (1996: 64-78) classify these characteristics into

four categories: the personal characteristics of the respondents – for example, their age,

gender, and native language; the knowledge of the topic that they bring to the language

testing situation; their affective schemata (that is, their prior likes and dislikes with

regard to assessment); and their language ability.

Research into the impact of these characteristics continues. For example, with regard to

the age variable, researchers have suggested that educators revisit this issue and perhaps


conceive of new ways to consider the impact of the age variable in assessing language

ability (Marinova-Todd, Marshall, & Snow, 2000). With regard to performance on

language measures, it would appear that age interacts with other variables such as

attitudes, motivation, the length of exposure to the target language, as well as the nature

and quality of language instruction (see García Mayo &GarcíaLecumberri, 2003).

With regard to language ability, both Bachman and Palmer (1996) and Alderson (2000)

detail the many types of knowledge that respondents may need to draw on to perform

well on a given item or task:world knowledge and culturally-specific knowledge,

knowledge of how the specific grammar works, knowledge of different oral and written

text types, knowledge of the subject matter or topic, and knowledge of how to perform

well on the given task.

Item-Elicitation Format

the format for item elicitation has to be determined for any given item. An item can

have a spoken, written, or visual stimulus, as well as any combination of the three.

Thus, while an item or task may ostensibly assess one modality, it may also be testing

some other as well. So, for example, a subtest referred to as “listening” which has

respondents answer oral questions by means of written multiple-choice responses is

testing reading as well as listening. It would be possible to avoid introducing this

reading element by having the multiple-choice alternatives presented orally as well. But

then the tester would be introducing yet another factor, namely, short-term memory

ability, since the respondents would have to remember all the alternatives long enough

to make an informed choice.

Item-Response Format

The item-response format can be fixed, structured, or open-ended. Item responses with a

fixed format include true/false, multiple-choice, and matching items.Item responses,

which call for a structured format include ordering (where respondents are requested to

arrange words to make a sentence, and several orders are possible), duplication – both

written (such as., dictation) and oral (for example, recitation, repetition, mimicry),


identification (explaining the part of speech of a form), and completion.Those item

responses calling for an open-ended format include composition – both written (for

example, creative fiction, expository essays) and oral (such as a speech) – as well as

other activities, such as free oral response in role-playing situations.

Grammatical competence

According to Canale and Swain (1980, p. 29), grammatical competence includes

phonology, morphology, syntax, knowledge of lexical items, and semantics, as well as

matters of mechanics (spelling, punctuation, capitalization, and handwriting). It would

seem that this definition is perhaps too broad for practical purposes. A truly perplexing

issue is determining what constitutes a grammatical error, as well as determining the

severity of this error. In other words, will the use of the error stigmatize the speaker?

Let us say that we are using a grammatical scale which deals with how acceptably

words, phrases, and sentences are formed and pronounced in the respondents'

utterances. Let us assume that the focus is on both of the following: clear cases of

errors in form, such as the use of the present perfect for an action completed in the past

(e.g., ”We have had a great time at your house last night."), and matters of style, such

as the use of a passive verb form in a context where a native would use the active form

(e.g., Question - “What happened to the CD I lent you, Jorge?” Reply - "The CD was

lost." vs. "I lost your CD.").

Major grammatical errors might be considered those that either interfere with

intelligibility or stigmatize the speaker. Minor errors would be those that do not get in

the way of the listener's comprehension nor would they annoy the listener to any

extent.Thus, getting the tense wrong in the above example, "We have had a great time at

your house last night" could be viewed as a minor error, whereas in another case,

producing "I don't have what to say" ("I really have no excuse" by translating directly

from the appropriate Hebrew language) could be considered a major error since it is not

only ungrammatical but also could stigmatize the speaker as rude and unconcerned,

rather than apologetic.

Rational for Tests:


Measures of student performance (testing) may have as many as five purposes:

Student Placement,

Diagnosis of Difficulties,

Checking Student Progress,

Reports to Student and Superiors,

Evaluation of Instruction.

Unfortunately the most common perception is that tests are designed to statistically rank

all students according to a sampling of their knowledge of a subject and to report that

ranking to superiors or anyone else interested in using that information to adversely

influence the student's feeling of self-worth. It is even more unfortunate that the

perception matches reality in the majority of testing situations. Consequently tests are

highly stressful anxiety producing events for most persons.

All too often tests are constructed to determine how much a student knows rather than

determining what he/she must learn. Frequently tests are designed to "trap" the student

and in still other situations tests are designed to insure a "bell curve" distribution of

results. Most of the other numerous testing designs and strategies fail to help the student

in his learning process and in many cases are quite detrimental to that process.

In a Mastery Based system of instruction the two main reasons for testing are to

determine mastery and to diagnose difficulties. When tests are constructed for these

purposes, the other four purposes will also be satisfied. For example, consider a test

which requires the student to demonstrate mastery and at the same time rigorously

diagnoses learning difficulties. If no difficulties are indicated, it may be safely assumed

that the learner has mastered the concept. That information may then be used to record

student progress and to make reports to the student and superiors. Examining student

performance collectively for a group of students provides information about the quality

of instruction. Examining a single student's performance collectively for a group of

learning objectives may be used to determine proper placement within that group of

learning objectives.


It is therefore important that the instructional developer construct each question so that a

correct response indicates mastery of the learning objective and any incorrect response

provides information about the nature of the student's lack of mastery. Furthermore,

each student should have ample opportunity to "inform" the instructor of any form of

lack of mastery. Unfortunately the mere presence of a test question influences the

student's response to the question. The developer should minimize that influence by

constructing questions which permit the student to make any error he would make in the

absence of such influence. For example, a multiple choice question should have all the

wrong answers the student might want to select and should also have as many correct

answers as the student might want to provide.

True/False Questions:

True/false questions should be written without ambiguity. That is, the statement of the

question should be clear and the decision whether the statement is true or false should

not depend on an obscure interpretation of the statement. A true/false question may

easily be used, and most commonly is used, to determine if the student recalls facts.

However, a true/false question may also be used to determine if the learner has mastered

the learning objective well enough to correctly analyze a statement.

It is important to be aware that only two choices are available to the student and

therefore the nature of the question gives the student a 50% chance of being correct. A

single True/False question therefore is helpful only if the student answers the question

incorrectly and the incorrect response indicates a specific misunderstanding of the

learning objective. A collection of true/false questions, about a single learning

objective, all answered correctly by a student is a much stronger indication of mastery.

It is therefore important that the instructional developer construct a "test bank"

containing a large number of true/false questions. It is also important to include

numerous true/false questions on any test which utilizes true/false questions. Ideally a

true/false question should be constructed so that an incorrect response indicates

something about the student's misunderstanding of the learning objective. This may be a

difficult task, especially when constructing a true statement. The instructional developer


should try to accomplish the ideal, but should recognize that in some instances he/she

will not reach that goal.

Multiple Choice Questions:

Multiple choice questions should be written without ambiguity. That is, the statement of

the question stem should be clear and should leave no doubt about how to select

choices. Additionally the choices should be written without ambiguity and should

contain all information required to make a decision whether or not to choose it. The

decision whether to select or not select a choice should not depend on an obscure

interpretation of either the stem or the choice. A multiple choice question may easily be

used to determine if the student recalls facts. However, a multiple choice question may

also be used to determine if the student has mastered the learning objective well enough

to correctly analyze a statement.

The instructional developer should not construct multiple choice questions with a

uniform number of choices, a uniform number of valid choices, or any other

recognizable pattern for construction of choices. Instead the instructional developer

should include as many valid and invalid choices as is required to determine the

student's deficiencies with respect to the learning objective. Moreover, each choice

should appear to be a valid choice to some student.

Multiple choice questions should therefore contain any number of choices with one or

more valid choices. The student is of course required to select all valid choices and

failure to select any one of the valid choices will provide information about the student's

misunderstanding of the learning objective in the same way that selection of an invalid

choice reveals the nature of his/her misunderstanding. The nature of the choices

provided in a multiple choice question may be of two types: those which require merely

recall of facts and those which require additionally activity such as synthesis, analysis,

computation, comparison, or diagramming. The instructional developer who is seriously

concerned with the student's success will use both types extensively.

Fill-in-the-Blank Questions:


The temptation, when constructing fillintheblank questions, is to construct traps for the

student. The instructional developer should avoid this problem. Ensure that there is only

one acceptable word for the student to provide and that the word (or words) is

significant. Avoid asking the student to supply "minor" words. Avoid fillintheblank

question with so many blanks that the student is unable to determine what is to be

completed.

Sometime/Always/Never Questions:

The collection of Sometime/Always/Never (referred to as SAN) statements are

statements which are: true sometimes, always true, and never true. The statements used

in these questions must be stated carefully and should contain enough information to

permit the student to decide whether the statement is true sometimes, always, or never.

SAN questions (especially the sometimes statements) are the most difficult to construct

but can be the most significant part of a test. SAN questions should be constructed to

force the student to engage in some critical thinking about the learning objective. When

used properly, SAN questions force the student to consider important details about the

learning objective. Careful use of this type of question and careful analysis of student's

response will provide detailed information about some of the student's deficiencies.

SAN questions are especially appropriate, and easy to construct, for learning objectives

addressing concepts which are "black" or "white" except in a few cases. The true

statements in a collection of true/false questions are of course always true statements

while the set of false statement may be further subdivided into those which are true

sometimes and those which are never true.

Test Construction

Closed-Answer or “Objective” Tests

Although by definition no test can be truly “objective” (existing as an object of fact,

independent of the mind), this handbook refers to tests made up of multiple choice,

matching, fill-in, true/false, or fill-in-the-blank items as objective tests. Objective tests


have the advantages of allowing an instructor to assess a large and potentially

representative sample of course material and allow for reliable and efficient scoring.

The disadvantages of objective tests include a tendency to emphasize only “recognition”

skills, the ease with which correct answers can be guessed on many item types, and the

inability to measure students’ organization and synthesis of material (Adapted with

permission from Yonge, 1977).

Since the practical arguments for giving objective exams are compelling, we offer a few

suggestions for writing multiple-choice items. The first is to find and adapt existing test

items. Teachers’ manuals containing collections of items accompany many textbooks.

(AIs: Your course supervisor or former teachers of the same course may be willing to

share items with you.) However, the general rule is adapt rather than adopt. Existing

items will rarely fit your specific needs; you should tailor them to more adequately

reflect your objectives.

Second, design multiple choice items so that students who know the subject or material

adequately are more likely to choose the correct alternative and students with less

adequate knowledge are more likely to choose a wrong alternative. That sounds simple

enough, but you want to avoid writing items that lead students to choose the right

answer for the wrong reasons. For instance, avoid making the correct alternative the

longest or most qualified one, or the only one that is grammatically appropriate to the

stem. Even a careless shift in tense or verb-subject agreement can often suggest the

correct answer.

Finally, it is very easy to disregard the above advice and slip into writing items which

require only rote recall but are nonetheless difficult because they are taken from obscure

passages (footnotes, for instance). Some items requiring only recall might be

appropriate, but try to design most of the items to tap the students’ understanding of the

subject (Adapted with permission from Farris, 1985). One way to write multiple choice

questions that require more than recall is to develop questions that resemble miniature

“cases” or situations. Provide a small collection of data, such as a description of a

situation, a series of graphs, quotes, a paragraph, or any cluster of the kinds of raw

information that might be appropriate material for the activities of your discipline. Then


develop a series of questions based on that material. These questions might require

students to apply learned concepts to the case, to combine data, to make a prediction on

the outcome of a process, to analyze a relationship between pieces of the information, or

to synthesize pieces of information into a new concept.

Here are a few additional guidelines to keep in mind when writing multiple-choice tests

(Adapted with permission from Yonge, 1977):

The item-stem (the lead-in to the choices) should clearly formulate a problem.

As much of the question as possible should be included in the stem.

Randomize occurrence of the correct response (e.g., you don’t always want “C” to be

the right answer).

Make sure there is only one clearly correct answer (unless you are instructing

students to select more than one).

Make the wording in the response choices consistent with the item stem.

Don’t load down the stem with irrelevant material.

Beware of using answers such as “none of these” or “all of the above.”

Use negatives sparingly in the question or stem; do not use double negatives.

Beware of using sets of opposite answers unless more than one pair is presented (e.g.,

go to work, not go to work).

Beware of providing irrelevant grammatical cues.

Grading of multiple choice exams can be done by hand or through the use of computer

scannable answer sheets available from your departmental office. Take completed

answer sheets to IUB Evaluation Services and Testing (BEST) located in Franklin Hall

M014. If you have your test scored by BEST, they will provide statistics on difficulty

and reliability, which will help you to improve your tests.


If you choose the computer-grading route, you must be sure students have number 2

pencils to mark answers on their sheets. These are often available from your

department’s main office. At the time of the exam it is helpful to write on the

chalkboard all pertinent information required on the answer sheet (course name, course

number, section number, instructor’s name, etc.). Also, remind students to fill in their

university identification numbers carefully so that you can have a roster showing the ID

number and grade for each student. If you would like to consult with someone about

developing test items, call theCenter for Innovative Teaching and Learning at 855-9023.

If you would like to consult with someone about how to interpret your test results, call

BEST at 855-1595.

Essay Tests

Conventional wisdom accurately portrays short-answer and essay examinations as the

easiest to write and the most difficult to grade, particularly if they are graded well. You

should give students an exam question for each crucial concept that they must

understand.

If you want students to study in both depth and breadth, don't give them a choice among

topics. This allows them to choose not to answer questions about those things they

didn’t study. Instructors generally expect a great deal from students, but remember that

their mastery of a subject depends as much on prior preparation and experience as it

does on diligence and intelligence; even at the end of the semester some students will be

struggling to understand the material. Design your questions so that all students can

answer at their own levels.

The following are some suggestions that may enhance the quality of the essay tests that

you produce (Adapted with permission from Ronkowski, 1986):

1. Have in mind the processes that you want measured (e.g., analysis, synthesis).

2. Start questions with words such as “compare,” “contrast,” “explain why.” Don’t

use “what,” “when,” or “list.” (These latter types are better measured with

objective-type items). Writing Tutorial Services, Ballantine Hall 207, 855-6738,


has a handout for students which defines these terms and explains how to study

for and respond to essay questions.

3. Write items that define the parameters of expected answers as clearly as

possible.

4. Make sure that the essay question is specific enough to invite the level of detail

you expect in the answer. A question such as “Discuss the causes of the

American Civil War,” might get a wide range of answers, and therefore be

impossible to grade reliably. A more controlled question would be, “Explain

how the differing economic systems of the North and South contributed to the

conflicts that led to the Civil War.

5. Don’t have too many questions for the time available.

12


TYPES OF LISTENING TESTING

1. DISCRIMINATIVE LISTENING

Discriminative Listening is an awareness of changes in pitch and loudness of

sounds and it is determining if sounds are different or the same.These activities are

designed to enahnce this listening skill:

1) Same or different? - Call out two words andhave the children determine if they are

the same or different. For example, say bat/ bat, bat/bet.

2) Rhyming words- Practice rhyming discriminative listening skills by calling out a

few rhyming words, such as“hat, bat, rat, cat, and so on” Have the children take

turns calling out a word that rhymes with “at” as well as other rhyming words you

want to use.

3) What’s the problem? - After reading a storybook to children (one that’s very

familiar to them) have them tell you what the problem is. As you read the story

change things around so the story isdifferent somehow, to see if they catch the

changes and can tell you what theproblem is.

4) Musical moods- Play music, but change it up some by changing the pace, make it

fast, slow, loud, soft, high and low. Have the children tell you when a sound change

is made and what the change is.

5) Clap it out- After talking about syllables of words, clap out the syllables of some

words you call out, starting with a two syllable word, then three, and so on. Repeat a

word at least twice (or more if needed) so the concept is fully graspe.

Lastly, we have discriminative listening which has to do with the identification of

different variations in sounds and words in order to understand the different messages.

This is the most important listening and it spans all the other forms of listening. It

involves being sensitive to pitch, volume, emphasis and rate of speech in order to detect

the messages that may be hidden. This form of listening usually requires one to be


efficient in two factors: have a good hearing ability and the knowledge of sound

structure (Kline, 2010).

Hearing ability

The ability to hear helps in sound differentiation and therefore is one can hear well, then

there is a high likelihood that they can get the message well (Lengel, 1998).

Knowledge of sound structure

The knowledge of sound structure enables an individual to differentiate different sounds

and be able to tell what is being said. For example the difference between “I would rank

it first” and “I drank it first” requires such kind of ability in order to get the message

clearly.In conclusion there are various forms of listening and these include listening for

the sake of making critical evaluations, building relationships, making discriminations

and obtaining information or gaining appreciation and each of the needs in listening

calls for a different form of listening. These forms of listening depend on basic factors

such as concentration, attention, memory, perception, experience, presentation style and

the determination of ethos, pathos and logos under the various forms of listening. The

lack of these may imply that there will be no communication that would be ongoing.

For example of discriminative listening

Exercise

Difference sounds is identified

1) “I would rank it first” and “I drank it first”

2) bat/ bat, bat/bet.

3) Safe/save

4) Made/mate

5) Age/h


COMPREHENSION LISTENING

The next step beyond discriminating between different sound and sights is to make

sense of them. To comprehend the meaning requires first having a lexicon of words at

our fingertips and also all rules of grammar and syntax by which we can understand

what others are saying.

The same is true, of course, for the visual components of communication, and an

understanding of body language helps us understand what the other person is really

meaning.

In communication, some words are more important and some less so, and

comprehension often benefits from extraction of key facts and items from a long

spiel.Comprehension listening is also known as content listening, informative listening

and full listening.Listening Comprehension Sample Questions Transcript

Sample Item A

On the recording, you will hear:

(Narrator): Listen to a high school principal talking to the school's students.

(Man): I have a very special announcement to make. This year, not just

one,but three of our students will be receiving national awards for

their academic achievements. Krista Conner, Martin Chan, and Shriya

Patel have all been chosen for their hard work and consistently high

marks.It is very unusual for one school to have so many students

receive this award in a single year.

(Narrator): What is the subject of the announcement?

In your test book, you will read:

1. What is the subject of the announcement?

A. The school will be adding new classes.

B. Three new teachers will be working at the school.

C. Some students have received an award.

D. The school is getting its own newspaper.


Sample Item B


(Narrator): Listen to a teacher making an announcement at the end of the day.

(Man): Remember that a team of painters is coming in tomorrow to paint the

walls. In this box on my desk are sheets of plastic that I want you to

slip over your desks. Make sure you cover your desks completely so

that no paint gets on them. Everything will be finished and the plastic

will be removed by the time we return on Monday.

(Narrator): What does the teacher want the students to do?


2. What does the teacher want the students to do?

A. Take everything out of their desks

B. Put the painting supplies in plastic bags

C. Bring paints with them to school on Monday

D. Put covers on their desks to keep the paint off

Sample Set A


(Narrator): Listen to a conversation between two friends at school.

(Boy): Hi, Lisa.

(Girl): Hi, Jeff. Hey, have you been to the art room today?

(Boy): No, why?

(Girl): Well, Mr. Jennings hung up a notice about a big project that's going

ondowntown. You know how the city's been doing a lot of work to

fix up MainStreet—you know, to make it look nicer? Well, they're

going to create a mural.

(Boy): You mean, like, make a painting on the entire wall of a building?

(Girl): It's that big wall on the side of the public library. And students from


this school are going to do the whole thing ... create a design, and

paint it, and everything. I wish I could be a part of it, but I'm too

busy.

(Boy): [excitedly] Cool! I'd love to help design a mural. Imagine everyone in

town walking past that wall and seeing my artwork, every day.

(Girl): I thought you'd be interested. They want the mural to be about nature,

so I guess all the design ideas students come up with should have a

nature theme.

(Boy): That makes sense—they've been planting so many trees and plants

along the streets and in the park.

(Girl): If you're interested you should talk with Mr. Jennings.

(Boy): [half listening, daydreaming] This could be so much fun. Maybe I'll

try to visit the zoo this weekend ... you know, to see the wild animals

and get some ideas, something to inspire me!

(Girl): [with humor] Well maybe you should go to the art room first to get

more information from Mr. Jennings.

(Boy): [slightly sheepishly] Oh yeah. Good idea. Thanks for letting me

know, Lisa! I'll go there right away.

(Narrator): Now answer the questions.

In your test book,you will read:

3. What are the speakers mainly discussing?

A. A new art project in the city

B. An assignment for their art class

C. An art display inside the public library

D. A painting that the girl saw downtown

4. Why is the boy excited?

A. A famous artist is going to visit his class.


B. His artwork might be seen by many people.

C. His class might visit an art museum.

D. He is getting a good grade in his art class.

5. Where does the boy say he may go this weekend?

A. To the zoo

B. To an art store

C. To Main Street

D. To the public library

6. Why does the girl suggest that the boy go to the art room?

A. So that he can hand in his homework

B. So that he can sign up for a class trip

C. So that he can see a new painting

D. So that he can talk to the teacher

Sample Set B


Script Text:

(Narrator): Listen to a teacher talking in a biology class.

(Woman): We've talked before about how ants live and work together in huge

communities. Well, one particular kind of ant community also grows its

own food. So you could say these ants are like people like farmers. And

what do these ants grow? They grow fungi [FUN-guy]. Fungi are kind

of like plants—mushrooms are a kind of fungi. These ants have gardens,

you could say, in their underground nests. This is where the fungi are

grown.

Now, this particular kind of ant is called a leafcutter ant. Because of

their name, people often think that leafcutter ants eat leaves. If they cut

up leaves they must eat them, right? Well, they don't! They actually use


the leaves as a kind of fertilizer. Leafcutter ants go out of their nests

looking for leaves from plants or trees. They cut the leaves off and carry

them underground . . . and then feed the leaves to the fungi—the fungi

are able to absorb nutrients from the leaves. What the ants eat are the

fungi that they grow. In that way, they are like farmers!

The amazing thing about these ants is that the leaves they get are often

larger and heavier than the ants themselves. If a leaf is too large,

leafcutter ants will often cut it up into smaller pieces—but not all the

time. Some ants carry whole leaves back into the nest. In fact, some

experiments have been done to measure the heaviest leaf a leafcutter ant

can lift without cutting it. It turns out, it depends on the individual ant.

Some are stronger than others. The experiments showed that some

"super ants" can lift leaves about 100 times the weight of their body!

(Narrator): Now answer the questions.


7. What is the main topic of the talk?

A. A newly discovered type of ant

B. A type of ant with unusual skills

C. An increase in the population of one type of ant

D. A type of ant that could be dangerous to humans

8. According to the teacher, what is one activity that both leafcutter ants and

people do?

A. Clean their food

B. Grow their own food

C. Eat several times a day

D. Feed their young special food

9. What does the teacher say many people think must be true about leafcutter

ants?

A. They eat leaves.

B. They live in plants.


C. They have sharp teeth.

D. They are especially large.

10. What did the experiments show about leafcutter ants?

A. How fast they grow

B. Which plants they eat

C. Where they look for leaves

D. How much weight they can carry

Answer Key for Listening Comprehension

1. C

2. D

3. A

4. B

5. A

6. D

7. B

8. B

9. A

10. D


1. CRITICAL LISTENING

Critical listening is listening in order to evaluate and judge, forming opinion about what

is being said. Judgment includes assessing strengths and weaknesses, agreement and

approval.

This form of listening requires significant real-time cognitive effort as the listener

analyzes what is being said, relating it to existing knowledge and rules, whilst

simultaneously listening to the ongoing words from the speaker.

2. BIASED LISTENING

Biased listening happens when the person hears only what they want to hear, typically

misinterpreting what the other person says based on the stereotypes and other biases that

they have. Such biased listening is often very evaluative in nature.

3. EVALUATIVE LISTENING

In evaluative listening, or critical listening, we make judgments about what the other

person is saying. We seek to assess the truth of what is being said. We also judge what

they say against our values, assessing them as good or bad, worthy or unworthy.

Evaluative listening is particularly pertinent when the other person is trying to persuade

us, perhaps to change our behavior and maybe even to change our beliefs. Within this,

we also discriminate between subtleties of language and comprehend the inner meaning

of what is said. Typically also we weigh up the pros and cons of an argument,

determining whether it makes sense logically as well as whether it is helpful to

us.Evaluative listening is also called critical, judgmental or interpretive listening.

4. APPRECIATIVE LISTENING

In appreciative listening, we seek certain information which will appreciate, for

example that which helps meet our needs and goals. We use appreciative listening when


we are listening to good music, poetry or maybe even the stirring words of a great

leader. The student use ppreciative listening when they are listening this poetry and they

seek certain information which will appreciate

Adventure Quotient (AQ) Test 77 questions, 30 min

How adventurous are you? Thrill-seeking can come in different forms, whether it's

doing a swan dive bungee jump off the Auckland Harbour Bridge in New Zealand, or

trying that new exotic restaurant around the corner from work. The type of adventure

you enjoy (or avoid) depends a great deal on your personality. Are you more of a

planner or spontaneous? Courageous or careful? Do you have the energy level of a bee

or a sloth? Find out more about your adventure personality with this test!

Examine the following statements and choose the answer option that best applies to you.

There may be some questions describing situations that may not be relevant to you. In

such cases, select the answer you would most likely choose if you ever found yourself

in that type of situation. In order to receive the most accurate results, please answer as

truthfully as possible.After finishing the test, you will receive a Snapshot Report with an

introduction, a graph and a personalized interpretation for one of your test scores. You

will then have the option to purchase the full results.

Adventure Quotient (AQ) Test 50 questions, 30 min

1. I _____ repetitive tasks.

enjoy

don't mind

can't stand

2. I take pride in my appearance and upkeep.

Agree

Somewhat agree

Disagree

3. I have already been or would consider any of the following: skydiving, bungee

jumping, hang gliding, or free climbing.

Definitely

Maybe


No way

4. I see getting away from it all as a chance to:

Connect with people and places

Connect with myself

5. I would travel to a developing country and leave the airport/train station:

With pleasure.

Only with a friend.

Only with a hired guide.

6. I seek new experiences more...

To learn about new places, people, and things.

For the way they make me feel

7. I am more likely to ask myself:

"When is break time?"

"What's next?"

8. I am more likely to get my thrills from:

Doing something physically or emotionally gutsy

Watching someone else do something physically or emotionally gutsy

9. The lowest comfort I would consider for sleeping is:

Outside on the ground.

A tent.

An RV or camper.

A motel.

A bed and breakfast.

A furnished apartment or house.

A 3 or 4 star hotel.

10. Adrenaline is a chemical that:

I avoid

I enjoy from time to time

I seek

I am addicted to

11. Having a daily routine is:


Oppressive and stifling

Annoying and limiting

Sometimes a good thing, sometimes not

Helpful and comforting

Totally necessary

12. Not knowing what the future might hold is:

Terrifying

A little disconcerting, but that's just the way life is

Exhilarating

13. At a theme park, I'll try:

The highest, scariest ride

Something fast, but no upside-down stuff

The kiddie train or merry-go-round

The park bench

14. Having nice things and looking good is important to me

Extremely

Somewhat

Not very

15. A life without luxury is:

Not worth living

Difficult to imagine.

Perfectly acceptable

Expected.

16. Knowing what others think of me is:

Essential

Important

Helpful

Not important

17. When visiting new places, I am more interested in:

Soaking up the environment

Interacting with people

18. Others are more likely to wonder...


Where my energy goes.

Where my energy comes from

19. Life's experiences are most rich and interesting when I contemplate them...

With others

In my own mind

An old friend is in town. Where are you most likely to eat?

A. We'd eat at:

A fast food joint

An ethnic café

B. We'd eat at:

A themed restaurant or dinner theater

At the kitchen table in my house, warming up something in the microwave

C. We'd eat at a:

Chain restaurant

Upscale restaurant

You inherit $100,000 from a distant uncle. What are you more likely to do with it?

D. I'd take my wallet out and:

Go on an epic shopping spree

Donate some, or all, to charity

E. I'd take my wallet out and:

Go on a casino fling

Put it in the bank

F. I'd take my wallet out and:

Go on a dream vacation

Throw a gigantic party

It's time to learn something new. Which class would you be most interested in taking

up?

G. I would rather take:

Acting classes

Creative writing classes


H. I would rather take:

Survival skills classes

Speed reading classes

I. I would rather take:

Kickboxing classes

Tai Chi classes

Which of the following would you rather visit or spend some time in?

J. I would rather go to:

An Inuit igloo

A Buddhist monastery

K. I would rather go to:

An African hut

A European hostel

L. I would rather go to:

A Japanese pagoda

A California spa

Pick your preferred pet

M. I'd rather have a:

Parrot

Hamster

N. I'd rather have a:

Goldfish

Snake

O. I'd rather have a:

Tarantula

Horse

Which is your preferred adrenaline rush?

P. There's nothing like the thrill of:

A looming deadline

A charging rhino

Q. There's nothing like the thrill of:


Running cross-country

Running with the bulls

R. There's nothing like the thrill of:

Swimming with dolphins

Swimming with sharks

Which is your preferred adrenaline rush?

S. There's nothing like the thrill of:

Finding something I really like on sale.

Finding an ancient Egyptian artifact in Valley of the Kings

T. There's nothing like the thrill of:

Cycling or hiking

Taking a scenic drive

U. There's nothing like the thrill of:

Getting a tattoo or piercing

Skydiving or hang gliding

Pick the adjective that best describes you.

V. I am more:

Bold

Timid

W. I am more:

Impulsive

Deliberate

X. I am more:

Of an improviser

Of a planner

Y. What's your most favorite way to get from point A to point B?

First class or Business class

The scenic railroad route

Automobile - the classic "road trip"

Budget airline - who needs legroom?

Tour bus - sit back and relax


An all-terrain vehicle. No road? No problem

My bike - and the wind in my hair

Z. What's your comfort zone when it comes to heights?

Top shelf of the bookcase

The 3-meter diving board

A bungee jump

A skydive

A spacewalk

AA. What is the one form of footwear you could never live without?

Skis

Cycling shoes

Stiletto heels

Cross-trainers

Walking shoes

Flip-flops

Hiking boots

Dress shoes

BB. Which voice mail message are you most likely to leave on a friend's phone?

"How about a movie and some take out?"

"Got an extra ticket to a show, let's go!"

"Party of the century! Pick you up at 9."

"Meet me at the airport with a suitcase and your passport."

CC. Which phrase do you agree with more?

"Better safe than sorry."

"Nothing ventured, nothing gained."

DD. How much of Mother Nature's wrath will you endure for adventure?

Monsoon, tornado, ice storm - bring it on!

Thundershowers, extreme hot and cold

Some wind, clouds, and drizzle

If it's not blue skies, forget it

EE. How often do you pick up new fashions?

Daily to weekly.


Monthly to yearly.

Every decade or so

5. SYMPATHETIC LISTENING

In sympathetic listening we care about the other person and show this concern in

the way we pay close attention and express our sorrow for their ills and happiness at

their joys.

EMPATHETIC LISTENING

When we listen empathetically, we go beyond sympathy to seek a truer understand how

others are feeling. This requires excellent discrimination and close attention to the

nuances of emotional signals. When we are being truly empathetic, we actually feel

what they are feeling.In order to get others to expose these deep parts of themselves to

us, we also need to demonstrate our empathy in our demeanor towards them, asking

sensitively and in a way that encourages self-disclosure.

6. THERAPEUTIC LISTENING

In therapeutic listening, the listener has a purpose of not only empathizing with the

speaker but also to use this deep connection in order to help the speaker understand,

change or develop in some way.This not only happens when you go to see a therapist

but also in many social situations, where friends and family seek to both diagnose

problems from listening and also to help the speaker cure themselves, perhaps by some

cathartic process. This also happens in work situations, where managers, HR people,

trainers and coaches seek to help employees learn and develop.

7. DIALOGIC LISTENING


The word 'dialogue' stems from the Greek words 'dia', meaning 'through' and 'logos'

meaning 'words'. Thus dialogic listening mean learning through conversation and an

engaged interchange of ideas and information in which we actively seek to learn more

about the person and how they think.Dialogic listening is sometimes known as

'relational listening'.

The example of dialogic listening

A : I was working as a training director for a national homelessness foundation. I was

traveling around the country doing a lot of teaching and consulting. I was mostly

the only white male wherever I went. So I was doing big urban shelters and city

governments in Detroit and places like that. I was always coming up against race,

class, and gender issues between myself and the participants.

Q : Because they weren't white males?

A : Right, they were mostly females of color, and I could always deal with it, but it

was by the seat of my pants. So I came to PCP for consultation initially and then I

was accepted into their first workshop back in 1994. I found it to be such a

revolutionary approach to difference, one that I had never experienced before in

all my training in diversity and all that other stuff. I found out after my first class

that I had to do my training in Louisville, Kentucky for the homelessness network

there. The issue there was that the staff of the homeless shelters were mostly

women of color, and the volunteers were mostly affluent white women from the

suburbs and they differed in many ways and had different ideas about each other

as well. So I started doing this training. One of the goals of this group was that

they wanted the people to work more effectively together.

About half way through the first day, an African American woman stood up and

she was very angry. She said, "You don't know shit about my life, you're a white

man with privilege." I had some choices to make there. But because I had been to

this one PCP class, I decided that I was going to deal with this differently than I

would have dealt with this prior. I said, "You're absolutely right. I am white. I'm a

guy. I have certain level of power. I wear a tie. I live in suburbs. I drive a nice car.

And I imagine that your story has a lot to do with why you're here. I imagine that


a lot of other people's stories have a lot to do with why they're here. I'm

wondering if we can make a choice together as a group to hear your story, and

what it is that you want people to understand about you. Would you be willing to

hear the stories of others?" She said, "Yeah." So I had everyone go around and tell

the group how their personal story connected to why they were there. Everybody

went around the room. Women told these incredible stories.

I remember there was one white woman who told how she had been homeless for

the last two years. That she had been beaten by her husband, but because they

were wealthy and lived in the suburbs, he was basically able to buy off the police,

and she was basically in prison because of her wealth. Finally, when he started

beating the children, she took them. She was cut off completely from his wealth

and lived on the streets for two years. She had just gotten out of shelter. This

tremendous bonding happened among these women. We were all brought to tears

by it. That affected me deeply. I came home and a couple days later my kids were

fighting. I was always the type, and I still give into this temptation, of getting

involved in the middle and trying to referee, thinking I know what's going on. In

this instance, I tried taking what is called a not-knowing attitude. I suggested that

each kid take five minutes to explain what's going on. I was using the "what's-at-

the-heart-of-the-matter-for-you approach," but in a way that was easier for them to

understand because they were younger. So each kid had five minutes.

Once they spoke, I realized that I certainly didn't have a clue about what their

concerns were. I had a completely different idea about what they were concerned

about, and they had completely different ideas about each other. They were then

able to say, "Oh, so that's all you want," and then move along. Now it doesn't

always happen like that, but it made a really deep impression on me. The biggest

thing for me is being a father, it's the most important thing in my life, and the fact

that I can do it well is my biggest accomplishment. To think that I was doing it so

well, yet I was doing it so ineffectively that I could not know my own kids. I

could be with them ten hours a day but still not know them because I wasn't

listening to them deeply. It blew my mind. I just thought that this is the best thing

since sliced bread. So those two things really catapulted me into the whole PCP

mindset.


Q : So you were really struck by the real power of letting the parties speak for

themselves, without being the convoy, without being the person who summarizes

and says, "This is what's going on."

A : Right. Exactly. Yeah, because I could have said, "This is what I hear." I try to

relate it to my own experience in some way, but basically I don't know. Being

asked, "Can we use your wisdom and tap the rest of the wisdom in the room and

make it work for us here?" Then leaving it in their hands afterward was big. It just

was not my style to do that before.

Listen carefully to the dialog between nick and jimmy,then complete the

conversation

Nick : I heard (1)..........as a computer pragrammer

Jimmy : Yes,and I had already(2)..............

Nick : Really?i’m happy(3)...

Jimmy : Thank you.

Nick : Your parents must be(4)........

Jimmy : They want me to run their business.they’re(5)......

Nick : That’s a pity!did you explain your reasons?

Jimmy : I did and I hope they’ll accept my decision.

Dialog II

Margaret : Look at you!you look so great now.what have you been doing?

Joe : Really?(1).................i’ve been in canada for two weeks.by the way,how

about your job?

Margaret : (2)............it’s in a big new hospital.My working conditions aremuch

better than the the last place.

Tony : Attention,please.today,we have a surprise.we’ve been offered a trip from

our boss

Joe : Really?(3)........................?

Tony : Bandung

Joe : (4)..................but where is it located?

Tony : Aren’t you pleased?


Joe : Yes,of course.(5)........................but tell me where it is.

Margare : It’s in indonesia.

Joe : Oh,I see.that’s not so good

Tony : Don’t worry joe.my friend,lisa,who lives there,wrote to me about the

conditions in indonesia.indonesia is safe now,especially in that twon.there

is no riot.it’s just a rumour.

Key Answer

1) I think it’s usual

2) That’s great

3) Where to

4) Marvellous

5) I’m delighted to hear that

8. RELATIONSHIP LISTENING

Sometimes the most important factor in listening is in order to develop or sustain a

relationship. This is why lovers talk for hours and attend closely to what each other has

to say when the same words from someone else would seem to be rather

boring.Relationship listening is also important in areas such as negotiation and sales,

where it is helpful if the other person likes you and trusts you.


13

Testing Grammar

English is a very important language in the world. It plays a very big rule in

communication and education. Everything which is served by technology should be

related to English. By and by, English will be the global language in every part of the

world.Since English is an international language, people all over the world try to learn

as much as Possible about english. To develop our skill in english we always meet the

grammar. And we practice our english by the testing grammar, so that we know how far

we understand the english.

A. Definition of grammar

Grammar is the structural foundation of our ability to express ourselves. The more we

are aware of how it works, the more we can monitor the meaning and effectiveness of

the way we and others use language. It can help foster precision, detect ambiguity, and


exploit the richness of expression available in English. And it can help everyone--not

only teachers of English, but teachers of anything, for all teaching are ultimately a

matter of getting to grips with meaning.

1. Descriptive grammar refers to the structure of a language as it is actually

used by speakers and writers.

2. Prescriptive grammar refers to the structure of a language as certain

people think it should be used.

Both kinds of grammar are concerned with rules--but in different ways. Specialists in

descriptive grammar (called linguists) study the rules or patterns that underlie our use of

words, phrases, clauses, and sentences. On the other hand, prescriptive grammarians

(such as most editors and teachers) lay out rules about what they believe to be the

“correct” or “incorrect” use of language.

B. Types of test

Before writing a test it is vital to think about what it is you want to test and what its

purpose is. We must make a distinction here between proficiency tests, achievement

tests, diagnostic tests and prognostic tests.

21. A proficiency test is one that measures a candidate's overall ability in a

language; it isn't related to a specific course.

22. An achievement test on the other hand tests the students' knowledge of

the material that has been taught on a course.

23. A diagnostic test highlights the strong and weak points that a learner may

have in a particular area.

24. A prognostic test attempts to predict how a student will perform on a

course.

There are of course many other types of tests. It is important to choose elicitation

techniques carefully when you prepare one of the aforementioned tests.There are many

elicitation techniques that can be used when writing a test. Below are some widely used

types with some guidance on their strengths and weaknesses. Using the right kind of


question at the right time canbe enormously important in giving us a clear

understanding of our students' abilities, but we must also be aware of the limitations of

each of these task or question types so that we use each one appropriately

1. Multiple choice

Choose the correct word to complete the sentence.

Cook is ________________today for being one of Britain's most famous explorers.

a) Recommended b) reminded c) recognized d) remembered

In this question type there is a stem and various options to choose from. The advantages

of this question type are that it is easy to mark and minimizes guess work by having

multiple distracters. The disadvantage is that it

can be very time-consuming to create, effective multiple choice items are surprisingly

difficult to write. Also it takes time for the candidate to process the information which

leads to problems with the validity of the exam. If a low level candidate has to read

through lots of complicated information before they can answer the question, you may

find you are testing their reading skills more than their lexical knowledge.

Multiple choice can be used to test most things such as grammar, vocabulary, reading,

listening etc. but you must remember that it is still possible for students to just 'guess'

without knowing the correct answer.

2. Transformation

Complete the second sentence so that it has the same meaning as the first.

'Do you know what the time is, John?' asked Dave.

Dave asked John __________ (what) _______________ it was.

This time a candidate has to rewrite a sentence based on an instruction or a key word

given. This type of task is fairly easy to mark, but the problem is that it doesn't test

understanding. A candidate may simply be able to rewrite sentences to a formula. The

fact that a candidate has to paraphrase the whole meaning of the sentence in the

example above however minimizes this drawback.


Transformations are particularly effective for testing grammar and understanding of

form. This wouldn't be an appropriate question type if you wanted to test skills such as

reading or listening.

3. Gap-filling

Complete the sentence.

Check the exchange ______________ to see how much your money is worth.

The candidate fills the gap to complete the sentence. A hint may sometimes be included

such as a root verb that needs to be changed, or the first letter of the word etc. This

usually tests grammar or vocabulary. Again this type of task is easy to mark and

relatively easy to write. The teacher must bear in mind though that in some cases there

may be many possible correct answers.

Gap-fills can be used to test a variety of areas such as vocabulary, grammar and

are very effective at testing listening for specific words

4. True / False

Decide if the statement is true or false.

England won the world cup in 1966. T/F

Here the candidate must decide if a statement is true or false. Again this type is easy to

mark but guessing can result in many correct answers. The best way to counteract this

effect is to have a lot of items.

This question type is mostly used to test listening and reading comprehension

5. Open questions

Answer the questions.


Why did John steal the money?

Here the candidate must answer simple questions after a reading or listening or as part

of an oral interview. It can be used to test anything. If the answer is open-ended it will

be more difficult and time consuming to mark and there may also be a an element of

subjectivity involved in judging how 'complete' the answer is, but it may also be a more

accurate test.

These question types are very useful for testing any of the four skills, but less

useful for testing grammar or vocabulary.

6.Error Correction

Find the mistakes in the sentence and correct them.

Ipswich Town was the more better team on the night.

Errors must be found and corrected in a sentence or passage. It could be an extra word,

mistakes with verb forms, words missed etc. One problem with this question type is that

some errors can be corrected in more than one way.

Error correction is useful for testing grammar and vocabulary as well as readings

and listening.

7. Other Techniques

There are of course many other elicitation techniques such as translation, essays,

dictations, ordering words/phrases into a sequence and sentence construction

(He/go/school/yesterday).

It is important to ask yourself what exactly you are trying to test, which techniques suit

this purpose best and to bear in mind the drawbacks of each technique. Awareness of

this will help you to minimize the problems and produce a more effective test.


C.The Value of Studying Grammar

The study of grammar all by itself will not necessarily make you a better writer. But by

gaining a clearer understanding of how our language works, you should also gain

greater control over the way you shape words into sentences and sentences into

paragraphs. In short, studying grammar may help you become a more effective

writer.Descriptive grammarians generally advise us not to be overly concerned with

matters of correctness: language, they say, isn't good or bad; it simply is. As the history

of the glamorous word grammar demonstrates, the English language is a living system

of communication, a continually evolving affair. Within a generation or two, words and

phrases come into fashion and fall out again. Over centuries, word endings and entire

sentence structures can change or disappear.

Prescriptive grammarians prefer giving practical advice about using language:

straightforward rules to help us avoid making errors. The rules may be over-simplified

at times, but they are meant to keep us out of trouble--the kind of trouble that may

distract or even confuse our readers.


14

INTERPRETING TEST SCORE

Introduction

What does interpret mean? To interpret is to decide what the intended meaning of

something is (Cambridge Advanced Learner’s Dictionary). To interpret is to conceive

the significance of; construe (thefreedictionary.com). Thus, to interpret is to understand

the meaning and the significance of something.Interpreting test scores is to understand

the meaning and the significance of test scores, which can be used to plan next action -

to fix or to retain. There are many ways to do it, but the most common three are

frequency distribution, measures of central tendency, and measures of dispersion.

Frequency distribution here is talking about the distribution of scores and the frequency

of each category. On the other hand, measures of central tendency refer to measure of

“middle” value, and are measured using the mode, median, and mean. The last but not

least, is the measures of dispersion. It is related to the range or spread of scores. All

three can help teachers interpret the meaning behind test scores.


II. Content

A. Frequency Distribution

Frequency distribution deals with the distribution of scores and the frequency of the

distribution. Each entry in the table contains the frequency or count of the occurrences

of scores within a particular name, and in this way, the table summarizes the

distribution of scores.

The example case here is: a teacher administers a test of 40 questions to 26 students.

Marks are awarded by counting the number of correct answers on the test scripts. These

are known as raw marks.

Here are the steps to create a table of frequency distribution:

1. Create Table 1 and put the raw mark of every student in it.

TABLE 1

Testee Mark

A 20

B 25

C 33

D 35

E 29

F 25

G 30

H 26

I 19

J 27

K 26

L 32

M 34

N 27

O 27

P 29

Q 25


R 23

S 30

T 26

U 22

V 23

W 33

X 26

Y 24

Z 26

2. Create Table 2. Sort the marks from the highest to the lowest score. This is called

descending sorting. It is easier and faster to use tool like Microsoft Excel to do the

sorting.

TABLE 2

Testee Mark

D 35

M 34

C 33

W 33

L 32

G 30

S 30

E 29

P 29

J 27

N 27

O 27

H 26

K 26

T 26


X 26

Z 26

B 25

F 25

Q 25

Y 24

R 23

V 23

U 22

A 20

I 19

Now, we determine the rank. We start form rank 1 up to rank 26, for there are 26

students.

The problem comes when there are two or more students with the same mark. Here we

highlight the same mark to make it easier to distinguish. Then, we write imaginary rank

on the right of Rank column from 1 to 26. The imaginary rank of the same mark is then

added and divided by how many people who get the same mark. For example, student C

and W have the same mark, 33. Their imaginary rank is 3 and 4. To get the actual rank,

we add 3 and 4 (3+4=7). The result, 7, is then divided by the number of people of the

same score, which is 2 here. The final result is 3.5. Thus, the final result is 3.5. Thus,

the final result is 3.5. Thus, the ranks of both of them are 3.5.

TABLE 2

Testee MarkRan

k

Imaginar

y rank

D 35 ? 1

M 34

? 2

C 33 ? 3

W 33 ? 4


(3+4) / 2 = 3.5

L 32 ? 5

G 30? 6

S 30 ? 7

E 29? 8

P 29 ? 9

J 27? 10

N 27 ? 11

O 27 ? 12

H 26

? 13

K 26? 14

T 26 ? 15

X 26 ? 16

Z 26 ? 17

B 25? 18

F 25 ? 19

Q 25 ? 20

Y 24 ? 21

R 23? 22

V 23 ? 23

U 22 ? 24

A 20 ? 25

I 19 ? 26

The result will be like this. Table 2 shows the students’ scores in order of merit and

their rank as well.

TABLE 2


(6+7) / 2 = 6.5

(8+9) / 2 = 8.5

(10+11+12) / 3 =

11

(13+14+15+16+17) / 5

= 15 8.5

(18+19+20) / 3 =

19

(22+23) / 2 =

22.5

Testee Mark Rank

D 35 1

M 34 2

C 33 3.5

W 33 3.5

L 32 5

G 30 6.5

S 30 6.5

E 29 8.5

P 29 8.5

J 27 11

N 27 11

O 27 11

H 26 15

K 26 15

T 26 15

X 26 15

Z 26 15

B 25 19

F 25 19

Q 25 19

Y 24 21

R 23 22.5

V 23 22.5

U 22 24

A 20 25

I 19 26

3. Create Table 3, which consists of Mark column, Tally column, and Frequency

column.


In Mark column, we can expand the range from 40 up to 15, for the highest score is

35 and the lowest score is 19. We usually do this to give more space to enhance

readability.

Tally is the stroke of how many students get a certain score. It is simply a method of

counting the frequency of scores.

Frequency column lists the number of students obtaining each score. It is easier to

count due to the tallies.

Table 3 is the table of frequency distribution.

TABLE 3

Mark TallyFrequenc

y

40

39

38

37

36

35 / 1

34 / 1

33 // 2

32 / 1

31

30 // 2

29 // 2

28

27 /// 3

26 //// 5

25 /// 3

24 / 1

23 // 2

22 / 1


21

20 / 1

19 / 1

18

17

16

15

TOTAL 26

B. Measures of Central Tendency

A measure of central tendency is a measure that tells us where the “middle” of a bunch

of data lies. The three most common measures of central tendency are the mode, the

median, and the mean.

B.1. Mode

Mode refers to the score which most candidates obtained. We can easily spot it from

Table 3. The most frequent score in Table 3 is 26, as five testees have scored this mark.

Thus, the mode is 26.


B.2. Median

Median refers to the score gained by the middle candidate after the data is put in order.

We use Table 2, which has been ordered in descending order, to find the median. In the

case of 26 students here, there can obviously be no middle student and thus the score

halfway between the lowest score in the top half and the highest score in the bottom half

is taken as the median. The median score in this case is 26.



TABLE 2

Testee Mark Rank

D 35

1

M 34 2

C 33 3.5

W 33 3.5

L 32 5

G 30

6.5

S 30 6.5

E 29 8.5

P 29 8.5

J 27 11

N 27 11

O 27

11

H 26 15

K

26 15

T 26 15

X 26 15

Z 26 15

B 25 19

TOP HALF

Lowest score of top half: 26

Highest score of bottom half: 26(26+26)/2=26

B.3. Mean

Mean or average score is the sum of the scores divided by the total number of testees.

The mean is the most efficient measure of central tendency, but it is not always

appropriate.

Now, we are going to create Table 4, to count the mean. Note that symbol x is used to

denote the score, N the number of the testees, and m the mean. The symbol fdenotes the

frequency with which a score occurs. The symbol ∑ mean the sum of.

First, we gather the data from previous Table 3. We get the scores and their frequencies

from it. The score (x) is then multiplied by the frequency (f). The result is put in column

fx. After that, the total of fxis summed up as ∑fx.

TABLE 4

x . f Fx

35 x 1 35

34 x 1 34

33 x 2 66

32 x 1 32

30 x 1 60

29 x 2 58

27 x 3 81

26 x 5 130

25 x 3 75

24 x 1 24

23 x 2 46

22 x 1 22

20 x 1 20

19 x 1 19

TOTA

L

∑fx

702


To get the mean, we use formula m = ∑fx / N.

m=∑ fxN

=70226

=27

Thus, the mean is 27.

C. Measures of Dispersion

Measures of dispersion are important for describing the spread of the scores, or its

variation around a central value. There are various methods that can be used to measure

the dispersion of a dataset, but the most common ones are the range and the standard

deviation.

C.1. Range

A simple way of measuring the spread of marks is based on the difference between the

highest and the lowest scores. It is called the range. From previous Table 2, we can see

the highest score is 35 and the lowest score is 19. The range is 16.

Range =Xmax – Xmin

Range = 35 – 19 = 16

C.2. Standard Deviation

The standard deviation (s.d.) is another way of showing the spread of scores. It shows

how all the scores are spread out and gives a fuller description of test scores than the

range.

One simple method of calculating s.d. is shown below:

s . d .=√ Σ d2

N

N is the number of scores

d is the deviation of each score from the mean

From previous calculation, mean is 27. The steps to calculate s.d. are as followings:


1. Step 1: Find out the amount by which each score deviates from the mean (d).

ScoreMean deviation (d)

(Score - 27)

35 8

34 7

33 6

33 6

32 5

30 3

30 3

29 2

29 2

27 0

27 0

27 0

26 -1

26 -1

26 -1

26 -1

26 -1

25 -2

25 -2

25 -2

24 -3

23 -4

23 -4

22 -5

20 -7

19 -8

2. Step 2: Square each result (d2)


Score

Mean

deviation (d)

(Score - 27)

d2

35 8 64

34 7 49

33 6 36

33 6 36

32 5 25

30 3 9

30 3 9

29 2 4

29 2 4

27 0 0

27 0 0

27 0 0

26 -1 1

26 -1 1

26 -1 1

26 -1 1

26 -1 1

25 -2 4

25 -2 4

25 -2 4

24 -3 9

23 -4 16

23 -4 16

22 -5 25

20 -7 49

19 -8 64

3. Step 3: Total all the results (Σ d2)


Σ d2=432

4. Step 4: Divide the total by the number of testees(Σ d2/N )

Σ d2/N = 432 / 26 = 16.62

5. Step 5: Find the square root of the result √ Σ d2/ N

√Σ d2/ N=√16.62 = 4.077= 4.08

Thus, standard deviation (s.d.) is 4,08. That means that on average, the scores are about

4 points away from the average.


References:

Teaching Student-Centered Mathematics (K-3). John van de Walle and

LouAnnLovin, Pearson Publishing, 2006.

Classroom and Large- Scale Assessment. Wilson and Kenney. This article appeared

in A Research Companion to Principles and Standards for School Mathematics

(NCTM), 2003, (pages 53-67).

Principles and Standards for School Mathematics. National Council of Teachers of

Mathematics (NCTM), 2000.

Young Mathematicians at Work, Constructing Number Sense, Addition, and

Subtraction. By Catherine TwomeyFosnot and Maarten Dolk, Heinemann

Assessing Learners with Special Needs: 6TH ED. By Terry Overton

Weaver, B. Formal versus Informal Assessments.

http://www.scholastic.com/teachers/article/formal-versus-informal-assessments

Morrison, G. Informal Methods of Assessment.

http://www.education.com/reference/article/informal-methods-assessment/

Forlizzi, L. Informal assessment: The

Basics. http://aded.tiu11.org/disted/FamLitAdminSite/fn04assessinformal.pdf

Navarete, C., et al. Informal Assessment In Educational Evaluation.

Mind map retrieved february 20, 2013 from the URL:

http://www.mindmeister.com/122645400/formal-vs-informal-assessments

TESTING: BASIC CONCEPTS: BASIC TERMINOLOGY by

Anthony Bynom, Ph.D., December 2001

A Statistical Analysis of Different Instruments to Measure Short-term In an L2

ImmersionProgram1. by

Kyle Perkins Southern Illinois University at Carbondale U.S.A.

TESL Journal, Vol. II, No. 5, May 1996

http://iteslj.org/


http://iteslj.org/

http://www.mindmeister.com/122645400/formal-vs-informal-assessments

http://aded.tiu11.org/disted/FamLitAdminSite/fn04assessinformal.pdf



http://www.scholastic.com/teachers/article/formal-versus-informal-assessments

Berk, R., 1979. Generalizability of Behavioral Observations: A Clarification of

Interobserver Agreement and Interobserver Reliability. American Journal of Mental

Deficiency, Vol. 83, No. 5, p. 460-472.

Cronbach, L., 1990. Essentials of psychological testing. Harper & Row, New York.

Carmines, E., and Zeller, R., 1979. Reliability and Validity Assessment. Sage

Publications, Beverly Hills, California.

Gay, L., 1987. Eductional research: competencies for analysis and application. Merrill

Pub. Co., Columbus.

Guilford, J., 1954. Psychometric Methods. McGraw-Hill, New York.

Nunnally, J., 1978. Psychometric Theory. McGraw-Hill, New York.

Winer, B., Brown, D., and Michels, K., 1991. Statistical Principles in Experimental

Design, Third Edition. McGraw-Hill, New York.

American Educational Research Association, American Psychological Association, &

National Council on Measurement in Education. (1985). Standards for educational and

psychological testing. Washington, DC: Authors.

Cozby, P.C. (2001). Measurement Concepts. Methods in Behavioral Research (7th ed.).

California: Mayfield Publishing Company.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational

Measurement (2nd ed.). Washington, D. C.: American Council on Education.

Moskal, B.M., &Leydens, J.A. (2000). Scoring rubric development: Validity and

reliability. Practical Assessment, Research & Evaluation, 7(10). [Available online:

http://pareonline.net/getvn.asp?v=7&n=10].

The Center for the Enhancement of Teaching. How to improve test reliability and

validity: Implications for grading. [Available online:

http://oct.sfsu.edu/assessment/evaluating/htmls/improve_rel_val.html].

http://spiritize.blogspot.com/2007/10/active-listening.html

http://wiki.answers.com/Q/Examples_of_poetry#ixzz1xYYnZXmU


http://wiki.answers.com/Q/Examples_of_poetry#ixzz1xYYnZXmU

http://spiritize.blogspot.com/2007/10/active-listening.html

Alderson, J. C 2002 Conceptions of validity and validation. Paper presented at a

conference in Bucharest, June 2002.

Angoff, 1988 Validity: An evolving concept. In H. Wainer& H. Braun [Eds.] Test

validity [pp. 19-32], Hillsdale, NJ: Erlbaum.

Bachman, L. F. 1990 Fundamental considerations in language testing. Oxford: O.U.P.

Cumming A. & Berwick R. [Eds.] Validation in Language Testing Multilingual

Matters 1996

Hatch, E. &Lazaraton, A. 1991 The Research Manual - Design & Statistics for Applied

Linguistics Newbury House

Henning, G. 1987 A guide to language testing: Development, evaluation and research

Cambridge, Mass: Newbury House

Hubley, A. M. &Zumbo, B. D. A dialectic on validity: where we have been and where

we are going. The Journal of General Psychology 1996. 123[3] 207-215

Messick, S. 1988 The once and future issues of validity: Assessing the meaning and

consequences of measurement. In H. Wainer& H. Braun [Eds.] Test validity [pp. 33-

45], Hillsdale, NJ: Erlbaum.

Messick, S. 1989 Validity. In R. L. Linn [Ed.] Educational measurement. [3rd ed., pp

13-103]. New York: Macmillan


Documents

akademik.uhn.ac.id · Web viewThis book is a compilation material for English Language Testing. General outlines of material as an introduction to English Language Testing that has