TESTING IN THE CONTEXT OF A LANGUAGE LEARNING EXPERIMENT

TESTING IN THE CONTEXT OF A LANGUAGE LEARNING EXPERIMENT

Elisabe t h Ingr am University of Edinburgh

1 . Introduction

I want to consider the problems of constructing and using lan- guages tests to evaluate language teaching materials and language teaching techniques.

Language testing is a flourishing business, as this conference shows. Fifteen years ago Lado would have conferred with Lado. Now there are professional language testers in various centres who need to meet and discuss.

So far, language testers have concentrated on meeting the need of reducing educational wastage by developing aptitude and proficiency tests. In particular, the highly professional proficiency tes ts such as the Michigan and TOE FL are valuable contributions to social engineering.

There a r e many problems of general and theoretical interest in proficiency testing, but our current problem in Edinburgh with ELBA (English Language Battery) is how to get things to run smoothly now that the test is actually being used, and this is of purely local interest.

But there a r e other testing needs, more closely related to teaching. Achievement tests, for instance, the internal tests of the English Language Institute, measure how much the students have learnt of what they are supposed to have learnt during a given course. Such tes ts are commonly used for the purposes of evaluating the students. What I want to consider is the problem of testing for the purpose of evaluating the teaching, not the learners.

There is a need for reasonable objective evaluation of teaching materials and teaching techniques, and the problems of close and detailed testing, week by week, to see if the teaching comes across, are somewhat different from the problems of what I have called the social engineering type testing, the large scale proficiency and aptitude projects.

I propose to discuss these problems in the light of an experiment carried out in Oslo in April/May of this year.

The theme is language testing needs.

147

14 8 PROBLEMS IN FOREIGN LANGUAGE TESTING

2. Background

Most of the people who argue about language teaching methods are dedicated and successful teachers who outline their own theories and practices, point to the good results they get, and strongly advise everybody to do as they do. Or so it seemed to me when I was brought into applied linguistics from psychology ten yea r s ago. Clearly what was needed was a bit of experimental rigour, with techniques of controlling the variables and measuring the results in language learning situations. Since language laboratory and pro- gramming techniques have become fairly widely adopted, it has become possible to minimize the effect of the chief uncontrollable variable in the situation-the teacher-and with language testing techniques it is possible to measure the results.

Patrick Chaffey, a teacher of English at Bedriftsjdkonomisk In- stitutt in Oslo was a student at the Department of Applied Linguis- t ics in 1966-67. He came to find out how to use language labora- tories more effectively. I was going to Oslo on Sabbatical leave in the Spring t e rm of 1967, and we decided to cooperate in an experiment.

3. The Experiment

The aims of the experiment were:

(i) to find out if the format described below improved the students' ability to understand spoken English; and

(ii) to find out if the format was sufficiently interesting to get a reasonable proportion of the students to turn up for the language laboratory sessions.

The prepared teaching material consisted of ten three-part

(i) a taped unscripted conversation between 2 or 3 persons, lasting about 4-5 minutes, on some vaguely economic or political topic, accompanied by a multiple choice comprehension test;

(ii) a short written text on roughly the same topic as the conversation. For instance, the first tape "Running a Summer Schooltt was matched by a newspaper report of the Vice- Chancellor of the University discussing the proposed in- crease in fees for Overseas students; and

(iii) a discussion period, based on returned corrected scripts and a transcript of the tape.

units. The par ts were:

A LANGUAGE LEARNING EXPERIMENT 14 9

The units were delivered, one per week, in three normal class periods of 40 minutes. In the first period the students were asked to give the gist of each paragraph of the written text, and they were introduced to a very simple logical-syntactic scheme of comprehension analysis.

The second period was spent in the language laboratory. The students listened to the conversation played through once, with the comprehension questions in front of them. Then they played and replayed the tape as they chose, to answer the questions. The third period was a classroom discussion, with a tape recorder to replay the tape, plus the corrected tests and the tape transcripts.

The testing material consisted of ELBA, Part I (Listening) and Part I1 (Written), the comprehension tests accompanying the tapes, and an extra comprehension test based on a scripted test tape. ELBA is a proficiency test at an advanced level, researched in the usual way. The comprehension tests were written to the tapes; they were not pretested. The experiment was the pretesting, so to speak.

The experimental design was very simple:

1. Pretest the experimental and control groups on ELBA. 2. Pretest experimental group on test passage comprehension

test. 3. Train the students in the experimental group in the teaching

units, without increasing the number of English periods per week.

4. Leave control group to their normal schedule. 5. Post-test experimental group on ELBA Part I and subtest 8

(Reading Comprehension) from Part 11.

6. Post-test experimental group on test passage comprehension test.

7. Post-test control group on test passage comprehension test.

The post-test plan is obviously very skimpy, but in the control group neither the teachers nor the students would tolerate being tested twice on ELBA. If the control and experimental groups were not significantly different on the ELBA pretest, then it would be meaningful to compare the averages of the two groups on the test passage comprehension test alone, provided this correlated with ELBA. The pretesting went according to plan, but the teacher of the control group found himself unable at the last minute to fit in the 20 minute comprehension test. So one half of the design fell away.

150 PROBLEMS IN FOREIGN LANGUAGE TESTING

The subjects were the students of the Business School, aged between 18 and 28, predominantly male, mostly with University entrance qualifications. The experimental group was drawn from the British-English oriented classes. There were 60 of them according to the lists. Of the sixty, 49 were pretested, 56 individuals turned up at various times during the experiment, only 21 were post-tested. The American-oriented classes were to provide the control group. There were 91 according to the lists, 52 were pretested, none were post-tested.

Only 6 of the teaching units were used. Thought there were 10 calendar weeks available, holidays reduced this to 8, and the first and last week were set aside for testing.

4. Results

I shall ignore the "Americantt pretest results since they are not relevant to the experiment.

The relevance of ELBA as the general testing instrument was established by estimating the rank correlations between the Part and Total test scores, and:

(i) the intermediate English examination results set by the

(ii) the class teacher's rating for command of spoken English

class teacher (December 1966); and

and for command of written English (April 1967).

TABLE 1

Rank correlations between ELBA Part I (Listening), Part II (Written) and Total, and intermediate examination and teacher's ratings.

TEACHER'S RATINGS

EXAM MARKS ORAL WRITTEN POOLED n = 37

ELBA Part I . 6 3 .75

ELBA Part II .68 .76

ELBA Total .69 .81

The teaching comprehension tests and the test passage comprehension test varied in the number of items per test, and in difficulty. The number of students who took each test also varied.

A LANGUAGE LEARNING EXPERIMENT 15 1

TABLE 2

Students, items, item analysis, and success rate for all comprehension tests. Test Comp Test Comp Comp

Pretest

No. of Students 45

No. of items 12

No. of items

.30 internal

Success rate per item per student .439

with (El- 3)_>

validity 9

Post-test 1

21 30

12 9

10 4

.556 .696

Comp Comp 2 3

34 46

10 10

5 5

.818 .753

Comp Comp Comp 4 5a

31 21

19 12

9 8

.737 .653

5b

5

9

-

-

The number of effective items in the teaching tests, Comp 1 - Comp 5b, are too small to make any treatment of the individual subtests meaningful. Luckily the Test Passage Comprehension test turned out to have a high proportion of effective items, but 9 or 10 effective items leave a lot of room for error. However some esti- mate of the value of the test must be obtained. Dr. Pilliner sug- gested a relatively simple way of pooling the scores from each test by converting the raw scores into scaled scores. The scaled scores can then be added and divided by the number of tests taken by each student to obtain the me.an scaled score for each student, and these mean scaled scores can be correlated with ELBA scores and teacher's ratings. According to Dr. Pilliner, this procedure is justified if, and only if, the samples of the student population that take each test can be regarded as a random sample of the standardizing population, i.e. the experimental group. If this condition holds, the varying number of items, difficulty level, and subjects is irrelevant. Since I didn't know anything to the contrary, I made the assumption that they weye drawn randomly and proceeded to calculate the mean scaled score for each student for all the tests, and then estimated the product moment correlation of these with ELBA scores, and the rank order correlation with the teacher's ratings.

Tables 1-3 represent the attempts to establish the usefulness of the tests which were used in the experiment.

Assuming that the tests had a certain amount of validity, I proceeded to t ry to answer the experimental questions. The first was: Does the teaching format improve the students' ability to

Table 3 gives these corrections.


TABLE 3

Correlations between Test Comprehension scores, Mean Scaled Compre- hension scores and ELBA.

Mean Scaled Comp Scores

n r

ELBA Part I (Listening). ........................ 46 .53 ELBA Part I1 (Written). - - ......................... ELBA Total . . ............................... 46 .64

Teacher's Ratings n r Oral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 .62

W r i t t e n . . ............................... 37 -59

Pooled.. ................................ 37 .64

comprehend spoken English? Table 4 shows the mean difference of matched pairs of scores for the subtests common to the pre- and post-testing. These were:

ELBA Part I Subtest 1: Phoneme Discrimination (100 items) Subtest 2: Intonation Meaning ( 10 items) Subtest 3: Sentence Stress ( 10 items) Subtest 4: Listening Comprehension ( 30 items) Column 5: Subtotal, Part I (150 items)

ELBA PART 11 Subtest 8: Reading Comprehension ( 22 items)

Test Passage Comprehension Test ( 12 items)

TABLE 4

Mean difference between matched pairs, t values and significance of differences for pre- and post-test ELBA Part I subtests and Part I Total, ELBA Part 11 Subtest 8, and Test Passage comprehension Test.

ELBA Part I Part I ELBA Testpass- Subtests Total PartII age Comp

1 2 3 4 5 Subtest 8 Test

Mean diff. between pairs 3.381 .524 .905 2.571 7.381 1.381 1.5

Obtained t values 3.551 1.047 2.219 4.153 5.876 1.766 3.246

Level of

(d.f. 40) sig. sig. significance .01 Not .05 .01 .01 Not .01

A LANGUAGE LEARNING EXPERIMENT 153

That the students' ability to comprehend spoken English had improved during the treatment seems beyond doubt.

The real question is, however: is the improvement due to the treatment? There is no answer to this, because the control post- testing fell through.

The second aim of the experiment was to find out whether the format was sufficiently attractive to the students to make them come for the language laboratory sessions. The attendance was erratic with a marked falling off at the end, as Table 2 shows. The experiment ended a fortnight before the examinations. But the baseline of comparison is zero. The teacher of the experimental group had stopped the language laboratory sessions altogether before the experiment started, because nobody turned up for the imi- tative "Model - Student - Model - Student" format previously available. Table 5 shows the frequency distribution and cumulative count of students turning up for the language laboratory sessions, which were also the comprehension testing sessions.

TABLE 5

Number of sessions attended

1 2 3 4 5 6 7 8

Frequency 4 5 10 13 11 5 3 5

Cumulative count of students

56 (maximum possible: 60) 52 47 37 24 13

8 5

The answer to the second question seems a qualified yes. A reasonable number of students were interested, at least during the experiment, until 2 weeks before the examinations.

5. Discussion Reporting experiments is all very well but our concern is to

find the most useful testing techniques for language learning experiments.

There are perfectly good learning experiment designs and there are perfectly good designs for researching tests. But if you want to get information about the effectiveness of teaching techniques or course content, you have an educational experiment on your hands, and

At first sight there is no particular problem.


educational experiments differ from laboratory experiments in sev- eral important ways:

(i) the learning job will be complex and messy, not clear-cut and directly relatable to theoretical concepts. Comprehen- sion of spoken language is about as messy as you can get;

(ii) the learning job will not be a matter of a single session or two, it will be an ongoing, cumulative affair over a period of time; and

(iii) the learners are not paid subjects, they are students who pay to be taught, who are the responsibility of educational authorities subject to regulations, in the care of teachers who have their own teaching schedules.

Any experiment needs the continued cooperation of the authorities, of the teachers, and of the students.

The first problem for the experimentor is to get in and be given the necessary time, and the second is to have the students turn up regularly. It seems that the ordinary schools, who can enforce attendance, find it impossible to give up the time, at least to outsiders. And those universities and colleges and other places of adult education who are willing to give you the time, cannot enforce student attendance.

If you cannot get in, you cannot have an experiment. So that eliminates all further problems. In Oslo we were granted full ac- cess and everybody was most cooperative, but student attendance was not enforced.

Ours was a very untidy experiment. I do not believe that all educational experiments have to be equally untidy, but I do believe that all educational experiments are inherently untidy to some de- gree. In Oslo we had the full support of the administration, and generous cooperation from the teachers. Curiosity brought record student attendance in the beginning, but this soon dropped off to i ts normal level, and even below. The reasons for this were partly inherent, partly accidental. Inherently, some students will find any project a waste of time because it is not directly useful for exams, or because it is not pitched at the right level for them, so they soon drop out, if they have the choice. In Oslo these included a few of the best students, more of the poorer ones. But it was accidental and bad luck-for the experiment-that it had to be conducted in April/May. May is a lovely month in Norway, full of public holidays, and June is the examination month, so students naturally cut classes, partly from jo i de uiure and partly to work for exams. It was further bad luck that one of the experimental classes had

So we had all the other problems.


extremely low group morale. They had lost their English teacher 6 months previously and had been fostered out among the other teachers ever since. The experiment, involving yet another teacher (me) was the last straw and they staged a strike half way through. These accidents happened to us; some accident will inevitably happen. And between them, the inherent and the accidental student wastage will seriously muck up any experimental design.

Difficulties with testing are also inherent in the situation. Lab- oratory experiments are so simple that the measuring of results rarely presents any trouble. Educational experiments need previously researched tests (and we were lucky to have one at the right difficulty level and culturally acceptable to the students) to establish the baselines and to act as criterion for the specially written comprehension tests. These were new and previously untried. The experiment constituted the pretesting, so to speak. The trouble is, that I only get a Sabbatical every five years. Other people, for instance, Wertheimer and Scherer, have the same difficulties with standard versus tailor-made tests.

Post-testing is bound to be a headache. You can entice the students to the pretesting because they are curious about the experiment. When it comes to the post-test, they know what it is all about and they do not see why they should give up their time and energy for purely research purposes, for something that will do them no good. What is even more serious, the teachers in charge of the control group classes have a built-in negative motivation when it comes to post-testing. Even the most cooperative teacher is understandably, probably unconsciously, reluctant to help to show that another class is superior to his class in some respect or another-which is what the experimentor hopes to show. The difficulties of securing post-testing facilities for control groups are partly Freudian.

These difficulties have to do with people. There is a purely test-technical problem, which I do not know the answer to.

The standardizing test-in this case ELBA-is too crude and all-purpose to provide any detailed information about the efficiency of the teaching units. The comprehension tests accompanying each unit were written in order to provide this information. But these were new and short. Even if they had not been new, if all the items had been made effective, they would still be short, probably containing not more than 8-10 items. This seems to be as many as you can get out of a 4-5 minute conversation. (Our Comprehen- sion 4 had 19 items on paper, but that conversation lasted 8 1/2 minutes, which was much too long. We overran the class period and the students were exhausted.)


With such short tests, it is not very informative to correlate them each one separately with the criterion, because the confidence range will be huge. Therefore the comprehension tests will have to be pooled, as I did, by means of scaling and averaging, before any meaningful comparison with the criterion can be made.

Now suppose that the rate of progress during the experiment varies for small subtests within the group. One set doesn't learn much while another set of individuals have been waiting for just this all their lives and improve their command enormously. Sup- pose further that all tests are true. The successful students move further and further awayfrom their initial level while the unsuc- cessful ones stay at much the same level as they started. Under these conditions, the correlation between the pretest criterion and the pooled experiment tests are not going to be very high. So if you do not get very high correlations you do not know if this is because :

(a) your experiment tests are poor; or (b) your experiment tests reflect accurately the varying rate of

The problem of testing and correlating divergent rates of progress over time can be illustrated by a simple imaginary example. In the experimental group there are three individuals, A, B and C who all score the same on the Criterion. A is an over- achiever, B makes normal progress and C is one of the under- achievers who learn nothing.

progress within the group.

This could be the extract from the list of scores:

Weekly successive tests Pooled Subject Criterion 1 2 3 4 5 6 Score X Y

A 10 2 4 6 8 10 12 42

B 10 2 3 4 5 6 7 27

C 10 2 2 2 2 2 2 12

Z

These three Criterion/Pooled Score pairs are going to reduce the correlation quite a bit.

The problem could be minimized by pooling the scores, not right across the board, but in two or more blocks; for instance, block ''a,'' including weeks 1 to 3 and block "b," including weeks 4 to 6.


Then there would be two correlations: the pooled scores of block traTL" with the criterion, and the pooled scores of block "b" with the criterion. By reducing the time span, the divergent .rates will not have diverged quite so much, and the differences between the pairs will be smaller. Therefore the correlation will not be so much reduced, by reason of divergent rates of learning.

I shows the value of the pairs for the Criterion/Pooled Score. IIa and b the pairs for the correlations of Criterion/block t'a7T

and Criterion/block "b."

I IIa IIb

X

A 10 42 10 12 10 22

B 10 27 10 9 10 18 C 10 12 10 6 10 6 Z

If there were more tests than we had-that is, if the experiment were longer and the students turned up reasonably regularly throughout the duration of the experiment-then it would be meaningful to pool the experiment test scores in blocks. But then we are back to the practical difficulties of time and student attendance. Such block pooling cannot be done on the Oslo scores. I had not formulated the problem early enough, and the pattern of attendance over time is not regular enough.

6. Conclusion

Nexttime I think I'll try the Army,


DISCUSSION

Ingram: I am really quite grateful for your discussion of tax- onomy, Mr. Lewis. I don't know if you can solve the problem for me, but certainly by listening to people talking about testing I have had the matter pigeonholed in a way that I certainly didn't have before I came. The prlblem is: how to test a test which is designed not for placement or selection, but designed for your first purposes: to investigate the learning process, or, which is the same thing in another way, to investigate the efficiency of the teaching material. Now, what one gets taught is how to evaluate a test for placement purposes. All the technique seems to be designed for that. Obviously, the reason why I am so puzzled and don't know what to do is that this is a different kind of purpose and therefore, it presumably required different kinds of evaluation procedures.

Cooper: There are two problems here. One is the problem that everybody faces in collecting data: getting the people there. And the other is assessing the validity of the test for a particular purpose. The standard way, or one way certainly of doing it for this kind of test, is to determine i ts content validity; how well does it seem to represent the kinds of behaviors, at least on the face of it, that you are trying to assess? It doesn't seem to me that the kinds of skills you are trying to assess a re that unusual.

Ingram: I am sorry to appear too critical about this, but I don't like content validity except when you can specify exactly what your content is. And I don't think talking about comprehension of spoken language, is a sufficient content analysis.

Upshur: One of the problems that you are getting into here is that content validity is satisfactory for your test as an achievement test, because you can make the assumption that whoever created the course and the material for use in the course knew exactly what he was doing and you can refer back to that. I think your problem comes in because you want to know if the things that the curriculum writer did are worth the time.

Ingram: I happen to have written both the materials and the But test. If the materials show out to be good, then that's fine.

as a test writer, I couldn't show that my test was any good.

Muckey: Do I understand here that what you are trying to do


is to evaluate learning, in other words, achievement, and to use that as an evaluation of the usefulness of what is learned? I don't think you can possibly do that. It is inherently contradictory. Sup- pose, for example, that your course in English as a Second Lan- guage, Lesson 1, included words like pew, wattle and so forth, and sentences like "The big dog wants, but the little dog doesn't." and things like that. You know the sor t of thing I am referring to because you have seen some of them. Now you wonder whether this is good or not and so you are going to find out by testing students on how well they learn it. How good the materials are has nothing to do with how easy they are to learn. You say that you are not really evaluating certain qualities in the materials such as the usefulness. But what precisely a r e you evaluating in the ser ies of materials by testing how well they are learned? I would like to know what there is in this package of materials that is transmitted to the learner that you are trying to find out by getting the results of the students' learning ability. What are the elements in this?

Ingram: The element is, as I saw it, the skill of understanding, of getting the gist of natural spoken dialogues. So that the evaluation of the material is that I feed them these tapes; I show them some techniques which I hope are useful and will enable them to get the gist of spoken passages in a reasonably efficient way. Given the techniques I provide, the material they listen to, and the experience, does it make any difference on a pre- and post-testing of the same passage, using the same questions? In other words, can I show an improvement on my test questions before and after the treatment?

UPshur: You have already demonstrated a significant improvement between pre-test and post-test scores for your experimental group-or for what remained of your experimental group. So, in still other words, you are interested first in whether a control group, had it stayed with you, might have improved as much; and secondly you are interested in whether the test improvement actually, and in some sense accurately, reflects improvement in a yet undefined "skill of understanding." Since your control group didn't stay with the experiment, I can't see how you might answer the first of these questions.

Spolsky: Isn't it like Gene Brisre 's problem, not wanting to know i f they have learned the pattern as taught in the lesson?

Itgram: Exactly. This is not what I want to know.

160 PROBLEMS I N FOREIGN LANGUAGE TESTING

BriBre: Wouldn't that have to be a point of departure for what you want? The materials or the teaching method, it seems to me, should do what they say they're going to do. If it says it is teaching the present progressive, X number of forms, I don't see how you can get around it from the testing angle.

SpoZsky: That's not a sufficient condition. Here we're interested specifically in the extra part of the advanced level, in seeing what the student can do creatively. So really we're asking, in a sense, what are the chances of validating a test of this creative ability, this handling of novel, unstated, undescribed situations.

Mackey: I think your protection is that you're operating at a high level of proficiency and you can assume a lot. So that other approach would be simply to regard the materials as samples of language behavior which the learners would have to reproduce in some form or other and then to test their language behavior. And I think the best way to do this, most obviously, is simply by getting samples of performance.

Ingram: Why samples? I was testing comprehension. That is a receptive thing. And the pre- and post-tests which I con- structed were samples of performance, I thought. I can't see where having them speak at me tells me whether they've learned to comprehend better. That is a very indirect way of testing.

Mackey: But then you're breaking it down to a certain extent. You're breaking down language behavior, at least, into comprehension and expression. There's some analogy there already. So that if the course teaches these persons to understand the language and it's overall behavior, and that's the purpose of the course, then of course if you give them a passage and you test whether they understand the passage or not, it would be just as valid.

Ingram: Yes, but how do I know that the particular questions that I'm asking to show their comprehension constitute a group test?

Lewis: The puzzle which I have, is that you're trying to establish a correlation between groups of students in terms of their yield over a period of time. And there is a diachronic and a synchronic scale in this particular exercise. I think what we've been discussing so f a r is really a validation on the synchronic level. That is, how do you measure the success of the student on this

A LANGUAGE LEARNING EXPERIMENTS 16 1

particular test at this particular point? But there is another di- mension which now complicates this beyond measure. If you want t o introduce the diachronic scale into this, you are then testing not only achievement but yield. Once you bring in these two and want to correlate achievement at one point with achievement at another point, you are testing yield, and not simple achievement at any particular point. This is the thing which puzzles me and which I'm trying to analyze. You know how to solve the problem of testing achievement at any particular point in time. Your dilemma is the correlation of two points of achievement. This is what I might call yield. You find that you have students who a r e yielding at different levels, at different rates, and it is the measure of that that is really confusing you. And .I'm afraid, at least from my understanding of the discussion, that we've really not attempted to help you on that particular point.

Documents

TESTING IN THE CONTEXT OF A LANGUAGE LEARNING EXPERIMENT