The Effect of Test Sequence on students ' achievement

Embed Size (px)

Citation preview

  • 7/29/2019 The Effect of Test Sequence on students ' achievement

    1/5

    1

    The Effect of Multiple Choice Item Sequence on EFL Students Performance and Test Reliability

    Asep Suarman,

    [email protected]

    Abstract

    In 2011 Indonesian national examination, five packages of items were used in relatively random

    sequence in term of level of difficulty, which might affect students performance on the test. This paperreports the investigation whether there is significant effect of sequencing multiple choice items on theperformance of the junior high school ESL students and whether the difference in sequencing affects theinternal reliability of the items. A multiple choice paper and pencil test was run to collect data. 68students who were divided into two intact groupseasy to difficult (ED) item group and difficult to easy(DE) item were involved. The students answer was analyzed through ANATESV4 and SPSS 15software. The result revealed that the students in the ED group outperformed the DE groups (t= -2.114,df.=52, p=0.039) meaning that there is significant effect of sequencing the items on the performance ofthe students. In term of internal reliability, the item in ED sequence is better than that in DE sequence(ED= .85, DE= .59). Further investigation was necessary to conduct to get more convincing conclusion.

    Keywords: Multiple Choice, Sequence, Performance, Reliability

    Background

    National examination is still conducted in Indonesian every year. It is done in the end of last grade ofjunior or senior school. It functions as one of the graduating requirements and as the mapping ofstudents capability in generals. It covers Bahasa Indonesia, Mathematics, English and Science subjects.

    In the last few years, there has been some changing in packaging the multiple choice items of national

    examination of either junior or senior high schools in Indonesia. In 2008 and 2009, two packages ofitems derived from the same test criteria were used. The items were different but assumed to have samelevel of difficulty and were sequenced relatively from easy to difficult one. In 2010, two packages wereemployed. They had the same items but different sequence. In 2011, five packages of the items were

    used. The items on every package derive from the same test criteria consisting of the same genre of texts,the same material (skills tested), similar degree of difficulty but the different sequence. By sight, suchitems may draw no problems. However, if the items were analyzed and compared, some drawbacks fromthe sequence might come up.

    From the analysis of the five packages of the 2011 English national examination items, it is found thatthe packages basically are only made up of two types of items with the same genres of text and the sameskills tested. Package number 39, 25 and 12 consist of the same items, and package number 46 and 54 dotoo. Only the sequence of the items in each package is different. Lets take for example. Number 1 in

    which the genre is notice, asking the main idea is the same in all packages. Number 2 and 3 in package39, in which the genre is announcement, asking explicit and implicit information are put in number 10

    and 11 in package 25, and number 12 and 13 in package 12. In another two packages, number 2 and 3 in

    package 46 are placed in number 49 and 50 in package 54.

    The analysis shows that the test criteria or material tested are the same but the items have some aspects ofdifference. The degree of difficulty of the test items might be the same but the sequence is very different.

    One package might consist of the items which are sequenced from the easy to the difficult (ED) items,another might be from difficult to easy (DE) items. This, of course, may affect students achievement inthe exams since, psychologically, students motivation and disappointment may influence their score.

    Furthermore, from the fairness point of view (Brown, 2005 p. 26), the difference of the item sequence is

    not fair enough. The item sequence leads to different result. Although the items are similar in terms ofcompetence, indicators and degree of difficulty, the result of such test remains quite inobjective.Students as examinee who actually have accepted the same treatment, materials, tasks, guidance orfeedback are tested with different items.

    Some studies investigating the sequence of items have indicated that the sequence of the test hassignificant effect on the performance of the test takers: Jessel and Sullin (1975); Towle and Meril,

  • 7/29/2019 The Effect of Test Sequence on students ' achievement

    2/5

    2

    (1975); Carsten, P.W and McKeag, R.A. (1982); Carlson and Ostrosky (1992); Hodson, 2006;Soureshjani, (2011). Jessel and Sullin (1975) found that the ideal arrangement of multiple choice diddiffer significantly with respect to performance and reliability of test. Towle and Meril (1975) found thatthe students with easy to hard (EH) items had significantly higher score than those with hard to easy (HE)one and pointed out that no anxiety was found, meaning that difficulty sequence did not affect anxietyarousal. Carsten and McKeag (1982), Gohman and Spector (1989) and Carlson and Ostrosky (1992)showed evidence that the distribution of test score may be influenced by the sequence of the items but the

    items validity and reliability remain unaffected. Hodson (2006) and Soureshjani (2011) showed that thestudents taking easy to difficult (ED) multiple choices test items outperformed those difficult to easy(DE) items.Those previous research indicate that the sequence of items can affect adult learners performance ontheir test. But, no research was done to teenage or junior high school students. In addition, it is difficult tofind research on the sequence of the items which is related to the internal reliability of the tests.

    Thus, this study attempted to shed light on the two questions: whether there is any significant effect ofsequencing multiple choice test items on the performance of the junior high school students and whetherthe difference in sequencing affects the internal reliability of the items.Hopefully, this research might be

    the additional support on existing theory or finding about the topic.

    By and large, in connection to the performance of test takers, there are many factors affecting theirgained score. The factors are the testing environment, the test rubric, the nature of the input of the test,the nature of the expected response, and the relationship between input and response (Bachman, 1990 in

    Soureshjani, 2011). In addition, test format like multiple-choice, true-false, cloze procedure, open-endedor other testing formats may influence the test takers' performance (Alderson, 2000; Bachman & Palmer,

    1996; Buck, 2001; cited in Soureshjani, 2011).

    In multiple choice items, there might be three kinds of jumbling or sequence items from the test criteria.The first one is jumbling only the options of MC items. The items are the same but the options aredifferent in every number of the test. The second one is jumbling the sequence of the items. The test has

    the same items and with the options, but the order of the items is different. And, the third one is bothjumbling the items as well as the options. The items are different in order and the options of every item

    are the same.On the top of that, to know the quality of test items, reliability is one of the criteria beside practicality andvalidity. Reliability is defined the extent to which the result can be considered consistent or stable(Brown, 2005 p. 175) or the desired consistency (or reproducibility) of test score (Crocker and Agina,1986 in Fulcher and Davidson, 2007 p. 104). A reliable test is consistent and dependable (Brown, 2001;Brown 2004). The reliability of a test may lie on the test its self, which is generally called test reliability,or in the scoring of the test which is called rater (scorer) reliability (Brown, 2001). Since multiple choicetest, in this case, the former is the only one investigated here.

    Among three basic strategies to estimate the reliability of tests e.g. test-retest, equivalent or (parallel)forms and internal consistency (Brown, 2005 pp. 175-9, see also Fulcher and Davidson, 2007), this study

    employed internal consistency using split-half method with ANATES14 as the software.

    Method

    This study was done in EFL class in Serang Regency, where 74 participants which were divided into two

    intact groups involved. The population who were students of grade IX of junior high school was about14-16 years old and in beginner-to-intermediate level of proficiency. All of them had studied English atleast two and a half years.

    The instrument used in the study was 40 items of reading comprehension test. The test items were the

    adaptation from the Prediksi Ujian Nasional 2010 (The prediction of national examination 2010)issued by Depdiknas (Ministry of Education Affair) in the form of softcopy. The selected genres werereport, procedure, letter and advertisement which were taught in odd term of Grade IX. The items were

    multiple choices with four options in two sequences: easy to hard and vice versa.

  • 7/29/2019 The Effect of Test Sequence on students ' achievement

    3/5

  • 7/29/2019 The Effect of Test Sequence on students ' achievement

    4/5

    4

    Lower

    Upper

    Lower

    Upper

    Lower

    Upper

    Lower

    Upper

    Lower

    TEST

    STAGE

    Equal

    variances

    assumed

    9.824 .003 -2.114 52 .039 -7.15110 3.38352 -13.94063 -.36157

    Equal

    variances not

    assumed

    -2.159 41.543 .037 -7.15110 3.31207 -13.83731 -.46489

    The table above shows the analysis of Lavenes test which reveals the difference of variances of testscores in the test stage between ED and DE groups. It shows that F forscore test in test stage withequal variance assumed is 9.824 and level of probability is .003; meaning it is smaller than .05 (p > 0.5).This probability means that variances of both populations are different meaning that one group isoutperformed the another. In addition, the table also shows that t- value for the test score with equally

    variances assumed is -2.114 and the probability (p) is 0.039 (t= -2.114, df.=52, p=0.039). This indicatesthat null hypothesis (H0) is rejected. Both groups were not equal. They are not homogenous. They have

    different performance. The difference is might due to the sequence of the items as the items are the samefor both groups and the groups previously shown are homogenous. One group with ED items looks betterthan another group with DE items.

    In summary, the students who did items with Easy-to-Difficult sequence (ED group) outperformed those

    who did the items with Difficult-to-Easy sequence (DE group). This is in line with previous finding thatthe sequence of the test has significant effect on the performance of the test takers (Towle and Meril,1975), (Carlson and Ostrosky, 1992) and Hodson, (2006). This result also confirms Soureshjanis (2011)finding that the students taking easy to difficult (ED) test items outperformed those difficult to easy (DE)items.

    Besides, the difference in test performance above might be caused by other affective factors like anxiety,motivation or frustration like what Munz and Smouse (1968) claimed that differential items sequenceaffect performance of different anxiety. The DE (difficult to easy) sequence might lead students to befrustrated, unmotivated or even disappointed so that on the rest of the items, they lack of concentration.

    In addition, based on the data obtained from the ANATESV4 analysis comparing trial group, ED and DEgroups, it is revealed that the ED group outperforms the DE groups. All the means score, the standarddeviation, the correlation of XY and the reliability of test of ED group is higher than DE groups. Look atthe table for details.

    Table 2: The Result of ANATESV4 Analysis of ED and DE Groups

    The Name ofGroup

    Number ofPopulation

    MeanScore

    StandardDeviation

    Correlation ofXY

    InternalReliability

    Highestscore

    Lowestscore

    Trial group 36 16.19 4.57 0.42 0.59 28 10

    ED Group 30 15.71 6.71 0.73 0.85 33 9

    DE Group 29 13.83 3.57 0.42 0.59 23 8

    Difference ED-

    DE

    1 1.88 3.14 0.31 0.26 10 1

    Further, it can be seen that the DE groups has almost the same result compared with the trial group.Despite the slight difference in mean score and standard deviation, the internal reliability and the

    correlation of XY are the same between trial and DE group. It can be assumed that the trial group and theDE group have similar capability. They look homogenous.

    In term of internal-consistency reliability (Brown, 2005 pp. 176-8; Hughes 2003, pp. 38-9) wherein theodd and even items are scored separately and compared, the above table shows that the items whosesequence is easy to difficult has better internal reliability than those of difficult to easy. In DE group, thereliability of the item is .59 (the same as the trial stage), but in ED group, it is .85 meaning that it has

  • 7/29/2019 The Effect of Test Sequence on students ' achievement

    5/5

    5

    80% reliability. It is close to the best reliability coefficient which is 1 or 100% (Hughes, 2003). Thismight mean that the items have better reliability if it is sequenced from easy to difficult items.

    Conclusion and Suggestions

    In summary, the result of the study confirmed previous finding that there is significant effect ofsequencing the items on students performance. The SPSS independent t-test analysis shows that the

    students in ED (Easy-difficult item) group performed better than those in DE (difficult-easy item) group.It happens possibly due to students affective factors like anxiety, motivation, and frustration in thebeginning of the test.

    Meanwhile, in term of internal item reliability, the data shows that the items in easy to difficult (ED)sequence has better reliability coefficient than that of difficult to easy (DE).

    However, as this study may contain some drawbacks, some further studies covering non-multiple choiceitems or more population need conducting to have more convincing evidence about the sequence of testitems. It is worth mentioning despite teacher-team made test was used, the future study may utilized theitems of standardized test so that different results may come out. It also would be a good idea to try toinvestigate similar study on computer or internet-based multiple choice item test and employ othersoftware to analyze the result. Finally, the study on affective factors influencing test takers in a test isnecessary to carry out in order that the test is valid and can reflect the test takers authentic performance.

    Bibliography

    Alderson, J.C. 2000.Assessing Reading. Cambridge: Cambridge University Press.

    Bachman, L. (1990).Fundamental considerations in language testing. London: OUP in Soureshjani H. K. 2011.

    Item Sequence on Test Performance: Easy Items First? Language Testing in Asia Volume one, Issue threeOctober 2011.

    Brown, J.D. 2005. Testing in Language Programs: A Comprehensive guide to English Language Assessment.

    Singapore: McGraw-Hill Education.

    Brown, H. D. 2001. Teaching by Principles; An Integrative Approach to Language Pedagogy, Third Edition.Now

    York: Addison Wesley Longman, Inc.

    Brown, H. D. 2004.Language Assessment: Principle and Classroom Practice. Now York: Pearson Education, Inc.Buck, G. (2001).Assessing listening. Cambridge: Cambridge University Press. In Soureshjani H. K. 2011.Item

    Sequence on Test Performance: Easy Items First? Language Testing in Asia Volume one, Issue three

    October 2011. Available online athttp://www.languagetestingasia.com/. Retrieved on November 2011.

    Carsten, P. W. and McKeag, R. A. 1982. The effect of a Change in Item Sequence Order on Performance in a Test,

    Re-Test Experiment. Online publication. Available athttp://www.eric.ed.govretrieved on November 2011

    Carlson and Ostrosky. 1992.Item Sequence and Student Performance on Multiple Choice Exams. The Journal of

    Economic Education. Vol. 23, No. 3 (Summer 1992), pp 232-235. Available online at

    http://www.jstor.org/stable/1183225. Retrieved on November 2011.

    Fulcher, G and Davidson, F. 2007.Language Testing and Assessment: an Advance Resource Book. Oxon:

    Routledge.

    Gohman, Stephan F., and Lee C. Spector. 1989. Test Scrambling and Student Performance. Journal of Economic

    Education, Summer. In Carlson and Ostrosky. 1992.Item Sequence and Student Performance on Multiple

    Choice Exams. The Journal of Economic Education. Vol. 23, No. 3 (Summer 1992), pp 232-235. Available

    online athttp://www.jstor.org/stable/1183225. Retrieved on November 2011.

    Hatch, E. and Farhady. 1982.Research Design and Statistics for Applied Linguistics. Los Angeles, California,

    U.S.A: Newbury House Publishers. Inc.

    Hughes, A. 2003. Testing for Language Teachers, Second Edition. Cambridge: Cambridge University Press.Hodson. 2006. The Effect of Change in item Sequence on Student Performance in A Multiple-Choice Chemistry

    Test. Journal of Educational Measurement: National Council on Measurement in Education. Available online

    athttp://www.jstor.org. Retrieved on November 2011.

    Jessel, J.C. and Sullin, W.L. 1975. The Effect of Keyed Response Sequencing of Multiple Choice Items on

    Performance and Reliability. Journal of Educational Measurement . Volume 12 no. 1.National Council on

    Measurement in EducationAvailable online athttp://www.jstor.org. Retrieved on November 2011.Soureshjani H. K. 2011.Item Sequence on Test Performance: Easy Items First? Language Testing in Asia Volume

    one, Issue three October 2011. Available online athttp://www.languagetestingasia.com/. Retrieved on

    November 2011.Towle and Meril. 1975.Effects on Anxiety type and Item Difficulty sequencing on Mathematic Test Performance.

    Journal of Educational Measurement: National Council on Measurement in Education. Available online at

    http://www.jstor.org. Retrieved on November 2011.

    http://www.languagetestingasia.com/http://www.languagetestingasia.com/http://www.languagetestingasia.com/http://www.eric.ed.gov/http://www.eric.ed.gov/http://www.eric.ed.gov/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/action/showPublisher?publisherCode=ncmehttp://www.jstor.org/action/showPublisher?publisherCode=ncmehttp://www.jstor.org/action/showPublisher?publisherCode=ncmehttp://www.jstor.org/action/showPublisher?publisherCode=ncmehttp://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.languagetestingasia.com/http://www.languagetestingasia.com/http://www.languagetestingasia.com/http://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.languagetestingasia.com/http://www.jstor.org/http://www.jstor.org/action/showPublisher?publisherCode=ncmehttp://www.jstor.org/action/showPublisher?publisherCode=ncmehttp://www.jstor.org/http://www.jstor.org/http://www.jstor.org/http://www.eric.ed.gov/http://www.languagetestingasia.com/