Upload
willow
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Effects of Item Content Characteristics on Item Difficulty of Multiple Choice Test Items in an EFL Listening Assessment. Ikkyu Choi University of California, Los Angeles. 2010.10.30. ECOLT 2010. Background. Korean College Scholastic Ability Test (CSAT) - PowerPoint PPT Presentation
Citation preview
EFFECTS OF ITEM CONTENT CHARACTERISTICS ON ITEM
DIFFICULTY OF MULTIPLE CHOICE TEST ITEMS IN AN EFL LISTENING
ASSESSMENT
2010.10.30. ECOLT 2010
Ikkyu ChoiUniversity of California, Los Angeles
Background Korean College Scholastic Ability Test
(CSAT)
one of main criteria for the new university students selection process
the highest-stakes test administered in Korea several distinguishing characteristics from its
predecessors, including the introduction of a dedicated English listening section (consisting of multiple choice items)
Background One Thorny Problem: Listening Section
much easier than its reading counterpart as well as pre-aimed standards (Cha, 1997; Kim, 2001; Lee, 2001)
low item discrimination (Kim, 2001)
Background One Thorny Problem: Listening Section
much easier than its reading counterpart as well as pre-aimed standards (Cha, 1997; Kim, 2001; Lee, 2001)
low item discrimination (Kim, 2001)
-> a need for increasing the difficulty level of the English Listening Comprehension (ELC) items
The Purpose of the Study To identify variables and the underlying
factor structure that affect the difficulty of multiple choice test items such as the ones adopted in the CSAT listening section
Research QuestionsI. What are the characteristics of the
CSAT type multiple choice ELC test items and their relationships?
II. What relationships exist between item content characteristics and item difficulty?
Review of Literature In Free-Response Assessment Contexts
Buck and Tatsuoka (1998): identify 15 item content characteristics and 14 interactions among the content characteristics as meaningful predictors of task difficulty
Brindley and Slatyer (2002): control the item difficulty by manipulating some of item content characteristics
Carr (2006): construct a model that accounts for the item difficulty in a reading comprehension context
Review of Literature In TOEFL Listening Contexts
Freedle and Kostin (1996):14 variables, including the type of topic, required degree of inference, and the location of information, were significant in predicting item difficulty Nissan, DeVincenzi, and Tang (1996) : five meaningful predictors of item difficulty, including the frequency of negatives and infrequent vocabulary, and the degree of familiarity of roles speakers had Kostin (2004):14 significant predictors, most of which were found significant in the two earlier studies
Review of Literature In the CSAT Context
Lee et al. (2003) and Chang (2004): the degree of inference, grammatical competence and time required to answer the item, number of attractive distracters and their degree of attractiveness, and the level of grammar involved in the item (of the reading section)
Jin and Park (2004):14 meaningful predictors of the CSAT English test item difficulty
Research QuestionsI. What are the characteristics of the
CSAT type multiple choice ELC test items and their relationships?
II. What relationships exist between item content characteristics and item difficulty?
Methodology Participants
Test takers: 1,280 Korean middle- and high- school students Item Contents Raters: 2 graduate students majoring in English education
Test Items 120 items from 78 CSAT preparation examinations (4 matched formats, each 30 items) involved a conversation between a male and a female, and required test takers to identify specific information from the given conversation Each item had two sub-questions, which asked the test takers to indicate their levels of confidence to get the given item right and the degree of their comprehension of stimulus.
Methodology Item Contents Variables
variables that were expected or found to be influential on the test taker performance in theory (e.g., Brown et al., 1984; Rost, 2002) and relevant empirical studies (e.g., Freedle & Kostin, 1993; Kostin, 2004)
27 item characteristic variables were selected
divided into 6 groups according to their characteristics: Word Level, Sentence Level, Key Sentence, Discourse Level, Item Level, and Item/Stimulus Overlap
Methodology Content Rating Instruments
taken directly from, or sometimes derived from those used by Bachman (1990), Bachman, Davidson, Ryan, and Choi (1995), Bachman, Davidson, and Milanovic (1996), Buck and Tatsuoka (1998), Freedle and Kostin (1993), Kostin (2004), Carr (2006) and Nissan, DeVincenzi, and Tang (1996)
classified into three categories (Carr, 2006), namely counting, calculating, and judging, in terms of appropriate measurement procedures
Excerpt from the Rating Instrument
VariableName
OperationalDefinition Category Rating
WLNIDWNumber of words not listed in middle school English textbooks
in stimulus Word Counted
WLNWMSNumber of words that contain more than three syllables in
stimulus Word Counted
WLNIMV Number of idiomatic/multiword verbs Word Counted
WLAWL Average word length in characters Word Calculated
WLDIFJudged relevance of the words not listed in middle school
English textbooks to key information of stimulusWord Calculated
SLNDC Number of dependent clauses in stimulus Sentence CountedSLDIF The Flesch–Kincaid Grade Level of stimulus Sentence Calculated
SLNWCR Number of within-sentence referential expressions in stimulus Sentence Counted
SLNBCR Number of between-sentence referential expressions in stimulus Sentence Counted
KSLOC Key sentence location – more difficult
when it is located in the middle
Key Sentence Judged
Data Analysis Item Contents Analysis
inter-rater reliability for ratings of judged variables: r=.84
descriptive statistics including means, standard deviations, minimum and maximum values, skewness, and kurtosis
Item Difficulty Estimation test taker performance: the proportion of test takers who did not provide correct response the degree of the confidence: the average of responses on the first sub-question the degree of the comprehension: the average of responses on the second sub-questions
Data Analysis Initial Model 1
Data Analysis Initial Model 2
Data Analysis Initial Model 3
Results Item Content Characteristics
infrequent use of “difficult words” (words not included in the middle school textbooks)
the stems and options in the ELC items showed very limited variability
the mere counting of match between the options and the stimulus and the difficulty the test takers might have actually faced could differ due to the overlap
some key sentences were recorded at a high speech rate, but it could be compensated by hints and repetitions often found in the stimulus
Results Item Difficulty
test taker performance: close to the normal distribution
confidence and comprehension indicators: close to the normal distribution
linear dependency of Confidence and Comprehension Indicators (r=.989)
-> In order to avoid multicolinearity, only the comprehension indicator was retained.
Results Candidate Model 1
Results Candidate Model 2
Results Candidate Model 3
Results Candidate Model 1
Results Candidate Model 2
Results Candidate Model 3
Results Model Fit
Model No. Chi-square (df, sig) CFI NNFI SRMR RMSEA
1 34.22 (29, p=.23) .99 .98 .058 .026
2 39.49 (38, p=.40) 1.00 .99 .055 .012
3 21.00 (17, p=.23) .99 .98 .062 .039
Results Model Fit
Model No. Chi-square (df, sig) CFI NNFI SRMR RMSEA
1 34.22 (29, p=.23) .99 .98 .058 .026
2 39.49 (38, p=.40) 1.00 .99 .055 .012
3 21.00 (17, p=.23) .99 .98 .062 .039
-> All three models showed good fit to the data. Considering goodness of fit, practicality, and interpretability, the third model, which accounted for item difficulty with the stimulus complexity and item/stimulus overlap, was chosen as the final model.
Implications The frequency of difficult words in a
stimulus could be utilized as an effective means of item difficulty control.
While counting of surface matches between a stimulus and its options could indicate high difficulty for a certain item, judged ratings of the degree of the overlap could point to the opposite direction
Limitations a small sample of 120 items made the
results from covariance structure analysis unstable
a small number of raters
a rather simplistic, linear model of accounting for the difficulty of the ELC items without considering test takers
Thank You!!!