Upload
candace-ball
View
237
Download
0
Tags:
Embed Size (px)
Citation preview
ONE RULER, MANY TESTSA Primer on Test Equating
THE PROBLEM
People in APEC economies need to communicate across national, cultural, and economic boundaries.
This requires fluency in an international language, such as English.
Each APEC economy has its own system for teaching and assessing English fluency.
There is no common language scale. There are no common standards.
DOMAIN OF LANGUAGE FLUENCY
SOME OPTIONS
Use scores from national exams Use one of the international English exams:
TOEFL (Test of English as a Foreign Language) TOEIC (Test of English for International
Communication) IELTS (International English Language Testing
System) Develop a parallel to the European system
(Common European Framework of Reference for Languages)
Develop new tests specifically for the APEC Equate existing APEC tests
LIMITATIONS OF EXISTING EXAMS
Limited supply of seats for international English fluency exams
Each test is on its own scale The TOEFL and TOEIC are proprietary to
Educational Testing Service (ETS) Appropriate only for adults National tests are not comparable across
APEC None of these scales seems appropriate for
establishing a single set of standards
PSYCHOMETRIC SOLUTIONS
Test Equating International Item Banking Computer Adaptive Testing The Lexile Scale
These methodologies require some background knowledge about Item Response Theory (IRT)
CLASSICAL DEFINITION OF EQUATING
Two tests X and Y are equated when from an examinee’s score on X it is possible to infer his score on Y, and vice versa.
CLASSICAL TEST THEORY
Same test for everyone Representative sample of examinees Representative sample of items Equate tests through common persons who
take both tests Do not use item-level data much Person measures depend on the items Test and item measures depend on the
persons No way to handle missing data No way to assess quality of responses
ITEM RESPONSE THEORY
Famous names: Georg Rasch, Ben Wright, Fred Lord, Allan Birnbaum, 1950s – 1980s
Probability of a Correct Response = function of (Person Ability – Item Difficulty)
IRT models predict values for missing responses Competing probability models:
Rasch model, no extra parameters (Rasch, Wright) 2-PL and 3-PL models, add guessing parameter and
item discrimination parameter (Lord, Birnbaum) Rasch model has a special property when the
data fit the model: Objectivity = Invariance = Generalizability
OBJECTIVITY
“ Objective measurement” is when the relative measures of persons are the same regardless of which items they take, and the relative measures of items are the same regardless of which persons take them.
The Rasch model does not produce objectivity. It requires it as a condition of fit to the data. The model is used to “edit” the data set until objectivity is achieved.
Classical Test Theory does not have this property. The 2-PL and 3-PL models fit the data better than
the Rasch model and do not require editing, but at the expense of objectivity.
MISSING DATA IS KEY TO TEST EQUATING
Easy [Test X] Questions [Test Y] Difficult
1 2 3 4 5 6 7 8 9 10 11 12
% Correc
t
Less A 0 0 1 1 0 . . . . . . . 0.40
Fluent B 1 0 1 0 0 . . . . . . . 0.40
C 0 1 0 0 1 . . . . . . . 0.40
D 1 1 1 1 0 . . . . . . . 0.80
E 0 1 1 0 0 . . . . . . . 0.40F 1 1 1 0 0 . . . . . . . 0.60
Persons G . . . 1 0 0 1 1 0 0 0 0 0.33
H . . . 1 0 1 0 1 1 1 0 0 0.56
I . . . 1 1 0 1 1 0 0 0 0 0.44
J . . . 1 1 1 1 1 1 1 0 1 0.89
K . . . 1 1 1 0 0 0 1 1 0 0.56
More L . . . 1 1 0 1 1 1 1 0 1 0.78Fluent M . . . 1 1 1 0 1 1 1 1 1 0.89
% Incorrect 0.50 0.33 0.17 0.31 0.54 0.43 0.43 0.14 0.43 0.29 0.71 0.57
EQUATING USING IRT
Easy [Test X] Questions [Test Y] Difficult
1 2 3 4 5 6 7 8 9 10 11 12
% Correc
t
Less A 0 0 1 1 0 0.26 0.22 0.17 0.13 0.09 0.04 0.00 0.24
Fluent B 1 0 1 0 0 0.30 0.26 0.22 0.17 0.13 0.09 0.04 0.27C 0 1 0 0 1 0.35 0.30 0.26 0.22 0.17 0.13 0.09 0.29
D 1 1 1 1 0 0.39 0.35 0.30 0.26 0.22 0.17 0.13 0.49E 0 1 1 0 0 0.43 0.39 0.35 0.30 0.26 0.22 0.17 0.34F 1 1 1 0 0 0.48 0.43 0.39 0.35 0.30 0.26 0.22 0.45
Persons G 0.74 0.70 0.65 1 0 0 1 1 0 0 0 0 0.42
H 0.78 0.74 0.70 1 0 1 0 1 1 1 0 0 0.60I 0.83 0.78 0.74 1 1 0 1 1 0 0 0 0 0.53
J 0.87 0.83 0.78 1 1 1 1 1 1 1 0 1 0.87K 0.91 0.87 0.83 1 1 1 0 0 0 1 1 0 0.63
More L 0.96 0.91 0.87 1 1 0 1 1 1 1 0 1 0.81Fluent M 1.00 0.96 0.91 1 1 1 0 1 1 1 1 1 0.91
% Incorrect 0.30 0.25 0.19 0.31 0.54 0.52 0.54 0.41 0.58 0.53 0.78 0.72
THREE RULES OF EQUATING
Valid Comparisons Examinees can only legitimately be compared when they can be said, either literally or theoretically, to have taken the same test.
IRT Definition of EquatingTwo tests X and Y are equated when from the responses on X and the responses on Y it is possible to infer the responses and total score that students would have received on a common test XY composed of the items from both tests.
Test LinkingIn order to infer responses on one test based on another, the tests must somehow be linked.
THREE WAYS TO LINK TESTS X AND Y
Common PersonsTests X and Y must be administered, in their entirety, to a common sample of persons at more or less the same time.
Common ItemsTests X and Y must have items in common. This is called common-item equating.
Common Objective CharacteristicsTests X and Y must have “objective characteristics” in common. There must be some way to infer from the test questions themselves their likely difficulty without recourse to person responses.
TECHNICAL CLARIFICATIONS
EstimationIRT algorithms do not bother calculating values for missing cells (unless requested). They use more efficient algorithms to calculate person and item measures that are equivalent to filling in the missing cells.
Measurement unitsIRT algorithms do not work in “percent correct.” They work in “log-odds units” or “logits,” which are a way of going back and forth between probabilities and a ruler-like linear metric.
ErrorMy pseudo-data set has random error added to it. Error is intrinsic to real-world data and binary raw responses. That is why we work in probabilities.
IRT REQUIRES UNIDIMENSIONALITY
Unidimensionality means that all the items in a test are sensitive to the same construct, and only that construct. Items differ from each other only in their difficulty, nothing else. An examinee is very good at math but poor at
English. He is given an easy word problem and gets it wrong.
Question: Should we mark the word problem wrong or not?
MULTIDIMENSIONAL IRT
Models exist, but the field has not matured Properties of a true MIRT model:
Fit multidimensional data Predict missing values Invariant person positions in n-space Invariant item positions in the same n-space Misfit when invariance is not achieved Transferability of coordinates Maximal use of information in data set Standard errors for cell estimates and
person/item parameters Example: My own model called NOUS
TOOLS RELEVANT TO APEC
International Item Bank Computer Adaptive Testing Lexile scale
INTERNATIONAL ITEM BANK
Easy [Test X] Questions [Test Y] Difficult
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000
Person Abilitie
s
Less AA xFluent BB "Predict missing cells" -- Everyone in effect takes same giant test x
CC x
DD x
EE x
FF x
Persons GG xHH x
II x
JJ x
KK x
More LL Save item difficulties, attributes in bank. Makes it easy to calculate abilities. xFluent MM x
Item Difficulties x x x x x x x x x x x x
INTERNATIONAL ITEM BANK BENEFITS
How it works Participating countries “withdraw” items Combine bank items with own items, administer New items, with data, are deposited to bank Items are edited and calibrated, made available
Benefits Different tests, a common scale Excellent test security Test all grades and ability levels Freedom of use, not tied to proprietary tests Computer adaptive testing
COMPUTER ADAPTIVE TESTING (CAT)
How it works At least 12,000 items spread across all ability
levels Unidimensionality rigorously enforced Item difficulties pre-calibrated Examinee is administered items that are
matched to his continuously recalculated ability, until a desired precision is achieved.
Benefits Fast, secure, computerized All examinees are measured with equal precision None of the problems of paper tests
Problematic with Writing and Speaking
THE LEXILE SCALE How it works
Rasch-based software uses “objective characteristic” equating to measure the relative reading difficulty of text passages
Two “objective characteristics” used as predictors: Semantic Difficulty (how frequently is each word in a
passage used in general English) Syntactic Complexity (how many words are in each
sentence) How it is used
Tests are scanned and Lexile difficulties calculated Books and curricula are also scanned and “Lexiled” Examinees receive Lexile reading ability scores Teachers match examinees to materials
BENEFITS OF THE LEXILE SCALE
All APEC tests on one English fluency scaleScale is objective, rigorous, and already applied to many assessments and thousands of books
Practical, inexpensive, straightforwardMetaMetrics scans tests, assigns Lexiles. No equating studies, no need for item banks.
Usable by teachersLots of Lexiled curricular materials and books(See attached excerpt of the Lexile scale, from MetaMetrics)
Easy to set performance standardsEach level of the scale has thousands of examplars
MORE LEXILE BENEFITS Transparency
The scale is based on objective test characteristics that can be independently verified and replicated. If MetaMetrics disappears the Lexile scale can be recreated. Proprietary tests like TOEFL and TOEIC have scales completely dependent on their creators.
StabilityA scale based on objective characteristics is less likely to change difficulty or distort over time.
Applicable to many kinds of language fluency Academic Business Reading, Writing (not yet Listening and Speaking) English and Spanish, extendable to Asian languages
CONCLUSION
The problem of establishing a common scale and common language standards for the APEC economies – particularly as regards the learning, teaching, and usage of English – is eminently solvable.
Psychometric tools have been developed over the last 40 years in response to the need to equate tests.
Three such tools relevant to APEC needs are: An International Item Bank Computer Adaptive Testing The Lexile framework
CONTACT INFORMATION
Personal InformationMark H. Moulton, Ph.D., Educational Data [email protected]
Complete “One Ruler…” paper, plus NOUS and multidimensional equatingwww.eddata.com/resources/publications/
Rasch Modelswww.winsteps.com/
Item banks and CATwww.nwea.org/
The Lexile frameworkwww.lexile.com/EntrancePageFlash.html