O NE R ULER, M ANY TESTS A Primer on Test Equating

ONE RULER, MANY TESTSA Primer on Test Equating

THE PROBLEM

People in APEC economies need to communicate across national, cultural, and economic boundaries.

This requires fluency in an international language, such as English.

Each APEC economy has its own system for teaching and assessing English fluency.

There is no common language scale. There are no common standards.

DOMAIN OF LANGUAGE FLUENCY

SOME OPTIONS

Use scores from national exams Use one of the international English exams:

TOEFL (Test of English as a Foreign Language) TOEIC (Test of English for International

Communication) IELTS (International English Language Testing

System) Develop a parallel to the European system

(Common European Framework of Reference for Languages)

Develop new tests specifically for the APEC Equate existing APEC tests

LIMITATIONS OF EXISTING EXAMS

Limited supply of seats for international English fluency exams

Each test is on its own scale The TOEFL and TOEIC are proprietary to

Educational Testing Service (ETS) Appropriate only for adults National tests are not comparable across

APEC None of these scales seems appropriate for

establishing a single set of standards

PSYCHOMETRIC SOLUTIONS

Test Equating International Item Banking Computer Adaptive Testing The Lexile Scale

These methodologies require some background knowledge about Item Response Theory (IRT)

CLASSICAL DEFINITION OF EQUATING

Two tests X and Y are equated when from an examinee’s score on X it is possible to infer his score on Y, and vice versa.

CLASSICAL TEST THEORY

Same test for everyone Representative sample of examinees Representative sample of items Equate tests through common persons who

take both tests Do not use item-level data much Person measures depend on the items Test and item measures depend on the

persons No way to handle missing data No way to assess quality of responses

ITEM RESPONSE THEORY

Famous names: Georg Rasch, Ben Wright, Fred Lord, Allan Birnbaum, 1950s – 1980s

Probability of a Correct Response = function of (Person Ability – Item Difficulty)

IRT models predict values for missing responses Competing probability models:

Rasch model, no extra parameters (Rasch, Wright) 2-PL and 3-PL models, add guessing parameter and

item discrimination parameter (Lord, Birnbaum) Rasch model has a special property when the

data fit the model: Objectivity = Invariance = Generalizability

OBJECTIVITY

“ Objective measurement” is when the relative measures of persons are the same regardless of which items they take, and the relative measures of items are the same regardless of which persons take them.

The Rasch model does not produce objectivity. It requires it as a condition of fit to the data. The model is used to “edit” the data set until objectivity is achieved.

Classical Test Theory does not have this property. The 2-PL and 3-PL models fit the data better than

the Rasch model and do not require editing, but at the expense of objectivity.

MISSING DATA IS KEY TO TEST EQUATING

Easy [Test X] Questions [Test Y] Difficult

1 2 3 4 5 6 7 8 9 10 11 12

% Correc

t

Less A 0 0 1 1 0 . . . . . . . 0.40

Fluent B 1 0 1 0 0 . . . . . . . 0.40

C 0 1 0 0 1 . . . . . . . 0.40

D 1 1 1 1 0 . . . . . . . 0.80

E 0 1 1 0 0 . . . . . . . 0.40F 1 1 1 0 0 . . . . . . . 0.60

Persons G . . . 1 0 0 1 1 0 0 0 0 0.33

H . . . 1 0 1 0 1 1 1 0 0 0.56

I . . . 1 1 0 1 1 0 0 0 0 0.44

J . . . 1 1 1 1 1 1 1 0 1 0.89

K . . . 1 1 1 0 0 0 1 1 0 0.56

More L . . . 1 1 0 1 1 1 1 0 1 0.78Fluent M . . . 1 1 1 0 1 1 1 1 1 0.89

% Incorrect 0.50 0.33 0.17 0.31 0.54 0.43 0.43 0.14 0.43 0.29 0.71 0.57

EQUATING USING IRT


1 2 3 4 5 6 7 8 9 10 11 12

% Correc

t

Less A 0 0 1 1 0 0.26 0.22 0.17 0.13 0.09 0.04 0.00 0.24

Fluent B 1 0 1 0 0 0.30 0.26 0.22 0.17 0.13 0.09 0.04 0.27C 0 1 0 0 1 0.35 0.30 0.26 0.22 0.17 0.13 0.09 0.29

D 1 1 1 1 0 0.39 0.35 0.30 0.26 0.22 0.17 0.13 0.49E 0 1 1 0 0 0.43 0.39 0.35 0.30 0.26 0.22 0.17 0.34F 1 1 1 0 0 0.48 0.43 0.39 0.35 0.30 0.26 0.22 0.45

Persons G 0.74 0.70 0.65 1 0 0 1 1 0 0 0 0 0.42

H 0.78 0.74 0.70 1 0 1 0 1 1 1 0 0 0.60I 0.83 0.78 0.74 1 1 0 1 1 0 0 0 0 0.53

J 0.87 0.83 0.78 1 1 1 1 1 1 1 0 1 0.87K 0.91 0.87 0.83 1 1 1 0 0 0 1 1 0 0.63

More L 0.96 0.91 0.87 1 1 0 1 1 1 1 0 1 0.81Fluent M 1.00 0.96 0.91 1 1 1 0 1 1 1 1 1 0.91

% Incorrect 0.30 0.25 0.19 0.31 0.54 0.52 0.54 0.41 0.58 0.53 0.78 0.72

THREE RULES OF EQUATING

Valid Comparisons Examinees can only legitimately be compared when they can be said, either literally or theoretically, to have taken the same test.

IRT Definition of EquatingTwo tests X and Y are equated when from the responses on X and the responses on Y it is possible to infer the responses and total score that students would have received on a common test XY composed of the items from both tests.

Test LinkingIn order to infer responses on one test based on another, the tests must somehow be linked.

THREE WAYS TO LINK TESTS X AND Y

Common PersonsTests X and Y must be administered, in their entirety, to a common sample of persons at more or less the same time.

Common ItemsTests X and Y must have items in common. This is called common-item equating.

Common Objective CharacteristicsTests X and Y must have “objective characteristics” in common. There must be some way to infer from the test questions themselves their likely difficulty without recourse to person responses.

TECHNICAL CLARIFICATIONS

EstimationIRT algorithms do not bother calculating values for missing cells (unless requested). They use more efficient algorithms to calculate person and item measures that are equivalent to filling in the missing cells.

Measurement unitsIRT algorithms do not work in “percent correct.” They work in “log-odds units” or “logits,” which are a way of going back and forth between probabilities and a ruler-like linear metric.

ErrorMy pseudo-data set has random error added to it. Error is intrinsic to real-world data and binary raw responses. That is why we work in probabilities.

IRT REQUIRES UNIDIMENSIONALITY

Unidimensionality means that all the items in a test are sensitive to the same construct, and only that construct. Items differ from each other only in their difficulty, nothing else. An examinee is very good at math but poor at

English. He is given an easy word problem and gets it wrong.

Question: Should we mark the word problem wrong or not?

MULTIDIMENSIONAL IRT

Models exist, but the field has not matured Properties of a true MIRT model:

Fit multidimensional data Predict missing values Invariant person positions in n-space Invariant item positions in the same n-space Misfit when invariance is not achieved Transferability of coordinates Maximal use of information in data set Standard errors for cell estimates and

person/item parameters Example: My own model called NOUS

TOOLS RELEVANT TO APEC

International Item Bank Computer Adaptive Testing Lexile scale

INTERNATIONAL ITEM BANK


1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000

Person Abilitie

s

Less AA xFluent BB "Predict missing cells" -- Everyone in effect takes same giant test x

CC x

DD x

EE x

FF x

Persons GG xHH x

II x

JJ x

KK x

More LL Save item difficulties, attributes in bank. Makes it easy to calculate abilities. xFluent MM x

Item Difficulties x x x x x x x x x x x x

INTERNATIONAL ITEM BANK BENEFITS

How it works Participating countries “withdraw” items Combine bank items with own items, administer New items, with data, are deposited to bank Items are edited and calibrated, made available

Benefits Different tests, a common scale Excellent test security Test all grades and ability levels Freedom of use, not tied to proprietary tests Computer adaptive testing

COMPUTER ADAPTIVE TESTING (CAT)

How it works At least 12,000 items spread across all ability

levels Unidimensionality rigorously enforced Item difficulties pre-calibrated Examinee is administered items that are

matched to his continuously recalculated ability, until a desired precision is achieved.

Benefits Fast, secure, computerized All examinees are measured with equal precision None of the problems of paper tests

Problematic with Writing and Speaking

THE LEXILE SCALE How it works

Rasch-based software uses “objective characteristic” equating to measure the relative reading difficulty of text passages

Two “objective characteristics” used as predictors: Semantic Difficulty (how frequently is each word in a

passage used in general English) Syntactic Complexity (how many words are in each

sentence) How it is used

Tests are scanned and Lexile difficulties calculated Books and curricula are also scanned and “Lexiled” Examinees receive Lexile reading ability scores Teachers match examinees to materials

BENEFITS OF THE LEXILE SCALE

All APEC tests on one English fluency scaleScale is objective, rigorous, and already applied to many assessments and thousands of books

Practical, inexpensive, straightforwardMetaMetrics scans tests, assigns Lexiles. No equating studies, no need for item banks.

Usable by teachersLots of Lexiled curricular materials and books(See attached excerpt of the Lexile scale, from MetaMetrics)

Easy to set performance standardsEach level of the scale has thousands of examplars

MORE LEXILE BENEFITS Transparency

The scale is based on objective test characteristics that can be independently verified and replicated. If MetaMetrics disappears the Lexile scale can be recreated. Proprietary tests like TOEFL and TOEIC have scales completely dependent on their creators.

StabilityA scale based on objective characteristics is less likely to change difficulty or distort over time.

Applicable to many kinds of language fluency Academic Business Reading, Writing (not yet Listening and Speaking) English and Spanish, extendable to Asian languages

CONCLUSION

The problem of establishing a common scale and common language standards for the APEC economies – particularly as regards the learning, teaching, and usage of English – is eminently solvable.

Psychometric tools have been developed over the last 40 years in response to the need to equate tests.

Three such tools relevant to APEC needs are: An International Item Bank Computer Adaptive Testing The Lexile framework

CONTACT INFORMATION

Personal InformationMark H. Moulton, Ph.D., Educational Data [email protected]

Complete “One Ruler…” paper, plus NOUS and multidimensional equatingwww.eddata.com/resources/publications/

Rasch Modelswww.winsteps.com/

Item banks and CATwww.nwea.org/

The Lexile frameworkwww.lexile.com/EntrancePageFlash.html

Documents

O NE R ULER, M ANY TESTS A Primer on Test Equating