of 89/89
A Practitioner’s Introduction to Equating Joseph Ryan, Arizona State University Frank Brockmann, Center Point Assessment Solutions with Primers on Classical Test Theory (CTT) and Item Response Theory (IRT) Assessment, Research and Evaluation Colloquium Neag School of Education, University of Connecticut October 22, 2010 Workshop:

A Practitioner’s Introduction to Equating

  • View
    38

  • Download
    3

Embed Size (px)

DESCRIPTION

A Practitioner’s Introduction to Equating. with Primers on Classical Test Theory (CTT) and Item Response Theory (IRT). Joseph Ryan , Arizona State University Frank Brockmann , Center Point Assessment Solutions. Workshop:. Assessment, Research and Evaluation Colloquium - PowerPoint PPT Presentation

Text of A Practitioner’s Introduction to Equating

  • A Practitioners Introduction to EquatingJoseph Ryan, Arizona State UniversityFrank Brockmann, Center Point Assessment Solutions

    with Primers on Classical Test Theory (CTT) and Item Response Theory (IRT)Assessment, Research and Evaluation Colloquium Neag School of Education, University of Connecticut October 22, 2010Workshop:

  • Acknowledgments Council of Chief State School Officers (CCSSO)

    Technical Issues in Large Scale Assessment (TILSA) and Subcommittee on Equating, part of the State Collaborative on Assessment and Student Standards (SCASS)

    Doug Rindone and Duncan MacQuarrie, CCSSO TILSA Co-Advisers Phoebe Winter, Consultant Michael Muenks, TILSA Equating Subcommittee Chair

    Technical Special Interest Group of National Assessment of Educational Progress (NAEP) coordinators

    Hariharan Swaminathan, University of Connecticut

    Special thanks to Michael Kolen, University of Iowa

  • Workshop TopicsThe workshop covers the following topics:

    Overview - Key concepts of assessment, linking, and equatingMeasurement Primer Classical and IRT theoriesEquating BasicsThe Mechanics of EquatingEquating Issues

  • 1. OverviewKey Concepts inAssessment, Linking, Equating

  • Assessment, Linking, and Equating

    Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment.

    (Messick, 1989, p. 13)

    Validity is the essential motivation for developing and evaluating appropriate linking and equating procedures.

  • Assessment, Linking, and Equating

  • Linking and Equating

    EquatingScale aligningPredicting/Projecting

    Holland in Dorans, Pommerich and Holland (2007)

  • Misconceptions About Equating

    Equating isa threat to measuring gains.a tool for universal applications.a repair shop.a semantic misappropriation.

    MYTHWISHFUL THOUGHTMISCONCEPTIONMISUNDERSTANDING

  • 2. Measurement PrimerClassical Test Theory (CTT) Item Response Theory (IRT)

  • Classical Test TheoryThe Basic Model

    Oobserved score

    Ttrue score

    Eerror (with some MAJOR assumptions)

    =+ Reliability is derived from the ratio of error score to true score Key item features include: Difficulty Discrimination Distractor Analysis

  • Reliability reflects the consistency of students' scoresOver time, test retestOver forms, alternate formWithin forms, internal consistency

    Validity reflects the degree to which scores assess what the test is designed to measure in terms ofContentCriterion related measuresConstructClassical Test Theory

  • Item Response Theory (IRT)The Concept

    An approach to item and test analysis that estimates students probable responses to test questions, based on

    the ability of the studentsone or more characteristics of the test items

  • Item Response Theory (IRT)IRT is now used in most large-scale assessment programs

    IRT models apply to items that use dichotomous scoring with right (1) or wrong (0) answers and polytomous scoring with items scored with ordered categories (1, 2, 3, 4) common with written essays and open-ended constructed response items

    IRT is used in addition to procedures from CTTINFO

  • Item Response Theory (IRT)IRT Models

    All IRT models reflect the ability of students. In addition, the most common basic IRT models include:

    The 1-parameter model (aka Rasch model) models item difficulty

    The 2-parameter model models item difficulty and discrimination

    The 3-parameter model models item difficulty, discrimination and pseudo guessing

  • Item Response Theory (IRT)IRT Assumptions

    Item Response Theory requires major assumptions:

    Unidimensionality

    Item Independence

    Data-Model Fit

    Fixed but arbitrary scale origin

  • Item Response Theory (IRT)A Simple Conceptualization

    -1.5+2.25

  • Item Response Theory (IRT)Probability of a Student Answer

  • Item Response Theory (IRT)Item Characteristic Curve for Item 2

  • Item Response Theory (IRT)

  • IRT and FlexibilityIRT provides considerable flexibility in terms of

    constructing alternate tests formsadministering tests well matched or adapted to students ability levelbuilding sets of connected tests that span a wide range (perhaps two or more grades)inserting or embedding new items into existing test forms for field testing purposes so new items can be placed on the measurement scale

    INFO

  • 3. Equating BasicsBasic Terms (Sets 1, 2, and 3)Equating Designs (a, b, c)Item Banking (a, b, c, d)

  • Basic Terms Set 1Column A Column B__Anchor Items A. Sleepwear__Appended Items B. Nautically themed__Embedded Items apparel C. Vestigial organs D. EMIP learning module

    USEFUL TERMS

  • Basic Terms Set 2Pre-equating-

    Post equating - USEFUL TERMSFor each term, make some notes on your handout:

  • Basic Terms Set 3Horizontal Equating

    Vertical Equating (Vertical Scaling)

    Form-to-Form (Chained) Equating Item Banking USEFUL TERMSFor each term, make some notes on your handout:

  • Equating DesignsRandom Equivalent Groups Single Group Anchor Items

  • Equating Designsa. Random Equivalent Groups

  • Equating Designsb. Single Group

    CAUTIONThe potential for order effects is significant--equating designs that use this data collection method should always be counterbalanced!

  • Equating Designsb. Single Group with Counterbalance

  • Equating Designsc. Anchor Item Design

    not always at the end

  • Equating Designsc. Anchor Item Set

  • c. Anchor Item Designs

    Internal/Embedded

    Internal/Appended

    ExternalUSEFUL TERMSEquating Designs

  • Equating DesignsInternal Embedded Anchor Items

  • Equating DesignsInternal Appended Anchor Items

  • Equating DesignsExternal Anchor Items

  • Equating DesignsGuidelines for Anchor Items

    Mini-Test

    Similar Location

    No Alterations

    Item Format RepresentationRULES of THUMB

  • 3. Equating BasicsBasic Terms (Sets 1, 2, and 3)Equating Designs (a, b, c)Item Banking (a, b, c, d)

  • Item BankingBasic Concepts Anchor-item Based Field Test Matrix Sampling Spiraling Forms

  • An item bank is a large collection of calibrated and scaled test items representing the full range, depth, and detail of the content standardsItem Bank development is supported by field testing a large number of items, often with one or more anchor item sets.Item banks are designed to provide a pool of items from which equivalent test forms can be built.Pre-equated forms are based on a large and stable item bank. Item Bankinga. Basic Concepts

  • b. Anchor Item Based Field Test Design

    RULE of THUMBField test items are most appropriately embedded within, not appended to, the common items.Item Banking

  • Item BankingItems can be assembled into relatively small blocks (or sets) of items.A small number of blocks can be assigned to each test form to reduce test length.Blocks may be assigned to multi forms to enhance equating.Blocks need not be assigned to multi forms if randomly equivalent groups are used.c. Matrix Sampling

  • Item Bankingc. Matrix Sampling

  • Tests forms can be assigned to individual students, or students grouped in classrooms, schools, districts, or some other units.

    Spiraling at the student level involves assigning different forms to different students within a classroom.

    Spiraling at the classroom level involves assigning different forms to different classrooms within a school.

    Spiraling at the school or district level follows a similar pattern.

    Item Bankingd. Spiraling Forms

  • d. Spiraling Forms

    Item Banking

  • Item Bankingd. Spiraling Forms

    Spiraling at the student level is technically desirable:provides randomly equivalent groupsminimizes classroom effect on IRT estimates (most IRT procedures assume independent responses) Spiraling at the student level is logistically problematic:exposes all items in one locationrequires careful monitoring of test packets and distributionrequires matching test form to answer key at thestudent level

  • Its Never Simple!Linking and equating procedures are employed in the broader context of educational measurement which includes, at least, the following sources of random variation (statistical error variance) or imprecision.

    Content and process representation Errors of measurementSampling errorsViolations of assumptionsParameter estimation varianceEquating estimation variance

    IMPORTANTCAUTION

  • 4. The Mechanics of EquatingThe Linking-Equating Continuum Classical Test Theory (CTT) Approaches Item Response Theory (IRT) Approaches

  • The Linking-Equating ContinuumLinking is the broadest terms used to refer to a collection of procedures through which performance on one assessment is associated or paired with performance on a second assessment.Equating is the strongest claim made about the relationship between performance on two assessments and asserts that the scores that are equated have the same substantive meaning.

    USEFUL TERMS

  • The Linking-Equating Continuumdifferent forms of linkingequating (strongest kind of linking)

  • The Linking-Equating ContinuumFrameworks

  • The Linking-Equating ContinuumIn 1992, Mislevy described four typologies of linking test forms: moderation, projection, calibration, and equating (Mislevy, 1992, pp. 21-26). In his model, moderation is the weakest form of linking tests, while equating is considered the strongest type. Thus, equating is done to make scores as interchangeable as possible.

  • The Linking-Equating ContinuumEquating strongest form of linking, invariant across populations, maintains substantive meaning Calibration may use equating procedures, not necessarily invariant across populations, and substantive meaning might not be preservedPrediction/Projection unidirectional statistical procedure for predicting scores or projecting distributions Moderation weakest form of linking, may be statistical or judgmental (social), based on comparisons of distributions or panel/reviewers decisions.

    USEFUL TERMS

  • CTT Linking-Equating Approaches

    Mean MethodLinear MethodEquipercentile Method

  • CTT Linking-Equating Approachesa. Mean Method

    Adjusts one set of scores based on the difference in the means of two tests

    Assumes a constant difference in the scales across all scores

    Useful for carefully developed and parallel or close-to-parallel forms

    Simple, but strains assumptions of parallel forms

  • CTT Linking-Equating Approachesb. Linear Method

    Based on setting standardized deviation scores from two tests equal

    Can be done in raw score scale with simple linear regression

  • CTT Linking-Equating Approachesb. Linear Method

  • CTT Linking-Equating Approachesc. Equipercentile Method

    Based on scores that correspond to the same percentile rank position from two tests Does not assume a linear relationship between the two testsProvides for linking scores across the full range of possible test scoresMay require smoothing of the distributions, especially with small samples

  • CTT Linking-Equating Approachesc. Equipercentile Method

  • common itemscommon people or randomly equivalent groups treated as being the same peopleIRT Linking-Equating Approaches

  • IRT Linking-Equating ApproachesIRT linking and equating approaches:

    provide flexibility and are applicable to many settingsprovide consistency by employing the IRT model being used for calibration and scalingprovide indices that reveal departures from what is expected (tests of fit)

  • IRT Linking-Equating Approachesa. Common ItemsApproaches can be based on:

    Applying an equating constantEstimating item parameters with fixed or concurrent/simultaneous calibrationApplying the Test Characteristic Curve procedure (TCC) of Stocking & Lord, 1983

  • IRT Linking-Equating Approachesa. Common ItemsApplying an equating constantAppropriate when two or more tests have a common set of anchor items and also some items unique to each formRequires selecting one form or some other location on the scale as the origin of the scale

  • IRT Linking-Equating Approachesa. Common Items1. Applying an Equating Constant

  • IRT Linking-Equating Approachesa. Common Items1. Applying an Equating Constant

  • IRT Linking-Equating ApproachesCommon Item ApproachApplying an equating constant

  • IRT Linking-Equating Approachesa. Common Items1. Applying an Equating Constant

  • a. Common Items1. Determining an Equating Constant

    IRT Linking-Equating Approaches

    Form YForm X(Y-X)Item A0.5-1.52Item B1.0-1.02Item C1.5-0.52Sum3.0-3.06Average1.0-1.02Constant = Form Y - Form X = 2If C= Y X ; then Y = X + C

  • CLOSER LOOKIRT Linking-Equating Approaches

  • a. Common Items1. Applying an Equating ConstantThe common items used for equating are the anchor itemsGenerally 15 to 20 items are needed for common item equatingNot all items designed as anchor items will work effectively The anchor items should be in the same location on the testsThe anchor items should reflect the content, format and difficulty range of the whole test

    IRT Linking-Equating Approaches

  • Test FormXTest FormYDesignate this as the Base form, which defines the scale origin.Calibrate parameters (difficulty, discrimination, and guessing) of all items. Treat the item parameters of the anchor items as fixed.

    Use parameters of the anchor items from Form X for the same items (anchors) on Form Y.

    Anchor Items1, 5, 7, 10Etc.Calibrate the Form Y items using the fixed parameter values of the anchor items. Treat all other items and their parameters as free to vary

    The resultant calibration of Form Y will be on the same scale as Form X ; it is anchored through the fixed values of the common items .IRT Linking-Equating Approachesa. Common Items2. Fixed Calibration

  • Consider the following:500 students take Form X500 students take Form Y1,000 take the anchor itemsThe data for all students are stacked as shown on the next slide

    Test FormXTest FormY15 Anchor Items40 Items40 ItemsIRT Linking-Equating Approachesa. Common Items2. Concurrent or Simultaneous Calibration

  • Data are calibrated on 1,000 students Students each take 65 items Students are missing data on the form they did not take. All students respond to the anchor items.

    IRT Linking-Equating Approachesa. Common Items2. Concurrent or Simultaneous Calibration

    Form X Items (25 items)Anchor Items(15 Items)Form Y Items(25 Items)The 500 students who take Form X will take 40 itemsItem Responses to 25 itemsItem Responses to 15 itemsMissing DataThe 500 students who take Form Y will take 40 itemsMissing DataItem Responses to 15 itemsItem Responses to 25 items500 students1,000 students500 students

  • Developed by Stocking and Lord (1983)Very flexible and widely usedCommonly applied with the 2- and 3- parameter IRT models.

    IRT Linking-Equating Approachesa. Common Items3. Test Characteristic Curve (TCC) Procedures

  • IRT scales have an arbitrary origin and an arbitrary scale spacing e.g., size of each unit of measurement. Origin is selected and fixedScale spread is expended or reduced

    Item parameter estimates for the same items from two independent calibration will differ due toOrigin and scale differencesCharacteristics of other itemsPossibly sampling and estimation errorIRT Linking-Equating Approachesa. Common Items3. Test Characteristic Curve (TCC) Procedures

  • If two scales differ in origin (location) and spread (variability), a linear transformation can be applied to one scale to re-express or transform it to be on the other scaleThe choice of what scale to use is informed by considering the intended use of the items, test forms, or item bankThe figures on the next slide illustrate the basic idea of the TCC method IRT Linking-Equating Approaches3. Test Characteristic Curve (TCC) Proceduresa. Common Items

  • IRT Linking-Equating Approaches

  • Transforms the Item parameter values for the common items on one test form to be on the same scale as their corresponding parameter values on the other (target) formRequires two constants: the parameters are multiplied by one constant and then added to the second constantBegins with carefully chosen initial values for the constantsRefines the constants to minimize the differences in estimated scores based on the transformed test form and the target form Never as simple as the theory

    IRT Linking-Equating Approachesa. Common Items3. Test Characteristic Curve (TCC) Procedures

  • IRT Linking-Equating Approaches

  • The same students, or two groups sampled to be equivalent on critical relevant characteristics, take Form X and Form Y; the forms do not have any common itemsExample: students average ability on Form X is -1.0 (low ability) and their average ability on Form Y is +1.0 (high ability)

    Differences in students abilities cannot explain the differences in the performance on Forms X and Y since the same students (common students) take both forms

    IRT Linking-Equating Approachesb. Common Personsor Random Equivalent GroupsQUESTIONHow can the same group of students have two different mean abilities?

  • The difference in mean performance reflects the difference in the difficulty of the two formsThe test forms must be different in difficulty since the students abilities were held constant (same students) On Form X, students look less able with a mean of -1; on Form Y students look more able with a mean of +1Form X is harder than Form Y in that it makes students look less able; the test forms differ by +2 units The difference of +2 is used as a linking constant to adjust the tests onto a single scale in the same way as a linking constant derived from common items

    IRT Linking-Equating Approachesb. Common Personsor Random Equivalent Groups

  • 5. Equating Issues Substantive Concerns Technical Issues Quality Control Issues Test design, development & administration Scoring, analysis and equating Technical Documentation Accountability Compliance Item Formats and Platforms

  • Common Equating Concerns/IssuesSubstantive ConcernsValidity is the central issueValidity evidence must document fairness, absence of bias, and equal access for all studentsCarefully planned and rigorously monitored item and test form development are the most essential ingredients for successful equatingEquating goes bad through items and test forms, not in the psychometrics

  • Common Equating Concerns/IssuesTechnical IssuesExamining and testing IRT assumptionsConducting and documenting IRT tests of fitData to model fitLinking/equating fitItem Parameter Drift

  • Common Equating Concerns/IssuesQuality Control IssuesTest design, development, and administration problemsChanges in content standards or test specificationsItem contexts that differ between forms and affect performance on anchor itemsAnchor items that appear in very different locations among formsItem misprints/errorsUnintended accommodations (maps or periodic tables on walls, calculators, etc.)All manner of weird and unimaginable stuff and happenings

  • Common Equating Concerns/IssuesQuality Control IssuesItem scoring, analysis, and equating quality issuesNon-standard scoring criteria or changes in scoring proceduresRedefinition in scoring rubrics, variation in benchmark papersItem parameter driftDepartures from specified equating proceduresUnreliable and/or inconsistent item performance or score distributionsDepartures from specified data processing protocolsAll manner of weird and unimaginable stuff and happenings

  • Common Equating Concerns/IssuesTechnical DocumentationGeneral technical reportsStandards setting reportsEquating technical reportsSpecify requirements for documentation in RFPs, with TAC reviews and due dates

    QUESTIONCan an independent contractor replicate the equating results?

  • Common Equating Concerns/IssuesAccountability ConcernsStandard Setting

    Adequate Yearly Progress (AYP)

  • Common Equating Concerns/IssuesItem Formats and PlatformsOpen-ended or Constructed Response Tasks

    Writing Assessment

    Paper-and-Pencil and Computerized Assessments

  • ReferencesDorans, N. J., Pommerich, M., & Holland, P. W. (2007). Linking and aligning scores and scales. Statistics for social and behavioral sciences. New York: Springer.

    Linn, R.L. (1993) Linking results of distinct assessments. Applied Measurement in Education, 6, 83-102.

    Mislevy, R.J. (1992) Linking educational assessments: Concepts, issues, methods, and prospects. Princeton, NJ: Educational Testing Service

    Ryan, J. and Brockmann, F. (2009). A Practitioners Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Washington, D.C.:Council of Chief State School Officers (CCSSO).

  • A Practitioners Introduction to EquatingJoseph Ryan, Arizona State [email protected]

    Frank Brockmann, Center Point Assessment [email protected]

    END

    *****************************************