Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara Presented at Measured Progress August 2007

Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara

Presented at Measured ProgressAugust 2007

Overview

Definition and causes of DIF Assessing DIF via Mantel-Haenszel EB enhancement to MH DIF (1994-2002, with

D. Thayer & C. Lewis) Model and Applications Simulation findings Discussion

What’s differential item functioning ?

DIF occurs when equally skilled members of 2 groups have different probabilities of answering an item correctly.

(Only dichotomous items considered today)

IRT Definition of (absence of) DIF

Lord, 1980: P(Yi = 1| , R) = P(Yi = 1| , F) means DIF is absent

P(Yi = 1| , G) is the probability of correct response to item i, given , in group G,

G = F (focal) or R (Reference). is a latent ability variable, imperfectly

measured by test score S. (More later...)

Reasons for DIF “Construct-irrelevant difficulty” (e.g., sports

content in a math item)

Differential interests or educational background: NAEP History items with DIF favoring Black test-takers were about M. L. King, Harriet Tubman, Underground Railroad (Zwick & Ercikan, 1989)

Often mystifying (e.g., “X + 5 = 10” has DIF; “Y + 8 = 11” doesn’t)

Mini-history of DIF analysis: DIF research dates back to 1960’s In late 1980’s (“Golden Rule”), testing

companies started including DIF analysis as a QC procedure.

Mantel-Haenszel (Holland & Thayer, 1988): method of choice for operational DIF analyses Few assumptions No complex estimation procedures Easy to explain

Mantel-Haenszel:

Compare item performance for members of 2 groups, after matching on total test score, S.

Suppose we have K levels of the score used for matching test-takers, s1, s2, …sK

In each of the K levels, data can be represented as a 2 x 2 table (Right/Wrong by Reference/Focal).

Mantel-Haenszel

For each table, compute conditional odds ratio=

Odds of correct response| S=sk, G=R

Odds of correct response| S=sk, G=F

Weighted combination of these K values is MH odds ratio,

MH DIF statistic is -2.35 ln( )€

ˆ ω MH

€

ˆ ω MH

Mantel-Haenszel

The MH chi-square tests the hypothesis,

H0: k = = 1, k = 1, 2, … K versus

H1: k = ≠ 1, k = 1, 2, … K

where k is the population odds ratio at score level k.

(Above H0 is similar, but not, in general, identical to the IRT H0; see Zwick, 1990 Journal of Educational Statistics)

Mantel-Haenszel

ETS: Size of DIF estimate, plus chi-square results are used to categorize item: A: negligible DIF B: slight to moderate DIF C: substantial DIF

For B and C, “+” or “-” used to indicate DIF direction: “-” means DIF against focal group.

Designation determines item’s fate.

Drawbacks to usual MH approach

May give impression that DIF status is deterministic or is a fixed property of the item Reviewers of DIF items often ignore SE

Is unstable in small samples, which may arise in CAT settings

EB enhancement to MH:

Provides more stable results May allow variability of DIF findings to be

represented in a more intuitive way Can be used in three ways

Substitute more stable point estimates for MH Provide probabilistic perspective on true DIF status

(A, B, C) and future observed status [Loss-function-based DIF detection]

Main Empirical Bayes DIF Work (supported by ETS and LSAC)

An EB approach to MH DIF analysis (with Thayer & Lewis). JEM, 1999. [General approach, probabilistic DIF]

Using loss functions for DIF detection: An EB approach (with Thayer & Lewis). JEBS, 2000. [Loss functions]

The assessment of DIF in CATs. In van der Linden & Glas (Eds.) CAT: Theory and Practice, 2000. [review]

Application of an EB enhancement of MH DIF analysis to a CAT (with Thayer). APM, 2002. [simulated CAT-LSAT]

What’s an Empirical Bayes Model?(See Casella (1985), Am. Statistician)

In Bayesian statistics, we assume that parameters have prior distributions that describe parameter “behavior.”

Statistical theory, or past research may inform us about the nature of those distributions.

Combining observed data with the prior distribution yields a posterior (“after the data”) distribution that can be used to obtain improved parameter estimates.

“EB” means prior’s parameters are estimated from data (unlike fully Bayes models).

EB DIF Model

MHi is the MH statistic for item i.

σ

i

2≡ SE

2MH

i( ) is the squared standard

error (treated as ) known of MHi.

(Sensitivity analyses revealed no problem

withknown-variance assu .) mption

EB DIF Model

f MHi

| θi

( ) is N θi

, σi

2

( ) , θi is unknown

DIF parameter (true value of DIF)

Note: Distribution of MH is asymptotically

normal (e.g., Agresti, 1990)

EB DIF Model

Prior: f θi

( ) is N μ , τ

2

( ) ,

where μ is across- item mean of DIF

parameters and τ 2 is across- item varianc .e ( No DIF implies τ 2 = 0.)

EB DIF Model

f θi

| MHi( )

∝ f MHi

| θi( )

f θi( )

= posterior

distribution of θi, given MH

i.

Bayes model with normal prior, normal

likelihood ⇒ normal posterior--posterior

mean and variance have simple expressions

(see, e.g., Gelman et al., 1995).

EB DIF Model

Posterior mean = Wi

MHi

+ ( 1 − Wi

) μ

whe re Wi =

τ

2

σi

2

+ τ

2

.

EB DIF statistic = estimated posterior me an– a weighted combination of MHi and

€

ˆ μ .

Posterior varia = nce

€

Wiσi2 ≤ σi2.

Estimation of μ andτ 2

We estimated μ and τ 2 from current data:

ˆ μ = Average ( MHi

) .

ˆ τ

2

= V ˆ a r ( MHi

) − Average ( SEi

2

( MHi

)) ,

where V ˆ a r ( MHi

) = observed across-item

variance of MHi statistics, i.e., τ

2 is

estimated by deflating V ˆ a r ( MHi

) by average

of estimated standard errors.

Recall: EB DIF estimate is a weighted combination of MHi and prior mean.

Prior mean will be (near) 0 because MHi

values sum to (about) 0 across items when

we match on number-right score (or similar

scores).

EB DIF estimate is closer to 0 than MH.

Lots of data: little "shrinkage" to 0

Sparse data: lots of shrinkage; prior leads to

more stable estimation.

Next…

Performance of EB DIF estimator

“Probabilistic DIF” idea

How does EB DIF estimator EBi compare to MHi?

Applied to real data, including GRE Applied to simulated data, including simulated

CAT-LSAT (Zwick & Thayer, 2002): Testlet CAT data simulated, including items with

varying amounts of DIF EB and MH both used to estimate (known) True

DIF Performance compared using RMSR, variance, and

bias measures

Design of Simulated CAT

Pool: 30 5-item testlets (150 items total) 10 Testlets at each of 3 difficulty levels

Item data generated via 3PL model CAT algorithm was based on testlet scores Examinees received 5 testlets (25 items) Test score (used as DIF matching variable) was

expected true score on pool (Zwick, Thayer, & Wingersky, 1994 APM)

Simulation Conditions Differed on Several Factors: Ability distribution:

Always N(0,1) in Reference group Focal group either N(0,1) or N(-1,1)

Initial sample size per group: 1000 or 3000 DIF: Absent or Present (in amounts that vary

across items)

600 replications for results shown today

Definition of True DIF for Simulation

True DIF =

− 2 . 35 ln

PiR

( α ) / QiR

( α )

PiF

( α ) / QiF

( α )

⎧

⎨

⎩

⎫

⎬

⎭

∫ f ( α ) d α ,

f(α ) is reference gp. ability dist., PiG

( α ) is

IRF for group G, QiG

( α ) = 1 - PiG

( α ) .

Like MH with no measurement or sampling error

Range of True DIF: -2.3 to 2.9, SD ≈ 1.

Definition of Root Mean Square Residual

RMSR is average deviation, in MH metric, of

DIF estimate from True DIF.

For each item in each condition, get

€

1R (EstDIF(j)−TrueDIF)2

j=1R∑

j indexes replications

R = 600 is the number of reps

Est DIF(j) is EB or MH value from jth rep

MSR = Variance + Squared Bias

MSR = RMSR2 =

€

1R

{EstDIF( j)j=1

R∑ − Avg(EstDIF)}2

€

+ (Ave(EstDIF) −TrueDIF )2

RMSRs for No-DIF condition, Initial N=1000; Item N’s = 80 to 300

Summary over 150 items EB MH25th %ile .068 .543

Median .072 .684

75th %ile .078 .769

RMSRs - 50 hard items, DIF condition, Focal N(-1,1)Focal N’s = 16 to 67, Reference N’s 80 to 151

Summary over50 items EB MH

25th %ile .514 1.190

Median .532 1.252

75th %ile .558 1.322

RMSRs for DIF condition, Focal N(-1,1)Initial N=1000; Item N’s = 16 to 307

Summary over 150 items EB MH

25th %ile .464 .585

Median .517 .641

75th %ile .560 1.190

Variance and Squared Bias for Same ConditionInitial N=1000; Item N’s = 16 to 307

Summary over 150

Items

EB MH

Variance SquaredBias

Variance SquaredBias

25th %ile .191 .004 .335 .000

Median .210 .027 .402 .002

75th %ile .242 .088 1.402 .013

Summary-Performance of EB DIF Estimator

RMSRs (and variances) are smaller for EB than for MH, especially in (1) no-DIF case and

(2) very small-sample case.

EB estimates more biased than MH; bias is toward 0.

Above findings are consistent with theory.

Implications to be discussed.

“External” Applications/Elaborations of EB DIF Point Estimation

Defense Dept: CAT-ASVAB (Krass & Segal, 1998)

ACT: Simulated multidimensional CAT data (Miller & Fan, NCME, 1998)

ETS: Fully Bayes DIF model (NCME, 2007) of Sinharay et al: Like EB, but parameters of prior are determined using past data (see ZTL).

Also tried loss function approach.

Probabilistic DIF

In our model, posterior distribution is normal, so is fully determined by mean and variance.

Can use posterior distribution to infer the probability that DIF falls into each of the ETS categories (C-, B-, A, B+, C+), each of which corresponds to a particular DIF magnitude.

(Statistical significance plays no role here.) Can display graphically.

Probabilistic DIF status for an “A” item in LSAT sim.MH = 4.7, SE = 2.2, Identified Status = C+Posterior Mean = EBi= .7, Posterior SD = .8

0%

1%

65%

20%

14%

C-B-AB+C+

NR=101NF = 23

Probabilistic DIF, continued

In EB approach can be used to accumulate DIF evidence across administrations.

Prior can be modified each time an item is given: Use former posterior distribution as new prior (Zwick, Thayer & Lewis, 1999).

Pie chart could then be modified to reflect new evidence about an item’s status.

Predicting an Item’s Future Status: The Posterior Predictive Distribution

A variation on the above can be used to predict future observed DIF status

Mean of posterior predictive distribution is same as posterior mean, but variance is larger.

For details and an application to GRE items, see Zwick, Thayer, & Lewis, 1999 JEM.

Discussion

EB point estimates have advantages over MH counterparts

EB approach can be applied to non-MH DIF methods Advisability of shrinkage estimation for DIF needs to

be considered Reducing Type I error may yield more interpretable results Degree of shrinkage can be fine-tuned

Probabilistic DIF displays may have value in conveying uncertainty of DIF results.

Documents

Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara Presented at Measured Progress August 2007