63
Vertical and horizontal test equating in educational research Eveline Gebhardt & Wolfram Schulz

Vert&Hor Equating 111024

Embed Size (px)

DESCRIPTION

Vertical and horizontal equating and measurement invariance

Citation preview

Page 1: Vert&Hor Equating 111024

Vertical and horizontal test equating in educational

research

Eveline Gebhardt&

Wolfram Schulz

Page 2: Vert&Hor Equating 111024

Method for estimating

• change over time in student abilities• growth between year levels

Page 3: Vert&Hor Equating 111024

CLASSICAL TEST THEORYA method based on

Page 4: Vert&Hor Equating 111024

Classical test theory

• Student performance: % correct on set of items– Compare students that respond to identical

set of items• Item difficulty: % of students responding

correctly– Compare items that were administered to

the same group of students

Page 5: Vert&Hor Equating 111024

Constraints

• Limited number of items to measure a domain

• All items need to be kept secure

Page 6: Vert&Hor Equating 111024

Problematic

• Comparing students from different age groups (ceiling or floor effect)

• Comparing student abilities over time when not all items can be kept secure

• Item difficulty and student performance are confounded

Page 7: Vert&Hor Equating 111024

ITEM RESPONSE THEORYA method based on

Page 8: Vert&Hor Equating 111024

Rasch model

Common scale for item difficulties and student abilities– If ability = difficulty, the student has 50%

chance to respond correctly to that item– If ability > difficulty, most likely to respond

correctly– If ability < difficulty, most likely to respond

incorrectly

Page 9: Vert&Hor Equating 111024

Example scale – Year 6

Year 6 students

xx

xxxxxxxxxxxxxxx

Items

6

1 53 7 94 10

82

3

-3

2

1

0

-1

-2

Page 10: Vert&Hor Equating 111024

Example scale – Year 10

3

-3

2

1

0

-1

-2

Items

1412613 151 5 113 7 94 10

82

Year 10 students

xxxxxxxxxxxxxx

x

Page 11: Vert&Hor Equating 111024

Example scale – Combined

Year 6

xx

xxxxxxxxxxxxxx

3

-3

2

1

0

-1

-2

Items

1412613 151 5 113 7 94 10

82

Year 10

xxxxxxxxxxxxxx

x

Page 12: Vert&Hor Equating 111024

Vertical and horizontal equating

Year 10

Year 6

2011 2014V

ertic

alHorizontal

Page 13: Vert&Hor Equating 111024

COMMON ITEM EQUATINGThree methods

Page 14: Vert&Hor Equating 111024

Several methods

• Average item difficulty of set of link items needs to be equal in both tests

• Three common methods:– Shift method (trends)– Joint scaling (booklets)– Anchoring item difficulties

Page 15: Vert&Hor Equating 111024

SHIFT METHODMethod 1

Page 16: Vert&Hor Equating 111024

Shift method

• Test 1 and test 2 are scaled separately• Average difficulty of items B in test 1 (MN1) and test 2

(MN2) is computed

• Difference between averages (d = MN1 – MN2) is computed

• Difference is added to the student abilities of test 2 (θ2* = θ 2 + d)

Items A Items B Items C

Test 1 X X

Test 2 X X

Page 17: Vert&Hor Equating 111024

0

Test 1

MN1

d 0

Test 2

MN2

Page 18: Vert&Hor Equating 111024

Item Difficulty T1 Difficulty T2

A

1 -1.12 1.63 -2.64 0.95 -1.8

B

6 0.8 -1.27 1.7 -0.38 0.9 -1.19 -0.2 -2.2

10 -0.9 -2.9

C

11 2.112 1.513 0.414 2.415 1.2

AVG all 0.0 0.0AVG link 0.5 -1.5 2.0 = shift

Page 19: Vert&Hor Equating 111024

JOINT SCALINGMethod 2

Page 20: Vert&Hor Equating 111024

Joint scaling

• Data of test 1 and 2 are joined in one data set• Test 1 and 2 are scaled together• Difficulties of items B are estimated only once• Difficulties of items B are identical for test 1

and 2• Tests are on the same scale• Also called concurrent equating

Page 21: Vert&Hor Equating 111024

Joint scaling - Data file A B C

Std Year i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15

1 6 0 1 0 0 1 0 0 1 1 1 n n n n n

2 6 0 0 1 1 1 0 1 0 0 0 n n n n n

3 6 1 0 1 1 1 0 1 1 0 1 n n n n n

4 6 0 0 1 1 0 1 1 0 0 1 n n n n n

5 6 1 1 0 1 1 1 1 1 1 1 n n n n n

6 10 n n n n n 0 0 0 0 0 0 1 0 0 0

7 10 n n n n n 0 1 0 1 0 0 0 0 0 0

8 10 n n n n n 0 1 1 0 1 1 1 1 1 1

9 10 n n n n n 1 1 0 0 1 1 1 0 1 1

10 10 n n n n n 1 1 1 1 1 1 1 0 1 1

Page 22: Vert&Hor Equating 111024

ANCHORINGMethod 3

Page 23: Vert&Hor Equating 111024

Anchoring

• Test 1 (items A and B) is scaled• Difficulties of items B are copied• Test 2 (items B and C) is scaled,

anchoring items B to the same values as test 1

Page 24: Vert&Hor Equating 111024

Set Item Difficulty T1 Difficulty T2

A

1 -1.12 1.63 -2.64 0.95 -1.8

B

6 0.8 0.8*7 1.7 1.7*8 0.9 0.9*9 -0.2 -0.2*

10 -0.9 -0.9*

C

11 4.112 3.513 2.414 4.415 3.2

AVG all 0.0 2.5AVG link 0.5 0.5

Page 25: Vert&Hor Equating 111024

EVALUATION OF LINK ITEMSBefore equating tests

Page 26: Vert&Hor Equating 111024

Link item invariance

• Relative item difficulty• Discrimination• Differential item functioning (DIF)

Page 27: Vert&Hor Equating 111024

RELATIVE ITEM DIFFICULTYEvaluation of

Page 28: Vert&Hor Equating 111024

Relative item difficulty

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Page 29: Vert&Hor Equating 111024

ITEM DISCRIMINATIONEvaluation of

Page 30: Vert&Hor Equating 111024

Item discrimination

• Discriminate between able and less able• Some items discriminate more than others• Average abilities of students:

Item 1 Item 2

Answer A 1.00 0.62

Answer B -0.22 0.61

Answer C -0.15 0.81

Answer D -0.02 0.53

Page 31: Vert&Hor Equating 111024

Slopes

• Level of discrimination is reflected by the slope of the item characteristic curve

Page 32: Vert&Hor Equating 111024

Assumption

• Assumption of the Rasch model:slopes are equal across items

• However, in practice slopes always vary a little within a test

• The expected slope is the average slope of all items in a test

• Steeper average slopes reflect a larger spread in abilities in the population

Page 33: Vert&Hor Equating 111024

Link items &

Discrimination

• The average discrimination of link item can vary between tests

• Individual link items can vary in discrimination between tests

Page 34: Vert&Hor Equating 111024

Experiment - 1

• Same test with 10 items is used in Year 6 and Year 10

• Spread in abilities is larger in Year 10 than in Year 6

• Item discriminate more in Year 10 than in Year 6

Page 35: Vert&Hor Equating 111024

Results experiment 1

Average discrimination

Population variance

True variance

Separate Joint Separate Joint

Year 6 0.25 0.34 0.76 1.07 0.80

Year 10 0.41 0.34 1.89 1.49 2.00

Page 36: Vert&Hor Equating 111024

DIFFERENTIAL ITEM FUNCTIONING

Evaluation of

Page 37: Vert&Hor Equating 111024

Differential Item Functioning

• Assumption of Rasch model:all students with the same ability have the same probability to respond correctly to an item, independent of the subgroup a student belongs to

• The violation of this assumption is called Differential Item Functioning (DIF)

Page 38: Vert&Hor Equating 111024

Example: sex DIF

Page 39: Vert&Hor Equating 111024

Link items &

DIF

• Set of link items needs to have the same average DIF as the non-link items in both tests

• The following experiment shows why

Page 40: Vert&Hor Equating 111024

Experiment 2

• Item pool of 105 items for assessment at time 1

• Selection of 55 trend items all favouring boys

• Scale two sets of items on the same set of student responses

Page 41: Vert&Hor Equating 111024

Results experiment 2

All items Boys items

0.44

0.50

0.60

0.44

Abilities by subgroup

All items Link items

M F M F

Page 42: Vert&Hor Equating 111024

Conclusion experiment 2

• Selecting link items that on average favour a subgroup of students changes the gap in performance between subgroups

• The average DIF should be as close to 0 as possible

Page 43: Vert&Hor Equating 111024

OTHER ITEM CHARACTERISTICS

Evaluation of

Page 44: Vert&Hor Equating 111024

Link items &

Sub-domains

• Equating shift should be based on a set of items that is representative of the whole test

• Equating shifts can be slightly different for different sub-domains

• Best practice to have equal proportions of sub-domains in trend items and in the total item pool

Page 45: Vert&Hor Equating 111024

Link items &

Item types

• Equating shifts can be slightly different for multiple choice items than for open ended items

• Best practice to have equal proportions of item types in trend items and in the total item pool

Page 46: Vert&Hor Equating 111024

EQUATING EXAMPLEHorizontal and vertical equating in NAP Civics and Citizenship

Page 47: Vert&Hor Equating 111024

Equating in practice

• NAP-CC survey• Year 6 and Year 10• Assessment every 3 years since 2004

Page 48: Vert&Hor Equating 111024

Equating overview

Page 49: Vert&Hor Equating 111024

45 horizontal link items in Year 10

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties link items

Page 50: Vert&Hor Equating 111024

Average discrimination

2007 2010

45 link items 0.43 0.45

Page 51: Vert&Hor Equating 111024

Plot discrimination

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Page 52: Vert&Hor Equating 111024

Average gender DIF

2007 2010

45 link items -0.027 -0.014

Page 53: Vert&Hor Equating 111024

Selection of link items

• 32 of 45 items were selected to use as link items based on:– change in relative difficulty– change in discrimination– average gender DIF

Page 54: Vert&Hor Equating 111024

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties 45 ink items

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties 32 link items

Page 55: Vert&Hor Equating 111024

Average discrimination

2007 2010

45 link items 0.43 0.45

32 link items 0.41 0.42

Page 56: Vert&Hor Equating 111024

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Discrimination 45 items

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Discrimination 32 items

Page 57: Vert&Hor Equating 111024

Average gender DIF

2007 2010

45 link items -0.027 -0.014

32 link items -0.035 -0.023

Page 58: Vert&Hor Equating 111024

Horizontal equating Year 6

• The process for Year 6 was identical• 24 out 27 link items could be used for

equating from 2010 to 2007

Page 59: Vert&Hor Equating 111024

Equating shifts

Year 6 Year 10

Average difficulty 2010 0.384 0.618

Average difficulty 2007 -0.089 -0.159

Difference (=shift) -0.473 -0.777

Page 60: Vert&Hor Equating 111024

Equating overview

Page 61: Vert&Hor Equating 111024

EQUATING ERRORRelated to common item equating is the

Page 62: Vert&Hor Equating 111024

Uncertainty in the link

• The equating shift depends on the change in relative difficulty of each item

• Different sets of items will lead to slightly different shifts

• An uncertainty is associated with equating two tests due to sampling of items

Page 63: Vert&Hor Equating 111024

Equating error

• Expressed as a standard error, just like the student sampling error

• Take into account when estimating change over time

• The equating error is added to the standard error of the difference when comparing across time