Conceptual Issues in Response-Time Modeling Wim J. van der Linden CTB/McGraw-Hill

Preview:

Citation preview

Conceptual Issues in Response-Time Modeling

Wim J. van der Linden

CTB/McGraw-Hill

Outline

• Traditions of RT modeling

• RTs fixed or random?

• Item completion, responses, and RTs

• RT and speed

• Speed and ability

Outline Cont’d

• RT and item difficulty

• Dependences between responses and RTs

• Hierarchical model of responses and RTs

• Applications to testing problems

Traditions of RT Modeling

• Four different traditions– No model– Distinct models for RTs– Response models with RT parameters– RT models in mathematical psychology

• Alternative– Hierarchical model of responses and RTs

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Fixed or Random RTs

• Some models treat RTs as fixed quantities:– Roskam (1987, 1997); Thurstone (1937)

• RTs treated as random in psychology• Random responses but fixed RTs seems contradictory• Conclusion 1: Just as responses, RTs on

test items should be treated as realizationsof random variables

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Item Completion, Response,and RT

• Rasch (1960) models for misreadings andreading speed– Poisson-gamma framework– Same notation and terminology for parameters in both types of models

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

“To which extent the two difficulty parameters … and the two ability parameters … run parallel is a questionto be answered by empirical results, and at present weshall leave it open.” (Rasch, 1960, p. 42)

Item Completion, Response,and RT Cont’d

• Notion of equivalent scores of speed tests (Gulliksen, 1960; Woodbury (1951, 1963): – Total time on a fixed number of items– Number of items correct in a fixed time interval

• Three types of variables required to describetest behavior:– Tij: response time (person j and item i)

– Uij: response

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Item Completion, Response,and RT Cont’d

• Three sets of variables (cont’d)

– Dij: item completion (design variable)

• Uij and Dij have different distributions

– Same holds for their sums• NU: number-correct scores

• ND: number of items completed

• Equivalence only when Pr{Uij=1|Dij=1}=1for all items and persons

• Test design – Fixed tests

– Adaptive tests

– Test accommodations

• And many more

Item Completion, Response,and RT Cont’d

• Distinction between speed and power test makesno sense; all test are hybrids

• Conclusion 2: Tij, Uij, and Dij are randomvariables with different distributions. The same holds for their sums: total time (T), numbercorrect (NU), and number completed (ND). Except for discreteness, T and ND are inversely related. (We’ll assume T and NU to beindependent!)

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

RT and Speed

• Speed and time are no equivalent notions• Generally, speed is a rate of change of some measure with respect to time, e.g.,

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Distance traveledSpeed of motion

Time

RT and Speed Cont’d

• For achievement testing, an appropriate notion of speed is cognitive speed:

• Fundamental equation:

Amount of cognitive laborSpeed

Time

** iβj

ijt

Amount of labor required(“time intensity”) by item i

Speed ofperson j Response time of

person j on item i

RT and Speed Cont’d

• Lognormal RT model (van der Linden, 2006)– Log transformation to remove skewness from RT

distributions

– Addition of random term

* *ln ln lnij i j i jt

ln ,ij i j it 2(0, )i iN

RT and Speed Cont’d

• Lognormal RT model:

Speed

Time intensity

Discrimination

2α 1( ) exp [α (ln (β τ ))]

22i

ij i ij i j

ij

f t tt

RT and Speed Cont’d

• Conclusion 3: RT and speed are different concepts related through a fundamental equation. RT models with a speed parameter should also have an item parameter for their amount of cognitive labor (or time intensity)

Speed and Ability

• Speed-accuracy tradeoff in psychology is same as a speed-ability tradeoff in achievement testing– Negative within-person correlation between τ and θ

– Change of speed required for tradeoff to become manifest

• Traditional IRT view of a person’s ability is of θ as a scale point, not as a function θ=θ(τ)– Effective ability level

• Test design – Fixed tests

– Adaptive tests

– Test accommodations

• And many more

Speed and Ability Cont’d

• At group level, any correlation between ability and speed may occur• Basic assumption: constancy of speed during

the test– Constant speed implies constant ability (ceteris

paribus)

• In practice, speed and ability always fluctuate somewhat, but fluctuations should be minorand unsystematic

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Speed and Ability Cont’d

• Conclusion 4: Speed and ability are related through a distinct function θ=θ(τ) for each test taker. The function itself need not be corporated into the response and RT models. But these models do require (fixed) parameters for the effective ability and speed of the test takers.

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

RT and Item Difficulty

• Descriptive research and speed-accuracy tradeoff suggest correlation between RT and item difficult– Item difficulty parameter in RT model?

– Counterexample

• Item parameters in response and RT models are for different item effects (on probability of correct response and time, respectively)

• Test design – Fixed tests

– Adaptive tests

– Test accommodations

• And many more

RT and Item Difficulty Cont’d

• Latent vs. manifest effect parameters– Danger of reification of latent effects

• Conclusion 5: RT models require item parameters for their time intensity but difficulty parameters belong in response models

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Dependences between Responses and RTs

• Descriptive vs. experimental studies• However, these studies necessarily involve

data aggregation across items and/or persons– Spurious correlations due to hidden sources of covariation (item and person parameters)

• Marginal vs. conditional independence between responses (spurious correlation, Simpson’s paradox, etc.)

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Dependences between Responses and RTs Cont’d

• Conclusion 6: Regular test behavior is characterized by three different types of conditional (or “local”) independence,namely between– responses on different items– between RTs on different items– between responses and RTs on the same item

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Dependences between Responses and RTs Cont’d

• For these conditional independencies to holdfor an entire test, constant speed is a necessary condition

• Empirical results

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Hierarchical Model ofResponses and RTs

• Distinct models for responses and RTs for a fixed person and item– Regular IRT model– E.g., lognormal model for RTs– Models should have

• parameters for effective ability and speed

• parameters for item difficulty and time intensity

• conditional independence

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Hierarchical Model ofResponses and RTs Cont’d

• Second-level models for dependences between – ability and speed across persons– difficulty and time intensity across items

• Multivariate normal distributions (possibly after parameter transformation)

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Hierarchical Model ofResponses and RTs Cont’d

• Bayesian treatment of modeling framework– Parameter estimation and model fit analysis with MCMC (Gibbs sampler)– Plug-and-play approach– Calibration of items with respect to RT parameters is straightforward– R package available upon request (Fox, Klein Entink, & van der Linden, 2007; Klein Entink,

Fox, & van der Linden, 2009)

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Applications to Testing Problems

• Test design

• Adaptive testing– Item selection– Differential speededness

• Detection of cheating– Item memorization and preknowledge– Collusion

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Applications to Testing Problems

• Use of RTs as collateral information in parameter estimation

• Cognitive research on problem solving

• Etc.

• Test design – Fixed tests

– Adaptive tests

– Test accommodations

• And many more

No RT Model

• Descriptive studies in educational testing– Correlation between responses and RTs– Regression of RT on item and person attributes

• Word counts, IRT item parameters, etc.• Number-correct scores; ability estimates

• Experimental studies in psychology– Manipulation of task or conditions

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

• Experimental reaction-time research (cont’d)

– Speed-accuracy tradeoff (Luce, 1986)– Plot of proportion of correct responses against RT

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

No RT Model Cont’d

t

p

• Problems– Spurious correlations between observed RTs– Speed-accuracy tradeoff is not a between-person phenomenon

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

No RT Model Cont’d

Stroop Test

Green

Stroop Test Cont’d

Blue

• RTs of two arbitrary students on a quantitative reasoning test

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Spurious Relations

Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …

• RTs of two arbitrary students on a quantitative reasoning test

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Spurious Relations Cont’d

Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …

r= .89

• RTs of two arbitrary students on a quantitative reasoning test

• Responses of same students

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Spurious Relations Cont’d

Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …

Subject 1: 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, … Subject 2: 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,…

r= .89

r= .20

• RTs of two arbitrary students on a quantitative reasoning test

• Responses of same students

• Test design – Fixed tests

– Adaptive tests

– Test accommodations

• And many more

Spurious Relations Cont’d

Subject 1: 22, 19, 40, 43, 27, 27, 45, 23, 14, … Subject 2: 26, 38, 101, 57, 37, 21, 116, 44, 10, …

Subject 1: 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, … Subject 2: 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,…

r= .21

• Rasch’s (1960) models for reading speed

• Exponential models – Oosterloo (1975); Scheiblechner (1979)

• Gamma models– Maris (1993); Pieters & van der Ven (1982)

• Weibull models– Tatsuoka & Tatsuoka (1980)

Distinct Models for RTs

• Poisson distribution of number of readingerrors a in a text of N words

• Gamma distribution of reading time for textof N words

Rasch’s Models

( )Pr{ } , with

!

ij

ij

a

ij iij ij

ij j

a N e Na

1( )( } , with =

( 1)!ij ij

Nt ij ij j

ij ij iji

tp t N e

N

• This type of model mostly motivated by attempts to build speed-accuracy tradeoff in response model• Response surface in Thurstone (1937)• Logistic models

– Roskam (1987; 1997); Verhelst, Verstralen &Janssen (1997)

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Response Models with RT Parameters

• We also have RT models that incorporate response parameters– E.g., lognormal models by Gaviria (2005) and Thissen (1982)

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

Response Models with RT Parameters

Thurstone’s Response Surface

Roskam’s Model (1997)

1(θ ) 1 exp(θ ln )i j j ij ip t b

RT

“Speed-accuracytradeoff” Item

difficulty

Ability

• Models for underlying psychological processes– Diffusion models– Models for sequential and parallel processing

• Experimental data– Standardized task– Assumption of exchangeable subjects

• No subject or item parameters

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

RT Models in Mathematical Psychology

RT and Speed

375229

+

375229 58 39

+

Time: 9 sec 12 sec

Item 1 Item 2

Speed: ? ?

Speed-Ability Tradeoff

Speed

Abi

lity

Within-person relation

Speed

Abi

lity

Lower ability

Higher ability

Speed-Ability Tradeoff Cont’d

Speed-Ability Tradeoff Cont’d

Effective speed

Speed τ

θ=θ(

τ) Effective ability

Speed-Ability Tradeoff Cont’d

Speed

Abi

lity

x

x xx

x xx

xx

Speed-Ability Tradeoff Cont’d

Speed

xx

x x

xx

x

xx

Abi

lity

Speed-Ability Tradeoff Cont’d

Speed

xx

x

x

x

x

x

x

x

Abi

lity

RT and Item Difficulty

375229

+

375229 58 39

+

Item 1 Item 2

Person (ability)

Response RT

Item (difficulty)

Item( time

intensity)

Person(speed)

Person (ability)

Response RT

Item (difficulty)

Item( time

intensity)

Person(speed)

Distributionof Item

Parameters

Distribution of Person

Parameters

Test Design

• So far, issues of test speededness have been dealt with intuitively, with post hoc evaluationof time limits

• Alternatively, the time parameters of the items can be used to assemble a test to have a prespecified level of speededness– Example for LSAT

• Test design – Fixed tests– Adaptive tests– Test accommodations

• And many more

New Test Equally Speeded as Reference Test

τ=0

New testReference test

Adaptive Testing

• Application 1: use responses and RTs during the test to select the next item– Posterior predictive density of responses on candidate item given previous responses

and RTs

– Example for LSAT (simulation)

• Application 2: select items to prevent speededness of test– Example for ASVAB

• Test design – Fixed tests

– Adaptive tests

– Test accommodations

• And many more

Response and RTs in Adaptive Testing

Item: ai,bi,ci

Uij Tij

Person:j

Person:j

Item:i, i

Population:,

Item Domain:abc, abc

0

0.1

0.2

0.3

0.4

0.5

0.6

-2 -1 0 1 2

Mean Square Error in Ability Estimates

MSE

MSE

θ θ

n=10 n=20

No RTsρ=.2

ρ=.80.0

0.1

0.2

0.3

0.4

0.5

0.6

-2 -1 0 1 2

0

1000

2000

3000

4000

-0.6 -0.3 0 0.3 0.6

Time Used to Complete Test(Without Constraint)

Tim

e Limit (39 min)

=2

=-2

Speed

Time Used to Complete Test(With Constraint)T

ime

0

1000

2000

3000

4000

-0.6 -0.3 0 0.3 0.6

=-2

=2

Speed

Limit (39 min)

Time Used to Complete Test(With Constraint) T

ime

0

1000

2000

3000

4000

-0.6 -0.3 0 0.3 0.6

=-2

=2

Speed

Limit (34 min)

Time Used to Complete Test (With Constraint) T

ime

0

1000

2000

3000

4000

-0.6 -0.3 0 0.3 0.6

=-2

=2

Speed

Limit (29 min)

Detection of Cheating

• Item memorization and preknowledge– Check actual RTs on suspicious item against expectation based on (i) its time parameters and

(ii) estimation of speed on other items

– Baysian residuals

– Case Study for GMAT

• Test design – Fixed tests

– Adaptive tests

– Test accommodations

• And many more

Detection of Cheating Cont’d

• Types of collusion– Sign language– Intra/internet– Wireless communication

• Collusion between test takers may manifest itself as correlation between their response times (RT)

Detection of Cheating Cont’d

• However, observed RTs always correlate because the time intensity of the items varies from one item to the next (see earlier example of spurious correlation)!

• Therefore, correlation between RTs of pairs of test takers should be analyzed under a model for their bivariate distribution

Detection of Cheating Cont’d

• Bivariate lognormal model for RTs by test takers j and k on item i

22 2

22

1( , ) exp ( )

2(1 )2 1i

ij ik ij ij ik jk ikjkij ik jk

f t tt t

with [ln )]ij i ij i jt

Detection of Cheating Cont’d

• Example for test of quantitative reasoning

Case Study for GMAT Cont’d

• Example 1: RT patterns with 15 flagged items– Test taker spent most time on Items 1-18 and

then rushed through 19-27 – No cheating but serious time management

problem – Observe RT on Item 2, which is quite time

intensive but the residual RT is barely aberrant!

RT Pattern with 15 Flagged Items

-4

-2

0

2

4

6

8

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Item

Residual Log RT

Observed RT in Min.

Case Study for GMAT Cont’d

• Example 2: Observed vs. residual RTs (no flagged items!)– This case illustrates need of RT modeling and

analysis of residual RTs – Observed RTs suggest same time management

problem as in the preceding example but the pattern almost disappears for the residual RTs

Observed vs. Residual RTs

-4

-2

0

2

4

6

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Item

Residual Log RT

Observed RT in Min

Case Study for GMAT Cont’d

• Example 3: Suspicious item– Large negative residual (-4.66) for Item 14– RT of 12.3 seconds (expected RT under the model

was 88.9 seconds!)– Test taker had correct response but very low

estimated ability relative to item difficulty– Four other test takers with same behavior on same

item!

Suspicious Item

-5

-3

-1

1

3

5

1 3 5 7 9 11 13 15 17 19 21 23 25

Item

Residual Log RTObserved RT in Min

Recommended