Richard M. Luecht, Ph.D., UNC-G Matthew J. Burke, Ph.D., ABIM

1

Richard M. Luecht, Ph.D., UNC-G

Matthew J. Burke, Ph.D., ABIM

2

Outline of the Talk

Rationale

Reconceptualizing “items” and test content

Item models and automatic item generation (AIG): mechanisms for mass producing items

Cognitive task models and “item families”: an engineering approach to scale and test development

Quality control (QC) for item families

3

Why Use Automated Item Generation and Principled Item Design Technologies?

“The demand for large numbers of items is challenging to satisfy because the traditional approach to test development uses the item as the fundamental unit of currency. That is, each item is individually hand-crafted—written, reviewed, revised, edited, entered into a computer, and calibrated—as if no other like it had ever been created before.”

Drasgow, Luecht & Bennett, Educational Measurement, 4th Edition, p. 473

4

Traditionally, the quality of test item generation has been dependent on the experience and interpretation of content specifications by item writers (Schmeiser & Welch, 2006)

Principled item design (Bennett, 2001; Irvine, 2002) is rapidly evolving from theory to practical implementation

Layout construct

Develop cognitive

task models

Design templates

Write & tryout

items per template

ReviewGenerate

items

Create Variables

Create Rendering

Models (Shells)

Combine Variables & Models by

Brute Force

NLP Quality Controls

Generate items

AE & ECD

AIG

5

6

Easy Task Model Family

Items

Templates

Moderate Task Model Family

Items

Templates

Difficult Task Model Family

Items

Templates

Lo

w

Hig

h

Population Distribution of

Examinee Proficiency

Task Model Map for the Item Bank

Scoring Evaluator

Data Model

Rendering Components

<Patient.article><Patient.description.age><Patient.description.occupation> “comes to” <Setting.description> “complaining of”<Patient.ailment.symptom1> <Patient.ailment.symptom1.duration><Patient.ailment.symptom2> <Patient.ailment.symptom2.duration><Patient.history.activity.recent><Patient.physicalexam.temp=# C, (convert(C,F))><Patient.physicalexam.pulse=#/min><Patient.physicalexam.respiration=#/min><Patient.physicalexam.bp=#1/#2><Patient.physicalexam.symptom1><Patient.physicalexam.symptom2> “What is the most likely cause of”

<Patient.ailment.prime_symptom> “?”

7

© Pearson

©American Institutes of Research

©American Institutes of Research

©NBME©AICPA

7

© Pearson

Where is the in the cerebral cortex’sparietal lobe? (Click on the appropriate area of the picture, below)

Parietal Lobe

8

Item Types

MCQ & SR

CR & ER

SA

PA

TE

SIM

Simulations

Multiple-Choice and Selected-Response

Constructed- and Extended Response

Technology-Enhanced

Performance Assessments

Short-Answer

• Stimuli, prompts and problem instructions

• Exhibits• Auxiliary tools/resources• Response capturing• Response data• Scoring evaluators

9

10

11

AIG in Three Steps*

The content required for the generated items is identified by test development specialists and defined as a cognitive model

An item model is developed by the test development specialists to specify where content is placed in each generated item

In Step #3, computer-based algorithms are used to place the content specified in Step #1 into the item model developed in Step #2

* Gierl, Lai & Turner (2012). Using automatic item generation to create multiple- choice test items. Medical Education, 46,757–765

12

13

[AGE] (Integer): From 25.0 to 60.0, by 5.0[GENDER] (String): 1: man 2: woman[PAIN] (String): 1: 2: and intense pain 3: and severe pain 4: and mild pain[LOCATION] (String): 1: the left groin 2: right groin 3: the umbilicus 4: an area near a recent surgery[ACUITYOFONSET] (String): 1: a few months ago 2: a few hours ago 3: a few days ago 4: a few days ago after moving a piano[PHYSICALFINDINGS] (String): 1: protruding but with no pain 2: tender 3: tender and exhibiting redness 4: tender and reducible[WBC] (String): 1: normal results 2: normal results 3: elevated white blood cell count 4: normal results

A 25-year-old man presented with a mass in the left groin. It occurred suddenly 2 hours ago while lifting a piano. On examination, the mass is firm and located in the left groin and lab work came back with normal results. Which of the following is the next best step?

A [AGE]-year-old [GENDER] presented with a mass [PAIN] in [LOCATION]. It occurred [ACUITYOFONSET]. On examination, the mass is [PHYSICALFINDINGS] and lab work came back with [WBC]. Which of the following is the next best step?

Contextual features: exploratory surgery; reduction of mass; hernia repair; ice applied to mass

14

Create Model Structures

Create Item Model

Extract Elements & Constraints

Generate Items

Extract Elements & Constraints

Table

Structures

Table

Constraints

Table

Variant Data

Table

Item Model(s)

Automatic Item Generator (Software)

Natural Language Processor Item 1.2

AIG Item Family

Item 1.3Item 1.2

Item 1.1

15

Sample AE Math Task Model and Templates

Lai, H.; Gierl, M. & Alves, C. (2010). Generating Items under the AE Framework. Invited symposium paper at the Annual Meeting of NCME, Denver

16

AE Item Production

Lai, H.; Gierl, M. & Alves, C. (2010). Generating Items under the AE Framework. Invited symposium paper at the Annual Meeting of NCME, Denver

17

Multiple-Language AIG*Human translations add expense and error on top of an already expensive process of artfully crafted items

Translated English-generated medical licensing examination multiple-choice items into Canadian French and Chinese by adding a “language layer” to the item models

Still partly a work in progress since the “art” of translation is seldom exact, given the nuances of language

Sin embargo... el "arte" de la traducción es raramente exacto

However... the 'art' of the translation is rarely accurate

* Gierl, Lai & Turner (2012). Medical Education. Gierl, Fung, Lai & Zheng (2013). National Council

on Measurement in Education Symposium Paper.

18

AIG model premise: number series problems are a convenient format to measure some aspects of numerical reasoning (application of rule-based induction) and are amenable to algorithmic item design

There is a mature “task model” for number series problems

19

AIG less human involvement ($$$)AIG without STRONG quality controls and evaluation criteria is not fruitfulStandards that depend on projected use and quality of evidence

Quality of item contentPredictability of item parametersImpact of item predictability on score reliability

How much should traditional “content blueprints” drive these standards?

20

21

22

23

Easier Tasks(Low Complexity)

Moderately Difficult Tasks(Medium Complexity)

Very Difficult Tasks(High Complexity)

Template 1.3

Item 1,3,n

Item 1,3,2Item 1,3,1

Template 1.1

Item 1,1,n

Item 1,1,2Item 1,1,1

Template 1.2 Template 25.3

Item 25,3,n

Item 25,3,2Item 25,3,1

Template 25.1

Item 25,1,n

Item 25,1,2Item 25,1,1

Template 25.2

Content Area 1

Item Difficulty

Content Area 2

Content Area 3

Content Area 3

24

Scalable & Replicable

Designs

Stable Cross-Platform

Performance

Consistent Data

Standardized Components

25

Task Model Grammars (TMGs) are domain-specific languages that describe the intended cognitive complexity design features for families of assessment tasks—the Task Models

Content and declarative knowledge components

Procedural skills needed

Tools, resourcesContextual conditions

Task Model Maps (TMMs) provide a distribution of Task Models on a scale

26

Skills: Required Examinee

Actions

Knowledge Objects

Relations Among Objects

or Actions

Contextual Complexity

Auxiliary Tools

32112 .,. , ,obj objectis relate contexec ta objec auxaction toolsc on tdt ti

27

Task Model Mapping: Locating IntendedChallenges to Support Evidence-Based Claims

Skill=identifyObjects = one, simple conceptRelations=noneContext=match word definitionTools=none

Skills=identify, compare, evaluateObjects = 3-4 complex propertiesRelations=hierarchical (3 levels) Context=complex text, dense info.Tools=facilitative if used correctly

28

Fixed-specification task models

Number of task models = no. of test items

Each task model is essential

Domain-sampled task models

Multiple task models per location

Task models are considered exchangeable at a particular location

Self-adaptive performance task models

Template components are manipulated to change the information (location)

Optimal reconfiguration of components

29

Rendering

Components

Semi-Passive Graphic Exhibits

Interactive Input

Compo-nents

Helm Controls

Online Tools

Online Resources

Scripts

Style Sheets

Data Model

Passive component rendering

data structures Interactive

component data

structures

Auxiliary tools dB

TMG codingComponent

object properties

Template and item statistical

information

Auxiliary content coding

Scoring Evaluator

Measurement opportunity

response structures

Scoring rubric, rules or

algorithms

Stored answer

components

Scoring QC agent(s)

30

A test has five multiple-choice questions scored correct/incorrect. Each question has four possible options. What will be the expected number-correct score for students who guess the answers to all five of the questions?

A. 0.25B. 0.80C. 1.25D. 3.75E. 5.00

Calculate expected values and use them to solve problems: (S-MD.3.) Develop a probability distribution for a random variable defined for a sample space in which theoretical probabilitiescan be calculated; find the expected value. (CCSS Initiatives Project, www.corestandards.org/the-standards/mathematics/hs-statistics-and-probability/)

31

Calculate expected values and use them to solve problems: (S-MD.3.) Develop a probability distribution for a random variable defined for a sample space in which theoretical probabilitiescan be calculated; find the expected value. (CCSS Initiatives Project, www.corestandards.org/the-standards/mathematics/hs-statistics-and-probability/)

1Recall.formula.SRS_uniform.discrete 1i i i

p P u aa

1

Recall.formula.expected_valuen

i iiE y y p u

1 1 2 21Apply.formula.sum_products

n

i i n niy p u p u p u p u

1Apply.formula.simplify_distributive 1/

n

iipu pn p a

maxConstraint.value.discrete_int 0,1, ,i

u u

maxConstraint.value.discrete_int 2, ,n n

Constraint.value.prob 0.0 1.0p

32

A <sample.event> has <n> <description.sample_units> <description.auxiliary_info>. <The/Each> <description.theoretical_event_probability>. What will be the expected <description.value_unit(s)> for <description.objects_using_theoretical_prob_distrib>?

<MCq5.distractor.1=p><MCq5.distractor.2=(1/n)*a><MCq5.distractor.3=n*p=∑xx*px><MCq5.distractor.4=(1/a)∑xx = p∑xx><MCq5.distractor.5=(1/a)*p*n>

Note: p=theoretical_prob_distr.constant=1/a

Scoring Evaluator ui=CAK(i.Selection.MCq.d=i.Key,1 if T,0 if F)

33

Reverse-Engineering Existing Items (Bottom-Up)

Reverse engineering actual items to develop a TMG (propositional) or language to detail required skills and knowledge components

Forward engineer templates and items from the TMG

Iteratively refine of TMG-based families, matching empirical item difficulty ordering

Construct Mapping Approach (Top-Down)

Develop a TMM along a trajectory

Design cognitive task models using challenge schema where skills knowledge | context, tools

Iteratively design and validate templates and item families using a hierarchical QC approach

34

Highly Constrained Templates/Models

Item Writing Guidelines with

Exemplars

Content Topic-Focused

Cognitive Complexity Focused Models

Item Modeling (e.g., Case & Swanson, 2003)

AE Task Models & Templates

AIG

Traditional Item Writing

35

36

AETask

Models

OperationalItems

MeasurementOpportunities

AETemplates

37

1 ; 1if if if if if ifP x c c a b

θ ξ

Geerlings, H., Glas, C. A. W., & van der Linden, W. J. (2011). Modeling rule-based item generation. Psychometrika, 76, 337-359.

First-Level Model

Second-Level Model ,if f fMVNξ μ Σ

1

R

if fr rrd

38

Calibrate individual items (ignore the item family)…Refine templates to reduce variation in item characteristics

Calibrate task models as families…monitor variation over time and “tweak” templates as needed

39

QC Implications(a)Item families are working well(b)Calibrate TM families

QC Implications(a)Item families not working(b)Tighten/repair templates

ESTIMATION and QC

39

40

41

Substantive isomorphism within item families

Cognitively exchangeable tasks in terms of required knowledge and skills

Exchangeable evidence to inform measurement claims

Statistical isomorphism within item families

Sufficiently small variation of all item statistical properties within item families

Exchangeability of items within families for scoring purposes

42

Questions?

“The world is full of magical things patiently waiting for our wits to grow sharper.”

Bertrand Russell

43

Ric Luecht: [email protected] Burke: [email protected]

Documents

Richard M. Luecht, Ph.D., UNC-G Matthew J. Burke, Ph.D., ABIM