Effect of a limited-enforcement intelligent tutoring ...jchampai/papers/... · Effect of a limited-enforcement intelligent tutoring system in dermatopathology on student errors, goals

Effect of a limited-enforcement intelligent tutoringsystem in dermatopathology on student errors,goals and solution paths

Velma L. Payne a, Olga Medvedeva a, Elizabeth Legowski a,Melissa Castine a, Eugene Tseytlin a, Drazen Jukic b,c,Rebecca S. Crowley a,b,d,*

aDepartment of Biomedical Informatics, University of Pittsburgh School of Medicine, United StatesbDepartment of Pathology, University of Pittsburgh School of Medicine, United StatescDepartment of Dermatology, University of Pittsburgh School of Medicine, United Statesd Intelligent Systems Program, University of Pittsburgh School of Arts and Sciences, United States

Received 28 July 2008; received in revised form 28 July 2009; accepted 28 July 2009

Artificial Intelligence in Medicine (2009) 47, 175—197

http://www.intl.elsevierhealth.com/journals/aiim

KEYWORDSIntelligent tutoringsystems;Diagnostic reasoning;Clinical competence;Cognition;Diagnostic errors;Education;Medical;Educationaltechnology;Dermatology;Pathology;Problem solving

Summary

Objectives: Determine effects of a limited-enforcement intelligent tutoring systemin dermatopathology on student errors, goals and solution paths. Determine if limitedenforcement in a medical tutoring system inhibits students from learning the optimaland most efficient solution path. Describe the type of deviations from the optimalsolution path that occur during tutoring, and how these deviations change over time.Determine if the size of the problem-space (domain scope), has an effect on learninggains when using a tutor with limited enforcement.Methods: Analyzed data mined from 44 pathology residents using SlideTutor–—aMedical Intelligent Tutoring System in Dermatopathology that teaches histopathologicdiagnosis and reporting skills based on commonly used diagnostic algorithms. Twosubdomains were included in the study representing sub-algorithms of different sizesand complexities. Effects of the tutoring system on student errors, goal states andsolution paths were determined.Results: Students gradually increase the frequency of steps that match the tutoringsystem’s expectation of expert performance. Frequency of errors gradually declinesin all categories of error significance. Student performance frequently differs fromthe tutor-defined optimal path. However, as students continue to be tutored, theyapproach the optimal solution path. Performance in both subdomains was similar for

* Corresponding author at: Department of Biomedical Informatics, University of Pittsburgh School of Medicine, UPMC Shadyside CancerPavilion, Room 307, 5230 Centre Avenue, Pittsburgh, PA 15232, United States. Tel.: +1 412 623 1752; fax: +1 412 647 5380.

E-mail address: [email protected] (R.S. Crowley).

0933-3657/$ — see front matter # 2009 Published by Elsevier B.V.doi:10.1016/j.artmed.2009.07.002

mailto:[email protected]

http://dx.doi.org/10.1016/j.artmed.2009.07.002

176 V.L. Payne et al.

both errors and goal differences. However, the rate at which students progress towardthe optimal solution path differs between the two domains. Tutoring in superficialperivascular dermatitis, the larger and more complex domain was associated with aslower rate of approximation towards the optimal solution path.Conclusions: Students benefit from a limited-enforcement tutoring system thatleverages diagnostic algorithms but does not prevent alternative strategies. Evenwith limited enforcement, students converge toward the optimal solution path.# 2009 Published by Elsevier B.V.

1. Introduction

1.1. Current practices for training indermatopathology

Training in dermatopathology poses significant chal-lenges. Dermatopathology encompasses a very largenumber of diagnostic entities, with significant over-lap. Some areas of the domain (e.g., inflammatorydiseases) require substantial knowledge of ancillarystudies such as immunoflourescence and electronmicroscopy; while other areas (e.g., melanocyticlesions) present difficult diagnostic challenges andare associated with high error and malpractice liti-gation rates [1,2]. At present, most training pro-grams provide only two to three months ofdermatopathology training. Residents and fellowsare trained in the apprenticeship style–—individuallyexamining cases that come to the service, and thenexamining the case a second time with the expertdermatopathologist over a multi-headed micro-scope. During this interaction, the expert providesinstruction, guidance and redirection. But traineesare rarely exposed to a sufficient range of cases toachieve expertise during this brief interval, evenwhen they supplement apprenticeship training withslide study sets and other didactic materials.Furthermore, new pressures in health care andpathology are decreasing the time available forteaching and learning basic skills in dermatopathol-ogy and sub-specialty areas. The need for training innew sub-disciplines (for example, molecular diag-nostics), the pressure to decrease ‘‘turn-around’’times on cases, and the significant increase in case-loads and on-service time for attending pathologistspose a growing barrier to training in dermatopathol-ogy.

1.2. Need for intelligent medical trainingsystems

Intelligent medical training systems [3] provide amethod to address the educational challenges ofmedical training. In recent years, intelligent med-ical training systems (IMTSs) have been developed in

a wide range of domains [4—9]. Like traditionalcomputer-based learning, IMTS can represent amuch greater range of cases and are thus not boundby place and time [10]. But IMTS also have signifi-cant advantages over traditional computer-basedlearning. For example, IMTS that provide immediatefeedback can simulate the normal apprenticeshipexperience [3]. An IMTS that dynamically builds astudent model [11] can support case selection thatadapts to student mastery, maximizing time spenton cases that have the highest learning value toindividual students.

We have developed an IMTS in dermatopathology[6] that builds on the cognitive tutoring systemformalism [12,13]. In non-medical domains, stu-dents working with a cognitive tutoring systemconsistently achieve higher performance thanthose who are trained in a traditional classroomsetting alone [14]. Cognitive tutoring systems act asperformance trainers. The system requires stu-dents to perform some or all of the intermediatesteps in solving a problem, and then providesimmediate feedback on these intermediate stepsby comparing the action to the system’s internalmodel of an expert problem solver. Intermediatesteps may include reasoning and inference aspectsof problem solving and decision-making (such asidentifying evidence or making a hypothesis), aswell as procedural actions (such as measuring atumor, or reporting on a prognostic factor). Whenstudents make errors, the system provides explana-tions of the errors, and when students are lost, thesystem can tell them the next step to take. A keyadvantage of this type of system is that the studentis made immediately aware of errors in intermedi-ate cognitive steps and does not flounder along non-productive solution paths. However, a disadvantageis that the student must follow an explicit model ofreasoning.

1.3. Flexibility vs. enforcement–—afundamental tradeoff

The strategy of providing immediate feedback incognitive tutors has been controversial. On the one

Effect of a limited-enforcement intelligent tutoring system in dermatopathology 177

hand, there is strong evidence that students reachmastery level quickly when using directive systemsthat provide immediate feedback [15]. While work-ing with an intelligent tutoring system that taughtLISP programming, Corbett and Anderson assessedthe effect of four feedback conditions including: (a)immediate feedback and immediate error correc-tion; (b) immediate error flagging and student con-trol of error correction; (c) feedback on demand andstudent control of error correction; and (d) no tutorstep-by-step problem-solving support. The studyshowed that students in the immediate feedbackgroup (with the greatest tutor control over problemsolving) experienced the most efficient learning.Students who received immediate feedback com-pleted the problems fastest, and the three groupswho received some level of feedback support per-formed better on assessment tests than the groupthat did not receive feedback [15]. On the otherhand, there has been continued concern thatimmediate feedback can be associated with impor-tant negative effects. Cognitive processes in theperformance of complex tasks may be disruptedby immediate feedback [16—20]. Critics of immedi-ate feedback have suggested that the rigid one-to-one action-feedback cycle prevents students fromlearning themetacognitive skills needed to evaluatetheir own problem solving and sense when they aremaking errors [21,22]. In effect, students maybecome overly dependent on the immediate feed-back, such that they are unable to generate aninternal sense of their progress towards a correctanswer.

Another style of tutoring approaches studentmentoring in a different way. Constraint-BasedTutoring Systems (CBTSs) give the student a greatdeal of flexibility during problem solving. Con-straint-based tutors are based on a theory of learn-ing from performance errors [23,24] which consistsof two phases: (a) error recognition and (b) errorcorrection. When using a constraint-based tutor,students are free to perform whatever actions theywish until they make an error. The system deter-mines if the student is fulfilling all the generalprinciples of the domain; if so, the system doesnot interrupt students with feedback but permitsthem to continue their current path. In this case,students are not required to follow the path takenby the expert. Constraint-based modeling does notimpose any particular strategy since it evaluates thecurrent state of problem solving instead of thespecific action of the student. However, an inherentlimitation of CBTS is that it is typically not possibleto prompt the student towards the next-best-step,because CBTS do not model the entire problem-solving process.

1.4. What do human tutors do that is soeffective?

The tradeoff between flexibility and enforcement isa fundamental issue in human tutoring as well. Anumber of studies have documented the effective-ness of human tutors [25—27] and suggest that thiseffectiveness results from simultaneously promot-ing increasing autonomy while providing guidancethat prevents frustration and confusion [28]. Tech-niques used in human training include (1) offeringguided learning by doing; (2) providing indirectrather than explicit feedback; (3) basing feedbackcontent and timing on error complexity [28]; and (4)fading support [29].

Guided learning by doing is an effective techni-que in one-on-one teacher—student interaction,in which students are encouraged to attempt aproblem on their own before the teacher offersguidance. There are significant benefits to lettingstudents encounter obstacles, work around them,and realize what worked and did not work. However,allowing students to learn by doing without gui-dance may result in frustration, confusion anddevelopment of poor problem-solving strategies.Human tutors follow student solutions closely andredirect students when they encounter impasses bydrawing attention to the error and giving studentsthe opportunity to solve the problem again ratherthan giving explicit corrective feedback [28].

Another efficient human tutoring technique isproviding indirect rather than explicit feedback.The use of subtle cues to guide and support a studentproduces improved performance, enhances confi-dence and sense of accomplishment and maximizesthe motivation to learn [28]. Students seem to solvemore difficult problems with an indirect tutoringstyle than they do with explicit feedback [27].

Basing the content and timing of feedback on thecomplexity of the error is another effective humantutoring strategy. Merrill et al. found that humantutors modulate intervention on the potential learn-ing consequences of errors [30]. Littman et al. foundthat human tutors first address errors that reflect apoor understanding of material and errors that maskother errors, and then address less consequentialerrors [31].

Finally, coaching students as they practice a skilland gradually withdrawing feedback (fading thescaffolding) as proficiency increases has also beenshown to be a highly effective teaching strategy[29].

Human tutors balance the disparate goals ofencouraging autonomy and enforcing correct solu-tions by adjusting both components to the needs ofthe particular student. But tutoring systems are


naturally more rigid than human teachers. Can alimited-enforcement tutoring system still providesufficient guidance to move students towards opti-mum solution paths? We mined data of studentsusing a limited-enforcement tutor in dermato-pathology to answer this question.

1.5. Limited-enforcement tutoring

An important variable in any training system is thedegree to which the system enforces specificsequences of actions and student strategies. A cog-nitive-based tutoring system is usually a high-enfor-cement tutoring system that significantly limits thenumber of acceptable actions at any step, anddrives students toward specific sequences ofactions. In part, this is because the cost of modelingmany strategies is very high. A constraint-basedtutoring system is nearly free of enforcementbecause it allows students to pursue their ownproblem-solving path. Between these two ends ofthe spectrum is a limited-enforcement tutoringsystem that accepts a wider variety of sequences,and thus accommodates many different studentstrategies. Much like human tutors, a limited-enfor-cement tutoring system allows students to pursuetheir own solution strategy, while closely monitoringthem and intervenes only when they veer too far offa valid solution path. This technique enhanceslearning by allowing the student to learn from mak-ingmistakes while at the same time beingmonitoredand guided by an expert.

Medical education in dermatopathology is anexample of a complex domain where the balancebetween enforcement and flexibility is critical.Novices who lack experience in a domain need direc-tion to quickly learn visual identification skills, andproperly apply existing diagnostic algorithms. At thesame time, the tutoring system must not preventperfectly acceptable solution paths that do notexactly match the expert model. Failure to acceptother orders and strategies in reasoning can produceuser resistance and may inadvertently suppresseffective alternative problem-solving strategies.

2. Research questions

2.1. Does limited enforcement preventstudents from learning the optimumsolution path?

Evidence from studies of human tutoring suggeststhat flexibility in evaluating student actions may bea key component to the success of human tutoring[28]. One obvious reason for concern regarding

flexibility is that the student may not learn theoptimal solution path at all. This outcome has adisadvantage in medical training systems thatattempt to move students towards normative andefficient performance. One objective of thisresearch is to determine if flexibility in acceptingvariations in solution paths inhibits learning theoptimal solution path.

2.2. What kinds of deviations from theoptimum solution path are observed, andhow do they change over time?

Because cognitive tutors are capable of determiningthe most appropriate next step for any given pro-blem-state, it is possible to measure the degree towhich every student action differs from theexpected intermediate step. Differences willinclude both cognitive errors and differences ingoals between the student and the system. Takentogether, they provide a measure of how close off-path student actions are to the optimum reasoningpath. A second objective of this research is todetermine the frequency and distribution of errorsand goal differences when using a limited-enforce-ment tutoring system.

2.3. What is the effect of problem-spacesize (domain scope) on learning whenusing limited enforcement?

When human tutors use an indirect tutoring styleinstead of explicit feedback, students seem to beable to solve more difficult problems [27]. Onepotential disadvantage of tutoring systems that donot rigidly enforce specific actions is that they maynot scale well to larger domains where efficiencybecomes the rate-limiting factor. The final objectiveof this study is to determine whether such a limited-enforcement style is associated with increasederrors and goal differences in larger domains.

3. Methods

3.1. SlideTutor system

SlideTutor is an Intelligent Medical Training Systemin Dermatopathology that teaches histopathologicdiagnosis and reporting skills. Currently, the systemhas been instantiated for two areas of dermato-pathology–—inflammatory diseases and melanocyticlesions. SlideTutor is a client-server application inwhich students examine virtual slides using variousmagnifications, identify visual features, specifyqualities of these features, make hypotheses and


diagnoses, and write pathology reports. As a studentworks through a case, SlideTutor provides feedbackincluding error analysis and confirmation of correctactions. At any time during the solution path, stu-dents may request hints. Hints are context-specificto the current problem-state, and provide increas-ingly more targeted advice based on a system-gen-erated ‘best-next-step’. In order to distinguishcorrect from intermediate steps, categorize errors,and provide hints, SlideTutor maintains a cognitivemodel of diagnosis using a Dynamic Solution Graph(DSG) and a set of ontologies that represent rela-tionships between diagnoses and pathologic find-ings. SlideTutor contains a pedagogic model thatdescribes the appropriate interventions for specificerrors and maintains a probabilistic model of stu-dent performance that is used to adapt instructionbased on the student model [11]. SlideTutor uses avariety of interfaces including a diagrammatic inter-face for reifying diagnostic reasoning, and a naturallanguage interface that interprets and evaluatesdiagnostic reports written by students [32]. For thisstudy, we limited our data analysis to the diagnosticcomponent of the tutoring system.

3.1.1. SlideTutor architectureThe architecture of SlideTutor has been previouslydescribed [6]. We provide here only a brief intro-duction to the system as it pertains to the currentwork.

Student actions during intermediate problem-solving steps are tested against an expert model(Fig. 1). The expert model consists of a domainmodel, domain task, case data, and problem-solvingmethods (PSMs). The domain model defines therelationship between evidence, or feature sets,and disease entities. Within the model, diseaseentities are associated with a set of features, andfeature attribute value sets such as location and

Figure 1 SlideTutor architecture. (Reprinted from [6

quantity that further refine each feature. Thedomain task represents a variety of cognitive goalsimplicit in the classification problem-solving processincluding identifying a feature; specifying an absentfeature; refining a feature by designating an attri-bute such as location or quantity; asserting hypoth-eses and diagnoses; asserting a supportingrelationship connecting a feature to a hypothesis;asserting a refuting relationship between a featureand a hypothesis; and specifying that a featuredistinguishes a hypothesis from a competing hypoth-esis. The case data is a representation of the name,location, and attributes of the features present ineach case. The problem-solving methods of theexpert model utilize all components of the expertmodel to create a Dynamic Solution Graph (DSG)that models the current problem-space and valid-next-steps for case solution. At every point in everystudent’s problem solving, the DSG maintains acontext-specific ‘best-next-step’, which is used asadvice if the student requests a hint.

The instructional model is composed of a peda-gogic model, pedagogic tasks, student model data,and problem-solving methods. The primary objec-tive of the instructional model is to respond torequests for help from the student and to interveneby triggering alerts when the student has made anerror. The pedagogic knowledge base contains thedeclarative knowledge of responses to students’errors and requests for help. Hints are based on ahint priority based on the state of the problem at thetime the student requests help. Hints also havelevels of specificity; initially hints offer generalguidance. As the student continues to seek help,the hints become more specific and directive. Thepedagogic task contains information related to thesystem’s model of how to help specific studentsrelated to the problem-state and the current stateof the student model [11]. The problem-solving

], Copyright (2006), with permission from Elsevier.)


methods of the instructional model are used tomatch the student actions to the help categoriesin the pedagogic model in order to determine howthe system should intervene.

3.1.2. Dynamic Solution GraphSlideTutor’s Dynamic Solution Graph (DSG) is a direc-ted acyclic graph that models the current problem-space, all valid-next-steps, specifies the best-next-step, and is updated with each student action. Theinitial DSG at the beginning of each problem isgenerated using knowledge derived from thedomain, task, and case models of the expert modeldescribed above. After each student action, thegraph structure is updated; therefore, the DSGdynamically and continually assesses the currentproblem-space and regenerates all valid-next-stepsand the best-next-step. Based on the studentaction, the DSG will change by adding new nodesand arcs, or deleting existing nodes and arcs, repre-senting all changes in the state of the problem, andthe next set of valid-next-steps (goals). Studentactions that match any valid-next-step in the DSGresult in propagating changes within the graph toproduce the next problem-state specific to the caseand student. Student actions that do not match anyvalid-next-step in the DSG are handled by theinstructional layer and result in context-specificremediation including visual and textual explana-tions. In these cases, the DSG does not advance. TheDSG also has an evidence cluster node used toexpress an integrated relation between featuresand hypotheses. As the student further refines casefeatures, the evidence cluster will point to fewerand fewer hypothesis nodes, reflecting a morerefined disease set and a more precise diagnosis.The dynamic nature of the graph enables the systemto reason with the student. Sequential DSG statesdefine the student’s path through the problem-space and provide information about the reasoningtechnique that the student is using.

3.1.3. Best-next-step derivationAs the student progresses through the case, the DSGis updated reflecting all valid-next-steps, one ofwhich is designated as the best-next-step. Thebest-next-step is context-specific to the currentproblem-state and pedagogic model. For an error-free current problem-state, the best-next-step isdetermined based on the pedagogic model strategicsettings. The strategy for novice users supportsforward reasoning from (1) slide exploration, to(2) all feature identification and refinement movingfrom low to high microscope zoom level, to (3)triggering hypotheses that are consistent with anyidentified feature, to and (4) making a final diag-

nosis that is consistent with all correctly identifiedfeatures.

3.2. Student—system interaction data

3.2.1. Data captureData from all student actions in the SlideTutor sys-tem are capturedwith the system’s response to eachstudent action in an Oracle database for later ana-lysis [33]. During the tutoring session, students per-form various actions to complete problem-solvinggoals, including visual feature identification wherethe student identifies present and absent featuresassociated with the case; feature refinement orattribute value identification where the studentspecifies an attribute of a feature such as the quan-tity of eosinophilic dermal inflammatory infiltrate as‘moderate’ or the location of a blister as ‘sub-epidermal’; hypothesis generation where the stu-dent triggers a hypothesis for the case; and hypoth-esis evaluation where the student specifies one ormore acceptable diagnoses for the case. Specificstudent actions (e.g., ‘‘Blister’’) and their corre-sponding goal class (e.g., ‘‘Feature Identification’’)are saved by the system. For each student action,the system response (confirm, failure, or hint) is alsosaved. ‘Confirm’ indicates that the student actioncorresponds to a valid-next-step. ‘Failure’ indicatesthat the student action does not correspond to avalid-next-step, in which case the system also storesthe reason for the error based on its categorizationof each student error. ‘Hint’ indicates that the tutorresponded to a student request for help, in whichcase the system also stores the content of the hintthat the system provides. Table 1 shows a highlysimplified example of student—system interactiondata from one student. The hint request was notnumbered in order to coordinate action numbers inTable 1 with step numbers in Fig. 2 (referenceSection 3.2.2 for further explanation).

3.2.2. Tutoring session dataThe interaction of the student and system can becharacterized as a sequence of actions that overlapwhen the student performs the best-next-step, anddeviate when the student does not perform thebest-next-step. In Fig. 2, we represent the eventsin Table 1 as two partially overlapping sequences ofstudent and system actions. The hint request shownin Table 1 is not reflected in Fig. 2 since we do notinclude hint actions in our analysis. In Table 1, thehint request was not numbered as a step since we donot include hints in our analysis. The top row ofFig. 2 shows the student actions that are not thebest-next-step; the bottom row shows the system-generated best-next-steps that the student did not


Table

1Student—

system

interactiondata.

Action#

User

Actiontype

Actionnam

eTutorresponse

Errorco

de

Tutorbest-next-stepac

tiontype

Tutorbest-next-stepnam

e

1User1

Feature

identifica

tion

Blister

Confirm

Attribute

valueidentifica

tion

Loca

tion:sub-epiderm

al

2User1

Absentfeature

identifica

tion

Mucin

Confirm

Attribute

valueidentifica

tion

Loca

tion:reticu

larderm

is

—User1

Hintrequest

Hint

Hint

Attribute

valueidentifica

tion

Loca

tion:reticu

larderm

is3

User1

Attribute

value

identifica

tion

Loca

tion:reticu

lar

derm

isConfirm

Feature

identifica

tion

Eosinophilrich

inflam

matory

infiltrate

4User1

Feature

identifica

tion

Eosinophilrich

inflam

matory

infiltrate

Confirm

Attribute

valueidentifica

tion

Loca

tion:derm

is

5User1

Attribute

value

identifica

tion

Loca

tion:epiderm

isFa

ilure

S3Attribute

valueidentifica

tion

Loca

tion:derm

is

6User1

Attribute

value

identifica

tion

Loca

tion:derm

isConfirm

Attribute

valueidentifica

tion

Quan

tity:mild

7User1

Hyp

othesis

generation

Acu

teburn

Failure

T2

Attribute

valueidentifica

tion

Quan

tity:mild

perform; and the middle row depicts actions wherethe student and the best-next-step match. Becausethe tutor takes into account information based onthe current problem-state, each deviation (shownas steps in the sequence where icons are present inthe top and bottom rows) can be considered to beindependent of all others in the sequence. Forexample, if the student has previously identified afeature in a different order than the tutor-definedoptimal path, then the best-next-step produced bythe system will relate to the feature just identifiedas opposed to the feature that was the previousbest-next-step.

3.2.3. SlideTutor as a limited-enforcementtutorSlideTutor is a limited-enforcement tutor in that itdoes not force the student to solve the case bytraversing the system defined optimal solution path.The tutoring system permits the student to solve thecase by performing goals in any order as long as theitems identified are applicable to the case. Operat-ing much like a human tutor, SlideTutor monitorsstudent actions and intervenes only when the stu-dent action results in an action inappropriate for thecase. Step 2 in Fig. 2 is an example where thestudent performed an action (identifying a featureof mucin) that was not the best-next-step, but wasacceptable in this case. SlideTutor permitted thestudent to continue this solution path even though itdid not correspond to the optimal solution path.Steps 5 (identifying the location of eosinophilicinfiltrate as epidermal) and 7 (specifying a hypoth-esis of acute burn) are examples of student actionsthat were incorrect for the case. For these actions,SlideTutor intervened by displaying an error mes-sage explaining what the student did wrong.

3.3. Data analysis

3.3.1. Domains, cases and subjectsWe analyzed student—system interaction eventsfrom 44 pathology residents solving dermatopathol-ogy cases using SlideTutor (Table 2). Students repre-sented the entire spectrum of post-graduatetraining, and included individuals who had pre-viously completed a dermatopathology rotation,as well as those who did not. The data includestutoring sessions performed for two differentexperimental studies [34,35], which demonstratedsignificant learning gains from pre-test to post-test.All tutoring sessions were conducted using the iden-tical tutoring approach (e.g., immediate feedback).The only significant difference between the twogroups was the cases and domain knowledge usedin the tutoring session.


Table 2 Demographic characteristics of students.

Number of students

Sub-epidermal Superficialperivascular

Total

Student characteristicsPost-graduate yearFirst 6 9 15Second 7 7 14Third 5 3 8Fourth 2 3 5Fifth 1 1 2

Previous dermatopathology rotationYes 10 11 21No 11 12 23

Total 21 23 44

Figure 2 Sequences of student and system actions.

In the first group (N = 21), tutoring sessionsincludedcases from the sub-epidermal blisteringder-matitis diagnostic algorithm, which models the rela-tionshipsamong33diseases,23visualfeatures,and31attribute-value pairs. Twenty cases were used in afixedsequenceduringthetutoringsession(4.5 hours),and students who completed the entire set wereaskedtorepeattheentire loopuntil therequiredtimeon task had elapsed. The first group of subjects sawa set of cases during tutoring that representeda largepercentage of the entire subdomain studied.

In the second group (N = 23), tutoring sessionsincluded cases from the superficial perivasculardermatitis diagnostic algorithm, which model therelationships among 74 diseases, 52 visual features,and 66 attribute-value pairs. Fifteen cases wereused in a fixed sequence during the tutoring session(2.25 hours), and students who completed theentire set were asked to repeat the entire loop untilthe required time on task had elapsed. The secondgroup of subjects saw a set of cases during tutoringthat represented a small percentage of the entiresubdomain studied.

Characteristics of the datasets are shown inTable 3. A total of 16,431 student actions wereanalyzed, including 9873 student actions from tutor-ing sessions using the smaller sub-epidermal blisterdermatitides diagnostic algorithm and 6558 studentactions from tutoring sessions using a much largersuperficial perivascular dermatitides diagnostic algo-rithm.

3.3.2. Analysis of student actionsTo determine the relative frequency of optimumbest-next-steps in contrast to other student actions,

we classified student actions into four categories:(a) hint-driven, (b) correct best-next-step (BNS), (c)correct but not BNS (Correct-not-BNS), and (d)error. Table 4 provides an example of each classifi-cation based on the student actions in Table 1. Fig. 2shows a sequence of seven student actions and theirclassifications.

We determined total counts and frequencies (%)of each category for each problem solved by eachstudent and then plottedmeans for all students overtime (Fig. 4). Data for the two different domainswere analyzed separately. For graphs depictingmeans of each action category over time, we limitedour analysis to the first time that a problem wassolved (see Section 3.3.1 for discussion on problemsolution sequence).

Table 4 Student action categories, definitions and examples.

Category Definition Example

Hint-driven Student requested a hint immediatelyprior to performing this step

Student pressed the ‘‘Hint’’ button on the tutorinterface to request a hint, such that the subsequentstep (step 3 in Fig. 2) is designated as Hint-driven.Hint-driven steps are not considered in error or goalanalyses (see Sections 3.3.3 and 3.3.4 for explanation)

Errors Student action is evaluated by tutorto be incorrect

Student identified a hypothesis of Acute Burn but thishypothesis was not applicable for this case(steps 5 and 7 in Fig. 2)

Correct-not-BNS Student action is acceptable, but theaction did not correspond to thesystem-generated best-next-step

Student identified Mucin as absent when thesystem-generated best-next-step was to identify anattribute value of location: sub-epidermal for theblister feature (step 2 in Fig. 2)

Correct-BNS Student action corresponded to thesystem-generated best-next-step

Student identified Blister which matched thesystem-generated best-next-step (steps 1, 4 and6 in Fig. 2)

Table 3 Dataset characteristics.

Sub-epidermal Superficial perivascular

Number of participants 21 23Total problems solved 498 287Mean problems solved 23.7 12.5Total user hours 94.5 51.75

Total number of actions 9873 6558Hint-driven 2822 (28.6%) 1602 (24.4%)Correct-BNS 3366 (34.1%) 1810 (27.6%)Correct-not-BNS 613 (6.2%) 958 (14.6%)Errors 3072 (31.1%) 2188 (33.3%)

Total number of actions divided by total number of goals 1.56 1.79

3.3.3. Analysis of errorsStudent actions classified as errors in Section 3.3.2were further classified based on the tutorresponse. Errors are actions performed by thestudent that are incorrect or inappropriate forthe case. Student actions that are appropriatefor the case but are not designated by the tutoras the BNS are not considered errors. SlideTutoridentifies 25 different general classes of errorsbased on its pedagogic knowledge base [6]. Theseinclude errors of identification of present andabsent features, feature refinement, hypothesistriggering, hypothesis evaluation and problemcompletion (Table 5). Two of these errors (I12and T3) are only applicable to the superficial peri-vascular domain. Fig. 2 shows examples of twodifferent errors in steps 5 and 7.

We determined counts of each error and frequen-cies as a percentage of each goal and across allerrors. It is important to note that the distribution ofgoal errors (e.g., feature identification, feature

refinement, hypothesis triggering, etc.) is impactedby the total distribution of goals, which is finite andknown in advance for each case.

For each error, we then plotted means for allstudents over time (defined by the problemsequence). Data for the two different domains wereanalyzed separately (Figs. 5—9). For graphs depict-ing means of each error over time, we limited ouranalysis to the first time that a problem was solved(see Section 3.3.1 for discussion on problem solutionsequence).

3.3.4. Analysis of goal differencesStudent actions classified as Correct-not-BNS in Sec-tion 3.3.2 were further classified based on thedifference between the goal that the student istrying to complete and the goal that the tutorconsiders to be the BNS. The goal difference definesa second dimension for determining where studentactions do not correspond with the optimum solu-tion path. Two kinds of differences can be distin-guished:

184V.L.

Payneetal.

Table 5 Tutor errors by type and frequency.

Process Count and frequency of totalerrors

Error code Error type description Frequency of error types



Count %ofprocess

% Allerrors

Countz %ofprocess

% Allerrors

Featureidentification

957 (31.2%) 919 (42.0%) I1 Identified feature notpresent

555 58.0% 18.1% 801 87.2% 36.6%

I2 Identified feature existselsewhere

373 39.0% 12.1% 73 7.9% 3.3%

I3 Identified feature existselsewhere, but secondfeature present in thislocation has been missed

0 0

I4 Identified feature isexplicitly absent

8 0.8% 0.3% 0

I5 Identified feature notpresent in this case, butcan be present for one ormore hypotheses underconsideration (includingcorrect)

21 2.2% 0.7% 3 0.3% 0.1%

I12 Identified feature ispresent but is notimportant to currentalgorithm

N/A N/A N/A 42 4.6% 1.9%

Absent featureidentification

37 (1.2%) 4 (0.2%) I6 Feature identified asabsent is present inlocation currently underconsideration

10 27.0% 0.3% 1 25.0% 0.0%

I7 Feature identified asabsent is explicitlypresent in location notmatching viewercoordinates

4 10.8% 0.1% 0 0.0% 0.0%

I8 Feature identified asabsent but not importantto note

0 0

Effe

ctofalim

ited-enforce

mentintellige

nttutorin

gsyste

min

derm

atopath

ology

185

I9 Magnification used bystudent is too low toidentify absence offeature

0 0

I10 Magnification used bystudent is too low toidentify feature

0 0

I11 Absent Featureidentified is notimportant

23 62.2% 0.7% 3 75.0% 0.1%

Feature refinement(attribute valueidentification)

1301 (42.4%) 677 (30.9%) S1 Identified attributeis never a goal for thisfeature in any case

0 0.0% 0.0% 8 1.2% 0.4%

S2 Identified attribute isnot a goal for thisfeature in the currentcase

415 31.9% 13.5% 312 46.1% 14.3%

S3 Identified attribute iscorrect, but identifiedvalue is incorrect

886 68.1% 28.8% 349 51.6% 16.0%

S4 Identified attributevalue is within rangeacceptable for ahypothesis underconsideration but thevalue is incorrect in thiscase (used for backwardsreasoning)

0 0

S5 Identified attribute valueis within range acceptablefor a hypothesis not yetunder consideration butthe value is incorrect inthis case (used forforwards reasoning)

0 0

S6 Another value for theattribute has already beenidentified for this feature

0 0.0% 0.0% 8 1.2% 0.4%

Hypothesisspecification(triggering)

466 (15.2%) 343 (15.7%) T1 No feature has beenidentified to supportasserted hypothesis

1 0.2% 0.0% 176 51.0% 8.0%

186V.L.

Payneetal.

Table 5 (Continued )

Process Count and frequency of totalerrors

Error code Error type description Frequency of error types



Count %ofprocess

% Allerrors

Countz %ofprocess

% Allerrors

T2 Asserted hypothesis fitswith some features thathave been identified butnot others

465 99.8% 15.1% 110 32.1% 5.0%

T3 Asserted hypothesis is notsupported by any currentfeatures for currentalgorithm, but issupported in anotheralgorithm

N/A N/A N/A 58 16.9% 2.7%

Hypothesisevaluation

95 (3.1%) 74 (3.4%) E10 Diagnosis fits with somefeatures that have beenidentified but not otherfeatures that havebeen identified

95 100.0% 3.1% 74 100.0% 3.4%

E11 Diagnosis now inconsistentwith identified featurebecause new feature added

0 0

E12 Diagnosis now inconsistentwith identified featurebecause new attribute-value pairs of feature added

0 0

Problemcompletion

216 (7.0%) 171 (7.8%) C1 Student indicates problemdone before all requiredsubtasks are completed

216 100.0% 7.0% 171 100.0% 7.8%

Errors in bold text are the errors that occurred during the tutoring session. Errors in regular text did not occur during the tutoring session.

Effe

ctofalim

ited-enforce

mentintellige

nttutorin

gsyste

min

derm

atopath

ology

187

Table 6a Goal state differences with assigned weights.

Tutor best-next-step Student action

Featurelevel 1

Featurelevel 2

Featurelevel 3

Featurelevel 4

Featurelevel 5 andgreater

Featureattribute valuelevel 1-4

Feature attributevalue level 5 andgreater

Hypothesisspecification


Problemcompletion

Feature level 1 0 1 2 3 4 10 10 6 7 8Feature level 2 1 0 1 2 3 10 10 5 6 7Feature level 3 2 1 0 1 2 10 10 4 5 6Feature level 4 3 2 1 0 1 10 10 3 4 5Feature level 5and greater

4 3 2 1 0 10 10 2 3 4

Feature attributevalue levels 1—4

1.5 1.5 1.5 1.5 1.5 0 1 2 2 2

Feature attributevalue levels 5and greater

1.5 1.5 1.5 1.5 1.5 1 1 2 2 2


10 10 10 10 10 10 10 0 2 3


10 10 10 10 10 10 10 10 0 2

Problemcompletion

10 10 10 10 10 10 10 10 10 0


Table 6b Error categories with assigned weights.

Weight Error category Error types(see Table 5)

0 No error at goal(corresponds toCorrect)

Correct-not-BNS

1 Error with limitedoverall significance

I6, I11, I12, S3, T2

2 Error withindeterminateoverall significance

I2, I5, I7, S2, S6

3 Error with highoverall significance

I1, I4, S1, T1, E10, C1

1. When the goal type that the student is complet-ing is the same as the goal type of the BNS, thendifferences between the student and the tutorare related to a different ordering of goals. Thestudent may search for visual features in a dif-ferent order than the diagnostic algorithm sug-gests. For example, the student may identify ablister, an absent feature of mucin, and thenidentify a feature of eosinophil inflammatoryinfiltrate. The optimal sequence designated bythe tutor is to identify a blister, followed byidentifying the eosinophil inflammatory infiltrateand then the absent feature mucin. The studentis identifying features in a different order thanthe system-designated optimal sequence.

2. When the type of goal that the student is com-pleting is different than the goal type of the BNS(see steps 2 and 7 in Fig. 2), then differencesbetween student and tutor are related to either(a) step-skipping, or (b) strategy differences. Forexample, in step 2 of Fig. 2, the student identi-fies an absent feature, but the tutor considersrefinement of the previous feature (blister) to bethe BNS. This is an example of step-skipping ofthe previous feature refinement, which is a com-mon behavior especially as students becomemore adept at visual diagnosis. In some cases,goal type differences represent distinctly differ-ent strategies that the student is using comparedto the strategy the tutor considers optimal. Forexample, students may assert hypotheses firstand then find supportive features (backwardsreasoning). Although the tutor permits backwardand forward reasoning, the BNS for a novice willfavor forward reasoning. In some cases, it can bevery difficult to determine whether a givensequence represents step-skipping vs. a differentstrategy.

We determined counts and frequencies of eachcategory of goal differences using a goal matrix(Table 6a). Data for the two different domains wereanalyzed separately.

3.3.5. Analysis of solution pathFor each case solved, we generated an entire pro-blem-solving trace based on the degree to which thestudent’s solution deviated from the optimal solu-tion path defined by the tutoring system. The tracerepresented a sum of the deviations at each step inwhich the student did not perform the optimumaction (BNS). For example, in Fig. 2, there are threedeviations at steps 2, 5 and 7. For each step in whichthe student action deviates from the BNS, twodimensions were considered: (1) goal differences,and (2) errors. Thus, any deviating action repre-

sented (1) an error (step 5), (2) a difference in goal(step 2), or (3) both an error and a difference in goal(step 7).

Counts and frequencies alone do not adequatelydescribe the overlap of a particular student’s solu-tion against the optimum solution path, becauseerrors and differences in goal states have differentmeanings and may vary in how far they take thestudent from the optimum path. Therefore, weassigned weights for each category of error and goaldifference based on our assessment of the degree towhich they differ from the system-generated best-next-step. Tables 6a and 6b show goal differencesand the weights assigned for errors.

Errors were classified into four general categoriesof significance, based on the degree to which theerror impacts subsequent reasoning steps (Table 6b).We assign a weight of 0 to actions that were Correct-not-BNS. Errors ofminimal significanceareassignedaweight of 1. For example, error I12 indicates that thestudent has identified a feature that is present butnot important to the diagnostic algorithm (Table 5).This inefficiencymay slow the student down, but willnot have propagating effects on subsequent reason-ing. Errors of high significance are assigned a weightof 3. For example, error I4 indicates that the studenthas identified a feature as present that is explicitlyabsent in the case. Because the student has confusedthe identified finding with some other finding, thiserror is likely to affect the hypotheses that can bereached in the present case, and is also likely toimpede problem solving in other cases that containthis feature. Errors of indeterminate significance areassigned a weight of 2. For these errors, it is difficultfor us to properly assign credit or blame to a specificaction. For example, error I2 indicates that thestudent identified a feature which is not present inthe current view, but can be seen elsewhere on thevirtual slide. We are uncertain as to whether thestudent’s behavior indicates a perceptual problem,or that they simply waited until later in their visualsearch to identify this feature.


Figure 3 Portion of sub-epidermal blister diagnostic algorithm reflecting algorithm levels.

Goal difference weights were assigned based onhow many levels in the diagnostic algorithm thestudent’s action differed from the tutor’s BNS.Fig. 3 shows a portion of the sub-epidermal blisterdiagnostic algorithm demonstrating the algorithmlevels. Table 6a shows the weights assigned to eachgoal difference. When a student should have iden-tified a feature at a particular level in the diagnosticalgorithm, but instead identified a feature at ahigher level, a goal difference of 1 was added foreach level the student’s action deviated from thetutor’s expected step. For example, if the BNS wasto identify a feature of blister (Fig. 3, level 1), butthe student identified a feature of mucin (Fig. 3,level 3), the goal difference weight would be 2.

When the student should have identified a fea-ture attribute value, but instead identified a fea-ture, a goal difference of 1.5 was assigned. Forexample, in step 2 of Fig. 2, the student shouldhave identified a feature attribute value of sub-epidermal for the blister, but instead identified anabsent feature of mucin; this resulted in a deviationof 1.5 for step 2. Attribute-value steps are essen-tially steps refining features that have already beenidentified, so these were considered to have lowersignificance than jumping across levels while iden-tifying features or hypotheses. Consequently, allgoal differences for feature attribute value devia-tions were weighted identically at 1.5.

Aweight of 2 was assigned when students jumpedfrom identifying features to hypothesizing sincehypothesizing is a completely different goal type.For example in step 7 of Fig. 2, the student specifieda hypothesis of acute burn, which was not applicable

for this case, instead of identifying a quantity fea-ture attribute value of mild for the eosinophilicinfiltrate. At this step the student not only jumpedlevels, but also committed an error. To representboth error and goal difference dimensions for eachdeviation, we add the error weight and the goaldifference weight to compute a step deviationscore. The deviation weight at step 7 included botha goal difference weight of 2 and an error weightassignment of 1 resulting in a total weight of 3.

Thehighestweightwasreservedforthesituationinwhich students did not realize they are done with aparticular type of step. For example, the student isidentifying another featurewhen they should be spe-cifyingahypothesis (Table6a).Hint-drivenstepswerenotconsideredinthisanalysisbecausetheyprovidenoinformation about the student’s solution path.

The solution path deviation score is simply thesum of all step deviation scores. For example, thesolution path deviation for the student solution inFig. 2 is 5.5. We plotted the mean for all studentsover time (Fig. 10). Data for the two differentdomains were analyzed separately. For graphsdepicting means of solution path deviation scoresover time, we limited our analysis to the first timethat a problem was solved (see Section 3.3.1 fordiscussion on problem solution sequence).

4. Results

4.1. Student actions

Table 3 shows total counts of each student actioncategory. Hint-driven actions are approximately 30%


Figure 4 Student actions over time.

Figure 5 Feature identification errors over time.

of the total actions, Correct-BNS are between 27%and 34% of total actions and error actions hoveraround 30% of all actions; whereas Correct-not-BNSactions are less than 15% for both domains. Correct-not-BNS are somewhat more frequent in the largersuperficial perivascular dermatitis domain (6.2% vs.14.6%) and Correct-BNS are somewhat less frequentin this domain (34.1% vs. 27.6%). Fig. 4 shows thefrequency of student actions over the problemsequence (time). In both domains, we observed adecrease in errors over time from approximately50% to 20% and an increase in Correct-BNS actionsover time from approximately 20% to 50%. Hint-driven and Correct-not-BNS actions remain rela-tively stable over time at 30% and 20%, respectively.

4.2. Errors

The frequency of errors identified during the entiretutoring session is shown in Table 5. Many errors thatcan be identified by the tutoring system were notobserved in this data sample. Identification of pre-sent features and refinement of existing featurescomprise the largest percentage of errors rangingbetween 30% and 42% for both domains. Hypothesisspecification errors comprise a relatively small per-centage of total errors for both domains at approxi-mately 15%. Absent feature identification andhypothesis evaluation errors are rare at less than1.5% and 3.5%, respectively. With few exceptions,

the distribution of errors is similar between tutoringsessions in the two subdomains.

Feature identification errors (Figs. 5 and 6) aredue to identifying features that are not present onthe slide (I1, I5); identifying features at the wronglocation (I2); identifying a feature as present whenit is explicitly absent (I4); specifying features asabsent when they are present (I6, I7) and identifyingfeatures not important to the case diagnosis (I11,I12). The most common feature identification errorwas identification of a feature when it was notpresent (I1) at 18.1% for the sub-epidermal blisterdomain and 36.6% for the superficial perivasculardomain. The other feature identification errorsoccurred infrequently in the superficial perivasculardomain and more frequently in the sub-epidermalblister domain. For both domains, I1 and I2 errorsdecreased over time (Fig. 5). Absent feature iden-tification errors were nearly non-existent (Fig. 6).

Feature refinement errors (Fig. 7) are due toidentifying an incorrect feature attribute value(S3, S4, S5); identifying a feature attribute valuewhen such identification was not a case goal (S1,S2); or identifying multiple attribute values for thesame feature (S6). For both domains, identificationof an incorrect value (S3) was the most frequentfeature refinement error (28.8% for sub-epidermal;16.0% for superficial perivascular), followed by S2,which is identifying a value when it was not a casegoal (13.5% for sub-epidermal; 14.3% for superficial


Figure 6 Absent feature identification errors over time.

Figure 7 Feature refinement errors over time.

Figure 8 Hypothesis specification errors over time.

Figure 9 Hypothesis evaluation errors over time.

perivascular). Other feature refinement errors wererare or non-existent in both domains. In the sub-epidermal domain, both S2 and S3 decreased overtime. In the superficial perivascular domain how-ever, these errors occurred infrequently in the firstcase, slightly increased for the next three cases thendecreased to the original frequency of approxi-mately one error per user.

Hypothesis specification and evaluation errors(Figs. 8 and 9) are due to inconsistency of thehypothesis and specified features (T1, T2, T3,E10). In the superficial perivascular domain, errorT1 (no feature identified to support hypothesis) isthe most frequent (8.0%) and decreases over time;this error does not occur in the sub-epidermaldomain. For the sub-epidermal domain, error T2


Table

7a

Goal

statedifference

sbytypean

dfrequency

(sub-epiderm

alblisteralgo

rithm).

Tutorbest-next-step

Studentac

tions

Feature

leve

l1

Feature

leve

l2

Feature

leve

l3

Feature

attribute

valueleve

l1

Feature

attribute

valueleve

l2

Feature

attribute

valueleve

l3

Hyp

othesis

specifica

tion

Hyp

othesis

eva

luation

Problem

completion

Feature

leve

l1

376

4711

00

00

00

Feature

leve

l2

127

155

10

014

10

Feature

leve

l3

13

126

23

251

320

Feature

attribute

leve

l1

07

1043

20

31

10

Feature

attribute

leve

l2

168

195

691

228

181

Feature

attribute

leve

l3

17

152

729

223

170

Hyp

othesis

specifica

tion

46

34

73

533

133

1

Hyp

othesis

eva

luation

00

10

00

430

11

Problem

completion

00

11

21

813

349

(asserted hypothesis fits only a subset of identifiedfeatures) is the most frequent (15.1%); this is thesecond most frequent error in the superficial peri-vascular domain. The differences observed betweendomains for these errors are related to the structureof the diagnostic algorithm. In the sub-epidermalalgorithm, once the feature ‘‘blister’’ has beenidentified, any hypotheses in that algorithm canbe supported with existing evidence, which is nottrue of the larger superficial perivascular algorithm.Error T2 decreases over time in the sub-epidermalblister domain; but remains nearly non-existent inthe superficial perivascular domain. The otherhypothesis specification errors are infrequent inoccurrence during the entire tutoring session.Hypothesis evaluation error E10 (diagnosis fits withsome features that have been identified but notother features that have been identified) are nearlynon-existent at less than 3.5% in both domains(Table 5 and Fig. 9).

Problem completion errors occur when the stu-dent indicates the problem has been solved beforeall goals are completed (C1). This error occurred7.0% for the sub-epidermal blistering domain and7.8% for the superficial perivascular domain(Table 5).

4.3. Goal differences

Amatrix of goal differences is shown in Tables 7a and7b. Goals identified by the tutor are shown in therows, and goals performed by the student are shownin the columns. The diagonal shows the number ofmatching goals for student and the tutor. Off-diag-onal cells show goal differences. The first generalobservation is that goal differences are common,across both domains. Students frequently jumpahead in the diagnostic algorithm as evidenced bydifferences in feature identification levels. Theobservation that they are jumping ahead as opposedto re-ordering is supported by the fact that goaldifferences for features are more frequent abovethe diagonal than below it, in both domains. Once afeature has been identified, the BNS defined by thetutor is to refine that feature before identifyingadditional features. Students commonly identifyall features then go back and refine those featuresby specifying attribute values. Such alterations inthe sequence are referred to as jumping ahead. Thisapproach, commonly seen in our research studies,may be related to subjects executing a ‘‘breath firstsearch’’ strategy rather than a ‘‘depth first search’’strategy.

The second general observation is that there is anenormous amount of variability when studentsassert hypotheses and attempt to conclude problem

Effe

ctofalim

ited-enforce

mentintellige

nttutorin

gsyste

min

derm

atopath

ology

193

Table 7b Goal state differences by type and frequency (superficial perivascular dermatitis algorithm).

Tutor best-next-step Student action

Featurelevel 1

Featurelevel 2

Featurelevel 3

Featurelevel 4

Featureattributevalue level 1






Problemcompletion

Feature level 1 148 137 62 32 0 9 15 0 30 3 0Feature level 2 10 115 26 12 0 4 0 0 29 7 0Feature level 3 4 8 17 3 0 2 1 0 40 23 0Feature level 4 4 2 1 7 0 0 1 0 8 9 0Feature attributelevel 1

13 10 7 0 183 7 1 0 19 7 0

Feature attributelevel 2

38 30 19 2 4 544 8 0 40 18 0


7 13 12 1 0 5 231 0 22 10 0


2 2 0 1 0 0 0 37 9 6 0


4 1 14 0 1 3 14 0 164 63 0


1 1 2 1 0 0 0 0 17 230 0

Problemcompletion

0 2 1 0 0 0 0 0 1 13 157


Figure 10 Solution path deviation over time.

solving by making a diagnosis, which is also true inboth domains. Finally, we observe little differencein the overall distribution of goal differencesbetween the two domains.

4.4. Overall deviation of a solution path

Fig. 10 depicts solution path deviation scores overtime. A deviation score of zero indicates that thestudent has solved the case by following the optimalpath as designated by the tutor. A low score indi-cates the student’s solution slightly veered from theoptimal path. A high score indicates that the stu-dent’s solution was considerably different than theoptimal solution path.

Analysis of the data shows that the deviation ofthe student solution path decreases toward theoptimum in both domains, with accompanyingdecreases in the standard deviations. Interestingly,we observed a difference between domains withregard to the rate at which the deviation scoresdrop. In the smaller sub-epidermal blister domain,deviation of the student solution path immediatelydecreases toward the optimum path after the firstproblem indicating students immediately recognizethe importance of identifying a sub-epidermal blis-ter, and then remains relatively close to the opti-mum solution. Tutoring in the larger and morecomplex superficial perivascular dermatitis domainwas associated with a slower rate of approximationtowards the optimal solution path.

5. Discussion

The findings of this study suggest that minimalenforcement can be an effective pedagogic strategyin an Intelligent Tutoring System for a complexmedical domain. The study resulted in a numberof useful observations regarding tutoring in thisdomain.

First, we found that a limited-enforcement strat-egy does not prevent students from learning theoptimum solution path. Students who were tutored

using the limited-enforcement strategy initially pur-sued a path other than the optimal one. Neverthe-less, when given significant flexibility in theintermediate steps that the tutor accepts as cor-rect, students gradually converge towards the opti-mum solution path encoded by the diagnosticalgorithm. To do so, students must be explicitlyassembling their own representation of the algo-rithm by using the Best-Next-Step hints, becausethis is the only information available to them forself-correction. Thus it appears that students areable to learn from these partial examples, synthe-sizing the information from multiple problems todevelop an approach that closely reproduces thediagnostic algorithm. This is an important finding formedical tutoring systems because it demonstratesthat flexibility does not necessarily prevent acquisi-tion of the most efficient approach to case inter-pretation. In light of the significant benefits of amore flexible approach, our results suggest thatlimited-enforcement merits further study as a ped-agogic strategy in our domain.

Second, we described the distribution of errorsand goal differences over time when using a limited-enforcement tutoring system. Both goal differencesand errors decline over time. We observed thatstudents make fewer and less severe errors overtime, progressively relying more on knowledgeobtained during the tutoring session and less ontutor feedback. This is reflected by the decreasein the number of hint-driven steps, decrease in thefrequency and severity of errors, and the increase inthe number of correct (Correct-not-BNS and Cor-rect-BNS) actions. Since students spend the major-ity of time identifying features, that is where theymake the majority of their mistakes. Errors made inhypothesis triggering and hypothesis evaluationwere fewer in number, although they may have beenmore significant in consequence. Goal differenceswere largely due to identifying feature and featureattributes in a different order than the optimalsequence. The distribution of errors and goal differ-ences is consistent with our previous empiricalobservations of problem solving using think-aloud


methods [36]. The observation that visual featureidentification skills precede development of hypoth-esis evaluation skills suggests that pedagogic stra-tegies for part-training in visual identification couldbe effective interventions to complement the per-formance training approach of our systems.

Third, we observed that larger problem-spacesize was associated with a longer interval ofapproach to the optimum solution path, but other-wise there were few differences in the overall dis-tribution of errors and goal state differences. Theempirical observation of a relationship betweenproblem-size space and the length of trainingrequired to reach the optimal path has strong facevalidity. Instructors in clinical domains have longknown that acquisition of accurate skills in identify-ing diagnostic entities requires consideration ofother entities in the ‘‘differential diagnosis’’. Thesesmall sets of similar entities may be considered to bethe smallest possible problem-space for teachingskills relevant to identifying a particular entity.Increasing the problem-space beyond this smallset is an imperative, as students must also learnto place a particular case within a specific differ-ential diagnosis. In effect, they must learn to deter-mine ‘‘What is the question?’’ As the size of theproblem-space increases further, it can be arguedthat the training experience more closely resemblesproblem solving in its natural task environment—clinical practice. The relationship between pro-blem-space size and tutoring response is an impor-tant variable for consideration in pedagogic strategychoice, and again raises the issue of balancingpotential drawbacks and benefits. Training in smal-ler ‘‘chunks’’ of the problem-space will likely pro-duce more rapid response to training, but learningto determine where a single case fits within thelarger context may be needed for transfer to real-world problem solving. How and when tomanipulateproblem-space size remains an important and lar-gely unstudied problem for tutoring systems in med-ical domains.

Taken together, our observations support theuse of limited-enforcement strategies in thisdomain, but also provide some guidance regardingthe most effective use of this approach. Early on intutoring, the need to rapidly gain declarativeknowledge could argue for a more rigid, high-enforcement approach. However, one of the diffi-culties in medical tutoring systems that we haveobserved over the years is that students typicallydo not entirely ‘‘trust’’ the tutoring system, favortheir own solutions even when they are non-opti-mal, and therefore are resistant to the more rigidtutoring style that characterizes most high-enfor-cement systems. Helping students to accept the

system is a difficult but important goal for ourtutoring systems.

In many ways, our limited-enforcement strategyrepresents an extreme of the cognitive tutorapproach, which requires multiple rule sets tomodel different problem-solving strategies. In ourcase, we have purposely allowed for the widestpossible variety of alternative problem-solving stra-tegies. Our ability to explicitly encode this widerange of strategies derives directly from our imple-mentation of the tutoring system as a set of abstractproblem-solving methods, ontologies, and DynamicSolution Graph. Modeling a higher level abstractionof skilled performance greatly simplifies the task ofaccounting for such a wide variety of alternativeapproaches to problem solving. This contrasts withthe typical cognitive tutoring system, where a greatdeal of the content is encoded in the productionrules themselves and therefore each alternativeseries of steps must be explicitly accounted for. Likeother cognitive tutors, we are able to distinguishbetween correct and incorrect steps, and thereforecan provide next-best-steps through our hintmechanism–—an intervention that we have shownto be effective in this study.

In other ways, our limited-enforcement strategyreproduces some of the intent of the less rigid con-straint-based tutoring system, which permits thestudent to pursue their own solution path rather thanforcing them to perform a specified series of steps.Leaving aside the issue of enforcement, anotherproperty of our tutoring systems, which makes themmore like constraint-based tutors is that our metho-dology allows us to easily alter the feedback cyclesuch that immediate feedback (1:1) can gradually befaded giving feedback at longer intervals of actions,providing students increasing autonomy and theopportunity to learn principally from errors. Duringthis fading, students also have the opportunity tomore directly evaluate their own performance andthus build their metacognitive abilities.

6. Future work

The present observational study points to threepotentially useful areas of future work. Furtherexperimental studies are needed to address thesequestions.

First, we intend to further explore the use oflimited-enforcement strategies to determine whenand how to best use this approach. The presentstudy shows that limited-enforcement does notimpose a barrier to learning the optimal path, butleaves many important questions unanswered.How efficient is limited-enforcement training when


compared to high-enforcement training? What is theuser experience, degree of acceptance, and sub-jective response to these pedagogic strategies?When is the best time to employ this approach,and how can we detect when the student is readyfor less enforcement?

Second, based on our observation of error distri-butions and frequencies over time, we are interestedin further exploring the use of part-training for acqui-sition of visual feature identification skills andhypothesis evaluation skills. How can we most effec-tively combine these methods with the performancetraining aspects of our system? How often should thetutoring system step out of the problem-solving exer-cises to pursue a more directed approach? Whataspects of training are most amenable to part-train-ing? How should part-training be integrated with theexisting problem-solving exercises?

Third, we plan to further explore the variable ofproblem-size space as a parameter in pedagogicstrategy. What are the best approaches to increasingproblem-space size? When do students benefit fromsmaller or larger problem-spaces? How should smal-ler problem-spaces be combined over time to moreclosely replicate the natural task environment andpromote transfer to clinical practice?

New versions of our tutoring system are alreadyusing limited-enforcement strategies especiallylater in training. Ongoing evaluation of these sys-tems will provide useful additional information fordetermining the benefits and costs of limited-enfor-cement. As we begin to address the many questionsraised by this study, we will focus our efforts onidentifying and implementing the most effectivepedagogic strategies for intelligent tutoring in thiscomplex medical domain.

7. Conclusion

We conclude that students may benefit from alimited-enforcement tutoring system that leveragesdiagnostic algorithms but does not prevent alter-native strategies. Students trained by this type ofsystem converge towards the optimum solution pathspecified in the algorithm, but are not forced to doso. The rate at which they are able to approximatethe algorithm is affected by the size of the domainthat they are presented with.

Acknowledgements

The research described was supported by theNational Library of Medicine through grant R01-

LM007891. The first author (VP) was supportedthrough the University of Pittsburgh BiomedicalInformatics Training Program Grant (T15LM007059-22)

This work was conducted using the Protegeresource, which is supported by grant LM007885from the United States National Library of Medicine.SpaceTree was provided in collaboration with theHuman—Computer Interaction Lab (HCIL) at theUniversity of Maryland, College Park.

References

[1] Kronz JD, Westra WH, Epstein JI. Mandatory second opinionsurgical pathology at a large referral hospital. Cancer1999;86:2426—35.

[2] Davenport J. Documenting high-risk cases to avoid malprac-tice liability: You’re at the highest risk of malpractice suitswhen dealing with these five clinical conditions. Full doc-umentation can help. Family Practice Management 2000;7:33—6.

[3] Crowley R, Gryzbicki D. Intelligent medical training systems.Artificial Intelligence in Medicine 2006;38:1—4.

[4] Woo CW, Evens MW, Freedman R, Glass M, Shim LS, Zhang Y.An intelligent tutoring system that generates a naturallanguage dialogue using dynamic multi-level planning. Arti-ficial Intelligence in Medicine 2006;38:25—46.

[5] Suebnukam S, Haddawy P. A Bayesian approach to generatingtutorial hints in a collaborative medical problem-basedlearning system. Artificial Intelligence in Medicine 2006;38:5—24.

[6] Crowley RS, Medvedeva O. An intelligent tutoring system forvisual classification problem solving. Artificial Intelligence inMedicine 2006;36:85—117.

[7] Day RS. Challenges of biological realism and validation insimulation-based medical education. Artificial Intelligencein Medicine 2006;38:47—66.

[8] Satish U, Streufert S. Value of a cognitive simulation inmedicine: towards optimizing decision-making performanceof healthcare personnel. Quality and Safety in Health Care2002;11:163—7.

[9] Romero C, Ventura S, Gibaja EL, Hervas C, Romero F.Web-based adaptive training simulator system for cardiaclife support. Artificial Intelligence in Medicine 2006;38:67—78.

[10] Friedman CP. The marvellous medical education machine orhow medical education can be ‘‘Unstuck’’ in time. MedicalTeacher 2000;22:496—502.

[11] Yudelson MV, Medvedeva OP, Crowley RS. A multifactorapproach to student model evaluation. User Modeling andUser-Adapted Interaction 2008;18:349—82.

[12] Anderson JR, Boyle CF, Corbett AT, Lewis MW. Cognitivemodeling and intelligent tutoring. Artificial Intelligence inMedicine 1990;42:7—49.

[13] Anderson JR, Corbett AT, Koedinger KR, Pelletier R. Cogni-tive tutors: lessons learned. The Journal of the LearningSciences 1995;4:167—207.

[14] Corbett AT, Trask H. Instructional interventions in computer-based tutoring: Differential impact on learning time andaccuracy. In: Turner T, Szwillus G, Czerwinski M, Paterno F,editors. SIG CHI 2000 Conference on Human Factors inComputing Systems. 2000.p. 97—104.


[15] Corbett AT, Anderson JR. Locus of feedback control incomputer-based tutoring: impact on learning rate, achieve-ment and attitudes. In: Jacko JS, Sears A, Beaudouin-LafonM, Jacob R, editors. Proceedings of SIG CHI 2001 ConferenceonHumanFactorsinComputingSystems.Seattle,Washington,New York, United States: ACM Press; 2001. p. 245—52.

[16] Munro A, Fehling MR, Towne DM. Instruction intrusiveness indynamic simulation training. Journal of Computer-BasedInstruction 1985;12:50—3.

[17] Schmidt RA, Young DE, Swinnen S, Shapiro DC. Summaryknowledge results for skill acquisition: Support for theguidance hypothesis. Journal of Experimental PsychologyLearning Memory and Cognition 1989;15:352—9.

[18] Schooler LJ, Anderson JR. The disruptive potential ofimmediate feedback. In: Proceedings of the 12th AnnualConference of the Cognitive Science Society; 1990. p. 702—8.

[19] Chi MTH, Bassok M, Lewis M, Reimann P, Glaser R.Self-explanations: how students study and use examplesin learning to solve problems. Cognitive Science 1989;13:145—82.

[20] Lepper MR, Woolverton M, Mume DL, Gurtner J. Motivationaltechniques of expert human tutors: lessons for the design ofcomputer-based tutors. In: Lajoie SP, Derry SJ, editors.Computers as cognitive tools. Hillsdale, NJ: Erlbaum; 1993.p. 75—105.

[21] Azevedo R, Hadwin AF. Scaffolding self-regulated learningand metacognition-implications for the design of computer-based scaffolds. Instructional Science 2005;33:367—79.

[22] White B, Frederiksen J. A theoretical framework andapproach for fostering metacognitive development. Educa-tional Psychologist 2005;40:211—3.

[23] Ohlsson S. Learning from performance errors. PsychologicalReview 1996;103:241—62.

[24] Mitrovic A, Koedinger KR, Martin B. A comparative analysis ofcognitive tutoring and constraint-based modeling. In: Bru-silovsky P, Corbett A, Rosis FD, editors. Proceedings of the9th International Conference on User Modeling UM2003.2003. p. 313—22.

[25] Bloom BS. The 2 sigma problem: the search for methods ofgroup instruction as effective as one-to-one tutoring. Edu-cational Researcher 1984;13:4—16.

[26] Cohen PA, Kulik JA, Kulik CLC. Educational outcomes oftutoring: a meta-analysis of findings. American EducationalResearch Journal 1982;19:237—48.

[27] Lepper MR, Aspinwall L, Mumme D, Chabay RW. Self-percep-tion and social perception processes in tutoring: subtlesocial control strategies of expert tutors. In: Olson JM,Zanna MP, editors. Self-inference processes: the sixthOntario symposium in social psychology. Hillsdale, NJ: Lawr-ence Erlbaum Associates, Inc.; 1990. p. 217—37.

[28] Merrill DC, Reiser BJ, Ranney M, Trafton JG. Effectivetutoring techniques: a comparison of human tutors andintelligent tutoring systems. The Journal of the LearningSciences 1992;2:277—305.

[29] Collins A, Brown JS, Newman SE. Cognitive apprenticeship:teaching the crafts of reading, writing andmathematics. In:Resnick LB, editor. Knowing, learning and instruction: essaysin honor of Robert Glaser. Hillsdale, NJ: Lawrence ErlbaumAssociates, Inc.; 1989. p. 453—94.

[30] Merrill DC, Reiser BJ, Landes S. Human tutoring: pedagogicalstrategies and learning outcomes. Annual Meeting of theAmerican Educational Research Association 1992.

[31] Littman D, Pinto P, Soloway E. The knowledge required fortutorial planning: an empirical analysis. Interactive LearningEnvironments 1990;1:124—51.

[32] Saadawi GME, Tseytlin E, Legowski E, Jukic D, Castine M,Fine J, et al. A natural language intelligent tutoring systemfor training pathologists: implementation and evaluation.Advanced Health Science Education Theory Practice2007;13:709—22.

[33] Medvedeva O, Chavan G, Crowley RS. A data collectionframework for capturing its data based on an agent com-munication standard. In: Proceedings of the 20th AnnualMeeting of the American Association for Artificial Intelli-gence; 2005. p. 23—30.

[34] Crowley RS, Legowski E, Medvedeva O, Tseytlin E, Roh E, JrkicD. Evaluation of an intelligent tutoring system in pathology:Effects of external representation on performance gains,metacognition and acceptance. Journal of the AmericanMedical Informatics Association 2007;14:182—90.

[35] Saadawi GME, Azevedo R, Castine M, Payne V, Medvedeva O,Tseytlin E, et al. Factors affecting feeling-of-knowing in amedical intelligent tutoring system: the role of immediatefeedback as a metacognitive scaffold. Advances in HealthScience Education 2009; doi:10.1007/s10459-009-9162-6.

[36] Crowley RS, Naus GJ, Stewart J, Friedman CP. Developmentof visual diagnostic expertise in pathology–—an information-processing study. Journal of the American Medical Infor-matics Association 2003;10:39—51.

http://dx.doi.org/10.1007/s10459-009-9162-6

http://dx.doi.org/10.1007/s10459-009-9162-6

Documents

Effect of a limited-enforcement intelligent tutoring ...jchampai/papers/... · Effect of a limited-enforcement intelligent tutoring system in dermatopathology on student errors, goals