8
Hands-free administration of subjective workload scales: Acceptability in a surgical training environment C. Melody Carswell a, b, * , Cindy H. Lio a , Russell Grant a, b , Martina I. Klein a, e , Duncan Clarke a , W. Brent Seales a, c , Stephen Strup d a Center for Visualization and Virtual Environments, University of Kentucky, Lexington, KY 40506, USA b Department of Psychology, University of Kentucky, Lexington, KY 40506, USA c Department of Computer Science, University of Kentucky, Lexington, KY 40506, USA d Department of Urology, University of Kentucky, Lexington, KY 40506, USA e Department of Psychology, Texas Tech University, Lubbock, TX 79409, USA article info Article history: Received 23 March 2009 Accepted 11 June 2010 Keywords: Minimally invasive surgery Surgical training Mental workload Reliability Usability metrics Human factors evaluation Administration procedures abstract Introduction: Subjective workload measures are usually administered in a visualemanual format, either electronically or by paper and pencil. However, vocal responses to spoken queries may sometimes be preferable, for example when experimental manipulations require continuous manual responding or when participants have certain sensory/motor impairments. In the present study, we evaluated the acceptability of the hands-free administration of two subjective workload questionnaires e the NASA Task Load Index (NASA-TLX) and the Multiple Resources Questionnaire (MRQ) e in a surgical training environment where manual responding is often constrained. Method: Sixty-four undergraduates performed fteen 90-s trials of laparoscopic training tasks (ve repli- cations of 3 tasks e cannulation, ring transfer, and rope manipulation). Half of the participants provided workload ratings using a traditional paper-and-pencil version of the NASA-TLX and MRQ; the remainder used a vocal (hands-free) version of the questionnaires. A follow-up experiment extended the evaluation of the hands-free version to actual medical students in a Minimally Invasive Surgery (MIS) training facility. Results: The NASA-TLX was scored in 2 ways e (1) the traditional procedure using participant-specic weights to combine its 6 subscales, and (2) a simplied procedure e the NASA Raw Task Load Index (NASA-RTLX) e using the unweighted mean of the subscale scores. Comparison of the scores obtained from the hands-free and written administration conditions yielded coefcients of equivalence of r ¼ 0.85 (NASA-TLX) and r ¼ 0.81 (NASA-RTLX). Equivalence estimates for the individual subscales ranged from r ¼ 0.78 (mental demand) to r ¼ 0.31 (effort). Both administration formats and scoring methods were equally sensitive to task and repetition effects. For the MRQ, the coefcient of equivalence for the hands-free and written versions was r ¼ 0.96 when tested on undergraduates. However, the sensitivity of the hands-free MRQ to task demands (h partial 2 ¼ 0.138) was substantially less than that for the written version (h partial 2 ¼ 0.252). This potential shortcoming of the hands-free MRQ did not seem to generalize to medical students who showed robust task effects when using the hands-free MRQ (h partial 2 ¼ 0.396). A detailed analysis of the MRQ subscales also revealed differences that may be attributable to a spillovereffect in which participantsjudgments about the demands of completing the questionnaires contaminated their judgments about the primary surgical training tasks. Conclusion: Vocal versions of the NASA-TLX are acceptable alternatives to standard written formats when researchers wish to obtain global workload estimates. However, care should be used when interpreting the individual subscales if the object is to make comparisons between studies or conditions that use different administration modalities. For the MRQ, the vocal version was less sensitive to experimental manipulations than its written counterpart; however, when medical students rather than undergradu- ates used the vocal version, the instruments sensitivity increased well beyond that obtained with any other combination of administration modality and instrument in this study. Thus, the vocal version of the MRQ may be an acceptable workload assessment technique for selected populations, and it may even be a suitable substitute for the NASA-TLX. Ó 2010 Elsevier Ltd. All rights reserved. * Corresponding author. Department of Psychology, University of Kentucky, Lexington, KY 40506, USA. Tel.: þ1 859 333 2492. E-mail address: [email protected] (C.M. Carswell). Contents lists available at ScienceDirect Applied Ergonomics journal homepage: www.elsevier.com/locate/apergo 0003-6870/$ e see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.apergo.2010.06.003 Applied Ergonomics 42 (2010) 138e145

Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

Embed Size (px)

Citation preview

Page 1: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

lable at ScienceDirect

Applied Ergonomics 42 (2010) 138e145

Contents lists avai

Applied Ergonomics

journal homepage: www.elsevier .com/locate/apergo

Hands-free administration of subjective workload scales: Acceptabilityin a surgical training environment

C. Melody Carswell a,b,*, Cindy H. Lio a, Russell Grant a,b, Martina I. Klein a,e, Duncan Clarke a,W. Brent Seales a,c, Stephen Strup d

aCenter for Visualization and Virtual Environments, University of Kentucky, Lexington, KY 40506, USAbDepartment of Psychology, University of Kentucky, Lexington, KY 40506, USAcDepartment of Computer Science, University of Kentucky, Lexington, KY 40506, USAdDepartment of Urology, University of Kentucky, Lexington, KY 40506, USAeDepartment of Psychology, Texas Tech University, Lubbock, TX 79409, USA

a r t i c l e i n f o

Article history:Received 23 March 2009Accepted 11 June 2010

Keywords:Minimally invasive surgerySurgical trainingMental workloadReliabilityUsability metricsHuman factors evaluationAdministration procedures

* Corresponding author. Department of PsychologyE-mail address: [email protected] (C.M. Carswel

0003-6870/$ e see front matter � 2010 Elsevier Ltd.doi:10.1016/j.apergo.2010.06.003

a b s t r a c t

Introduction: Subjective workload measures are usually administered in a visualemanual format, eitherelectronically or by paper and pencil. However, vocal responses to spoken queries may sometimes bepreferable, for example when experimental manipulations require continuous manual responding orwhen participants have certain sensory/motor impairments. In the present study, we evaluated theacceptability of the hands-free administration of two subjective workload questionnaires e the NASATask Load Index (NASA-TLX) and the Multiple Resources Questionnaire (MRQ) e in a surgical trainingenvironment where manual responding is often constrained.Method: Sixty-four undergraduates performed fifteen 90-s trials of laparoscopic training tasks (five repli-cations of 3 tasks e cannulation, ring transfer, and rope manipulation). Half of the participants providedworkload ratings using a traditional paper-and-pencil version of the NASA-TLX and MRQ; the remainderused a vocal (hands-free) version of the questionnaires. A follow-up experiment extended the evaluation ofthe hands-free version to actual medical students in a Minimally Invasive Surgery (MIS) training facility.Results: TheNASA-TLXwas scored in2wayse (1) the traditional procedureusingparticipant-specificweightsto combine its 6 subscales, and (2) a simplified procedure e the NASA Raw Task Load Index (NASA-RTLX) eusingtheunweightedmeanof the subscale scores. Comparisonof the scoresobtained fromthehands-freeandwritten administration conditions yielded coefficients of equivalence of r¼ 0.85 (NASA-TLX) and r¼ 0.81(NASA-RTLX). Equivalence estimates for the individual subscales ranged from r¼ 0.78 (“mental demand”) tor¼ 0.31 (“effort”). Both administration formats and scoring methods were equally sensitive to task andrepetition effects. For the MRQ, the coefficient of equivalence for the hands-free and written versions wasr¼ 0.96 when tested on undergraduates. However, the sensitivity of the hands-free MRQ to task demands(hpartial

2 ¼ 0.138) was substantially less than that for the written version (hpartial2 ¼ 0.252). This potential

shortcoming of the hands-freeMRQ did not seem to generalize to medical students who showed robust taskeffectswhenusingthehands-freeMRQ(hpartial2 ¼ 0.396).Adetailedanalysisof theMRQsubscalesalsorevealeddifferences thatmaybeattributable to a “spillover”effect inwhichparticipants’ judgments about thedemandsof completing the questionnaires contaminated their judgments about the primary surgical training tasks.Conclusion: Vocal versions of the NASA-TLX are acceptable alternatives to standard written formats whenresearchers wish to obtain global workload estimates. However, care should be used when interpretingthe individual subscales if the object is to make comparisons between studies or conditions that usedifferent administration modalities. For the MRQ, the vocal version was less sensitive to experimentalmanipulations than its written counterpart; however, when medical students rather than undergradu-ates used the vocal version, the instrument’s sensitivity increased well beyond that obtained with anyother combination of administration modality and instrument in this study. Thus, the vocal version of theMRQ may be an acceptable workload assessment technique for selected populations, and it may even bea suitable substitute for the NASA-TLX.

� 2010 Elsevier Ltd. All rights reserved.

, University of Kentucky, Lexington, KY 40506, USA. Tel.: þ1 859 333 2492.l).

All rights reserved.

Page 2: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

C.M. Carswell et al. / Applied Ergonomics 42 (2010) 138e145 139

1. Introduction

Subjective workload scales have been a familiar part of thehuman factors and ergonomics toolkit since the 1980s (Tsang andVidulich, 2006). Although workload metrics can also be obtainedfrom both performance and physiological measures, subjectivescales have the advantage of accessibility, ease of use, and directapplicability to situations where the operator’s experience is ofparamount concern. The adaptability of these measures to manywork contexts may also be seen as an advantage; however, minorchanges to administration procedure may result in changes in themeasures’ sensitivities to workload variation (e.g., shifts fromnumeric to analogue scales, or from hard copy to electronicadministration, e.g., Hart and Staveland,1988). In the present study,we focus on the acceptability of one procedural change that maymake typical subjectiveworkload scales more appropriate for someworkplaces, tasks, and user populations e hands-free administra-tion. Specifically, we compare spoken and written versions ofa well-established and frequently used subjective measure, theNASA Task Load Index (NASA-TLX) (Hart and Staveland, 1988) aswell as a relatively new and potentially more diagnostic measure,the Multiple Resources Questionnaire (MRQ) (Boles et al., 2007;Boles and Adair, 2001).

The NASA Task Load Index (NASA-TLX) is among the mostfrequently used subjective workload measures (Hart, 2006).Although initially designed for use in the aviation sector, the NASA-TLX has been adopted for workload evaluation in many othercontexts, including nuclear power (Park and Jung, 2006),commercial and private ground transportation (Otmani et al., 2005;Matthews et al., 2003), manufacturing (Landau et al., 2006), andeducation (Windell et al., 2006).

Respondents are asked to rate 6 subscales when using theNASA-TLX e (1) mental demand, (2) physical demand, (3) temporaldemand, (4) effort, (5) performance, and (6) frustration. Thesesubscales can be used diagnostically to isolate the potential causesof elevated workload, or they can be combined to generate a globalworkload measure. Traditionally, the global score is produced bytaking a weighted average of the subscale scores (Hart andStaveland, 1988). The weights applied to each subscale are partic-ipant-specific and are derived from paired-comparison proceduresin which participants choose which member of each pair ofsubscales is the most critical to their overall perception of work-load. An alternative, simplified method of administering andscoring the NASA-TLX, referred to as the NASA Raw Task Load Index(NASA-RTLX; Byers et al., 1989), uses the unweighted mean of thesubscales and omits the paired comparison procedure. Severalresearchers advocate using the simpler procedure after findingstrong correlations between the NASA-TLX and NASA-RTLX(Moroney et al., 1995; Moroney et al., 1992; and Byers et al., 1989).However, other researchers have found that the original NASA-TLXis more sensitive to changes in task demands (e.g., Liu andWickens,1994).

The Multiple Resources Questionnaire (MRQ) is a more recentintroduction, developed to be diagnostic of the specific cognitiveprocesses (resources) demanded in different situations (Boles et al.,2007; Boles and Adair, 2001). Based on research in cognitiveneuroscience, the MRQ was designed primarily to predict the typesof additional demands that are most likely to interfere withperformance in a target task. If valid, the measure has the potentialto be proscriptive in terms of designing interfaces and workprocedures to minimize primary task disruption in dynamic workenvironments.

The MRQ, the NASA-TLX (and NASA-RTLX) and other subjectiveworkload measures such as the Subjective Workload AssessmentTechnique (Reid and Nygren, 1988), Workload Profile (Tsang and

Velazquez, 1996), and Subjective Workload Dominance Technique(Vildulich,1989), are typically presented visually (either hard copy oron-screen) and require manual responses (either written, bykeyboard, or by mouse). However, there may be times when thisformat cannot be used, for example when the participant’s primarytask requires manual control or when data must be collected frompopulations with visual, mobility, or literacy limitations. Certainly,practitioners in other domains, such as clinical psychologists usingthe Minnesota Multiphasic Personality Inventory (MMPI), have beenmotivated to develop and validate alternative administrationmodalities (Richards et al., 1983). However, within human factors,there seems to have been more concern with determining theequivalence of different visual administration media (i.e., computervs. paper-and-pencil; e.g., Noyes and Garland, 2008; Noyes andBruneau, 2007) rather than with comparing altogether differentadministration modalities. Although David and Pledger (1995) didcompare vocal vs. keyboard reporting of SWAT and InstantaneousSelf-Assessment (ISA) responses, these researchers were mostlyinterested in determining the relative intrusiveness of the twomodalities on air traffic control primary tasks. Theydidfind, however,that their participants preferred vocal to keyboard reporting.

Our own research on technology and training enhancements forlaparoscopic surgery is an example of an application in whichtraditional workloadmeasures are frequently intrusive. A surgeon’sprimary task involves manipulating long-handed instrumentsinserted through ports into the patient’s body with visual feedbackprovided on nearby monitors. It is often desirable to administerworkload measures after individual training trials withoutrequiring participants to relinquish their instruments and stepaway from the surgical simulator or training stand, but the use ofvisualemanual administration techniques requires that they do so.Although vocal administration seemed an obvious solution, wewere concerned that this procedure might introduce systematicresponse biases, for example if some participants felt self-consciousreporting aloud that they found a particular task demanding orfrustrating. We were also troubled by the possible introduction ofgreater unsystematic (error) variance into the data if participantsfelt greater time pressure when responding vocally. The latterconcern is far from trivial in field experiments where it is rarely thecase that large numbers of participants can be recruited tocompensate for increased experimental noise.

In the present study we explicitly compared vocal and writtenadministrations of the NASA-TLX, NASA-RTLX, and the MRQ forparticipants learning to perform three laparoscopic training tasks ea cannulation task, a ring transfer task, and a rope manipulationtask. In the first experiment, undergraduate students were taughtto perform the training tasks and were asked to make workloadevaluations using either the vocal or manual format after eachtraining trial. These data allowed us to:

(1) Estimate the equivalent-forms reliability of vocal and writtenversions of the NASA-TLX, the NASA-RTLX, and MRQ.

(2) Compare the sensitivity of vocal and written versions of theNASA-TLX, the NASA-RTLX, and MRQ to task manipulations(i.e., task type).In addition to our primary goal of assessing the acceptability ofsubstituting our hands-free version of the measures for thetraditional written-manual formats, we were also able toprovide data on the relative merits of the three differentworkload measures themselves. Thus, we also

(3) Compare the sensitivity of the NASA-TLX and the NASA-RTLX totask manipulations, and

(4) Compare the sensitivity of global workload estimates derivedfrom the NASA-TLX and NASA-RTLX to those of the newerMRQ.

Page 3: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

Fig. 1. Laparoscopic training stand used by participants to complete all three experi-mental tasks.

C.M. Carswell et al. / Applied Ergonomics 42 (2010) 138e145140

Finally, in a second experiment, we looked at whether theoverall patterns of data obtained from our hands-free instrumentsin Experiment 1 were replicated in a more job-appropriate pop-ulation performing under more realistic training conditions. As isthe case for research in many work contexts, subject matter experts(SMEs) (surgeons and surgical residents in our case) are a scarceresource that must be tapped only when more readily availablepopulations will not suffice. It may be obvious in some cases whenSMEs must be used, for example, when domain knowledge isrequired for basic experimental task performance (e.g., a surgeonperforming a simulated cholecystectomy). However, it is less clearthat such highly skilled participants are necessary in situationswhere those with less specialized training are still capable ofsuccessfully completing the experimental task with only a fewminutes instruction. Even when domain knowledge is nota deciding factor in participant selection, it is still possible thatfactors such as group norms, motivation, academic attainment, andmaturity may create significant differences not only in primary taskperformance, but also in the way rating scales are understood andused. Thus, we felt it was essential to compare our results froma larger “convenience sample” to results from members of ourtarget population.

2. Experiment 1

In order to compare the sensitivity and diagnosticity of theworkload ratings provided by traditional and hands-free adminis-tration modalities, we had participants provide ratings of threesurgical training tasks, each with different quantitative and quali-tative demands. These tasks e a simulated cannulation task, a ringtransfer task, and a rope manipulation task e were all componentsof the surgical training program at a minimally invasive surgerytraining facility at a large U.S. medical center. A mixed-factorexperimental design was employed with administration modalityvaried as a between-group factor and task type and repetitionvaried as within-participant factors.

2.1. Method

2.1.1. ParticipantsSixty-four students (38 men, 26 women, 18e24 years) were

recruited from introductory psychology courses at the University ofKentucky. They received credit in partial fulfillment of a researchexposure requirement for participating. Participants were selectedbased on self-reports that they either (1) frequently played videogamesor (2)wereenrolled inahealthcare-relatedacademicprogram.

2.1.2. EquipmentA Stryker 888 endoscope, a Stryker Quantum 300 light source,

a Stryker 888 zero-degree 10-mm camera, a Sony 19-inch monitor,and a surgical box trainer were used for the experiment (see Fig. 1).Participants used two Covidian Autosuture Endo Dissects, insertedinto two 2.5 cm openings on the trainer’s vinyl cover, to manipulatea small set of objects. Depending on the specific task being per-formed, the sets of objects included (1) a rubber tube (4 mm innerdiameter� 6 mm outer diameter� 89 mm long) and four 13 cmpipe cleaners, (2) a plywood pegboard (9.8 cm� 7.7 cm) with 1 cmlong pegs and a 4.3 cm dish containing approximately 20 foamrings of different colours (6 mm inner diameter� 11 mm outerdiameter� 3.5 mm thick), and (3) a 160 cm “cobra” rope, withalternate 2.5 cm bands of blue, orange, and white. The rope wasapproximately 2 mm in diameter.

Each of the three sets of objects was placed on separate33 cm� 36 cm poster boards. The poster boards helped to (1) keepthe objects’ placement consistent across trials, (2) stabilize the

objects being manipulated, and (3) reduce the between-trial tasksetup time.

2.1.3. Experimental tasksThree experimental tasks were designed to simulate the exer-

cises that are used to train incoming surgical residents at the Centerfor Minimally Invasive Surgery at the University of KentuckyMedical Center. These tasks are illustrated in Fig. 2.

In the cannulation task, participants were instructed to threadpipe cleaners through a rubber tube using both dissectors. The endthrough which the participant began threading the pipe cleaner(right vs. left) was congruent with the participant’s dominant hand.The tubewas loosely taped to the cardboard mat horizontally; thus,some movement of the tube usually occurred during threading.Participants were allowed to steady the tube by holding it with onegrasper while using the other to hold the pipe cleaner. Participantswere told to thread as much of the pipe cleaner through the tube aspossible in a trial. If a participant managed to completely threada pipe cleaner through the tube before the 90-s trial’s end, theywere told to begin the task again with a new pipe cleaner locatednext to the tube. Performance was measured by the total length ofthe pipe cleaner(s) fully threaded through the tube.

During the ring transfer task, participants were instructed toretrieve as many foam rings as possible from a dish and place themon vertical pegs located 2e8 cm from the dish. They werepermitted to stack multiple rings onto each peg. The performance

Page 4: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

Fig. 2. Target objects being manipulated by research participants in the cannulation(top), ring transfer (middle), and rope manipulation (bottom) tasks as seen through theendoscope.

C.M. Carswell et al. / Applied Ergonomics 42 (2010) 138e145 141

criterionwas the number of rings placed on the pegs within the 90-s trial.

The rope manipulation task was designed to simulate theinspection of a section of bowel. Participants held a long, narrowropewith the two dissectors, passing the rope from one dissector tothe other, attempting to traverse the rope’s full length. The ropewas divided into white, blue, and orange segments, each approxi-mately 2.5 cm in length, and participants were told that theyshould touch only the white sections. They were required to holdeach white section with both graspers before reaching for the nextwhite section. Right-handed participants began passing the ropefrom left to right, while left-handed participants began passing therope in the opposite direction. The number of white sections

successfully passed in the 90-s trial determined the participant’sperformance score.

Note that for all tasks, the performance criterion was used onlyas a goal to motivate participants and to help them understand thetask; the actual scores were not analysed because our main interestwas in comparing perceived load across tasks, and the performancecriteria between tasks were not directly comparable.

2.1.4. Workload instrumentsThe written version of the NASA-TLX consisted of six horizontal

graphic scales (7 cm), each divided into 20 sections by tick marks.All scales were printed on the same page. The endpoint anchors forfive of the scales (mental demand, physical demand, temporaldemand, effort, and frustration) were labelled “low” and “high.”The endpoints for the performance subscale were labelled “good”and “bad.” On a separate page, all unique pairs of the subscaleswere listed (e.g., mental demand vs. physical demand, mentaldemand vs. temporal demand, etc.). Participants circled themember of each pair that seemed more influential to them indetermining their overall assessment of workload. The definitionsof the six subscale constructs (from Hart and Staveland, 1988) wereavailable to participants at all times; they could refer to aninstructional poster mounted by the training stand or to printedinstructions available at the desk where they filled out the ques-tionnaires. Participants indicated their responses by drawinga vertical line through each graphic scale.

In the vocal version of the NASA-TLX, participants indicatedtheir choice by picking a number from 1 to 100 that best describedtheir experiences. The experimenter, sitting to the back or side ofthe participants, read each scale name, reminded participants ofthe endpoint labels (e.g., “0¼ low and 100¼ high”), and recordedparticipants’ responses. Then, the experimenter read each possiblepair of scale names and asked participants to saywhich of the scaleswas more influential in their overall workload judgments. As withthe written version, definitions of the subscale constructs werevisible to participants at all time.

Data from the administration of the full NASA-TLX was used toobtain NASA-RTLX scores; the latter simply omits the use of theparticipant-provided weightings of the subscales in computingthe global workload score. Thus, there was no separate version ofthe NASA-RTLX administered to participants in either the writtenor vocal conditions.

The administration of the written and vocal versions of the MRQfollowed as closely as possible the procedures for the NASA-TLX. TheMRQcontains seventeenscales, each representingparticipants’usageof a specific mental resource such as spatial concentrative or tactilefigural. The complete list of resources (scales) is provided in Table 2.Complete definitions for each scale (from Boles & Adair, 2001) wereavailable as part of thewritten form, and for the vocal form theywereavailable on poster board throughout the experiment. Participantsfilled in a rating between 0 and 100 for each subscale, with “0” indi-cating “no usage” of the indicated mental resource and “100” indi-cating “extreme usage.” Once again, participants simply respondednumerically from 1 to 100 in the vocal condition.

Although we did not instruct participants to avoid specificnumbers when responding in the vocal condition, the majorityrestricted their responses to multiples of five. With the writtenforms, participants’ marks on the NASA-TLX were translated tonumbers by rounding upward to the nearest tick mark on the scale.This also led to scores in multiples of five, resulting in comparablegranularity between the vocal and written versions.

2.1.5. ProcedureAfter obtaining informed consent, participants were trained to

use the dissectors. Theywere given the opportunity to practise each

Page 5: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

C.M. Carswell et al. / Applied Ergonomics 42 (2010) 138e145142

of the three experimental tasks while being guided by the experi-menter. During this orientation phase, the objects being manipu-lated were in direct view of the participant (i.e., were not viewedvia the endoscope and video monitor). Upon initiating datacollection, participants could only see the task on the videomonitor.

Each participant performed five blocks of trials, with each blockcontaining one trial of each of the three tasks. The order of taskswithin each block was randomized. After each trial, participantsresponded to the NASA-TLX. They completed the MRQ only aftereach of the three trials in the final block.

2.2. Results and discussion

2.2.1. Equivalent-forms reliabilityCorrelations were used to calculate the coefficient of equiva-

lence for the workload judgments obtained from the vocal andwritten versions of the NASA-TLX, NASA-RTLX, and MRQ, as well asfor the six NASA-TLX subscales. Because administration modalitywasmanipulated between groups, the correlations were performedon the mean global workload values for each relevant experimentalcondition summarised across participants. Thus, for thesemeasures, correlations were obtained over 15 conditions e 3tasks� 5 blocks of trials. For theMRQ, the coefficient of equivalencewas obtained by comparing 33 mean scores e the 11 scale valueswith non-zeromedians obtained for each of the three experimentaltasks. Themethod for estimating the equivalence of theMRQ differsfrom that of the other measures because the goal of the MRQ isdistinct. For the NASA-TLX and RTLX, the global means for eachexperimental condition are of paramount importance; for theMRQ,it is the fluctuation in the subscale profiles that are of concern. Notethat MRQ scales with median scores of zero, indicating thatparticipants did not believe the mental resource in question wasrelevant to the given task, were dropped from subsequent analysesfollowing the procedures of Boles et al. (2007).

As summarised in Table 1, acceptable equivalence coefficients(r> 0.80) were obtained for the MRQ, global NASA-TLX, and globalNASA-RTLX. However, as is typically the case when evaluatingindividual test items rather than overall scale scores, the cross-modality equivalence of the NASA subscales was less than that ofthe composite scores. Evidence for equivalence was weakest forself-reported effort and performance.

In addition to comparing the vocal and written versions of thethree workload instruments, we also compared the global work-load scores obtained using the traditional NASA-TLX to thoseobtained using the simplified NASA-RTLX across our 15 (3 tasks� 5replications) experimental conditions. We obtained a correlationbetween these two measures of r¼ 0.91, consistent with ther¼ 0.94 obtained byMoroney et al. (1992) and the r¼ 0.98 obtainedby Byers et al. (1989). These data provide further evidence thatsubstituting the simpler NASA-RTLX for the NASA-TLX may beacceptable in many situations.

Table 1Coefficients of equivalence for written and vocal versions of workload measures.

Workload measure Coefficient of equivalence (r¼ )

Global NASA-TLX 0.81Global NASA-RTLX 0.85Mental demand 0.78Physical demand 0.76Temporal demand 0.65Effort 0.31Performance 0.53Frustration 0.70Global MRQ 0.96

Although the coefficients of equivalence suggested, with theexception of some of the NASA-TLX subscales, strong relationshipsbetween scores obtained through written and vocal administrationprocedures, there might still be systematic biases (e.g., one formatresulting in consistently lower workload estimates than the other).To evaluate this possibility, we performed a 2 (format)� 3 (task)�5 (repetition) mixed-factor ANOVA on both the global NASA-TLXand NASA-RTLX scores. Administration format was a between-group factor, and task and repetition were both within-participantfactors. A similar analysis was performed for global MRQ scores,although the repetition factor was dropped from the ANOVAmodelbecause the MRQ was only administered once for each task. Therewere no statistically reliable main effects or interactions involvingadministration format, although as indicated in Table 2, there weremain effects of task for some of the measures. In short, although themeasures were sensitive to task demands, there was no evidencethat the vocal administration modality introduced systematic bia-ses for the NASA-TLX, NASA-RTLX, or MRQ.

2.2.2. Sensitivity to experimental manipulationsThe results of the ANOVAs were also used to estimate the effect

sizes (hpartial2 ) associated with the manipulation of task demands forthe global workload metrics and the MRQ subscales. In a typicalstudy that uses workload as a dependent variable, these taskmanipulations would be the normal factors of interest. Differencesin effect size obtained by different administration conditions orcomputational procedures, even if small, could be important toresearchers who must try to find the most sensitive methodspracticable when conducting research in settings where access toresearch participants may be limited.

Table 2 provides a summary of the sizes of task effects obtainedby both written and vocal versions of the NASA-TLX and NASA-RTLX. All means and associated effect sizes shown in bold representstatistically reliable (p< 0.05) task effects. Note that the sensitivityof the NASA-TLX and NASA-RTLX appears similar, as does thesensitivity of the two administration formats.

Table 2 also summarises the effect of task demands on the globalMRQ score and its subscales as a function of administration format.Although all of the MRQ subscales are listed in Table 2, it should benoted once more that 6 scales were dropped from the analysisbecause they received median scores of zero. The retained scalesincluded all six spatial processing subscales, as well as subscalesmeasuring manual, short-term memory, tactile figural, visualtemporal, and vocal processing. We interpreted the rathersurprising report by participants that vocal resources were taxed bythe surgical tasks to indicate some participants’ confusion aboutwhether or not to include the effort required to respond to theworkload scales as part of the “task” they were rating. This inter-pretation is supported by the observation that a non-zero medianfor vocal processing was obtained only in the vocal administrationcondition. Note that although themedian score for vocal processingin the written condition was zero, we listed the mean scores andstandard deviations for the written condition in Table 2 in order toallow comparisons with the vocal condition.

Comparisons of the effect sizes (hpartial2 ) obtained from thesimple task effects for the two administration conditions revealedevidence of a sensitivity loss when moving from the written tovocal version of the MRQ, both for the global MRQ score and forsome of the subscales. Subscales that revealed reliable task effectsincluded spatial concentrative, spatial emergent, spatial quantita-tive (for the written format only), and tactile figural. Note that forthese subscales, there is consistency in the task judged most diffi-cult across both conditions; however, there was disagreementbetween participants in the vocal and written conditions aboutwhich task was the least difficult.

Page 6: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

Table 2Means (and standard deviations) of workload scores obtained for the three experimental tasks as a function of administration format (vocal vs. written) in Experiment 1. Effectsizes are reported as h2

partial.

Workload Scale Written administration Spoken administration

Ring Rope Cannulation Effect size Ring Rope Cannulation Effect size

NASA-TLX 47.8 (16.4) 44.8 (17.3) 50.3 (17.5) 0.157 46.2 (19.7) 45.4 (19.0) 48.6 (19.5) 0.162NASA-RTLX 53.4 (18.5) 49.2 (19.4) 55.6 (19.8) 0.126 46.5 (24.5) 46.9 (24.0) 49.0 (24.8) 0.157MRQ global 62.5 (17.1) 56.1 (15.6) 60.0 (15.2) 0.252 58.5 (17.4) 58.3 (18.9) 53.9 (17.5) 0.138MRQ manual 93.0 (11.8) 91.0 (20.0) 95.4 (8.5) 0.034 89.4 (16.6) 90.0 (16.6) 86.2 (18.8) 0.015MRQ spatial-attentive 90.8 (15.1) 90.0 (17.9) 88.4 (18.2) 0.034 77.8 (26.0) 80.4 (25.3) 76.8 (26.8) 0.071MRQ spatial-categorical 87.0 (20.1) 85.3 (24.5) 86.0 (15.1) 0.014 70.6 (31.6) 81.4 (27.8) 74.3 (26.0) 0.036MRQ spatial-positional 78.0 (28.9) 79.5 (27.2) 73.3 (27.9) 0.025 68.1 (27.4) 73.3 (26.3) 66.0 (26.9) 0.003MRQ spatial-concentrative 76.9 (27.8) 62.4 (34.7) 75.3 (27.7) 0.206 72.5 (27.0) 69.5 (31.6) 63.9 (33.8) 0.093MRQ visual-temporal 53.5 (37.0) 62.1 (38.9) 55.3 (36.1) 0.000 59.5 (36.2) 55.9 (35.0) 54.6 (32.7) 0.011MRQ spatial-emergent 71.2 (34.0) 42.0 (36.2) 52.9 (35.9) 0.289 66.9 (33.9) 46.1 (34.1) 44.3 (35.0) 0.263MRQ short-term memory 43.4 (32.5) 48.8 (35.1) 50.8 (27.3) 0.026 37.2 (34.6) 51.0 (37.1) 35.1 (31.1) 0.015MRQ spatial-quantitative 43.0 (37.3) 18.3 (29.0) 31.9 (32.2) 0.191 41.0 (33.5) 32.8 (35.0) 30.7 (31.5) 0.028MRQ tactile-figural 30.2 (31.7) 19.6 (30.0) 29.6 (33.9) 0.094 32.8 (30.2) 29.8 (29.8) 39.3 (33.8) 0.132MRQ vocal 21.0 (30.1) 18.7 (27.4) 20.8 (29.1) 0.024 27.7 (31.6) 30.9 (33.7) 22.4 (27.1) 0.073MRQ visual-lexical e e e e e e e e

MRQ visual-phonetic e e e e e e e e

MRQ auditory-linguistic e e e e e e e e

MRQ auditory-emotional e e e e e e e e

MRQ facial-figural e e e e e e e e

MRQ facial-motive e e e e e e e e

C.M. Carswell et al. / Applied Ergonomics 42 (2010) 138e145 143

Where there was a substantial effect size difference (>0.10)between the vocal and written version, the written version wasconsistently more sensitive. One explanation for the superiority ofthe written version is that the vocal version created more timepressure, leading to more errors (an assumption that is consistentwith a non-significant trend found with responses to the “temporaldemand” subscale of the NASA-TLX). Whatever the cause, itappears that caution is warranted when deciding whether tosubstitute the vocal version of the MRQ for the traditional writtenformat.

3. Experiment 2

Experiment 1 revealed several differences between the vocaland written versions of our workload measures. Specifically,although the global metrics of workload exceeded our equivalencecriterion, we found low coefficients of equivalence for some of theNASA-TLX subscales. More critically, we found a reduction in theMRQ’s sensitivity to task demand when we used the vocal version.These differences may be due in part to participants’ confusionabout whether to include the effort required to answer the work-load questionnaires in their overall assessment of “task workload.”In addition, some participants may have found the MRQ subscaleconstructs difficult to understand, a problem that might have beenexacerbated in the vocal condition by their perception of greatertime stress.

The purpose of Experiment 2 was to determine whether theresults we obtained with the vocal administration of our workloadmeasures to first and second year college students generalized topost-baccalaureate medical students. This comparison also offeredthe opportunity to evaluate the source of the format differencesfound in Experiment 1. Being a medical student in the U.S. repre-sents, in general, a substantially higher level of academic attain-ment than enrolment as a first or second year undergraduate ata state university. In addition to having completed an undergrad-uate degree with compulsory pre-medical coursework in the bio-logical and physical sciences, the medical students face selectiveadmissions into medical school based on their collegiate recordsand standardized tests. We reasoned that the higher level ofeducation among the medical students might translate into greatersensitivity in their use of the workload instruments, both because

they may be better able to understand the instructions and, in thecase of the MRQ, may be more familiar with the terminology usedto describe the mental resources they must rate. If this is the case,we would predict that the vocal MRQs of medical students, but notnecessarily their vocal NASA-TLX/RTLXs, would show greatersensitivity to task effects than those obtained with either verbal orwritten formats in Experiment 1. We would also expect medicalstudents to be less likely to endorse those MRQ subscales thatreflect the load imposed by the questionnaire rather than the loadimposed by the primary task (e.g., vocal demand should not beendorsed as a contributor to primary task workload for the heavilyperceptual-motor surgical task).

3.1. Method

3.1.1. ParticipantsThirteen first and second year medical students (2 women; 11

men; 23e30 years) were recruited by online announcements andflyers at the University of Kentucky College of Medicine. For theirparticipation, they were offered $20 beverage coupons.

3.1.2. Tasks, materials, and proceduresThe three tasks used in Experiment 1 e the cannulation, ring

transfer, and rope manipulation tasks e were again the focus ofparticipants’ workload evaluations. All materials and procedureswere identical, with the following exceptions: (1) only the vocalform of the workload assessment questionnaires was employed,and (2) the NASA-RTLX but not the NASA-TLX was employed (i.e.,the paired-comparison procedure used to calculate subscaleweights for the NASA-TLX was dropped).

3.2. Results and discussion

Table 3 summarises the vocal NASA-RTLX and MRQ dataobtained for the three tasks. Global NASA-RTLX andMRQ dataweresubmitted to one-way (3 tasks) repeated-measures ANOVAs toobtain both p-values and effect sizes (hpartial2 ). Statistically reliabletask effects (p< 0.05) are highlighted in Table 3 in bold type. Notethat for the global NASA-RTLX, the effect of taskwas not statisticallyreliable (F(2,22)¼ 1.20; MSe¼ 188.85, p< 0.40), although for theMRQ it was (F(2,21)¼ 6.89; MSe¼ 51.86, p< 0.005). The NASA-

Page 7: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

Table 3Means (and standard deviations) of workload scores obtained for the three exper-imental tasks in Experiment 2. Effect sizes are reported as h2

partial.

Workload scale Vocal administration

Ring Rope Cannulation Effect size

NASA-RTLX global 46.2 (20.8) 44.2 (21.9) 46.7 (21.3) 0.100MRQ global 73.3 (8.4) 63.7 (9.5) 67.3 (11.9) 0.396MRQ manual 96.3 (8.8) 100 (e) 96.0 (7.0) 0.067MRQ spatial-attentive 95.0 (10.0) 100 (e) 96.7 (8.6) 0.065MRQ spatial-positional 94.6 (8.9) 89.5 (24.7) 87.4 (15.8) 0.063MRQ spatial-categorical 88.8 (12.1) 89.5 (16.2) 87.5 (19.0) 0.006MRQ spatial-concentrative 89.6 (14.8) 55.0 (37.7) 76.7 (30.5) 0.376MRQ spatial-emergent 70.0 (24.5) 36.8 (32.8) 37.4 (41.0) 0.427MRQ short-term memory 55.4 (30.7) 58.6 (31.7) 65.0 (30.6) 0.092MRQ tactile-figural 36.3 (31.6) 31.8 (25.2) 36.3 (29.3) 0.090MRQ spatial-quantitative 34.2 (30.9) 11.8 (13.8) 22.5 (33.0) 0.096MRQ visual-temporal 15.8 (26.1) 11.4 (27.2) 16.3 (26.4) 0.000MRQ vocal e e e e

MRQ visual-lexical e e e e

MRQ visual-phonetic e e e e

MRQ auditory-linguistic e e e e

MRQ auditory-emotional e e e e

MRQ facial-figural e e e e

MRQ facial-motive e e e e

C.M. Carswell et al. / Applied Ergonomics 42 (2010) 138e145144

RTLX effect size obtained in the current experiment (hpartial2 ¼ 0.100)was also lower than that obtained with the vocal NASA-RTLX inExperiment 1 (hpartial2 ¼ 0.157). The effect size obtained for theglobal MRQ in the present experiment (hpartial2 ¼ 0.396), on theother hand, was substantially larger than the comparable effect sizein Experiment 1 (hpartial2 ¼ 0.138). Thus, the main differencebetween Experiments 1 and 2 with respect to the global workloadmeasures was a substantial increase in the sensitivity of theMRQ inExperiment 2.

The composite (or global) MRQ score was obtained in bothExperiments 1 and 2 by taking the mean of those subscales withmedian scores greater than zero across all participants in eachexperiment. In Experiment 1, 6 subscales were dropped, whereas inExperiment 2, 7 of the subscales were dropped. Unlike the ratingsprovided by previous participants, the vocal subscale receiveda median rating of zero in the current experiment. This findingsupports our belief that the medical students were less likely toconfuse the effort required tomake vocal workload judgments withtheworkload of the surgical training tasks that were supposed to bethe focus of their ratings.

One-way (3 tasks) repeated-measures ANOVAs for each of theretained MRQ subscales revealed reliable task effects for two of thesubscales e spatial concentrative (F(2,21)¼ 6.32; MSe¼ 657.79,p< 0.0071) and spatial emergent (F(2,21)¼ 7.84; MSe¼ 545.28,p< 0.0029). In general, pairwise comparisons revealed that therope task was perceived as requiring fewer processing resourcesthan either of the remaining tasks; the ring transfer task requiredthe most. Note that this is consistent with the pattern found for thewritten but not the vocal version of the MRQ in Experiment 1, againbringing into question the equivalence and acceptability of usinga vocal version of the MRQ with the younger student population.Comparisons of effect sizes associated with the manipulation oftasks across the two experiments for each of the subscales alsoshowed that when large differences existed, they typically favouredthe sensitivity of the vocal MRQ when administered to the medicalstudents.

4. General discussion

The current experiments provide guidance for researcherswanting to use vocal rather than written versions of the NASA-TLX,the MRQ, and similar subjective state scales for human factor

evaluations. Although our findings may be applicable to otherdomains, we focused on surgical training environments wherethere is a need for subjective assessment methods that do notrequire the participants to take their hands or eyes away from theirprimary tasks. There is also a pressing need to standardizeadministration procedures in this work domain as the use of suchmeasures becomes more commonplace (e.g., Ballester, 2007; Cao,2006; Crossan et al., 2001; Stefanidis et al., 2007a,b, 2008a,b).

In general, we obtained acceptable estimates of equivalent-forms reliability between vocal and written versions of the NASA-RTLX (r¼ 0.81), the NASA-TLX (r¼ 0.85), and the MRQ (r¼ 0.96) forglobal workload estimates. However, we should caution that thissame level of equivalence was not achieved for some of the NASA-TLX subscales, especially the “effort” subscale. Thus, althoughglobal workload assessments may not be altered by a switch toa vocal administration method, it is still possible that the use of theNASA-TLX as a diagnostic instrument may be compromised.

With respect to the sensitivity of our global measures of work-load to task demands, we found little difference between the vocaland written versions of the NASA-TLX and NASA-RTLX. However,therewas a substantial loss of sensitivitywhen comparing the globalworkload obtained using the vocal MRQ to that obtained with thewritten version. What was a “modest” effect size in the writtencondition, using Cohen’s conventions (Cohen, 1988), becamea “small”effect size in thevocal condition.Wespeculate that this lossin sensitivity for the vocal version was due to the greater difficultyour participants may have had in understanding theMRQ subscalescompared to the NASA-TLX subscales. This difficulty may have beenexacerbatedby the transient natureof the vocal prompts andgreaterassociated timestress.Whenwetestedmore academicallyadvancedresearch participants (i.e., the medical students in Experiment 2),our assumptions about the source of the reduced sensitivity for thevocalMRQwere supported. The vocalMRQ scores obtained fromthemore highly educated group were, in fact, more sensitive to taskeffects than either the written or vocal MRQ scores in Experiment 1oranyversionof theNASA-TLX (orRTLX) ineitherexperiment. Thesedata suggest that the MRQ should receive further evaluation asa function of participants’ education level or verbal aptitude; ourdata support theMRQ’s usewith our target participants but suggestcaution in interpreting the results obtained from undergraduates,especially when administered vocally.

It should also be noted that although the MRQ proved to bea sensitive global measure of workload, it was designed primarily tobe a diagnostic instrument for use in pinpointing the cognitiveresources that are taxed by particular work procedures or tasks(Boles et al., 2007; Boles and Adair, 2001). The diagnostic feature ofthe MRQ provided some evidence of a potential source ofcontamination in workload ratings that deserves more attention.Specifically, we found that when comparing the vocal and writtenversions of the MRQ in Experiment 1, participants endorsed the useof vocal resources when performing three primary tasks thatseemed unlikely to require vocal processes, and for which weobserved no task-related vocal responses. However, the endorse-ment of vocal resource demands only occurred when participantswere reporting their workload vocally. Thus, participants appearedto be including in their workload estimates the effort required tomake those workload estimates as well as the effort required forthe experimental tasks. One implication of this finding is thatinstructions need to be explicit about what constitutes theboundaries of the primary experimental task. Participants mayneed to know what NOT to rate.

Finally, the present experiments provided an opportunity tocompare the reliability and relative sensitivity of the global work-load measures produced by the NASA-TLX, NASA-RTLX, and MRQ.As with previous studies (e.g., Moroney et al., 1995; Nygren, 1991),

Page 8: Hands-free administration of subjective workload scales: Acceptability in a surgical training environment

C.M. Carswell et al. / Applied Ergonomics 42 (2010) 138e145 145

we found few differences in the data produced using the NASA-TLXand NASA-RTLX. We therefore concur with other researchers whoargue that the use of a simple subscale mean (NASA-RTLX) can besuccessfully used in instances where time does not permit collec-tion of the paired-comparison data necessary to develop individ-ualized subscale weights in the traditional NASA-TLX procedure(Hart and Staveland, 1988).

When comparing the global MRQ to either of the NASA metrics,we found that the relative sensitivity may be population depen-dent. For our undergraduates in Experiment 1, the NASA-TLX andNASA-RTLX were at least as sensitive to task demands, if notslightly more sensitive, than the MRQ. For the medical students inExperiment 2, the MRQ was far more sensitive than was the NASA-RTLX (note that the NASA-TLX was not obtained in the secondstudy). Thus, for future studies in the medical training environ-ment, the current results suggest that the MRQ should be widelyadopted, especially when researchers are interested in cognitivedecomposition of task demands.

Acknowledgments

This research was conducted as part of Project STITCH (SurgicalTechnology Integration with Tools for Cognitive Human Factors),funded byDoD TATRC grantW81XWH-06-1-0761 (Brent Seales, PI).

References

Ballester, P., 2007. The Ergonomic Evaluation and Human-centred Design Approachto Robotic Systems in Minimal Invasive Surgery. Unpublished doctoral disser-tation, University of Manchester, United Kingdom.

Boles, D.B., Adair, L.P., 2001. The multiple resources questionnaire (MRQ).Proceedings of the Human Factors and Ergonomics Society 45, 1790e1794.

Boles, D.B., Bursk, J.H., Phillips, J.B., Perdelwitz, J.R., 2007. Predicting dual-taskperformance with the multiple resources questionnaire (MRQ). Human Factors49 (1), 32e45.

Byers, J.C., Bittner Jr., A.C., Hill, S.G., 1989. Traditional and raw task load index (TLX)correlations: are paired comparisons necessary? In: Mita, A. (Ed.), Advances inIndustrialErgonomicsandSafety I. TaylorandFrancis, Philadelphia, PA,pp.481e485.

Cao, C.G.L., 2006. Guiding navigation in colonoscopy. Surgical Endoscopy 21 (3),480e484.

Cohen, J., 1988. Statistical Power Analysis for the Behavioral Sciences, second ed.Lawrence Earlbaum Associates, Hillsdale, NJ.

Crossan, A., Brewster, S.A., Reid, S., Mellor, D., 2001. Comparison of simulated ovarytraining over different skill levels. In: Proceedings of the Eurohaptics Confer-ence, Birmingham, UK, pp. 17e21.

David, H., Pledger, S., 1995. Speech recognition and keyboard input for control andstress reporting in ATC simulation. In: Robertson, S.A. (Ed.), ContemporaryErgonomics. Taylor and Francis, London, pp. 175e180.

Hart, S.G., 2006. NASA-task load index (NASA-TLX): 20 years later. In: Proceedingsof the Human Factors and Ergonomics Society 50th Annual Meeting. HumanFactors and Ergonomics Society, Santa Monica, CA, pp. 904e908.

Hart, S.G., Staveland, L.E., 1988. Development of NASA-TLX (task load index): resultsof empirical and theoretical research. In: Hancock, P.A., Meshkati, N. (Eds.),

Human Mental Workload. North Holland Press, North Holland, Amsterdam, pp.239e250.

Landau, K., Wiese, G., Bopp, V., Sinn-Behrendt, A., Winter, G., Salmanzadeh, H.,2006. Integration of inspection tasks into machine operators’ jobs in theconsumer goods industry. Occupational Ergonomics 6 (3e4), 159e172.

Liu, Y.L., Wickens, C.D., 1994. Mental workload and cognitive task automaticity: anevaluation of subjective and time-estimation metrics. Ergonomics 37 (11),1843e1854.

Matthews, R., Legg, S., Charlton, S., 2003. The effect of cell phone type on drivers’subjective workload during concurrent driving and conversation. AccidentAnalysis and Prevention 35, 451e457.

Moroney, W.F., Biers, D.W., Eggemeier, F.T., 1995. Some measurement and meth-odological considerations in the application of subjective workload measure-ment techniques. The International Journal of Aviation Psychology 5 (1),87e106.

Moroney, W.F., Biers, D.W., Eggemeier, F.T., Mitchell, J.A., 1992. A comparison of twoscoring procedures with the NASA Task Load Index in a simulated flight tasks.Proceedings of the Aerospace Electronics Conference, USA 2, 734e740.

Noyes, J.M., Bruneau, D.P.J., 2007. A self-analysis of the NASA-TLX subjectiveworkload measure. Ergonomics 50, 514e519.

Noyes, J.M., Garland, K.J., 2008. Computer- vs. paper-based tasks: are they equiva-lent? Ergonomics 51, 1352e1375.

Nygren, T.E., 1991. Psychometric properties of subjective workload measurementtechniques: implications for their use in the assessment of perceived mentalworkload. Human Factors 33., 17e33.

Otmani, S., Rogé, J., Muzet, A., 2005. Sleepiness in professional drivers: effect of ageand time of day. Accident Analysis and Prevention 37 (5), 930e937.

Park, J., Jung, W., 2006. A study on the validity of a task complexity measure foremergency operating procedures of nuclear power plants e comparing witha subjective workload. IEEE Transactions on Nuclear Science 53 (5), 2962e2970.

Reid, G.B., Nygren, T.E., 1988. The subjective workload assessment technique:a scaling procedure for measuring mental workload. In: Hancock, P.A., Mesh-kati, N. (Eds.), Human Mental Workload. North Holland Press, North-Holland,Amsterdam, pp. 185e218.

Richards, J.S., Fine, P.R., Wilson, T.L., Rogers, J.T., 1983. A voice-operated method foradministering the MMPI. Journal of Personality Assessment 47 (2), 167e170.

Stefanidis, D., Haluck, R., Pham, T., Dunne, B.J., Reinke, T., Markley, S., et al., 2007a.Construct and face validity and task workload for laparoscopic camera navi-gation: virtual reality versus videotrainer systems at the SAGES Learning Center.Surgical Endoscopy 21 (7), 1158e1164.

Stefanidis, D., Korndorffer, J.R., Markley, S., Sierra, R., Heniford, B.T., Scott, D.J.,2007b. Closing the gap in operative performance between novices and experts:does harder mean better for laparoscopic simulator training? Journal of theAmerican College of Surgeons 205 (2), 307e313.

Stefanidis, D., Acker, C.E., Swiderski, D., Heniford, T.B., Greene, F.L., 2008a. Chal-lenges during the implementation of a laparoscopic skills curriculum in a busygeneral surgery residency program. Journal of Surgical Education 65 (1), 4e7.

Stefanidis, D., Scerbo, M.W., Sechrist, C., Mostafavi, A., Heniford, T., 2008b. Donovices display automaticity during simulator training? The American Journalof Surgery 195 (2), 210e213.

Tsang, P.S., Velazquez, V.L., 1996. Diagnosticity and multidimensional workloadratings. Ergonomics 31, 358e381.

Tsang, P.S., Vidulich, M.A., 2006. Mental workload and situation awareness. In:Salvendy, G. (Ed.), Handbook of Human Factors and Ergonomics, third ed. Wiley,New York (Chapter 9).

Vildulich, M.A., 1989. The use of judgment matrices in subjective workloadassessment: the subjective workload dominance (SWORD) technique. In:Proceedings of the Human Factors Society 33rd Annual Meeting, pp.1406e1410.

Windell, D., Wiebe, E.N., Converse-Lane, S.A., Beith, B., 2006. A comparison of twomental workload instruments in multimedia instruction. In: Proceedings of theHFES Annual Meeting, San Francisco, CA, pp. 1764e1768.