[IEEE 2011 IEEE Virtual Reality (VR) - Singapore, Singapore (2011.03.19-2011.03.23)] 2011 IEEE Virtual Reality Conference - An initial exploration of conversational errors as a novel method for evaluating virtual human experiences

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>An Initial Exploration of Conversational Errors as a Novel Method forEvaluating Virtual Human Experiences</p><p>Richard SkarbezUniversity of North Carolina at Chapel Hill</p><p>Aaron Kotranza</p><p>University of FloridaFrederick P. Brooks, Jr.</p><p>University of North Carolina at Chapel Hill</p><p>Benjamin Lok</p><p>University of FloridaMary C. Whitton</p><p>University of North Carolina at Chapel Hill</p><p>ABSTRACTWe present a new method for evaluating user experience in interac-tions with virtual humans (VHs). We code the conversational errorsmade by the VH. These errors, in addition to the duration of theinteraction and the numbers of statements made by the participantand the VH, provide objective, quantitative data about the virtualsocial interaction. We applied this method to a set of previouslycollected interactions between medical students and VH patientsand present preliminary results. The error metrics do not correlatewith traditional measures of the quality of a virtual experience, e.g.presence and copresence questionnaires. The error metrics weresignificantly correlated with scores on the Maastricht Assessmentof Simulated Patients (MaSP), a scenario-appropriate measure ofsimulation quality, suggesting further investigation is warranted.</p><p>Keywords: virtual humans, embodied agents, virtual reality, virtualenvironments, task performance</p><p>Index Terms: H.1.2 [Models and Principles]: User/MachineSystemsHuman Factors; H.5.2 [Information Systems and Presen-tation]: User InterfacesEvaluation; I.3.7 [Computer Graphics]:Three-Dimensional Graphics and RealismVirtual Reality;</p><p>1 INTRODUCTIONAn increasingly common use of virtual humans (VHs) is to trainsocial skills for interpersonal interactions. As simulations of suchinteractions are being deployed in medical, law enforcement, andmilitary fields ( [2, 4, 6]), a critical question for VH researchersis how to best evaluate the efficacy of the social interaction be-tween the user and VH, e.g., Was the VH believable in its role?;Was the user engaged? This is a different construct from a simu-lations efficacy for training a domain-specific task. For example,the simulation of a medical interview is likely to be evaluated ondomain-specific outcomes such as diagnostic correctness [6].</p><p>Our goal is to identify and validate a set of domain-agnosticmeasures that can be used to evaluate the efficacy of VHs in in-terpersonal simulations. Our approach here is to score recordingsof human-VH interactions with a new set of objective, quantita-tive and domain-agnostic measures and test whether these corre-late with validated subjective, qualitative, and domain-specificmea-sures (e.g. [12]). Our source data were previously collected record-ings of medical students conducting medical interviews with VHpatients [5, 8]. We found that domain-agnostic measures of VHconversational errors do correlate with domain-specific measures.</p><p>e-mail: skarbez@cs.unc.edue-mail: akotranz@cise.ufl.edue-mail: brooks@cs.unc.edue-mail: lok@cise.ufl.edue-mail: whitton@cs.unc.edu</p><p>Figure 1: A medical student converses with a VH patient presentedat life-size on a projection screen.</p><p>2 BACKGROUNDInterpersonal Simulation with Virtual Humans. Success infields such as medicine, law enforcement, and military leader-ship relies on mastering communication skills. These interper-sonal skills are traditionally taught via practice sessions in whichlearners role-play with real humans. Experiences in which thelearner role-plays with VHs are now augmenting or replacing thisapproach [2, 4, 6].A Domain-Specific Evaluation of the Virtual Human: TheMaastricht Assessment of Simulated Patients (MaSP). TheMaSP is used to evaluate social aspects of (human) standardizedpatients (SPs) [12]. It is administered to medical students after anSP interaction. The MaSP has previously been validated for use inevaluating VH patients [6]. In this work we used it as a domain-specific measure against which to compare our new measures.Presence, Copresence, and Virtual Humans. It is common forresearchers working with H-VH interactions to measure feelings ofpresence and copresence in the human. In previous work, thesemeasures are typically not impacted by altering a wide range ofcharacteristics of the VH and its presentation (e.g. agency, appear-ance, and display) [3, 5, 7, 13]. This suggests that presence andcopresence measures may not be appropriate means of characteriz-ing user experiences with VHs.</p><p>3 STUDY DESIGNWe use the number, frequency, and types of conversational errors asobjective, quantitative measures of the efficacy of the VH in the in-terpersonal simulation. This method is inspired by the Slaters useof breaks in presence as an objective, quantitative metric for mea-suring presence in virtual environments [10] and in virtual socialscenarios [9]. As an initial evaluation of our approach, we investi-gated conversational errors in a previously-collected set of human-VH interactions.</p><p>243</p><p>IEEE Virtual Reality 2011</p><p>19 - 23 March, Singapore</p><p>978-1-4577-0038-5/11/$26.00 2011 IEEE </p></li><li><p>Table 1: Categories of conversational errors</p><p>No response The trainee asked a question; the VH did not respond.No question The VH replied without the trainee speaking.Wrong answer The trainee asks a question of the VH and the VH</p><p>responds with a nonsensical answer, one meant inresponse to a different question.</p><p>Repeat The VH response was given verbatim earlier in theconversation. E.g. the VH responds with Hi doctor,five minutes in to the interview. This categoryexcludes short responses that would naturally be givenmultiple times in a conversation, e.g. yes or no.</p><p>Interrupt The VH response cuts off a trainee mid-question.Answer pileup In response to a single question from the trainee, VH</p><p>makes several statements sequentially without pausing.E.g. a sequence of: trainee question; no VH response;trainee question; VH responds to both questions.</p><p>We defined a rubric for coding conversational errors, used thisrubric to code the H-VH video logs, and computed conversationsummary metrics. We then applied a factor analysis to identify theconstructs underlying these metrics.Data Source. Data was available for twenty-five students from theMedical College of Georgia who conducted a medical interview ofa VH patient who presented with a breast mass (Figure 1). Videorecordings and MaSP, presence (SUS questionnaire [11]), and cop-resence [1] scores were collected for all participants.Conversational Error Metrics. In this preliminary study, a singlecoder watched each video and recorded the time and type of anyerrors. We considered only errors related to the VHs responses tothe participants speech; Table 1 gives the error codes. Note that itis possible for a single statement to contain multiple errors.</p><p>We computed sixteen measurements for each trainee, includ-ing the conversation duration, the number of trainee utterances, thenumber of VH statements containing errors and the percentage oftrainee statements that went unrecognized by the system. Thesemetrics can be broadly grouped into three categories: counting met-rics, time-based-rate metrics, and statement-based-rate metrics.Factor Analysis. We performed a factor analysis on these met-rics to determine the underlying constructs contributing to them.The KMO measure of sampling adequacy for the data was greaterthan 0.5 (0.713), and Bartletts Test of Sphericity was not violated(p &lt; .001). These results suggested that a principal-componentsanalysis of the data could be conducted. Inspection of the resultingeigenvalues and scree plot confirmed the extraction of four factors.A Varimax rotation was applied to improve interpretability.</p><p>Factor analysis revealed four factors that, together, explain95.5% of the variance in the collected data. These factors were:Error rate, including No response errors (35.5% of variance),Duration of interaction (32.2%), Likelihood that a VH statementcontains an error (17.6%), and Severity of errors (10.3%).</p><p>4 RESULTSGeneric Qualitative Reports: Presence and Copresence Ques-tionnaires. No correlation was found between the collected errormetrics and the participant-reported presence or copresence ratings.Scenario Specific Qualitative Reports: MaSP. Significant rela-tionships between some error metrics and the MaSP were found(Table 2). All of these metrics are negatively loaded onto Factor2 (Duration), suggesting that the length of the interaction is a keyfactor in the trainees evaluation of the VH. This suggests that thelower MaSP scores result from a cumulative negative effect of con-versational errors and the relatively high error rate at present.</p><p>5 DISCUSSION AND FUTURE WORKThe MaSP is a validated subjective and qualitative measure of thequality of a patient simulation. Finding objective, quantitative mea-</p><p>Table 2: Significant correlations with mean MaSP score</p><p>Metric r(23) p-valueLength of interview (s) -.45 .025Erroneous speech events -.45 .026Number of errors -.44 .027Number of VH statements -.40 .046Number of trainee statements -.41 .042</p><p>sures that correlate with the MaSP score would provide an alter-native method for evaluating VH experiences, at least those withVH patients. From this initial evaluation, the selected error metricsappear as to function as such a measure for VH experiences.</p><p>The next step is to apply this technique to more data sets and withmultiple independent coders, to better establish the reliability andvalidity of the metrics. We also hope to confirm the validity of thefactors identified in this study, and validate a small set of metrics toreliably measure these factors.</p><p>ACKNOWLEDGEMENTSAndrew Raij, Brent Rossen, and Kyle Johnsen in Benjamin Lokslaboratory designed and conducted the studies that generated thesedata. They were generous with their time and advice.</p><p>REFERENCES[1] J. N. Bailenson, K. Swinth, C. Hoyt, S. Persky, A. Dimov, and J. Blas-</p><p>covich. The independent and interactive effects of embodied-agent ap-pearance and behavior on self-report, cognitive, and behavioral mark-ers of copresence in immersive virtual environments. 14(4):379393,2005.</p><p>[2] G. A. Frank, C. I. Guinn, R. C. Hubal, M. A. Stanford, P. Pope, andL. D. Weisel. Just-talk: An application of responsive virtual humantechnology. In Proc. I/ITSEC, December 2002.</p><p>[3] M. Garau, M. Slater, D.-P. Pertaub, and S. Razzaque. The responsesof people to virtual humans in an immersive virtual environment.14:104116, 2005.</p><p>[4] R. W. Hill, Jr., A. W. Hill, J. Gratch, S. Marsella, J. Rickel,W. Swartout, and D. Traum. Virtual humans in the mission rehearsalexercise system. 17:3238, 2003.</p><p>[5] K. Johnsen and B. Lok. An evaluation of immersive displays for vir-tual human experiences. In Proc. IEEE VR, pages 133136, 2008.</p><p>[6] K. Johnsen, A. Raij, A. Stevens, D. S. Lind, B. Lok, and P. D. The va-lidity of a virtual human experience for interpersonal skills education.In Proc. CHI. ACM Press, 2007.</p><p>[7] K. L. Nowak and F. Biocca. The effect of the agency and anthro-pomorphism of users sense of telepresence, copresence, and socialpresence in virtual environments. 12(5):481494, 2003.</p><p>[8] B. Rossen, K. Johnsen, A. Deladisma, S. Lind, and B. Lok. Virtual hu-mans elicit skin-tone bias consistent with real-world skin-tone biases.In Proc. IVA, pages 237244, Berlin, Heidelberg, 2008. Springer-Verlag.</p><p>[9] M. Slater, C. Guger, G. Edlinger, R. Leeb, G. Pfurtscheller, A. Antley,M. Garau, A. Brogni, and D. Friedman. Analysis of physiologicalresponses to a social situation in an immersive virtual environment.15(5):553569, 2006.</p><p>[10] M. Slater and A. Steed. A virtual presence counter. 9(5):413434,2000.</p><p>[11] M. Slater, A. Steed, and M. Usoh. The virtual treadmill: a naturalisticmetaphor for navigation in immersive virtual environments. In Euro-graphics VE, pages 135148, London, UK, 1995. Springer-Verlag.</p><p>[12] L. A. Wind, J. Van Dalen, A. M. M. Muijtjens, and J.-J. Rethans. As-sessing simulated patients in an educational setting: the masp (maas-tricht assessment of simulated patients). 38(1):3944, 2004.</p><p>[13] C. A. Zanbaka, A. C. Ulinski, P. Goolkasian, and L. F. Hodges. So-cial responses to virtual humans: implications for future interface de-sign. In Proc. SIGCHI HFCS, pages 15611570, NewYork, NY, USA,2007. ACM.</p><p>244</p><p> /ColorImageDict &gt; /JPEG2000ColorACSImageDict &gt; /JPEG2000ColorImageDict &gt; /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict &gt; /GrayImageDict &gt; /JPEG2000GrayACSImageDict &gt; /JPEG2000GrayImageDict &gt; /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict &gt; /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False</p><p> /CreateJDFFile false /Description &gt;&gt;&gt; setdistillerparams&gt; setpagedevice</p></li></ul>