Transcript
Page 1: [IEEE 2011 IEEE Virtual Reality (VR) - Singapore, Singapore (2011.03.19-2011.03.23)] 2011 IEEE Virtual Reality Conference - An initial exploration of conversational errors as a novel

An Initial Exploration of Conversational Errors as a Novel Method forEvaluating Virtual Human Experiences

Richard Skarbez∗University of North Carolina at Chapel Hill

Aaron Kotranza†

University of FloridaFrederick P. Brooks, Jr.‡

University of North Carolina at Chapel Hill

Benjamin Lok§

University of FloridaMary C. Whitton¶

University of North Carolina at Chapel Hill

ABSTRACT

We present a new method for evaluating user experience in interac-tions with virtual humans (VHs). We code the conversational errorsmade by the VH. These errors, in addition to the duration of theinteraction and the numbers of statements made by the participantand the VH, provide objective, quantitative data about the virtualsocial interaction. We applied this method to a set of previouslycollected interactions between medical students and VH patientsand present preliminary results. The error metrics do not correlatewith traditional measures of the quality of a virtual experience, e.g.presence and copresence questionnaires. The error metrics weresignificantly correlated with scores on the Maastricht Assessmentof Simulated Patients (MaSP), a scenario-appropriate measure ofsimulation quality, suggesting further investigation is warranted.

Keywords: virtual humans, embodied agents, virtual reality, virtualenvironments, task performance

Index Terms: H.1.2 [Models and Principles]: User/MachineSystems—Human Factors; H.5.2 [Information Systems and Presen-tation]: User Interfaces—Evaluation; I.3.7 [Computer Graphics]:Three-Dimensional Graphics and Realism—Virtual Reality;

1 INTRODUCTION

An increasingly common use of virtual humans (VHs) is to trainsocial skills for interpersonal interactions. As simulations of suchinteractions are being deployed in medical, law enforcement, andmilitary fields ( [2, 4, 6]), a critical question for VH researchersis how to best evaluate the efficacy of the social interaction be-tween the user and VH, e.g., Was the VH believable in its role?;Was the user engaged? This is a different construct from a simu-lation’s efficacy for training a domain-specific task. For example,the simulation of a medical interview is likely to be evaluated ondomain-specific outcomes such as diagnostic correctness [6].

Our goal is to identify and validate a set of domain-agnosticmeasures that can be used to evaluate the efficacy of VHs in in-terpersonal simulations. Our approach here is to score recordingsof human-VH interactions with a new set of objective, quantita-tive and domain-agnostic measures and test whether these corre-late with validated subjective, qualitative, and domain-specific mea-sures (e.g. [12]). Our source data were previously collected record-ings of medical students conducting medical interviews with VHpatients [5, 8]. We found that domain-agnostic measures of VHconversational errors do correlate with domain-specific measures.

∗e-mail: [email protected]†e-mail: [email protected]‡e-mail: [email protected]§e-mail: [email protected]¶e-mail: [email protected]

Figure 1: A medical student converses with a VH patient presentedat life-size on a projection screen.

2 BACKGROUND

Interpersonal Simulation with Virtual Humans. Success infields such as medicine, law enforcement, and military leader-ship relies on mastering communication skills. These interper-sonal skills are traditionally taught via practice sessions in whichlearners role-play with real humans. Experiences in which thelearner role-plays with VHs are now augmenting or replacing thisapproach [2, 4, 6].A Domain-Specific Evaluation of the Virtual Human: TheMaastricht Assessment of Simulated Patients (MaSP). TheMaSP is used to evaluate social aspects of (human) standardizedpatients (SPs) [12]. It is administered to medical students after anSP interaction. The MaSP has previously been validated for use inevaluating VH patients [6]. In this work we used it as a domain-specific measure against which to compare our new measures.Presence, Copresence, and Virtual Humans. It is common forresearchers working with H-VH interactions to measure feelings ofpresence and copresence in the human. In previous work, thesemeasures are typically not impacted by altering a wide range ofcharacteristics of the VH and its presentation (e.g. agency, appear-ance, and display) [3, 5, 7, 13]. This suggests that presence andcopresence measures may not be appropriate means of characteriz-ing user experiences with VHs.

3 STUDY DESIGN

We use the number, frequency, and types of conversational errors asobjective, quantitative measures of the efficacy of the VH in the in-terpersonal simulation. This method is inspired by the Slater’s useof breaks in presence as an objective, quantitative metric for mea-suring presence in virtual environments [10] and in virtual socialscenarios [9]. As an initial evaluation of our approach, we investi-gated conversational errors in a previously-collected set of human-VH interactions.

243

IEEE Virtual Reality 2011

19 - 23 March, Singapore

978-1-4577-0038-5/11/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 IEEE Virtual Reality (VR) - Singapore, Singapore (2011.03.19-2011.03.23)] 2011 IEEE Virtual Reality Conference - An initial exploration of conversational errors as a novel

Table 1: Categories of conversational errors

No response The trainee asked a question; the VH did not respond.

No question The VH replied without the trainee speaking.

Wrong answer The trainee asks a question of the VH and the VH

responds with a nonsensical answer, one meant in

response to a different question.

Repeat The VH response was given verbatim earlier in the

conversation. E.g. the VH responds with “Hi doctor,”

five minutes in to the interview. This category

excludes short responses that would naturally be given

multiple times in a conversation, e.g. “yes” or “no.”

Interrupt The VH response cuts off a trainee mid-question.

Answer pileup In response to a single question from the trainee, VH

makes several statements sequentially without pausing.

E.g. a sequence of: trainee question; no VH response;

trainee question; VH responds to both questions.

We defined a rubric for coding conversational errors, used thisrubric to code the H-VH video logs, and computed conversationsummary metrics. We then applied a factor analysis to identify theconstructs underlying these metrics.Data Source. Data was available for twenty-five students from theMedical College of Georgia who conducted a medical interview ofa VH patient who presented with a breast mass (Figure 1). Videorecordings and MaSP, presence (SUS questionnaire [11]), and cop-resence [1] scores were collected for all participants.Conversational Error Metrics. In this preliminary study, a singlecoder watched each video and recorded the time and type of anyerrors. We considered only errors related to the VH’s responses tothe participant’s speech; Table 1 gives the error codes. Note that itis possible for a single statement to contain multiple errors.

We computed sixteen measurements for each trainee, includ-ing the conversation duration, the number of trainee utterances, thenumber of VH statements containing errors and the percentage oftrainee statements that went unrecognized by the system. Thesemetrics can be broadly grouped into three categories: counting met-rics, time-based-rate metrics, and statement-based-rate metrics.Factor Analysis. We performed a factor analysis on these met-rics to determine the underlying constructs contributing to them.The KMO measure of sampling adequacy for the data was greaterthan 0.5 (0.713), and Bartlett’s Test of Sphericity was not violated(p < .001). These results suggested that a principal-componentsanalysis of the data could be conducted. Inspection of the resultingeigenvalues and scree plot confirmed the extraction of four factors.A Varimax rotation was applied to improve interpretability.

Factor analysis revealed four factors that, together, explain95.5% of the variance in the collected data. These factors were:Error rate, including “No response” errors (35.5% of variance),Duration of interaction (32.2%), Likelihood that a VH statementcontains an error (17.6%), and Severity of errors (10.3%).

4 RESULTS

Generic Qualitative Reports: Presence and Copresence Ques-tionnaires. No correlation was found between the collected errormetrics and the participant-reported presence or copresence ratings.Scenario Specific Qualitative Reports: MaSP. Significant rela-tionships between some error metrics and the MaSP were found(Table 2). All of these metrics are negatively loaded onto Factor2 (Duration), suggesting that the length of the interaction is a keyfactor in the trainees’ evaluation of the VH. This suggests that thelower MaSP scores result from a cumulative negative effect of con-versational errors and the relatively high error rate at present.

5 DISCUSSION AND FUTURE WORK

The MaSP is a validated subjective and qualitative measure of thequality of a patient simulation. Finding objective, quantitative mea-

Table 2: Significant correlations with mean MaSP score

Metric r(23) p-value

Length of interview (s) -.45 .025

Erroneous speech events -.45 .026

Number of errors -.44 .027

Number of VH statements -.40 .046

Number of trainee statements -.41 .042

sures that correlate with the MaSP score would provide an alter-native method for evaluating VH experiences, at least those withVH patients. From this initial evaluation, the selected error metricsappear as to function as such a measure for VH experiences.

The next step is to apply this technique to more data sets and withmultiple independent coders, to better establish the reliability andvalidity of the metrics. We also hope to confirm the validity of thefactors identified in this study, and validate a small set of metrics toreliably measure these factors.

ACKNOWLEDGEMENTS

Andrew Raij, Brent Rossen, and Kyle Johnsen in Benjamin Lok’slaboratory designed and conducted the studies that generated thesedata. They were generous with their time and advice.

REFERENCES

[1] J. N. Bailenson, K. Swinth, C. Hoyt, S. Persky, A. Dimov, and J. Blas-

covich. The independent and interactive effects of embodied-agent ap-

pearance and behavior on self-report, cognitive, and behavioral mark-

ers of copresence in immersive virtual environments. 14(4):379–393,

2005.

[2] G. A. Frank, C. I. Guinn, R. C. Hubal, M. A. Stanford, P. Pope, and

L. D. Weisel. Just-talk: An application of responsive virtual human

technology. In Proc. I/ITSEC, December 2002.

[3] M. Garau, M. Slater, D.-P. Pertaub, and S. Razzaque. The responses

of people to virtual humans in an immersive virtual environment.

14:104–116, 2005.

[4] R. W. Hill, Jr., A. W. Hill, J. Gratch, S. Marsella, J. Rickel,

W. Swartout, and D. Traum. Virtual humans in the mission rehearsal

exercise system. 17:32–38, 2003.

[5] K. Johnsen and B. Lok. An evaluation of immersive displays for vir-

tual human experiences. In Proc. IEEE VR, pages 133–136, 2008.

[6] K. Johnsen, A. Raij, A. Stevens, D. S. Lind, B. Lok, and P. D. The va-

lidity of a virtual human experience for interpersonal skills education.

In Proc. CHI. ACM Press, 2007.

[7] K. L. Nowak and F. Biocca. The effect of the agency and anthro-

pomorphism of users’ sense of telepresence, copresence, and social

presence in virtual environments. 12(5):481–494, 2003.

[8] B. Rossen, K. Johnsen, A. Deladisma, S. Lind, and B. Lok. Virtual hu-

mans elicit skin-tone bias consistent with real-world skin-tone biases.

In Proc. IVA, pages 237–244, Berlin, Heidelberg, 2008. Springer-

Verlag.

[9] M. Slater, C. Guger, G. Edlinger, R. Leeb, G. Pfurtscheller, A. Antley,

M. Garau, A. Brogni, and D. Friedman. Analysis of physiological

responses to a social situation in an immersive virtual environment.

15(5):553–569, 2006.

[10] M. Slater and A. Steed. A virtual presence counter. 9(5):413–434,

2000.

[11] M. Slater, A. Steed, and M. Usoh. The virtual treadmill: a naturalistic

metaphor for navigation in immersive virtual environments. In Euro-graphics VE, pages 135–148, London, UK, 1995. Springer-Verlag.

[12] L. A. Wind, J. Van Dalen, A. M. M. Muijtjens, and J.-J. Rethans. As-

sessing simulated patients in an educational setting: the masp (maas-

tricht assessment of simulated patients). 38(1):39–44, 2004.

[13] C. A. Zanbaka, A. C. Ulinski, P. Goolkasian, and L. F. Hodges. So-

cial responses to virtual humans: implications for future interface de-

sign. In Proc. SIGCHI HFCS, pages 1561–1570, New York, NY, USA,

2007. ACM.

244


Recommended