Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino Gonzalez, Ronald DeMara University of Central Florida

Dialog Management for Rapid-Prototyping of Speech-Based Training Agents

Victor Hung, Avelino Gonzalez, Ronald DeMara

University of Central Florida

Introduction

Approach

Evaluation

Results

Conclusions

Agenda

Introduction

• General Problem– Elevate the level of speech-based discourse to a new level of

naturalness in Embodied Conversation Agents (ECA) carrying an open-domain dialog

• Specific Problem– Overcome Automatic Speech Recognition (ASR) limitations– Domain-independent knowledge management

• Training Agent Design– Conversational input with robustness to ASR and adaptable

knowledge base

Approach

• Build a dialog manager that:– Handles ASR limitations– Manages domain-independent knowledge– Provides open dialog

• CONtext-driven Corpus-based Utterance Robustness (CONCUR)– Input Processor– Knowledge Manager– Discourse Model I/O

Dialog Manager

User Input

Agent Response

Input Processor

Discourse ModelKnowledge Manager

CONCUR

• Input Processor– Pre-process knowledge corpus via keyphrasing– Break down user utterance

Input Processor

Corpus

Data

Keyphrase ExtractorWordNet

NLP Toolkit

User Utterance

• Knowledge Manager– 3 data bases– Encyclopedia-entry style corpus– Context-driven

CONCUR

• CxBR Discourse Model– Goal Bookkeeper

• Goal Stack (Branting et al, 2004)• Inference Engine

– Context Topology• Agent Goals• User Goals

Detailed CONCUR Block Diagram

Evaluation

Plagued by subjectivity Gathering of both objective and subjective metrics

Qualitative and quantitative metrics: Efficiency metrics

Total elapsed time Number of user turns Number of system turns Total elapsed time per turn Word-Error Rate (WER)

Quality metrics Out-of-corpus misunderstandings General misunderstandings Errors Total number of user goals Total number of user goals fulfilled Goal completion accuracy Conversational accuracy

Survey data Naturalness Usefulness

Evaluation Instrument

Nine statements, judged on a 1-to-7 scale based on level of agreement Naturalness

If I told someone the character in this tool was real they would believe me. The character on the screen seemed smart. I felt like I was having a conversation with a real person. This did not feel like a real interaction with another person.

Usefulness I would be more productive if I had this system in my place of work. The tool provided me with the information I was looking for. I found this to be a useful way to get information. This tool made it harder to get information than talking to a person or using a

website. This does not seem like a reliable way to retrieve information from a

database.

Data Acquisition

General data set acquisition procedure:• User asked to interact with agent

• Natural, information-seeking• Voice recording

• User asked to complete survey Data analysis process:

• Voice transcriptions, ASR transcripts, internal data, and surveys analyzed

Data Set Dialog Manager Agent Style DomainSurveys/

TranscriptsCollected

1 AlexDSS LifeLike Avatar NSF I/UCRC 30/302 CONCUR LifeLike Avatar NSF I/UCRC 30/203 CONCUR Chatbot NSF I/UCRC 0/204 CONCUR Chatbot Current Events 20/20

Data Acquisition

LifeLike Avatar

Speech Recognizer

CONCUR Dialog Manager

AgentExternals

ASR String

Response String

MicUser Voice

Speaker

Monitor

Agent Voice

Agent Image

Monitor

CONCUR Chatbot

CONCUR Dialog Manager

Jabber-basedAgent

Agent Text Output

Keyboard User Text Input

ECA

Chatbot

Survey Baseline

Agent Naturalness User Rating Usefulness User Rating

Data Set 1: AlexDSS Avatar 4.02 4.47

Data Set 2: CONCUR Avatar 4.14 4.51

Amani (Gandhe et al, 2009)

3.09 3.24

Hassan(Gandhe et al, 2009)

3.55 4.00

1. Both LifeLike Avatars established user assessments that exceeded other ECA efforts

2. Both avatar-based systems in the speech-based data sets established similar scores in Naturalness and Usefulness

Question 1: What are the expectations of naturalness and usefulness for the conversation agents in this study?

Question 2: How differently did users rate the AlexDSS Avatar with the CONCUR Avatar?

Survey Baseline

3. ECA-based systems were judged similarly, both better than chatbot

Question 3: How differently did users rate the ECA systems with the chatbot system?

ASR Resilience

Data Set 1: AlexDSSAvatar

Data Set 2: CONCUR

AvatarEfficiency

MetricsWER 60.85% 58.48%

Quantitative Analysis

Out-of-Corpus Misunderstanding Rate

0.29% 6.37%

Goal Completion Accuracy 63.29% 60.48%

Question 1: Can a speech-based CONCUR Avatar’s goal completion accuracy measure up to the AlexDSS Avatar under a high WER?

1. A Speech-based CONCUR Avatar’s goal completion accuracy measures up to AlexDSS avatar with similarly high WER

ASR Resilience

Data Set 2: CONCUR

Avatar

Data Set 3: CONCUR Chatbot

Efficiency Metrics

WER 58.48% 0.00%


Out-of-Corpus Misunderstanding

Rate6.37% 6.77%

Goal Completion Accuracy 60.48% 68.48%

Question 2: How does improving WER affect CONCUR’s goal completion accuracy?

2. Improved WER does not increase CONCUR’s goal completion accuracy because no new user goals were identified or corrected with the better recognition

ASR Resilience

Agent AverageWER

Goal Completion Accuracy

Data Set 2: CONCUR Avatar 58.48% 60.48%

Digital Kyoto(Misu and Kawahara, 2007)

29.40% 61.40%

Question 3: Can CONCUR’s goal completion accuracy measure up to other conversation agents in lieu of high WER?

3: CONCUR’s goal completion accuracy is similar to that of the Digital Kyoto system, with twice the WER.

ASR Resilience

Data Set 1: AlexDSSAvatar

Data Set 2: CONCUR

AvatarEfficiency

MetricsWER 60.85% 58.48%


General Misunderstanding Rate

9.51% 14.12%

Error Rate 8.71% 21.81%

Conversational Accuracy 81.78% 64.22%

Question 4: Can a speech-based CONCUR Avatar’s conversational accuracy measure up to the AlexDSS avatar under a high WER?

4. Speech-based CONCUR’s conversational accuracy does not measure up to an AlexDSS Avatar with similarly high WER. This can be attributed to general misunderstandings and errors caused by misheard user requests or specific question answering requests not common with menu-driven discourse models

ASR Resilience

Data Set 2: CONCUR

Avatar

Data Set 3: CONCUR Chatbot

Efficiency Metrics

WER 58.48% 0.00%


General Misunderstanding

Rate14.12% 7.48%

Error Rate 21.81% 16.68%Goal Completion

Accuracy 60.48% 68.48%

Conversational Accuracy 64.22% 75.31%

Question 5: How does improving WER affect CONCUR’s conversational accuracy?

5. Improved WER increases CONCUR’s conversational accuracy by decreasing general misunderstandings

ASR Resilience

Agent AverageWER

Conversational Accuracy

Data Set 2: CONCUR Avatar 58.48% 64.22%

TARA(Schumaker et al, 2007)

0.00% 54.00%

Question 6: Can CONCUR’s conversational accuracy measure up to other conversation agents in lieu of high WER?

6: CONCUR’s conversational accuracy surpasses that of the TARA system, which is text-based.

Domain-Independence

Data Set 2: NSF I/UCRC Avatar

Data Set 3:NSF I/UCRC Chatbot

Data Set 4: Current Events Chatbot

Quantitative AnalysisOut-Of-Corpus

Misunderstanding Rate 6.15% 6.77% 17.45%

Goal Completion Accuracy 60.48% 68.48% 48.08%

Question 1: Can CONCUR maintain goal completion accuracy after changing to a less specific domain corpus?

1. CONCUR’s goal completion accuracy does not remain consistent after a change to a generalized domain corpus. Changing domain expertise may increase out-of-corpus requests, which decreases goal completion

Domain-Independence

Data Set 2: NSF I/UCRC Avatar

Data Set 3:NSF I/UCRC Chatbot

Data Set 4: Current Events Chatbot


General Misunderstanding Rate 14.49% 7.48% 0.00%

Error Rate 21.81% 16.68% 16.46%

Conversational Accuracy 64.22% 75.34% 83.54%

Question 2: Can CONCUR maintain conversational accuracy after changing to a less specific domain corpus?

2. After changing to a general domain corpus, CONCUR is capable of maintaining its conversational accuracy

Domain-Independence

Dialog System Method Turnover Time

CONCUR Corpus-based 3 DaysMarve

(Babu et al, 2006)Wizard-of-Oz 18 Days

Amani(Gandhe et al, 2009)

Question-Answer Pairs Weeks

AlexDSS Expert System WeeksSergeant Blackwell(Robinson et al, 2008)

Wizard-of-Oz 7 Months

Sergeant Star(Artstein et al, 2009)

Question-Answer Pairs 1 Year

HMIHY (Béchet et al, 2004)

Hand-modeled 2 Years

Hassan(Gandhe et al, 2009)

Question-Answer Pairs Years

3. CONCUR’s Knowledge Manager enables a shortened knowledge development turnover time as compared to other conversation agent knowledge management systems

Question 3: Can CONCUR provide a quick method of providing agent knowledge?

Conclusions

• Building Training Agents– Agent Design

• ECA preference over Chatbot format– ASR

• ASR improvements leads to better conversation-level processing• High ASR not necessarily an obstacle for ECA design

– Knowledge Management• Tailoring domain expertise for an intended audience is more effective

than a generalized corpus• Separation of domain knowledge from agent discourse helps to

maintain conversational accuracy and speed up agent development times

Documents

Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino Gonzalez, Ronald DeMara University of Central Florida