Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006

Understanding Spoken Corrections inHuman-Computer Dialogue

Gina-Anne LevowUniversity of Chicago

http://www.cs.uchicago.edu/~levowMAICS

April 1, 2006

Error Correction Spiral U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and

Telegraph. S: Excuse me?

Identifying Corrections

Most "Reasoning" Approaches Correction => Violates Task, Belief Constraint Requires Tight Task, Belief Model Often Requires Accurate Recognition

This Approach Uses Acoustic or Lexical Information Content, Context Independent

Accomplishments

Corrections vs Original Inputs Significant Differences: Duration, Pause, Pitch

Corrections vs Recognizer Models Contrasts: Phonology and Duration

Correction Recognition Decision Tree Classifier: 65-77% accuracy

Human Baseline ~80%

Why Corrections?

Recognizer Error Rates ~25-40% REAL meaning of utterance

user intent Corrections misrecognized 2.5X as often Hard to Correct => Poor Quality System

Why it's Necessary

Error Repair Requires Detection Errors can be very difficult to detect

E.g. Misrecognitions Focus Repair Efforts Corrections Decrease Recognition Accuracy

Adaptation Requires Identification

Why is it Hard?

Recognition Failures and Errors Repetition <> Correction

500 Strings => 6700 Instances (80%) Speech Recognition Technology

Variation - Undesirable, Suppressed

Roadmap

Data Collection and Description SpeechActs System & Field Trial

Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results

Recognizing Corrections Conclusions and Future Work

SpeechActs System

Speech-Only System over the Telephone (Yankelovich, Levow & Marx 1995)

Access to Common Desktop Applications Email, Calendar, Weather, Stock Quotes

BBN's Hark Speech Recognition, Centigram TruVoice Speech Synthesis

In-house: Natural Language Analysis Back-end Applications, Dialog Manager

System Data Overview

Approximately 60 hours of interactions Digitized at 8kHz, 8-bit mu-law encoding

18 subjects: 14 novices, 4 experts, single shots 7529 user utterances, 1961 errors P(error | correct) = 18%; P(error | error) = 44%

System: Recognition Error Types

Rejection Errors - Below Recognition Level U: Switch to Weather S (Heard): <nothing> S (said): Huh?

Misrecognition Errors - Substitution in Text U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago.

1250 Rejections ~2/3 706 Misrecognitions ~1/3

Analysis: Data

300 Original Input-Repeat Correction Pairs Lexically Matched, Same Speaker Example:

S: (Said): Please say mail, calendar, weather. U: Switch to Weather. Original S (Said): Huh? U: Switch to Weather. Repeat.

Analysis: Duration

Automatic Forced Alignment, Hand-Edited Total: Speech Onset to End of Utterance Speech: Total - Internal Silence Contrasts: Original Input/Repeat Correction

Total: Increases 12.5% on average Speech: Increases 9% on average

Analysis: Pause

Utterance Internal Silence > 10ms Not Preceding Unvoiced Stops(t), Affricates(ch)

Contrasts: Original Input/Repeat Correction Absolute: 46% Increase Ratio of Silence to Total Duration: 58% Increase

Pitch Tracks

Analysis: Pitch I

ESPS/Waves+ Pitch Tracker, Hand-Edited Normalized Per-Subject:

(Value-Subject Mean) / (Subject Std Dev) Pitch Maximum, Minimum, Range

Whole Utterance & Last Word Contrasts: Original Input/Repeat Correction

Significant Decrease in Pitch Minimum Whole Utterance & Last Word

Analysis: Pitch II

Analysis: Overview

Significant Differences: Original/Correction Duration & Pause

Significant Increases: Original vs Correction Pitch

Significant Decrease in Pitch Minimum Increase in Final Falling Contours

Conversational-to-Clear Speech Shift

Analysis: Phonology

Reduced Form => Citation Form Schwa to unreduced vowel (~20)

E.g. Switch t' mail => Switch to mail. Unreleased or Flapped 't' => Released 't' (~50)

E.g. Read message tweny => Read message twenty Citation Form => Hyperclear Form

Extreme lengthening, calling intonation (~20) E.g. Goodbye => Goodba-aye

Durational Model Contrasts

Departure from Model Mean (Std Dev)

# of

Wor

ds

Non-final Final

Compare to SR model(Chung 1995)

Phrase-final lengthening Words in final position significantly longer than non-final and than model prediction

All significantly longer incorrection utterances

Analysis: Overview II

Original vs Correction & Recognizer Model Phonology

Reduced Form => Citation Form => Hyperclear Form Conversational to (Hyper) Clear Shift

Duration Contrast between Final and Non-final Words Departure from ASR Model

Increase for Corrections, especially Final Words

Automatic Recognition of Spoken Corrections

Machine learning classifier: Decision Trees

Trained on labeled examples Features: Duration, Pause, Pitch

Evaluation: Overall: 65% accuracy (inc. text features)

Absolute and normalized duration Misrecognitions: 77% accuracy (inc. text features)

Absolute and normalized duration, pitch 65% accuracy – acoustic features only

Approaches human baseline: 79.4%

Accomplishments

Contrasts between Originals vs Corrections Significant Differences in Duration, Pause, Pitch

Conversational-to-Clear Speech Shifts Shifts away from Recognizer Models Corrections Recognized at 65-77%

Near-human Levels

Future Work

Modify ASR Duration Model for Correction Reflect Phonological and Duration Change

Identify Locus of Correction for Misrecognitions U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago. U:Switch to WEATHER!

Preliminary tests: 26/28 Corrected Words Detected, 2 False Alarms

Future Work

Identify and Exploit Cues to Discourse and Information Structure Incorporate Prosodic Features into Model of Spoken

Dialogue

Exploit Text and Acoustic Features for Segmentation of Broadcast Audio and Video

Necessary first phase for information retrieval

Assess language independence

First phase: Segmentation of Mandarin and Cantonese Broadcast News (in collaboration with CUHK)

Classification of Spoken Corrections

Decision Trees Intelligible, Robust to irrelevant attributes ?Rectangular decision boundaries; Don’t combine features

Features (38 total, 15 in best trees) Duration, pause, pitch, and amplitude

Normalized and absolute

Training and Testing 50% Original Inputs, 50% Repeat Corrections 7-way cross-validation

Recognizer: Results (Overall)

Tree Size: 57 (unpruned), 37 (pruned) Minimum of 10 nodes per branch required

First Split: Normalized Duration (All Trees) Most Important Features:

Normalized & Absolute Duration, Speaking Rate 65% Accuracy - Null Baseline-50%

Example Tree

Classifier Results: Misrecognitions

Most important features: Absolute and Normalized Duration Pitch Minimum and Pitch Slope

77% accuracy (with text) 65% (acoustic features only)

Null baseline - 50% Human baseline - 79.4% (Hauptman & Rudnicky 1987)

Misrecognition Classifier

Background & Related Work Detecting and Preventing Miscommunication

(Smith & Gordon 96, Traum & Dillenbourg 96) Identifying Discourse Structure in Speech

Prosody: (Grosz & Hirschberg 92, Swerts & Ostendorf 95) Cue words+prosody: (Taylor et al 96, Hirschberg&Litman 93)

Self-repairs: (Heeman & Allen 94, Bear et al 92) Acoustic-only: (Nakatani & Hirschberg 94, Shriberg et al 97)

Speaking Modes: (Ostendorf et al 96, Daly & Zue 96) Spoken Corrections:

Human baseline (Rudnicky & Hauptmann 87) (Oviatt et al 96, 98; Levow 98,99; Hirschberg et al 99,00)

Other languages: (Bell & Gustafson 99, Pirker et al 99,Fischer 99)

Learning Method Options (K)-Nearest Neighbor

Need Commensurable Attribute Values Sensitive to Irrelevant Attributes Labeling Speed - Training Set Size

Neural Nets Hard to Interpret Can Require More Computation & Training Data +Fast, Accurate when Trained

Decision Trees <= Intelligible, Robust to Irrelevant Attributes +Fast, Compact when Trained ?Rectangular Decision Boundaries, Don't Test Feature Combinations

Documents

Understanding Spoken Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago levow MAICS April 1, 2006