View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Understanding Spoken Corrections inHuman-Computer Dialogue
Gina-Anne LevowUniversity of Chicago
http://www.cs.uchicago.edu/~levowMAICS
April 1, 2006
Error Correction Spiral U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½ since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett-Packard was 83 ¾, up 2 ½. U: Give me the price for American Telephone and
Telegraph. S: Excuse me?
Identifying Corrections
Most "Reasoning" Approaches Correction => Violates Task, Belief Constraint Requires Tight Task, Belief Model Often Requires Accurate Recognition
This Approach Uses Acoustic or Lexical Information Content, Context Independent
Accomplishments
Corrections vs Original Inputs Significant Differences: Duration, Pause, Pitch
Corrections vs Recognizer Models Contrasts: Phonology and Duration
Correction Recognition Decision Tree Classifier: 65-77% accuracy
Human Baseline ~80%
Why Corrections?
Recognizer Error Rates ~25-40% REAL meaning of utterance
user intent Corrections misrecognized 2.5X as often Hard to Correct => Poor Quality System
Why it's Necessary
Error Repair Requires Detection Errors can be very difficult to detect
E.g. Misrecognitions Focus Repair Efforts Corrections Decrease Recognition Accuracy
Adaptation Requires Identification
Why is it Hard?
Recognition Failures and Errors Repetition <> Correction
500 Strings => 6700 Instances (80%) Speech Recognition Technology
Variation - Undesirable, Suppressed
Roadmap
Data Collection and Description SpeechActs System & Field Trial
Characterizing Corrections Original-Repeat Pair Data Analysis Acoustic and Phonological Measures & Results
Recognizing Corrections Conclusions and Future Work
SpeechActs System
Speech-Only System over the Telephone (Yankelovich, Levow & Marx 1995)
Access to Common Desktop Applications Email, Calendar, Weather, Stock Quotes
BBN's Hark Speech Recognition, Centigram TruVoice Speech Synthesis
In-house: Natural Language Analysis Back-end Applications, Dialog Manager
System Data Overview
Approximately 60 hours of interactions Digitized at 8kHz, 8-bit mu-law encoding
18 subjects: 14 novices, 4 experts, single shots 7529 user utterances, 1961 errors P(error | correct) = 18%; P(error | error) = 44%
System: Recognition Error Types
Rejection Errors - Below Recognition Level U: Switch to Weather S (Heard): <nothing> S (said): Huh?
Misrecognition Errors - Substitution in Text U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago.
1250 Rejections ~2/3 706 Misrecognitions ~1/3
Analysis: Data
300 Original Input-Repeat Correction Pairs Lexically Matched, Same Speaker Example:
S: (Said): Please say mail, calendar, weather. U: Switch to Weather. Original S (Said): Huh? U: Switch to Weather. Repeat.
Analysis: Duration
Automatic Forced Alignment, Hand-Edited Total: Speech Onset to End of Utterance Speech: Total - Internal Silence Contrasts: Original Input/Repeat Correction
Total: Increases 12.5% on average Speech: Increases 9% on average
Analysis: Pause
Utterance Internal Silence > 10ms Not Preceding Unvoiced Stops(t), Affricates(ch)
Contrasts: Original Input/Repeat Correction Absolute: 46% Increase Ratio of Silence to Total Duration: 58% Increase
Pitch Tracks
Analysis: Pitch I
ESPS/Waves+ Pitch Tracker, Hand-Edited Normalized Per-Subject:
(Value-Subject Mean) / (Subject Std Dev) Pitch Maximum, Minimum, Range
Whole Utterance & Last Word Contrasts: Original Input/Repeat Correction
Significant Decrease in Pitch Minimum Whole Utterance & Last Word
Analysis: Pitch II
Analysis: Overview
Significant Differences: Original/Correction Duration & Pause
Significant Increases: Original vs Correction Pitch
Significant Decrease in Pitch Minimum Increase in Final Falling Contours
Conversational-to-Clear Speech Shift
Analysis: Phonology
Reduced Form => Citation Form Schwa to unreduced vowel (~20)
E.g. Switch t' mail => Switch to mail. Unreleased or Flapped 't' => Released 't' (~50)
E.g. Read message tweny => Read message twenty Citation Form => Hyperclear Form
Extreme lengthening, calling intonation (~20) E.g. Goodbye => Goodba-aye
Durational Model Contrasts
Departure from Model Mean (Std Dev)
# of
Wor
ds
Non-final Final
Compare to SR model(Chung 1995)
Phrase-final lengthening Words in final position significantly longer than non-final and than model prediction
All significantly longer incorrection utterances
Analysis: Overview II
Original vs Correction & Recognizer Model Phonology
Reduced Form => Citation Form => Hyperclear Form Conversational to (Hyper) Clear Shift
Duration Contrast between Final and Non-final Words Departure from ASR Model
Increase for Corrections, especially Final Words
Automatic Recognition of Spoken Corrections
Machine learning classifier: Decision Trees
Trained on labeled examples Features: Duration, Pause, Pitch
Evaluation: Overall: 65% accuracy (inc. text features)
Absolute and normalized duration Misrecognitions: 77% accuracy (inc. text features)
Absolute and normalized duration, pitch 65% accuracy – acoustic features only
Approaches human baseline: 79.4%
Accomplishments
Contrasts between Originals vs Corrections Significant Differences in Duration, Pause, Pitch
Conversational-to-Clear Speech Shifts Shifts away from Recognizer Models Corrections Recognized at 65-77%
Near-human Levels
Future Work
Modify ASR Duration Model for Correction Reflect Phonological and Duration Change
Identify Locus of Correction for Misrecognitions U: Switch to Weather S (Heard): Switch to Calendar S (Said): On Friday, May 4, you have talk at Chicago. U:Switch to WEATHER!
Preliminary tests: 26/28 Corrected Words Detected, 2 False Alarms
Future Work
Identify and Exploit Cues to Discourse and Information Structure Incorporate Prosodic Features into Model of Spoken
Dialogue
Exploit Text and Acoustic Features for Segmentation of Broadcast Audio and Video
Necessary first phase for information retrieval
Assess language independence
First phase: Segmentation of Mandarin and Cantonese Broadcast News (in collaboration with CUHK)
Classification of Spoken Corrections
Decision Trees Intelligible, Robust to irrelevant attributes ?Rectangular decision boundaries; Don’t combine features
Features (38 total, 15 in best trees) Duration, pause, pitch, and amplitude
Normalized and absolute
Training and Testing 50% Original Inputs, 50% Repeat Corrections 7-way cross-validation
Recognizer: Results (Overall)
Tree Size: 57 (unpruned), 37 (pruned) Minimum of 10 nodes per branch required
First Split: Normalized Duration (All Trees) Most Important Features:
Normalized & Absolute Duration, Speaking Rate 65% Accuracy - Null Baseline-50%
Example Tree
Classifier Results: Misrecognitions
Most important features: Absolute and Normalized Duration Pitch Minimum and Pitch Slope
77% accuracy (with text) 65% (acoustic features only)
Null baseline - 50% Human baseline - 79.4% (Hauptman & Rudnicky 1987)
Misrecognition Classifier
Background & Related Work Detecting and Preventing Miscommunication
(Smith & Gordon 96, Traum & Dillenbourg 96) Identifying Discourse Structure in Speech
Prosody: (Grosz & Hirschberg 92, Swerts & Ostendorf 95) Cue words+prosody: (Taylor et al 96, Hirschberg&Litman 93)
Self-repairs: (Heeman & Allen 94, Bear et al 92) Acoustic-only: (Nakatani & Hirschberg 94, Shriberg et al 97)
Speaking Modes: (Ostendorf et al 96, Daly & Zue 96) Spoken Corrections:
Human baseline (Rudnicky & Hauptmann 87) (Oviatt et al 96, 98; Levow 98,99; Hirschberg et al 99,00)
Other languages: (Bell & Gustafson 99, Pirker et al 99,Fischer 99)
Learning Method Options (K)-Nearest Neighbor
Need Commensurable Attribute Values Sensitive to Irrelevant Attributes Labeling Speed - Training Set Size
Neural Nets Hard to Interpret Can Require More Computation & Training Data +Fast, Accurate when Trained
Decision Trees <= Intelligible, Robust to Irrelevant Attributes +Fast, Compact when Trained ?Rectangular Decision Boundaries, Don't Test Feature Combinations