Upload
cora-watts
View
216
Download
0
Embed Size (px)
Citation preview
ECE 259—Lecture 19ECE 259—Lecture 19
Techniques for Speech and Techniques for Speech and Natural Language RecognitionNatural Language Recognition
Speech Recognition-2001Speech Recognition-2001
The Speech Dialog CircleThe Speech Dialog Circle
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Speech
Action
Words spoken
“I dialed a wrong number”
Meaning
“Billing credit”
Words: What’s next?
“Determine correct number”
Voice reply to customer
“What number did you want to call?”
• Goal: Accurately and efficiently convert a
speech signal into a text message independent of the device, speaker or the environment.
• Applications: Automation of complex operator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services, customer help lines, e-commerce, etc.
Automatic Speech Automatic Speech RecognitionRecognition
Milestones in Speech and Milestones in Speech and Multimodal Technology Multimodal Technology
ResearchResearch
1962 1967 1972 1977 1982 1987 1992 1997 2002
Year
Isolated Words
Filter-bank analysis;
Time-normalization;Dynamic programmin
g
Isolated Words; Connected Digits;
Continuous Speech
Pattern recognition; LPC analysis;
Clustering algorithms;
Level building;
Continuous Speech; Speech Understanding
Stochastic language
understanding; Finite-state machines; Statistical learning;
Small Vocabulary,
Acoustic Phonetics-
based
Medium Vocabulary, Template-based
Large Vocabulary;
Syntax, Semantics,
Connected Words;
Continuous Speech
Large Vocabulary, Statistical-
based
Hidden Markov
models;
Stochastic Language modeling;
Spoken dialog; Multiple
modalities
Very Large Vocabulary; Semantics, Multimodal Dialog, TTS
Concatenative synthesis; Machine
learning; Mixed-initiative
dialog;
Basic ASR FormulationBasic ASR FormulationThe basic equation of Bayes rule-based speech recognition is
where X=X1,X2,…,XN is the acoustic observation (feature vector) sequence.
is the corresponding word sequence, P(X|W) is the acoustic model and P(W) is the language model
ˆ arg max ( | )
( ) ( | )arg max
( )
arg max ( ) ( | )
W
W
W
W P
P P
P
P P
W X
W X W
X
W X W
1 2ˆ ... Mw w wW
Speech Analysis
Decoders(n), W Xn W
Feature Analysis (Spectral Analysis)
Feature Analysis (Spectral Analysis)
Language Model (N-
gram)
Language Model (N-
gram)Word
Lexicon
Word Lexicon
Confidence Scoring
(Utterance Verification)
Confidence Scoring
(Utterance Verification)
Pattern Classification
(Decoding, Search)
Pattern Classification
(Decoding, Search)
Acoustic Model (HMM)
Acoustic Model (HMM)
Input Speec
h“Hello World”
(0.9) (0.8)
Speech Speech Recognition Recognition
ProcessProcess
Speech Speech Recognition Recognition
ProcessProcessSLU
TTS ASR
DM
SLG
s(n), W
Xn
W^
Speech Recognition ProcessesSpeech Recognition Processes• Choose task => sounds, word
vocabulary, task syntax (grammar), task semantics– Text training data set => word lexicon, word
grammar (language model), task grammar– Speech training data set => acoustic models
• Evaluate performance– Speech testing data set
• Training algorithm => build models from training set of text and speech
• Testing algorithm => evaluate performance from testing set of speech
Feature ExtractionFeature ExtractionFeature ExtractionFeature ExtractionGoal: Extract robust features (information) fromthe speech that are relevant for ASR.
Method: Spectral analysis through either a bank-of-filters or through LPC followedby non-linearity and normalization (cepstrum).
Result: Signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).
Challenges: Robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—cepstral features or those from a high dimensionality space.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
What Features to Use?What Features to Use?
Short-time Spectral Analysis: Acoustic features:
- cepstrum (LPC, filterbank, wavelets)- formant frequencies, pitch, prosody- zero-crossing rate, energy
Acoustic-Phonetic features:- manner of articulation (e.g., stop, nasal, voiced)- place of articulation (labial, dental, velar)
Articulatory features:- tongue position, jaw, lips, velum
Auditory features:- ensemble interval histogram (EIH), synchrony
Temporal Analysis: approximation of the velocity and acceleration typically through first and second ordercentral differences.
Sampling and Quantization
Segmentation (blocking)
Windowing
Preemphasis
Feature Extraction ProcessFeature Extraction Process
Spectral Analysis
Cepstral Analysis
Filtering
Temporal Derivative
Equalization
Delta cepstrumDelta^2 cepstrum
Bias removalor normalization
PitchFormants
NoiseRemoval,
Normalization
( )s t
( )s n
( )s n
( )ms n
( )ms n
)(nW
( )ma l
( )mc l
( )mc l
( )mc l
M,N
EnergyZero-Crossing
( )ma l
RobustnessRobustness RobustnessRobustness
UnlimitedVocabulary
UnlimitedVocabulary
RejectionRejection
Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.
Methods:Traditional techniques for improving system robustness are based on signal enhancement, feature normalization or/andmodel adaptation.
Perception Approach:Extract fundamental acoustic information in narrow bands of speech. Robust integration of features across time and frequency.
Methods for Robust Speech Methods for Robust Speech RecognitionRecognition
Methods for Robust Speech Methods for Robust Speech RecognitionRecognition
SignalTraining
Testing
Features Model
Signal Features Model
Enhancement
Normalization Adaptation
A mismatch in the speech signal between the training phase and testing phase results in performance degradation.
Noise and Channel Noise and Channel DistortionDistortion
+
Speech
Noise
Distorted Speech
frequency
FourierTransform
FourierTransform
Channel
frequency
5dB50dB
n(t)
s(t)
h(t)
y(t)
y(t) = [s(t) + n(t)] * h(t)
Speaker VariationsSpeaker Variations- one source of acoustic variation among speakers is explained by their different vocal tract lengths (which varies in length from 15-20 cm)- an increase in vocal tract length is inversely proposal to the frequency contents of the resulting acoustic signal- by warping the frequency content of a signal, one can effectively normalize for variations in vocal tract lengths
Maximum Likelihood Speaker Normalization' argmaxPr( | )c
17.187.0
Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelGoal: Map acoustic features into distinct phonetic labels (e.g., /s/, /aa/).
Hidden Markov Model (HMM): Statistical method for characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone. HMMs are also assigned for modeling extraneous events.
Advantages: Powerful statistical method for dealing with a wide range of data and reliably recognizing speech.
Challenges: Understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
Word LexiconWord LexiconWord LexiconWord LexiconGoal:Map legal phone sequences into wordsaccording to phonotactic rules. For example,
David /d/ /ey/ /v/ /ih/ /d/
Multiple Pronunciation:Several words may have multiple pronunciations. For example
Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/
Challenges: How do you generate a word lexicon automatically; how do you add new variant dialects and word pronunciations.
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
The LexiconThe Lexicon• one entry per word, with at least 100K words
needed for dictation (find: /f/ /ay/ /n/ /d/)• either done by hand or with letter-to-sound
rules (LTS). Rules can be automatically trained with decision trees (CART); less than 8% errors, but proper nouns are hard!
• multiple pronunciations (ta-mey-to vs to-mah-to)
/ow /
/dx/
/t/
/aa/
/ey/
/m /
/ow /
/t/
/ax/
0.5
0.3
0.2
0.7
0.3 0.80.2
0.8
0.2
Goal: Model “acceptable” spoken phrases, constrained by task syntax.
Rule-based:Deterministic grammars that are knowledge driven. For example,
flying from $city to $city on $date
Statistical:Compute estimate of word probabilities (N-gram, class-based, CFG). For example
flying from Newark to Boston tomorrow
Challenges:How do you build a language model rapidly for a new task.
0.60.4
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
Language ModelLanguage Model
Formal GrammarsFormal Grammars
S
V PN P
V N P
A D J
N A M E
M aryloves
tha t
person
R ew rite R u le s :
1 . S N P V P2 . V P V N P3 . V P A U X V P4 . N P A R T N P 15 . N P A D J N P 1
7 . N P 1 N6 . N P 1 A D J N P 1
8 . N P N A M E9 . N P P R O N10 . N A M E M ary11 . V lo ves12 . A D J tha t13 . N p erson
N P 1
N
N-GramsN-Grams
• Trigrams Estimation
1 2
1 2 1 3 1 2 1 2 1
1 2 11
( ) ( , , , )
( ) ( | ) ( | , ) ( | , , , )
( | , , , )
N
N N
N
i ii
P P w w w
P w P w w P w w w P w w w w
P w w w w
W
2, 1,1, 2
2, 1
( )( | )
( )i i i
i i ii i
C w w wP w w w
C w w
ExampleExample
• training data: “John read her book”, “I read a different book”, “John read a book by Mulan”
• but we have a problem here
( , ) 2( | )
( ) 3
( , ) 2( | )
( ) 2
C s JohnP John s
C s
C John readP read John
C John
( , ) 2( | )
( ) 3
( , ) 1( | )
( ) 2
( , / ) 2( / | )
( ) 3
C read aP a read
C read
C a bookP book a
C a
C book sP s book
C book
( )
( | ) ( | ) ( | ) ( | ) ( / | ) 0.148
P John read a book
P John s P read John P a read P book a P s book
( )
( | ) ( | ) ( | ) ( | ) ( / | ) 0
P Mulan read a book
P Mulan s P read Mulan P a read P book a P s book
N-Gram SmoothingN-Gram Smoothing
• data sparseness: in millions of words more than 50% of trigrams occur only once.
• can’t assign p(wi|wi-1, wi-2)=0
• solution: assign non-zero probability for each n-gram by lowering the probability mass of seen n-grams.
1 1 1 1
2 1
( | ... ) ( | ... )
(1 ) ( | ... )
smooth i i n i ML i i n i
smooth i i n i
P w w w P w w w
P w w w
PerplexityPerplexity
• cross-entropy of a language model P on word sequence W is
• and its perplexity
• measures the complexity of the language model (geometric mean of branching factor).
2
1( ) log ( )H P
N
W
W W
( )( ) 2HPP WW
PerplexityPerplexity
• for digit recognition task (TIDIGITS) has 10 words, PP=10 and 0.2% error rate
• Airline Travel Information System (ATIS) has 2000 words and PP=20
• Wall Street Journal Task has 5000 words and PP=130 with bigram and 5% error rate
• in general, lower perplexity => lower error rate, but it does not take acoustic confusability into account: E-set (B, C, D, E, G, P, T) has PP=7 and has 5% error rate.
Bigram PerplexityBigram Perplexity
0
100
200
300
400
500
10k 30k 40k 60kVocabulary Size
Per
ple
xity
Trained on 500 million words and tested on the Encarta Encyclopedia
Out-Of-Vocabulary (OOV) Out-Of-Vocabulary (OOV) RateRate
0
5
10
15
20
25
10k 30k 40k 60k
Vocabulary Size
OO
V R
ate
OOV rate measured on Encarta Encyclopedia. Trained on 500 million words
WSJ ResultsWSJ Results
Models Perplexity Word Error Rate
Unigram Katz 1196.45 14.85%
Unigram Kneser-Ney 1199.59 14.86%
Bigram Katz 176.31 11.38%
Bigram Kneser-Ney 176.11 11.34%
Trigram Katz 95.19 9.69%
Trigram Kneser-Ney 91.47 9.60%
• Perplexity and word error rate on the 60,000-word Wall Street Journal continuous speech recognition task.
• Unigrams, bigrams and trigrams were trained from 260 million words
•Smoothing mechanisms: Katz and Kneser-Ney
Pattern ClassificationPattern ClassificationPattern ClassificationPattern ClassificationGoal:Combine information (probabilities)from the acoustic model, languagemodel and word lexicon to generatean “optimal” word sequence (highest probability).
Method:Decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.
Challenges:How do we build efficient structures (FSMs) for decoding and searching large vocabulary, complex language models tasks;
• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient
What is the theoretical limit of efficiency that can be achieved
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
22 8
Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR
• The basic problem in ASR is to find the sequence of words that explain the input signal. This implies the following mapping:
• For the WSJ 20,000 vocabulary, this results in a network of 10 bytes!
• State-of-the-art methods including fast match, multi-pass decoding A* stack and finite-state transducers all provide tremendous speed-up by searching through the network and finding the best path that maximizes the likelihood function.
22
RobustnessRobustness
UnlimitedVocabulary
UnlimitedVocabulary
RejectionRejection
Features HMM states HMM units Phones Words Sentences
Weighted Finite State Weighted Finite State Transducers (WFST)Transducers (WFST)
Weighted Finite State Weighted Finite State Transducers (WFST)Transducers (WFST)
• Unified Mathematical framework to ASR• Efficiency in time and space
WFSTWFST
WFSTWFST
WFSTWFST
Word:Phrase
Phone:Word
HMM:PhoneCombinationCombination OptimizationOptimization
SearchNetwork
WFSTWFST
State:HMM
WFST can compile the network to 108 states --14 orders of magnitude more efficient
Weighted Finite State Weighted Finite State TransducerTransducer
d: /1
ey:/.4
ae:/.6
dx:/.8
t:/.2
ax:”data”/1
Word Pronunciation Transducer
Data
0
5
10
15
20
25
30
1994 1995 1996 1997 1998 1999 2000 2001 2002
Year
Rel
ativ
e S
pee
d
AT&T (Algorithmic) Moore's Law (hardware)
North American Business vocabulary:
40,000 words branching factor: 85
North American Business vocabulary:
40,000 words branching factor: 85
Algorithmic Speed-up for Algorithmic Speed-up for Speech RecognitionSpeech Recognition
Algorithmic Speed-up for Algorithmic Speed-up for Speech RecognitionSpeech Recognition
Community
AT&T
Goal:Identify possible recognition errorsand out-of-vocabulary events. Potentiallyimproves the performance of ASR, SLU and DM.
Method:A confidence score based on a hypothesis test is associated with each recognized word. For example:
Label: credit pleaseRecognized: credit fees Confidence: (0.9) (0.3)
Challenges:Rejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.
Confidence ScoringConfidence ScoringConfidence ScoringConfidence Scoring
Feature Extraction
Feature Extraction
Pattern Classification
Pattern Classification
Acoustic Model
Acoustic Model
Language Model
Language Model
Word Lexicon
Word Lexicon
Confidence Scoring
Confidence Scoring
( ) ( | )( | )
( ) ( | )
P PP
P PW
W X WW X
W X W
RejectionRejectionProblem: Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.
Effect: Improvement in the performance of the recognizer, understanding system and dialogue manager.
Measure of Confidence: Associating word strings with a verification cost thatprovide an effective measure of confidence (Utterance Verification).
RobustnessRobustness
UnlimitedVocabulary
UnlimitedVocabulary
RejectionRejection
State-of-the-Art State-of-the-Art PerformancePerformance
State-of-the-Art State-of-the-Art PerformancePerformance
Feature Extraction
Feature Extraction
Language Model
Language Model
Word Lexicon
Word Lexicon
Pattern Classification
(Decoding, Search)
Pattern Classification
(Decoding, Search)
Acoustic Model
Acoustic Model
Input Speech
RecognizedSentence
Confidence Scoring
Confidence Scoring
SLU
TTS ASR
DM
SLG
How to Evaluate How to Evaluate Performance?Performance?
• Dictation applications: Insertions, substitutions and deletions
• Command-and-control: false rejection and false acceptance => ROC curves.
# # #Word Error Rate 100%
No.of words in the correct sentence
Subs Dels Ins
Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY
SIZE WORD
ERROR RATE
Connected Digit Strings--TI Database
Spontaneous 11 (zero-nine, oh)
0.3%
Connected Digit Strings--Mall Recordings
Spontaneous 11 (zero-nine, oh)
2.0%
Connected Digits Strings--HMIHY
Conversational 11 (zero-nine, oh)
5.0%
RM (Resource Management)
Read Speech 1000 2.0%
ATIS(Airline Travel Information System)
Spontaneous 2500 2.5%
NAB (North American Business)
Read Text 64,000 6.6%
Broadcast News News Show 210,000 13-17%
Switchboard Conversational Telephone
45,000 25-29%
Call Home Conversational Telephone
28,000 40%
factor of 17
increase in digit error rate
NIST Benchmark PerformanceNIST Benchmark Performance
Year
Word
Err
or
Rate
Resource Manageme
ntATIS
NAB
vocabulary: 40,000 words branching
factor: 85
vocabulary: 40,000 words branching
factor: 85
North American BusinessNorth American BusinessNorth American BusinessNorth American Business
30
40
50
60
70
80
1996 1997 1998 1999 2000 2001 2002
Year
Wo
rd A
cc
ura
cy
Switchboard/Call Home Vocabulary: 40,000
words Perplexity: 85
Switchboard/Call Home Vocabulary: 40,000
words Perplexity: 85
Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech Recognition
Growth in Effective Growth in Effective Recognition Vocabulary Recognition Vocabulary
SizeSize
110
1001000
10000100000
100000010000000
1960 1970 1980 1990 2000 2010Year
Vo
cab
ula
ry S
ize
Human Speech Recognition vs Human Speech Recognition vs ASRASR
0.1
1
10
100
0.001 0.01 0.1 1 10
HUMAN ERROR (%)
MA
CH
INE
ER
RO
R (
%)
Digits RM-LM NAB-mic WSJ
RM-null NAB-omni SWBD WSJ-22dB
MachinesOutperform
Humansx1
x10
x100
Challenges in ASRChallenges in ASR
System Performance- accuracy- efficiency (speed, memory)- robustness
Operational Performance- end-point detection- user barge-in- utterance rejection- confidence scoring
Machines are 10-100 times less accurate than humans
Large-Vocabulary ASR Large-Vocabulary ASR DemoDemo
Courtesy of Murat Saraclar
Voice-Enabled System Voice-Enabled System Technology ComponentsTechnology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: Interpret the meaning of key words and phrases in the recognized
speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly recognizing every
word– SLU makes it possible to offer services where the customer can speak naturally
without learning a specific set of terms
• Methodology:: Exploit task grammar (syntax) and task semantics to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high information word sequences to appropriate meaning
• Performance Evaluation:: Accuracy of speech understanding system on various tasks and in various operating environments
• Applications:: Automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.
• Challenges: What goes beyond simple classifications systems but below full Natural Language voice dialogue systems
SLU FormulationSLU Formulation• let W be a sequence of words and C
be its underlying meaning (conceptual structure),; then, using Bayes rule, we get
• finding the best conceptual structure can be done by parsing and ranking using a combination of acoustic, linguistic and semantics scores.
( | ) ( | ) ( ) / ( )P C W P W C P C P W
* arg max ( | ) ( )C
C P W C P C
Knowledge Sources for Knowledge Sources for Speech UnderstandingSpeech Understanding
PhonotacticPhonotactic SyntacticSyntactic SemanticSemantic PragmaticPragmaticAcoustic/PhoneticAcoustic/Phonetic
AcousticModel
WordLexicon
LanguageModel
Relationshipof speech sounds and
English phonemes
Rules forphonemesequences
and pronunciation
Structure ofwords, phrasesin a sentence
Relationshipand meaningsamong words
Discourse,interaction history,world knowledge
UnderstandingModel
DialogManager
ASR SLU DM
DARPA CommunicatorDARPA Communicator• Darpa sponsored research and development of mixed-
initiative dialogue systems
• Travel task involving airline, hotel and car information and reservation
“Yeah I uh I would like to go from New York to Boston tomorrow night with United”
• SLU output (Concept decoding)
Topic: ItineraryOrigin: New YorkDestination: BostonDay of the week: SundayDate: May 25th, 2002Time: >6pmAirline: United
XML Schema<itinerary> <origin>
<city></city><state></state>
</origin> <destination>
<city></city><state></state>
</destination> <date></date> <time></time> <airline></airline></itinerary>
Customer Care IVR and Customer Care IVR and HMIHYHMIHY
Network Menu
LEC Misdirect Announcement
Account Verification Routine
Main Menu
Reverse Directory RoutineLD Sub-Menu
HMIHY
Sparkle Tone“AT&T, How may I help you?”
Account Verification Routine
Customer Care IVR
Sparkle Tone“Thank you for calling AT&T…”
Total Time to Get to Reverse Directory Lookup: 2:55
minutes!!!
10 seconds
30 seconds
13 seconds
26 seconds
58 seconds
38 seconds
Total Time to Get to Reverse Directory Lookup: 28
seconds!!!
20 seconds
8 seconds
Prompt is “AT&T. How may I help you?” User responds with totally unconstrained fluent
speech System recognizes the words and determines the
meaning of users’ speech, then routes the call Dialog technology enables task completion
HMIHY
Account
Balance
Calling
Plans
Local Unrecognized
Number. . .
HMIHY—How Does It WorkHMIHY—How Does It Work
HMIHY Example DialogsHMIHY Example Dialogs
• Irate Customer• Rate Plan• Account Balance• Local Service• Unrecognized Number• Threshold Billing• Billing Credit
Customer Customer SatisfactionSatisfaction
• decreased repeat calls (37%)
• decreased ‘OUTPIC’ rate (18%)
• decreased CCA (Call Control Agent) time per call (10%)
• decreased customer complaints (78%)
Customer Care Customer Care ScenarioScenario
Voice-enabled System Voice-enabled System Technology ComponentsTechnology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
Dialog Management (DM)Dialog Management (DM)• Goal:: Combine the meaning of the current input with the interaction history to
decide what the next step in the interaction should be– DM makes viable complex services that require multiple exchanges between the system
and the customer– dialog systems can handle user-initiated topic switching (within the domain of the
application)
• Methodology:: Exploit models of dialog to determine the most appropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction
• Performance Evaluation:: Speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining help with a service
• Applications:: Customer care (HMIHY), travel planning, conference registration, scheduling, voice access to unified messaging
• Challenges: Is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines
Dialog Management Dialog Management (DM)(DM)
• the DM can be considered as a model that is programmed from either rules and/or data so that to accomplish desired tasks.
• the goal of a DM is to manage elaborate exchanges with the user, providing automated access to information and services.
Computational Models for Computational Models for DMDM
- Structural-based Approach - assumes that there exists a regular structure in dialog that can be modeled as a state transition network or dialog grammars.
- dialog strategies need to be predefined, thus limiting
- several real-time deployed systems exist today which have inference engines and knowledge representation.
- Plan-based Approach- considers communication as “acting”. Dialog acts and actions are oriented toward goal achievement.
- motivated by human/human interaction in which humans generally have goals and plans when interacting with others.
- aims to account for general models and theories of discourse
Generic Spoken Dialog Generic Spoken Dialog SystemSystem
Context TrackingContext Tracking
Dialog Strategy Modules
Dialog Strategy Modules
SL UnderstandingSL Understanding
Response GenerationResponse Generation
Dialog Manager
Dialog Manager
Resource Manager
Resource Manager
Access Device Interface (Telephone,
IP)
Access Device Interface (Telephone,
IP)
DatabasesDatabases
TTS/PlayerTTS/Player
ASR ModuleASR Module
Dialog StrategyDialog Strategy• Confirmation: used to ascertain correctness of the recognized utterance
or utterances• Error Recovery: used to get the dialog back on track after a user
indicates that the system has misunderstood something• Reprompting: used when the system expected input but did not receive
any• Completion: use to elicit missing input information from the user• Constraining: used to reduce the scope of the request so that a
reasonable amount of information is retrieved, presented to the user, or otherwise acted upon
• Relaxation: used to increase the scope of the request when no information has been retrieved
• Disambiguation: used to resolve inconsistent input from the user• Greeting/Closing: used to maintain social protocol at the beginning and
end of an interaction
Building Good Speech-Based Building Good Speech-Based ApplicationsApplications
• Good User Interfaces--make the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice
• Good Models of Dialog--keep the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machine
• Matching the Task to the Technology--be realistic about the capabilities of the technology– fail soft methods– error correcting techniques (data dips)
Mixed-initiative DialogMixed-initiative Dialog
System UserInitiative
Please say collect, calling card, third
number.
How may I help you?
Who manages the dialog?
Example: Mixed-Example: Mixed-Initiative DialogInitiative Dialog
• System Initiative
• Mixed InitiativeSystem: Please say your departure cityUser: I need to travel from Chicago to Newark tomorrow.
• User Initiative
System: How may I help you?User: I need to travel from Chicago to Newark tomorrow.
System: Please say just your departure city.User: ChicagoSystem: Please say just your arrival city.User: Newark
Shorter dialogs (better user experience) but more difficult to design
Long dialogs but easier to design
Dialog EvaluationDialog Evaluation
Why?• mining problematic situations (task failure)• minimize user hang-ups and routing to an operator• improve user experience• perform data analysis and monitoring• conduct data mining for sales and marketing
How?• log exchange features from ASR, SLU, DM, ANI • define and automatically optimize a task objective function• compute service statistics (instant and delayed)
Voice-Enabled System Voice-Enabled System Technology ComponentsTechnology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
Dialog Management
ASR
SLGSpoken Language Generation
Data
Words
Meaning
SpeechSpeech
Action
Words
Spoken Language Spoken Language Generation (SLG)Generation (SLG)
• translate the action of the dialogue manager into a textual representation– template-based SLG– stochastic-based SLG
Template-Based SLG• Language generation is achieved by
manipulating text strings using– concatenation of strings and – substitution of substrings (slots) by other
strings• Number of templates grows exponentially
– No syntactic generalization; different templates for singular and plural versions
• Not much variation in output– No text planning; simple concatenation of
strings• Maintainability
– Expensive and complex to assemble and change with large number of templates
Stochastic-based SLGStochastic-based SLG• Text planning
– transform communicative goals into sequence of elementary communicative goals
• Sentence planning– choose linguistic resources to express the
communicative goals• Surface realization
– Pproduce surface word order according to the grammar of the language.
– combine linguistic information with statistical models derived from annotated corpora
• Goal: constructing dialog applications with zero human intervention by automatically mining information from the clients web site
– possess as much knowledge as the website does– synchronize with updates on the website with zero effort– synchronization of telephone-based and Internet-base services
• Challenges:– robust website understanding– webpage-based language model that reflects human-machine
interaction– web data driven dialog manger– spoken sanguage understanding that requires no hand rules and
annotation– language generation that’s grammatically correct without rules
Automatic Creation of ServicesAutomatic Creation of Servicesfrom Web Sites – WebTalkfrom Web Sites – WebTalk
W: Hi, this is AT&T local service. How may I help you?U: HiW: Hello, what is your question?U: I need help pleaseW: Let me give you an idea of what you can say. You can ask for solutions, LNP, products services, local solutions and local voice service. How may I help you?U: What’s LNP?W: LNP stands for local number portability.U: ThanksW: You're welcome. Anything else?U: What products and services do you provide?W: Let me give you more information about our products, including centrex, AT&T primeplex,AT&T primecodes, T1 access, AT&T prime digital trunk, …
Voice-enabled System Voice-enabled System Technology ComponentsTechnology Components
DM
SLU
TTS
Meaning
Text-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLG
Words to besynthesized
Spoken Language Generation
Words spoken
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
Building Spoken dialog Building Spoken dialog systems systems
Rule-Based or Data-DrivenRule-Based or Data-Driven Rule-based Data-driven
Task complexity Simple
(e.g., name dialing)
Complex
(e.g., customer care)
Input speech Read/Fluent Spontaneous/
conversational
Resources No data collection Extensive data collection
Maintenance Harder only for complex apps
Straight forward if sufficient data is available
Characteristic Brittle Robust
User InterfaceUser Interface• Good User Interfaces
– makes the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice
– keeps the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machine
– a great UI cannot save a system with poor ASR and NLU. But, the UI can make or break a system, even with excellent ASR & NLU
– effective UI design is based on a set of elementary principles and common widgets; sequenced screen presentations; simple error-trap dialog; a user manual.
Correction UICorrection UI
• N-best list
Multimodal System Multimodal System Technology ComponentsTechnology Components
DM
SLU
TTSText-to-Speech
Synthesis
Automatic SpeechRecognition
Spoken LanguageUnderstanding
DialogManagement
ASR
SLGSpoken Language Generation
Data,Rules
Words
Meaning
SpeechSpeech
Action
Words
VisualPen
Gesture
Multimodal ExperienceMultimodal Experience
• Users need to have access to automated dialog applications at anytime and anywhere using any access device
• Selection of “most appropriate” UI mode or combination of modes depends on the device, the task, the environment, and the user’s abilities and preferences.
Multimodal ArchitectureMultimodal Architecture
Parsing
Rendering
Understanding Application Logic
Language Model
SemanticModel
Behavior Model
What the user might do (say)P(F|x, Sn-1)
What does that meanP(Sn|F, Sn-1)
What to show (say) to the user
P(A|Sn)
x (Multimodal
Inputs)
F (Surface Semantics)
A (Multimedia
Outputs)
Sn (Discourse Semantics)
Microsoft MIPadMicrosoft MIPad
• Multimodal Interactive Pad
• Usability studies show double throughput for English
• Speech is mostly useful in cases with lots of alternatives
MiPad VideoMiPad Video
AT&T MATCHAT&T MATCH
“Are there any cheap Italian places in this neighborhood?”
Multimodal Access To City Help
Access to information through voice interface, gesture and GPS.
• Multimodal Integration Combination of speech, gesture and meaning using finite state technology
MATCH VideoMATCH Video
Courtesy of Michael Johnston
The Speech AdvantageThe Speech Advantage• Reduce costs
– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services
• New revenue opportunities – 24x7 high-quality customer care automation– Access to information without a keyboard or touch-
tones• Customer retention
– provide personal services for customer preferences– improve customer experience
Poor Customer CarePoor Customer Care
In a recent survey conducted by Mobius Management Systems:
• 54% consider customer service to be poor if they wait over 5 minutes for a rep• for poor customer care
• 46% of respondents dropped their credit card provider• 32% dropped their internet provider • 30% dropped a phone company• 32% dropped a cellular phone vendor
• 53% of respondents consider touch-tone IVR to be the most frustrating aspect of customer service.
Voice-Enabled ServicesVoice-Enabled Services• Desktop applications -- dictation, command and control of
desktop, control of document properties (fonts, styles, bullets, …)• Agent technology – simple tasks like stock quotes, traffic reports,
weather; access to communications, e.g., voice dialing, voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointments
• Voice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are key
• E-Contact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues
• Telematics – command and control of automotive features (comfort systems, radio, windows, sunroof)
• Small devices – control of cellphones, PDAs from voice commands
Customer Care and Help Customer Care and Help Desk AutomationDesk Automation
• Telecommunication• Banking• Finance• Technology• Retail• Insurance• Entertainment
Customer care and Help Desk costs are expected to be over $100 billion/year world-wide
Major industry sectors
It’s Just the It’s Just the Beginning…Beginning…
Armstrong - Telephone should be the most ubiquitous Internet access device
Gates - Speech is key to popularizing the Internet
A Look Into the A Look Into the FutureFuture
MATCH: Multimodal Access To City Help
Keyword spottingHandcrafted grammars No dialogue
Directory AssistanceVRCP
• Constrained speech• Minimal data collection• Manual design
Medium size ASRHandcrafted Grammars System Initiative
Airline reservationBanking
• Constrained speech• Moderate data collection• Some automation
Large size ASR Limited NLU Mixed-initiative
Call centers, E-commerce
• Spontaneous speech• Extensive data collection• Semi-automation
1990+
Unlimited ASR Deeper NLU Adaptive systems
Multimodal, MultilingualHelp Desks, E-commerce
• Spontaneous speech/pen• Fully automated systems
1995+
2000+
2005+
Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies
Future of Speech Recognition Future of Speech Recognition TechnologiesTechnologies
2002 2005 2008 2011
Year
Dialog Systems
Robust Systems
Multilingual Systems;
Multimodal Speech Enabled Devices
Very Large Vocabulary,
Limited Tasks,
Controlled Environment
Very Large Vocabulary,
Limited Tasks,
Arbitrary Environment
Unlimited Vocabulary, Unlimited
Tasks, Many Languages
Spoken Dialog Interfaces Spoken Dialog Interfaces Vs. Touch-Tone IVR?Vs. Touch-Tone IVR?
Courtesy of Jay Wilpon