91
Di it lS hP i Di it lS hP i Digital Speech Processing Digital Speech ProcessingLecture 19 Lecture 19 Techniques for Speech Techniques for Speech and Natural Language and Natural Language Recognition and Recognition and Understanding Understanding Understanding Understanding

Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Di it l S h P iDi it l S h P iDigital Speech ProcessingDigital Speech Processing——Lecture 19Lecture 19

Techniques for Speech Techniques for Speech and Natural Language and Natural Language g gg g

Recognition and Recognition and UnderstandingUnderstandingUnderstandingUnderstanding

Page 2: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(St l K b i k Vi i 1968)(Stanley Kubrick View in 1968)(Stanley Kubrick View in 1968)

Page 3: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Apple Navigator Apple Navigator ---- 19881988

Page 4: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

The Speech AdvantageThe Speech Advantagep gp g•• Reduce costsReduce costs

– reduce labor expenses while still providing– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services

• New revenue opportunities– 24x7 high-quality customer care automation– access to information without a keyboard or

touch-tonesCustomer retentionCustomer retention•• Customer retentionCustomer retention– provide personal services for customer

preferencespreferences– improve customer experience

Page 5: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

The Speech Dialog CircleThe Speech Dialog Circle

SpeechVoice reply to customer

TTS ASR

“What number did you want to call?”

TTSText-to-Speech

SynthesisAutomatic Speech

Recognition

ASR

Data,Rules Words spoken

“I di l d b ”

Words: What’s next?

“D t i t b ”

SLU Spoken LanguageUnderstanding

SLGSpoken LanguageGeneration

“I dialed a wrong number”“Determine correct number”

DMAction Meaning

“Billing credit”

DialogManagement

Page 6: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Automatic Speech RecognitionAutomatic Speech Recognition

• Goal: Accurately and efficiently convert a

Automatic Speech RecognitionAutomatic Speech Recognition

y yspeech signal into a text message independent of the device, speaker or the p penvironment.

• Applications: Automation of complex operator-based tasks e g customer careoperator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services customerprovisioning of new services, customer help lines, e-commerce, etc.

Page 7: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

A Brief History of Speech RecognitionA Brief History of Speech RecognitionA Brief History of Speech RecognitionA Brief History of Speech Recognition

acy

n A

ccur

aco

gniti

o

Mathematical Formalization

eech

Rec Mathematical Formalization

Global optimizationAutomatic Learning from data

HeuristicsHandcrafted rules

Sp

Time

Handcrafted rulesLocal optimization

Cristopher C.discovers a new

continent.HMMs

Page 8: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)

Speaker’s Speech W Acoustics(n) X Linguistic WSpeaker s Intention Production

Mechanisms

Acoustic Processor

Linguistic Decoder

S h R iS k M d l Speech RecognizerSpeaker Model

ˆ arg max ( | )W P W X= g ( | )

( | ) ( )arg max( )

W

W

P X W P WP X

=( )

arg max ( | ) ( )W

A LW

P XP X W P W=

Step 2Step 1Step 3

Page 9: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Steps in Speech RecognitionSteps in Speech RecognitionStep 1Step 1-- Acoustic ModelingAcoustic Modeling:: assign probabilities

to acoustic realizations of a sequence ofto acoustic realizations of a sequence of words. Compute PA(X|W) using statistical models (Hidden Markov Models) of acoustic i l d dsignals and words

Step 2Step 2-- Language ModelingLanguage Modeling:: assign probabilities to sequences of words in the language Trainto sequences of words in the language. Train PL(W) from generic text or from transcriptions of task-specific dialogues.

Step 3Step 3-- Hypothesis SearchHypothesis Search:: find the word sequence with the maximum a posteriori probability Search through all possible wordprobability. Search through all possible word sequences to determine arg max over W.

Page 10: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Step 1Step 1--The Acoustic ModelThe Acoustic ModelStep 1Step 1 The Acoustic ModelThe Acoustic Model we build by learning statistics of the acousticfeatures from a training set where we compute the variability

acoustic modelsX

i features, , from a training set where we compute the variability of the acoustic features during the production of the sounds

p re

X

resented by the models it is impractical to create a separate acoustic model, ( | ), for

every possible word in the language--it requires too muchtraining data for words in every possible contex

AP X Wi

t training data for words in every possible context instead we build for the ~50 phonemes

in the English language and construct the model for a word byacoustic-phonetic mod le si

concantenating (stringing together sequentially) the models for e th constituent phones in the word

similarly we build sentences (sequences of words) by concatenatingi similarly we build sentences (sequences of words) by concatenating word modelsi

Page 11: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Step 2Step 2--The Language ModelThe Language ModelStep 2Step 2 The Language ModelThe Language Model• the language model describes the probability of a sequence of words that form a valid sentence in the languagewords that form a valid sentence in the language

• a simple statistical method works well based on a Markovian assumption, namely that the probability of a word in a sentence i diti d l th i N d l Nis conditioned on only the previous N-words, namely an N-gram language model

1 2( ) ( )L L kP W P w w w= 1 2

1 21

( ) ( , ,..., )

( | , ,..., )

L L kk

L n n n n Nn

P W P w w w

P w w w w− − −=

=

=∏1

1 2 where ( | , ,..., ) is estimated by simply counting up the relative frequencies of -tuples in a

n

L n n n n NP w w w wf N

=

− − −i

large corpus of text

Page 12: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Step 3Step 3--The Search ProblemThe Search ProblemStep 3Step 3 The Search ProblemThe Search Problem• the search problem is one of searching the space of all p g p

valid sound sequences, conditioned on the word grammar, the language syntax, and the task constraints, to find the word sequence with the maximum likelihoodq

• the size of the search space can be astronomically large and take inordinate amounts of computing power to solve by heuristic methodssolve by heuristic methods

• the use of methods from the field of Finite State Automata Theory provide Finite State Networks (FSN)that reduce the computational burden by orders ofthat reduce the computational burden by orders of magnitude, thereby enabling exact solutions in computationally feasible times, for large speech recognition problemsrecognition problems

Page 13: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Speech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition Processes

11 Ch h kCh h k d d1.1. Choose the taskChoose the task => sounds, word vocabulary, task syntax (grammar), task semanticssemantics

•• ExampleExample: isolated digit recognition taskExampleExample: isolated digit recognition task– Sounds to be recognized—whole words– Word vocabulary—zero, oh, one, two, …, y , , , , ,

nine– Task syntax—any digit is allowed

T k ti f i l t d– Task semantics—sequence of isolated digits must form a valid telephone number

Page 14: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Speech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition Processes

22 T i th M d lT i th M d l tt th d2.2. Train the ModelsTrain the Models => create a => create a method for building acoustic word modelsfrom a speech training data set afrom a speech training data set, a word lexicon, a word grammar(language model), and a task

f t t t i i d t tgrammar from a text training data set– speech training set—100 people each

speaking the 11 digits 10 times in isolationspeaking the 11 digits 10 times in isolation– text data training set—listing of valid

telephone numbers (or equivalently algorithm that generates valid telephonealgorithm that generates valid telephone numbers)

Page 15: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Speech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition Processes

33 Evaluate performanceEvaluate performance => determine=> determine3.3. Evaluate performanceEvaluate performance => determine => determine word error rate, task error rate for word error rate, task error rate for recognizerrecognizergg

– speech testing data set—25 new people each speaking 10 telephone numbers as sequences of isolated digitssequences of isolated digits

– evaluate digit error rate, telephone number error rate

4.4. Testing algorithmTesting algorithm => method for evaluating recognizer performance from the testing set of speechfrom the testing set of speech utterances

Page 16: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Speech Recognition Speech Recognition Speech Recognition Speech Recognition ProcessProcessProcessProcess

SLU

TTS ASR

SLG

Acoustic Model (HMM)

SLU

DM

SLG

Feature ConfidencePattern

(HMM)

Input S h

Xn

Feature Analysis (Spectral Analysis)

Confidence Scoring

(Utterance Verification)

Pattern Classification

(Decoding, Search)

Speech “Hello World”(0.9) (0.8)

^

Language Model (N- Word

L i

s(n), W W

Model (Ngram) Lexicon

Page 17: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Feature ExtractionFeature ExtractionFeature ExtractionFeature ExtractionGoal: extract robust features (information) fromthe speech that are relevant for ASR. Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S i

p

Method: spectral analysis through either a bank-of-filters or through LPC followed

Extraction Classification

Language Model

Word Lexicon

Scoring

gby non-linearity and normalization (cepstrum).

Result: signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).

Challenges: robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—

l f h f hi h di i licepstral features or those from a high dimensionality space.

Page 18: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

What Features to Use?What Features to Use?

ShortShort--time Spectral Analysistime Spectral Analysis:Acoustic features:

mel cepstrum (LPC filterbank wavelets)- mel cepstrum (LPC, filterbank, wavelets)- formant frequencies, pitch, prosody- zero-crossing rate, energy

A i Ph i fAcoustic-Phonetic features:- manner of articulation (e.g., stop, nasal, voiced)- place of articulation (labial, dental, velar)p ( , , )

Articulatory features:- tongue position, jaw, lips, velum

Auditory features:Auditory features:- ensemble interval histogram (EIH), synchrony

Temporal AnalysisTemporal Analysis: approximation of the velocity and l ti t i ll th h fi t d d dacceleration typically through first and second order

central differences.

Page 19: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Feature Extraction ProcessFeature Extraction Process

Sampling and Q ti ti FilteringNoise

Removal

( )s t

Quantization

Preemphasis

αCepstral

FilteringRemoval,Normalization

( )s n

( )a l′

( )ma l′

Preemphasis

Spectral Analysis

pAnalysis

PitchFormants

( )s n′

( )ma l′

( )mc lM,N

Segmentation (blocking)

AnalysisEqualization

Bias removalor normalization( )s n′)(nW ( )mc l′

EnergyZero-Crossing

Windowing Temporal Derivative

( )ms n

( )′′

)(nW

Delta cepstrumDelta^2 cepstrum

( )ms n′′( )mc l′Δ

Page 20: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

RobustnessRobustnessRobustness

R j tiRobustnessRobustnessy

UnlimitedVocabulary

Rejection

Problem:Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.

Methods:Traditional techniques for improving system robustnessare based on signal enhancement feature normalization or/andare based on signal enhancement, feature normalization or/andmodel adaptation.

Perception Approach:Perception Approach:Extract fundamental acoustic information in narrow bandsof speech. Robust integration of features across time and frequency.

Page 21: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Methods for Robust Speech Methods for Robust Speech Methods for Robust Speech Methods for Robust Speech e ods o o us peece ods o o us peecRecognitionRecognition

e ods o o us peece ods o o us peecRecognitionRecognition

SignalTraining Features ModelSignalTraining Features Model

E h N l dEnhancement Normalization Adaptation

Testing Signal Features Model

Page 22: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelAcoustic

Model

Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelGoal: map acoustic features into distinct phonetic labels (e g /s/ /aa/) or word labels

Feature Extraction

Pattern Classification

Language Word

Confidence Scoring

phonetic labels (e.g., /s/, /aa/) or word labels.

Hidden Markov Model (HMM): statistical method for characterizing the spectral properties of speech by a parametric random

Model Lexicon

characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone or a word. HMMs are also assigned for modeling extraneous events.

Advantages: powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

Challenges: understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Page 23: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Hidden Markov ModelsHidden Markov Models

0 7010 2

..

⎢⎢⎢

⎥⎥⎥

010 60 3

.

.⎡

⎢⎢⎢

⎥⎥⎥0 2.⎣⎢ ⎦⎥ 0 3.⎣⎢ ⎦⎥

05⎡ ⎤

0 3⎡ ⎤

050 20 3

.

.

.

⎢⎢⎢

⎥⎥⎥

Ergodic modelErgodic model —can get to any state from any

th t t 0 30 30 4

.

.

.

⎢⎢⎢

⎥⎥⎥

P upP down

P unchanged

( )( )

( )

⎢⎢⎢

⎥⎥⎥

other state;

States are hiddenStates are hidden— observe the

ff t t theffects, not the states

Page 24: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

HMM for speechHMM for speechHMM for speechHMM for speech

• phone model : ‘s’i1

s1 s3s2

• word model: ‘is’ (ih z) i• word model: is (ih-z) i1

z1 z3z2i1 i3i2

‘ih’ ‘z’

Page 25: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Isolated Word HMMIsolated Word HMMIsolated Word HMMIsolated Word HMM

a11 a22 a33 a44 a55

1 2 3 4 5

a34a23a12 a45

a13 a24 a3513 24 35

b1(Ot) b2(Ot) b3(Ot) b4(Ot) b5(Ot)

Left-right HMM – highly constrained state sequences

Page 26: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Basic Problems in HMMsBasic Problems in HMMsBasic Problems in HMMsBasic Problems in HMMs

Given acoustic observation X and model Φ:Given acoustic observation X and model Φ:

Evaluation: compute P(X | Φ)Evaluation: compute P(X | Φ)

Decoding: choose optimal state sequenceDecoding: choose optimal state sequence

Re estimation: adjust Φ to maximize P(X|Φ)Re-estimation: adjust Φ to maximize P(X|Φ)

Page 27: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Evaluation:Evaluation: The BaumThe Baum--Welch AlgorithmWelch AlgorithmEvaluation: Evaluation: The BaumThe Baum--Welch AlgorithmWelch Algorithm

1 1( ) ( , | ) ( ) ( ) 2 ; 1N

tt t t ji i ti P X s i j a b X t T i Nα α −

⎡ ⎤= = Φ = ≤ ≤ ≤ ≤⎢ ⎥

⎣ ⎦∑

1 1 11

( ) ( | , ) ( ) ( ) = -1 1; 1N

Tt t t ji i t t

ij P X s j a b X i t T j Nβ β+ + +

=

⎡ ⎤= = Φ = … ≤ ≤⎢ ⎥

⎣ ⎦∑

1 11

( ) ( | ) ( ) ( )t t t ji i tj

j=

⎢ ⎥⎣ ⎦∑

( ) ( | )Ti j P s i s j Xγ = = = Φs1 a1i aj1 s1

t -2 t+1t-1 t

1 1

1 1

1

1

( , ) ( , | , )

( , , | )( | )

( ) ( ) ( )

t t tT

t tT

t ij j t t

i j P s i s j X

P s i s j XP X

i a b X j

γ

α β

= = = Φ

= = Φ=

Φ

s2

s3

a2i

a3i

aj2

aj3

s2

s3

si

Xtsj

1

1

( ) ( ) ( )

( )

t ij j t tN

Tk

i a b X j

k

α β

α

=

=

sN

aNi ajN

sN

aijbj(Xt)

sN sN

αt-2(i) αt-1(i) βt(j) βt+1(j)

Page 28: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Decoding: Viterbi AlgorithmDecoding: Viterbi AlgorithmDecoding: Viterbi AlgorithmDecoding: Viterbi Algorithm

Optimal alignment

sM

p gbetween X and S

Vt-1(j+1)

Vt(j)Vt-1(j)

t 1(j )

bj(xt)j

j+1aj+1 j

aj j

sb2(2)

Vt-1(j-1)bj(xt)

t -1 t

j -1aj-1 j

s1

s2

x2x1 xT

Page 29: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Reestimation: Training AlgorithmReestimation: Training AlgorithmReestimation: Training AlgorithmReestimation: Training Algorithm

• find model parameters Φ=(A, B, π) that maximize the p ( )probability p(X| Φ) of observing some data X:

1( | ) ( , | ) ( )

t t t

T

s s s tP P a b XΦ = Φ =∑ ∑∏X X S

• there is no closed-form solution and thus we use the EM algorithm:

11

t t ts s s tt

−=

∑ ∑∏S S

ˆ– given Φ, we compute model parameters that maximize the following auxiliary function Q

( , | ) ˆˆ ˆ ˆ( ) l ( | ) ( ) ( )PQ P Q QΦΦ Φ Φ Φ Φ∑ X S X S b

Φ

– it is guaranteed to have increased (or the same) likelihood

all

( , | )( , ) log ( , | ) ( , ) ( , )( | ) i ji jQ P Q Q

PΦ Φ = Φ = Φ + Φ

Φ∑ a bS

X S a bX

Page 30: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Training Procedure (Training Procedure (Segmental Segmental ))KK--Means TrainingMeans Training))

Several iterations of the process:Several iterations of the process:

I t S hInput SpeechDatabase

Updated HMM Model

Old (Initial)HMM Model

EstimatePosteriorsζ (j k) γ (i j)

MaximizeParameters

a c μ Σ HMM ModelHMM Model ζt(j,k), γt(i,j) aij, cjk, μjk, Σjk

Page 31: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Isolated Word Recognition Using Isolated Word Recognition Using HMM’HMM’HMM’sHMM’s

Assume a vocabulary of words, with occurrences of each spoken wordV Kin a training set. Observation vectors are spectral characterizations of the word.For isolated word recognition, we do the follow

λ

ing: 1. for each word, , in the vocabulary, we must build an HMM, , i.e., we vv

( )Π must re-estimate model parameters , , that optimize the likelihood of the training set observation vector

A Bs for the -th word. (TRAINING)

2. for each unknown word which is to be recognized, we do the following:v

[ ]= 1 2

g , g a. measure the observation sequence ...

b. calculate model likelihooTO O O O

λ ≤ ≤ds, ( | ), 1c select the word whose model likelihood score is highest

vP O v V

λ∗

≤ ≤⎡ ⎤= ⎣ ⎦

⋅ = = =1

2

c. select the word whose model likelihood score is highest

argmax ( | )

Computation is on the order of required; 100, 5, 40

v

v Vv P O

V N T V N T⇒ 5

p q ; , ,10 computations

Page 32: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Continuous Observation Density HMM’sContinuous Observation Density HMM’s

μ⎡ ⎤= ≤ ≤⎣ ⎦∑

Most general form of pdf with a valid re-estimation procedure is:

( ) , , , 1M

j jm jm jmb x c x U j N

{ }

μ=

⎣ ⎦

=

=

∑1

1 2

( ) , , ,

observation vector= , ,...,number of mixture densities

j jm jm jmm

D

j

x x x xM

= gain of -th mijmc m

μ=

xture in state

any log-concave or elliptically symmetric density (e.g., a Gaussian)mean vector for mixture state

j

m jμ =

=

≥ ≤ ≤ ≤ ≤

mean vector for mixture , state

covariance matrix for mixture , state

0, 1 , 1

jm

jm

jm

m j

U m j

c j N m M

=

= ≤ ≤∑1

1, 1M

jmm

c j

N

−∞

= ≤ ≤∫ ( ) 1, 1jb x dx j N

Page 33: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

HMM Feature Vector DensitiesHMM Feature Vector DensitiesHMM Feature Vector DensitiesHMM Feature Vector Densities

Page 34: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Isolated Word HMM RecognizerIsolated Word HMM Recognizer

Page 35: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Choice of Model Parameters1. left-right model preferable to ergodic model (speech is

a left-right process)a left right process)2. number of states in range 2-40 (from sounds to

frames)• order of number of distinct sounds in the word• order of average number of observations in word

3 observation vectors3. observation vectors• cepstral coefficients (and their second and third order

derivatives) derived from LPC (1-9 mixtures), diagonal covariance matricescovariance matrices

• vector quantized discrete symbols (16-256 codebook sizes)

4. constraints on bj(O) densitiesj• bj(k)>ε for discrete densities• Cjm>δ, Ujm(r,r)>δ for continuous densities

Page 36: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Performance Vs Number of Performance Vs Number of S i M d lS i M d lStates in ModelStates in Model

Page 37: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

HMM Segmentation for /SIX/HMM Segmentation for /SIX/HMM Segmentation for /SIX/HMM Segmentation for /SIX/

Page 38: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Digit Recognition Using HMM’sDigit Recognition Using HMM’s

unknown unknown log log

energyenergy

frame frame cumulative cumulative

scoresscores

oneone oneone

iioneone

ninenine

frameframe oneone oneoneframe frame likelihood likelihood

scoresscores

state state segmentationsegmentation

ninenine ninenine

Page 39: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Digit Recognition Using HMM’sDigit Recognition Using HMM’sDigit Recognition Using HMM sDigit Recognition Using HMM s

sevenseven

unknown unknown log log

energyenergy

frame frame cumulative cumulative

scoresscores

sevenseven

sevenseven

energyenergy scoresscores

sevenseven

sixsix

sevenseven sevensevenframe frame

likelihood likelihood scoresscores

state state segmentationsegmentation

sevenseven

sixsixsixsix

Page 40: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple
Page 41: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Word LexiconWord LexiconWord LexiconWord LexiconWord LexiconWord LexiconWord LexiconWord LexiconGoal: Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S imap legal phone sequences into words

according to phonotactic rules. For example,

D id /d/ / / / / /ih/ /d/

Extraction Classification

Language Model

Word Lexicon

Scoring

David /d/ /ey/ /v/ /ih/ /d/

Multiple Pronunciation:l d h lti l i ti F lseveral words may have multiple pronunciations. For example

Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/Data /d/ /ey/ /t/ /ax/

Challenges:how do you generate a word lexicon automatically; how do you addhow do you generate a word lexicon automatically; how do you add new variant dialects and word pronunciations.

Page 42: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Language ModelLanguage ModelGoal: Model “acceptable” spoken phrases, constrained by task syntax. Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S i

y y

Rule-based:Deterministic grammars that are

Extraction Classification

Language Model

Word Lexicon

Scoring

gknowledge driven. For example,

flying from $city to $city on $date

Statistical:Compute estimate of word probabilities (N-gram, class-based, CFG). For example

flying from Newark to Boston tomorrow

Ch ll0.6

0.4

Challenges:How do you build a language model rapidly for a new task.

Page 43: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Formal GrammarsFormal GrammarsFormal GrammarsFormal Grammars

S

VPNP

Rewrite Rules:

1. S NP VP2. VP V NP3. VP AUX VPVPNP

V NPNAME

4. NP ART NP15. NP ADJ NP1

7. NP1 N6. NP1 ADJ NP1

8 NP NAME

ADJMaryloves

8. NP NAME9. NP PRON10. NAME Mary11. V loves12 ADJ that

NP1

that

person

12. ADJ that13. N personN

Page 44: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

NN--GramsGramsNN GramsGrams1 2( ) ( , , , )NP P w w w=W …

1 2 1 3 1 2 1 2 1

1 2 1

( ) ( | ) ( | , ) ( | , , , )

( | , , , )

N NN

i i

P w P w w P w w w P w w w w

P w w w w

=

=∏

T i E ti ti

1 2 11

i ii=∏

• Trigrams Estimation

( )C w w w2, 1,1, 2

2, 1

( )( | )

( )i i i

i i ii i

C w w wP w w w

C w w− −

− −− −

=

Page 45: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

ExampleExample: bibi--gramsgramsprobability of a word given the previous wordprobability of a word given the previous word

I want 0.02I want to go from Boston to Denver tomorrow want to 0.01to go 0.03go from 0.05go to 0.01

gI want to go to Denver tomorrow from BostonTomorrow I want to go from Boston to Denver with UnitedA flight on United going from Boston to Denver tomorrowA flight on UnitedBoston to Denver

i from Boston 0.01Boston to 0.02to Denver 0.01...

Going to DenverUnited from BostonBoston with United tomorrow Tomorrow a flight to DenverGoing to Denver tomorrow with United Boston Denver with United

...Boston Denver with UnitedA flight with United tomorrow

...

...

0 03 0 02

I want to go from Boston to Denver0.02 0.01

0.03

0.05 0.01 0.02

0.02

1.2e-12

Page 46: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Generalization for NGeneralization for N--gramsgrams(b k(b k ff)ff)(back(back--off)off)

0.0 0.0 0.02

f h b b b d

I want a flight from Boston to Denver0.02 0.0 0.01 0.02

0.0BAD!!!

If the bi-gram ab was never observed, we canestimate its probability:

Probability of word b following word aas a fraction of probability of word b

I want a flight from Boston to Denver0.02 0.003

0.001 0.002

0.01 0.02

0.02

4.8e-150.02 0.003

GOOD

Page 47: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Pattern ClassificationPattern ClassificationPattern ClassificationPattern ClassificationGoal:combine information (probabilities)from the acoustic model, language Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S imodel and word lexicon to generate

an “optimal” word sequence (highest probability).

M th d

Extraction Classification

Language Model

Word Lexicon

Scoring

Method:decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.

Challenges:how do we build efficient structures (FSMs) for decoding and searching l b l l l d l t klarge vocabulary, complex language models tasks;

• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states 14 orders

22

8• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient

what is the theoretical limit of efficiency that can be achieved

Page 48: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR Robustness

• the basic problem in ASR is to find the sequence f d th t l i th i t i l Thi i li

yUnlimited

Vocabulary

Rejection

of words that explain the input signal. This implies the following mapping:

• for the WSJ 20 000 vocabulary this results in a network

Features HMM states HMM units Phones Words Sentences

for the WSJ 20,000 vocabulary, this results in a network of 10 bytes!

state of the art methods including fast match multi pass

22

• state-of-the-art methods including fast match, multi-pass decoding A* stack and finite-state transducers all provide tremendous speed-up by searching through the network andfinding the best path that maximizes the likelihood function.

Page 49: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)

Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)(WFST)(WFST)(WFST)(WFST)

• Unified Mathematical framework to ASR• Efficiency in time and space

Word:Phrase WFST

WFSTPhone:Word

Combination Optimization

SearchNetwork

WFST

WFST

HMM:Phone

St t HMMWFST can compile the network to 10WFST can compile the network to 1088 states states

----14 orders of magnitude more efficient14 orders of magnitude more efficientWFSTState:HMM ----14 orders of magnitude more efficient14 orders of magnitude more efficient

Page 50: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Weighted Finite State Weighted Finite State T dT dTransducerTransducer

ey:ε/ 4 dx:ε/ 8

Word PronunciationWord Pronunciation TransducerTransducer

d: ε/1

ey:ε/.4 dx:ε/.8ax:”data”/1

ae:ε/.6 t:ε/.2

D tData

Page 51: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

30

AT&T (Algorithmic) Moore's Law (hardware)

Speech RecognitionSpeech RecognitionSpeech RecognitionSpeech Recognition

20

25

30

peed

Community

AT&T

5

10

15

lativ

e S Community

0

5

1994 1995 1996 1997 1998 1999 2000 2001 2002

Rel

Year

N h A i B i

North American Businessvocabulary: 40,000 words

branching factor: 85

Page 52: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Confidence ScoringConfidence ScoringConfidence ScoringConfidence ScoringGoal:identify possible recognition errors

d t f b l t P t ti ll

Confidence ScoringConfidence ScoringConfidence ScoringConfidence Scoring

Feature Pattern Cl ifi ti

Acoustic Model

Confidence S iand out-of-vocabulary events. Potentially

improves the performance of ASR, SLU and DM.

Method:

Extraction Classification

Language Model

Word Lexicon

Scoring

Method:a confidence score based on a hypothesis test is associated with each recognized word. For example:

Label: credit pleaseRecognized: credit fees Confidence: (0 9) (0 3)

( ) ( | )( | )( ) ( | )

P PPP P

=∑W

W X WW XW X W

Confidence: (0.9) (0.3)

Challenges:rejection of extraneous acoustic events (noise background speechrejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.

Page 53: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

RejectionRejectionRobustness

RejectionRejectionProblem: E t ti t i

yUnlimited

Vocabulary

Rejection

Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.

Measure of Confidence:Associating word strings with a verification cost that

id ff ti f fidprovide an effective measure of confidence (Utterance Verification).

Effect:Improvement in the performance of the recognizer, understanding system and dialogue managerunderstanding system and dialogue manager.

Page 54: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

StateState--ofof--thethe--Art Art StateState--ofof--thethe--Art Art PerformancePerformancePerformancePerformance

SLU

TTS ASR

SLG

Acoustic Model

SLU

DM

SLG

Feature Pattern

Classification

Input Speech

RecognizedSentenceConfidence

Extraction (Decoding, Search)

Scoring

Language Model

Word Lexicon

Page 55: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

How to Evaluate Performance?How to Evaluate Performance?How to Evaluate Performance?How to Evaluate Performance?

• Dictation applications: InsertionsDictation applications: Insertions, substitutions and deletions

# # #Word Error Rate 100%No.of words in the correct sentence

Subs Dels Ins+ += ×

• Command-and-control: false rejection and false acceptance => ROC curves.

Page 56: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY

SIZE WORD

ERROR RATEConnected Digit

Strings--TI DatabaseSpontaneous 11 (zero-nine,

oh)0.3%

factor ofStrings TI Database oh)Connected Digit

Strings--Mall Recordings

Spontaneous 11 (zero-nine, oh)

2.0%

Connected Digits Conversational 11 (zero-nine 5 0%

factor of 17

increase in digit

error rateConnected Digits Strings--HMIHY

Conversational 11 (zero nine, oh)

5.0%

RM (Resource Management)

Read Speech 1000 2.0%

ATIS(Airline Travel Spontaneous 2500 2 5%

error rate

ATIS(Airline Travel Information System)

Spontaneous 2500 2.5%

NAB (North American Business)

Read Text 64,000 6.6%

B d t N N Sh 210 000 13 17%Broadcast News News Show 210,000 13-17%

Switchboard Conversational Telephone

45,000 25-29%

Call Home Conversational Telephone

28,000 40%

Page 57: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple
Page 58: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

NIST Benchmark PerformanceNIST Benchmark PerformanceNIST Benchmark PerformanceNIST Benchmark Performancee

rror

Rat

eW

ord

E

Resource NAB

Management ATIS

Year

Page 59: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

North American BusinessNorth American BusinessNorth American BusinessNorth American Business

vocabulary: 40,000 words branching factor: 85

Page 60: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech RecognitionSpeech RecognitionSpeech Recognition

70

80

cy

50

60

70

Acc

urac

30

40

1996 1997 1998 1999 2000 2001 2002

Wor

d

Year

Switchboard/Call Home

Vocabulary: 40,000 words Perplexity: 85

Page 61: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Growth in Effective Growth in Effective Recognition Vocabulary SizeRecognition Vocabulary Size

1000001000000

10000000

ize

1001000

10000100000

cabu

lary

Si

110

100

Voc

1960 1970 1980 1990 2000 2010Year

Page 62: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Human Speech Recognition vs ASRHuman Speech Recognition vs ASR

100

10

RR

OR

(%)

1

ACH

INE

ER

Machinesx10

x100

0.10 001 0 01 0 1 1 10

MA

OutperformHumans

x1

0.001 0.01 0.1 1 10

HUMAN ERROR (%)Digits RM-LM NAB-mic WSJ

RM null NAB omni SWBD WSJ 22dBRM-null NAB-omni SWBD WSJ-22dB

Page 63: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Challenges in ASRChallenges in ASRSystem PerformanceSystem Performance

accuracy- accuracy- efficiency (speed, memory)- robustnessrobustness

Operational PerformanceOperational Performance- end-point detection- user barge-in- utterance rejection- confidence scoring

Machines are 10Machines are 10--100 times less accurate than humans100 times less accurate than humans

Page 64: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

LargeLarge--Vocabulary ASR DemoVocabulary ASR DemoLargeLarge Vocabulary ASR DemoVocabulary ASR Demo

Courtesy of Murat Saraclar

Page 65: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

VoiceVoice--Enabled System Technology Enabled System Technology ComponentsComponentsComponentsComponents

SpeechSpeech

TTS ASRTTSText-to-Speech

SynthesisAutomatic Speech

Recognition

ASR

Data,Rules WordsWords

SLU Spoken LanguageUnderstanding

SLGSpoken LanguageGeneration

DMMeaningAction

DialogManagement

Page 66: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: interpret the meaning of key words and phrases in the

recognized speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly

recognizing every word– SLU makes it possible to offer services where the customer can speak

naturally without learning a specific set of termsy g p• Methodology:: exploit task grammar (syntax) and task semantics

to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high g; p p p ginformation word sequences to appropriate meaning

• Performance Evaluation:: accuracy of speech understanding system on various tasks and in various operating environmentsy p g

• Applications:: automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.

• Challenges: what goes beyond simple classifications systems but below full Natural Language voice dialogue systems

Page 67: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Knowledge Sources for Knowledge Sources for Speech UnderstandingSpeech Understanding

ASR SLU DM

Phonotactic Syntactic Semantic PragmaticAcoustic/Phonetic

Relationshipof speech sounds and

Rules forphoneme

sequences and

Structure ofwords, phrasesin a sentence

Relationshipand meanings

d

Discourse,interaction history,

ld k l d

AcousticM d l

WordL i

LanguageM d l

sounds andEnglish phonemes

sequences and pronunciation

in a sentence among words world knowledge

UnderstandingM d l

DialogMModel Lexicon Model Model Manager

Page 68: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

DARPA CommunicatorDARPA Communicator• Darpa sponsored research and development of mixed-initiative

dialogue systemsTravel task involving airline hotel and car information and reservation• Travel task involving airline, hotel and car information and reservation“Yeah I uh I would like to go from New York to Boston tomorrow

night with United”S• SLU output (Concept decoding) XML Schema

<itinerary><origin>

<city></city>Topic: ItineraryOrigin: New YorkDestination: BostonD f th k S d

city /city<state></state>

</origin><destination>

< it ></ it >Day of the week: SundayDate: May 25th, 2002Time: >6pmAirline: United

<city></city><state></state>

</destination><date></date>Airline: United<time></time><airline></airline>

</itinerary>

Page 69: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Customer Care IVR and HMIHYCustomer Care IVR and HMIHY

HMIHYCustomer Care IVR

Network Menu

Sparkle Tone“AT&T, How may I help you?”

Account Verification Routine

Sparkle Tone“Thank you for calling AT&T…”

10 seconds20 seconds

8 seconds

LEC Misdirect Announcement

30 seconds

13 seconds

Account Verification Routine

Main Menu

26 seconds

58 seconds Main Menu

Reverse Directory RoutineLD Sub-Menu

58 seco ds

38 seconds

Total Time to Get to Reverse Directory Lookup: 2:55 minutes!!!

Total Time to Get to Reverse Directory Lookup: 28 seconds!!!

Page 70: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

HMIHYHMIHY——How Does It WorkHow Does It Work

Prompt is “AT&T How may I help you?”

HMIHYHMIHY How Does It WorkHow Does It Work

Prompt is AT&T. How may I help you?User responds with totally unconstrained fluent speechSystem recognizes the words and determines the meaningf ’of users’ speech, then routes the call

Dialog technology enables task completion

HMIHY

AccountBalance

CallingPlans

Local UnrecognizedN b

. . .Balance Plans Number

Page 71: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

HMIHY Example DialogsHMIHY Example DialogsHMIHY Example DialogsHMIHY Example Dialogs

• Irate Customer Customer SatisfactionCustomer Satisfaction• Irate Customer• Rate Plan

Customer SatisfactionCustomer Satisfaction

• decreased repeat calls (37%)

• Account Balance• Local Service

• decreased ‘OUTPIC’ rate (18%)

• decreased CCA (Call• Unrecognized Number• Threshold Billing

• decreased CCA (Call Control Agent) time per call (10%)

• Threshold Billing• Billing Credit

• decreased customer complaints (78%)

Page 72: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Customer Care ScenarioCustomer Care ScenarioCustomer Care ScenarioCustomer Care Scenario

Page 73: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

VoiceVoice--enabled System Technology enabled System Technology ComponentsComponentsComponentsComponents

SpeechSpeech

TTS ASRTTSText-to-Speech

SynthesisAutomatic Speech

Recognition

ASR

Data,Rules WordsWords

SLU Spoken LanguageUnderstanding

SLGSpoken LanguageGeneration

DMMeaningAction

DialogManagement

Page 74: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Dialog Management (DM)Dialog Management (DM)•• Goal:Goal::: combine the meaning of the current input with the interaction history

to decide what the next step in the interaction should be– DM makes viable complex services that require multiple exchanges between the

system and the customery– dialog systems can handle user-initiated topic switching (within the domain of

the application)•• Methodology:Methodology::: exploit models of dialog to determine the most

appropriate spoken text string to guide the dialog forward towards a clearappropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction

•• Performance Evaluation:Performance Evaluation::: speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, g , g , g , g ,purchasing a stock, obtaining help with a service

•• Applications:Applications::: customer care (HMIHY), travel planning, conference registration, scheduling, voice access to unified messaging

•• Challenges:Challenges: is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues sometimes it is better/easier/faster/moreinto the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines

Page 75: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Dialog StrategiesDialog StrategiesContextInterpretation

• Completion (continuation): elicits missing input information from the user

DialogStrategies

• Constraining (disambiguation): reduces the scope of the request when multiple information has been retrieved

• Relaxation: increases the scope of the request when no information

Backend

Relaxation: increases the scope of the request when no information has been retrieved• Confirmation: verifies that the system understood correctly; strategy may differ depending on SLU confidence measure. • Reprompting: used when the system expected input but did not receive any or did not understand what it received.• Context Help: provides user with help during periods of

i d t di f ith th t thmisunderstanding of either the system or the user.• Greeting/Closing: maintains social protocol at the beginning and end of an interaction• Mixed initiative: allows users to manage the dialog• Mixed-initiative: allows users to manage the dialog.

Page 76: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

MixedMixed--initiative Dialoginitiative Dialog

Who manages the dialog?

System UserInitiative

Who manages the dialog?

System UserInitiative

Please say collect, calling card, third

How may I help you?

number.y

Page 77: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Example: MixedExample: Mixed--Initiative Initiative DialogDialogDialogDialog

• System InitiativeSystem InitiativeSystem: Please say just your departure citySystem: Please say just your departure city.User: ChicagoSystem: Please say just your arrival city.User: Newark

Long dialogs but easier to design

• Mixed InitiativeMixed InitiativeSystem: Please say your departure city

User: Newark

System: Please say your departure cityUser: I need to travel from Chicago to Newark tomorrow.

• User InitiativeUser InitiativeSystem: How may I help you?User: I need to travel from

Chicago to Newark tomorrow.

Shorter dialogs (better user experience) but more difficult to designg

Page 78: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

VoiceVoice--enabled System enabled System Technology ComponentsTechnology ComponentsTechnology ComponentsTechnology Components

SpeechSpeech

TTS ASRTTSText-to-Speech

SynthesisAutomatic Speech

Recognition

ASR

Words to besynthesized Words spoken

Data,Rules WordsWords

SLU Spoken LanguageUnderstanding

SLGSpoken LanguageGeneration

DM MeaningMeaningAction

DialogManagement

Page 79: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Building Good SpeechBuilding Good Speech--Based Based A li tiA li tiApplicationsApplications

•• Good User InterfacesGood User Interfaces make the application easy to use•• Good User InterfacesGood User Interfaces---make the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice

•• Good Models of DialogGood Models of Dialog---keep the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machineof either the user or the machine

•• Matching the Task to the TechnologyMatching the Task to the Technology---be realistic about the capabilities of the technology– fail soft methods– error correcting techniques (data dips)

Page 80: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

User InterfaceUser Interface•• Good User InterfacesGood User Interfaces

– makes the application easy-to-use and robust tomakes the application easy to use and robust to the kinds of confusion that arise in human-machine communications by voice

– keeps the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machine

– a great UI cannot save a system with poor ASR and NLU. But, the UI can make or break a UI can make or break a

tt ith ll t ASR & NLUsystemsystem, even with excellent ASR & NLU– effective UI design is based on a set of

elementary principles and common widgets;elementary principles and common widgets; sequenced screen presentations; simple error-trap dialog; a user manual.

Page 81: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Multimodal System Multimodal System Technology ComponentsTechnology ComponentsTechnology ComponentsTechnology Components

SpeechSpeech

TTS ASR

VisualPen

Gesture

TTSText-to-Speech

SynthesisAutomatic Speech

Recognition

ASR

Data,Rules WordsWords

SLU Spoken LanguageUnderstanding

SLGSpoken LanguageGeneration

DM MeaningAction

DialogManagement

Page 82: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Multimodal ExperienceMultimodal ExperienceMultimodal ExperienceMultimodal Experience• users need to have access to

automated dialog applications at anytime and anywhere using any access deviceaccess device

• selection of “most appropriate” UI mode or combination ofUI mode or combination of modes depends on the device, the task, the environment, and , ,the user’s abilities and preferences.

Page 83: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Microsoft MIPadMicrosoft MIPad

• Multimodal Interactive PadMultimodal Interactive Pad• usability studies show

double throughput for g pEnglish

• speech is mostly useful in p ycases with lots of alternatives

Page 84: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

MiPad VideoMiPad VideoMiPad VideoMiPad Video

Page 85: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

AT&T MATCHAT&T MATCH M lti d l AM lti d l AAT&T MATCHAT&T MATCH Multimodal Access Multimodal Access To City HelpTo City Help

A t i f ti th hAccess to information through voice interface, gesture and GPSGPS.

•• Multimodal IntegrationMultimodal IntegrationCombination of speech gestureCombination of speech, gesture and meaning using finite state technology

technology

“Are there any cheap Italian places in this neighborhood?”

Page 86: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

MATCH VideoMATCH VideoMATCH VideoMATCH Video

Courtesy of Michael Johnston

Page 87: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

The Speech AdvantageThe Speech AdvantageThe Speech AdvantageThe Speech Advantage•• Reduce costsReduce costs

– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and servicesaccess information and services

•• New revenue opportunitiesNew revenue opportunities– 24x7 high-quality customer care automation24x7 high quality customer care automation– Access to information without a keyboard or

touch-tones•• Customer retentionCustomer retention

– provide personal services for customer preferences

– improve customer experience

Page 88: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

VoiceVoice--Enabled ServicesEnabled Services•• Desktop applicationsDesktop applications -- dictation, command and control of

desktop, control of document properties (fonts, styles, bullets, …)•• Agent technologyAgent technology – simple tasks like stock quotes, traffic reports, g gyg gy p q , p ,

weather; access to communications, e.g., voice dialing, voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointmentsV i P t lV i P t l ‘ t b t i bl d it ’•• Voice PortalsVoice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are keySOAP and others are key

•• EE--Contact servicesContact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues

•• TelematicsTelematics – command and control of automotive features (comfort systems, radio, windows, sunroof)

•• Small devicesSmall devices – control of cellphones, PDAs from voice dcommands

Page 89: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Milestones in Speech and Milestones in Speech and M lti d l T h l R hM lti d l T h l R hMultimodal Technology ResearchMultimodal Technology ResearchSmall

Vocabulary Medium Large Large Very Large Vocabulary, Acoustic

Phonetics-based

Medium Vocabulary, Template-based

Vocabulary; Syntax,

Semantics

Large Vocabulary,

Statistical-basedVocabulary; Semantics, Multimodal Dialog, TTS

Isolated Words

Isolated Words; Connected Digits;

Continuous Speech

Continuous Speech; Speech Understanding

Connected Words;

Continuous Speech

Spoken dialog; Multiple

modalities

Filter-bank analysis;

Time-

Pattern recognition;

LPC analysis;

Stochastic language

understanding;

Hidden Markov models;

Stochastic

Concatenative synthesis; Machine

learning; Mixed-i i i i di lnormalization;

Dynamicprogramming

Clustering algorithms;

Level building

Finite-state machines;

Statistical learning

Language modeling

initiative dialog

1962 1967 1972 1977 1982 1987 1992 1997 2002

Year

Page 90: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Future of Speech Recognition Future of Speech Recognition T h l iT h l i

Future of Speech Recognition Future of Speech Recognition T h l iT h l iTechnologiesTechnologiesTechnologiesTechnologies

Very Large Very Large UnlimitedVery Large Vocabulary,

Limited Tasks, Controlled

Environment

Very Large Vocabulary,

Limited Tasks, Arbitrary

Environment

Unlimited Vocabulary,

Unlimited Tasks, Many

Languages

Multilingual S t

Environment Environment Languages

Dialog Systems

Robust Systems

Systems; Multimodal

Speech EnabledEnabled Devices

2002 2005 2008 2011

Year

Page 91: Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

We’ve Come A Long WayWe’ve Come A Long Way------Or Or ??Have We?Have We?