Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple

Di it l S h P iDi it l S h P iDigital Speech ProcessingDigital Speech Processing——Lecture 19Lecture 19

Techniques for Speech Techniques for Speech and Natural Language and Natural Language g gg g

Recognition and Recognition and UnderstandingUnderstandingUnderstandingUnderstanding

Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(St l K b i k Vi i 1968)(Stanley Kubrick View in 1968)(Stanley Kubrick View in 1968)

Apple Navigator Apple Navigator ---- 19881988

The Speech AdvantageThe Speech Advantagep gp g•• Reduce costsReduce costs

– reduce labor expenses while still providing– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and services

• New revenue opportunities– 24x7 high-quality customer care automation– access to information without a keyboard or

touch-tonesCustomer retentionCustomer retention•• Customer retentionCustomer retention– provide personal services for customer

preferencespreferences– improve customer experience

The Speech Dialog CircleThe Speech Dialog Circle

SpeechVoice reply to customer

TTS ASR

“What number did you want to call?”

TTSText-to-Speech

SynthesisAutomatic Speech

Recognition

ASR

Data,Rules Words spoken

“I di l d b ”

Words: What’s next?

“D t i t b ”

SLU Spoken LanguageUnderstanding

SLGSpoken LanguageGeneration

“I dialed a wrong number”“Determine correct number”

DMAction Meaning

“Billing credit”

DialogManagement

Automatic Speech RecognitionAutomatic Speech Recognition

• Goal: Accurately and efficiently convert a

Automatic Speech RecognitionAutomatic Speech Recognition

y yspeech signal into a text message independent of the device, speaker or the p penvironment.

• Applications: Automation of complex operator-based tasks e g customer careoperator-based tasks, e.g., customer care, dictation, form filling applications, provisioning of new services customerprovisioning of new services, customer help lines, e-commerce, etc.

A Brief History of Speech RecognitionA Brief History of Speech RecognitionA Brief History of Speech RecognitionA Brief History of Speech Recognition

acy

n A

ccur

aco

gniti

o

Mathematical Formalization

eech

Rec Mathematical Formalization

Global optimizationAutomatic Learning from data

HeuristicsHandcrafted rules

Sp

Time

Handcrafted rulesLocal optimization

Cristopher C.discovers a new

continent.HMMs

Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)Basic ASR Formulation (Bayes Method)

Speaker’s Speech W Acoustics(n) X Linguistic WSpeaker s Intention Production

Mechanisms

Acoustic Processor

Linguistic Decoder

S h R iS k M d l Speech RecognizerSpeaker Model

ˆ arg max ( | )W P W X= g ( | )

( | ) ( )arg max( )

W

W

P X W P WP X

=( )

arg max ( | ) ( )W

A LW

P XP X W P W=

Step 2Step 1Step 3

Steps in Speech RecognitionSteps in Speech RecognitionStep 1Step 1-- Acoustic ModelingAcoustic Modeling:: assign probabilities

to acoustic realizations of a sequence ofto acoustic realizations of a sequence of words. Compute PA(X|W) using statistical models (Hidden Markov Models) of acoustic i l d dsignals and words

Step 2Step 2-- Language ModelingLanguage Modeling:: assign probabilities to sequences of words in the language Trainto sequences of words in the language. Train PL(W) from generic text or from transcriptions of task-specific dialogues.

Step 3Step 3-- Hypothesis SearchHypothesis Search:: find the word sequence with the maximum a posteriori probability Search through all possible wordprobability. Search through all possible word sequences to determine arg max over W.

Step 1Step 1--The Acoustic ModelThe Acoustic ModelStep 1Step 1 The Acoustic ModelThe Acoustic Model we build by learning statistics of the acousticfeatures from a training set where we compute the variability

acoustic modelsX

i features, , from a training set where we compute the variability of the acoustic features during the production of the sounds

p re

X

resented by the models it is impractical to create a separate acoustic model, ( | ), for

every possible word in the language--it requires too muchtraining data for words in every possible contex

AP X Wi

t training data for words in every possible context instead we build for the ~50 phonemes

in the English language and construct the model for a word byacoustic-phonetic mod le si

concantenating (stringing together sequentially) the models for e th constituent phones in the word

similarly we build sentences (sequences of words) by concatenatingi similarly we build sentences (sequences of words) by concatenating word modelsi

Step 2Step 2--The Language ModelThe Language ModelStep 2Step 2 The Language ModelThe Language Model• the language model describes the probability of a sequence of words that form a valid sentence in the languagewords that form a valid sentence in the language

• a simple statistical method works well based on a Markovian assumption, namely that the probability of a word in a sentence i diti d l th i N d l Nis conditioned on only the previous N-words, namely an N-gram language model

1 2( ) ( )L L kP W P w w w= 1 2

1 21

( ) ( , ,..., )

( | , ,..., )

L L kk

L n n n n Nn

P W P w w w

P w w w w− − −=

=

=∏1

1 2 where ( | , ,..., ) is estimated by simply counting up the relative frequencies of -tuples in a

n

L n n n n NP w w w wf N

=

− − −i

large corpus of text

Step 3Step 3--The Search ProblemThe Search ProblemStep 3Step 3 The Search ProblemThe Search Problem• the search problem is one of searching the space of all p g p

valid sound sequences, conditioned on the word grammar, the language syntax, and the task constraints, to find the word sequence with the maximum likelihoodq

• the size of the search space can be astronomically large and take inordinate amounts of computing power to solve by heuristic methodssolve by heuristic methods

• the use of methods from the field of Finite State Automata Theory provide Finite State Networks (FSN)that reduce the computational burden by orders ofthat reduce the computational burden by orders of magnitude, thereby enabling exact solutions in computationally feasible times, for large speech recognition problemsrecognition problems

Speech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition ProcessesSpeech Recognition Processes

11 Ch h kCh h k d d1.1. Choose the taskChoose the task => sounds, word vocabulary, task syntax (grammar), task semanticssemantics

•• ExampleExample: isolated digit recognition taskExampleExample: isolated digit recognition task– Sounds to be recognized—whole words– Word vocabulary—zero, oh, one, two, …, y , , , , ,

nine– Task syntax—any digit is allowed

T k ti f i l t d– Task semantics—sequence of isolated digits must form a valid telephone number


22 T i th M d lT i th M d l tt th d2.2. Train the ModelsTrain the Models => create a => create a method for building acoustic word modelsfrom a speech training data set afrom a speech training data set, a word lexicon, a word grammar(language model), and a task

f t t t i i d t tgrammar from a text training data set– speech training set—100 people each

speaking the 11 digits 10 times in isolationspeaking the 11 digits 10 times in isolation– text data training set—listing of valid

telephone numbers (or equivalently algorithm that generates valid telephonealgorithm that generates valid telephone numbers)


33 Evaluate performanceEvaluate performance => determine=> determine3.3. Evaluate performanceEvaluate performance => determine => determine word error rate, task error rate for word error rate, task error rate for recognizerrecognizergg

– speech testing data set—25 new people each speaking 10 telephone numbers as sequences of isolated digitssequences of isolated digits

– evaluate digit error rate, telephone number error rate

4.4. Testing algorithmTesting algorithm => method for evaluating recognizer performance from the testing set of speechfrom the testing set of speech utterances

Speech Recognition Speech Recognition Speech Recognition Speech Recognition ProcessProcessProcessProcess

SLU

TTS ASR

SLG

Acoustic Model (HMM)

SLU

DM

SLG

Feature ConfidencePattern

(HMM)

Input S h

Xn

Feature Analysis (Spectral Analysis)

Confidence Scoring

(Utterance Verification)

Pattern Classification

(Decoding, Search)

Speech “Hello World”(0.9) (0.8)

^

Language Model (N- Word

L i

s(n), W W

Model (Ngram) Lexicon

Feature ExtractionFeature ExtractionFeature ExtractionFeature ExtractionGoal: extract robust features (information) fromthe speech that are relevant for ASR. Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S i

p

Method: spectral analysis through either a bank-of-filters or through LPC followed

Extraction Classification

Language Model

Word Lexicon

Scoring

gby non-linearity and normalization (cepstrum).

Result: signal compression where for each window of speech samples where 30 or so cepstral features are extracted (64,000 b/s -> 5,200 b/s).

Challenges: robustness to environment (office, airport, car), devices (speakerphones, cellphones), speakers (acents, dialect, style, speaking defects), noise and echo. Feature set for recognition—

l f h f hi h di i licepstral features or those from a high dimensionality space.

What Features to Use?What Features to Use?

ShortShort--time Spectral Analysistime Spectral Analysis:Acoustic features:

mel cepstrum (LPC filterbank wavelets)- mel cepstrum (LPC, filterbank, wavelets)- formant frequencies, pitch, prosody- zero-crossing rate, energy

A i Ph i fAcoustic-Phonetic features:- manner of articulation (e.g., stop, nasal, voiced)- place of articulation (labial, dental, velar)p ( , , )

Articulatory features:- tongue position, jaw, lips, velum

Auditory features:Auditory features:- ensemble interval histogram (EIH), synchrony

Temporal AnalysisTemporal Analysis: approximation of the velocity and l ti t i ll th h fi t d d dacceleration typically through first and second order

central differences.

Feature Extraction ProcessFeature Extraction Process

Sampling and Q ti ti FilteringNoise

Removal

( )s t

Quantization

Preemphasis

αCepstral

FilteringRemoval,Normalization

( )s n

( )a l′

( )ma l′

Preemphasis

Spectral Analysis

pAnalysis

PitchFormants

( )s n′

( )ma l′

( )mc lM,N

Segmentation (blocking)

AnalysisEqualization

Bias removalor normalization( )s n′)(nW ( )mc l′

EnergyZero-Crossing

Windowing Temporal Derivative

( )ms n

( )′′

)(nW

Delta cepstrumDelta^2 cepstrum

( )ms n′′( )mc l′Δ

RobustnessRobustnessRobustness

R j tiRobustnessRobustnessy

UnlimitedVocabulary

Rejection

Problem:Problem:A mismatch in the speech signal between the training phase and testing phase can result in performance degradation.

Methods:Traditional techniques for improving system robustnessare based on signal enhancement feature normalization or/andare based on signal enhancement, feature normalization or/andmodel adaptation.

Perception Approach:Perception Approach:Extract fundamental acoustic information in narrow bandsof speech. Robust integration of features across time and frequency.

Methods for Robust Speech Methods for Robust Speech Methods for Robust Speech Methods for Robust Speech e ods o o us peece ods o o us peecRecognitionRecognition

e ods o o us peece ods o o us peecRecognitionRecognition

SignalTraining Features ModelSignalTraining Features Model

E h N l dEnhancement Normalization Adaptation

Testing Signal Features Model

Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelAcoustic

Model

Acoustic ModelAcoustic ModelAcoustic ModelAcoustic ModelGoal: map acoustic features into distinct phonetic labels (e g /s/ /aa/) or word labels

Feature Extraction

Pattern Classification

Language Word

Confidence Scoring

phonetic labels (e.g., /s/, /aa/) or word labels.

Hidden Markov Model (HMM): statistical method for characterizing the spectral properties of speech by a parametric random

Model Lexicon

characterizing the spectral properties of speech by a parametric random process. A collection of HMMs is associated with a phone or a word. HMMs are also assigned for modeling extraneous events.

Advantages: powerful statistical method for dealing with a wide range of data and reliably recognizing speech.

Challenges: understanding the role of classification models (ML Training) versus discriminative models (MMI training). What comes after the HMM—are there data driven models that work better for some or all vocabularies.

Hidden Markov ModelsHidden Markov Models

0 7010 2

..

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

010 60 3

.

.⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥0 2.⎣⎢ ⎦⎥ 0 3.⎣⎢ ⎦⎥

05⎡ ⎤

0 3⎡ ⎤

050 20 3

.

.

.

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

Ergodic modelErgodic model —can get to any state from any

th t t 0 30 30 4

.

.

.

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

P upP down

P unchanged

( )( )

( )

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

other state;

States are hiddenStates are hidden— observe the

ff t t theffects, not the states

HMM for speechHMM for speechHMM for speechHMM for speech

• phone model : ‘s’i1

s1 s3s2

• word model: ‘is’ (ih z) i• word model: is (ih-z) i1

z1 z3z2i1 i3i2

‘ih’ ‘z’

Isolated Word HMMIsolated Word HMMIsolated Word HMMIsolated Word HMM

a11 a22 a33 a44 a55

1 2 3 4 5

a34a23a12 a45

a13 a24 a3513 24 35

b1(Ot) b2(Ot) b3(Ot) b4(Ot) b5(Ot)

Left-right HMM – highly constrained state sequences

Basic Problems in HMMsBasic Problems in HMMsBasic Problems in HMMsBasic Problems in HMMs

Given acoustic observation X and model Φ:Given acoustic observation X and model Φ:

Evaluation: compute P(X | Φ)Evaluation: compute P(X | Φ)

Decoding: choose optimal state sequenceDecoding: choose optimal state sequence

Re estimation: adjust Φ to maximize P(X|Φ)Re-estimation: adjust Φ to maximize P(X|Φ)

Evaluation:Evaluation: The BaumThe Baum--Welch AlgorithmWelch AlgorithmEvaluation: Evaluation: The BaumThe Baum--Welch AlgorithmWelch Algorithm

1 1( ) ( , | ) ( ) ( ) 2 ; 1N

tt t t ji i ti P X s i j a b X t T i Nα α −

⎡ ⎤= = Φ = ≤ ≤ ≤ ≤⎢ ⎥

⎣ ⎦∑

1 1 11

( ) ( | , ) ( ) ( ) = -1 1; 1N

Tt t t ji i t t

ij P X s j a b X i t T j Nβ β+ + +

=

⎡ ⎤= = Φ = … ≤ ≤⎢ ⎥

⎣ ⎦∑

1 11

( ) ( | ) ( ) ( )t t t ji i tj

j=

⎢ ⎥⎣ ⎦∑

( ) ( | )Ti j P s i s j Xγ = = = Φs1 a1i aj1 s1

t -2 t+1t-1 t

1 1

1 1

1

1

( , ) ( , | , )

( , , | )( | )

( ) ( ) ( )

t t tT

t tT

t ij j t t

i j P s i s j X

P s i s j XP X

i a b X j

γ

α β

−

−

= = = Φ

= = Φ=

Φ

s2

s3

a2i

a3i

aj2

aj3

s2

s3

si

Xtsj

1

1

( ) ( ) ( )

( )

t ij j t tN

Tk

i a b X j

k

α β

α

−

=

=

∑

sN

aNi ajN

sN

aijbj(Xt)

sN sN

αt-2(i) αt-1(i) βt(j) βt+1(j)

Decoding: Viterbi AlgorithmDecoding: Viterbi AlgorithmDecoding: Viterbi AlgorithmDecoding: Viterbi Algorithm

Optimal alignment

sM

p gbetween X and S

Vt-1(j+1)

Vt(j)Vt-1(j)

t 1(j )

bj(xt)j

j+1aj+1 j

aj j

sb2(2)

Vt-1(j-1)bj(xt)

t -1 t

j -1aj-1 j

s1

s2

x2x1 xT

Reestimation: Training AlgorithmReestimation: Training AlgorithmReestimation: Training AlgorithmReestimation: Training Algorithm

• find model parameters Φ=(A, B, π) that maximize the p ( )probability p(X| Φ) of observing some data X:

1( | ) ( , | ) ( )

t t t

T

s s s tP P a b XΦ = Φ =∑ ∑∏X X S

• there is no closed-form solution and thus we use the EM algorithm:

11

t t ts s s tt

−=

∑ ∑∏S S

ˆ– given Φ, we compute model parameters that maximize the following auxiliary function Q

( , | ) ˆˆ ˆ ˆ( ) l ( | ) ( ) ( )PQ P Q QΦΦ Φ Φ Φ Φ∑ X S X S b

Φ

– it is guaranteed to have increased (or the same) likelihood

all

( , | )( , ) log ( , | ) ( , ) ( , )( | ) i ji jQ P Q Q

PΦ Φ = Φ = Φ + Φ

Φ∑ a bS

X S a bX

Training Procedure (Training Procedure (Segmental Segmental ))KK--Means TrainingMeans Training))

Several iterations of the process:Several iterations of the process:

I t S hInput SpeechDatabase

Updated HMM Model

Old (Initial)HMM Model

EstimatePosteriorsζ (j k) γ (i j)

MaximizeParameters

a c μ Σ HMM ModelHMM Model ζt(j,k), γt(i,j) aij, cjk, μjk, Σjk

Isolated Word Recognition Using Isolated Word Recognition Using HMM’HMM’HMM’sHMM’s

Assume a vocabulary of words, with occurrences of each spoken wordV Kin a training set. Observation vectors are spectral characterizations of the word.For isolated word recognition, we do the follow

λ

ing: 1. for each word, , in the vocabulary, we must build an HMM, , i.e., we vv

( )Π must re-estimate model parameters , , that optimize the likelihood of the training set observation vector

A Bs for the -th word. (TRAINING)

2. for each unknown word which is to be recognized, we do the following:v

[ ]= 1 2

g , g a. measure the observation sequence ...

b. calculate model likelihooTO O O O

λ ≤ ≤ds, ( | ), 1c select the word whose model likelihood score is highest

vP O v V

λ∗

≤ ≤⎡ ⎤= ⎣ ⎦

⋅ = = =1

2

c. select the word whose model likelihood score is highest

argmax ( | )

Computation is on the order of required; 100, 5, 40

v

v Vv P O

V N T V N T⇒ 5

p q ; , ,10 computations

Continuous Observation Density HMM’sContinuous Observation Density HMM’s

μ⎡ ⎤= ≤ ≤⎣ ⎦∑

Most general form of pdf with a valid re-estimation procedure is:

( ) , , , 1M

j jm jm jmb x c x U j N

{ }

μ=

⎣ ⎦

=

=

∑1

1 2

( ) , , ,

observation vector= , ,...,number of mixture densities

j jm jm jmm

D

j

x x x xM

= gain of -th mijmc m

μ=

xture in state

any log-concave or elliptically symmetric density (e.g., a Gaussian)mean vector for mixture state

j

m jμ =

=

≥ ≤ ≤ ≤ ≤

mean vector for mixture , state

covariance matrix for mixture , state

0, 1 , 1

jm

jm

jm

m j

U m j

c j N m M

=

= ≤ ≤∑1

1, 1M

jmm

c j

∞

∫

N

−∞

= ≤ ≤∫ ( ) 1, 1jb x dx j N

HMM Feature Vector DensitiesHMM Feature Vector DensitiesHMM Feature Vector DensitiesHMM Feature Vector Densities

Isolated Word HMM RecognizerIsolated Word HMM Recognizer

Choice of Model Parameters1. left-right model preferable to ergodic model (speech is

a left-right process)a left right process)2. number of states in range 2-40 (from sounds to

frames)• order of number of distinct sounds in the word• order of average number of observations in word

3 observation vectors3. observation vectors• cepstral coefficients (and their second and third order

derivatives) derived from LPC (1-9 mixtures), diagonal covariance matricescovariance matrices

• vector quantized discrete symbols (16-256 codebook sizes)

4. constraints on bj(O) densitiesj• bj(k)>ε for discrete densities• Cjm>δ, Ujm(r,r)>δ for continuous densities

Performance Vs Number of Performance Vs Number of S i M d lS i M d lStates in ModelStates in Model

HMM Segmentation for /SIX/HMM Segmentation for /SIX/HMM Segmentation for /SIX/HMM Segmentation for /SIX/

Digit Recognition Using HMM’sDigit Recognition Using HMM’s

unknown unknown log log

energyenergy

frame frame cumulative cumulative

scoresscores

oneone oneone

iioneone

ninenine

frameframe oneone oneoneframe frame likelihood likelihood

scoresscores

state state segmentationsegmentation

ninenine ninenine

Digit Recognition Using HMM’sDigit Recognition Using HMM’sDigit Recognition Using HMM sDigit Recognition Using HMM s

sevenseven

unknown unknown log log

energyenergy

frame frame cumulative cumulative

scoresscores

sevenseven

sevenseven

energyenergy scoresscores

sevenseven

sixsix

sevenseven sevensevenframe frame

likelihood likelihood scoresscores

state state segmentationsegmentation

sevenseven

sixsixsixsix

Word LexiconWord LexiconWord LexiconWord LexiconWord LexiconWord LexiconWord LexiconWord LexiconGoal: Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S imap legal phone sequences into words

according to phonotactic rules. For example,

D id /d/ / / / / /ih/ /d/


Language Model

Word Lexicon

Scoring

David /d/ /ey/ /v/ /ih/ /d/

Multiple Pronunciation:l d h lti l i ti F lseveral words may have multiple pronunciations. For example

Data /d/ /ae/ /t/ /ax/Data /d/ /ey/ /t/ /ax/Data /d/ /ey/ /t/ /ax/

Challenges:how do you generate a word lexicon automatically; how do you addhow do you generate a word lexicon automatically; how do you add new variant dialects and word pronunciations.

Language ModelLanguage ModelGoal: Model “acceptable” spoken phrases, constrained by task syntax. Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S i

y y

Rule-based:Deterministic grammars that are


Language Model

Word Lexicon

Scoring

gknowledge driven. For example,

flying from $city to $city on $date

Statistical:Compute estimate of word probabilities (N-gram, class-based, CFG). For example

flying from Newark to Boston tomorrow

Ch ll0.6

0.4

Challenges:How do you build a language model rapidly for a new task.

Formal GrammarsFormal GrammarsFormal GrammarsFormal Grammars

S

VPNP

Rewrite Rules:

1. S NP VP2. VP V NP3. VP AUX VPVPNP

V NPNAME

4. NP ART NP15. NP ADJ NP1

7. NP1 N6. NP1 ADJ NP1

8 NP NAME

ADJMaryloves

8. NP NAME9. NP PRON10. NAME Mary11. V loves12 ADJ that

NP1

that

person

12. ADJ that13. N personN

NN--GramsGramsNN GramsGrams1 2( ) ( , , , )NP P w w w=W …

1 2 1 3 1 2 1 2 1

1 2 1

( ) ( | ) ( | , ) ( | , , , )

( | , , , )

N NN

i i

P w P w w P w w w P w w w w

P w w w w

−

−

=

=∏

…

…

T i E ti ti

1 2 11

i ii=∏

• Trigrams Estimation

( )C w w w2, 1,1, 2

2, 1

( )( | )

( )i i i

i i ii i

C w w wP w w w

C w w− −

− −− −

=

ExampleExample: bibi--gramsgramsprobability of a word given the previous wordprobability of a word given the previous word

I want 0.02I want to go from Boston to Denver tomorrow want to 0.01to go 0.03go from 0.05go to 0.01

gI want to go to Denver tomorrow from BostonTomorrow I want to go from Boston to Denver with UnitedA flight on United going from Boston to Denver tomorrowA flight on UnitedBoston to Denver

i from Boston 0.01Boston to 0.02to Denver 0.01...

Going to DenverUnited from BostonBoston with United tomorrow Tomorrow a flight to DenverGoing to Denver tomorrow with United Boston Denver with United

...Boston Denver with UnitedA flight with United tomorrow

...

...

0 03 0 02

I want to go from Boston to Denver0.02 0.01

0.03

0.05 0.01 0.02

0.02

1.2e-12

Generalization for NGeneralization for N--gramsgrams(b k(b k ff)ff)(back(back--off)off)

0.0 0.0 0.02

f h b b b d

I want a flight from Boston to Denver0.02 0.0 0.01 0.02

0.0BAD!!!

If the bi-gram ab was never observed, we canestimate its probability:

Probability of word b following word aas a fraction of probability of word b

I want a flight from Boston to Denver0.02 0.003

0.001 0.002

0.01 0.02

0.02

4.8e-150.02 0.003

GOOD

Pattern ClassificationPattern ClassificationPattern ClassificationPattern ClassificationGoal:combine information (probabilities)from the acoustic model, language Feature Pattern

Cl ifi ti

Acoustic Model

Confidence S imodel and word lexicon to generate

an “optimal” word sequence (highest probability).

M th d


Language Model

Word Lexicon

Scoring

Method:decoder searches through all possible recognitionchoices using a Viterbi decoding algorithm.

Challenges:how do we build efficient structures (FSMs) for decoding and searching l b l l l d l t klarge vocabulary, complex language models tasks;

• features x HMM units x phones x words x sentences can lead to search networks with 10 states• FSM methods can compile the network to 10 states 14 orders

22

8• FSM methods can compile the network to 10 states—14 orders of magnitude more efficient

what is the theoretical limit of efficiency that can be achieved

Unlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASRUnlimited Vocabulary ASR Robustness

• the basic problem in ASR is to find the sequence f d th t l i th i t i l Thi i li

yUnlimited

Vocabulary

Rejection

of words that explain the input signal. This implies the following mapping:

• for the WSJ 20 000 vocabulary this results in a network

Features HMM states HMM units Phones Words Sentences

for the WSJ 20,000 vocabulary, this results in a network of 10 bytes!

state of the art methods including fast match multi pass

22

• state-of-the-art methods including fast match, multi-pass decoding A* stack and finite-state transducers all provide tremendous speed-up by searching through the network andfinding the best path that maximizes the likelihood function.

Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)

Weighted Finite State Transducers Weighted Finite State Transducers (WFST)(WFST)(WFST)(WFST)(WFST)(WFST)

• Unified Mathematical framework to ASR• Efficiency in time and space

Word:Phrase WFST

WFSTPhone:Word

Combination Optimization

SearchNetwork

WFST

WFST

HMM:Phone

St t HMMWFST can compile the network to 10WFST can compile the network to 1088 states states

----14 orders of magnitude more efficient14 orders of magnitude more efficientWFSTState:HMM ----14 orders of magnitude more efficient14 orders of magnitude more efficient

Weighted Finite State Weighted Finite State T dT dTransducerTransducer

ey:ε/ 4 dx:ε/ 8

Word PronunciationWord Pronunciation TransducerTransducer

d: ε/1

ey:ε/.4 dx:ε/.8ax:”data”/1

ae:ε/.6 t:ε/.2

D tData

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

Algorithmic SpeedAlgorithmic Speed--up for up for Speech RecognitionSpeech Recognition

30

AT&T (Algorithmic) Moore's Law (hardware)

Speech RecognitionSpeech RecognitionSpeech RecognitionSpeech Recognition

20

25

30

peed

Community

AT&T

5

10

15

lativ

e S Community

0

5

1994 1995 1996 1997 1998 1999 2000 2001 2002

Rel

Year

N h A i B i

North American Businessvocabulary: 40,000 words

branching factor: 85

Confidence ScoringConfidence ScoringConfidence ScoringConfidence ScoringGoal:identify possible recognition errors

d t f b l t P t ti ll

Confidence ScoringConfidence ScoringConfidence ScoringConfidence Scoring

Feature Pattern Cl ifi ti

Acoustic Model

Confidence S iand out-of-vocabulary events. Potentially

improves the performance of ASR, SLU and DM.

Method:


Language Model

Word Lexicon

Scoring

Method:a confidence score based on a hypothesis test is associated with each recognized word. For example:

Label: credit pleaseRecognized: credit fees Confidence: (0 9) (0 3)

( ) ( | )( | )( ) ( | )

P PPP P

=∑W

W X WW XW X W

Confidence: (0.9) (0.3)

Challenges:rejection of extraneous acoustic events (noise background speechrejection of extraneous acoustic events (noise, background speech, door slams) without rejection of valid user input speech.

RejectionRejectionRobustness

RejectionRejectionProblem: E t ti t i

yUnlimited

Vocabulary

Rejection

Extraneous acoustic events, noise, background speech and out-of-domain speech deteriorate system performance.

Measure of Confidence:Associating word strings with a verification cost that

id ff ti f fidprovide an effective measure of confidence (Utterance Verification).

Effect:Improvement in the performance of the recognizer, understanding system and dialogue managerunderstanding system and dialogue manager.

StateState--ofof--thethe--Art Art StateState--ofof--thethe--Art Art PerformancePerformancePerformancePerformance

SLU

TTS ASR

SLG

Acoustic Model

SLU

DM

SLG

Feature Pattern

Classification

Input Speech

RecognizedSentenceConfidence

Extraction (Decoding, Search)

Scoring

Language Model

Word Lexicon

How to Evaluate Performance?How to Evaluate Performance?How to Evaluate Performance?How to Evaluate Performance?

• Dictation applications: InsertionsDictation applications: Insertions, substitutions and deletions

# # #Word Error Rate 100%No.of words in the correct sentence

Subs Dels Ins+ += ×

• Command-and-control: false rejection and false acceptance => ROC curves.

Word Error RatesWord Error RatesCORPUS TYPE VOCABULARY

SIZE WORD

ERROR RATEConnected Digit

Strings--TI DatabaseSpontaneous 11 (zero-nine,

oh)0.3%

factor ofStrings TI Database oh)Connected Digit

Strings--Mall Recordings

Spontaneous 11 (zero-nine, oh)

2.0%

Connected Digits Conversational 11 (zero-nine 5 0%

factor of 17

increase in digit

error rateConnected Digits Strings--HMIHY

Conversational 11 (zero nine, oh)

5.0%

RM (Resource Management)

Read Speech 1000 2.0%

ATIS(Airline Travel Spontaneous 2500 2 5%

error rate

ATIS(Airline Travel Information System)

Spontaneous 2500 2.5%

NAB (North American Business)

Read Text 64,000 6.6%

B d t N N Sh 210 000 13 17%Broadcast News News Show 210,000 13-17%

Switchboard Conversational Telephone

45,000 25-29%

Call Home Conversational Telephone

28,000 40%

NIST Benchmark PerformanceNIST Benchmark PerformanceNIST Benchmark PerformanceNIST Benchmark Performancee

rror

Rat

eW

ord

E

Resource NAB

Management ATIS

Year

North American BusinessNorth American BusinessNorth American BusinessNorth American Business

vocabulary: 40,000 words branching factor: 85

Algorithmic Accuracy for Algorithmic Accuracy for Speech RecognitionSpeech RecognitionSpeech RecognitionSpeech Recognition

70

80

cy

50

60

70

Acc

urac

30

40

1996 1997 1998 1999 2000 2001 2002

Wor

d

Year

Switchboard/Call Home

Vocabulary: 40,000 words Perplexity: 85

Growth in Effective Growth in Effective Recognition Vocabulary SizeRecognition Vocabulary Size

1000001000000

10000000

ize

1001000

10000100000

cabu

lary

Si

110

100

Voc

1960 1970 1980 1990 2000 2010Year

Human Speech Recognition vs ASRHuman Speech Recognition vs ASR

100

10

RR

OR

(%)

1

ACH

INE

ER

Machinesx10

x100

0.10 001 0 01 0 1 1 10

MA

OutperformHumans

x1

0.001 0.01 0.1 1 10

HUMAN ERROR (%)Digits RM-LM NAB-mic WSJ

RM null NAB omni SWBD WSJ 22dBRM-null NAB-omni SWBD WSJ-22dB

Challenges in ASRChallenges in ASRSystem PerformanceSystem Performance

accuracy- accuracy- efficiency (speed, memory)- robustnessrobustness

Operational PerformanceOperational Performance- end-point detection- user barge-in- utterance rejection- confidence scoring

Machines are 10Machines are 10--100 times less accurate than humans100 times less accurate than humans

LargeLarge--Vocabulary ASR DemoVocabulary ASR DemoLargeLarge Vocabulary ASR DemoVocabulary ASR Demo

Courtesy of Murat Saraclar

VoiceVoice--Enabled System Technology Enabled System Technology ComponentsComponentsComponentsComponents

SpeechSpeech

TTS ASRTTSText-to-Speech


Recognition

ASR

Data,Rules WordsWords



DMMeaningAction

DialogManagement

Spoken Language Understanding (SLU)Spoken Language Understanding (SLU)• Goal:: interpret the meaning of key words and phrases in the

recognized speech string, and map them to actions that the speech understanding system should take– accurate understanding can often be achieved without correctly

recognizing every word– SLU makes it possible to offer services where the customer can speak

naturally without learning a specific set of termsy g p• Methodology:: exploit task grammar (syntax) and task semantics

to restrict the range of meaning associated with the recognized word string; exploit ‘salient’ words and phrases to map high g; p p p ginformation word sequences to appropriate meaning

• Performance Evaluation:: accuracy of speech understanding system on various tasks and in various operating environmentsy p g

• Applications:: automation of complex operator-based tasks, e.g., customer care, catalog ordering, form filling systems, provisioning of new services, customer help lines, etc.

• Challenges: what goes beyond simple classifications systems but below full Natural Language voice dialogue systems

Knowledge Sources for Knowledge Sources for Speech UnderstandingSpeech Understanding

ASR SLU DM

Phonotactic Syntactic Semantic PragmaticAcoustic/Phonetic

Relationshipof speech sounds and

Rules forphoneme

sequences and

Structure ofwords, phrasesin a sentence

Relationshipand meanings

d

Discourse,interaction history,

ld k l d

AcousticM d l

WordL i

LanguageM d l

sounds andEnglish phonemes

sequences and pronunciation

in a sentence among words world knowledge

UnderstandingM d l

DialogMModel Lexicon Model Model Manager

DARPA CommunicatorDARPA Communicator• Darpa sponsored research and development of mixed-initiative

dialogue systemsTravel task involving airline hotel and car information and reservation• Travel task involving airline, hotel and car information and reservation“Yeah I uh I would like to go from New York to Boston tomorrow

night with United”S• SLU output (Concept decoding) XML Schema

<itinerary><origin>

<city></city>Topic: ItineraryOrigin: New YorkDestination: BostonD f th k S d

city /city<state></state>

</origin><destination>

< it ></ it >Day of the week: SundayDate: May 25th, 2002Time: >6pmAirline: United

<city></city><state></state>

</destination><date></date>Airline: United<time></time><airline></airline>

</itinerary>

Customer Care IVR and HMIHYCustomer Care IVR and HMIHY

HMIHYCustomer Care IVR

Network Menu

Sparkle Tone“AT&T, How may I help you?”

Account Verification Routine

Sparkle Tone“Thank you for calling AT&T…”

10 seconds20 seconds

8 seconds

LEC Misdirect Announcement

30 seconds

13 seconds

Account Verification Routine

Main Menu

26 seconds

58 seconds Main Menu

Reverse Directory RoutineLD Sub-Menu

58 seco ds

38 seconds

Total Time to Get to Reverse Directory Lookup: 2:55 minutes!!!

Total Time to Get to Reverse Directory Lookup: 28 seconds!!!

HMIHYHMIHY——How Does It WorkHow Does It Work

Prompt is “AT&T How may I help you?”

HMIHYHMIHY How Does It WorkHow Does It Work

Prompt is AT&T. How may I help you?User responds with totally unconstrained fluent speechSystem recognizes the words and determines the meaningf ’of users’ speech, then routes the call

Dialog technology enables task completion

HMIHY

AccountBalance

CallingPlans

Local UnrecognizedN b

. . .Balance Plans Number

HMIHY Example DialogsHMIHY Example DialogsHMIHY Example DialogsHMIHY Example Dialogs

• Irate Customer Customer SatisfactionCustomer Satisfaction• Irate Customer• Rate Plan

Customer SatisfactionCustomer Satisfaction

• decreased repeat calls (37%)

• Account Balance• Local Service

• decreased ‘OUTPIC’ rate (18%)

• decreased CCA (Call• Unrecognized Number• Threshold Billing

• decreased CCA (Call Control Agent) time per call (10%)

• Threshold Billing• Billing Credit

• decreased customer complaints (78%)

Customer Care ScenarioCustomer Care ScenarioCustomer Care ScenarioCustomer Care Scenario

VoiceVoice--enabled System Technology enabled System Technology ComponentsComponentsComponentsComponents

SpeechSpeech



Recognition

ASR




DMMeaningAction

DialogManagement

Dialog Management (DM)Dialog Management (DM)•• Goal:Goal::: combine the meaning of the current input with the interaction history

to decide what the next step in the interaction should be– DM makes viable complex services that require multiple exchanges between the

system and the customery– dialog systems can handle user-initiated topic switching (within the domain of

the application)•• Methodology:Methodology::: exploit models of dialog to determine the most

appropriate spoken text string to guide the dialog forward towards a clearappropriate spoken text string to guide the dialog forward towards a clear and well understood goal or system interaction

•• Performance Evaluation:Performance Evaluation::: speed and accuracy of attaining a well defined task goal, e.g., booking an airline reservation, renting a car, g , g , g , g ,purchasing a stock, obtaining help with a service

•• Applications:Applications::: customer care (HMIHY), travel planning, conference registration, scheduling, voice access to unified messaging

•• Challenges:Challenges: is there a science of dialogues—how do you keep it efficient (turns, time, progress towards a goal); how do you attain goals (get answers). Is there an art of dialogues. How does the User Interface play into the art/science of dialogues sometimes it is better/easier/faster/moreinto the art/science of dialogues—sometimes it is better/easier/faster/more efficient to point, use a mouse, type than speak multimodal interactions with machines

Dialog StrategiesDialog StrategiesContextInterpretation

• Completion (continuation): elicits missing input information from the user

DialogStrategies

• Constraining (disambiguation): reduces the scope of the request when multiple information has been retrieved

• Relaxation: increases the scope of the request when no information

Backend

Relaxation: increases the scope of the request when no information has been retrieved• Confirmation: verifies that the system understood correctly; strategy may differ depending on SLU confidence measure. • Reprompting: used when the system expected input but did not receive any or did not understand what it received.• Context Help: provides user with help during periods of

i d t di f ith th t thmisunderstanding of either the system or the user.• Greeting/Closing: maintains social protocol at the beginning and end of an interaction• Mixed initiative: allows users to manage the dialog• Mixed-initiative: allows users to manage the dialog.

MixedMixed--initiative Dialoginitiative Dialog

Who manages the dialog?

System UserInitiative

Who manages the dialog?

System UserInitiative

Please say collect, calling card, third

How may I help you?

number.y

Example: MixedExample: Mixed--Initiative Initiative DialogDialogDialogDialog

• System InitiativeSystem InitiativeSystem: Please say just your departure citySystem: Please say just your departure city.User: ChicagoSystem: Please say just your arrival city.User: Newark

Long dialogs but easier to design

• Mixed InitiativeMixed InitiativeSystem: Please say your departure city

User: Newark

System: Please say your departure cityUser: I need to travel from Chicago to Newark tomorrow.

• User InitiativeUser InitiativeSystem: How may I help you?User: I need to travel from

Chicago to Newark tomorrow.

Shorter dialogs (better user experience) but more difficult to designg

VoiceVoice--enabled System enabled System Technology ComponentsTechnology ComponentsTechnology ComponentsTechnology Components

SpeechSpeech



Recognition

ASR

Words to besynthesized Words spoken




DM MeaningMeaningAction

DialogManagement

Building Good SpeechBuilding Good Speech--Based Based A li tiA li tiApplicationsApplications

•• Good User InterfacesGood User Interfaces make the application easy to use•• Good User InterfacesGood User Interfaces---make the application easy-to-use and robust to the kinds of confusion that arise in human-machine communications by voice

•• Good Models of DialogGood Models of Dialog---keep the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machineof either the user or the machine

•• Matching the Task to the TechnologyMatching the Task to the Technology---be realistic about the capabilities of the technology– fail soft methods– error correcting techniques (data dips)

User InterfaceUser Interface•• Good User InterfacesGood User Interfaces

– makes the application easy-to-use and robust tomakes the application easy to use and robust to the kinds of confusion that arise in human-machine communications by voice

– keeps the conversation moving forward, even in periods of great uncertainty on the parts of either the user or the machine

– a great UI cannot save a system with poor ASR and NLU. But, the UI can make or break a UI can make or break a

tt ith ll t ASR & NLUsystemsystem, even with excellent ASR & NLU– effective UI design is based on a set of

elementary principles and common widgets;elementary principles and common widgets; sequenced screen presentations; simple error-trap dialog; a user manual.

Multimodal System Multimodal System Technology ComponentsTechnology ComponentsTechnology ComponentsTechnology Components

SpeechSpeech

TTS ASR

VisualPen

Gesture

TTSText-to-Speech


Recognition

ASR




DM MeaningAction

DialogManagement

Multimodal ExperienceMultimodal ExperienceMultimodal ExperienceMultimodal Experience• users need to have access to

automated dialog applications at anytime and anywhere using any access deviceaccess device

• selection of “most appropriate” UI mode or combination ofUI mode or combination of modes depends on the device, the task, the environment, and , ,the user’s abilities and preferences.

Microsoft MIPadMicrosoft MIPad

• Multimodal Interactive PadMultimodal Interactive Pad• usability studies show

double throughput for g pEnglish

• speech is mostly useful in p ycases with lots of alternatives

MiPad VideoMiPad VideoMiPad VideoMiPad Video

AT&T MATCHAT&T MATCH M lti d l AM lti d l AAT&T MATCHAT&T MATCH Multimodal Access Multimodal Access To City HelpTo City Help

A t i f ti th hAccess to information through voice interface, gesture and GPSGPS.

•• Multimodal IntegrationMultimodal IntegrationCombination of speech gestureCombination of speech, gesture and meaning using finite state technology

“

technology

“Are there any cheap Italian places in this neighborhood?”

MATCH VideoMATCH VideoMATCH VideoMATCH Video

Courtesy of Michael Johnston

The Speech AdvantageThe Speech AdvantageThe Speech AdvantageThe Speech Advantage•• Reduce costsReduce costs

– reduce labor expenses while still providing customers an easy-to-use and natural way to access information and servicesaccess information and services

•• New revenue opportunitiesNew revenue opportunities– 24x7 high-quality customer care automation24x7 high quality customer care automation– Access to information without a keyboard or

touch-tones•• Customer retentionCustomer retention

– provide personal services for customer preferences

– improve customer experience

VoiceVoice--Enabled ServicesEnabled Services•• Desktop applicationsDesktop applications -- dictation, command and control of

desktop, control of document properties (fonts, styles, bullets, …)•• Agent technologyAgent technology – simple tasks like stock quotes, traffic reports, g gyg gy p q , p ,

weather; access to communications, e.g., voice dialing, voice access to directories (800 services); access to messaging (text and voice messages); access to calendars and appointmentsV i P t lV i P t l ‘ t b t i bl d it ’•• Voice PortalsVoice Portals – ‘convert any web page to a voice-enabled site’ where any question that can be answered on-line can be answered via a voice query; protocols like VXML, SALT, SMIL, SOAP and others are keySOAP and others are key

•• EE--Contact servicesContact services – Call Centers, Customer Care (HMIHY) and Help Desks where calls are triaged and answered appropriately using natural language voice dialogues

•• TelematicsTelematics – command and control of automotive features (comfort systems, radio, windows, sunroof)

•• Small devicesSmall devices – control of cellphones, PDAs from voice dcommands

Milestones in Speech and Milestones in Speech and M lti d l T h l R hM lti d l T h l R hMultimodal Technology ResearchMultimodal Technology ResearchSmall

Vocabulary Medium Large Large Very Large Vocabulary, Acoustic

Phonetics-based

Medium Vocabulary, Template-based

Vocabulary; Syntax,

Semantics

Large Vocabulary,

Statistical-basedVocabulary; Semantics, Multimodal Dialog, TTS

Isolated Words

Isolated Words; Connected Digits;

Continuous Speech

Continuous Speech; Speech Understanding

Connected Words;

Continuous Speech

Spoken dialog; Multiple

modalities

Filter-bank analysis;

Time-

Pattern recognition;

LPC analysis;

Stochastic language

understanding;

Hidden Markov models;

Stochastic

Concatenative synthesis; Machine

learning; Mixed-i i i i di lnormalization;

Dynamicprogramming

Clustering algorithms;

Level building

Finite-state machines;

Statistical learning

Language modeling

initiative dialog

1962 1967 1972 1977 1982 1987 1992 1997 2002

Year

Future of Speech Recognition Future of Speech Recognition T h l iT h l i

Future of Speech Recognition Future of Speech Recognition T h l iT h l iTechnologiesTechnologiesTechnologiesTechnologies

Very Large Very Large UnlimitedVery Large Vocabulary,

Limited Tasks, Controlled

Environment

Very Large Vocabulary,

Limited Tasks, Arbitrary

Environment

Unlimited Vocabulary,

Unlimited Tasks, Many

Languages

Multilingual S t

Environment Environment Languages

Dialog Systems

Robust Systems

Systems; Multimodal

Speech EnabledEnabled Devices

2002 2005 2008 2011

Year

We’ve Come A Long WayWe’ve Come A Long Way------Or Or ??Have We?Have We?

Documents

Di it l S h P iDigital Speech Processing— Lecture 19...Speech RecognitionSpeech Recognition--2001 2001 (St l K b i k Vi i 1968)(Stanley Kubrick View in 1968) Apple Navigator Apple