UniversitΓ€t Potsdam Institut fΓΌr Informatik
Lehrstuhl Maschinelles Lernen
Automatic Speech Recognition
Tobias Scheffer
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
2
Motivation
Spoken language is an
extremely natural form of
human communication.
Does not require a keyboard.
Does not require eyes or
hands to be available.
Can be implemented on
phone, car, embedded device.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
3
Overview
Speech recognition usind hidden Markov models.
Speech recognition using deep recurrent neural
networks.
Voice portals.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
4
Statistical Speech Recognition
Components: acoustic model + language model
(words)
(words)
acoustic model language model
arg max (words | signal)
arg max (signal | words) (words)
P
P P
Posterior: how likely is the
sequence given the signal?
Prior: how likely is the
word sequence a priori?
Likelihood: how likely is the
signal given the word sequence?
argmax: best
word sequence.
Bayes
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
5
Statistical Speech Recognition
Language model has to be trained on domain-
specific documents.
(words)
(words)
acoustic model language model
arg max (words | signal)
arg max (signal | words) (words)
P
P P
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
6
Statistical Speech Recognition
Acoustic model:
Trained on annotated voice signal.
Phonetic models for individual phones.
Phones combined into models of words.
(words)
(words)
acoustic model language model
arg max (words | signal)
arg max (signal | words) (words)
P
P P
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
7
Statistical Speech Recognition
Decoding (applying speech recognition):
Search in the space of all word sequences.
Search algorithms, Viterbi beam search.
(words)
(words)
acoustic model language model
arg max (words | signal)
arg max (signal | words) (words)
P
P P
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
8
Acoustic Model: Granularity
One model per word.
Works for small volabularies (e.g., digit recognition).
Cannot recognize unknown words.
Ignores that distinct words often share phones.
Syllables
Works for Japanese (50 syllables),
But generally bad (English has 30,000 syllables).
Phone:
Smallest phonetic unit; English has 50 phones.
Phoneme: smallest phonetic unit that distinguishes word meaning.
Pronounciation of phones varies by context.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
9
Acoustic Model
Short-Time-Fourier-Transform:
Signal sin(frequency ) = amplitude of in signal.
For each point in time:
Amplitudes of ~ 24 frequency bands.
Decorrelation, feature selection β ~20-50 continuous
βcepstral attributesβ
Time-varying superposition
of waves
1.1
29.0
7.2
n
NinkekXkX /2][][
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
10
Acoustic Model: Varyations
One separate model per phone.
Alternatively: Triphone class model:
Pronounciation depends on neighboring phones.
For instance, βbβ and βpβ have a similar influence on
the following vowel.
Group such context phones into classes, one acoustic
model per phone with each class of neighbors.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
11
Acoustic Model
Each phone (or phone + context) is represented by
a hidden Markov model.
~ 2-20 states per phone.
Time steps of ~ 10 milliseconds.
Observations ππ(ππ‘): vector of cepstral attributes.
Mostly linear structure, but
Shortcuts and parallel branches reflect the possible
pronounciation alternatives.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
12
State sequence cannot be observed, only
sequence of acoustic signals.
Observation probabilities ππ(ππ‘) modeled as mixture
of gaussian distributions.
State 1
State 3
State 2 State 4
90%
10%
4%
State 5
15%5%1%
4%
1%
90%
80%
1
1
P[ |State 1] )(,)()|( tk kkjjt NkbSP xΣμx
Acoustic Model: Hidden Markov Model
)|( 1 itjt SqSqP
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
13
Hidden Markov Model
Parameter estimation:
From annotated speech data
Baum Welch algorithm
Find argmaxππ(Data|π).
Repeat until convergence
βͺ Use forward-backwart to infer
βͺ Infer
βͺ Estimate
βͺ Estimate
βͺ Estimate
)(1
)( ik
i ( ) ( , ) ( )k
ij t t
t t
a i j i
t
t
OOt
t
k
i iiObt
)()()(:
)(
,,
),( jit
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Hidden Markov Acoustic Models
Training data: phonetically annotated speech.
Alignment is critical. HMM that models a phone can
only be trained with speech signals of that phone.
Phonetic annotation is a cumbersome process that
cannot easily be scaled.
14
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
15
Pronounciation Networks
Probabilistic finite automaton combines models of
phones into models of words.
Each node is a HMM that models a phone.
Transition probabilities are estimated from
annotated corpus.
/t/
/ax/
/ow/
/m/
/ey/
/aa/
/t/
/dx/
/ow/05
.3
.7
.3
.2
.8
.8
.2
.2
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
16
Acoustic Model: Decoding
Decoding the acoustic model:
Find state sequence given the acoustic signal.
π-best Viterbi algorithm.
Finds π most likely words given the signal.
/t/
/ax/
/ow/
/m/
/ey/
/aa/
/t/
/dx/
/ow/05
.3
.7
.3
.2
.8
.8
.2
.2
Zustand
1
Zustand
3
Zustand
2
Zustand
4
90%
10%
4%
Zustand
5
15%5%1%
4%
1%
90%
80%
1
1
Pronounciation
network
Phonetic HMMs
Signal / attributes
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
17
Acoustic Model: Decoding
Decoding the acoustic model:
Find state sequence given the acoustic signal.
π-best Viterbi algorithm.
Finds π most likely words given the signal.
Pronounciation
network
Phonetic HMMs
Signal / attributes
/t/
/ax/
/ow/
/m/
/ey/
/aa/
/t/
/dx/
/ow/05
.3
.7
.3
.2
.8
.8
.2
.2
Zustand
1
Zustand
3
Zustand
2
Zustand
4
90%
10%
4%
Zustand
5
15%5%1%
4%
1%
90%
80%
1
1
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
18
Language Model
Model of π sequence of words .
Language models in speech recognition:
π-gram models.
Probabilistic contect-free grammars (PCFGs).
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
19
Joint Decoding Problem
Maximize acoustic model + language model scores.
(words)
(words)
acoustic model language model
arg max (words | signal)
arg max (signal | words) (words)
P
P P
Language model allows
to infer probability of
each sequence
Acoustic model provides π
candidate words for each word in
sequence
argmax: best
word sequence.
Bayes
Given π candidates for each of π
words, there are ππ combinations
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
20
Joint Decoding Problem
Beam search algorithm.
1. Input candidates π€π‘[1β¦π], beam width π.
2. Let π π‘ 1β¦π = β
3. For π‘ = 1β¦π:
1. Let candidates = {π π‘β1 π + π€π‘[π]|1 β€ π β€ π, 1 β€ π β€π}
2. Infer π π€π‘ π π(π π‘β1 π + π€π‘ π ) for all candidates and let π π‘ 1β¦π be the highest-
scoring candidates.
4. Return π π[1].
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Limitations of HMM-Based Recognition
Signal preprocessing may steps may lose valuable
information from the signal.
First-order Markov assumption + assumption of
mixture of Gaussians for ππ(ππ‘) probably not
satisfied: model error even with infinite data.
Exact alignment of speech training data to
phonemes needed β data collection not easily
scalable.
21
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
22
Overview
Speech recognition usind hidden Markov models.
Speech recognition using deep recurrent neural
networks.
Voice portals.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Neural Networks for Speech Recognition
For several years, small parts a complex
processing pipeline have been replaced by neural
networks.
Trend: Train single large recurrent neural network
for (almost) entire processing pilepeline:
From signal processing to word output + separate
language model (language model learned from
different data source).
Or even from signal processing to sentence output.
23
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Neural Networks for Speech Recognition
Input: Spectrogram (time,
frequency, amplitude).
Output: phones, letters, or
words (one-hot coded,
softmax output units).
Output alphabet + β-β
Deep network of stacked
bidirectional LSTM layers.
Ourput string contains β-β
and duplicate phones (or
letters, words).
Postprocessing: beam
search with language
model.
24
π1
π±1 π±π‘β1 π±π‘ π±π‘+1 π±π
π2
π1
π2
π1
π2
π1
π2
π1
π2
ππβ²
ππ ππ ππ ππ ππ
ππβ² ππ
β² ππ
β² ππ
β²
β¦
- C A T -
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Neural Networks for Speech Recognition
Each softmax output unit π‘ gives a probability for
each of the possible discrete output values π:
ππππ‘ π, π‘ π± =ππ¦π‘
π
ππ¦π‘πβ²
πβ²
Alignment π is a string of output labels and blanks.
Network outputs is a distribution over alignments.
ππππ‘ π π± = ππππ‘(ππ‘ , π‘|π±)ππ‘=1
25
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Neural Network Decoding
Output alphabet: 1β¦π (e.g., a-z, β β).
Forward propagation of input π±1, β¦ , π±π β activation
of the π softmax output units with π values.
ππππ‘ π π± = ππππ‘(ππ‘ , π‘|π±)ππ‘=1
Run beam search: In each step, for each of the π
most likely initial sequences π²[1. . π‘ β 1],
Append each possible next output ππ‘,
Calculate posterior probability π(π²[1. . π‘]|π±) (this
includes the language model, skipping blanks,
deduplicating duplicate outputs), and
Store the π most likely ones.
26
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Connectionist Temporal Classification
Example: π² =βCATβ, π = 7.
Possible alignments π:
β-C-A-T-β, β-CAA-TTβ, βCC-A-TTβ, β¦
Let π΅β1(π²) be the set of all possible alignments.
ππππ‘ π² π± = ππππ‘(π|π±)πβπ΅β1(π²)
27
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
Connectionist Temporal Classification
Training on unaligned pairs (π±π , π²π).
Train network to output any alignment π β π΅β1(π²π).
For each example (π±π , π²π):
Run forward propagation.
Find alignment ππβ = arg max
ππβπ΅β1(π²π)
ππππ‘ ππ π±π .
Run back propagation with target ππβ.
28
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
29
Overview
Speech recognition usind hidden Markov models.
Speech recognition using deep recurrent neural
networks.
Voice portals.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
30
Voice Portals
Speech recognition Text to speech
SRGS: Speech
Recognition
Grammar
Specification
N-GRAM: Stochastic
Language Models
Specification
SSML: Speech
Synthesis Markup
Language
VoiceXML:
Dialog control
SALT: Speech Application
Language Tags
Application development
Language technology
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
31
Speech Recognition Grammar Specification
Syntax for probabilistic, context-free grammar:
Augmented BNF or XML.
Speech and DTMF input.
Basic elements
Rule definitions,
Rule expansions.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
32
SRGS: Elements
Rule definition:
Associates rule to
identifier.
<rule id = order>
[rule expansion]
</rule>
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
33
SRGS: Elements
Rule definition:
Associates rule to
identifier.
Rule reference:
Reference to rule or π-
gram.
VOID, NULL,
GARBAGE.
<rule id = order>
<ruleref uri = #greeting/>
β¦
</rule>
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
34
SRGS: Elements
Rule definition:
Associates rule to
identifier.
Rule reference:
Reference to rule or π-
gram.
VOID, NULL,
GARBAGE.
Alternative:
Accepts any of the
alternatives.
<one-of>
<item>Caipirinha</item>
<item>Mojito</item>
<item>Zombie</item>
</one-of>
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
35
SRGS: Elements
Rule reference:
Reference to rule or π-
gram.
VOID, NULL,
GARBAGE.
Alternative:
Accepts any of the
alternatives.
Weights:
Transformed into
probabilities
PCFG.
<one-of>
<item weight=10 >Caipirinha
</item>
<item weight=5>Mojito</item>
<item weight=1>B52</item>
</one-of>
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
36
N-GRAM: Stochastic Language Models
Syntactic scheme for the representation of
Dictionaries,
Counts of π-gram tuples.
Elements
<lexicon> dictionary declaration
<token> token declaration,
<tree> occurrance counter,
<interpolation> linear interpolation weights.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
37
SSML: Speech Synthesis Markup
Markup conventions for all stages of speech
generation.
ACSS: Aural Cascading Style Sheets.
Komplex markup definitions,
Assognments of speakers to markup tags,
Spacial positioning of speakers.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
38
SSML: Elements
Normalization (expansion of abbreviations, currencies:
<say-as>- and <sub> elements.
<p>- and <s> elements (paragraph, sentence).
Conversion text phoneme:
<phoneme> element.
Description in the IPA alphabet.
Prosodic analysis:
<emphasis>-, <break>- and <prosody> elements.
Signal generation:
<voice> element, voice selection.
Attribute gender, age, variant, name.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
39
VoiceXML
Goal:
Separation of application development and language
technology.
Elements:
Dialog control,
Speech recognition and
generation, DTML.
Recording and playing
messages.
<vxml version="2.0">
<form>
<block>
<prompt> Hello world!
</prompt>
</block>
</form> </vxml>
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
40
VoiceXML: Concepts
Dialogs and subdialogs:
Menus: branching points
Forms: collect information in fields.
Events: exception handling.
Links:
Grammar that is active in a certain scope.
Triggers event or refers to target URI.
Procedural elements:
<var>, <assign>, <goto>,
<if> <else/> </if>.
<objekt>: Calls plattform-specific element.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
41
VoiceXML: Generation
VoiceXML file is typically generated,
Just like HTML,
For instance using XSLT from XML content
<cocktail_menu>
<cocktail>
<name>Caipirinha</name>
<descr>National drink of Brasilia
</descr>
</cocktail>
β¦
</cocktailkarte>
<xsl:template match=βcocktail_menuβ>
Cocktail menu
<xsl:template match=βcocktailβ>
<voice gender = male>
<xsl:value-of select=βnameβ/>
</voice>
<p/>
<voice gender = female>
<xsl:value-of select=βdescrβ/>
</voice>
</xsl:template>
β¦
</xsl:template>
<voice gender=male>Caipirinha
</voice>
<voice gender=female>
National drink of Brasilia</voice>
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
42
VoiceXML Elements: Forms
<form>: Form; <field>: Fillable field.
Attributes and methods:
Identifier to be used for reference.
Variable that can be filled with input.
Scope: local for dialog or global.
Event handling, actions such as call forwarding.
Requirements for fields.
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
43
VoiceXML Elements: Forms
Interpretation:
Until every <field> is filled,
Choose <field>, read <prompt>, wait for input.
<form id=order>
<block>Ready to take your order.</block>
<field name=βdrinkβ>
<prompt>What would you like to drink?
</prompt>
<grammar src=βcocktail.grxmlβ type=βapplication/srgs+xmlβ/>
</field>
</form>
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
44
VoiceXML Forms: Mixed Initiative
<initial>, initial element: first prompt is read at first
iteration.
User can fill one or several fields.
Interpreter then reads prompt of first field that has
not yet been filled.
> Where would you like that delivered?
To August-Bebel-Str. 89,
my name if Tobias Scheffer
> What is your Zip code?
Scheffe
r: La
ng
ua
ge
Te
ch
no
log
y
45
VoiceXML - Example