Speech Recognition - NYU Computer Sciencemohri/asr12/lecture_9.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Parameter Reduction Problem: too many parameters (>

Speech RecognitionLecture 9: Acoustic Models.

Mehryar MohriCourant Institute of Mathematical Sciences

[email protected]

mailto:[email protected]

mailto:[email protected]

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Components

Acoustic and pronunciation model:

• : observation seq. distribution seq.

• : distribution seq. CD phone seq.

• : CD phone seq. phoneme seq.

• : phoneme seq. word seq.

Language model: , distribution over word seq.

2

Pr(o | w) =!

d,c,p

Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).

Pr(o | d)

Pr(d | c)

!

!

Pr(c | p) !

Pr(p | w) !

Pr(w)

acou

stic

mod

el


Context-Dependent Phones

Idea:

• phoneme pronunciation depends on environment (allophones, co-articulation).

• model phone in context better accuracy.

Context-dependent rules:

• Context-dependent units:

• Allophonic rules:

• Complex contexts: regular expressions.

3

ae/b d ! aeb,d.

!

t/V ! V ! dx.

(Lee, 1990; Young et al., 1994)


Acoustic Models

Critical component of a speech recognition system.

Different types:

• context-independent (CI) phones vs. context-dependent (CD) phones.

• speaker-independent vs. speaker-dependent.

Complex design and training techniques in large-vocabulary speech recognition.

4


Acoustic models

Training algorithms

This Lecture

5


Continuous Speech Models

Graph topology: 3-state HMM model: for each CD phone .

• Interpretation: beginning, middle, and end of CD phone.

Continuous case: transition weights based on distributions over feature vectors in , typically with

6

0

d0:!

1d0:!

d1:!

2d1:!

d2:!

3d2:aeb,d

aeb,d

RN

N = 39.

(Rabiner and Juang, 1993)


Distributions

Simple cases: e.g., single speaker, single Gaussian distribution

• covariance matrix typically diagonal.

General case: mixtures of Gaussians.

7

!

N (x; µ, !) =1

(2")N/2|!|1/2exp

!

!1

2(x ! µ)!!

"1(x ! µ)

"

.

with !i ! 0 and

M!

i=1

!i = 1.Typically, M = 16.

M!

k=1

!kN (x; µk, "k),


GMMs - Illustration

8


GMMs - Illustration

9


Parameter Reduction

Problem: too many parameters (> 200M).

• large number of GMMs provides better modeling flexibility.

• but requires much more training data.

Solution: tying mixtures, i.e., equality constraints on distributions.

• within the same HMM or distributions in different HMMs (similar CD phone transitions).

• semi-continuous: same distrib. different mixtures.10


Silence Model

Motivation: accounting for pause between words and sentences.

Model:

• optional pause symbol between words and at the beginning and end of utterances in language model.

• specific silence acoustic model, which can be context-dependent or not.

11


Composite HMM model

Composite model: obtained by taking the union and closure of all CD phone models.

Illustration:

12

! P"

p=1

Hi

#

!

.

0

1::

1’

::

d0:

2d0:

d1:

3

d1:

d2:p

d2:

d’0:

2’d’0:

d’1: 3’d’1:

d’2:p’ d’2:

Tying can reduce the size.


Acoustic models

Training algorithms

This Lecture

13


Parameter Estimation

Data: sample of sequences of the form:

Parameters:

• mean and variance of Gaussians .

• mixture coefficients .

Problems:

• segmentation.

• model initialization.14

m

p1, p2, . . . , pl ! aec,t.

Feature vectors:

CD phones:

µj , !j

!k

o1, o2, . . . , oT ∈ RN


Estimation Algorithm

Baum-Welsh algorithm:

• maximum likelihood principle.

• generalizes to continuous case with Gaussians and GMMs.

Questions: segmentation, model initialization.

15


Univariate Gaussian - ML solution

Problem: find most likely Gaussian distribution, given sequence of real-valued observations

Normal distribution:

Likelihood:

Solution: is differentiable and concave;

3.18, 2.35, .95, 1.175, . . .

p(x) =1

!

2!"2exp

!

"

(x " µ)2

2"2

"

.

l

!p(x)

!µ= 0 ! µ =

1

m

m!

i=1

xi

!p(x)

!"2= 0 ! "2 =

1

m

m!

i=1

x2

i " µ2.

16

l(p) = −12m log(2πσ2)−

m�

i=1

(xi − µ)2

2σ2.


Gradients

General identities:

• log of determinant

• bilinear form

17

∇σ(log det(σ)) = (σ−1)� = σ−�.

∇σ(x�σx) = xx�.


Multivariate Gaussian - ML solution

Log likelihood: sample .

ML solution:

18

Pr[xi] =1

(2!)N/2|"|1/2exp

!

!1

2(xi ! µ)!"

"1(xi ! µ)

"

.

L =m!

i=1

log Pr[xi] =m!

i=1

!N

2log(2!) +

1

2log |"!1|!

1

2(xi ! µ)""!1(xi ! µ).

For each ,xi

!L

!µ=

m!

i=1

"!1(xi ! µ) = 0 " µ =1

m

m!

i=1

xi.

!L

!"!1=

m!

i=1

1

2(""

! (xi ! µ)(xi ! µ)") = 0 " " =1

m

m!

i=1

(xi ! µ)(xi ! µ)".

x1, . . . , xm


GMMs - EM Algorithm

Mixture of Gaussians:

EM algorithm: let

• E-step:

• M-step:

19

M

L =m!

i=1

logM!

k=1

!kN (xi; µk, "k).p![x] =M!

k=1

!kN (x; µk, "k)

qt+1i,k = p!t [z = k|xi] =

!tkpt

i,k!M

k=1!t

kpti,k

.

pti,k = N (xi; µ

tk, !

tk).

µt+1k =

!mi=1

qt+1i,k xi

!mi=1

qt+1i,k

!t+1k =

!mi=1

qt+1i,k (xi ! µ

t+1k )(xi ! µ

t+1k )!

!mi=1

qt+1i,k

!t+1k =

1

m

m!

i=1

qt+1i,k .


GMMs - EM Algorithm

Proof: M-step.

• Auxiliary function:

• Optimization for fixed and :

20

l =M!

k=1

m!

i=1

p![z = k|xi] log p![xi, z = k] =M!

k=1

m!

i=1

qi,k log p![xi, z = k].

=M!

k=1

m!

i=1

qi,k

"

log !k !1

2(xi ! µt+1

k )!""1k (xi ! µt+1

k ) !1

2log(2#) +

1

2log |""1

k |#

.

q!M

k=1!k = 1

!L

!"!1

k

=1

2

m!

i=1

qi,k(""k ! (xi ! µ)(xi ! µ)").

∂L

∂λk=

1λk

m�

i=1

qi,k − β∂L

∂µk= σ−1

k

m�

i=1

qi,k(xi − µk)


HMMs - EM Algorithm

Use EM algorithm in discrete case to determine the probability of each transition at time : .

Use GMMs update combined with the probability for each observation to be emitted from each transition .

21

e t w[e]

e


Initialization

Selection of model:

• HMM topology (states and transitions).

• number of mixtures ( ).

Flat start: all distributions with same average values (mean and variance) computed over entire training set.

Existing hand-labeled segmentation: partial segmentation basis.

Uniform segmentation: equal number of HMM transitions per training example.

22

M


Forced Alignment

Viterbi training: approximate but faster method to determine HMM path.

Segmental K-means: approximate but faster method to determine which Gaussian of a mixture the training instance has been sampled from.

23


Viterbi Alignment

Idea: faster alignment based on most likely path.

• use current acoustic model.

• align sequence of feature vectors with most likely path (best path algorithm, e.g., Viterbi).

24

o

e e11e11e11e12e22e22e23e33e34e44

1 2 3 4 5 6 7 8 9 10


Segmental K-Means

Idea: use clustering algorithm to determine which Gaussian generated each observation.

Solution: use K-means clustering algorithm to initialize distribution means.

• Initialization: select K centroids .

• Repeat until no centroid change:

• for each point find closest centroid and assign to cluster .

• for each cluster , redefine as the centre of mass.

25

c1, . . . , cK

xi cj

xi j

j cj


Notes

Convergence rate of K-means: subject of current research.

GMM EM algorithm: soft version of K-means.

26


Variable Number of Mixtures

Problems:

• number of mixtures required.

• possible over- or underfitting.Solution:

• originally single Gaussian distribution.

• create two-component mixture with slightly perturbed means with the same covariance matrix.

• model parameters reestimated until desired complexity reached.

27

µ ± !


Acoustic Modeling

In practice:

• complicated recipes or heuristics with large number of ad hoc techniques.

• key skilled human supervision: choice of initial parameters to avoid local minima, segmentation, choice and number of parameters.

• computationally very costly: may take many days of several processors in large-vocabulary speech recognition.

28


Improvements

Adaptation (VTLN, MLLR).

Ensemble methods (ROVER).

Better features (e.g., LDA, MMI).

Discriminative training.

29


References• L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation

for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972.

• Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin . Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38..

• Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228.

• Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.

• C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103.

30

http://www.essrl.wustl.edu/~jao/

http://www.essrl.wustl.edu/~jao/

Documents

Speech Recognition - NYU Computer Sciencemohri/asr12/lecture_9.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Parameter Reduction Problem: too many parameters (>