30
Speech Recognition Lecture 9: Acoustic Models. Mehryar Mohri Courant Institute of Mathematical Sciences [email protected]

Speech Recognition - NYU Computer Sciencemohri/asr12/lecture_9.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Parameter Reduction Problem: too many parameters (>

  • Upload
    others

  • View
    30

  • Download
    0

Embed Size (px)

Citation preview

Speech RecognitionLecture 9: Acoustic Models.

Mehryar MohriCourant Institute of Mathematical Sciences

[email protected]

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Components

Acoustic and pronunciation model:

• : observation seq. distribution seq.

• : distribution seq. CD phone seq.

• : CD phone seq. phoneme seq.

• : phoneme seq. word seq.

Language model: , distribution over word seq.

2

Pr(o | w) =!

d,c,p

Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).

Pr(o | d)

Pr(d | c)

!

!

Pr(c | p) !

Pr(p | w) !

Pr(w)

acou

stic

mod

el

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Context-Dependent Phones

Idea:

• phoneme pronunciation depends on environment (allophones, co-articulation).

• model phone in context better accuracy.

Context-dependent rules:

• Context-dependent units:

• Allophonic rules:

• Complex contexts: regular expressions.

3

ae/b d ! aeb,d.

!

t/V ! V ! dx.

(Lee, 1990; Young et al., 1994)

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic Models

Critical component of a speech recognition system.

Different types:

• context-independent (CI) phones vs. context-dependent (CD) phones.

• speaker-independent vs. speaker-dependent.

Complex design and training techniques in large-vocabulary speech recognition.

4

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic models

Training algorithms

This Lecture

5

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Continuous Speech Models

Graph topology: 3-state HMM model: for each CD phone .

• Interpretation: beginning, middle, and end of CD phone.

Continuous case: transition weights based on distributions over feature vectors in , typically with

6

0

d0:!

1d0:!

d1:!

2d1:!

d2:!

3d2:aeb,d

aeb,d

RN

N = 39.

(Rabiner and Juang, 1993)

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Distributions

Simple cases: e.g., single speaker, single Gaussian distribution

• covariance matrix typically diagonal.

General case: mixtures of Gaussians.

7

!

N (x; µ, !) =1

(2")N/2|!|1/2exp

!

!1

2(x ! µ)!!

"1(x ! µ)

"

.

with !i ! 0 and

M!

i=1

!i = 1.Typically, M = 16.

M!

k=1

!kN (x; µk, "k),

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

GMMs - Illustration

8

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

GMMs - Illustration

9

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Parameter Reduction

Problem: too many parameters (> 200M).

• large number of GMMs provides better modeling flexibility.

• but requires much more training data.

Solution: tying mixtures, i.e., equality constraints on distributions.

• within the same HMM or distributions in different HMMs (similar CD phone transitions).

• semi-continuous: same distrib. different mixtures.10

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Silence Model

Motivation: accounting for pause between words and sentences.

Model:

• optional pause symbol between words and at the beginning and end of utterances in language model.

• specific silence acoustic model, which can be context-dependent or not.

11

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Composite HMM model

Composite model: obtained by taking the union and closure of all CD phone models.

Illustration:

12

! P"

p=1

Hi

#

!

.

0

1::

1’

::

d0:

2d0:

d1:

3

d1:

d2:p

d2:

d’0:

2’d’0:

d’1: 3’d’1:

d’2:p’ d’2:

Tying can reduce the size.

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic models

Training algorithms

This Lecture

13

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Parameter Estimation

Data: sample of sequences of the form:

Parameters:

• mean and variance of Gaussians .

• mixture coefficients .

Problems:

• segmentation.

• model initialization.14

m

p1, p2, . . . , pl ! aec,t.

Feature vectors:

CD phones:

µj , !j

!k

o1, o2, . . . , oT ∈ RN

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Estimation Algorithm

Baum-Welsh algorithm:

• maximum likelihood principle.

• generalizes to continuous case with Gaussians and GMMs.

Questions: segmentation, model initialization.

15

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Univariate Gaussian - ML solution

Problem: find most likely Gaussian distribution, given sequence of real-valued observations

Normal distribution:

Likelihood:

Solution: is differentiable and concave;

3.18, 2.35, .95, 1.175, . . .

p(x) =1

!

2!"2exp

!

"

(x " µ)2

2"2

"

.

l

!p(x)

!µ= 0 ! µ =

1

m

m!

i=1

xi

!p(x)

!"2= 0 ! "2 =

1

m

m!

i=1

x2

i " µ2.

16

l(p) = −12m log(2πσ2)−

m�

i=1

(xi − µ)2

2σ2.

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Gradients

General identities:

• log of determinant

• bilinear form

17

∇σ(log det(σ)) = (σ−1)� = σ−�.

∇σ(x�σx) = xx�.

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Multivariate Gaussian - ML solution

Log likelihood: sample .

ML solution:

18

Pr[xi] =1

(2!)N/2|"|1/2exp

!

!1

2(xi ! µ)!"

"1(xi ! µ)

"

.

L =m!

i=1

log Pr[xi] =m!

i=1

!N

2log(2!) +

1

2log |"!1|!

1

2(xi ! µ)""!1(xi ! µ).

For each ,xi

!L

!µ=

m!

i=1

"!1(xi ! µ) = 0 " µ =1

m

m!

i=1

xi.

!L

!"!1=

m!

i=1

1

2(""

! (xi ! µ)(xi ! µ)") = 0 " " =1

m

m!

i=1

(xi ! µ)(xi ! µ)".

x1, . . . , xm

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

GMMs - EM Algorithm

Mixture of Gaussians:

EM algorithm: let

• E-step:

• M-step:

19

M

L =m!

i=1

logM!

k=1

!kN (xi; µk, "k).p![x] =M!

k=1

!kN (x; µk, "k)

qt+1i,k = p!t [z = k|xi] =

!tkpt

i,k!M

k=1!t

kpti,k

.

pti,k = N (xi; µ

tk, !

tk).

µt+1k =

!mi=1

qt+1i,k xi

!mi=1

qt+1i,k

!t+1k =

!mi=1

qt+1i,k (xi ! µ

t+1k )(xi ! µ

t+1k )!

!mi=1

qt+1i,k

!t+1k =

1

m

m!

i=1

qt+1i,k .

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

GMMs - EM Algorithm

Proof: M-step.

• Auxiliary function:

• Optimization for fixed and :

20

l =M!

k=1

m!

i=1

p![z = k|xi] log p![xi, z = k] =M!

k=1

m!

i=1

qi,k log p![xi, z = k].

=M!

k=1

m!

i=1

qi,k

"

log !k !1

2(xi ! µt+1

k )!""1k (xi ! µt+1

k ) !1

2log(2#) +

1

2log |""1

k |#

.

q!M

k=1!k = 1

!L

!"!1

k

=1

2

m!

i=1

qi,k(""k ! (xi ! µ)(xi ! µ)").

∂L

∂λk=

1λk

m�

i=1

qi,k − β∂L

∂µk= σ−1

k

m�

i=1

qi,k(xi − µk)

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

HMMs - EM Algorithm

Use EM algorithm in discrete case to determine the probability of each transition at time : .

Use GMMs update combined with the probability for each observation to be emitted from each transition .

21

e t w[e]

e

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Initialization

Selection of model:

• HMM topology (states and transitions).

• number of mixtures ( ).

Flat start: all distributions with same average values (mean and variance) computed over entire training set.

Existing hand-labeled segmentation: partial segmentation basis.

Uniform segmentation: equal number of HMM transitions per training example.

22

M

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Forced Alignment

Viterbi training: approximate but faster method to determine HMM path.

Segmental K-means: approximate but faster method to determine which Gaussian of a mixture the training instance has been sampled from.

23

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Viterbi Alignment

Idea: faster alignment based on most likely path.

• use current acoustic model.

• align sequence of feature vectors with most likely path (best path algorithm, e.g., Viterbi).

24

o

e e11e11e11e12e22e22e23e33e34e44

1 2 3 4 5 6 7 8 9 10

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Segmental K-Means

Idea: use clustering algorithm to determine which Gaussian generated each observation.

Solution: use K-means clustering algorithm to initialize distribution means.

• Initialization: select K centroids .

• Repeat until no centroid change:

• for each point find closest centroid and assign to cluster .

• for each cluster , redefine as the centre of mass.

25

c1, . . . , cK

xi cj

xi j

j cj

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Notes

Convergence rate of K-means: subject of current research.

GMM EM algorithm: soft version of K-means.

26

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Variable Number of Mixtures

Problems:

• number of mixtures required.

• possible over- or underfitting.Solution:

• originally single Gaussian distribution.

• create two-component mixture with slightly perturbed means with the same covariance matrix.

• model parameters reestimated until desired complexity reached.

27

µ ± !

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic Modeling

In practice:

• complicated recipes or heuristics with large number of ad hoc techniques.

• key skilled human supervision: choice of initial parameters to avoid local minima, segmentation, choice and number of parameters.

• computationally very costly: may take many days of several processors in large-vocabulary speech recognition.

28

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Improvements

Adaptation (VTLN, MLLR).

Ensemble methods (ROVER).

Better features (e.g., LDA, MMI).

Discriminative training.

29

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation

for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972.

• Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin . Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38..

• Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228.

• Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.

• C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103.

30