Upload
others
View
30
Download
0
Embed Size (px)
Citation preview
Speech RecognitionLecture 9: Acoustic Models.
Mehryar MohriCourant Institute of Mathematical Sciences
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Recognition Components
Acoustic and pronunciation model:
• : observation seq. distribution seq.
• : distribution seq. CD phone seq.
• : CD phone seq. phoneme seq.
• : phoneme seq. word seq.
Language model: , distribution over word seq.
2
Pr(o | w) =!
d,c,p
Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).
Pr(o | d)
Pr(d | c)
!
!
Pr(c | p) !
Pr(p | w) !
Pr(w)
acou
stic
mod
el
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Context-Dependent Phones
Idea:
• phoneme pronunciation depends on environment (allophones, co-articulation).
• model phone in context better accuracy.
Context-dependent rules:
• Context-dependent units:
• Allophonic rules:
• Complex contexts: regular expressions.
3
ae/b d ! aeb,d.
!
t/V ! V ! dx.
(Lee, 1990; Young et al., 1994)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic Models
Critical component of a speech recognition system.
Different types:
• context-independent (CI) phones vs. context-dependent (CD) phones.
• speaker-independent vs. speaker-dependent.
Complex design and training techniques in large-vocabulary speech recognition.
4
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic models
Training algorithms
This Lecture
5
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Continuous Speech Models
Graph topology: 3-state HMM model: for each CD phone .
• Interpretation: beginning, middle, and end of CD phone.
Continuous case: transition weights based on distributions over feature vectors in , typically with
6
0
d0:!
1d0:!
d1:!
2d1:!
d2:!
3d2:aeb,d
aeb,d
RN
N = 39.
(Rabiner and Juang, 1993)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Distributions
Simple cases: e.g., single speaker, single Gaussian distribution
• covariance matrix typically diagonal.
General case: mixtures of Gaussians.
7
!
N (x; µ, !) =1
(2")N/2|!|1/2exp
!
!1
2(x ! µ)!!
"1(x ! µ)
"
.
with !i ! 0 and
M!
i=1
!i = 1.Typically, M = 16.
M!
k=1
!kN (x; µk, "k),
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Parameter Reduction
Problem: too many parameters (> 200M).
• large number of GMMs provides better modeling flexibility.
• but requires much more training data.
Solution: tying mixtures, i.e., equality constraints on distributions.
• within the same HMM or distributions in different HMMs (similar CD phone transitions).
• semi-continuous: same distrib. different mixtures.10
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Silence Model
Motivation: accounting for pause between words and sentences.
Model:
• optional pause symbol between words and at the beginning and end of utterances in language model.
• specific silence acoustic model, which can be context-dependent or not.
11
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Composite HMM model
Composite model: obtained by taking the union and closure of all CD phone models.
Illustration:
12
! P"
p=1
Hi
#
!
.
0
1::
1’
::
d0:
2d0:
d1:
3
d1:
d2:p
d2:
d’0:
2’d’0:
d’1: 3’d’1:
d’2:p’ d’2:
Tying can reduce the size.
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic models
Training algorithms
This Lecture
13
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Parameter Estimation
Data: sample of sequences of the form:
Parameters:
• mean and variance of Gaussians .
• mixture coefficients .
Problems:
• segmentation.
• model initialization.14
m
p1, p2, . . . , pl ! aec,t.
Feature vectors:
CD phones:
µj , !j
!k
o1, o2, . . . , oT ∈ RN
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Estimation Algorithm
Baum-Welsh algorithm:
• maximum likelihood principle.
• generalizes to continuous case with Gaussians and GMMs.
Questions: segmentation, model initialization.
15
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Univariate Gaussian - ML solution
Problem: find most likely Gaussian distribution, given sequence of real-valued observations
Normal distribution:
Likelihood:
Solution: is differentiable and concave;
3.18, 2.35, .95, 1.175, . . .
p(x) =1
!
2!"2exp
!
"
(x " µ)2
2"2
"
.
l
!p(x)
!µ= 0 ! µ =
1
m
m!
i=1
xi
!p(x)
!"2= 0 ! "2 =
1
m
m!
i=1
x2
i " µ2.
16
l(p) = −12m log(2πσ2)−
m�
i=1
(xi − µ)2
2σ2.
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Gradients
General identities:
• log of determinant
• bilinear form
17
∇σ(log det(σ)) = (σ−1)� = σ−�.
∇σ(x�σx) = xx�.
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Multivariate Gaussian - ML solution
Log likelihood: sample .
ML solution:
18
Pr[xi] =1
(2!)N/2|"|1/2exp
!
!1
2(xi ! µ)!"
"1(xi ! µ)
"
.
L =m!
i=1
log Pr[xi] =m!
i=1
!N
2log(2!) +
1
2log |"!1|!
1
2(xi ! µ)""!1(xi ! µ).
For each ,xi
!L
!µ=
m!
i=1
"!1(xi ! µ) = 0 " µ =1
m
m!
i=1
xi.
!L
!"!1=
m!
i=1
1
2(""
! (xi ! µ)(xi ! µ)") = 0 " " =1
m
m!
i=1
(xi ! µ)(xi ! µ)".
x1, . . . , xm
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
GMMs - EM Algorithm
Mixture of Gaussians:
EM algorithm: let
• E-step:
• M-step:
19
M
L =m!
i=1
logM!
k=1
!kN (xi; µk, "k).p![x] =M!
k=1
!kN (x; µk, "k)
qt+1i,k = p!t [z = k|xi] =
!tkpt
i,k!M
k=1!t
kpti,k
.
pti,k = N (xi; µ
tk, !
tk).
µt+1k =
!mi=1
qt+1i,k xi
!mi=1
qt+1i,k
!t+1k =
!mi=1
qt+1i,k (xi ! µ
t+1k )(xi ! µ
t+1k )!
!mi=1
qt+1i,k
!t+1k =
1
m
m!
i=1
qt+1i,k .
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
GMMs - EM Algorithm
Proof: M-step.
• Auxiliary function:
• Optimization for fixed and :
20
l =M!
k=1
m!
i=1
p![z = k|xi] log p![xi, z = k] =M!
k=1
m!
i=1
qi,k log p![xi, z = k].
=M!
k=1
m!
i=1
qi,k
"
log !k !1
2(xi ! µt+1
k )!""1k (xi ! µt+1
k ) !1
2log(2#) +
1
2log |""1
k |#
.
q!M
k=1!k = 1
!L
!"!1
k
=1
2
m!
i=1
qi,k(""k ! (xi ! µ)(xi ! µ)").
∂L
∂λk=
1λk
m�
i=1
qi,k − β∂L
∂µk= σ−1
k
m�
i=1
qi,k(xi − µk)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
HMMs - EM Algorithm
Use EM algorithm in discrete case to determine the probability of each transition at time : .
Use GMMs update combined with the probability for each observation to be emitted from each transition .
21
e t w[e]
e
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Initialization
Selection of model:
• HMM topology (states and transitions).
• number of mixtures ( ).
Flat start: all distributions with same average values (mean and variance) computed over entire training set.
Existing hand-labeled segmentation: partial segmentation basis.
Uniform segmentation: equal number of HMM transitions per training example.
22
M
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Forced Alignment
Viterbi training: approximate but faster method to determine HMM path.
Segmental K-means: approximate but faster method to determine which Gaussian of a mixture the training instance has been sampled from.
23
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Viterbi Alignment
Idea: faster alignment based on most likely path.
• use current acoustic model.
• align sequence of feature vectors with most likely path (best path algorithm, e.g., Viterbi).
24
o
e e11e11e11e12e22e22e23e33e34e44
1 2 3 4 5 6 7 8 9 10
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Segmental K-Means
Idea: use clustering algorithm to determine which Gaussian generated each observation.
Solution: use K-means clustering algorithm to initialize distribution means.
• Initialization: select K centroids .
• Repeat until no centroid change:
• for each point find closest centroid and assign to cluster .
• for each cluster , redefine as the centre of mass.
25
c1, . . . , cK
xi cj
xi j
j cj
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Notes
Convergence rate of K-means: subject of current research.
GMM EM algorithm: soft version of K-means.
26
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Variable Number of Mixtures
Problems:
• number of mixtures required.
• possible over- or underfitting.Solution:
• originally single Gaussian distribution.
• create two-component mixture with slightly perturbed means with the same covariance matrix.
• model parameters reestimated until desired complexity reached.
27
µ ± !
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic Modeling
In practice:
• complicated recipes or heuristics with large number of ad hoc techniques.
• key skilled human supervision: choice of initial parameters to avoid local minima, segmentation, choice and number of parameters.
• computationally very costly: may take many days of several processors in large-vocabulary speech recognition.
28
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Improvements
Adaptation (VTLN, MLLR).
Ensemble methods (ROVER).
Better features (e.g., LDA, MMI).
Discriminative training.
29
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
References• L. E. Baum. An Inequality and Associated Maximization Technique in Statistical Estimation
for Probalistic Functions of Markov Processes. Inequalities, 3:1-8, 1972.
• Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin . Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1. (1977), pp. 1-38..
• Jamshidian, M. and R. I. Jennrich: 1993, Conjugate Gradient Acceleration of the EM Algorithm. Journal of the American Statistical Association 88(421), 221-228.
• Lawrence Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of IEEE, Vol. 77, No. 2, pp. 257, 1989.
• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.
• C. F. Jeff Wu. On the Convergence Properties of the EM Algorithm. The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103.
30