Upload
grant
View
46
Download
0
Embed Size (px)
DESCRIPTION
Discriminative Learning for Hidden Markov Models. Li Deng. Microsoft Research. EE 516; UW Spring 2009. Minimum Classification Error (MCE). The objective function of MCE training is a smoothed recognition error rate. - PowerPoint PPT Presentation
Citation preview
1
Discriminative Learning for Hidden Markov Models
Li Deng
Microsoft Research
EE 516; UW Spring 2009
2
Minimum Classification Error (MCE)
The objective function of MCE training is a smoothed recognition error rate.
Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD)
In this work we proposed the Growth Transformation based method for MCE based model estimation
3
Automatic Speech Recognition (ASR)
Spectrum analysis: Xr =
Decoding sr* = argmax pΛ(Xr, sr)
Speech signal of the r-th utt.:
x1, x2, x3, x4 ,…, xt ,…, xT
(sil) OH (sil) SIX EIGHT (sil)
Segment to frames:
| | | | | … | | 1 2 3 4 T
Speech recognition:
* arg max log ( | ) arg max log ( , )r r
r r r r rs s
s p s X p X s
4
Models (feature functions) in ASR
3
1
( , ) exp ( , )r r m m r rm
p X s h s X
h1(sr, Xr) = log p(Xr|sr; Λ) (AM)
h2(sr, Xr) = log p(sr) (LM)
h3(sr, Xr) = |sr| (#word)
λ1 = 1
λ2 = s (LM scale)
λ3 = p (word ins. penalty)
ASR in the log-linear framework
Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.
5
MCE: Mis-classification measure
OH EIGHT THREE correct label: Sr
OH EIGHT SIX
competitor: sr,1
Observation. seq.: Xr x1, x2, x3, x4 ,…, xt ,…, xT
, ,1 ,( , ) log ( ) log ( )r r r r r rd X p X s p X S
Define misclassification measure:
sr,1: the top one incorrect (not equal to Sr) competing string
(in the case of using correct and top one incorrect competing tokens)
6
MCE: Loss function
0 d
lo
ss
1
),(1
1),(
rr Xdrrr eXdl
Loss function: smoothed error count func.
Classification: ),(logmaxarg*rr
sr sXps
r
Classifi. error: dr(Xr,Λ) > 0 1 classification error
dr(Xr,Λ) < 0 0 classification error
7
MCE: Objective function
MCE objective function:
R
rrrrMCE Xdl
RL
1
),(1
)(
LMCE(Λ) is the smoothed recognition error rate on the string (token) level.
Model (acoustic model) is trained to minimize LMCE(Λ), i.e., Λ* = argmin Λ{LMCE(Λ)}
8
MCE: Optimization
Traditional Stochastic GD New Growth Transform.
Gradient descent based online optimization
Convergence is unstable
Training process is difficult to be parallelized
Extend Baum-Welch based batch-mode method
Stable convergence
Ready for parallelized processing
9
MCE: Optimization
MinimizingLMCE(Λ) = ∑ l ﴾d(∙)﴿
MaximizingP(Λ) = G(Λ)/H(Λ)
MaximizingF(Λ;Λ′) = G-P′×H+D
MaximizingF(Λ;Λ′) = ∑ f (∙)
MaximizingU(Λ;Λ′) = ∑ f ′(∙)log f
(∙)
GT formula∂U(∙)/∂Λ = 0 Λ =T(Λ′)
If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation of Λ for P(Λ).
o Growth Transformation based MCE:
10
MCE: Optimization
Re-write MCE loss function to
Then, min. LMCE(Λ) max. Q(Λ), where
,1
,1
( , | )( , )
( , | ) ( , | )r r
r r rr r r r
p X sl d X
p X s p X S
,1
,1
{ , }
1 1,1{ , }
( ) 1 ( )
( , | ) ( , )( , | )
( , | ) ( , | ) ( , | )r r r
r r r
MCE
r r r rR Rs s Sr r
r rr r r r r rs s S
Q R L
p X s s Sp X S
p X s p X S p X s
11
MCE: Optimization
)(
)()(
H
GP
Q(Λ) is further re-formulated to a single fractional function P(Λ)
1
1
),...,,,...,()(
),(),...,,,...,()(
11
111
s sRR
s s
R
rrrRR
R
R
ssXXpH
SsssXXpG
where
12
MCE: Optimization
Increasing P(Λ) can be achieved by maximizing
( ; ) ( ) ( ) ( )F G P H D
1( )( ) ( ) ( ; ) ( ; )HP P F F
i.e.,
as long as D is a Λ-independent constant.
( ; ) ( , , | )[ ( ) ( )]q s
F p X q s C s P D Substitute G() and H() into F(),
(Λ′ is the parameter set obtained from last iteration)
13
MCE: Optimization
),|,()()();,,,( sqpsdsqf
dsqfFs q
);,,,();(
Reformulate F(Λ;Λ') to
where
F(Λ;Λ') is ready for EM style optimization
( ) ( , ) ( , )[ ( ) ( )]s
X p q s C s P
1
( ) ( , )R
r rr
C s s S
Note: Γ(Λ′) is a constant, and log p(χ, q | s, Λ) is easy to decompose.
14
MCE: Optimization
s q
dsqfsqfU
);,,,(log);,,,()(
Increasing F(Λ;Λ') can be achieved by maximizing
)(0)(
TU
So the growth transformation of Λ for CDHMM is:
Use extend Baum-Welch for E step.
log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute.
15
MCE: Model estimation formulas
For Gaussian mixture CDHMM,
, ,
,
( )
( )
m r r t m mr t
mm r m
j t
t x D
t D
, , ,
,
( )( - )( - ) ( - )( - )
( )
T Tm r r t m r t m m m m m m m m
r tm
m r mr t
t x x D D
t D
,1, ,1 , , , ,( ) ( | , ) ( | , ) ( ) ( )
r rm r r r r r m r S m r st p S X p s X t t where
1 121
( | , ) | | exp ( ) ( )2
Tp x x x
GT of mean and covariance of Gaussian m is
16
MCE: Model estimation formulas
Setting of Dm
,1
, ,1
,1 , ,
( | , ) ( | , ) ( )
( | , ) ( )
r
r
R
m r r r r m r Sr t
r r m r st
D E p S X p S X t
p s X t
Theoretically,
set Dm so that f(χ,q,s,Λ;Λ') > 0
Empirically,
MCE: Workflow
17
Training utterances
Last iteration Model Λ′
Recognition
GT-MCE Training transcripts
Competing strings
New model Λ
next iteration
18
Experiment: TI-DIGITS
Vocabulary: “1” to “9”, plus “oh” and “zero”
Training set: 8623 utterances / 28329 words
Test set: 8700 utterances / 28583 words
33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features.
Model: Continuous Density HMMs
Total number of Gaussian components: 3284
19
Experiment: TI-DIGITS
Obtain the lowest error rate on this task
Reduce recognition Word Error Rate (WER) by 23%
Fast and stable convergence
GT-MCE vs. ML (maximum likelihood) baseline
E=1.0E=2.0E=2.5
0 2 4 6 8 10600
700
800
900
1000
1100
MCE iteration
Loss
fun
c. (
sigm
oid
erro
r co
unt)
MCE training - TIdigits
E=1.0E=2.0E=2.5
0 2 4 6 8 100.2
0.25
0.3
0.35
0.4
MCE iteration
WE
R (
%)
MCE training - TIdigits
20
Experiment: Microsoft Tele. ASR
Microsoft Speech Server – ENUTEL A telephony speech recognition system
Training set: 2000 hour speech / 2.7 million utterances
33-dim spectrum features: (E+MFCCs) +∆ +∆∆
Acoustic Model: Gaussian mixture HMM
Total number of Gaussian components: 100K
Vocabulary: 120K (delivered vendor lexicon)
CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz
Training Cost: 4~5 hours per iteration
21
Experiment: Microsoft Tele. ASR
Name voc.size # word description
MSCT 70K 4356 enterprise call center system(the MS call center we use daily)
SA 20K 43966 major commercial applications(and include many cell phone data)
QSR 55K 5718 name dialing system(many names are OOV, rely on LTS)
ACNT 20K 3219 foreign accented speech recognition(designed to test system robustness)
Evaluate on four corpus-independent tests
Collected from sites other than training data providers
Cover major commercial Tele. ASR scenarios
22
Experiment: Microsoft Tele. ASR
WER ML GT-MCE WER reduction
MSCT 11.59% 9.73% 16.04%
SA 11.24% 10.07% 10.40%
QSR 9.55% 8.58% 10.07%
ACNT 32.68% 29.00% 11.25%
Significant performance improvements across-the-board
The first time MCE is successfully applied to a 2000 hr. speech database
The Growth Transformation based MCE training is well suited for large scale modeling tasks