Discriminative Learning for Hidden Markov Models

1

Discriminative Learning for Hidden Markov Models

Li Deng

Microsoft Research

EE 516; UW Spring 2009

2

Minimum Classification Error (MCE)

The objective function of MCE training is a smoothed recognition error rate.

Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD)

In this work we proposed the Growth Transformation based method for MCE based model estimation

3

Automatic Speech Recognition (ASR)

Spectrum analysis: Xr =

Decoding sr* = argmax pΛ(Xr, sr)

Speech signal of the r-th utt.:

x1, x2, x3, x4 ,…, xt ,…, xT

(sil) OH (sil) SIX EIGHT (sil)

Segment to frames:

| | | | | … | | 1 2 3 4 T

Speech recognition:

* arg max log ( | ) arg max log ( , )r r

r r r r rs s

s p s X p X s

4

Models (feature functions) in ASR

3

1

( , ) exp ( , )r r m m r rm

p X s h s X

h1(sr, Xr) = log p(Xr|sr; Λ) (AM)

h2(sr, Xr) = log p(sr) (LM)

h3(sr, Xr) = |sr| (#word)

λ1 = 1

λ2 = s (LM scale)

λ3 = p (word ins. penalty)

ASR in the log-linear framework

Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.

5

MCE: Mis-classification measure

OH EIGHT THREE correct label: Sr

OH EIGHT SIX

competitor: sr,1

Observation. seq.: Xr x1, x2, x3, x4 ,…, xt ,…, xT

, ,1 ,( , ) log ( ) log ( )r r r r r rd X p X s p X S

Define misclassification measure:

sr,1: the top one incorrect (not equal to Sr) competing string

(in the case of using correct and top one incorrect competing tokens)

6

MCE: Loss function

0 d

lo

ss

1

),(1

1),(

rr Xdrrr eXdl

Loss function: smoothed error count func.

Classification: ),(logmaxarg*rr

sr sXps

r

Classifi. error: dr(Xr,Λ) > 0 1 classification error

dr(Xr,Λ) < 0 0 classification error

7

MCE: Objective function

MCE objective function:

R

rrrrMCE Xdl

RL

1

),(1

)(

LMCE(Λ) is the smoothed recognition error rate on the string (token) level.

Model (acoustic model) is trained to minimize LMCE(Λ), i.e., Λ* = argmin Λ{LMCE(Λ)}

8

MCE: Optimization

Traditional Stochastic GD New Growth Transform.

Gradient descent based online optimization

Convergence is unstable

Training process is difficult to be parallelized

Extend Baum-Welch based batch-mode method

Stable convergence

Ready for parallelized processing

9

MCE: Optimization

MinimizingLMCE(Λ) = ∑ l ﴾d(∙)﴿

MaximizingP(Λ) = G(Λ)/H(Λ)

MaximizingF(Λ;Λ′) = G-P′×H+D

MaximizingF(Λ;Λ′) = ∑ f (∙)

MaximizingU(Λ;Λ′) = ∑ f ′(∙)log f

(∙)

GT formula∂U(∙)/∂Λ = 0 Λ =T(Λ′)

If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation of Λ for P(Λ).

o Growth Transformation based MCE:

10

MCE: Optimization

Re-write MCE loss function to

Then, min. LMCE(Λ) max. Q(Λ), where

,1

,1

( , | )( , )

( , | ) ( , | )r r

r r rr r r r

p X sl d X

p X s p X S

,1

,1

{ , }

1 1,1{ , }

( ) 1 ( )

( , | ) ( , )( , | )

( , | ) ( , | ) ( , | )r r r

r r r

MCE

r r r rR Rs s Sr r

r rr r r r r rs s S

Q R L

p X s s Sp X S

p X s p X S p X s

11

MCE: Optimization

)(

)()(

H

GP

Q(Λ) is further re-formulated to a single fractional function P(Λ)

1

1

),...,,,...,()(

),(),...,,,...,()(

11

111

s sRR

s s

R

rrrRR

R

R

ssXXpH

SsssXXpG

where

12

MCE: Optimization

Increasing P(Λ) can be achieved by maximizing

( ; ) ( ) ( ) ( )F G P H D

1( )( ) ( ) ( ; ) ( ; )HP P F F

i.e.,

as long as D is a Λ-independent constant.

( ; ) ( , , | )[ ( ) ( )]q s

F p X q s C s P D Substitute G() and H() into F(),

(Λ′ is the parameter set obtained from last iteration)

13

MCE: Optimization

),|,()()();,,,( sqpsdsqf

dsqfFs q

);,,,();(

Reformulate F(Λ;Λ') to

where

F(Λ;Λ') is ready for EM style optimization

( ) ( , ) ( , )[ ( ) ( )]s

X p q s C s P

1

( ) ( , )R

r rr

C s s S

Note: Γ(Λ′) is a constant, and log p(χ, q | s, Λ) is easy to decompose.

14

MCE: Optimization

s q

dsqfsqfU

);,,,(log);,,,()(

Increasing F(Λ;Λ') can be achieved by maximizing

)(0)(

TU

So the growth transformation of Λ for CDHMM is:

Use extend Baum-Welch for E step.

log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute.

15

MCE: Model estimation formulas

For Gaussian mixture CDHMM,

, ,

,

( )

( )

m r r t m mr t

mm r m

j t

t x D

t D

, , ,

,

( )( - )( - ) ( - )( - )

( )

T Tm r r t m r t m m m m m m m m

r tm

m r mr t

t x x D D

t D

,1, ,1 , , , ,( ) ( | , ) ( | , ) ( ) ( )

r rm r r r r r m r S m r st p S X p s X t t where

1 121

( | , ) | | exp ( ) ( )2

Tp x x x

GT of mean and covariance of Gaussian m is

16

MCE: Model estimation formulas

Setting of Dm

,1

, ,1

,1 , ,

( | , ) ( | , ) ( )

( | , ) ( )

r

r

R

m r r r r m r Sr t

r r m r st

D E p S X p S X t

p s X t

Theoretically,

set Dm so that f(χ,q,s,Λ;Λ') > 0

Empirically,

MCE: Workflow

17

Training utterances

Last iteration Model Λ′

Recognition

GT-MCE Training transcripts

Competing strings

New model Λ

next iteration

18

Experiment: TI-DIGITS

Vocabulary: “1” to “9”, plus “oh” and “zero”

Training set: 8623 utterances / 28329 words

Test set: 8700 utterances / 28583 words

33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features.

Model: Continuous Density HMMs

Total number of Gaussian components: 3284

19

Experiment: TI-DIGITS

Obtain the lowest error rate on this task

Reduce recognition Word Error Rate (WER) by 23%

Fast and stable convergence

GT-MCE vs. ML (maximum likelihood) baseline

E=1.0E=2.0E=2.5

0 2 4 6 8 10600

700

800

900

1000

1100

MCE iteration

Loss

fun

c. (

sigm

oid

erro

r co

unt)

MCE training - TIdigits

E=1.0E=2.0E=2.5

0 2 4 6 8 100.2

0.25

0.3

0.35

0.4

MCE iteration

WE

R (

%)

MCE training - TIdigits

20

Experiment: Microsoft Tele. ASR

Microsoft Speech Server – ENUTEL A telephony speech recognition system

Training set: 2000 hour speech / 2.7 million utterances

33-dim spectrum features: (E+MFCCs) +∆ +∆∆

Acoustic Model: Gaussian mixture HMM

Total number of Gaussian components: 100K

Vocabulary: 120K (delivered vendor lexicon)

CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz

Training Cost: 4~5 hours per iteration

21


Name voc.size # word description

MSCT 70K 4356 enterprise call center system(the MS call center we use daily)

SA 20K 43966 major commercial applications(and include many cell phone data)

QSR 55K 5718 name dialing system(many names are OOV, rely on LTS)

ACNT 20K 3219 foreign accented speech recognition(designed to test system robustness)

Evaluate on four corpus-independent tests

Collected from sites other than training data providers

Cover major commercial Tele. ASR scenarios

22


WER ML GT-MCE WER reduction

MSCT 11.59% 9.73% 16.04%

SA 11.24% 10.07% 10.40%

QSR 9.55% 8.58% 10.07%

ACNT 32.68% 29.00% 11.25%

Significant performance improvements across-the-board

The first time MCE is successfully applied to a 2000 hr. speech database

The Growth Transformation based MCE training is well suited for large scale modeling tasks

Documents

Discriminative Learning for Hidden Markov Models