Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech

RecognitionBing Zhang and Spyros Matsoukas,

BBN Technologies, 50 Moulton St. Cambridge

Reporter : Chang Chih Hao

Introduction

• LDA and HLDA– Better classification accuracy

– some common Limitations• None of them assumes any prior knowledge of confusable hypotheses

• Their objective functions do not directly relate to the word error rate (WER)

• Minimum Phoneme Error– Minimize phoneme errors in lattice-based training frameworks

– Since this criterion is closely related to WER, MPE_HLDA tends to be more robust than other projection methods, which makes it potentially better suited for a wider variety of features.

MPE Objection Function

• MPE-HLDA model

• MPE-HLDA aims at minimizing expected number of phoneme errors introduced by the MPE-HLDA model in a given hypothesis lattice, or equivalently maximizing the function

m m

Tm m

t t

A

C diag A A

o Ao

, | (4)

is the total number of training utterances,

is the sequence of p-dimensional observation vectors in utterance r,

is the "raw accuracy" score of wor

r

R

MPE r r rr w

r

r

F O P w O w

R

O

w

d hypothesis .rw

,m mC


•

| is the posterior probability of hypothesis in the lattice

| |

|

is the language model probability of hypothesis ,

k : in order

r

r r r

k

r r r

r r k

r r rw

r r

P w O w

P O w P wP w O

P O w P w

P w w

to reduce acoustic scores dynamic range, thereby avoiding

the concentration of all posterior mass in the top-1 hypothesis of the lattice.


• It can be shown that the derivative of (4) with respect to A is

, log | ,, (6)

, | ,

is the MPE score of utterance r (average accuracy over all hypotheses),

is the average accuracy ove

r

RMPE qr r

rr q

r r qr r

r

F O P O q rk D q r

A A

where

D q r P q O r q r

r

q

rr all hypotheses that contain arc q .


•

•

log | , log | ,

and are the begin and end time of are ,

denotes the posterior probability of Gaussian m in arc at time t.

qr

qr

qr

r r

qr

Eqr r tm

t S m

q q r

mr

P O q r P o mt

A A

S E q

t q

1 1 1

1 1 1 1

log | ,t m mm m t p m m t

T T

m m t m t m m m p m m t m t m

Tmt t m t m

Tmt t m t m

P o mC C P I A C R

A

C C diag o o A C I A C o o

where

P diag o o

R o o


• Therefore, Eq.(6) can be rewritten as

1 1

1

,

,

,

,

qr

r

r qr

qr

r

r qr

qr

r

r qr

MPE

m m m m p mm

Em

m r qr q t S

Em m

m r q tr q t S

Em m

m r q tm r q t S

F Ok C C g I A kJ

A

where

D q r t

g D q r t P

J C D q r t R

39*39

39*162

MPE-HLDA Implementation

• In theory, the derivative of the MPE-HLDA objective function can be computed based on Eq.(12), via s single forward-backward pass over the training lattices. In practice, however, it is not possible to fit all the full covariance matrices in memory.

• Two steps– First, run a forward-backward pass over the training lattices to acumulate

– Second, uses these statistics together with the full covariance matrices to synthesize the derivative.

• The Paper used gradient descent in updating the projection matrix.

MPE-HLDA Implementation

Experimental Framework

A Lp*n

n*l

l*1

p*1

Global feature projection

---there is more useful information in longer contexts

---Reduce the computational cost

Experimentation

• Conversational Telephone Speech (CTS)– 2300 hours of training data

• 800 hours : training the initial ML model

• 1500 hours : held-out training data – Lattice generation

– Discriminative training

– MPE-HLDA : only 370 hours

– Testing set• Eval03

• Dev04

Experimentation

• Conversational Telephone Speech (CTS)– Feature

• Frame concatenated PLP cepstra– 15 frames, l = 225, n = 130, p = 60

Experimentation

Experimentation

• Broadcast News (BN)– 600 hours : training the initial model (Hub4 and TDT)

– 330 hours : held-out data

Thanks

Documents

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton