DUJH0D UJLQ+ 00V IRU6 SHHFK5 HFRJQLWLRQhj/Talks/MSR_TALK.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Email:

Prof. Hui Jiang

Department of Computer Science and Engineering

York University, Toronto, Ont. M3J 1P3, CANADA

Email: [email protected]

/DUJH�0DUJLQ�+00V�IRU�6SHHFK�5HFRJQLWLRQ

(This is a joint work with Xinwei Li, Chao-Jun Liu )

2XWOLQHx Background

– Discriminative Training for ASR

– Large Margin Classifiers: concept & theory

x Large Margin HMMs for ASR– A new estimation criterion for HMM

– Analysis of Margin in CDHMMs

– Large Margin Estimation (LME): Constrained minimax optimization

x Optimization Methods– Gradient Descent (GD) search

– Semi-definite Programming (SDP)

x Experiments– The ISOLET recognition task

– The TIDIGITS connected digit string recognition

x Summary

$XWRPDWLF�6SHHFK�5HFRJQLWLRQ�x Statistical Speech Model

1

a11

2

a22

3

a33

a12 a23

ObservationSequence

x1 x2 x3 x4 x5

)( 11 OT )( 21 OT )( 32 OT )( 43 OT )( 53 OT

X =

WX

XW NoisyChnnel

WordSequence

SpeechSignal

ChannelDecoding

SpeechSignal

WordSequence

x Bayesian Decision Rule:

Language Model Acoustic Model

Discriminant Function

)F(X|WXpWPXWpW WWWW

/ � :�:�:�

maxarg)|()(maxarg)|(maxargˆ

+00�(VWLPDWLRQ�0HWKRGVx Maximum Likelihood Estimation (MLE)

– The Baum-Welch algorithm: the EM algorithm for HMM

x Discriminative Training (DT)– Maximum Mutual Information Estimation (MMIE):

• MPE, MWE, etc. – Minimum Classification Error (MCE):

x Discriminative training can improve over the standa rd ML training.

x Can we do better than DT?

/DUJH�0DUJLQ�&ODVVLILHU�6XSSRUW�9HFWRU�0DFKLQH��690�

larger margin

/DUJH�0DUJLQ�&ODVVLILHUVx Why large margin classifiers yield better

generalization performance?

x Conceptually, large margin Î– Robustness w.r.t. data patterns

– Robustness w.r.t. classifier parameters

x Theoretically …

6WDWLVWLFDO�/HDUQLQJ�7KHRU\x In pattern classification, the generalization upper bound

holds with probability 1- (Vapnik et. al. ):

¹̧·

©̈§ ��d )

4log()1

2(log

1)()(

GTTV

NV

NRR emp

TestErrorRate

Training ErrorRate

VC Confidence

N: size of training set V: VC dimension

6WDWLVWLFDO�/HDUQLQJ�7KHRU\x In pattern classification, the generalization upper bound

holds with probability 1- (Vapnik et. al. ):

¸̧¹·¨̈©

§¹̧·

©̈§��d GTT 1

log)/(log

)()(2

2

d

VNV

N

CRR d

TestError

MarginError

VC Confidence

V: VC dimensiond: marginN: size of training setC: universal constant

+RZ�DERXW�XVLQJ�690�IRU�6SHHFK�5HFRJQLWLRQ"

x Done in some simple ASR tasks:– phoneme recognition; speaker recognition– small vocabulary isolated speech recognition

x Hard to extend to large-scale continuous speech rec ognition.

x No significant improvement is reported.– still not a main-stream method

x Why? – SVM: binary, static classifier.– Lack of a proper kernel function to map speech samp les from

one dynamic high-dimension space to another high-di mension space, which is suitable for linear classifiers.

/DUJH�0DUJLQ�+00�EDVHG�&ODVVLILHU

model 1 model 2

separation boundary F(X| 1)-F(X| 2)=0

/DUJH�0DUJLQ�+00�EDVHG�&ODVVLILHU

original separation boundary F(X| 1)-F(X| 2)=0

1

’1

2

’2

new separation boundary F(X| ’1)-F(X| ’2)=0

+RZ�WR�GHILQH�VHSDUDWLRQ�PDUJLQ" ��

x In 2-class separable problem:

– For a data token, x1, of class 1

– For a data token, x2, of class 2

)|F(x)|F(xxd 21111)( �

)|F(x)|F(xxd 12222)( �

> 0

> 0

+RZ�WR�GHILQH�VHSDUDWLRQ�PDUJLQ" ��

x Extend to multiple-class problem:

– N classes 1, 2, …, N,

– For a data token, x i, of class i

> @)|F(x)|F(x

)|F(x)|F(xxd

jiiiij

jiij

iii

� �

z

zmin

max)(

/DUJH�0DUJLQ�(VWLPDWLRQ�RI�+00Vx An N-class problem: each class is represented by an HMM

x Given a training set DD, define a subset, called support token set SS, as:

x Large-Margin Estimation ( LME) of HMMs:

},,,{ 21 N/// �

})(0 and |{ Hdd� iii XdDXXS

0))( all o(subject t)(minmaxargˆ ! � iiSX

XdXdi

/DUJH�0DUJLQ�(VWLPDWLRQ�RI�+00Vx Convert to a minimax optimization problem.

x Assume Xi belongs to class i:

> @)|F(X)|F(X iijiijSX i

� z� ,maxminargˆ

. and allfor

0

:sconstraint subject to

ijSX

)|F(X)|F(X

i

iiji

z��

$QDO\VLV�RI�0DUJLQV�LQ�&'+00x The margin in CDHMM is unbounded without additional

constraints.

x Adjust CDHMM parameters in certain way to increase the margin unlimitedly.

x Adopt Viterbi approximation:

$QDO\VLV�RI�0DUJLQV�LQ�&'+00

x Each dimension: independent

x Linear: same variance

x Quadratic: different variances

/LQHDU

4XDGUDWLF

0DUJLQ��/LQHDU�'LPHQVLRQV

0 5 10 15-0.05

0

0.05

0.1

0.15

0.2

C1 C2

x o

÷ø

öçè

æ +-

-=

2

12

2

12mm

smm

xd

0DUJLQ��4XDGUDWLF�'LPHQVLRQV

0 5 10 15−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

C2

C1

x o o x

X4 X1

X2 d2(X1)

−d1(X2)

&RQVWUDLQWV�LQ�/0(�RI�&'+00x Impose constraints to make LME solvable:

± Linear part: fix the norm of the slope to a constan t

± Quadratic part: constrain the vertex to a range

2

1

21

1

)|,( ij

R

t DditdijW gXR

t

i // ¦ ¦

�

2

1

2)0(2

2

)()|,( ij

R

t DditditdijW GXR

t

id� // ¦ ¦

�

/0(��FRQVWUDLQHG�PLQLPD[�RSWLPL]DWLRQ

x Large Margin Estimation (LME) of CDHMM Î aconstrainted minimiax optimization problem> @)|F(X)|F(X iiji

ijSX i

� z� ,maxminargˆ

. and allfor

)|,(

)|,(

:sconstraint subject to

22

21

ijSX

GXR

gXR

i

ijijW

ijijW

i

i

z�d// //

subject to constraints:

2SWLPL]DWLRQ�0HWKRGVx Gradient Descent (GD) Search

– approx the objective function with a differentiable one

– cast constraints as penalty terms

x Semi-definite Programming (SDP)

– math manipulation

– relaxation

/0(�2SWLPL]DWLRQ��*UDGLHQW�'HVFHQW

x Approximate with summation of exponential s

x Constraints Î Penalty terms:

)(Q

»¼º«¬

ª � | ¦z� ijSX

i

i

XdQQ,

)](exp[log1

)()( KKK

)()(lim )0()()( QQQQ �! �fo KKK K

� �� 2

,,

222

2

,,

211

2211

)|,(,0max)(

)|,()(

)()()()(

¦¦

z

z

�// /�// /

/��/��/ /

i

i

i

i

WjjiijiWj

WjjiijiWj

GXRP

gXRP

PPQO WWK

)(min)( ii

XdQ

/0(�2SWLPL]DWLRQ��*UDGLHQW�'HVFHQW

x The gradient descent optimization:

x Gradient descent optimization:

± Many parameters to be tuned experimentally:

� step size, penalty coefficients, , etc.

± Slow convergence speed.

± Local optimum.

)(’

)()(’ˆ)1('ˆ

n

Onn w

w�� H

+RZ�WR�FDOFXODWH�WKH�JUDGLHQWIRU�FRQWLQXRXV�GHQVLW\�+00"��

¦¦

z�

z��

ww��

ww

ijSXi

ijSX

ii

i

i

Xd

XdXd

Q

,

,

)](exp[

)()](exp[

)(

KK

K

i

ii

i

i XXd

/w/w /w

w )|()( F

j

ji

j

iXXd

/w/w� /w

w )|()( F

+RZ�WR�FDOFXODWH�WKH�JUDGLHQWIRU�FRQWLQXRXV�GHQVLW\�+00"��x Assumption 1: adjust CDHMM mean vectors only

x Assumption 2: diagonal precision matrices

x Assumption 3: use the Viterbi approximation

¦¦

��|/ T

t

D

d

iitd

iii dtltsdtlts

mXrCX1 1

2)()( )(2

1’)|(F

¦¦

��|/ T

t

D

d

jitd

jji dtltsdtlts

mXrCX1 1

2)()( )(2

1")|(

’’’’F

x An active area in optimization community nowadays.x The standard SDP form

x Linear function of symmetric matrices in semi-defin ite matrix conex SDP can solve nonlinear optimization if configured properly.x Convex conic optimization Î Global optimal solutionx New efficient algorithms are developed.

6HPL�GHILQLWH�3URJUDPPLQJ��6'3�

M, X,ibXA

XC

ji

N

jjij

N

jjj

XXX N

� �

�

¦

¦

�

�

1

subject to

1

1,,,min

21

subject to

/0(�2SWLPL]DWLRQ��6'3x LME: convert the constrained minimax optimziation Î

semi-definite programming (SDP)

x Introduce a new constraint to the minimax optimizat ion problem:

subject to

/0(�6'3��0LQLPL]DWLRQx Minimax optimization Î minimization

± Replace max with a common upper-bound ± :

subject to

/0(�6'3� 0DWUL[�)RUPx Transform into matrix form

subject to

/0(�6'3��5HOD[DWLRQx Matrix Relaxation: equality Î Inequality

x Relaxation to an SDP problem

subject to

/0(�6'3� 5HOD[DWLRQ�$QDO\VLVx Relaxation ± geometry explanation

± Augment x and u to a higher dimension

± Solve the problem in the augmented space

/0(�6'3�� WUDLQLQJ�SURFHGXUH

([SHULPHQWV��RYHUYLHZx Implemented under the HTK framework.

x Added more training tools:

± MCE training tool: HMce.c

± LME-GD training tool: HCLme.c

± LME-SDP training tool: C programs + Matlab program + dsdpdsdp package

x ASR Tasks

± OGI ISOLET E-set recognition

± TIDIGITS

/0(�7UDLQLQJ�6\VWHP

([SHULPHQWDO�5HVXOW��,62/(7

x Feature vector is of 39 dimensions:

(12 MFCC + E) + +

x MLE models (14 states per HMM ) are trained by HTK.

x MLE Î MCE ; MCE Î LME .

x Alphabet (26-letter) recognition (training ± 3120 utterances; test ± 1560 utterances):

– OGI (96%), Cambridge (96.73%).

– Ours: MLE (95.4%), MCE (96.1%), LME ( 96.92%).

([SHULPHQWDO�5HVXOW�,62/(7�(�VHW

x ISOLET E-set: {B, C, D, E, G, P, T, V, Z}

x Training: 1080 utterances; Test: 540 utterances

x word accuracy (in %) on ISOLET E-set test data

94.4495.0092.96LME-GD

95.0095.1992.96LME-SDP

93.8994.0791.48MCE

91.4890.5685.56ML

mix-4mix-2mix-1

([SHULPHQWDO�5HVXOW��7,',*,76

x Connected digit strings: µ1¶ to µ9¶ plus µoh ¶ and µzero ¶x Training with 8623 sentences; test 8700 sentences.x Feature vector is of 39 dimensions:

(12 MFCC + E) + + .x Unknown length digit string recognition.x Context-independent whole-word HMM models.x MLE models (12 states per HMM) are trained by HTK.x MLE Î MCE ; MCE Î LME .x MCE/LME training: N-best (N=5) based string-level.

6WULQJ�/HYHO�/0(�7UDLQLQJ

1. Identify support tokens

2. LME optimization

3. Converge or not

([SHULPHQWDO�5HVXOW�� 7,',*,76x string accuracy (in %) in test data

-GD

([SHULPHQWDO�5HVXOW�� 7,',*,76x WER (in %) on TIDIGITS Test set

-GD

7,',*,76�5HVXOWV/0(�*'�YV��/0(�6'3

0 10 20 30 40 50 60 70 8098.8

98.9

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

Number of Iterations

Acc

urac

y (%

)

testtraining

*UDGLHQW�'HVFHQW 6'3

0 2 4 6 8 10 12

98.9

99

99.1

99.2

99.3

99.4

99.5

99.6

99.7

99.8

99.9

Number of Iterations

Str

ing

Acc

urac

y (%

)testtraining

7,',*,76�UHVXOWV�/0(�YV��RWKHUV

WER (%)string error rate (%)ModelCriterion

0.451.34context-indep

whole word model

MLE

(with HTK)

n/a0.95context-indep

whole word model

MCE

(Juang et. al. ’97 )

0.240.72context-dep

head-body-tail model

MCE

(Juang et. al. ’97 )


two models per word

MMIE

(Normandin’94 )


whole word modelLME-SDP


whole word modelLME-GD

6XPPDU\x HMM Estimation methods for ASR

growth-transformation

or extended BW (EBW)

Maximum Mutual Information Estimation (MMIE)

(Maximum Conditional Likelihood)

gradient descent, GPD

Quickprop, etc.

Minimum Classification Error (MCE)

EM or Baum-Welch (BW)Maximum Likelihood Estimation (MLE)

gradient descent

semisemi --definite programming definite programming (SDP)(SDP)

Large Margin EstimationLarge Margin Estimation

(LME)(LME)

gradient descentMaximum Relative Margin EstimationMaximum Relative Margin Estimation

(MRME)(MRME)

Optimization methodsCriterion

/DUJH�0DUJLQ�(VWLPDWLRQ��/0(��YV��'LVFULPLQDWLYH�7UDLQLQJ��'7�

x MCE or MMIE is only asymptotic bound of the Bayes error.

x But Vapnik’s generalization bound holds for a finite body of training data.

),(lim)( NQR MCEN

TT fod),(lim)( NQR MMI

NTT fo�d

¹̧·

©̈§ ��d )

4log()1

2(log

1)()(

GTTV

NV

NRR emp

Large MarginEstimation

Discriminative Training

2QJRLQJ�:RUNVx How to handle training errors?

– combined objective function: margin + training erro rs

(ICASSP’06)

x Extend to large-scale subword-based speech recognition:

– WSJ-5K

– SPINE

Documents

DUJH0D UJLQ+ 00V IRU6 SHHFK5 HFRJQLWLRQhj/Talks/MSR_TALK.pdf · Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Email: