14
Standalone Training and Context-Independent Initialisations of Context-Dependent Deep Neural Networks Chao Zhang & Phil Woodland University of Cambridge 20 May 2013

Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

Standalone Training and Context-Independent Initialisationsof Context-Dependent Deep Neural Networks

Chao Zhang & Phil Woodland

University ofCambridge

20 May 2013

Page 2: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

Improving Standard CD-DNN Training

• Std. CD-DNN-HMM training relies on GMM-HMMs in two ways:◦ Training labels — state-to-frame alignments◦ Tied CD state targets — GMM-HMM based decision tree state tying

• Can we build CD-DNN-HMMs independently of GMM-HMMs?

• Training CD-DNN-HMMs independently from any GMM-HMMs:Standalone training◦ Alignments — by CI-DNN-HMMs trained in a standalone fashion

• Training started with a flat start• Refine initial alignments in an iterative fashion• Train CI-DNN-HMMs using discriminative pre-training with realignment

and std. fine-tuning

◦ Targets — by DNN-HMM based decision tree target clustering

2 of 14

Page 3: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

DNN-HMM based Target Clustering

• Assume the output distribution for each target is Gaussian with acommon covariance matrix, i.e., p(z|Ck) = N (z;µk ,Σ), we have

p(Ck |z) =

exp{ µTkΣ−1 z −1

2µTkΣ−1µk + ln P(Ck) }

∑k ′ exp{ µT

k ′Σ−1 z −1

2µTk ′Σ−1µk ′ + ln P(Ck ′ ) }

• According to softmax output activation function,

p(Ck |z) =exp{ wT

k z + bk }∑k ′ exp{ wT

k ′ z + bk ′ }

• We can convert Gaussians to DNN output layer parameters.3 of 14

Page 4: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

Procedure of Building CD-DNN-HMMs

4 of 14

Page 5: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

Experiments

• Training set: Wall Street Journal training set WSJ0+1 (SI-284)

• Testing sets: 1994 H1-dev (Dev) and Nov’94 H1-eval (Eval)◦ 65k dictionary and trigram LM

• MPE GMM-HMMs: ((13PLP)D A T Z)HLDA; 5981 states, 12Gaussians/state◦ MPE GMM-HMMs were with ((13PLP)D A T Z)HLDA

• DNN models were trained and tested using an extended version ofQuickNet◦ Cross-entropy criterion, sigmoid/softmax hidden/output activation

functions

• DNN-HMMs: 9× (13PLP)D A Z; 5× 1000 hidden layers; 6000output targets

5 of 14

Page 6: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

CI-DNN-HMM Results

ID TypeDNN WER%

Alignments Dev Eval

G2 MPE GMM-HMMs — 8.0 8.7I1 CI-DNN-HMMs G2 10.5 12.0

Baseline GMM-HMM and CI-DNN-HMM Results (351× 10005 × 138).

ID Training RouteWER%

Dev Eval

I3 Realigned 12.2 14.3I4 Realigned+Conventional 11.7 13.8I5 Conventional 12.2 15.0I6 Conventional+Conventional 12.0 14.6

Different CI-DNN-HMMs trained in a standalone fashion.

6 of 14

Page 7: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

CD-DNN-HMM Results

• Baseline CD-DNN-HMMs (D1) were trained with G2 alignments.The WER on Dev and Eval were 6.7 and 8.0, respectively

• CD-DNN-HMMs with different clustered targets were listed in thetable. The hidden layer and alignments were from I4

ID Clustering BP LayersWER%

Dev Eval

G3 GMM-HMM Final Layer 7.6 9.0G4 GMM-HMM All Layers 6.8 7.9D2 DNN-HMM Final Layer 7.7 8.7D3 DNN-HMM All Layers 6.8 7.8

CD-DNN-HMM based state tying results (351× 10005 × 6000).

• The CD-DNN-HMMs (D3) trained without relying on anyGMM-HMMs is comparable to baseline D1

7 of 14

Page 8: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

Conclusion of Standalone Training

• Accomplish training CD-DNN-HMMs without relying on anypre-existing system◦ train CI-DNN-HMMs by updating the model parameters and the

reference labels in an interleaved fashion◦ decision tree tying in sigmoidal activation vector space of CI-DNN

• The experiments on WSJ SI-284 have shown◦ the proposed training procedure gives comparable performance◦ the methods are very efficient

8 of 14

Page 9: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

CI Discriminative Pre-training of CD-DNNs

• Weakness of standard CD-DNN Pre-training◦ RBM based Generative Pre-training

• Weight values are not directly optimised for classification purposes• Usually uses different settings from fine-tuning

◦ Traditional (CD) Discriminative Pre-training

• Lower layers are over-specific to particular set of CD states: not genericenough for modelling low level acoustic features

• Training speed can be very slow when the target set is big

• Propose CI discriminative pre-training◦ Initialise CD-DNNs with parameters discriminatively trained for

classifying CI states

• Improves CD-DNN performance• Can be much faster than CD discriminative pre-training

9 of 14

Page 10: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

CI Discriminative Pre-training

10 of 14

Page 11: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

Experiments

• All resulting DNNs were evaluated as hybrid acoustic models

• Training set: WSJ0 (SI-84) and WSJ0+1 (SI-284)

• WSJ0 MPE GMM-HMMs: 3007 tied states; 8 Gaussians per state

• WSJ0 CI-/CD-DNN structures: 351× 10005 × 138/3007

• The rest configurations were the same as before

• All experiments were conducted using an extended version of HTK,which supports DNNs

11 of 14

Page 12: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

WSJ0 DNN-HMMs Performance

ID Pre-trainingWER% CI State

Dev Eval CV Acc%

S01 Discriminative 14.6 16.6 67.2S02 Generative 9.4 10.9 68.9S03 CD Discriminative 9.6 11.3 68.7S04 CI Discriminative† 8.9 10.3 69.7S05 CI Discriminative 8.4 10.0 70.2

WSJ0 DNN-HMM system results. † means CI-DNN fine-tuning is not

included. S01 is a CI model. S02-S05 are CD models.

• S05 vs S02: 9.4% relative WER reduction

• S05 vs S03: 12.0% relative WER reduction

• S04 vs S02 (same num of epochs): 5.7% relative WER reduction

• S04 vs S03 (same num of epochs): 8.1% relative WER reduction

• CI state accuracies are consistent with the WERs12 of 14

Page 13: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

WSJ0+1 DNN-HMMs Performance

ID Pre-trainingWER% CI State

Dev Eval CV Acc%

S11 Discriminative 11.1 12.6 70.5S12 Generative 6.9 8.1 73.4S13 CD Discriminative 6.7 8.1 72.5S14 CI Discriminative† 6.3 7.4 73.4S15 CI Discriminative 6.3 7.4 72.9

WSJ0+1 DNN-HMM system results. † means CI-DNN fine-tuning is not

included in pre-training. S11 is a CI model. S12-S15 are CD models.

• S14 vs S12: 8.7% relative WER reduction

• S15 vs S13: 7.4% relative WER reduction

• If sufficient data are available, CI-DNN fine-tuning is less important

• S14 pre-training is 5 times faster than S13 pre-training (1×K20c)

13 of 14

Page 14: Standalone Training and Context-Independent ...€¦ · Dev Eval CV Acc% S11 Discriminative 11.1 12.6 70.5 S12 Generative 6.9 8.1 73.4 S13 CD Discriminative 6.7 8.1 72.5 S14 CI Discriminativey

Conclusion of CI-DNN Pre-training

• We introduce an alternative discriminative pre-training methodthat intialises CD-DNNs using a DNN with context independentstate targets◦ Resulting CD-DNN hybrid systems reduced the WER by 9.1% and

9.7% relative over the baselines with generative and CD discriminativepre-training

◦ Also reduced training time by a factor of five compared to CDdiscriminative pre-training with 6000 CD state targets

• A way of evaluating CD classification results on CI level is used tofacilitate frame level DNN comparisons with different targets◦ Frame error CV accuracies correlate well with final WERs in hybrid

system

14 of 14