T.J. Watson Research Center, Human Language Technologies 3/19/2005 Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum

T.J. Watson Research Center, Human Language Technologies

3/19/2005

Improvements to fMPE

Dan Povey

2


EARS progress update

Overview

Review of fMPE Mean offsets as features Multiple layer framework Context expansion in multiple layer framework Improved way of setting learning rate Improved way of setting per-dimension scales on learning rate “Smooth update” – more stable update rule “Out of the box training” Diagnostics Other issues What is most important?

3



Review of fMPE (1 of 3, overview)

In fMPE, we train a nonlinear offset to the features: yt = xt + M ht

ht is a high-dimensional vector and a function of xt (and maybe context frames xt-1, xt+1 etc).

The transformation parameters M are trained using the MPE objective function, using a modified form of gradient descent.

4



Review of fMPE (2 of 3, features)

The high dimensional features ht are (in the original implementation) a vector of Gaussian posteriors with frame splicing.

Obtain 100,000 Gaussians by clustering HMM set

Calculate Gaussian posteriors (model-free) on each frame

Splice vectors on adjacent frames together to create a larger vector (actually, splice together frames and averages of frames for larger context window).

Vector ht is very sparse (even though M is not), so calculations are fast.

5



Review of fMPE (3 of 3, training)

Specific learning rates for each parameter Mij obtained by accumulating positive and negative contributions to F/ Mij and dividing by the sum of the absolute value of both.

Compensate for different dimensions of the feature vector having different average variance.

The differential w.r.t. matrix element Mij contains an “indirect” term reflecting changes that will happen in the means and variances when we re-train the system. This is necessary because the HMM parameters are trained with ML while the matrix is trained with MPE

Features affect means & vars, means & vars affect objective function

! differentiate back through the process.

6



Mean offsets as features

Probably the most important change (results already given in last EARS meeting):

Using far fewer Gaussians (e.g. 1000 instead of 100,000) and adding the offsets of the observed features from the mean.

If the posteriors were [1, 2…], we are now using: [ 5.0 1, 1 (xt(1)-1(1))/1(1), 1 (xt(2)-1(2))/1(2) … 5.0 , (xt(1)-(1))/(1), (xt(2)-(2))/(2) …] Each posterior followed by offset of the feature from the mean. Divide by n to ensure equal scales on all offsets. 5.0 is a scale to put more weight on the posterior itself. For 1000 Gaussians, the final dimension of the feature ht would

be 1000 (d +1) for d-dimensional features (ignoring frame splicing).

Improves both accuracy and speed.

7



Multiple layer framework

Motivation: using mean offsets combined with frame averaging and splicing reduces sparsity of ht to the point where training takes much longer.

Need to reorganize the calculation into multiple stages. Developed a code framework where features can undergo multiple

layers of processing and propagate differentials back to previous layers.

Using multiple modules with a normalized interface (e.g. a layer doing a linear transformation would be called in the same way as a layer calculating Gaussian posteriors)*

Makes it very easy to add new kinds of processing (just copy, rename and modify an existing module).

Setup controlled by a config file*except some features need to be stored sparsely

8



Context expansion in multiple layer framework (1 of 2)

Previously, would calculate ht explicitly (including splicing) and then project. But with mean offsets & splicing it is not sparse enough.

Now, calculate the “single-frame” ht with no splicing (e.g. of size 1000 * d+1) and project it to a multiple of d, e.g. 9d, then splice and project to d.

ht ! M1 ht ! M2 (M1 ht, M1 ht+1, M1 ht-1 .. ) (dimension): 1000(d+1) ! 9d ! d Splice the 9d-dimensional feature across e.g. 80 frames and project

down to d with a projection s.t. each output dimension only “sees” 1/d of the input dimensions. #parameters = 9 * 80 * d.

Initialize projection to be equivalent to original context expansion, so the first of the 9 contexts gets projected from the central frame, the second gets projected only from one frame to the right, etc.

9



Context expansion in multiple layer framework (2 of 2)

I neglected to mention in the paper that… the context expansion layer is trained with held-out data (one out of

every 10 files). Otherwise, it tries to scale up the fMPE contribution as much as it

can to maximize overtraining. This is a problem with all setups that involve multiplying two fMPE

trained things. [ Note – I do not bother making sure that the source of the “indirect”

contributions to the differential was also held out. ]

10



Improved method of setting learning rates

When changing setups, the appropriate learning rate (controlled by E) can change.

Set a “target” criterion improvement for the first iteration and set E on the first iteration based on that.

Use the same value of E in subsequent iterations. Using 0.06 for the main (first) matrix and 0.007 for the second

(context expansion) layer. Reduce these values for low-WER domains. Note that the context expansion layer is trained only from the

second iteration since the differential would be zero on the first iteration.

11



Improved method of setting per-dimension learning rates

The original per-dimension learning rates included factor I (an average standard deviation) for matrix element Mij to have the appropriate scale for the target dimension being added to.

This did not seem to work well for MFCC parameters: got wide variation in contribution to criterion improvement between dimensions (perhaps broken by extreme values).

Replace i with 1/sqrt(Si), where Si is average squared value of summed positive and negative contributions to each F/ Mij.

Gives better ratio between learning rates for different dimensions. If E is set automatically as described above, the overall learning

rate will be appropriate.

12



“Smooth update”

When training context expansion, sometimes an instability appeared for certain dimensions.

Developed a method to detect and stop instabilities. Intuition – if too many parameters are changing direction and

moving farther than last time, the learning rate is too high.

(1) Define a set of meaningful subsets of matrix parameters (e.g. matrix rows, columns).

(2) For each subset in decreasing order of size: if for more than 10% of the parameters p in the subset, the value on iteration pn is on the opposite side of pn-2 from pn-1, reduce the learning rate for that subset until this no longer holds (i.e. move the parameters pn towards pn-1).

OK OK Too far

13



“Out of the box” training

The reason for many of the changes described above is to obtain a setup that will work on different domains without tuning.

E.g. new methods of setting learning rates And “smooth update” which can neutralize the effect of a learning

rate that has been set too fast. fMPE reliably gives improvements without tuning

E.g. recently trained some acoustic models for fast transcription of call center data (no adaptation). fMPE+MPE improved results by 8.5% absolute from a 45% baseline.

For small-vocabulary task, fMPE+MPE improved results by 30% relative from a 1.20% baseline.

Note – I now always use the same acoustic scales as normal MPE (e.g. 0.1 or 0.05, or inverse of normal LM scale if preferred).

14



Diagnostics

Always use plenty of diagnostics. E.g. -

Per-dimension measures of predicted criterion improvement and sign changes;

The overall predicted and observed criterion improvement; Check for indirect & direct differentials canceling overall (see paper); Look at average size of fMPE contributions to features; Check distribution of data among Gaussians used to calculate

posteriors; Use measures of difference between HMM sets. Print out plenty of graphs and histograms where appropriate.

“It doesn’t work” is not enough information to fix it if it’s broken.

15



Other issues investigated

Sigmoid layers – no improvement. Momentum update rule – no improvement. Training “variances” on features (a quantity added to (x-)2

quantities in training and test) – this gave some improvement ~1-2% relative.

Training multiple systems on the same data in parallel, sharing only the fMPE transform – should multiply effective amount of data (seems to help ~1-2% relative).

Note - I don’t know whether the way to obtain the Gaussians is critical. Jasha Droppo (Microsoft) suggests training a GMM on the features with a globally tied variance.

16



What is most important?

Use appropriate learning rate for your features (e.g. set target improvement).

Setting learning rate too fast can cause dramatic instability.

Setting it too slow can cause very slow convergence. Use the indirect differential if you want to train on fMPE features. Use frame splicing for acoustic context. Need a baseline discriminative training setup that works (e.g. lattice

generation).

Documents

T.J. Watson Research Center, Human Language Technologies 3/19/2005 Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179 Recommended maximum