Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul

1

Update on WordWave Fisher Transcription

Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul

2

Outline

Schedule update

Investigating WordWave + auto segmentation quality– Updated evaluation method– Separating effect of transcripts and segmentation– Improved segmentation algorithm

Plans

Update on using Fisher data in Training

3

Data Schedule

BBN has received 925 hours from WordWave (WWave)

Processed and released 478 hours via LDC– 91 hrs on 8/1/03– 300 on 9/24/03– 87 on 10/21/03

WWave is currently running more slowly than planned– Reason: CTS transcription is hard!– They will complete 1600 hrs by the end of Jan 04, with

remaining 200 hrs to follow as quickly as possible.

4

Segmentation Quality as of Sept 03

Auto segmentation goals: Given audio and transcript and no timing info, break into fairly short segments and align correct text to each segment

In September, we compared transcription and segmentation approaches on a 20 hour Swbd set:– LDC/MSU careful transcription and manual segmentation vs.– LDC fast transcription and manual segmentation vs.– WWave transcripts + BBN automatic segmentation.

Compared 2 different segmentation algorithms – Alg I: run recognizer and segment at “reliable” silences; decode

using segmentation and reject based on sclite alignment errors– Alg II: use recognizer to get coarse initial segmentation; then

forced alignment within coarse segs to find finer segs; final rejection pass as before.

5

Performance Comparison in Sept

Unadapted recognition; acoustic models trained with 20-hour Swbd1 set, LM trained on full Switchboard

ML, GI, VTL, HLDA-trained models

Transcripts / Segmentation

Training hours

Eval01 WER

Manual LDC+MSU 19.9 41.1

CTRAN / Alg I 19.4 41.8

Fast Manual LDC 17.9 41.2

WWave / Alg I 19.2 41.4

WWave / Alg II 19.5 41.4

6

Improving the Evaluation Method

There were a number of issues and shortcuts in the training and test, that clouded comparisons.

We therefore– Adopted improved training sequence, including new binaries– Reduced pruning errors in decode– Converted from fast approximate VTL length estimation to more

careful approach– Adopted more stable VTL models

VTL models trained on 20 hours differed dramatically for small changes in segmentation– This is a bug in our VTL model estimation that we need to fix– For following experiments used stable VTL models from RT03

eval

Switched from our historic LDC+MSU baseline to all MSU for simplicity.

7

Comparison with Better Train and Test

Transcripts/ Segmentation

Training hours Eval01 WER

LDC+MSU 19.9 38.5

MSU 23.4 38.0

Fast LDC 17.9 39.4

Wwave/ Alg I 19.6 38.8

Wwave/ Alg II 19.5 38.8

8

Separating Effect of Segmentation

Compare segmentations using identical (MSU) transcripts

Alg I WER same for WWave vs MSU transcripts Segmentation may be biggest/only problem.


Training hours

Eval01 WER

MSU / MSU 23.4 38.0

MSU / Alg I 20.2 38.8

9

Segmentation Algorithm III

Algorithm II used forced alignment within coarse segments provided by initial pass of recognition, but examination revealed unrecoverable errors (words in wrong segment) from coarse initial seg.

Tried forced alignment of complete conversation sides

Overcame initial problems of failed alignments by– Pre-chopping out long silences, where our system tends to get

confused• Used auto-segmenter developed for RT03 CTS eval for this

– Changing forced alignment program to do much less pruning at begin and end of conversation• This accommodated things like beeps, line noise, and words cut off

by recording start and stop

Forced alignment is followed by script that breaks segments at silences, then rejection pass

10

Algorithm III with MSU transcripts


Training hours

Eval01 WER

MSU / MSU 23.4 38.0

MSU / Alg I 20.2 38.8

MSU / Alg III 21.8 38.2

Manually comparing MSU and Alg III showed that Alg III:– had more, shorter segments– had less silence padding around utterances – allowed utterances > 15 seconds when speaker did

not pause

Modified Alg III to approximate MSU’s statistics

11

Improved Algorithm III


Training hours

Eval01 WER

MSU / MSU 23.4 38.0

MSU / Alg I 20.2 38.8

MSU / Original Alg III 21.8 38.2

MSU / Improved Alg III 22.5 38.1

Matching MSU’s utterance lengths and silence improves WER slightly

Alg III seems good enough, at least for this task

12

Results with WordWave Transcripts

WWave transcripts seem fine given improved seg


Training hours

Eval01 WER

MSU / MSU 23.4 38.0

Fast LDC 17.9 39.4

WWave/ Alg I 19.6 38.8

WWave/ Original Alg III 21.2 38.1

13

Plans

Confirm quality of WWave with Alg III seg– On Swbd 20 hour set, train MMI models to compare all-MSU vs.

WWave/Alg III– On Swbd + 150 hour Fisher experiment, where we got gains

using Alg I segmented data.• Performance should not degrade

Improve speed of Alg III

Resegment and redistribute all data that has been released so far

Catch up with and continue segmenting latest WWave transcript deliveries.

14

Update on Adding Fisher Data

In Martigny, showed 1.4% gain for adding 150 hrs Fisher data (Alg I segmented) to RT03 training

Hoped to have results with 350 hours but we had bugs in our initial runs.

Did train MMI on RT03 (sw370) vs RT03+Fisher150 Results on 2nd adaptation pass with POS LM rescoring

CAVEAT: non-rigorous comparison! Fisher150 system optimized (gains 0.1-0.2% gain); used diff phone set & faster training (degrades 0.2% in other comparisons).

TrainingEval03 WER

RT03: SW370 23.1

+ Fisher 150 22.5

Documents

Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul