Upload
fritzi
View
35
Download
2
Embed Size (px)
DESCRIPTION
Update on WordWave Fisher Transcription. Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul. Outline. Schedule update Investigating WordWave + auto segmentation quality Updated evaluation method - PowerPoint PPT Presentation
Citation preview
1
Update on WordWave Fisher Transcription
Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul
2
Outline
Schedule update
Investigating WordWave + auto segmentation quality– Updated evaluation method– Separating effect of transcripts and segmentation– Improved segmentation algorithm
Plans
Update on using Fisher data in Training
3
Data Schedule
BBN has received 925 hours from WordWave (WWave)
Processed and released 478 hours via LDC– 91 hrs on 8/1/03– 300 on 9/24/03– 87 on 10/21/03
WWave is currently running more slowly than planned– Reason: CTS transcription is hard!– They will complete 1600 hrs by the end of Jan 04, with
remaining 200 hrs to follow as quickly as possible.
4
Segmentation Quality as of Sept 03
Auto segmentation goals: Given audio and transcript and no timing info, break into fairly short segments and align correct text to each segment
In September, we compared transcription and segmentation approaches on a 20 hour Swbd set:– LDC/MSU careful transcription and manual segmentation vs.– LDC fast transcription and manual segmentation vs.– WWave transcripts + BBN automatic segmentation.
Compared 2 different segmentation algorithms – Alg I: run recognizer and segment at “reliable” silences; decode
using segmentation and reject based on sclite alignment errors– Alg II: use recognizer to get coarse initial segmentation; then
forced alignment within coarse segs to find finer segs; final rejection pass as before.
5
Performance Comparison in Sept
Unadapted recognition; acoustic models trained with 20-hour Swbd1 set, LM trained on full Switchboard
ML, GI, VTL, HLDA-trained models
Transcripts / Segmentation
Training hours
Eval01 WER
Manual LDC+MSU 19.9 41.1
CTRAN / Alg I 19.4 41.8
Fast Manual LDC 17.9 41.2
WWave / Alg I 19.2 41.4
WWave / Alg II 19.5 41.4
6
Improving the Evaluation Method
There were a number of issues and shortcuts in the training and test, that clouded comparisons.
We therefore– Adopted improved training sequence, including new binaries– Reduced pruning errors in decode– Converted from fast approximate VTL length estimation to more
careful approach– Adopted more stable VTL models
VTL models trained on 20 hours differed dramatically for small changes in segmentation– This is a bug in our VTL model estimation that we need to fix– For following experiments used stable VTL models from RT03
eval
Switched from our historic LDC+MSU baseline to all MSU for simplicity.
7
Comparison with Better Train and Test
Transcripts/ Segmentation
Training hours Eval01 WER
LDC+MSU 19.9 38.5
MSU 23.4 38.0
Fast LDC 17.9 39.4
Wwave/ Alg I 19.6 38.8
Wwave/ Alg II 19.5 38.8
8
Separating Effect of Segmentation
Compare segmentations using identical (MSU) transcripts
Alg I WER same for WWave vs MSU transcripts Segmentation may be biggest/only problem.
Transcripts/ Segmentation
Training hours
Eval01 WER
MSU / MSU 23.4 38.0
MSU / Alg I 20.2 38.8
9
Segmentation Algorithm III
Algorithm II used forced alignment within coarse segments provided by initial pass of recognition, but examination revealed unrecoverable errors (words in wrong segment) from coarse initial seg.
Tried forced alignment of complete conversation sides
Overcame initial problems of failed alignments by– Pre-chopping out long silences, where our system tends to get
confused• Used auto-segmenter developed for RT03 CTS eval for this
– Changing forced alignment program to do much less pruning at begin and end of conversation• This accommodated things like beeps, line noise, and words cut off
by recording start and stop
Forced alignment is followed by script that breaks segments at silences, then rejection pass
10
Algorithm III with MSU transcripts
Transcripts/ Segmentation
Training hours
Eval01 WER
MSU / MSU 23.4 38.0
MSU / Alg I 20.2 38.8
MSU / Alg III 21.8 38.2
Manually comparing MSU and Alg III showed that Alg III:– had more, shorter segments– had less silence padding around utterances – allowed utterances > 15 seconds when speaker did
not pause
Modified Alg III to approximate MSU’s statistics
11
Improved Algorithm III
Transcripts/ Segmentation
Training hours
Eval01 WER
MSU / MSU 23.4 38.0
MSU / Alg I 20.2 38.8
MSU / Original Alg III 21.8 38.2
MSU / Improved Alg III 22.5 38.1
Matching MSU’s utterance lengths and silence improves WER slightly
Alg III seems good enough, at least for this task
12
Results with WordWave Transcripts
WWave transcripts seem fine given improved seg
Transcripts/ Segmentation
Training hours
Eval01 WER
MSU / MSU 23.4 38.0
Fast LDC 17.9 39.4
WWave/ Alg I 19.6 38.8
WWave/ Original Alg III 21.2 38.1
13
Plans
Confirm quality of WWave with Alg III seg– On Swbd 20 hour set, train MMI models to compare all-MSU vs.
WWave/Alg III– On Swbd + 150 hour Fisher experiment, where we got gains
using Alg I segmented data.• Performance should not degrade
Improve speed of Alg III
Resegment and redistribute all data that has been released so far
Catch up with and continue segmenting latest WWave transcript deliveries.
14
Update on Adding Fisher Data
In Martigny, showed 1.4% gain for adding 150 hrs Fisher data (Alg I segmented) to RT03 training
Hoped to have results with 350 hours but we had bugs in our initial runs.
Did train MMI on RT03 (sw370) vs RT03+Fisher150 Results on 2nd adaptation pass with POS LM rescoring
CAVEAT: non-rigorous comparison! Fisher150 system optimized (gains 0.1-0.2% gain); used diff phone set & faster training (degrades 0.2% in other comparisons).
TrainingEval03 WER
RT03: SW370 23.1
+ Fisher 150 22.5