Upload
hilary-harrison
View
213
Download
0
Embed Size (px)
Citation preview
The SRI 2006 Spoken Term Detection System
Dimitra Vergyri, Andreas Stolcke,
Ramana Rao Gadde, Wen Wang
Speech Technology & Research Laboratory
SRI International, Menlo Park, CA
Dec. 14, 2006STD-06 Workshop 2
Outline
• STD system overview– STT systems
• BNews system description
• CTS system description
• ConfMtg system description
– Indexing• N-gram index from word lattices
• NNet based posterior estimation
– Retrieval
• Time and memory requirements• ATWV Results• Future work
Dec. 14, 2006STD-06 Workshop 3
SRI STD System
STT
Audio
WordLattices INDEXER
N-gramIndex with posteriors
RETRIEVER
SearchTerms
Termswith Times and
ProbabilitiesIndexing step
Dec. 14, 2006STD-06 Workshop 4
English BN STT System• Single front-end : PLP (52 39 dim)• HLDA, feature-space SAT • Gender-independent acoustic modeling• Decision-tree clustered within-word and cross-word triphones• MLE followed by alternating MPE-MMIE acoustic training• Acoustic training: Hub4, TDT2+TDT4+TDT4a, BNr1234 subset
– MLE training: 3300 hours, MPE training: 1700 hours– 2500 x 200 Gaussian for nonCW triphones– 3000 x 160 Gaussians for CW triphones
• Word bigram and 5-gram LM trained on Hub4, TDT, BNr1234 transcripts, Hub4 LM training data, and NABN (cutoff date Nov. 30, 2003)
– 62k words, 29M bigrams, 27M trigrams, 15M 4-grams, 2.4M 5-grams
• Duration rescoring (word-specific phone durations)• Two-pass decoding• First decoding stage unadapted nonCW model with bigram LM• Adapted CW models to nonCW output after 5-gram LM and
duration model lattice rescoring• Lattice constrained decoding with MLLR adapted, SAT, cw model
Dec. 14, 2006STD-06 Workshop 5
English BN STT System
2-gramLattices
PLP MPEnonCW
PLP MPECW
Adapted5-gramLattices
Legend
Decoding/rescoring step
Hyps for MLLR or output
Lattice generation/use
Lattice or 1-best output
• Runtimes: • 2.5xRT for unadapted lattices• 5.4xRT for adapted lattices
• ~10% relative WER improvement after adaptation• Both decoding stages use shortlists.
4-gramLattices
Dec. 14, 2006STD-06 Workshop 6
English CTS STT System• Two front-ends:
– MFCC + voicing + MLP-features (52 + 10 + 25 39 + 25 dim)– PLP (52 39 dim)
• HLDA, feature-space SAT• Gender-dependent acoustic modeling• Decision-tree clustered within-word and cross-word triphones• MLE followed by alternating MPE-MMIE acoustic training• Acoustic training: all Hub5 + Fisher training
– 2500 x 128 x 2 Gaussians for nonCW triphones– 3000 x 128 x 2 Gaussians for CW triphones
• Prosodic rescoring (word-specific phone durations, pause trigram)• Word bigram and 4-gram LM • Interpolated + pruned LM trained on CTS, BN, and Web data
– 48k words, 16M bigrams, 16M trigrams, 12M 4grams• First lattice generation uses phone-loop MLLR nonCW MFCC and
2-gram LM• Second constrained lattice generation uses cross-adapted CW SAT PLP
models.
Dec. 14, 2006STD-06 Workshop 7
English Meeting STT System[Stolcke et al., MLMI’05; Janin et al., MLMI’06]
• Based on CTS system architecture (2-pass system)• Combination of CTS (narrow-band) and BN (wide-band)
base models• Acoustic models adapted to distant-mic meeting
recordings using MMI-MAP• MLP features adapted for meeting recordings by
incremental training• Mixture language model trained on meetings, CTS, and
Web data• System used in RT-06S meeting evaluation, co-
developed with ICSI
Dec. 14, 2006STD-06 Workshop 8
English CTS & Confmtg STT Systems
Legend
Decoding/rescoring step
Hyps for MLLR or output
Lattice generation/use
Lattice or 1-best output
2-gramLattices
MFCC-MLPMPE nonCW
PLP MPECW
Adapted4-gramlattices
3-gramLattices
• CTS runtime:• 1.8xRT for unadapted lattices• 2.5xRT for adapted lattices
•Confmtg runtime:• 5.4xRT for unadapted lattices• 6.8xRT for adapted lattices
• CTS system uses Gaussian shortlists in first pass only• Confmtg system does not use shortlists.
Dec. 14, 2006STD-06 Workshop 9
English STT Result Summary (WER)
10.5%
eval03
23.2%10.7%BN
STD-dev06eval02
24.0%
eval03
23.7%
eval02
17.4%17.0%CTS
STD-dev06dev04
36.9%
dev04
37.2%
eval04s
44.2%Confmtg
STD-dev06
• STD-dev06 WER measured using references constructed from RTTM files• Systematic differences compared to standard STT references
• For example, BN scoring does not exclude commercial segments• Note: STT systems were not especially tuned for STD; used
configurations inherited from STT evaluations.
Dec. 14, 2006STD-06 Workshop 10
Indexing of Word Lattices
• SRILM lattice-tool dumps all word 1-grams to 5-grams in lattices, along with side information
– Posterior probabilities based on normalized recognizer scores– Start/end times, channel, waveform name– 0.5s time tolerance to merge same N-grams with different times– Pronunciations (to detect OOV words, not used yet)– N-grams with posterior < 0.001 are omitted to keep index size reasonable
• Index = term occurrence table sorted by N-gram• Indexing function incorporated in SRILM release 1.5.1
– Lattice-tool –write-ngram-index option– Downloadable from www.speech.sri.com/projects/srilm/
Dec. 14, 2006STD-06 Workshop 11
Score Calibration
• Neural net maps posteriors to unbiased STD scores– Input features used: audio source (bnews/cts/confmtg), LM joint probability,
LM N-gram length, #words, duration, lattice posterior– Used LnkNet software for training MLP to predict correctness of hypothesized
term (1 hidden layer with 10 nodes)– Cross entropy objective function
• Neural net trained using the dev06 term list• Training on raw data improved Occurrence Weighted
Value, not A-Term Weighted Value– Also required re-tuning the posterior threshold.
• Resample training data to approximate ATWV – Downsample/upsample within occurrences of each term to have equal number
of training samples for each term. – Posterior threshold 0.5 ended up being optimal for ATWV (at least on the
training data).
Dec. 14, 2006STD-06 Workshop 12
Searching & Retrieval
• Convert the search terms into a sorted list• Run the Unix “join” command between the index list
obtained in indexing and the term list• YES/NO decision based on the posterior threshold 0.5• Run time almost independent of the size of the search
list (depends on the index size)
Dec. 14, 2006STD-06 Workshop 13
Time and Memory Requirements
• The system was run on 3GB, 3.4 GHz Intel hyperthreading CPU
• Both index size and search time can be significantly reduced if we keep only candidates with high posterior
• STT runtimes were incorrectly measured in submitted sysdesc.
40440 s26760 s58560 sSTT run time
Search time needed for all terms
Indexing
13 s
530K
37Mb
602K
37Mb
944K
74MbIndex size
(# terms/MB)
2711 sNNet run time
493 sIndex from lattice
Confmtg
(2h)
CTS
(3h)
BN
(3h)
Dec. 14, 2006STD-06 Workshop 14
STD Results
• Extra dev consists of RT02, RT03 (BN+CTS), dev04 (CTS+ConfMtg), RT04s (ConfMtg)
• Difficult to debug eval06 (no references were given), but the result on meetings seems much lower than on dev sets.
• Possibly overtrained neural net on meetings condition.
Occ.WV/ATWV
0.790/0.687
0.784/0.718
0.631/0.462
0.536/0.461
0.792/0.681
0.800/0.712
0.887/0.801
0.889/0.817
Extra dev
--- / 0.255
--- / 0.665
--- / 0.824
eval06
0.802/0.700
0.782/0.739
0.566/0.205
0.491/0.358
0.860/0.615
0.860/0.660
0.906/0.802
0.905/0.818
dryrun06
0.821/0.787
0.804/0.817
0.585/0.275
0.515/0.427
0.881/0.692
0.881/0.714
0.914/0.850
0.914/0.865
dev06
0.3
0.5
No NNet
With NNet
Confmtg
No NNet
With NNet
No NNet
With NNet
No NNet
With NNet
All
CTS
BN
0.3
0.5
0.3
0.5
0.3
0.5
Thres.
Dec. 14, 2006STD-06 Workshop 15
Future Work
• Current system does not cover detection of terms with OOVs. Possible approaches:
– Map the unknown search terms to the known vocabulary (OGI work, gave about 2-3% improvement on BNews).
– Use of phone recognition and phone-based indexing for OOVs– Hybrid word+graphone recognizer outputs both words and “graphone” units
that can match OOVs (Bisani & Ney 2005)
• Improve the score mapper– Bigger devset needed to avoid overtraining– Other models (decision tree, logistic regression)
• Found some mismatch between ASR vocabulary and term lists. Apply normalization rules to fix common problems (found about 0.3% relative improvement with few simple rules)
• Tune STT systems for indexing speed