Upload
mediaeval2012
View
448
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
BUT2012Brno University of Technology
Faculty of Information TechnologySpeech@FIT
Igor Szöke, Michal Fapšo, Karel Veselý
MediaEval 2012 workshop – SWS task, October 4.-5. 2012, Pisa
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 2
Outlines
Systems overview & Underlying technologies
PhnRec, R-AKWS, AKWS – primary system
DTW
(GMM/HMM) – not submitted
Calibration
Results and discussion
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 3
System overview
Our internal task was
to build simple and minimalistic language dependent Query-by-Example (QbE).
Ingredients
Development data, Neural net classifier, Phoneme recognizer, Acoustic keyword spotting, DTW, Calibration
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 4
System overview
Sentence mean normalization
Neural network based features
bottle-necks
three state phone posteriors
Query detector
AKWS
DTW
(GMM/HMM) – not submitted to the evals
Bottle-Neck Posteriors
AKWS - X
DTW X X
(GMM/HMM) X -
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 5
Underlying technologies Universal context, bottle-neck neural network base classifier
devC state re-alignment, Reduced phone set (50 phonemes)
Trained by Tnet – our tool, publicly available
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 6
Phnrec, R-AKWS, AKWS Phoneme recognizer - free phone loop, devC 66.02% PAC
R-AKWS - Queries extracted from phone alignment
AKWS - Queries extracted from phone recognizer
devQ - devC devQ - evalCMTWV MTWVcalib UBTWV
R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600
MTWV MTWVcalib UBTWV
R-AKWS 0.653 0.703 0.789AKWS 0.377 0.429 0.552
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 7
DTW
Used as a baseline.. bottlenecks are better than posteriors
devQ - devC evalQ - evalCMTWV MTWVcalib UBTWV
R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600DTW 0.400 0.468 0.552
MTWV MTWVcalib UBTWV
R-AKWS - - -AKWS 0.470 0.530 0.672DTW 0.426 0.488 0.599
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 8
GMM/HMM Inspired by AKWS, not submitted due to bad results.
MTWV MTWVcalib UBTWV
R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600DTW 0.400 0.468 0.552
GMM/HMM 0.011 - 0.336
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 9
Calibration TWV - pooled, UBTWV - non-pooled TWV (each term has its best thr.)
Calibration of scores (linear combination of 12 parameters - 6 features with linear and quadratic forms). Trained on UBTWV thresholds.
Query length (w/o outer sil), Length of inner sil,
Score average global, Score average by phonemes
Phonemes count, Detections count
We found that Detections count and Length of inner sil work the best for AKWS (after evals).
Parameter Training error AKWS Training error DTWDetections count 0.1272 0.002115
Length of inner sil 0.1577 0.002687
Query length (w/o outer sil) 0.1626 0.002773
Score average global 0.1635 0.002530
Phonemes count 0.1656 0.002779
Score average by phonemes 0.1660 0.002746
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 10
Calibration AKWS
Conclusion
• AKWS with new calibration (submitted in brackets)• Good and consistent data, enough to train good Phnrec• GMM/HMM does not perform well on in-language condition
and 1 example per query (our best system in last year)• Number of detections is important calibration feature (due
to TWV)• Future work: detections calibration, system fusion
devQ-devC evalQ-evalC
ATWV MTWV UBTWV ATWV MTWV UBTWV
AKWS 0.488(0.488)
0.502(0.452)
0.600 0.522(0.492)
0.553(0.530)
0.672
DTW 0.443 0.468 0.552 0.448 0.488 0.599
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 12
Like / Dislike / Next evals? Like:
Adapted TWV, real KWS scoring
Phone alignment provided
Good data, great work of organizers
"Dislike":
No test data alignment
No speaker information
Next evals:
More examples per query?
Provide query and the query sentence (adaptation issue)?
Non-pooled scoring metric?
We would like to share our features – more on poster session
MediaEval SWS 2012 workshop - 4.-5.10. Pisa
BUT2012 13
Thank You for Your attention.