BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

BUT2012Brno University of Technology

Faculty of Information TechnologySpeech@FIT

Igor Szöke, Michal Fapšo, Karel Veselý

MediaEval 2012 workshop – SWS task, October 4.-5. 2012, Pisa

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 2

Outlines

Systems overview & Underlying technologies

PhnRec, R-AKWS, AKWS – primary system

DTW

(GMM/HMM) – not submitted

Calibration

Results and discussion


BUT2012 3

System overview

Our internal task was

to build simple and minimalistic language dependent Query-by-Example (QbE).

Ingredients

Development data, Neural net classifier, Phoneme recognizer, Acoustic keyword spotting, DTW, Calibration


BUT2012 4

System overview

Sentence mean normalization

Neural network based features

bottle-necks

three state phone posteriors

Query detector

AKWS

DTW

(GMM/HMM) – not submitted to the evals

Bottle-Neck Posteriors

AKWS - X

DTW X X

(GMM/HMM) X -


BUT2012 5

Underlying technologies Universal context, bottle-neck neural network base classifier

devC state re-alignment, Reduced phone set (50 phonemes)

Trained by Tnet – our tool, publicly available


BUT2012 6

Phnrec, R-AKWS, AKWS Phoneme recognizer - free phone loop, devC 66.02% PAC

R-AKWS - Queries extracted from phone alignment

AKWS - Queries extracted from phone recognizer

devQ - devC devQ - evalCMTWV MTWVcalib UBTWV

R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600

MTWV MTWVcalib UBTWV

R-AKWS 0.653 0.703 0.789AKWS 0.377 0.429 0.552


BUT2012 7

DTW

Used as a baseline.. bottlenecks are better than posteriors

devQ - devC evalQ - evalCMTWV MTWVcalib UBTWV

R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600DTW 0.400 0.468 0.552


R-AKWS - - -AKWS 0.470 0.530 0.672DTW 0.426 0.488 0.599


BUT2012 8

GMM/HMM Inspired by AKWS, not submitted due to bad results.


R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600DTW 0.400 0.468 0.552

GMM/HMM 0.011 - 0.336


BUT2012 9

Calibration TWV - pooled, UBTWV - non-pooled TWV (each term has its best thr.)

Calibration of scores (linear combination of 12 parameters - 6 features with linear and quadratic forms). Trained on UBTWV thresholds.

Query length (w/o outer sil), Length of inner sil,

Score average global, Score average by phonemes

Phonemes count, Detections count

We found that Detections count and Length of inner sil work the best for AKWS (after evals).

Parameter Training error AKWS Training error DTWDetections count 0.1272 0.002115

Length of inner sil 0.1577 0.002687

Query length (w/o outer sil) 0.1626 0.002773

Score average global 0.1635 0.002530

Phonemes count 0.1656 0.002779

Score average by phonemes 0.1660 0.002746


BUT2012 10

Calibration AKWS

Conclusion

• AKWS with new calibration (submitted in brackets)• Good and consistent data, enough to train good Phnrec• GMM/HMM does not perform well on in-language condition

and 1 example per query (our best system in last year)• Number of detections is important calibration feature (due

to TWV)• Future work: detections calibration, system fusion

devQ-devC evalQ-evalC

ATWV MTWV UBTWV ATWV MTWV UBTWV

AKWS 0.488(0.488)

0.502(0.452)

0.600 0.522(0.492)

0.553(0.530)

0.672

DTW 0.443 0.468 0.552 0.448 0.488 0.599


BUT2012 12

Like / Dislike / Next evals? Like:

Adapted TWV, real KWS scoring

Phone alignment provided

Good data, great work of organizers

"Dislike":

No test data alignment

No speaker information

Next evals:

More examples per query?

Provide query and the query sentence (adaptation issue)?

Non-pooled scoring metric?

We would like to share our features – more on poster session


BUT2012 13

Thank You for Your attention.