13
BUT2012 Brno University of Technology Faculty of Information Technology Speech@FIT Igor Szöke, Michal Fapšo, Karel Veselý MediaEval 2012 workshop – SWS task, October 4.-5. 2012, Pisa

BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

BUT2012Brno University of Technology

Faculty of Information TechnologySpeech@FIT

Igor Szöke, Michal Fapšo, Karel Veselý

MediaEval 2012 workshop – SWS task, October 4.-5. 2012, Pisa

Page 2: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 2

Outlines

Systems overview & Underlying technologies

PhnRec, R-AKWS, AKWS – primary system

DTW

(GMM/HMM) – not submitted

Calibration

Results and discussion

Page 3: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 3

System overview

Our internal task was

to build simple and minimalistic language dependent Query-by-Example (QbE).

Ingredients

Development data, Neural net classifier, Phoneme recognizer, Acoustic keyword spotting, DTW, Calibration

Page 4: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 4

System overview

Sentence mean normalization

Neural network based features

bottle-necks

three state phone posteriors

Query detector

AKWS

DTW

(GMM/HMM) – not submitted to the evals

Bottle-Neck Posteriors

AKWS - X

DTW X X

(GMM/HMM) X -

Page 5: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 5

Underlying technologies Universal context, bottle-neck neural network base classifier

devC state re-alignment, Reduced phone set (50 phonemes)

Trained by Tnet – our tool, publicly available

Page 6: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 6

Phnrec, R-AKWS, AKWS Phoneme recognizer - free phone loop, devC 66.02% PAC

R-AKWS - Queries extracted from phone alignment

AKWS - Queries extracted from phone recognizer

devQ - devC devQ - evalCMTWV MTWVcalib UBTWV

R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600

MTWV MTWVcalib UBTWV

R-AKWS 0.653 0.703 0.789AKWS 0.377 0.429 0.552

Page 7: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 7

DTW

Used as a baseline.. bottlenecks are better than posteriors

devQ - devC evalQ - evalCMTWV MTWVcalib UBTWV

R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600DTW 0.400 0.468 0.552

MTWV MTWVcalib UBTWV

R-AKWS - - -AKWS 0.470 0.530 0.672DTW 0.426 0.488 0.599

Page 8: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 8

GMM/HMM Inspired by AKWS, not submitted due to bad results.

MTWV MTWVcalib UBTWV

R-AKWS 0.739 0.786 0.859AKWS 0.452 0.493 0.600DTW 0.400 0.468 0.552

GMM/HMM 0.011 - 0.336

Page 9: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 9

Calibration TWV - pooled, UBTWV - non-pooled TWV (each term has its best thr.)

Calibration of scores (linear combination of 12 parameters - 6 features with linear and quadratic forms). Trained on UBTWV thresholds.

Query length (w/o outer sil), Length of inner sil,

Score average global, Score average by phonemes

Phonemes count, Detections count

We found that Detections count and Length of inner sil work the best for AKWS (after evals).

Parameter Training error AKWS Training error DTWDetections count 0.1272 0.002115

Length of inner sil 0.1577 0.002687

Query length (w/o outer sil) 0.1626 0.002773

Score average global 0.1635 0.002530

Phonemes count 0.1656 0.002779

Score average by phonemes 0.1660 0.002746

Page 10: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 10

Calibration AKWS

Page 11: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

Conclusion

• AKWS with new calibration (submitted in brackets)• Good and consistent data, enough to train good Phnrec• GMM/HMM does not perform well on in-language condition

and 1 example per query (our best system in last year)• Number of detections is important calibration feature (due

to TWV)• Future work: detections calibration, system fusion

devQ-devC evalQ-evalC

ATWV MTWV UBTWV ATWV MTWV UBTWV

AKWS 0.488(0.488)

0.502(0.452)

0.600 0.522(0.492)

0.553(0.530)

0.672

DTW 0.443 0.468 0.552 0.448 0.488 0.599

Page 12: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 12

Like / Dislike / Next evals? Like:

Adapted TWV, real KWS scoring

Phone alignment provided

Good data, great work of organizers

"Dislike":

No test data alignment

No speaker information

Next evals:

More examples per query?

Provide query and the query sentence (adaptation issue)?

Non-pooled scoring metric?

We would like to share our features – more on poster session

Page 13: BUT2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

MediaEval SWS 2012 workshop - 4.-5.10. Pisa

BUT2012 13

Thank You for Your attention.