Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMsAndrew RosenbergQueens College / CUNYInterspeech 2013August 26, 2013

ProsodyProsody Pitch, Intensity, Rhythm, SilenceProsody carries information about a speakers intent and identity.Here: prosodic recognition ofSpeaking StyleNativenessSpeaker8/26/131

ApproachUnsupervised clustering of acoustic/prosodic features.Sequence modeling of cluster identities8/26/132

K-MeansK-means is a simple distance based clustering algorithm.Iterative, non-deterministic (sensitive to initialization)Must specify K. We evaluate K between 2 and 100. Optimal value from cross-validation for each task8/26/133Dirichlet Process GMMsNon-parametric infinite mixture modelneed a prior of the dirichlet processand a prior over N a zero mean gaussian

still need to set hyper parametersand G0Stick-breaking & Chinese Restaurant metaphorsBlei and Jordan 2005 Variational InferenceRich get Richer8/26/134

Plate notation from M. Jordan 2005 NIPS tutorial

DPGMM Rich get Richer8/26/135

Artificially omit the largest cluster= 0. 25Prosodic Event DistributionToBI Prosodic LabelsPitch Accents, Phrase Accent/Boundary Tones8/26/136

Accent Type DistributionPhrase Ending DistributionSequence ModelingSRILM 3-gram modelBackoff & GT smoothingClusters learned over all materialSequence models trained over train sets

8/26/137ExperimentsSpeaking Style, Nativeness, Speaker RecognitionEvaluation500 samples between 10-100 syllables (~2-20 seconds)ToBI, K-Means, DPGMM, DPGMM (removing the largest cluster)5 fold Cross-validation to learn hyperparametersClassificationTrain one SRILM model per class.Classify by lowest perplexity

Outlier DetectionTrain a single model.Classifier learns a perplexity threshold

8/26/138DataBoston Directions CorpusREAD, SPONTANEOUS4 speakers (used for Speaker Classification)Boston University Radio News CorpusBROADCAST NEWS6 speakersColumbia Games CorpusSPONTANEOUS DIALOG13 speakersNative Mandarin Chinese Speakers reading BURNC stories.4 speakersAll ToBI Labeled8/26/139FeaturesVilling (2004) pseudosyllabification Syllables with mean intensity below 10dB are considered silent7 FeaturesMean range normalized intensityMean range normalized delta intensityMean z-score normalized log f0Mean z-score normalized delta log f0Syllable durationDuration of previous silence (if any)Duration of following silence (if any)

8/26/1310Consistency with ToBI labelsV-Measure between ToBI Accent Types and clustersToBI Intonational Phrase-ending Tones and clustersK-means, solid lineDPGMM, gray line for reference (doesnt vary by more than 0.001)8/26/1311

AccentingPhrasingSpeaking Style Recognition4 styles: READ, SPON, BN, DIALOGSingle speaker for evaluation.8/26/1312ClassificationOutlier Detection - Dialog

Nativeness RecognitionNative (BURNC) vs. Non-NativeSingle speaker for evaluation.

8/26/1313ClassificationOutlier Detection - Native

Speaker Recognition4 BDC Speakers6 tasks for training, 3 for testing8/26/1314

ClassificationOutlier Detection6 BURNC SpeakersDetect f2b vs. othersConclusionsK-means works well to represent prosodic informationDPGMM does not work so well out-of-the-box.Despite being non-parametric, hyperparameter setting is still critically importantFuture WorkLarger acoustic/prosodic feature set.requires pre-processingEvaluating the universality of prosodic representationsIntegration of K-means and DPGMM.Use one to seed the other.8/26/1315Thank [email protected]://speech.cs.qc.cuny.edu8/26/1316

Documents

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs