Context and Learning in Multilingual Tone and Pitch Accent
Recognition
Gina-Anne LevowUniversity of Chicago
May 18, 2007
Roadmap• Challenges for Tone and Pitch Accent
– Contextual effects– Training demands
• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition
• Asides: More tones and features• Reducing Training Demands
– Data collections & structure– Semi-supervised learning– Unsupervised clustering
• Conclusion
Challenges: Context • Tone and Pitch Accent Recognition
– Key component of language understanding• Lexical tone carries word meaning• Pitch accent carries semantic, pragmatic, discourse meaning
– Non-canonical form (Shen 90, Shih 00, Xu 01)
• Tonal coarticulation modifies surface realization– In extreme cases, fall becomes rise
– Tone is relative• To speaker range
– High for male may be low for female• To phrase range, other tones
– E.g. downstep
Challenges: Training Demands• Tone and pitch accent recognition
– Exploit data intensive machine learning• SVMs (Thubthong 01,Levow 05, SLX05)• Boosted and Bagged Decision trees (X. Sun, 02)• HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson
et al, 04,…– Can achieve good results with huge sample sets
• SLX05: ~10K lab syllabic samples -> > 90% accuracy– Training data expensive to acquire
• Time – pitch accent 10s of times real-time• Money – requires skilled labelers• Limits investigation across domains, styles, etc
– Human language acquisition doesn’t use labels
Strategy: Overall
• Common model across languages– Common machine learning classifiers
– Acoustic-prosodic model• No word label, POS, lexical stress info• No explicit tone label sequence model
– English, Mandarin Chinese, isiZulu• (also Cantonese)
Strategy: Context
• Exploit contextual information– Features from adjacent syllables
• Height, shape: direct, relative
– Compensate for phrase contour
– Analyze impact of • Context position, context encoding, context type• > 12.5% reduction in error over no context
Data Collections: I
• English: (Ostendorf et al, 95)– Boston University Radio News Corpus, f2b
– Manually ToBI annotated, aligned, syllabified
– Pitch accent aligned to syllables• Unaccented, High, Downstepped High, Low
– (Sun 02, Ross & Ostendorf 95)
Data Collections: II
• Mandarin: – TDT2 Voice of America Mandarin Broadcast News– Automatically force aligned to anchor scripts
• Automatically segmented, pinyin pronunciation lexicon• Manually constructed pinyin-ARPABET mapping• CU Sonic – language porting
– High, Mid-rising, Low, High falling, Neutral
Data Collections: III
• isiZulu: (Govender et al., 2005)– Sentence text collected from Web
• Selected based on grapheme bigram variation– Read by male native speaker– Manually aligned, syllabified– Tone labels assigned by 2nd native speaker
• Based only on utterance text – Tone labels: High, low
Local Feature Extraction• Uniform representation for tone, pitch accent
– Motivated by Pitch Target Approximation Model• Tone/pitch accent target exponentially approached
– Linear target: height, slope (Xu et al, 99)
• Base features: – Pitch, Intensity max, mean, min, range
• (Praat, speaker normalized)– Pitch at 5 points across voiced region– Duration– Initial, final in phrase
• Slope: – Linear fit to last half of pitch contour
Context Features• Local context:
– Extended features• Pitch max, mean, adjacent points of preceding, following
syllables– Difference features
• Difference between – Pitch max, mean, mid, slope– Intensity max, mean
• Of preceding, following and current syllable
• Phrasal context:– Compute collection average phrase slope– Compute scalar pitch values, adjusted for slope
Classification Experiments
• Classifier: Support Vector Machine – Linear kernel– Multiclass formulation
• SVMlight (Joachims), LibSVM (Cheng & Lin 01)
– 4:1 training / test splits• Experiments: Effects of
– Context position: preceding, following, none, both– Context encoding: Extended/Difference– Context type: local, phrasal
Results: Local ContextContext Mandarin Tone English Pitch
AccentisiZulu Tone
Full 74.5% 81.3% 75.9%
Extend PrePost 74% 80.7% 73.8%
Extend Pre 74% 79.9% 73.6%
Extend Post 70.5% 76.7% 72.3%
Diffs PrePost 75.5% 80.7% 75.8%Diffs Pre 76.5% 79.5% 75.5%Diffs Post 69% 77.3% 72.8%
Both Pre 76.5% 79.7% 75.5%
Both Post 71.5% 77.6% 72.5%No context 68.5% 75.9% 72.2%
Results: Local ContextContext Mandarin Tone English Pitch
AccentisiZulu Tone
Full 74.5% 81.3% 75.9%
Extend PrePost 74% 80.7% 73.8%
Extend Pre 74% 79.9% 73.6%
Extend Post 70.5% 76.7% 72.3%
Diffs PrePost 75.5% 80.7% 75.8%Diffs Pre 76.5% 79.5% 75.5%Diffs Post 69% 77.3% 72.8%
Both Pre 76.5% 79.7% 75.5%
Both Post 71.5% 77.6% 72.5%No context 68.5% 75.9% 72.2%
Results: Local ContextContext Mandarin Tone English Pitch
AccentisiZulu Tone
Full 74.5% 81.3% 75.9%
Extend PrePost 74% 80.7% 73.8%
Extend Pre 74% 79.9% 73.6%Extend Post 70.5% 76.7% 72.3%Diffs PrePost 75.5% 80.7% 75.8%Diffs Pre 76.5% 79.5% 75.5%Diffs Post 69% 77.3% 72.8%Both Pre 76.5% 79.7% 75.5%Both Post 71.5% 77.6% 72.5%No context 68.5% 75.9% 72.2%
Discussion: Local Context• Any context information improves over none
– Preceding context information consistently improves over none or following context information
• English/isiZulu: Generally more context features are better• Mandarin: Following context can degrade
– Little difference in encoding (Extend vs Diffs)
• Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory
Results & Discussion: Phrasal Context
Phrase Context Mandarin Tone English Pitch AccentPhrase 75.5% 81.3%No Phrase 72% 79.9%
•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve
Context: Summary
• Employ common acoustic representation– Tone (Mandarin,isiZulu), pitch accent (English)
• SVM classifiers - linear kernel: 76%,76%, 81%• Local context effects:
– Up to > 20% relative reduction in error– Preceding context greatest contribution
• Carryover vs anticipatory
• Phrasal context effects:– Compensation for phrasal contour improves recognition
Aside: More Tones
• Cantonese:– CUSENT corpus of read broadcast news text– Same feature extraction & representation – 6 tones:
– High level, high rise, mid level, low fall, low rise, low level
– SVM classification:• Linear kernel: 64%, Gaussian kernel: 68%
– 3,6: 50% - mutually indistinguishable (50% pairwise)» Human levels: no context: 50%; context: 68%
• Augment with syllable phone sequence– 86% accuracy: 90% of syllable w/tone 3 or 6: one
dominates
Aside: Voice Quality & Energy• w/ Dinoj Surendran
• Assess local voice quality and energy features for tone – Not typically associated with tones: Mandarin/isiZulu
• Considered: – VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt;
Band energy• Useful: Band energy significantly improves
– Mandarin: neutral tone • Supports identification of unstressed syllables
– Spectral balance predicts stress in Dutch– isiZulu: Using band energy outperforms pitch
• In conjunction with pitch -> ~78%
Roadmap• Challenges for Tone and Pitch Accent
– Contextual effects– Training demands
• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition
• Reducing Training Demands– Data collections & structure– Semi-supervised learning– Unsupervised clustering
• Conclusion
Strategy: Training• Challenge:
– Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data?
• Exploit semi-supervised and unsupervised learning– Semi-supervised Laplacian SVM– K-means and asymmetric k-lines clustering– Substantially outperform baselines
• Can approach supervised levels
Data Collections & Processing• English: (as before)
– Boston University Radio News Corpus, f2b• Binary: Unaccented vs accented• 4-way: Unaccented, High, Downstepped High, Low
• Mandarin:– Lab speech data: (Xu, 1999)
• 5 syllable utterances: vary tone, focus position– In-focus, pre-focus, post-focus
– TDT2 Voice of America Mandarin Broadcast News– 4-way: High, Mid-rising, Low, High falling
• isiZulu: (as before)– Read web sentences
• 2-way: High vs low
Semi-supervised Learning
• Approach: – Employ small amount of labeled data– Exploit information from additional – presumably more
available –unlabeled data • Few prior examples: several weakly supervised: (Wong et al, ’05)
• Classifier: – Laplacian SVM (Sindhwani,Belkin&Niyogi ’05)– Semi-supervised variant of SVM
• Exploits unlabeled examples – RBF kernel, typically 6 nearest neighbors, transductive
Experiments• Pitch accent recognition:
– Binary classification: Unaccented/Accented– 1000 instances, proportionally sampled
• Labeled training: 200 unacc, 100 acc– 80% accuracy (cf. 84% w/15x labeled SVM)
• Mandarin tone recognition:– 4-way classification: n(n-1)/2 binary classifiers– 400 instances: balanced; 160 labeled
• Clean lab speech- in-focus-94%– cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples
• Broadcast news: 70% – Cf. < 50% w/SVM 160 training samples
Unsupervised Learning• Question:
– Can we identify the tone structure of a language from the acoustic space without training?
• Analogous to language acquisition
• Significant recent research in unsupervised clustering• Established approaches: k-means• Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004):
asymmetric k-lines – Little research for tone
• Self-organizing maps (Gauthier et al,2005)– Tones identified in lab speech using f0 velocities
• Cluster-based bootstrapping (Narayanan et al, 2006)• Prominence clustering (Tambourini ’05)
Clustering
• Pitch accent clustering:– 4 way distinction: 1000 samples, proportional
• 2-16 clusters constructed– Assign most frequent class label to each cluster
• Classifier: – Asymmetric k-lines:
» context-dependent kernel radii, non-spherical
– > 78% accuracy: • 2 clusters: asymmetric k-lines best
– Context effects:• Vector w/preceding context vs vector with no context
comparable
Contrasting Clustering• Contrasts:
– Clustering: • 3 Spectral approaches:
– Perform spectral decomposition of affinity matrix» Asymmetric k-lines (Fischer & Poland 2004)» Symmetric k-lines (Fischer & Poland 2004)» Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004)» Binary weights, k-lines clustering
• K-means: Standard Euclidean distance– # of clusters: 2-16
• Best results: > 78%– 2 clusters: asymmetric k-lines; > 2 clusters: kmeans
• Larger # clusters: all similar
Contrasting Learners
Tone Clustering: I
• Mandarin four tones:• 400 samples: balanced• 2-phase clustering: 2-5 clusters each• Asymmetric k-lines, k-means clustering
– Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised)
– Broadcast news: 57% (cf. 74% supervised)– K-means requires more clusters to reach k-lines level
Tone Structure
First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height
Tone Clustering: II
• isiZulu High/Low tones• 3225 samples: no labels• Proportional: ~62% low, 38% high• K-means clustering: 2 clusters
– Read speech, web-based sentences• 70% accuracy (vs 76% fully-supervised)
Conclusions
• Common prosodic framework for tone and pitch accent recognition
– Contextual modeling enhances recognition• Local context and broad phrase contour
– Carryover coarticulation has larger effect for Mandarin
– Exploiting unlabeled examples for recognition• Semi- and Un-supervised approaches
– Best cases approach supervised levels with less training– Exploits acoustic structure of tone and accent space
Current and Future Work
• Interactions of tone and intonation– Recognition of topic and turn boundaries– Effects of topic and turn cues on tone real’n
• Child-directed speech & tone learning• Support for Computer-assisted tone learning• Structured sequence models for tone
– Sub-syllable segmentation & modeling• Feature assessment
– Band energy and intensity in tone recognition
Thanks• Dinoj Surendran, Siwei Wang, Yi Xu
• Natasha Govender and Etienne Barnard
• V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin
• This work supported by NSF Grant #0414919
• http://people.cs.uchicago.edu/~levow/tai