View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Context in Multilingual Tone and Pitch Accent Recognition
Gina-Anne Levow
University of Chicago
September 7, 2005
Roadmap
• Motivating Context
• Data Collections & Processing
• Modeling Context for Tone and Pitch Accent
• Context in Recognition
• Conclusion
Challenges
• Tone and Pitch Accent Recognition– Key component of language understanding
• Lexical tone carries word meaning• Pitch accent carries semantic, pragmatic, discourse meaning
– Non-canonical form (Shen 90, Shih 00, Xu 01)
• Tonal coarticulation modifies surface realization– In extreme cases, fall becomes rise
– Tone is relative• To speaker range
– High for male may be low for female• To phrase range, other tones
– E.g. downstep
Strategy
• Common model across languages, SVM classifier – Acoustic-prosodic model: no word label, POS, lexical stress info
• No explicit tone label sequence model – English, Mandarin Chinese (also Cantonese)
• Exploit contextual information– Features from adjacent syllables
• Height, shape: direct, relative– Compensate for phrase contour
• Analyze impact of – Context position, context encoding, context type– > 20% relative improvement over no context
• Preceding context greater enhancement than following
Data Collection & Processing
• English: (Ostendorf et al, 95)– Boston University Radio News Corpus, f2b– Manually ToBI annotated, aligned, syllabified– Pitch accent aligned to syllables
• Unaccented, High, Downstepped High, Low – (Sun 02, Ross & Ostendorf 95)
• Mandarin: – TDT2 Voice of America Mandarin Broadcast News– Automatically force aligned to anchor scripts (CUSonic)
– High, Mid-rising, Low, High falling, Neutral
Local Feature Extraction
• Uniform representation for tone, pitch accent– Motivated by Pitch Target Approximation Model
• Tone/pitch accent target exponentially approached – Linear target: height, slope (Xu et al, 99)
• Scalar features: – Pitch, Intensity max, mean (Praat, speaker normalized)– Pitch at 5 points across voiced region– Duration– Initial, final in phrase
• Slope: – Linear fit to last half of pitch contour
Context Features
• Local context:– Extended features
• Pitch max, mean, adjacent points of preceding, following syllables
– Difference features• Difference between
– Pitch max, mean, mid, slope– Intensity max, mean
• Of preceding, following and current syllable
• Phrasal context:– Compute collection average phrase slope– Compute scalar pitch values, adjusted for slope
Classification Experiments
• Classifier: Support Vector Machine – Linear kernel– Multiclass formulation
• (SVMlight, Joachims), LibSVM (Cheng & Lin 01)
– 4:1 training / test splits
• Experiments: Effects of – Context position: preceding, following, none, both– Context encoding: Extended/Difference– Context type: local, phrasal
Results: Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5% 81.3%
Extend LR 74% 80.7%
Extend L 74% 79.9%
Extend R 70.5% 76.7%
Diffs LR 75.5% 80.7%
Diffs L 76.5% 79.5%
Diffs R 69% 77.3%
Both L 76.5% 79.7%
Both R 71.5% 77.6%
No context 68.5% 75.9%
Results: Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5% 81.3%
Extend PrePost 74.0% 80.7%
Extend Pre 74.0% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69.0% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Results: Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5% 81.3%
Extend PrePost 74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Discussion: Local Context
• Any context information improves over none
– Preceding context information consistently improves over none or following context information
• English: Generally more context features are better• Mandarin: Following context can degrade
– Little difference in encoding (Extend vs Diffs)
• Consistent with phonological analysis (Xu) that coarticulation is carryover, not anticipatory
Results & Discussion: Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5% 81.3%
No Phrase 72% 79.9%
•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve
Conclusion
• Employ common acoustic representation– Tone (Mandarin), pitch accent (English)
• Cantonese, recent experiments
• SVM classifiers - linear kernel: 76%, 81%• Local context effects:
– Up to > 20% relative reduction in error– Preceding context greatest contribution
• Carryover vs anticipatory
• Phrasal context effects:– Compensation for phrasal contour improves recognition
Current & Future Work
• Application of model to different languages– Cantonese, Dschang (Bantu family)
• Cantonese: ~65% acoustic only, 85% w/segmental
• Integration of additional contextual influence– Topic, turn, discourse structure– HMSVM, GHMM models
• http://people.cs.uchicago.edu/~levow/projects/tai– Supported by NSF Grant #: 0414919
Confusion Matrix (English)Recognized Tone
Manually Labeled Tone
Unaccented High Low D.S. High
Unaccented 95%
25% 100%
53.5%
High 4.6%
73% 0% 38.5%
Low 0% 0% 0% 0%
D.S. High 0.3% 2% 0% 8%
Confusion Matrix (Mandarin)Recognized Tone
Manually Labeled Tone
High Mid-Rising Low High-Falling | Neutral
High 84% 9%
5%
13% | 0% |
Mid-Rising 6.7%
78.6%
10%
7% | 27.3% |
Low 0% 3.6% 70% 7% | 27.3%
High-Falling 7.4% 3.6% 10%
70% | 0% |
Neutral 0% 5.3% 5% 1.5% | 45%
Related Work
• Tonal coarticulation: – Xu & Sun,02; Xu 97;Shih & Kochanski 00
• English pitch accent– X. Sun, 02; Hasegawa-Johnson et al, 04;
Ross & Ostendorf 95
• Lexical tone recognition– SVM recognition of Thai tone: Thubthong 01– Context-dependent tone models
• Wang & Seneff 00, Zhou et al 04