Upload
cheche
View
34
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 11. Articulatory Phonology. - PowerPoint PPT Presentation
Citation preview
Landmark-Based Speech Recognition:
Spectrogram Reading,Support Vector Machines,
Dynamic Bayesian Networks,and Phonology
Mark [email protected]
University of Illinois at Urbana-Champaign, USA
Lecture 11. Articulatory Phonology• Surface phonology problems: reduction, assimilation,
deletion• Articulatory Phonology
– The mental lexicon (our mental storage for words) is made of Gestures, not phonemes
– Overlap among the gestures results in inter-gesture competition; competition can result in reduction and/or assimilation
– No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures
• Speech motor control– Constriction area matters more than non-constriction area– Motor control model: only control the constrictions– Tract variables– Task dynamics
• Prosody– Units of prosody: phrases and pitch accents– Prosodic gestures: spatial scaling, time stretching– Prosodic landmark detection
Pronunciation Variability (Read Speech)
Manner Class Assimilation:/t/ becomes
part of the /n/
Vowel Reduction:/iy/ becomes /ix/
Pronunciation Variability (Read Speech)
Syllable Merger:“carry an” becomes “carin”
Vowel Reduction:/iy/ becomes /ax/
Autosegmental Phonology(Goldsmith, 1975)
• Inter-word phonological rules all have a simple form: manner or place assimilation
• Hypothesis: instructions to the speech articulators are arranged in “autosegmental tiers,” i.e., on a kind of musical score with asynchronous rows
• Assimilation = feature spreading
/s/
[+strident]
[+blade]
[+anterior]
[-nasal]
/sh/
[+strident]
[+blade]
[-anterior]
[-nasal]
/sh/
[+strident]
[-nasal]
/sh/
[+strident]
[+blade]
[-anterior]
[-nasal]
Articulatory Phonology(Browman and Goldstein, 1990)
• Word is composed of “gestures”• Gestures are MENTAL speech
planning units, but they have close correspondence to articulatory controls
• Example: Mental Lexicon Entry for “she:”
1. TT-OPEN→FRICATIVE (/š/)
2. TT-LOC→PALATAL (/š/)
3. TB-OPEN→NARROW (whole word)
4. TB-LOC→PALATAL (whole word)
5. GLOTTIS-OPEN→WIDE (/š/) then GLOTTIS-OPEN→CRITICAL (/i/)
LIP-OP TT-OPEN
TT-LOC
TB-LOC
TB-OPENVELUM
VOICING
Articulatory Phonology(Browman and Goldstein, 1990)
• Rule-based Phonologies:– Reduction and assimilation are CHANGES in the value of a
distinctive feature, just like morpho-phonological processes• Autosegmental Phonologies:
– Reduction and assimilation are SUBSTITUTIONS of neighboring phone’s features in place of current phone’s features
• Articulatory Phonology:– “Frozen” word construction processes may result in the deletion
or substitution of gestures in the lexicon, but…– The process of sequencing words to create a sentence never
deletes or changes any gesture; all gestures stay in the mental representation all the time!!
• Reduction and Assimilation can be explained by– Overlap among gestures– Competition among overlapping gestures, for control of the same
articulators
GL-CLO
GL-OPEN
GL-CRIT
GL-CLO
Example: Manner-Class Assimilation
TT-CLOSED TT-CLOSED TT-FRIC
TB-CLOSEDTB-OPEN TB-OPEN
“Don’t Ask:” Careful Speech
/d/ /o/ /n/ /t/ /ae/ /s/ /k/
“Don’t Ask:” Fast Speech
GL-CRIT
GL-CRIT
TT-CLOSED TT-CLOSED TT-FRIC
TB-CLOSEDTB-OPEN TB-OPEN
/d/ /o/ /n/ /ae/ /s/ /k/
GL-OPEN
GL-CRIT
What’s in the Lexicon?(Browman and Goldstein, 2000)
• Experimental Observation: consonant clusters at the beginning of a syllable (/sp/ in “spat”) show less production variability than consonant clusters at the end of a syllable (/ps/ in “taps”)
• Hypothesis: the mental lexicon includes GESTURES and PAIRWISE COUPLING CONSTRAINTS
– Two kinds of coupling: simultaneous or sequential– Coda consonants FOLLOW the vowel, e.g. in “taps:”
TB-WIDE→→LIP-CLOSED→→TT-CRITICAL– Onset consonants are produced SIMULTANEOUSLY with start of the tongue body vowel gesture, but therefore in “spat:” both
TT-CRITICAL→→TB-WIDE
and LIP-CLOSED→→TB-WIDE. Competition among them yields reduced variability in production.
Production Planning: Lexical Entry Turned Into a Gestural Score
“SPAT:”
From Gestural Score to Acoustics
• Perturbation Theory (Chiba and Kajiyama, 1941) showed that Fn logA(x) ≈ A(x)/A(x)
• The audibility of a change A(x) is proportional to 1/A(x)– Changes near a constriction (small A(x)) are very audible– Changes elsewhere (large A(x)) are not very audible
• Therefore, talkers carefully control A(x) only near a constriction:
– Inter-utterance variability of A(x) is an increasing function of 1/A(x) (Perkell and Nelson, JASA 1985): E[(A(x)-A(x))2] ~ 1/A(x) A(x)≡E[A(x)]
– Inter-talker variability of A(x) is an increasing function of 1/A(x) (Hasegawa-Johnson et al., JSLHR 2003)
– Inter-talker variability of log A(x) is independent of A(x) (Hasegawa-Johnson et al., JSLHR 2003): E[(logA(x)-logA(x))2] ~ constant
Constriction Control as a Model of Speech Motor Control
(Stevens and House, JASA, 1955)
• Vocal tract shape controlled by just three control parameters:
– xPOS = POSition of tongue constriction– rCD = Constriction Degree = radius of the constriction– rLIP = effective radius of the lip constriction
• All other vocal tract areas determined byA(x) = r(x)2
r(x) = 0.7+0.144x2, 0 ≤ x ≤ 2.75 (larynx)
= min(1.6, rCD–0.025(1.2–rCD)(x–xPOS)2), 2.75 ≤ x ≤ xPOS (pharynx)
= rCD – 0.025(1.2–rCD) (x–xPOS)2, xPOS ≤ x ≤ 17 (mouth)
= rL 17 ≤ x ≤ 18 (lips)
x, r(x) are in centimeters, A(x) in cm2
Examples: Vowel /a/
Examples: Vowel /i/
Extending the Model: Tract Variables(Saltzmann and Munhall, 1989)
• Languages treat tongue tip and tongue body differently, e.g., both can have constrictions at the same time
– Therefore split (xPOS,ACD) → (TTPOS,TTCD,TBPOS,TBCD)
• Talkers can independently control lip area and lip length– Therefore split (RL) → (LIPCD, LIPPOS)
• Soft palate (“velum”) control: open vs. closed– Therefore we need a control variable VELCD
• Glottis control: open (breathy), critical (voiced), closed (glottal stop)
– Control variable GLOCD
• The tract variable model: speech is controlled by a mental controller with an 8-dimensional control vector:
a(t) = [LIPCD,LIPPOS,TTCD,TTPOS,TBCD,TBPOS,VELCD,GLOCD]T
Tract Variables
Task Dynamics: Connecting Gestures to Tract Variables
(Saltzmann and Munhall, 1989)
• Lexicon Gestures sequenced into a GESTURAL SCORE• The Gestural Score is “played” like a musical score. Each
Gesture onset is turned into TRACT VARIABLE TARGETS, a(t).
• Relationship between tract variable targets, a(t), and physical articulator positions, x(t), given by 2nd order system
M d2x/dt2 = K(t) (a(t)–x(t)) – R dx/dt
– K(t) = effective tract-variable-stiffness matrix; controlled by the talker, but varies more slowly than a(t)
– M = effective mass matrix– R = effective damping matrix
Production Planning: Lexical Entry Turned Into a Gestural Score
“SPAT:”
Speech Motor Control: Gestural Score Drives Task Dynamics
Speech Production: Vocal Tract Shape Determines Acoustics
Prosody: Beyond Words
1. Prosodic Phrases• Prosodic Phrasing = the PERCEPTUAL grouping of
words• Prosodic phrase boundaries usually (not always) a
subset of SYNTACTIC phrase boundaries• “I like ginger | chocolate ice cream | and cigars”• “I like ginger-chocolate ice cream | and cigars”• “I bought a book from | the old used bookstore downtown”
• A hierarchy of phrases:– Intonational phrase = 1-5 accent phrases– Intermediate/Accent phrase = 1-5 prosodic words– Prosodic word = 1-2 dictionary words, e.g., “the+open | door”
• Acoustic correlates of phrasing• Phrase-final syllable is MUCH LONGER (typically 50-100%)• Intonational phrase often followed by a PAUSE• (Language-dependent): Phrase may end in a PHRASE TONE
• Intermediate Phrase Tones in English: L-, H- (low and high)• Intonational Phrase Tones in English: L-L%, L-H%, H-L%, H-H%
2. Prominence/Pitch Accent• Prominence: Usually, a listener can tell which syllable in an accent
phrase the talker thinks is most important. That syllable is called “prominent.”
• Acoustic correlates of prominence (language-dependent):• DURATION:
• English, Dutch, and “stress-timed languages:” prominent syllables are longer• French, Japanese, and other “syllable-timed languages:” no
• HYPER-ARTICULATION: • prominent syllables often more clearly pronounced
• ENERGY: prominent syllables are louder• PITCH ACCENT (language-dependent)
• English:• Extra high pitch: H*• Extra low pitch: L*• Various combinations (H*+L, L+H*, L*+H)
• Swedish:• Single-peaked accents similar to English• Double-peaked accents perhaps unique to Swedish
• Japanese: • F0 is high from beginning of accent phrase until prominent syllable, then drops
• In Chinese: • Lexical tone is HYPER-ARTICULATED (e.g., 3rd tone dips MORE than usual)
Example: “Massachusetts”Unaccented
Accented: /u/ is longer, louder
Probability of Voicing
Pitch
get away with it they’ll pay
L*H-H% HiF0
H*L-L%
Example: “(if they think they can drink and drive, and) get away with
it, they’ll pay.
Do Prominence and Phrasing Affect Tongue Movement?
(Fougeron and Keating, 1997)
• Experiment:– Design an electropalate for each subject
• Electropalate = a plastic insert covered with small electrodes.• When the tongue touches the palate, the touched electrodes detect
contact• Keep track of the area and shape of tongue-palate contact as a
function of time
– Subjects read carrier sentences, target word in different positions• “book” Prominent: “the red book holder, not the red basket holder”• “book” Non-prominent: “the red book holder, not the blue book
holder”• “book” Phrase-final: “the red book, Holbert, not the blue book”
• Result:– Prominent words: longer + much more tongue-palate contact– Phrase-final wods: longer; little change in tongue-palate contact
Do Prominence and Phrasing Affect the MFCCs?
(Borys, Hasegawa-Johnson, and Cole, 2003)
R Vowel?
L Stop?
N
N-VOW
N STOP+N
Yes
YesNo
No
WER: 36.2%
R Vowel?
Pitch Accent?
N
N-VOW
N N*
Yes
YesNo
No
WER: 25.4%
BUT: WER of baseline Monophone system = 25.1%
Clustered Triphones Prosody-Dependent Allophones
Prosody-dependent allophones: ASR clustering matches EPG
Consonant Clusters
Phrase Initial
Phrase Medial
Phrase Final
Accented Class 1 Class 3
Unaccented Class 2
Fougeron & Keating(1997)EPG Classes:
1. Strengthened2. Lengthened 3. Neutral
Why is there a relationship between Prosody and Tongue Movement?
What’s the Scale of a Gestural Score?
TT-CLOSED TT-CLOSED TT-FRIC
TB-CLOSED
TT-OPEN TT-OPEN
tt1 t2
VEL-OPEN
What is t2-t1 in seconds?
How much does thetongue tip open?(How many cm?)
Prosodic Gestures(Byrd and Saltzmann)
TT-CLOSED TT-FRIC
TB-CLOSED
TT-OPEN TT-OPEN
VEL-OPEN
SPATIAL-SCALE-LARGE REDUCED
TIME-SCALE-STRETCHED
Convert Gestural Score to Tract Variable Targets
Gestural Score “Playback Head”Time Scale for Gesture PlaybackSpatial Scale for Gesture Playback
TT-CLOSED
Relative Time
s Prosodic Gestures
T Prosodic Gestures
Absolute Time
Tract Variable Targets a(t)
Convert Tract Variable Targets to Tract Variables, Then to Acoustics
Prosodic Landmarks: Detecting Pitch Accents
from F0 Contour
Prosodic Landmark Detection(Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)
Input & Pitch Accent Target
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369
F0 Prob. Of Voicing Pitch Accent Target
The Time-Delay Recursive Neural Network
(Kim, Neurocomputing, 1998)
DD D
. . . . . . . . . .
F0 Prob_Voice
Pitch Accented
DPitch Unaccented
Time-Delayed
Internal State
Time-Delayed Inputs
1st Hidden Layer
2nd Hidden Layer
Output Layer
Prosodic Landmark Detection(Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)
Activation of State Unit
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369
Output of Output Unit Output of State Unit Pitch Accent Target
Prosodic Landmark Detection(Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)
Summary• Surface phonology problems: reduction, assimilation,
deletion• Articulatory Phonology
– The mental lexicon (our mental storage for words) is made of Gestures, not phonemes
– No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures
• Speech motor control– Constriction area matters more than non-constriction area– Motor control model: only control the constrictions– Tract variables– Task dynamics
• Prosody– Units of prosody: phrases and pitch accents– Prosodic gestures: spatial scaling, time stretching– Prosodic landmark detection