Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark [email protected]

University of Illinois at Urbana-Champaign, USA

mailto:[email protected]

Lecture 11. Articulatory Phonology• Surface phonology problems: reduction, assimilation,

deletion• Articulatory Phonology

– The mental lexicon (our mental storage for words) is made of Gestures, not phonemes

– Overlap among the gestures results in inter-gesture competition; competition can result in reduction and/or assimilation

– No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures

• Speech motor control– Constriction area matters more than non-constriction area– Motor control model: only control the constrictions– Tract variables– Task dynamics

• Prosody– Units of prosody: phrases and pitch accents– Prosodic gestures: spatial scaling, time stretching– Prosodic landmark detection

Pronunciation Variability (Read Speech)

Manner Class Assimilation:/t/ becomes

part of the /n/

Vowel Reduction:/iy/ becomes /ix/

Pronunciation Variability (Read Speech)

Syllable Merger:“carry an” becomes “carin”

Vowel Reduction:/iy/ becomes /ax/

Autosegmental Phonology(Goldsmith, 1975)

• Inter-word phonological rules all have a simple form: manner or place assimilation

• Hypothesis: instructions to the speech articulators are arranged in “autosegmental tiers,” i.e., on a kind of musical score with asynchronous rows

• Assimilation = feature spreading

/s/

[+strident]

[+blade]

[+anterior]

[-nasal]

/sh/

[+strident]

[+blade]

[-anterior]

[-nasal]

/sh/

[+strident]

[-nasal]

/sh/

[+strident]

[+blade]

[-anterior]

[-nasal]

Articulatory Phonology(Browman and Goldstein, 1990)

• Word is composed of “gestures”• Gestures are MENTAL speech

planning units, but they have close correspondence to articulatory controls

• Example: Mental Lexicon Entry for “she:”

1. TT-OPEN→FRICATIVE (/š/)

2. TT-LOC→PALATAL (/š/)

3. TB-OPEN→NARROW (whole word)

4. TB-LOC→PALATAL (whole word)

5. GLOTTIS-OPEN→WIDE (/š/) then GLOTTIS-OPEN→CRITICAL (/i/)

LIP-OP TT-OPEN

TT-LOC

TB-LOC

TB-OPENVELUM

VOICING

Articulatory Phonology(Browman and Goldstein, 1990)

• Rule-based Phonologies:– Reduction and assimilation are CHANGES in the value of a

distinctive feature, just like morpho-phonological processes• Autosegmental Phonologies:

– Reduction and assimilation are SUBSTITUTIONS of neighboring phone’s features in place of current phone’s features

• Articulatory Phonology:– “Frozen” word construction processes may result in the deletion

or substitution of gestures in the lexicon, but…– The process of sequencing words to create a sentence never

deletes or changes any gesture; all gestures stay in the mental representation all the time!!

• Reduction and Assimilation can be explained by– Overlap among gestures– Competition among overlapping gestures, for control of the same

articulators

GL-CLO

GL-OPEN

GL-CRIT

GL-CLO

Example: Manner-Class Assimilation

TT-CLOSED TT-CLOSED TT-FRIC

TB-CLOSEDTB-OPEN TB-OPEN

“Don’t Ask:” Careful Speech

/d/ /o/ /n/ /t/ /ae/ /s/ /k/

“Don’t Ask:” Fast Speech

GL-CRIT

GL-CRIT


TB-CLOSEDTB-OPEN TB-OPEN

/d/ /o/ /n/ /ae/ /s/ /k/

GL-OPEN

GL-CRIT

What’s in the Lexicon?(Browman and Goldstein, 2000)

• Experimental Observation: consonant clusters at the beginning of a syllable (/sp/ in “spat”) show less production variability than consonant clusters at the end of a syllable (/ps/ in “taps”)

• Hypothesis: the mental lexicon includes GESTURES and PAIRWISE COUPLING CONSTRAINTS

– Two kinds of coupling: simultaneous or sequential– Coda consonants FOLLOW the vowel, e.g. in “taps:”

TB-WIDE→→LIP-CLOSED→→TT-CRITICAL– Onset consonants are produced SIMULTANEOUSLY with start of the tongue body vowel gesture, but therefore in “spat:” both

TT-CRITICAL→→TB-WIDE

and LIP-CLOSED→→TB-WIDE. Competition among them yields reduced variability in production.

Production Planning: Lexical Entry Turned Into a Gestural Score

“SPAT:”

From Gestural Score to Acoustics

• Perturbation Theory (Chiba and Kajiyama, 1941) showed that Fn logA(x) ≈ A(x)/A(x)

• The audibility of a change A(x) is proportional to 1/A(x)– Changes near a constriction (small A(x)) are very audible– Changes elsewhere (large A(x)) are not very audible

• Therefore, talkers carefully control A(x) only near a constriction:

– Inter-utterance variability of A(x) is an increasing function of 1/A(x) (Perkell and Nelson, JASA 1985): E[(A(x)-A(x))2] ~ 1/A(x) A(x)≡E[A(x)]

– Inter-talker variability of A(x) is an increasing function of 1/A(x) (Hasegawa-Johnson et al., JSLHR 2003)

– Inter-talker variability of log A(x) is independent of A(x) (Hasegawa-Johnson et al., JSLHR 2003): E[(logA(x)-logA(x))2] ~ constant

Constriction Control as a Model of Speech Motor Control

(Stevens and House, JASA, 1955)

• Vocal tract shape controlled by just three control parameters:

– xPOS = POSition of tongue constriction– rCD = Constriction Degree = radius of the constriction– rLIP = effective radius of the lip constriction

• All other vocal tract areas determined byA(x) = r(x)2

r(x) = 0.7+0.144x2, 0 ≤ x ≤ 2.75 (larynx)

= min(1.6, rCD–0.025(1.2–rCD)(x–xPOS)2), 2.75 ≤ x ≤ xPOS (pharynx)

= rCD – 0.025(1.2–rCD) (x–xPOS)2, xPOS ≤ x ≤ 17 (mouth)

= rL 17 ≤ x ≤ 18 (lips)

x, r(x) are in centimeters, A(x) in cm2

Examples: Vowel /a/

Examples: Vowel /i/

Extending the Model: Tract Variables(Saltzmann and Munhall, 1989)

• Languages treat tongue tip and tongue body differently, e.g., both can have constrictions at the same time

– Therefore split (xPOS,ACD) → (TTPOS,TTCD,TBPOS,TBCD)

• Talkers can independently control lip area and lip length– Therefore split (RL) → (LIPCD, LIPPOS)

• Soft palate (“velum”) control: open vs. closed– Therefore we need a control variable VELCD

• Glottis control: open (breathy), critical (voiced), closed (glottal stop)

– Control variable GLOCD

• The tract variable model: speech is controlled by a mental controller with an 8-dimensional control vector:

a(t) = [LIPCD,LIPPOS,TTCD,TTPOS,TBCD,TBPOS,VELCD,GLOCD]T

Tract Variables

Task Dynamics: Connecting Gestures to Tract Variables

(Saltzmann and Munhall, 1989)

• Lexicon Gestures sequenced into a GESTURAL SCORE• The Gestural Score is “played” like a musical score. Each

Gesture onset is turned into TRACT VARIABLE TARGETS, a(t).

• Relationship between tract variable targets, a(t), and physical articulator positions, x(t), given by 2nd order system

M d2x/dt2 = K(t) (a(t)–x(t)) – R dx/dt

– K(t) = effective tract-variable-stiffness matrix; controlled by the talker, but varies more slowly than a(t)

– M = effective mass matrix– R = effective damping matrix

Production Planning: Lexical Entry Turned Into a Gestural Score

“SPAT:”

Speech Motor Control: Gestural Score Drives Task Dynamics

Speech Production: Vocal Tract Shape Determines Acoustics

Prosody: Beyond Words

1. Prosodic Phrases• Prosodic Phrasing = the PERCEPTUAL grouping of

words• Prosodic phrase boundaries usually (not always) a

subset of SYNTACTIC phrase boundaries• “I like ginger | chocolate ice cream | and cigars”• “I like ginger-chocolate ice cream | and cigars”• “I bought a book from | the old used bookstore downtown”

• A hierarchy of phrases:– Intonational phrase = 1-5 accent phrases– Intermediate/Accent phrase = 1-5 prosodic words– Prosodic word = 1-2 dictionary words, e.g., “the+open | door”

• Acoustic correlates of phrasing• Phrase-final syllable is MUCH LONGER (typically 50-100%)• Intonational phrase often followed by a PAUSE• (Language-dependent): Phrase may end in a PHRASE TONE

• Intermediate Phrase Tones in English: L-, H- (low and high)• Intonational Phrase Tones in English: L-L%, L-H%, H-L%, H-H%

2. Prominence/Pitch Accent• Prominence: Usually, a listener can tell which syllable in an accent

phrase the talker thinks is most important. That syllable is called “prominent.”

• Acoustic correlates of prominence (language-dependent):• DURATION:

• English, Dutch, and “stress-timed languages:” prominent syllables are longer• French, Japanese, and other “syllable-timed languages:” no

• HYPER-ARTICULATION: • prominent syllables often more clearly pronounced

• ENERGY: prominent syllables are louder• PITCH ACCENT (language-dependent)

• English:• Extra high pitch: H*• Extra low pitch: L*• Various combinations (H*+L, L+H*, L*+H)

• Swedish:• Single-peaked accents similar to English• Double-peaked accents perhaps unique to Swedish

• Japanese: • F0 is high from beginning of accent phrase until prominent syllable, then drops

• In Chinese: • Lexical tone is HYPER-ARTICULATED (e.g., 3rd tone dips MORE than usual)

Example: “Massachusetts”Unaccented

Accented: /u/ is longer, louder

Probability of Voicing

Pitch

get away with it they’ll pay

L*H-H% HiF0

H*L-L%

Example: “(if they think they can drink and drive, and) get away with

it, they’ll pay.

Do Prominence and Phrasing Affect Tongue Movement?

(Fougeron and Keating, 1997)

• Experiment:– Design an electropalate for each subject

• Electropalate = a plastic insert covered with small electrodes.• When the tongue touches the palate, the touched electrodes detect

contact• Keep track of the area and shape of tongue-palate contact as a

function of time

– Subjects read carrier sentences, target word in different positions• “book” Prominent: “the red book holder, not the red basket holder”• “book” Non-prominent: “the red book holder, not the blue book

holder”• “book” Phrase-final: “the red book, Holbert, not the blue book”

• Result:– Prominent words: longer + much more tongue-palate contact– Phrase-final wods: longer; little change in tongue-palate contact

Do Prominence and Phrasing Affect the MFCCs?

(Borys, Hasegawa-Johnson, and Cole, 2003)

R Vowel?

L Stop?

N

N-VOW

N STOP+N

Yes

YesNo

No

WER: 36.2%

R Vowel?

Pitch Accent?

N

N-VOW

N N*

Yes

YesNo

No

WER: 25.4%

BUT: WER of baseline Monophone system = 25.1%

Clustered Triphones Prosody-Dependent Allophones

Prosody-dependent allophones: ASR clustering matches EPG

Consonant Clusters

Phrase Initial

Phrase Medial

Phrase Final

Accented Class 1 Class 3

Unaccented Class 2

Fougeron & Keating(1997)EPG Classes:

1. Strengthened2. Lengthened 3. Neutral

Why is there a relationship between Prosody and Tongue Movement?

What’s the Scale of a Gestural Score?


TB-CLOSED

TT-OPEN TT-OPEN

tt1 t2

VEL-OPEN

What is t2-t1 in seconds?

How much does thetongue tip open?(How many cm?)

Prosodic Gestures(Byrd and Saltzmann)

TT-CLOSED TT-FRIC

TB-CLOSED

TT-OPEN TT-OPEN

VEL-OPEN

SPATIAL-SCALE-LARGE REDUCED

TIME-SCALE-STRETCHED

Convert Gestural Score to Tract Variable Targets

Gestural Score “Playback Head”Time Scale for Gesture PlaybackSpatial Scale for Gesture Playback

TT-CLOSED

Relative Time

s Prosodic Gestures

T Prosodic Gestures

Absolute Time

Tract Variable Targets a(t)

Convert Tract Variable Targets to Tract Variables, Then to Acoustics

Prosodic Landmarks: Detecting Pitch Accents

from F0 Contour

Prosodic Landmark Detection(Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)

Input & Pitch Accent Target

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369

F0 Prob. Of Voicing Pitch Accent Target

The Time-Delay Recursive Neural Network

(Kim, Neurocomputing, 1998)

DD D

. . . . . . . . . .

F0 Prob_Voice

Pitch Accented

DPitch Unaccented

Time-Delayed

Internal State

Time-Delayed Inputs

1st Hidden Layer

2nd Hidden Layer

Output Layer


Activation of State Unit

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369

Output of Output Unit Output of State Unit Pitch Accent Target


Summary• Surface phonology problems: reduction, assimilation,

deletion• Articulatory Phonology

– The mental lexicon (our mental storage for words) is made of Gestures, not phonemes

– No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures

• Speech motor control– Constriction area matters more than non-constriction area– Motor control model: only control the constrictions– Tract variables– Task dynamics

• Prosody– Units of prosody: phrases and pitch accents– Prosodic gestures: spatial scaling, time stretching– Prosodic landmark detection

Documents

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA