Time Frames of Spoken Language Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 steveng

Time Frames of

Spoken Language

Steven GreenbergInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704

http://www.icsi.berkeley.edu/[email protected]

In Collaboration with Hannah Carvey, Leah Hitchcock and Shawn Chang

Acknowledgements and Thanks

Statistical Analysis and Automatic ClassificationHannah Carvey, Shawn Chang, Leah Hitchcock

Research FundingU.S. National Science FoundationU.S. Department of Defense

For Further Information

Consult the web site:

www.icsi.berkeley.edu/~steveng

OVERTURE

The Central Challenge for Models of Speech Recognition

Language - The Traditional PerspectiveThe “classical” view of spoken language posits a quasi-arbitrary relation between

the lower and higher tiers of linguistic organization

Cat= [k] + [ae] + [t]

Cat = /k/ + /ae/ + /t/

The Serial Frame Perspective on SpeechTraditional models of speech recognition assume the identity of a phonetic segment is derived from a detailed

spectral profile of the acoustic signal (provided courtesy of the auditory system) computed for each interval (frame) of speech

The Serial Frame Perspective on SpeechTraditional models of speech recognition assume the identity of a phonetic segment is derived from a detailed spectral

profile of the acoustic signal (provided courtesy of the auditory system) computed for each interval (frame) of speech (this is literally how automatic speech recognition systems decode the speech signal)

Challenge Number One

Pronunciation Variability

Pronunciation Variability of Real SpeechPronunciation patterns encountered in everyday life are extremely diverse

Pronunciation Variability of Real SpeechPronunciation patterns encountered in everyday life are extremely diverse The are literally dozens of ways in which common words are pronounced

Pronunciation Variability of Real SpeechPronunciation patterns encountered in everyday life are extremely diverse The are literally dozens of ways in which common words are pronounced

(as the following two slides illustrate for the word “and” based on manual phonetic annotation of a corpus comprising telephone dialogues)

How Many Pronunciations of “and”?

82 ae n63 eh n45 ix n35 ax n34 en30 n20 ae n dcl d17 ih n17 q ae n11 ae n d

7 q eh n7 ae nx6 ae ae n6 ah n5 eh nx4 uh n4 ix nx4 q ae n dcl d3 eh n d3 q ae nx

3 eh2 ae n dcl2 ae2 ax m2 ax n d2 ae eh n dcl d2 eh n dcl d2 ax nx2 q ae ae n2 q ix n2 ix n dcl d2 ih 2 eh eh n2 q eh nx2 ix d n1 eh m1 ax n dcl d1 aw n1 ae q1 eh dcl

N Pronunciation N Pronunciation

Canonical pronunciation

How Many Pronunciations of “and”?

1 ah nx1 ae n t1 eh d1 ah n dcl d1 ey ih n dcl1 ae ix n1 ae nx ax1 ax ng1 ay n1 ih ah n d1 ae hh1 ih ng1 ix1 ae n d dcl1 ix dcl d1 ae eh n1 hh n1 ix n t1 ae ax n dcl d1 iy eh n

1 m1 ae ae n d1 nx1 q ae ae n1 q ae ae n dcl d1 q ae eh n dcl d1 q ae ih n1 aa n1 q ae n d1 ? nx1 q ae n q1 eh n m1 q eh en dcl1 eh ng1 q eh n q1 em1 q eh ow m1 q ih n1 q ix en1 er

N Pronunciation N Pronunciation

Pronunciation Variability of Real SpeechThe are literally dozens of ways in which common words are pronounced

And as the following slide illustrates for the 20 most frequent words from the same corpus (Switchboard)

1 I 6 4 9 5 3 5 3 a y

2 a n d 5 2 1 8 7 1 6 a e n

3 th e 4 7 5 7 6 2 7 d h a x

4 y o u 4 0 6 6 8 2 0 y ix

5 th a t 3 2 8 1 1 7 1 1 d h a e

6 a 3 1 9 2 8 6 4 a x

7 to 2 8 8 6 6 1 4 tc l t u w

8 k n o w 2 4 9 3 4 5 6 n o w

9 o f 2 4 2 4 4 2 1 a x v

1 0 it 2 4 0 4 9 2 2 ih

1 1 y e a h 2 0 3 4 8 4 3 y a e

1 2 in 1 7 8 2 2 4 5 ih n

1 3 th e y 1 5 2 2 8 6 0 d h e y

1 4 d o 1 3 1 3 0 5 4 d c l d u w

1 5 s o 1 3 0 1 4 7 4 s o w

1 6 b u t 1 2 3 4 5 1 2 b c l b a h tc l t

1 7 is 1 2 0 2 4 5 0 ih z

1 8 lik e 1 1 9 1 9 4 6 l a y k c l k

1 9 h a v e 1 1 6 2 2 5 4 h h a e v

2 0 w a s 1 1 1 2 4 2 3 w a h z

2 1 w e 1 0 8 1 3 8 3 w iy

2 2 it's 1 0 1 1 4 2 0 ih tc l s

2 3 ju s t 1 0 1 3 4 1 7 jh ix s

2 4 o n 9 8 1 8 4 9 a a n

2 5 o r 9 4 2 3 3 6 e r

2 6 n o t 9 2 2 4 2 4 m a a q

2 7 th in k 9 2 2 3 3 2 th ih n g k c l k

2 8 fo r 8 7 1 9 4 6 f e r

2 9 w e ll 8 4 4 9 2 3 w e h l

3 0 w h a t 8 2 4 0 1 4 w a h d x

3 1 a b o u t 7 7 4 6 1 2 a x b c l b a w

3 2 a ll 7 4 2 7 2 4 a o l

3 3 th a t's 7 4 1 9 1 6 d h e h s

3 4 o h 7 4 1 7 6 1 o w

3 5 re a lly 7 1 2 5 4 5 r ih l iy

3 6 o n e 6 9 8 7 8 w a h n

3 7 a re 6 8 1 9 4 2 e r

3 8 I'm 6 7 9 2 6 q a a m

3 9 rig h t 6 1 2 1 2 8 r a y

4 0 u h 6 0 1 6 4 1 a h

4 1 th e m 6 0 1 8 2 3 a x m

4 2 a t 5 9 3 6 8 a e d x

4 3 th e re 5 8 2 8 2 2 d h e h r

4 4 my 5 8 9 6 6 m a y

4 5 me a n 5 6 1 0 5 8 m iy n

4 6 d o n 't 5 6 2 1 1 4 d x o w

4 7 n o 5 5 8 7 7 n o w

4 8 w ith 5 5 2 0 3 5 w ih th

4 9 if 5 5 1 8 4 1 ih f

5 0 w h e n 5 4 1 8 3 1 w e h n

5 1 c a n 5 4 2 8 1 5 k c l k a e n

5 2 th e n 5 1 1 9 3 8 d h e h n

5 3 b e 5 0 1 1 7 6 b c l b iy

5 4 a s 4 9 1 6 1 8 a e z

5 5 o u t 4 7 1 9 2 2 a e d x

5 6 k in d 4 7 1 7 2 1 k c l k a x n x

5 7 b e c a u e 4 6 3 1 1 5 k c l k a x z

5 8 p e o p le 4 5 2 1 4 4 p c l p iy p c l l e l

5 9 g o 4 5 5 8 3 g c l g o w

6 0 g o t 4 5 3 2 1 5 g c l g a a

6 1 th is 4 4 1 1 4 7 d h ih s

6 2 s o me 4 3 4 4 8 s a h m

6 3 w o u ld 4 1 1 6 2 9 w ih d c l

6 4 th in g s 4 1 1 5 5 2 th ih n g z

6 5 n o w 3 9 1 1 6 9 n a w

6 6 lo t 3 9 9 4 7 l a a d x

6 7 h a d 3 9 1 9 2 4 h h a e d c l

6 8 h o w 3 9 1 1 5 3 h h a w

6 9 g o o d 3 8 1 3 2 7 g c l g u h d c l

7 0 g e t 3 8 2 0 1 3 g c l g e h d x

7 1 s e e 3 7 6 8 0 s iy

7 2 fro m 3 6 1 0 2 8 f r a h m

7 3 h e 3 6 7 3 9 iy

7 4 me 3 5 5 8 7 m iy

7 5 d o n 't 3 5 2 1 1 4 d x o w

7 6 th e ir 3 3 1 9 2 5 d h e h r

7 7 mo re 3 2 1 1 5 6 m a o r

7 8 it's 3 1 1 4 2 0 ih tc l s

7 9 th a t's 3 1 2 0 1 6 d h e h s

8 0 to o 3 1 6 6 0 tc l t u w

8 1 o k a y 3 1 1 7 4 5 o w k c l k e y

8 2 v e ry 3 0 1 1 3 6 v e h r iy

8 3 u p 3 0 1 1 3 4 a h p c l p

8 4 b e e n 3 0 1 1 5 1 b c l b ih n

8 5 g u e s s 2 9 8 4 2 g c l g e h s

8 6 time 2 9 8 6 2 tc l t a y m

8 7 g o in g 2 9 2 1 1 3 g c l g o w ih n g

8 8 in to 2 8 2 0 1 4 ih n tc l t u w

8 9 th o s e 2 7 1 2 4 2 d h o w z

9 0 h e re 2 7 1 1 2 5 h h iy e r

9 1 d id 2 7 1 3 2 3 d c l d ih d x

9 2 w o rk 2 5 8 6 6 w e r k c l k

9 3 o th e r 2 5 1 4 2 6 a h d h e r

9 4 a n 2 5 1 2 2 8 a x n

9 5 I'v e 2 5 7 4 6 a y v

9 6 th in g 2 4 9 5 2 th ih n g

9 7 e v e n 2 4 7 4 0 iy v ix n

9 8 o u r 2 3 9 3 3 a a r

9 9 a n y 2 3 1 1 2 3 ix n iy

1 0 0 w e 're 2 3 8 2 5 w e y r

How Many Different Pronunciations?

1 I 649 53 53 ay2 and 521 87 16 ae n3 the 475 76 27 dh ax4 you 406 68 20 y ix5 that 328 117 11 dh ae6 a 319 28 64 ax7 to 288 66 14 tcl t uw8 know 249 34 56 n ow9 of 242 44 21 ax v

10 it 240 49 22 ih11 yeah 203 48 43 y ae12 in 178 22 45 ih n13 they 152 28 60 dh ey14 do 131 30 54 dcl d uw15 so 130 14 74 s ow16 but 123 45 12 bcl b ah tcl t17 is 120 24 50 ih z18 like 119 19 46 l ay kcl k19 have 116 22 54 hh ae v20 was 111 24 23 w ah z

Rank Word N #PronMost CommonPronunciation

MCP%Total

The 20 most frequent words account for 35% of the tokens

QUESTION

How do listeners decode the speech signal given the large amount of

pronunciation variation?

Challenge Number Two

Acoustic Variability

Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the spectro-temporal

structure of the speech signal under everyday conditions

Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions

Yet, the intelligibility of speech is remarkably stable (unless the amount of reverberation or background noise is truly extreme)

Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions

Yet, the intelligibility of speech is remarkably stable (unless the amount of reverberation or background noise is truly extreme)

How can this be so?

QUESTION

Is there some acoustic property that provides a basis for perceptual stability

of the speech signal?

An Invariant Property of the Speech Signal?Low-frequency energy fluctuations of the pressure waveform are largely preserved

under many acoustic-interference conditions

[based on an illustration by Hynek Hermansky]

Modulation Spectrum

An Invariant Property of the Speech Signal?Low-frequency energy fluctuations of the pressure waveform are largely preserved under many acoustic-interference conditions

In reverberant environments the MODULATION SPECTRUM’S peak is attenuated and shifted down to ca. 2 Hz (but is largely preserved)


Modulation Spectrum


In reverberant environments the modulation spectrum’s peak is attenuated and shifted down to ca. 2 Hz (but is largely preserved)

(“What is the modulation spectrum?” you ask)


Modulation Spectrum


In reverberant environments the modulation spectrum’s peak is attenuated and shifted down to ca. 2 Hz (but is largely preserved)

(“What is the modulation spectrum?” you ask) – Let’s find out!


Modulation Spectrum

Modulation Spectrum Computation

Intelligibility and the Modulation SpectrumSignificant attenuation (or distortion) of the modulation spectrum results in an

appreciable decline in the ability to understand spoken language

Greenberg and Arai (1998)

Intelligibility and the Modulation SpectrumSignificant attenuation (or distortion) of the modulation spectrum results in an appreciable decline in the ability to understand spoken

language

Why should this be so?

Greenberg and Arai (1998)

Anatomy of the Modulation SpectrumWhy is the modulation spectrum’s integrity so crucial for intelligibility?


What does it reflect linguistically?



Why is the bandwidth of the modulation spectrum associated with (intelligible) speech so broad?




Modulation spectrum of 40 TIMIT sentences (computed across a 6-kHz bandwidth)




Does the modulation spectrum reflect a unitary property of the speech signal?





Does the modulation spectrum reflect a unitary property of the speech signal?

Or something more complex?


The Modulation Spectrum Reflects SyllablesThe peak in the modulation spectrum (for speech) is ca. 5 Hz (200 ms)


The distribution associated with SYLLABLE DURATION is similar to the pattern of the MODULATION SPECTRUM ….



Syllable duration(in terms of equivalentModulation frequency)

Modulation Spectrum

Modulation spectrum of a short excerpt from the Switchboard Corpus

Syllable duration distribution associated with a 30-minute subset of Switchboard



Suggesting that the latter reflects SYLLABLES

Syllable duration(in terms of equivalentModulation frequency)

Modulation spectrum of a short excerpt from the Switchboard Corpus

Syllable duration distribution associated with a 30-minute subset of Switchboard

The Trouble with Syllables …The question thus arises …


If the modulation spectrum truly reflects syllables in the speech signal



Why is the distribution of syllable duration so broad?




Modulation spectrum of 15 minutes of spontaneous Japanese speech (OGI-TS corpus) compared with the syllable duration distribution for the same material (Arai and Greenberg, 1997)

Syllable duration(modulation frequency)

Modulation Spectrum




And does this variability in syllable duration reflect something significant?

Syllable duration(modulation frequency)

Modulation Spectrum

Modulation spectrum of 15 minutes of spontaneous Japanese speech (OGI-TS corpus) compared with the syllable duration distribution for the same material (Arai and Greenberg, 1997)

PART ONE

What Underlies

Variation in Word Duration?

Word DurationMost words (81%) in the Switchboard corpus are monosyllabic, and most

of the remainder are disyllabic (together comprising 95% of the words)

Word DurationMost words (81%) in the Switchboard corpus are monosyllabic, and most of the remainder

are disyllabic (together comprising 95% of the words)

The distribution of word duration therefore largely parallels that of syllables (plotted in units of duration [ms] on a logarithmic scale)

All Words

What Underlies Word Duration Variability?Is this distribution of lexical duration of a uniform nature (and source)?


Or does it reflect a more complex set of phenomena?



It has been observed for WRITTEN text that the more frequent words tend to be shorter and the less common words longer (i.e., Zipf’s law)




Does such a relationship hold for spoken language?




Does such a relationship hold for spoken language?

Let’s find out!

Is Word Duration Related to Word Frequency?Word duration (derived from the phonetically annotated portion of the

Switchboard corpus) can be plotted relative to frequency of occurrence

Is Word Duration Related to Word Frequency?Word duration (derived from the phonetically annotated portion of the

Switchboard corpus) can be plotted relative to frequency of occurrence

0

50

100

150

200

250

300

350

400

450

500

1 10 100 1000

Number of Occurences

Duration (ms)

r = – 0 .42Words with fewer than 5 instances omitted from graph

Is Word Duration Related to Word Frequency?Word duration (derived from the phonetically annotated portion of the Switchboard corpus)

can be plotted relative to frequency of occurrence

Such an exercise shows that there is a WEAK relationship (r = – 0.42) between lexical (unigram) frequency and word duration

0

50

100

150

200

250

300

350

400

450

500

1 10 100 1000


Duration (ms)


Is Word Duration Related to Word Frequency?Word duration (derived from the phonetically annotated portion of the Switchboard corpus) can be plotted relative to

frequency of occurrence


There is a lot of variability in word duration for any given frequency range

0

50

100

150

200

250

300

350

400

450

500

1 10 100 1000


Duration (ms)


Is Word Duration Related to Word Frequency?Word duration (derived from the phonetically annotated portion of the Switchboard corpus) can be plotted relative to frequency

of occurrence


There is a lot of variability in word duration for any given frequency range

Suggesting that lexical frequency, alone, is unlikely to account for variation in word duration

0

50

100

150

200

250

300

350

400

450

500

1 10 100 1000


Duration (ms)


If Not (entirely) Word Frequency, Then What? One parameter that might be more directly related to word duration (and

other durational properties of speech) is STRESS ACCENT



Stress Accent is related to the emphasis (or prominence) associated with individual syllables within a word




Although dictionaries list the stress patterns associated with words, this information is but a rough guide to the actual patterns observed

(as is the phonetic pronunciation provided in the dictionary)






In order to obtain empirical data pertaining to stress accent, it is necessary to manually annotate a corpus (syllable by syllable)







This manual annotation has been performed for a 45-minute subset of the Switchboard corpus, which has also been labeled with respect to phonetic segments, syllables and words








It is thus possible to ascertain the relationship between stress accent and duration at the level of the word, syllable and phonetic segment









The remainder of this presentation focuses on the statistical relationship between stress accent and duration at these different linguistic tiers










Before examining these data, let’s briefly consider the nature of the annotated material

If Not (entirely) Word Frequency, Then What? One parameter that might be more directly related to word duration (and other

durational properties of speech) is STRESS ACCENT


Although dictionaries list the stress patterns associated with words, this information is but a rough guide to the actual patterns observed (as is the phonetic pronunciation provided in the dictionary)





Before examining these data, let’s briefly consider the nature of the annotated material

(this is important for evaluating the reliability of the results obtained)

INTERMEZZO

Being Phonetically (and Prosodically)

Annotated

Phonetic Transcription of Spontaneous EnglishTelephone Dialogues of 5-10 minutes duration, from the SWITCHBOARD

corpus, have been phonetically annotated (labeled and segmented)



Most of this Material has been Manually Annotated



Most of this Material has been Manually Annotated 4 hours labeled at the phone level and segmented at the syllabic level



Most of this Material has been Manually Annotated 4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment level



Most of this Material has been Manually Annotated 4 hours labeled at the phone level and segmented at the syllabic level 1 hour labeled and segmented at the phonetic-segment levelThe remaining material has been segmented at the phonetic-segment level using

automatic methods




automatic methods45 minutes of stress-accent-labeled material




automatic methods45 minutes of stress-accent-labeled materialAn additional four hours of material automatically labeled with respect to accent

(this latter material not used in the current analysis, but will be available soon)






There is a Lot of Diversity in the Material Transcribed






There is a Lot of Diversity in the Material TranscribedSpans speech of both genders (ca. 50/50%), reflecting a wide range of American

dialectal variation, speaking rate and voice quality




automatic methods45 minutes of stress-accent-labeled materialAn additional four hours of material automatically labeled with respect to accent (this

latter material not used in the current analysis, but will be available soon)

There is a Lot of Diversity in the Material TranscribedSpans speech of both genders (ca. 50/50%), reflecting a wide range of American

dialectal variation, speaking rate and voice quality

Transcription SystemA variant of Arpabet (which was also used for transcription of the TIMIT corpus)

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

Phonetic Transcription of Spontaneous EnglishThe Data are Available at ….

http://www.icsi/berkeley.edu/real/stp

Phonetic Transcription How was the Labeling and Segmentation Performed?


VERY carefully …. by UC-Berkeley linguistics students


VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform


VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram


VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram, word transcription


VERY carefully …. by UC-Berkeley linguistics studentsUsing a display of the signal waveform, spectrogram, word transcription and

“forced alignments” (automatic estimates of phones and boundaries)



“forced alignments” (automatic estimates of phones and boundaries) + audio



“forced alignments” (automatic estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations



“forced alignments” (automatic estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations

Additionally, automatic segmentation and labeling of articulatory manner was used as a guide for phonetic labeling and segmentation in recent work

Annotation of Stress AccentForty-five minutes of the phonetically annotated portion of the Switchboard

corpus was manually labeled with respect to stress accent



Three levels of accent were distinguished:




Heavy




Heavy Light




Heavy Light None




Heavy Light None




Heavy Light None

(In actuality, labelers assigned a “1” to a fully accented syllables, a “null” to completely unaccented syllables, and a “0.5” to all others)




Heavy Light None


An example of the annotation (attached to the vocalic nucleus) is shown below (where the accent levels could not be derived from a dictionary)




Heavy Light None



In this example most of the syllables are unaccented, with two labeled as lightly accented (0.5)




Heavy Light None



In this example most of the syllables are unaccented, with two labeled as lightly accented (0.5) (and one other labeled as very lightly accented (0.25))

PART TWO

The Relation between

Stress Accent and Word Duration

Back to Stress Accent and Word Duration…Stress accent is supposed to bear some systematic relation to three

principal acoustic parameters of the speech signal:



Fundamental Frequency



Fundamental Frequency Amplitude



Fundamental Frequency Amplitude Duration







In previous studies my colleagues and I have shown that f0 -related cues play a relatively small role in stress accent assignment

(at least for spontaneous American English material)






Amplitude and duration appear to play a far more important role than f0







Therefore, it is not unreasonable to assume that the stress accent patterns associated with words bear some tangible relation to lexical duration








So …








So …, let’s find out!

Word Duration and Stress Accent LevelLet’s first examine the durational properties of heavily accented words


(these are words containing at least one heavily accented syllable)


(these are words containing at least one heavily accented syllable)

The mean duration of this subset (36%) is 378 ms (s.d. = 168 ms)

Heavily Accented

Word Duration and Stress Accent LevelLet’s first examine the durational properties of heavily accented words (these are words

containing at least one heavily accented syllable)

The mean duration of this subset (36%) is 378 ms (s.d. = 168 ms)

Most of the heavily accented words are longer than 200 ms

Heavily Accented

Let’s now compare the duration of the heavily accented words with those of their lightly accented counterparts (25% of the total)

Word Duration and Stress Accent Level

Heavily Accented

Heavily Accented

LightlyAccented


The mean duration of this subset is 255 ms (s.d. = 116 ms)


Heavily Accented

LightlyAccented


The mean duration of this subset is 255 ms (s.d. = 116 ms)

In many respects the durational properties of these two subsets are similar


Heavily Accented

LightlyAccented

Let’s now compare the duration of unaccented words with that of their accented counterparts


Heavily Accented

LightlyAccented

Unaccented

Let’s now compare the duration of unaccented words with that of their accented counterpartsThe mean duration of the unaccented subset (39%) is 149 ms (s.d. = 78 ms)


Heavily Accented

LightlyAccented

Unaccented

Let’s now compare the duration of unaccented words with that of their accented counterpartsThe mean duration of the unaccented subset (39%) is 149 ms (s.d. = 78 ms)The unaccented words are generally shorter than 200 ms


Heavily Accented

LightlyAccented

Unaccented

Let’s now compare the duration of unaccented words with that of their accented counterpartsThe mean duration of the unaccented subset (39%) is 149 ms (s.d. = 78 ms)The unaccented words are generally shorter than 200 ms and constitute a very different distributional form than their accented counterparts


Heavily Accented

LightlyAccented

Unaccented

Let’s now compare the durational properties of ALL WORDS in the corpus with those pertaining to words of varying accent levels


Heavily Accented

LightlyAccented

Unaccented

All Words

Word Duration and Stress Accent LevelLet’s now compare the durational properties of ALL WORDS in the corpus

with those pertaining to words of varying accent levels

When we do so,

Heavily Accented

LightlyAccented

Unaccented

All Words

Word Duration and Stress Accent LevelLet’s now compare the durational properties of ALL WORDS in the corpus with those

pertaining to words of varying accent levels

When we do so, we notice that the left-hand branch of the lexical distribution largely reflects unaccented words,

Heavily Accented

LightlyAccented

Unaccented

All Words

Word Duration and Stress Accent LevelLet’s now compare the durational properties of ALL WORDS in the corpus with those pertaining to

words of varying accent levels

When we do so, we notice that the left-hand branch of the lexical distribution largely reflects unaccented words, while the right-hand branch reflects mostly accented words (with the peak reflecting both)

Heavily Accented

LightlyAccented

Unaccented

All Words

Word Duration and Stress Accent LevelTherefore, it appears that the broad distribution of word duration

(and, in turn, syllable duration) largely reflects the co-existence of accented and unaccented words within spontaneous speech

Heavily Accented

LightlyAccented

Unaccented

All Words

Word Duration and Stress Accent LevelTherefore, it appears that the broad distribution of word duration (and, in turn,

syllable duration) largely reflects the co-existence of accented and unaccented words within spontaneous speech

What are the implications of this insight?

Breadth of the Modulation SpectrumThe broad bandwidth of the modulation spectrum, therefore, appears to

reflect the heterogeneity in syllabic and lexical duration associated with variation in stress accent level




UnaccentedHeavily Accented

All Accents(Convergnce)



Does this insight have implications for the lower tiers of spoken language?






Does this insight have implications for the lower tiers of spoken language? (e.g., the phonetic and phonological levels)






Does this insight have implications for the lower tiers of spoken language? (e.g., the phonetic and phonological levels)

Let’s find out!




INTERMEZZO

Anatomy of the Syllable

The Importance of the Syllable The analyses to follow are all linked, in some fashion, to syllable structure


In order to highlight patterns germane to variation in segmental duration it is necessary to partition the data in terms of syllable position


In order to highlight patterns germane to variation in segmental duration it is necessary to partition the data in terms of syllable position (as well as stress accent level)



As a consequence, we will examine the onsets, codas and nuclei of syllables separately in order to gain insight into the underlying patterns




What is an onset?




What is a onset? What is a nucleus?




What is a onset? What is a nucleus? What is a coda?




What is a nucleus? What is a coda? What is a coda?

The following slides provide a brief (and gentle) introduction to syllable structure

Syllable and Phonetic Segment Illustrated Syllables generally consist of three constituents - ONSET, NUCLEUS, CODA

“J” = JUNCTURE


Virtually all syllables contain a NUCLEUS, which is VOCALIC (by definition)

“J” = JUNCTURE



Most (but not all) syllables also contain an ONSET (usually a CONSONANT)

“J” = JUNCTURE




Many syllables contain a CODA (also typically a CONSONANT)

“J” = JUNCTURE





The most common syllable form in English is Onset + Nucleus + Coda (“Nine”)

“J” = JUNCTURE





The most common syllable form in English is Onset + Nucleus + Coda (“Nine”)

Followed in popularity by Onset + Nucleus (“Two”)

“J” = JUNCTURE

PART THREE

Stress Accent and Syllable Position

The Importance of Syllable StructureBefore going into the details of durational variation at the segmental level

we briefly examine some general patterns of pronunciation variation that are conditioned by syllable position and stress accent

The Importance of Syllable StructureBefore going into the details of durational variation at the segmental level

we briefly examine some general patterns of pronunciation variation that are conditioned by syllable position and stress accent

These data serve to illustrate the sort of variation observed that is conditioned by position within the syllable

All Segments

Pronunciation Variation – Syllable and Accent

Deletions

InsertionsSubstitutions

Pronunciation variation is systematic at the level of the syllable

CODATerritory

ONSETTerritory

NUCLEUSTerritory

All Segments

Pronunciation Variation – Syllable and Accent

Deletions


Pronunciation variation is systematic at the level of the syllable

It’s also systematic when stress accent is taken into account

CODATerritory

ONSETTerritory

NUCLEUSTerritory

Pronunciation Variation – Syllable and Accent Pronunciation variation is systematic at the level of the syllable

It’s also systematic when stress accent is taken into account

BOTH syllable structure and accent level are required for a full accounting

All Segments Deletions


CODATerritory

ONSETTerritory

NUCLEUSTerritory

A Coarse Perspective on Pronunciation Variation(at the level of the syllable and stress accent)

Analysis of Durational Properties of SpeechThe following analyses are conditioned on stress accent level and (for the

most part) syllable position



We will begin with analyses illustrating the patterns associated with three levels of stress accent (heavy, light and none) to show the graded nature of the durational properties pertaining to syllable and segment duration




However, for purposes of illustrative clarity, many of the slides will show only two levels of accent (heavy and none) in order to delineate the differences in duration associated with stress accent level




However, for purposes of illustrative clarity, many of the slides will show only two levels of accent (heavy and none) in order to delineate the differences in duration associated with stress accent level

Under such conditions, the durational properties associated with light accent are generally intermediate between heavy accent and none

Syllable Duration - Across Syllable FormsThere is a broad range of syllable structures observed in spoken English


Together, the V, VC, CV and CVC forms account for 85% of syllables



The CVCC and CCVC forms account for another 10%



The CVCC and CCVC forms account for another 10%

Together, the CV and CVC forms cover ca. 60% of the syllables

Syllable Duration - Across Syllable FormsIt is not surprising that syllable duration is largely a function of the number

of segments within the syllable (as shown in the graph below)

Canonical Syllable Forms

V = VowelC = Consonant



Note the systematic lengthening of the syllable for each form as the accent level increases from none to light to heavy






This pattern is representative of accent’s impact on duration






This pattern is representative of accent’s impact on duration (as we’ll see)



Syllable Duration - Accent Level/Syllable Form


This graph shows the same data as the previous slides, but from the perspective of only two accent levels (heavy and none)





The heavily accented syllables are generally 60-100% longer than their unaccented counterparts






The disparity in duration is most pronounced for syllable forms with one or no consonants (i.e., V, VC, CV)






The disparity in duration is most pronounced for syllable forms with one or no consonants (i.e., V, VC, CV)

This pattern implies that accent has the greatest impact on vocalic duration



Nucleus Duration - Accent Level/Syllable FormThe hypothesis delineated on the previous slide (that accent has the most

profound impact on vocalic duration) is confirmed in the graph below




The duration of vowels in accented syllables (of all forms) are at least twice as long as their unaccented counterparts




The duration of vowels in accented syllables (of all forms) are at least twice as long as their unaccented counterparts

This pattern implies that the syllable nucleus absorbs a major component of accent’s impact (at least as far as duration is concerned)

PART FOUR

Stress Accent and the Vocalic Nucleus

Because the pattern of stress accent’s impact on vocalic duration is relatively uniform across syllable form it is likely that the structure of the syllable has relatively little impact on vocalic duration

Stress Accent’s Impact on the Vocalic Nucleus


As a consequence, the remaining analyses pertaining to accent’s impact on vocalic duration collapse the data across syllable form




We now examine vocalic duration in somewhat greater detail and illustrate how duration, stress accent and vocalic identity interact





But first … a brief primer on vocalic acoustics





But first … a brief primer on vocalic acoustics (which should facilitate digesting the material that follows)


INTERMEZZO

A Brief Primer on Vowel Acoustics

A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a function primarily of two articulatory properties – both related to the motion of the tongue



• The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance




• The height parameter is closely linked to the frequency of F1





In the classic vowel “triangle,” segments are positioned in terms of the tongue positions associated with their production, as follows:





In the classic vowel “triangle,” segments are positioned in terms of the tongue positions associated with their production, as follows:


The Spatial Patterning of Duration

in

Vocalic Nuclei

Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

Spatial Patterning of Duration


The duration will be plotted on a 2-D grid, where the x-axis will always be in terms of hypothetical front-back tongue position



The duration will be plotted on a 2-D grid, where the x-axis will always be in terms of hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow)




The y-axis will serve as the dependent measure expressed in terms of duration or the proportion of fully stressed (or unstressed) nuclei




The y-axis will serve as the dependent measure expressed in terms of duration or the proportion of fully stressed (or unstressed) nuclei


Vocalic Duration and Vowel HeightThe spatial patterning of vocalic segments is systematic with respect to

duration

Vocalic Duration and Vowel HeightThe spatial patterning of vocalic segments is systematic with respect to

duration

Low vowels, be they diphthongs or monophthongs, are longer (on average) than high vowels

Vocalic Duration and Vowel Height

All nuclei Diphthongs Monophthongs

The spatial patterning of vocalic segments is systematic with respect to duration






Thus, duration appears to be highly correlated with vowel height





Thus, duration appears to be highly correlated with vowel height

But … the situation is a little more complicated than first appearances would suggest

Durational Differences - Stressed/UnstressedThere is a large dynamic range in duration between accented and unaccented

vocalic nuclei


Durational Differences - Stressed/UnstressedThere is a large dynamic range in duration between accented and unaccented vocalic nuclei

Moreover, diphthongs and tense, low monophthongs tend to exhibit a larger dynamic range than the lax monophthongs


Durational Differences - Stressed/UnstressedThere is a large dynamic range in duration between accented and unaccented vocalic nuclei

Moreover, diphthongs and tense, low monophthongs tend to exhibit a larger dynamic range than the lax monophthongs


Lax monophthongs

Vocalic Identity Among Unstressed NucleiThe high, lax monophthongs are almost always unstressed


The low vowels, be they monophthongs or diphthongs, are rarely unstressed


The low vowels, be they monophthongs or diphthongs, are rarely unstressed

The high diphthongs and high/mid, tense monophthongs occupy an intermediate position

The high vowels are rarely fully stressed

Vocalic Identity Among Fully Stressed Nuclei


The low vowels, be they monophthongs or diphthongs, are far more likely to be fully stressed




An intermediate degree of stress accounts for the other vocalic instances




An intermediate degree of stress accounts for the other vocalic instances (but will not be addressed here)


Duration Appears to Play An Important (but certainly not exclusive) Role in Stress Accent for Spontaneous American English Discourse

Is It Stress? Vocalic Identity? Or What?


For any given vocalic class, stressed segments are longer (on average)



For any given vocalic class, stressed segments are longer (on average)The durational disparity is most pronounced among the low vowels and the

diphthongs




diphthongs

Low Vowels Tend to be Much Longer in Duration than High Vowels




diphthongs

Low Vowels Tend to be Much Longer in Duration than High VowelsThis is the case even for diphthongs




diphthongs


Low Vowels are Rarely without Some Measure of Stress Accent




diphthongs


Low Vowels are Rarely without Some Measure of Stress AccentThis is true for monophthongs as well as diphthongs




diphthongs



High Vowels are Fully Stressed Extremely Rarely




diphthongs



High Vowels are Fully Stressed Extremely RarelyThis is particularly so for monophthongs, but also applies to diphthongs




diphthongs




Thus, Stress Accent Appears to Be Intricately Involved with Vocalic Identity




diphthongs




Thus, Stress Accent Appears to Be Intricately Involved with Vocalic Identity (as illustrated on the next several slides)


The Vowel Space Under (Full) Stress (Accent) There is a relatively even distribution of segments across the vowel space,

with a slight bias towards the front and central vowels

Canonical Vowels Only

In unaccented syllables vowels are confined largely to the high-front and high-central sectors of the articulatory space

The Vowel Space Without (Stress) Accent


In unaccented syllables vowels are confined largely to the high-front and high-central sectors of the articulatory space

The low and mid vowels “get creamed”

The Vowel Space Without (Stress) Accent


Stress accent exerts a profound effect on the character of the vowel space

The Vowel Spaces Compared

Heavily Accented Unaccented



High vowels are largely associated with unaccented syllables






Low vowels are mostly associated with accented forms






Low vowels are mostly associated with accented forms

This distinction between accented and unaccented syllables is of profound importance for understanding (and modeling) pronunciation variation




PART FIVE

Stress Accent’s Impact on Syllable Onsets

Stress Accent and Syllable OnsetsThe onset is often cited as the key syllabic constituent with respect to

“lexical access”



It is therefore of interest to ascertain how the onset’s duration behaves as a function of accent level




Because of the onset’s key role in lexical access one might assume that its duration would be relatively stable across accent level





The following slides suggest that this assumption is incorrect





The following slides suggest that this assumption is incorrect,

And that the structure of the onset is more complex (and more interesting) than initial intuition would suggest


Onset Duration - Accent Level/Syllable FormThe duration of the syllable onset varies significantly as a function of accent

level (though not quite as much as in vocalic constituents)


Onset Duration - Accent Level/Syllable FormThe duration of the syllable onset varies significantly as a function of accent

level (though not quite as much as in vocalic constituents)

Onset duration is similar across syllable form (except that segments comprising complex onsets [i.e., CCVC] are slightly shorter


Onset Duration - Accent Level/Syllable FormThe duration of the syllable onset varies significantly as a function of accent level (though not

quite as much as in vocalic constituents)

Onset duration is similar across syllable form (except that segments comprising complex onsets [i.e., CCVC] are slightly shorter

The duration of unaccented onsets is similar across syllable forms


Onset Duration - Accent Level/Syllable FormOnsets of accented syllables are generally 50-60% longer than their

unaccented counterparts


Onset Duration - Accent Level/Syllable FormOnsets of accented syllables are generally 50-60% longer than their

unaccented counterparts

Although this durational difference is not quite as large as observed for vocalic nuclei, it is still substantial (and mostly consistent across forms)

Onset Duration and Place of ArticulationIt is of interest to examine accent’s impact on duration of onset (and coda)

constituents in somewhat greater detail



A convenient means to do so is to partition the data with respect to place of maximum articulatory constriction in order to highlight certain patterns




What is place of articulation?




What is place of articulation? Let’s find out!

Place of Articulation – A Brief PrimerThe tongue contacts (or nearly so) the roof of the mouth in producing many of the consonantal sounds in English

AnteriorLabial [p] [b] [m]Labio-dental [f] [v] Inter-dental [th] [dh]

CentralAlveolar [t] [d] [n] [s] [z]

PosteriorPalatal [sh] [zh]Velar [k] [g] [ng]

ChameleonRhoticized [r]Lateral [l]Approximant [hh]

From Daniloff (1973)

Onset Duration and Place of ArticulationWe will examine accent’s impact on the duration of onset (and coda)

constituents on the basis of articulatory place



First, we will examine the anterior consonants, followed by the central and posterior onsets




Finally, we will examine those segments whose place of articulation assimilates to that of the following vocalic segment (“place chameleons”)





Although the heavily accented onsets are generally 50-60% longer than their unaccented counterparts …






There is a large disparity in the durational differences due to accent level

Onset Duration and Place of ArticulationWe will examine accent’s impact on the duration of onset (and coda) constituents

on the basis of articulatory place





We will now examine the specific durational patterns as a function of articulatory place ...

Onset Duration and Place of ArticulationWe will examine accent’s impact on the duration of onset (and coda) constituents on the

basis of articulatory place





We will now examine the specific durational patterns as a function of articulatory place ...

The patterns are revealing

Syllable Onset Duration - ANTERIOR Place


The voiceless consonants ([p] and [f]) are longer than the other segments




The largest durational disparity (as a function of accent level) is exhibited in the glide [y]





The smallest durational disparity is manifest in the voiced fricative [dh]





The smallest durational disparity is manifest in the voiced fricative [dh]

The other segments exhibit intermediate patterns

Segmental Identity and Stress AccentIt is of interest to compare accent’s impact on segmental duration with its

impact on segmental realization (i.e., whether the segment is realized canonically or not …)



Usually, non-canonical realizations are manifest as segmental deletions




The pattern of segmental realization bears some correspondence to durational variation as a function of accent level





But also exhibits some interesting differences





But also exhibits some interesting differences(which are potentially significant for models of phonetic organization)





But also exhibits some interesting differences(which are potentially significant for models of phonetic organization)

Before we examine the segmental patterns in detail, a brief primer on the interpretation of these data is presented

Road Map - How to Interpret the Data

Accent

Segment Can Trans Can Trans Can Trans Can Trans

p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187

TotalHeavy Light None

dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353

Compare the numbers in the YELLOW and ORANGE columns

Most numbers in the YELLOW / ORANGE columns will be similar

Can = Canonical formTrans = Transcribed (i.e., phonetically realized)


Accent


p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187


dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353



Indicating that the phonetic realization of the segment is the canonical form



Accent


p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187


dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353




A large disparity between columns is marked with a blue box



Accent


p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187


dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353





READY?



Accent


p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187


dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353





READY? OK, Let’s go!


Syllable Onset Statistics – ANTERIOR Place

Accent


p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187


dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353

Stress accent exerts relatively little affect on anterior onset segments


Accent


p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187


dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353

Syllable Onset Statistics – ANTERIOR PlaceStress accent exerts relatively little affect on anterior onset segments

EXCEPT for [dh] and [y]


Syllable Onset Statistics – ANTERIOR Place

Accent


p 203 205 153 153 94 94 450 452

b 126 127 227 225 214 190 567 542

m 137 137 211 211 116 110 464 458

f 136 136 104 104 113 103 353 343

v 35 33 58 58 108 93 201 184

th 62 61 102 100 28 26 192 187


dh 95 80 311 257 625 451 1031 788

y 63 72 135 136 193 145 391 353

Stress accent exerts relatively little affect on anterior onset segments

EXCEPT for [dh] and [y]

[dh] (as in “the” and “them”) tends to delete in unaccented syllables, as does [y] (although to a lesser extent)


Syllable Onset Duration - CENTRAL Place


The voiceless consonants ([t] and [s]) are longer than the other segments

Syllable Onset Duration - CENTRAL Place


The voiceless consonants ([t] and [s]) are longer than the other segments

The alveolar flap [dx] and nasal flap [nx] are the shortest segments and don’t exhibit a durational disparity as a function of accent level

Accent


t 241 245 276 230 513 276 1030 751

d 141 143 149 134 173 128 463 405

dx 0 3 0 62 0 179 0 244

n 133 135 237 196 194 130 564 461

nx 0 2 0 40 0 73 0 115

s 289 290 284 287 187 186 760 763


z 14 13 16 16 43 45 73 74

Central segments tend to “disappear” under (absence of) stress (accent)


Syllable Onset Statistics – CENTRAL Place

Accent


t 241 245 276 230 513 276 1030 751

d 141 143 149 134 173 128 463 405

dx 0 3 0 62 0 179 0 244

n 133 135 237 196 194 130 564 461

nx 0 2 0 40 0 73 0 115

s 289 290 284 287 187 186 760 763


z 14 13 16 16 43 45 73 74

Central segments tend to “disappear” under (absence) of stress (accent)

There is also a tendency for flaps ([dx] and [dx]) to insert under similar conditions


Syllable Onset Statistics – CENTRAL Place

Accent


t 241 245 276 230 513 276 1030 751

d 141 143 149 134 173 128 463 405

dx 0 3 0 62 0 179 0 244

n 133 135 237 196 194 130 564 461

nx 0 2 0 40 0 73 0 115

s 289 290 284 287 187 186 760 763


z 14 13 16 16 43 45 73 74

Syllable Onset Statistics – CENTRAL PlaceCentral segments tend to “disappear” under (absence) of stress (accent)

There is also a tendency for flaps ([dx] and [dx]) to insert under similar conditions

In heavily accented syllables, central segments maintain their canonical identity


Syllable Onset Duration - POSTERIOR Place

CANONICAL Syllable Forms

The voiceless consonants ([k], [sh], [ch]) are longer than the other segments




Most of the segments exhibit a durational disparity between accented and unaccented forms





The duration of the voiced segments in unaccented syllables is ca. 50-60 ms





The duration of the voiced segments in unaccented syllables is ca. 50-60 ms

The glide [w] exhibits a significant disparity between accented and unaccented forms

Accent


k 185 186 189 187 170 168 544 541

g 115 116 138 137 54 51 307 304

ng 0 0 2 3 1 1 3 4

sh 26 26 40 40 73 80 139 146

zh 0 1 2 9 11 17 13 27

ch 32 34 19 27 22 23 73 84


jh 31 30 52 43 58 48 141 121

w 201 209 310 330 276 287 787 826

q 0 33 0 64 0 38 0 135

Posterior segments are remarkably stable in onset position


Syllable Onset Statistics – Posterior Place

Syllable Onset Statistics – Posterior PlacePosterior segments are remarkably stable in onset position

The only significant “deviation” from canonical representation is the intrusion of the glottal stop [q], which lacks phonemic status in English

Accent


k 185 186 189 187 170 168 544 541

g 115 116 138 137 54 51 307 304

ng 0 0 2 3 1 1 3 4

sh 26 26 40 40 73 80 139 146

zh 0 1 2 9 11 17 13 27

ch 32 34 19 27 22 23 73 84


jh 31 30 52 43 58 48 141 121

w 201 209 310 330 276 287 787 826

q 0 33 0 64 0 38 0 135


Syllable Onset Duration - Place Chameleons


Place chameleon segments exhibit a consistent durational disparity between accented and unaccented forms

Syllable Onset Duration - Place Chameleons


Place chameleon segments exhibit a consistent durational disparity between accented and unaccented forms

In unaccented syllables the duration of these segments is ca. 50-60 ms

Accent


r 272 269 233 215 233 162 738 646

l 184 180 226 212 220 162 630 554

hh 158 156 169 157 67 37 394 350

er 0 0 0 2 0 0 0 2

lg 0 2 0 8 0 21 0 31

el 0 1 0 0 0 0 0 1


Syllable Onset Statistics – Place Chameleons“Chameleons” assimilate their place of articulation to the following vowel


Accent


r 272 269 233 215 233 162 738 646

l 184 180 226 212 220 162 630 554

hh 158 156 169 157 67 37 394 350

er 0 0 0 2 0 0 0 2

lg 0 2 0 8 0 21 0 31

el 0 1 0 0 0 0 0 1



They are relatively stable at syllable onset, except in unaccented forms


Accent


r 272 269 233 215 233 162 738 646

l 184 180 226 212 220 162 630 554

hh 158 156 169 157 67 37 394 350

er 0 0 0 2 0 0 0 2

lg 0 2 0 8 0 21 0 31

el 0 1 0 0 0 0 0 1



They are relatively stable at syllable onset, except in unaccented forms

The reduced form of [l] is [lg], a glide-like element – it tends to assume the functional status of [l] in unaccented syllables


Pronunciation Patterns – Syllable OnsetsThe ANTERIOR and POSTERIOR onsets are generally canonically realized

(the exceptions typically function as “junctures,” rather than as segments)

C = Canonical realizationN = Non-canonical realization, N0 = Non-canonical in unaccented syllables

Place of Articulation Approximants

Pronunciation Patterns – Syllable OnsetsThe ANTERIOR and POSTERIOR onsets are generally canonically realized

(the exceptions typically function as “junctures,” rather than as segments)

The CENTRAL and PLACE CHAMELEON onsets are often non-canonical (and also often function as “junctures”)



PART SIX

Stress Accent’s Impact on Syllable Codas

Stress Accent and Syllable CodasStress accent’s impact on syllable codas differs from that of onsets


The disparity in duration between accented and unaccented forms tends to be significantly less for codas than for onsets (at least when deletions are NOT taken into account)



There is a far greater probability of segmental deletion in coda constituents




Accent level exerts a powerful influence on segmental deletion and on segmental duration





To a certain degree segmental deletion and duration interact (or are flip sides of the same phonetic coin)





To a certain degree segmental deletion and duration interact (or are flip sides of the same phonetic coin)

(for this reason the durational properties of ALL syllables, including those in which coda segments are deleted, are also shown)

Syllable Coda Duration - ANTERIOR Place


The durational disparity between accented and unaccented forms is smaller for codas and for onsets




Certain segments exhibit little if any difference in duration as a function of accent (e.g., [b], [m], [v])




Certain segments exhibit little if any difference in duration as a function of accent (e.g., [b], [m], [v])

Such segments manifest certain properties of flaps


ALLSyllable Forms

Because of the significant number of deletions in coda constituents, particularly in unaccented syllables, the durational disparity between accented and unaccented syllables is preserved when duration is computed across ALL syllable forms (including those with deletions)


ALLSyllable Forms

Because of the significant number of deletions in coda constituents, particularly in unaccented syllables, the durational disparity between accented and unaccented syllables is preserved when duration is computed across ALL syllable forms (including those with deletions)

Those segments exhibiting flap-like properties (e.g., [b], [m], [v]) tend to delete the most in unaccented codas

Accent


p 33 32 39 32 17 13 89 77

b 9 6 4 4 1 1 14 11

m 108 96 148 148 112 83 368 327

f 37 36 40 40 36 48 113 124

v 63 55 102 87 172 94 337 236

th 11 10 24 16 34 20 69 46


dh 0 0 0 4 0 5 0 9

Syllable Coda Statistics – Anterior PlaceAnterior coda segments are relatively stable under stress (accent)


Accent


p 33 32 39 32 17 13 89 77

b 9 6 4 4 1 1 14 11

m 108 96 148 148 112 83 368 327

f 37 36 40 40 36 48 113 124

v 63 55 102 87 172 94 337 236

th 11 10 24 16 34 20 69 46


dh 0 0 0 4 0 5 0 9


The segments [m] and [v] are exceptions


Accent


p 33 32 39 32 17 13 89 77

b 9 6 4 4 1 1 14 11

m 108 96 148 148 112 83 368 327

f 37 36 40 40 36 48 113 124

v 63 55 102 87 172 94 337 236

th 11 10 24 16 34 20 69 46


dh 0 0 0 4 0 5 0 9


The segments [m] and [v] are exceptions – they often function as “flaps” in this context, and


Accent


p 33 32 39 32 17 13 89 77

b 9 6 4 4 1 1 14 11

m 108 96 148 148 112 83 368 327

f 37 36 40 40 36 48 113 124

v 63 55 102 87 172 94 337 236

th 11 10 24 16 34 20 69 46


dh 0 0 0 4 0 5 0 9


The segments [m] and [v] are exceptions – they often function as “flaps” in this context, and

They tend to delete in unaccented syllables


Syllable Coda Duration - CENTRAL Place


The centrally articulated codas exhibit a high probability of deletion, particularly in unaccented syllables (see durational data for ALL syllables)




The duration of many of the coda segments do not exhibit a difference in duration (when computed for the canonical syllable forms)




The duration of many of the coda segments do not exhibit a difference in duration (when computed for the canonical syllable forms)

Most of the unaccented codas are short in duration


ALL Syllable Forms

Because of the high probability of deletions for central coda consonants the mean durations are quite low relative to other conditions


ALL Syllable Forms

Because of the high probability of deletions for central coda consonants the mean durations are quite low relative to other conditions

In some sense the default duration for central codas is very short (more on this point later on in the presentation)

Accent


t 322 126 575 191 562 172 1459 489

d 200 119 295 127 370 96 865 342

n 311 237 498 381 773 542 1582 1160

s 142 135 202 214 151 155 495 504

z 179 149 258 208 271 221 708 578


Syllable Coda Statistics – Central PlaceCentral coda segments are extremely unstable under stress (accent)


Accent


t 322 126 575 191 562 172 1459 489

d 200 119 295 127 370 96 865 342

n 311 237 498 381 773 542 1582 1160

s 142 135 202 214 151 155 495 504

z 179 149 258 208 271 221 708 578



(except for the fricatives [s] and [z])


Accent


t 322 126 575 191 562 172 1459 489

d 200 119 295 127 370 96 865 342

n 311 237 498 381 773 542 1582 1160

s 142 135 202 214 151 155 495 504

z 179 149 258 208 271 221 708 578




The segments [t], [d] and [n] tend to delete in coda position, even in heavily accented syllables


Accent


t 322 126 575 191 562 172 1459 489

d 200 119 295 127 370 96 865 342

n 311 237 498 381 773 542 1582 1160

s 142 135 202 214 151 155 495 504

z 179 149 258 208 271 221 708 578




The segments [t], [d] and [n] tend to delete in coda position, even in heavily accented syllables

The major effect of stress accent is its affect on the probability of segmental deletion (which is appreciably higher in unaccented forms)


Syllable Coda Duration - POSTERIOR Place


Many coda consonants are short in duration



Many coda consonants are short in duration

Most segments exhibit relatively little sensitivity to accent level


ALL Syllable Forms

There are relatively few deletions in coda segments, hence the durational patterns are similar for ALL syllable forms relative to the canonical syllable forms

Accent


k 170 150 196 162 51 39 417 351

g 10 10 8 10 4 5 22 25

q 0 42 0 71 0 54 0 167

ng 63 60 139 126 203 129 405 315

sh 9 9 2 2 4 6 15 17

zh 1 0 0 4 0 2 1 6


ch 26 25 27 25 12 12 65 62

jh 10 10 11 10 15 12 36 32

w 0 4 0 2 0 6 0 12

Syllable Coda Statistics – Posterior PlacePosterior coda segments are relatively stable under stress (accent)


Accent


k 170 150 196 162 51 39 417 351

g 10 10 8 10 4 5 22 25

q 0 42 0 71 0 54 0 167

ng 63 60 139 126 203 129 405 315

sh 9 9 2 2 4 6 15 17

zh 1 0 0 4 0 2 1 6


ch 26 25 27 25 12 12 65 62

jh 10 10 11 10 15 12 36 32

w 0 4 0 2 0 6 0 12

Syllable Coda Statistics – Posterior PlacePosterior coda segments are relatively stable under stress (accent)

The primary exception is [ng], which tends to delete in unaccented syllables


Accent


k 170 150 196 162 51 39 417 351

g 10 10 8 10 4 5 22 25

q 0 42 0 71 0 54 0 167

ng 63 60 139 126 203 129 405 315

sh 9 9 2 2 4 6 15 17

zh 1 0 0 4 0 2 1 6


ch 26 25 27 25 12 12 65 62

jh 10 10 11 10 15 12 36 32

w 0 4 0 2 0 6 0 12

Syllable Coda Statistics – POSTERIOR PlacePosterior coda segments are relatively stable under stress (accent)

The primary exception is [ng], which tends to delete in unaccented syllables

The “infamous” glottal stop [q] tends to insert in this context


Syllable Coda Duration - Place Chameleons


There is a large durational disparity between the accented and unaccented chameleon segments



There is a large durational disparity between the accented and unaccented chameleon segments

In unaccented syllables the duration of these segments is ca. 60 ms


ALL Syllable Forms

There are a lot of deletions of coda chameleons in unaccented syllables


ALL Syllable Forms

There are a lot of deletions of coda chameleons in unaccented syllables

Hence the mean duration of these segments in unaccented forms is short

Syllable Coda Statistics – Place ChameleonsChameleon segments are unstable under stress (accent)



This is particularly true for [l] (for all levels of accent), where many canonical segments transmute into [lg], particularly in accented forms



This is particularly true for [l] (for all levels of accent), where many canonical segments transmute into [lg], particularly in accented forms

The segment [r] tends to delete in unaccented syllables, but not otherwise


Pronunciation Patterns – Syllable CodasThe ANTERIOR and POSTERIOR codas are generally canonically realized

(the exceptions typically function as “junctures,” rather than segments)



Pronunciation Patterns – Syllable CodasThe ANTERIOR and POSTERIOR codas are generally canonically realized

(the exceptions typically function as “junctures,” rather than segments)

The CENTRAL and PLACE CHAMELEON segments are often non-canonical (and also often function as “junctures”)



PART SEVEN

Onset and Coda Patterns Compared

Comparison of Syllable Onsets and CodasOnsets tend to be more stable than codas




The centrally articulated segments are highly unstable in both contexts





As are the place chameleons





As are the place chameleons

The unstable anterior and posterior phones are mostly “junctures”



PART EIGHT

A Preliminary Juncture-Accent Model

A means of visualizing important properties of the acoustic signal

Road Map to the Juncture-Accent Model


The juncture-accent representation is based on log, critical-band energy across time and frequency




Although it is not intended as an auditory representation, it does represent spectro-temporal properties of the signal in a manner consistent with auditory principles




Although it is not intended as an auditory representation, it does represent spectro-temporal properties of the signal in a manner consistent with auditory principles

Let’s take a look at some illustrations – Spectro-Temporal Profiles or “STePs”


Anatomy of a Spectro-Temporal Profile

[s]

[eh]

[vx]

[en]

juncture accented syllable

unaccented syllable

“Seven”

mean duration

Full-spectrumperspective

OGI Numbers95

[s] [eh] [vx] [en]

[s]

[eh]

[vx][en]


unaccented syllable

mean duration

“Seven”

Anatomy of a Spectro-Temporal ProfileHigh-frequency

perspective

OGI Numbers95

[s] [eh] [vx] [en]

Anatomy of a Spectro-Temporal Profile


unaccented syllable

[z]

mean duration

“Zero”

[ih]

[r]

[ax]


OGI Numbers95

[z] [ih] [r] [ah]

Spectro-Temporal Profile

juncture unaccented

syllable

mean duration

“Zero”

[ih][r]

[ax]

accented syllable

[z]

High-frequencyperspective

OGI Numbers95

[z] [ih] [r] [ah]


mean duration

“Three”

[iy][r]

accented syllable

[th]


OGI Numbers95

[th] [r] [iy]


mean duration

“Three”

[r]

accented syllable

[iy]

High-frequencyperspective

OGI Numbers95

[th]

[th] [r] [iy]

Summary and Conclusions(at last!)

Summary and ConclusionsBased on a detailed analysis of a manually annotated corpus of spontaneous

American English (Switchboard) the following conclusions are drawn:



Stress accent is the primary linguistic property associated with duration at the segmental, syllabic and lexical levels




Stress accent’s impact on duration is most pronounced in the vocalic nucleus





But also affects the duration of the syllable onset






The duration of the syllable coda is less affected by stress accent, however ...







Coda constituents are more prone to deletion as a function of stress accent








Thus, stress accent has an (indirect) impact on duration even for codas (via segmental deletion)









These data are inconsistent with a segmental model of spoken language









These data are inconsistent with a segmental model of spoken language

But is consistent with a JUNCTURE-ACCENT model based on syllable forms of variable accent level

That’s All, Folks

Many Thanks for Your Time and Attention

What’s Going on in Pronunciation?

With respect to onset and coda segments (i.e. consonants) there are two basic forms – (1) those that are relatively stable across accent level, and (2) those that are not

What’s Going On? (in pronunciation)


Most of the non-continuants (i.e. stops and nasals) are stable when the locus of articulation constriction is either anterior or posterior




The centrally articulated stops and nasals are highly unstable, particularly in coda position and in unaccented syllables





The place chameleons (i.e., the approximants) are not very stable in either onset or coda position






The vowels are divisible into two main groups – accented and unaccented







The accented vowels are generally canonically realized and quasi-evenly distributed across the vowel space








The unaccented forms tend to concentrate in the high-front and high-central regions of the vowel space









Certain segments are actually junctures – e.g., the flaps and the glottal stop










Many so-called segments are actually junctures (as they are flaps), the most noteworthy examples are [dh] and [v]










Many so-called segments are actually junctures (as they are flaps), the most noteworthy examples are [dh] and [v]

None of these properties is consistent with a segmental model of language


Syllable Duration and Number of SegmentsFor syllables greater than a single segment there is relatively little difference

in duration as the number of segments (within a syllable) increases


Syllable Duration and Number of SegmentsFor syllables greater than a single segment there is relatively little difference

in duration as the number of segments (within a syllable) increases

Suggesting that syllable duration is largely controlled by processes independent of segmental production


Documents

Time Frames of Spoken Language Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 steveng