Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Acoustic Properties of Speech Sounds

� Speech production

� Signal processing

� Properties of speech sounds of American English

� Microphone variations

� Spectrographic Examples

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 1

Anatomical Structures for Speech Production

Soft Palate (Velum)

Hyoid Bone

Epiglottis

CricoidCartilage

Esophagus

Nasal Cavity

Hard Palate

Tongue

Thyroid Cartilage

Vocal Cords

Trachea

Lung

Sternum

Nasal Cavity

Hard Palate

Tongue

Thyroid Cartilage

Vocal FoldsTrachea

Lung

Soft Palate(Velum)

Jaw


Sub-Word Linguistic Units

� The phoneme is one of the most basic linguistic units used torepresent pronunciations of words

� ASR systems typically represent words as phoneme sequences

� English contains approximately 40 phonemes which can begrouped by manner and place of articulation

Manner Class NumberVowels 16Fricatives 8Stops 6Semivowels 4Nasals 3Affricates 2Aspirant 1


Phonemes in American English

IPA AB Word IPA AB Word IPA AB Word/i/ iy beat /s/ s see /w/ w wet/I/ ih bit /S/ sh she /r/ r red/e/ ey bait /f/ f fee /l/ l let/E/ eh bet /T/ th thief /y/ y yet/@/ ae bat /z/ z z /m/ m meet/a/ aa bob /Z/ zh Gigi /n/ n neat/O/ ao bought /v/ v v /4/ ng sing/^/ ah but /D/ dh thee /C/ ch church/o/ ow boat /p/ p pea /J/ jh judge/U/ uh book /t/ t tea /h/ hh heat/u/ uw boot /k/ k key/5/ er bird /b/ b bay/a¤ / ay bite /d/ d day/O¤ / oy Boyd /g/ g geese/a⁄ / aw bout/{/ ax about


Places of Articulation for Speech Production

PalatalVelar

Uvular

AlveopalatalAlveolar

Labial

Dental


A Speech Waveform

Two plus seven is less than ten


Spectral Representations

� Speech waveforms are usually sampled at rates varying from 8K(telephone) to 20K (wide-band) samples/sec

� ASR systems typically transform the waveform into a spectrum: asequence of frequency-based analyses usually performed atregular intervals (e.g., 10 ms)

� A short-time Fourier transform (STFT) performs a spectral analysison waveform segments small enough to be able to assume thatthe speech signal is quasi-stationary

� The waveform segment is created by a moving window, whosetype (e.g., Hamming) and duration (e.g., 5-25ms) have asignificant impact on the resulting spectrum

� A spectrogram is an image computed from the resultingspectrum, which is often used to examine the waveform


Short-Time Fourier Transform

w [ 50 - m ] w [ 100 - m ] w [ 200 - m ]

x [ m ]

0 n = 50 n = 100 n = 200

m

Xn(ejω) =+∞∑m=−∞

w[n −m]x[m]e−jωm

� If n is fixed, then it can be shown that:

Xn(ejω) = 12π

∫ π

−πW(ejθ)ejθnX(ej(ω+θ))dθ

� The above equation is meaningful only if we assume that X(ejω)represents the Fourier transform of a signal whose properties continueoutside the window, or simply that the signal is zero outside the window.

� In order for Xn(ejω) to correspond to X(ejω), W(ejω) must resemble animpulse with respect to X(ejω).


Comparison of Windows


Comparison of Windows (cont’d)


A Wide-Band Speech Spectrogram

Two plus seven is less than tenCLSP Workshop 2000 Acoustic Properties of Speech Sounds 11

A Narrow-Band Speech Spectrogram

Two plus seven is less than tenCLSP Workshop 2000 Acoustic Properties of Speech Sounds 12

Spectral Averages: Corpus and Representation

� TIMIT acoustic-phonetic corpus

– phonetic transcription aligned with waveform

– native speakers of American English (8 dialects)

– 8 sentences/speaker (dialect sentences excluded)

– 136 female, 326 male speakers (NIST train set)

– 3,696 utterances, 142,910 tokens

� Mel-Frequency Spectral Coefficients (MFSC’s)

– Mel-frequency scale (linear < 1 kHz, log > 1 kHz)

– 40 channels (200 Hz - 6.4 kHz)

– 25 ms Hamming window, 5 ms frame-rate

– Average computed over entire phonetic token (for stopsspectral slice at release was used)


Happy Little Vowel Chart"So inaccurate, yet so useful."

SCHWAS:Plain ({) About /{ba⁄t/Front (|) Roses /ro⁄z|z/Retroflex (}) Forever /f}Ev5/

F 2 In

crea

ses

F1 Increases

MID LOWHIGH

FRONT

BACK

uÚ

OUu

ao

^,{

E@

eI

i

TENSE = Towards Edgestends to be longer

LAX = Towards Centertends to be shorter

Rob's

Think F3 is mighty low? Your pal 5 is the way to go!


Friendly Little Consonant Chart"Somewhat more accurate, yet somewhat less useful."

Labial Alveolar Palatal VelarPlace of Articulation

Voicing: Unvoiced Voiced

Nas

alFr

icat

ive

Stop

Man

ner

of

Art

icula

tion

p b

f v

m

Dental

T D

n 4

s z S Z

t d k g

The Semi-vowels:

is like an extremey i

is like an extremew u

is like an extremel o

is like an extremer 5

The Affricates:

is likeC t+S

is likeJ d+Z

The Odds and Ends:

h (unvoiced h)

H (voiced h)

F (flap) FÊ (nasalized flap)

? (glottal stop)

Weak (Non-strident) Strong (Strident)


Vowel Production

� No significant constriction in the vocal tract

� Usually produced with periodic excitation

� Acoustic characteristics depend on the position of the jaw,tongue, and lips

[i] [@] [a] [u]


Vowels of American English

� There are approximately 18 vowels in American English made upof monothongs, diphthongs, and reduced vowels (schwa’s)

� They are often described by the articulatory features: High/Low,Front/Back, Retroflexed, Rounded, and Tense/Lax

/i/ iy beat /O/ ao bought /a¤ / ay bite/I/ ih bit /^/ ah but /O¤ / oy Boyd/e/ ey bait /o/ ow boat /a⁄ / aw bout/E/ eh bet /U/ uh book [{] ax about/@/ ae bat /u/ uw boot [|] ix roses/a/ aa Bob /5/ er Bert [}] axr butter


Vowel Formant Averages

� Vowels are often characterized by F1, F2, and F3

� High/Low is correlated with F1

� Front/Back is correlated with F2

� Retroflexion is marked by a low F3

Female Speakers Male Speakers

i¤ I e¤ E @ a O ^ o⁄ U u 5 { |0

500

1000

1500

2000

2500

3000

3500

Ave

rag

e F

req

uen

cy (

Hz)

Vowel

F1F2F3

i¤ I e¤ E @ a O ^ o⁄ U u 5 { |0

500

1000

1500

2000

2500

3000

3500

Ave

rag

e F

req

uen

cy (

Hz)

Vowel

F1F2F3


Vowel Formant Trajectories

� Diphthongs can have significant formant motion

� Most vowels in American English are somewhat diphthongized


700

900

1100

1300

1500

1700

1900

2100

2300

2500

2700

300 400 500 600 700 800 900

F2

F1

i¤

E

a¤

O¤

{

|

a⁄^

a

e¤

I

O

@

u o⁄

5U

700

900

1100

1300

1500

1700

1900

2100

2300

2500

2700

300 400 500 600 700 800 900

F2

F1

i¤

@

|

EI

e¤

^

O

o⁄

5U

u

O¤

a¤

{

a⁄

a


Vowel Durations

� Each vowel has a different intrinsic duration

� Schwa’s have distinctly shorter durations (50ms)

� /I, E, ^, U/ are the shortest monothongs

� Context can greatly influence vowel duration


i¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u0

50

100

150

200

250

Ave

rag

e D

ura

tio

n (

ms)

Voweli¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u

0

50

100

150

200

250

Ave

rag

e D

ura

tio

n (

ms)

Vowel


Fricative Production

� Turbulence produced at narrow constriction

� Constriction position determines acoustic characteristics

� Can be produced with periodic excitation

[f] [T] [s] [S]


Fricatives of American English

� There are 8 fricatives in American English

� They are often described by the features Strident/Non-Strident(Strong/Weak), Voiced/Unvoiced

� Four places of articulation: Labial, Dental, Alveolar, and Palatal

Type Unvoiced VoicedLabial /f/ f fee /v/ v vDental /T/ th thief /D/ dh theeAlveolar /s/ s see /z/ z zPalatal /S/ sh she /Z/ zh Gigi


Fricative Energy

Average Total Energy

Pro

babi

lity

Den

sity

una

djus

ted

for

freq

uenc

y

-100 -90 -80 -70 -60 -50 -40

0.0

0.02

0.04

0.06

NON-STRIDENTSTRIDENT

Strident fricatives tend to be stronger than non-strident


Fricative Durations

Duration

Pro

babi

lity

Den

sity

una

djus

ted

for

freq

uenc

y

0.0 0.05 0.10 0.15 0.20 0.25 0.30

02

46

810

1214

UNVOICEDVOICED

Voiced fricatives tend to be shorter than unvoiced


Nasal Production

� Velum lowering results in airflow through nasal cavity� Consonants produced with closure in oral cavity� Nasalized vowels have output through oral and nasal cavities� Nasal murmurs have similar spectral characteristics

[m] [n] [4]


Nasal Consonants of American English

� Three places of articulation: Labial, Alveolar, and Velar

� Always attached to a vowel, though can form an entire syllable inunstressed environments ([nÍ ], [mÍ ], [4Í ])

� /4/ is always post-vocalic

� Place identified by neighboring formant transitions

Type NasalLabial /m/ m meDental /n/ n kneeVelar /4/ ng sing


Nasal Durations

Singleton Unvoiced Cluster

Voiced Cluster

0

25

50

75

100

125

150

Dur

atio

n (m

s)

Nasal consonants tend to be shorter in clusters with unvoicedconsonants, and longer with voiced consonants


Semivowel Production

� Constriction in vocal tract, no turbulence

� Slower articulatory motion than other consonants

� Laterals form complete closure with tongue tip,airflow via sides of constriction

[w] [y] [r] [l]


Semivowels of American English

� There are 4 semivowels in American English

� Always attached to a vowel, though /l/ can form an entire syllablein unstressed environments ([lÍ ])

� Extreme articulation of a corresponding vowel

– Similar formant positions

– Generally weaker due to constriction

Type Semivowel Nearest VowelGlides /w/ w wet /u/

/y/ y yet /i/Liquids /r/ r red /5/

/l/ l let /o/


Acoustic Properties of Semivowels

� /w/ is characterized by a very low F1, F2

– Typically a rapid spectral falloff above F2

� /y/ is characterized by very low F1, very high F2

� /r/ is characterized by a very low F3

– Prevocalic F3 < medial F3 < postvocalic F3

� /l/ is characterized by a low F1 and F2

– Often presence of high frequency energy

– Postvocalic /l/ characterized by minimal spectral discontinuity,gradual motion of formants


Aspirant Production

� /h/ in American English

� Turbulence excitation at glottis

� No constriction in the vocal tract, normal formant excitation

� Coupling with subglottal system results in little energy in F1region

� Periodic excitation can be present in medial position


Stop Production

� Complete closure in the vocal tract, pressure build up

� Sudden release of the constriction, turbulence noise

� Can have periodic excitation during closure

[b] [d] [g]


Stops of American English

� There are 6 stop consonants in American English

� Same places of articulation as nasal consonants

� Unvoiced stops are typically aspirated

� Voiced stops usually exhibit a “voice-bar’’ during closure

� Information about formant transitions and release useful forclassification

Type Voiced UnvoicedLabial /b/ b bee /p/ p peaDental /d/ d Dee /t/ t teaVelar /g/ g geese /k/ k key


Singleton Stop Durations

b d g p t k0

10

20

30

40

50

60

70

80

VO

T D

urat

ion

(ms)

The voice onset time (VOT) of unvoiced stopsis longer than that of voiced stops


/s/-Stop Durations

p t k0

10

20

30

40

50

60

70

80

VO

T D

urat

ion

(ms)

Unvoiced stops are unaspirated in /s/ stop sequences


Stop-Semivowel Durations

b d g p t k0

10

20

30

40

50

60

70

80

90

100

VO

T D

urat

ion

(ms)

Singletons

[Stop][Semivowel] Clusters

Semivowels are partially devoiced in stop semivowel sequences


Voicing Cues for Stops

There are many voicing cues for a stop


Affricate Production

� Alveolar-stop palatal-fricative pairs

� Sudden release of the constriction, turbulence noise

� Can have periodic excitation during closure

Affricates of American English

There are two affricates in American English

Voiced Unvoiced/J/ jh judge /C/ ch church


Speech from a Close-Talking Microphone

kHz kHz

Wide Band Spectrogram

kHz kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

kHz kHz

0 0

8 8

16 16Zero Crossing Rate

dB dBTotal Energy

dB dBEnergy -- 125 Hz to 750 Hz

Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/users/jwc/latex/sum97/sennheiser.wavPrinted by jwc on Wed Jul 16 11:58:32 1997

Page: 1The Thinker is a famous sculptureCLSP Workshop 2000 Acoustic Properties of Speech Sounds 39

Speech from a Omni-Directional Microphone

kHz kHz


kHz kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

kHz kHz

0 0

8 8


dB dBTotal Energy


Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/users/jwc/latex/sum97/bk.wavPrinted by jwc on Wed Jul 16 11:57:43 1997


Speech over a Telephone Channel

kHz kHz


kHz kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

kHz kHz

0 0

8 8


dB dBTotal Energy


Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/users/jwc/latex/sum97/telephone.wavPrinted by jwc on Wed Jul 16 11:59:12 1997


Documents

Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone