21
Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone variations Spectrographic Examples CLSP Workshop 2000 Acoustic Properties of Speech Sounds 1 Anatomical Structures for Speech Production Soft Palate (Velum) Hyoid Bone Epiglottis Cricoid Cartilage Esophagus Nasal Cavity Hard Palate Tongue Thyroid Cartilage Vocal Cords Trachea Lung Sternum Nasal Cavity Hard Palate Tongue Thyroid Cartilage Vocal Folds Trachea Lung Soft Palate (Velum) Jaw CLSP Workshop 2000 Acoustic Properties of Speech Sounds 2

Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

  • Upload
    vohuong

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Acoustic Properties of Speech Sounds

� Speech production

� Signal processing

� Properties of speech sounds of American English

� Microphone variations

� Spectrographic Examples

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 1

Anatomical Structures for Speech Production

Soft Palate (Velum)

Hyoid Bone

Epiglottis

CricoidCartilage

Esophagus

Nasal Cavity

Hard Palate

Tongue

Thyroid Cartilage

Vocal Cords

Trachea

Lung

Sternum

Nasal Cavity

Hard Palate

Tongue

Thyroid Cartilage

Vocal FoldsTrachea

Lung

Soft Palate(Velum)

Jaw

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 2

Page 2: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Sub-Word Linguistic Units

� The phoneme is one of the most basic linguistic units used torepresent pronunciations of words

� ASR systems typically represent words as phoneme sequences

� English contains approximately 40 phonemes which can begrouped by manner and place of articulation

Manner Class NumberVowels 16Fricatives 8Stops 6Semivowels 4Nasals 3Affricates 2Aspirant 1

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 3

Phonemes in American English

IPA AB Word IPA AB Word IPA AB Word/i/ iy beat /s/ s see /w/ w wet/I/ ih bit /S/ sh she /r/ r red/e/ ey bait /f/ f fee /l/ l let/E/ eh bet /T/ th thief /y/ y yet/@/ ae bat /z/ z z /m/ m meet/a/ aa bob /Z/ zh Gigi /n/ n neat/O/ ao bought /v/ v v /4/ ng sing/^/ ah but /D/ dh thee /C/ ch church/o/ ow boat /p/ p pea /J/ jh judge/U/ uh book /t/ t tea /h/ hh heat/u/ uw boot /k/ k key/5/ er bird /b/ b bay/a¤ / ay bite /d/ d day/O¤ / oy Boyd /g/ g geese/a⁄ / aw bout/{/ ax about

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 4

Page 3: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Places of Articulation for Speech Production

PalatalVelar

Uvular

AlveopalatalAlveolar

Labial

Dental

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 5

A Speech Waveform

Two plus seven is less than ten

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 6

Page 4: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Spectral Representations

� Speech waveforms are usually sampled at rates varying from 8K(telephone) to 20K (wide-band) samples/sec

� ASR systems typically transform the waveform into a spectrum: asequence of frequency-based analyses usually performed atregular intervals (e.g., 10 ms)

� A short-time Fourier transform (STFT) performs a spectral analysison waveform segments small enough to be able to assume thatthe speech signal is quasi-stationary

� The waveform segment is created by a moving window, whosetype (e.g., Hamming) and duration (e.g., 5-25ms) have asignificant impact on the resulting spectrum

� A spectrogram is an image computed from the resultingspectrum, which is often used to examine the waveform

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 7

Short-Time Fourier Transform

w [ 50 - m ] w [ 100 - m ] w [ 200 - m ]

x [ m ]

0 n = 50 n = 100 n = 200

m

Xn(ejω) =+∞∑m=−∞

w[n −m]x[m]e−jωm

� If n is fixed, then it can be shown that:

Xn(ejω) = 12π

∫ π

−πW(ejθ)ejθnX(ej(ω+θ))dθ

� The above equation is meaningful only if we assume that X(ejω)represents the Fourier transform of a signal whose properties continueoutside the window, or simply that the signal is zero outside the window.

� In order for Xn(ejω) to correspond to X(ejω), W(ejω) must resemble animpulse with respect to X(ejω).

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 8

Page 5: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Comparison of Windows

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 9

Comparison of Windows (cont’d)

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 10

Page 6: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

A Wide-Band Speech Spectrogram

Two plus seven is less than tenCLSP Workshop 2000 Acoustic Properties of Speech Sounds 11

A Narrow-Band Speech Spectrogram

Two plus seven is less than tenCLSP Workshop 2000 Acoustic Properties of Speech Sounds 12

Page 7: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Spectral Averages: Corpus and Representation

� TIMIT acoustic-phonetic corpus

– phonetic transcription aligned with waveform

– native speakers of American English (8 dialects)

– 8 sentences/speaker (dialect sentences excluded)

– 136 female, 326 male speakers (NIST train set)

– 3,696 utterances, 142,910 tokens

� Mel-Frequency Spectral Coefficients (MFSC’s)

– Mel-frequency scale (linear < 1 kHz, log > 1 kHz)

– 40 channels (200 Hz - 6.4 kHz)

– 25 ms Hamming window, 5 ms frame-rate

– Average computed over entire phonetic token (for stopsspectral slice at release was used)

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 13

Happy Little Vowel Chart"So inaccurate, yet so useful."

SCHWAS:Plain ({) About /{ba⁄t/Front (|) Roses /ro⁄z|z/Retroflex (}) Forever /f}Ev5/

F 2 In

crea

ses

F1 Increases

MID LOWHIGH

FRONT

BACK

OUu

ao

^,{

E@

eI

i

TENSE = Towards Edgestends to be longer

LAX = Towards Centertends to be shorter

Rob's

Think F3 is mighty low? Your pal 5 is the way to go!

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 14

Page 8: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Friendly Little Consonant Chart"Somewhat more accurate, yet somewhat less useful."

Labial Alveolar Palatal VelarPlace of Articulation

Voicing: Unvoiced Voiced

Nas

alFr

icat

ive

Stop

Man

ner

of

Art

icula

tion

p b

f v

m

Dental

T D

n 4

s z S Z

t d k g

The Semi-vowels:

is like an extremey i

is like an extremew u

is like an extremel o

is like an extremer 5

The Affricates:

is likeC t+S

is likeJ d+Z

The Odds and Ends:

h (unvoiced h)

H (voiced h)

F (flap) FÊ (nasalized flap)

? (glottal stop)

Weak (Non-strident) Strong (Strident)

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 15

Vowel Production

� No significant constriction in the vocal tract

� Usually produced with periodic excitation

� Acoustic characteristics depend on the position of the jaw,tongue, and lips

[i] [@] [a] [u]

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 16

Page 9: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Vowels of American English

� There are approximately 18 vowels in American English made upof monothongs, diphthongs, and reduced vowels (schwa’s)

� They are often described by the articulatory features: High/Low,Front/Back, Retroflexed, Rounded, and Tense/Lax

/i/ iy beat /O/ ao bought /a¤ / ay bite/I/ ih bit /^/ ah but /O¤ / oy Boyd/e/ ey bait /o/ ow boat /a⁄ / aw bout/E/ eh bet /U/ uh book [{] ax about/@/ ae bat /u/ uw boot [|] ix roses/a/ aa Bob /5/ er Bert [}] axr butter

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 17

Vowel Formant Averages

� Vowels are often characterized by F1, F2, and F3

� High/Low is correlated with F1

� Front/Back is correlated with F2

� Retroflexion is marked by a low F3

Female Speakers Male Speakers

i¤ I e¤ E @ a O ^ o⁄ U u 5 { |0

500

1000

1500

2000

2500

3000

3500

Ave

rag

e F

req

uen

cy (

Hz)

Vowel

F1F2F3

i¤ I e¤ E @ a O ^ o⁄ U u 5 { |0

500

1000

1500

2000

2500

3000

3500

Ave

rag

e F

req

uen

cy (

Hz)

Vowel

F1F2F3

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 18

Page 10: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Vowel Formant Trajectories

� Diphthongs can have significant formant motion

� Most vowels in American English are somewhat diphthongized

Female Speakers Male Speakers

700

900

1100

1300

1500

1700

1900

2100

2300

2500

2700

300 400 500 600 700 800 900

F2

F1

E

{

|

a⁄^

a

I

O

@

u o⁄

5U

700

900

1100

1300

1500

1700

1900

2100

2300

2500

2700

300 400 500 600 700 800 900

F2

F1

@

|

EI

^

O

o⁄

5U

u

{

a⁄

a

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 19

Vowel Durations

� Each vowel has a different intrinsic duration

� Schwa’s have distinctly shorter durations (50ms)

� /I, E, ^, U/ are the shortest monothongs

� Context can greatly influence vowel duration

Female Speakers Male Speakers

i¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u0

50

100

150

200

250

Ave

rag

e D

ura

tio

n (

ms)

Voweli¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u

0

50

100

150

200

250

Ave

rag

e D

ura

tio

n (

ms)

Vowel

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 20

Page 11: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Fricative Production

� Turbulence produced at narrow constriction

� Constriction position determines acoustic characteristics

� Can be produced with periodic excitation

[f] [T] [s] [S]

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 21

Fricatives of American English

� There are 8 fricatives in American English

� They are often described by the features Strident/Non-Strident(Strong/Weak), Voiced/Unvoiced

� Four places of articulation: Labial, Dental, Alveolar, and Palatal

Type Unvoiced VoicedLabial /f/ f fee /v/ v vDental /T/ th thief /D/ dh theeAlveolar /s/ s see /z/ z zPalatal /S/ sh she /Z/ zh Gigi

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 22

Page 12: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Fricative Energy

Average Total Energy

Pro

babi

lity

Den

sity

una

djus

ted

for

freq

uenc

y

-100 -90 -80 -70 -60 -50 -40

0.0

0.02

0.04

0.06

NON-STRIDENTSTRIDENT

Strident fricatives tend to be stronger than non-strident

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 23

Fricative Durations

Duration

Pro

babi

lity

Den

sity

una

djus

ted

for

freq

uenc

y

0.0 0.05 0.10 0.15 0.20 0.25 0.30

02

46

810

1214

UNVOICEDVOICED

Voiced fricatives tend to be shorter than unvoiced

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 24

Page 13: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Nasal Production

� Velum lowering results in airflow through nasal cavity� Consonants produced with closure in oral cavity� Nasalized vowels have output through oral and nasal cavities� Nasal murmurs have similar spectral characteristics

[m] [n] [4]

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 25

Nasal Consonants of American English

� Three places of articulation: Labial, Alveolar, and Velar

� Always attached to a vowel, though can form an entire syllable inunstressed environments ([nÍ ], [mÍ ], [4Í ])

� /4/ is always post-vocalic

� Place identified by neighboring formant transitions

Type NasalLabial /m/ m meDental /n/ n kneeVelar /4/ ng sing

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 26

Page 14: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Nasal Durations

Singleton Unvoiced Cluster

Voiced Cluster

0

25

50

75

100

125

150

Dur

atio

n (m

s)

Nasal consonants tend to be shorter in clusters with unvoicedconsonants, and longer with voiced consonants

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 27

Semivowel Production

� Constriction in vocal tract, no turbulence

� Slower articulatory motion than other consonants

� Laterals form complete closure with tongue tip,airflow via sides of constriction

[w] [y] [r] [l]

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 28

Page 15: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Semivowels of American English

� There are 4 semivowels in American English

� Always attached to a vowel, though /l/ can form an entire syllablein unstressed environments ([lÍ ])

� Extreme articulation of a corresponding vowel

– Similar formant positions

– Generally weaker due to constriction

Type Semivowel Nearest VowelGlides /w/ w wet /u/

/y/ y yet /i/Liquids /r/ r red /5/

/l/ l let /o/

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 29

Acoustic Properties of Semivowels

� /w/ is characterized by a very low F1, F2

– Typically a rapid spectral falloff above F2

� /y/ is characterized by very low F1, very high F2

� /r/ is characterized by a very low F3

– Prevocalic F3 < medial F3 < postvocalic F3

� /l/ is characterized by a low F1 and F2

– Often presence of high frequency energy

– Postvocalic /l/ characterized by minimal spectral discontinuity,gradual motion of formants

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 30

Page 16: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Aspirant Production

� /h/ in American English

� Turbulence excitation at glottis

� No constriction in the vocal tract, normal formant excitation

� Coupling with subglottal system results in little energy in F1region

� Periodic excitation can be present in medial position

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 31

Stop Production

� Complete closure in the vocal tract, pressure build up

� Sudden release of the constriction, turbulence noise

� Can have periodic excitation during closure

[b] [d] [g]

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 32

Page 17: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Stops of American English

� There are 6 stop consonants in American English

� Same places of articulation as nasal consonants

� Unvoiced stops are typically aspirated

� Voiced stops usually exhibit a “voice-bar’’ during closure

� Information about formant transitions and release useful forclassification

Type Voiced UnvoicedLabial /b/ b bee /p/ p peaDental /d/ d Dee /t/ t teaVelar /g/ g geese /k/ k key

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 33

Singleton Stop Durations

b d g p t k0

10

20

30

40

50

60

70

80

VO

T D

urat

ion

(ms)

The voice onset time (VOT) of unvoiced stopsis longer than that of voiced stops

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 34

Page 18: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

/s/-Stop Durations

p t k0

10

20

30

40

50

60

70

80

VO

T D

urat

ion

(ms)

Unvoiced stops are unaspirated in /s/ stop sequences

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 35

Stop-Semivowel Durations

b d g p t k0

10

20

30

40

50

60

70

80

90

100

VO

T D

urat

ion

(ms)

Singletons

[Stop][Semivowel] Clusters

Semivowels are partially devoiced in stop semivowel sequences

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 36

Page 19: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Voicing Cues for Stops

There are many voicing cues for a stop

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 37

Affricate Production

� Alveolar-stop palatal-fricative pairs

� Sudden release of the constriction, turbulence noise

� Can have periodic excitation during closure

Affricates of American English

There are two affricates in American English

Voiced Unvoiced/J/ jh judge /C/ ch church

CLSP Workshop 2000 Acoustic Properties of Speech Sounds 38

Page 20: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Speech from a Close-Talking Microphone

kHz kHz

Wide Band Spectrogram

kHz kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

kHz kHz

0 0

8 8

16 16Zero Crossing Rate

dB dBTotal Energy

dB dBEnergy -- 125 Hz to 750 Hz

Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/users/jwc/latex/sum97/sennheiser.wavPrinted by jwc on Wed Jul 16 11:58:32 1997

Page: 1The Thinker is a famous sculptureCLSP Workshop 2000 Acoustic Properties of Speech Sounds 39

Speech from a Omni-Directional Microphone

kHz kHz

Wide Band Spectrogram

kHz kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

kHz kHz

0 0

8 8

16 16Zero Crossing Rate

dB dBTotal Energy

dB dBEnergy -- 125 Hz to 750 Hz

Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/users/jwc/latex/sum97/bk.wavPrinted by jwc on Wed Jul 16 11:57:43 1997

Page: 1The Thinker is a famous sculptureCLSP Workshop 2000 Acoustic Properties of Speech Sounds 40

Page 21: Acoustic Properties of Speech Sounds - clsp.jhu.edu · Acoustic Properties of Speech Sounds Speech production Signal processing Properties of speech sounds of American English Microphone

Speech over a Telephone Channel

kHz kHz

Wide Band Spectrogram

kHz kHz

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

kHz kHz

0 0

8 8

16 16Zero Crossing Rate

dB dBTotal Energy

dB dBEnergy -- 125 Hz to 750 Hz

Waveform

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/users/jwc/latex/sum97/telephone.wavPrinted by jwc on Wed Jul 16 11:59:12 1997

Page: 1The Thinker is a famous sculptureCLSP Workshop 2000 Acoustic Properties of Speech Sounds 41