Upload
vohuong
View
227
Download
0
Embed Size (px)
Citation preview
Acoustic Properties of Speech Sounds
� Speech production
� Signal processing
� Properties of speech sounds of American English
� Microphone variations
� Spectrographic Examples
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 1
Anatomical Structures for Speech Production
Soft Palate (Velum)
Hyoid Bone
Epiglottis
CricoidCartilage
Esophagus
Nasal Cavity
Hard Palate
Tongue
Thyroid Cartilage
Vocal Cords
Trachea
Lung
Sternum
Nasal Cavity
Hard Palate
Tongue
Thyroid Cartilage
Vocal FoldsTrachea
Lung
Soft Palate(Velum)
Jaw
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 2
Sub-Word Linguistic Units
� The phoneme is one of the most basic linguistic units used torepresent pronunciations of words
� ASR systems typically represent words as phoneme sequences
� English contains approximately 40 phonemes which can begrouped by manner and place of articulation
Manner Class NumberVowels 16Fricatives 8Stops 6Semivowels 4Nasals 3Affricates 2Aspirant 1
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 3
Phonemes in American English
IPA AB Word IPA AB Word IPA AB Word/i/ iy beat /s/ s see /w/ w wet/I/ ih bit /S/ sh she /r/ r red/e/ ey bait /f/ f fee /l/ l let/E/ eh bet /T/ th thief /y/ y yet/@/ ae bat /z/ z z /m/ m meet/a/ aa bob /Z/ zh Gigi /n/ n neat/O/ ao bought /v/ v v /4/ ng sing/^/ ah but /D/ dh thee /C/ ch church/o/ ow boat /p/ p pea /J/ jh judge/U/ uh book /t/ t tea /h/ hh heat/u/ uw boot /k/ k key/5/ er bird /b/ b bay/a¤ / ay bite /d/ d day/O¤ / oy Boyd /g/ g geese/a⁄ / aw bout/{/ ax about
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 4
Places of Articulation for Speech Production
PalatalVelar
Uvular
AlveopalatalAlveolar
Labial
Dental
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 5
A Speech Waveform
Two plus seven is less than ten
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 6
Spectral Representations
� Speech waveforms are usually sampled at rates varying from 8K(telephone) to 20K (wide-band) samples/sec
� ASR systems typically transform the waveform into a spectrum: asequence of frequency-based analyses usually performed atregular intervals (e.g., 10 ms)
� A short-time Fourier transform (STFT) performs a spectral analysison waveform segments small enough to be able to assume thatthe speech signal is quasi-stationary
� The waveform segment is created by a moving window, whosetype (e.g., Hamming) and duration (e.g., 5-25ms) have asignificant impact on the resulting spectrum
� A spectrogram is an image computed from the resultingspectrum, which is often used to examine the waveform
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 7
Short-Time Fourier Transform
w [ 50 - m ] w [ 100 - m ] w [ 200 - m ]
x [ m ]
0 n = 50 n = 100 n = 200
m
Xn(ejω) =+∞∑m=−∞
w[n −m]x[m]e−jωm
� If n is fixed, then it can be shown that:
Xn(ejω) = 12π
∫ π
−πW(ejθ)ejθnX(ej(ω+θ))dθ
� The above equation is meaningful only if we assume that X(ejω)represents the Fourier transform of a signal whose properties continueoutside the window, or simply that the signal is zero outside the window.
� In order for Xn(ejω) to correspond to X(ejω), W(ejω) must resemble animpulse with respect to X(ejω).
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 8
Comparison of Windows
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 9
Comparison of Windows (cont’d)
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 10
A Wide-Band Speech Spectrogram
Two plus seven is less than tenCLSP Workshop 2000 Acoustic Properties of Speech Sounds 11
A Narrow-Band Speech Spectrogram
Two plus seven is less than tenCLSP Workshop 2000 Acoustic Properties of Speech Sounds 12
Spectral Averages: Corpus and Representation
� TIMIT acoustic-phonetic corpus
– phonetic transcription aligned with waveform
– native speakers of American English (8 dialects)
– 8 sentences/speaker (dialect sentences excluded)
– 136 female, 326 male speakers (NIST train set)
– 3,696 utterances, 142,910 tokens
� Mel-Frequency Spectral Coefficients (MFSC’s)
– Mel-frequency scale (linear < 1 kHz, log > 1 kHz)
– 40 channels (200 Hz - 6.4 kHz)
– 25 ms Hamming window, 5 ms frame-rate
– Average computed over entire phonetic token (for stopsspectral slice at release was used)
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 13
Happy Little Vowel Chart"So inaccurate, yet so useful."
SCHWAS:Plain ({) About /{ba⁄t/Front (|) Roses /ro⁄z|z/Retroflex (}) Forever /f}Ev5/
F 2 In
crea
ses
F1 Increases
MID LOWHIGH
FRONT
BACK
uÚ
OUu
ao
^,{
E@
eI
i
TENSE = Towards Edgestends to be longer
LAX = Towards Centertends to be shorter
Rob's
Think F3 is mighty low? Your pal 5 is the way to go!
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 14
Friendly Little Consonant Chart"Somewhat more accurate, yet somewhat less useful."
Labial Alveolar Palatal VelarPlace of Articulation
Voicing: Unvoiced Voiced
Nas
alFr
icat
ive
Stop
Man
ner
of
Art
icula
tion
p b
f v
m
Dental
T D
n 4
s z S Z
t d k g
The Semi-vowels:
is like an extremey i
is like an extremew u
is like an extremel o
is like an extremer 5
The Affricates:
is likeC t+S
is likeJ d+Z
The Odds and Ends:
h (unvoiced h)
H (voiced h)
F (flap) FÊ (nasalized flap)
? (glottal stop)
Weak (Non-strident) Strong (Strident)
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 15
Vowel Production
� No significant constriction in the vocal tract
� Usually produced with periodic excitation
� Acoustic characteristics depend on the position of the jaw,tongue, and lips
[i] [@] [a] [u]
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 16
Vowels of American English
� There are approximately 18 vowels in American English made upof monothongs, diphthongs, and reduced vowels (schwa’s)
� They are often described by the articulatory features: High/Low,Front/Back, Retroflexed, Rounded, and Tense/Lax
/i/ iy beat /O/ ao bought /a¤ / ay bite/I/ ih bit /^/ ah but /O¤ / oy Boyd/e/ ey bait /o/ ow boat /a⁄ / aw bout/E/ eh bet /U/ uh book [{] ax about/@/ ae bat /u/ uw boot [|] ix roses/a/ aa Bob /5/ er Bert [}] axr butter
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 17
Vowel Formant Averages
� Vowels are often characterized by F1, F2, and F3
� High/Low is correlated with F1
� Front/Back is correlated with F2
� Retroflexion is marked by a low F3
Female Speakers Male Speakers
i¤ I e¤ E @ a O ^ o⁄ U u 5 { |0
500
1000
1500
2000
2500
3000
3500
Ave
rag
e F
req
uen
cy (
Hz)
Vowel
F1F2F3
i¤ I e¤ E @ a O ^ o⁄ U u 5 { |0
500
1000
1500
2000
2500
3000
3500
Ave
rag
e F
req
uen
cy (
Hz)
Vowel
F1F2F3
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 18
Vowel Formant Trajectories
� Diphthongs can have significant formant motion
� Most vowels in American English are somewhat diphthongized
Female Speakers Male Speakers
700
900
1100
1300
1500
1700
1900
2100
2300
2500
2700
300 400 500 600 700 800 900
F2
F1
i¤
E
a¤
O¤
{
|
a⁄^
a
e¤
I
O
@
u o⁄
5U
700
900
1100
1300
1500
1700
1900
2100
2300
2500
2700
300 400 500 600 700 800 900
F2
F1
i¤
@
|
EI
e¤
^
O
o⁄
5U
u
O¤
a¤
{
a⁄
a
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 19
Vowel Durations
� Each vowel has a different intrinsic duration
� Schwa’s have distinctly shorter durations (50ms)
� /I, E, ^, U/ are the shortest monothongs
� Context can greatly influence vowel duration
Female Speakers Male Speakers
i¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u0
50
100
150
200
250
Ave
rag
e D
ura
tio
n (
ms)
Voweli¤ I e¤ E @ a O ^ o⁄ U u 5 { | a⁄ o¤ a¤ ¤u
0
50
100
150
200
250
Ave
rag
e D
ura
tio
n (
ms)
Vowel
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 20
Fricative Production
� Turbulence produced at narrow constriction
� Constriction position determines acoustic characteristics
� Can be produced with periodic excitation
[f] [T] [s] [S]
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 21
Fricatives of American English
� There are 8 fricatives in American English
� They are often described by the features Strident/Non-Strident(Strong/Weak), Voiced/Unvoiced
� Four places of articulation: Labial, Dental, Alveolar, and Palatal
Type Unvoiced VoicedLabial /f/ f fee /v/ v vDental /T/ th thief /D/ dh theeAlveolar /s/ s see /z/ z zPalatal /S/ sh she /Z/ zh Gigi
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 22
Fricative Energy
Average Total Energy
Pro
babi
lity
Den
sity
una
djus
ted
for
freq
uenc
y
-100 -90 -80 -70 -60 -50 -40
0.0
0.02
0.04
0.06
NON-STRIDENTSTRIDENT
Strident fricatives tend to be stronger than non-strident
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 23
Fricative Durations
Duration
Pro
babi
lity
Den
sity
una
djus
ted
for
freq
uenc
y
0.0 0.05 0.10 0.15 0.20 0.25 0.30
02
46
810
1214
UNVOICEDVOICED
Voiced fricatives tend to be shorter than unvoiced
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 24
Nasal Production
� Velum lowering results in airflow through nasal cavity� Consonants produced with closure in oral cavity� Nasalized vowels have output through oral and nasal cavities� Nasal murmurs have similar spectral characteristics
[m] [n] [4]
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 25
Nasal Consonants of American English
� Three places of articulation: Labial, Alveolar, and Velar
� Always attached to a vowel, though can form an entire syllable inunstressed environments ([nÍ ], [mÍ ], [4Í ])
� /4/ is always post-vocalic
� Place identified by neighboring formant transitions
Type NasalLabial /m/ m meDental /n/ n kneeVelar /4/ ng sing
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 26
Nasal Durations
Singleton Unvoiced Cluster
Voiced Cluster
0
25
50
75
100
125
150
Dur
atio
n (m
s)
Nasal consonants tend to be shorter in clusters with unvoicedconsonants, and longer with voiced consonants
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 27
Semivowel Production
� Constriction in vocal tract, no turbulence
� Slower articulatory motion than other consonants
� Laterals form complete closure with tongue tip,airflow via sides of constriction
[w] [y] [r] [l]
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 28
Semivowels of American English
� There are 4 semivowels in American English
� Always attached to a vowel, though /l/ can form an entire syllablein unstressed environments ([lÍ ])
� Extreme articulation of a corresponding vowel
– Similar formant positions
– Generally weaker due to constriction
Type Semivowel Nearest VowelGlides /w/ w wet /u/
/y/ y yet /i/Liquids /r/ r red /5/
/l/ l let /o/
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 29
Acoustic Properties of Semivowels
� /w/ is characterized by a very low F1, F2
– Typically a rapid spectral falloff above F2
� /y/ is characterized by very low F1, very high F2
� /r/ is characterized by a very low F3
– Prevocalic F3 < medial F3 < postvocalic F3
� /l/ is characterized by a low F1 and F2
– Often presence of high frequency energy
– Postvocalic /l/ characterized by minimal spectral discontinuity,gradual motion of formants
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 30
Aspirant Production
� /h/ in American English
� Turbulence excitation at glottis
� No constriction in the vocal tract, normal formant excitation
� Coupling with subglottal system results in little energy in F1region
� Periodic excitation can be present in medial position
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 31
Stop Production
� Complete closure in the vocal tract, pressure build up
� Sudden release of the constriction, turbulence noise
� Can have periodic excitation during closure
[b] [d] [g]
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 32
Stops of American English
� There are 6 stop consonants in American English
� Same places of articulation as nasal consonants
� Unvoiced stops are typically aspirated
� Voiced stops usually exhibit a “voice-bar’’ during closure
� Information about formant transitions and release useful forclassification
Type Voiced UnvoicedLabial /b/ b bee /p/ p peaDental /d/ d Dee /t/ t teaVelar /g/ g geese /k/ k key
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 33
Singleton Stop Durations
b d g p t k0
10
20
30
40
50
60
70
80
VO
T D
urat
ion
(ms)
The voice onset time (VOT) of unvoiced stopsis longer than that of voiced stops
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 34
/s/-Stop Durations
p t k0
10
20
30
40
50
60
70
80
VO
T D
urat
ion
(ms)
Unvoiced stops are unaspirated in /s/ stop sequences
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 35
Stop-Semivowel Durations
b d g p t k0
10
20
30
40
50
60
70
80
90
100
VO
T D
urat
ion
(ms)
Singletons
[Stop][Semivowel] Clusters
Semivowels are partially devoiced in stop semivowel sequences
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 36
Voicing Cues for Stops
There are many voicing cues for a stop
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 37
Affricate Production
� Alveolar-stop palatal-fricative pairs
� Sudden release of the constriction, turbulence noise
� Can have periodic excitation during closure
Affricates of American English
There are two affricates in American English
Voiced Unvoiced/J/ jh judge /C/ ch church
CLSP Workshop 2000 Acoustic Properties of Speech Sounds 38
Speech from a Close-Talking Microphone
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16Zero Crossing Rate
dB dBTotal Energy
dB dBEnergy -- 125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
File: /server/users/jwc/latex/sum97/sennheiser.wavPrinted by jwc on Wed Jul 16 11:58:32 1997
Page: 1The Thinker is a famous sculptureCLSP Workshop 2000 Acoustic Properties of Speech Sounds 39
Speech from a Omni-Directional Microphone
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16Zero Crossing Rate
dB dBTotal Energy
dB dBEnergy -- 125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
File: /server/users/jwc/latex/sum97/bk.wavPrinted by jwc on Wed Jul 16 11:57:43 1997
Page: 1The Thinker is a famous sculptureCLSP Workshop 2000 Acoustic Properties of Speech Sounds 40
Speech over a Telephone Channel
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds)0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16Zero Crossing Rate
dB dBTotal Energy
dB dBEnergy -- 125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
File: /server/users/jwc/latex/sum97/telephone.wavPrinted by jwc on Wed Jul 16 11:59:12 1997
Page: 1The Thinker is a famous sculptureCLSP Workshop 2000 Acoustic Properties of Speech Sounds 41