Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Feature Extraction from Speech Signals
Abeer Alwan
Speech Processing and Auditory Perception Laboratory (SPAPL) Department of Electrical Engineering, UCLA
http://www.ee.ucla.edu/spapl [email protected]
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
0 2000 4000 6000 8000 10000 12000 14000-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
ONE FIVE SEVEN
Waveform
Spectrogram
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE FIVE SEVEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE FIVE SEVEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Harmonics
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Pitch f0 = 250 HzHarmonics
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Pitch f0 = 250 HzHarmonics
Vocal Tract Transfer Envelop
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE
FIVE
SE -VEN
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
Pitch f0 = 250 HzHarmonics
Vocal Tract Transfer Envelope
Formants
0 500 1000 1500 2000 2500 3000 3500 40000.3
0.4
0.5
0.6
0.7
0.8
0.9
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
Filter Bank
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
Filter Bank
log(.), DCT, ∆1 - ∆2
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
Filter Bank
log(.), DCT, ∆1 - ∆2
Hidden Markov Models
0 2000 4000 6000 8000 10000 12000 14000-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
20 40 60 80 100 120 140 160
500
1000
1500
2000
2500
3000
3500
4000
ONE FIVESEVEN
log(.), DCT, ∆1 - ∆2
Short-Time Fourier Transform
Mel Filter Bank
MFCC
Cepstrum Transform
Variability
• The variability in the way humans produce speech due to, for example, gender, accent, age, and emotion necessitates data-drivenapproaches to capture significant trends/behavior in the data.
• The same variability, however, may not be modeled adequately by such systems especially if data are limited and/or corrupted by noise.
Technological Challenges
•Robustness of automatic recognition systems (ASR) to background acoustic and/or channel noise
•ASR robustness to variations due to gender, age, affect, dialect, accent,speaking rate
•Limited data
Within and across speaker variation in voice quality
Motivation
– Voice quality varies both between and within speakers• Within-speaker variability may not be negligible
• Quantitative model for the variability is needed 19
Database
Large number of
speakers
Multiple recording sessions
Multiple speech tasks
High quality recording
TIMIT
Switchboard
…
Japanese
UCLA
20
• Need a new database• Existing databases are not adequate to study between- and within-
speaker variability at the same time
UCLA Database• Data collected in collaboration with the Linguistics department
and Medical school• Inter-speaker variability
– Day/time variability (session variability)– Read speech vs. conversational speech– Low-affect speech vs. high-affect speech
• Recordings– Steady-state vowel /a/ (3 repetition)– Reading sentences– Explaining something to someone they do not know – Phone call to someone they know– Telling something unimportant/ joyful/ annoying – Speaking to pets
21
Preliminary Study Overview• Objective
– Validate the importance of studying inter- and intra-speaker variability in voice quality by acoustic analysis
• Preliminary experiments with steady-state vowels– Motivation
• Minimum intra-speaker variability• Acoustic measures can be estimated with high accuracy
• Stimuli– Subset of the UCLA database– 9 female, 9 male speakers (18 speakers)– 3 repetitions of vowel /a/ from 3 sessions on different days,
(9 tokens per speaker)– Sampling rate 22100Hz
22
Observations– Intra-speaker variance of some features are compatible to, or even larger than inter-
speaker variance
23
females males
F0H
NR
total
Experimental Setup
• Classifier– Support Vector Machine (LIBSVM)
• Train conditions– Matched
• Trained on all 3 sessions with first 2 vowels• Tested on all 3 sessions with the remaining 1 vowel
– Mismatched• Trained on 2 sessions with all 3 vowels• Tested on the remaining 1 session with 3 vowels
24
Speaker Identification Result
• Observations– Good features for matched condition are not necessarily
good for mismatched condition– Perceptual voice quality feature are most robust to session
variability25
Set 1
Vocal tract
Set 2
Voice source
Set 3
Perceptual
Set 4
MFCCs
81.5%
96.3% 94.4%98.2%
61.1%
79.6%
90.7%83.3%
40%
60%
80%
100%
set 1 set 2 set 3 set 4
accuracy
matched mismatched
Height Estimation and Speaker Adaptation using Subglottal
Resonances
The Subglottal System
• The acoustic system below the glottis consists of the trachea, bronchi, and lungs
• Similar to the supraglottalsystem (vocal tract), the subglottal system is characterized by a series of poles and zeros referred to as subglottal resonances and anti-resonances.
Coupling of the Subglottal System• Introduces pole-zero pairs in the vocal tract
transfer function (Stevens, 1998)• Theoretically, subglottal resonances are
independent of speech sounds (Lulich, 2008b)
• The ‘acoustic length’ of the subglottalsystem is proportional to speaker height.
Motivation for studying SGRs• Vocal-tract shape changes continuously, but the
subglottal tract has a fixed configuration.– SGRs (red) remain constant irrespective of spoken content
and language, unlike formants (green).
• SGRs form natural boundaries between vowel categories.
/iy/
/uw/ /aa/
increasing vowel height
increasing vowel backness
Major goals of this research
• Collect a sizable database of time-synchronized speech and subglottal acoustics for both adults and children.– Use an accelerometer to record subglottal acoustics.
• Develop automatic methods to extract subglottal information using speech signals only.
• Investigate the role of subglottal features in tasks requiring speaker-specific information.– Exploit the stationarity of subglottal acoustics for improved
performance when speech data are limited.
The WashU-UCLA corpora• Time-synchronized recordings of speech (microphone) and subglottal
acoustics (accelerometer).
• Subjects: 50 adults (18 to 25 yrs); 43 children (6 to 17 yrs).– Native speakers of American English.– Speaker age and height were recorded.
• Recordings were made in a sound-attenuated booth and were phrases of the form “I said a <CVC> again”.
– 10 repetitions of each <CVC> for adults; 4 repetitions for kids.– Vowel beginning, steady-state and end were labeled manually.
• Data (for adults only) will be released for free by the LDC on 4/15.
K&K Sound HotSpotaccelerometer
Data (speech & subglottal acoustics)
Measurement and analysis
SGR estimation algorithms
Height estimationSpeaker normalization
Estimating subglottal spectral features
Speaker identification
Sg1 (Hz) Sg2 (Hz) Sg3 (Hz)Adults
Average male 542 1327 2198Average female 659 1511 2410Overall average 601 1419 2304
Children (6 to 17 yrs)Average male 727 1720 2710
Average female 752 1778 2720Overall average 730 1740 2710
Results: actual SGRs• Significant differences between:
– Adults and children– Adult males and adult females
Results: SGRs versus body height• Trachea length correlates with height [Griscom & Wohl,
1985].– Results in negative correlations between SGRs and height.
Evaluation setup
• For adults– Training: 35 speakers in the WashU-UCLA corpus.– Evaluation: 14 speakers in the MIT Tracheal Resonance
database (with known SGRs).
• For children– Training: 25 speakers in the WashU-UCLA corpus.– Evaluation: the remaining 18 speakers.
• Utterances are 2 to 3 seconds long; sampled at 8 kHz.
ResultsRMSEs in Hz, for adults (5 dB SNR)
RMSEs in Hz, for children (5 dB SNR)
• RMSEs are on the order of measurement standard deviations.
0
50
100
Sg1 Sg2 Sg3
CleanBabblePinkFactoryWhite
0
100
200
Sg1 Sg2 Sg3
CleanBabblePinkCarWhite
Height Estimation
• Based on the correlation between SGRs and body height.
unknown speaker
WashU-UCLA corpus
automatic SGR estimation algorithm
model the relationship
between SGRs and speaker
height
estimated height
estimated SGRs
actual SGRs and heights
Height estimation: evaluation
Using Sg1 Using Sg2 Ganchev et al.
mean abs. error 5.3 cm 5.4 cm 5.3 cm
RMS error 6.6 cm 6.7 cm 6.8 cm
• Training data: SGRs and heights of 50 speakers.• Evaluation data: speech signals of 604 speakers (TIMIT).
Arsikere, Leung, Lulich and Alwan (Speech Communication, 2013)
ResultsRMSEs (cm) for clean speech – comparing the three SGRs
RMSEs (cm) for noisy speech (0 dB SNR) using Sg1
• Advantages over [Ganchev et al., 2010] (RMSE = 6.8 cm):– Just 1 feature (as opposed to 50); very little training data (50
speakers vs. 468); no need to retrain models for every scenario.
6.8 6.9 7.1
Using Sg1 Using Sg2 Using Sg3
6.8 6.97.2
7.06.8
Clean Babble White Pink Factory
SGR-based normalization (SGRN)• Motivation for using SGRs:
– Role of Sg1 and Sg2 as vowel-feature boundaries.– Phonetic invariance.– Can be estimated fairly well from noisy speech.
200 400 600 800 1000 12001000
1500
2000
2500
3000
F1 (Hz)
F2 (H
z)
Sg1c
Sg2cSg1m
Sg2m
Blue: male adultRed: male child
Evaluation setup• Task: recognition of connected digit utterances.
– TIDIGITS database (utterances of 1 to 7 digits each).– ASR models trained on 112 adult speakers; tested on 50
children.– Training in clean condition.– Testing in clean + babble, pink, white and car noise (5 to 15 dB).
• Hidden Markov Models (HMMs) are used.
• Features: standard MFCCs extracted at 10 ms intervals.
• Performance metric: word error rate (WER).– Total number of substitution, insertion and deletion errors.
Experimental results
39.7
26.218.9
23.217.9 14.416.7 16.1 14.1
6 to 8 yrs 9 to 11 yrs 12 to 15 yrs
BaselineVTLNSGRN
WERs (in %) by age group, averaged across noise types
24.7 27.5 27.5
18.2 18.4 17.414.9 16.2 15.6
1 or 2 words 3 to 5 words 6 or 7 words
BaselineVTLNSGRN
WERs (in %) by utterance length, averaged across noise types
Summary (speaker normalization)• Proposed approach: SGR-based normalization (SGRN).
– ML corrections applied to initial SGR estimates, which are fairly robust to noise; frequency-dependent scaling.
• Experiments on children’s ASR in quiet & noise: SGRN is better than VTLN, especially for young speakers and short utterances.
SID
SGCCs Computed just like speech MFCCs, but from subglottal acoustics.
45
Conclusion and future directions
• Subglottal features are useful for: (1) height estimation, (2) speaker normalization for ASR, (3) speaker identification, and (4) cross-language adaptation.– Effective with limited data.– Robust to environmental noise.
• Future research directions:– Data collection and analysis in other languages.– Pole-zero models for SGR estimation in speech.– Larger databases and better models for height estimation.
Noise Robust ASR
Speech Perception in NoiseHealthy-hearing adults are remarkably adept at perceiving speech in noise. However,
•Nearly 30 million North Americans are hearing impaired, yet less than 6 million use hearing aids. The most common complaint of hearing-aid users is listening to speech in naturally noisy environments.
•The performance of automatic speech recognition (ASR) systems degrades significantly in the presence of noise.
The ‘Robust’ Human Auditory System
The auditory system is extremely robust to noise due to both:
•“Intelligent” High-Level Processing• Inherently Robust Auditory
Representation (Front-End Processing)
Knowledge-Based Signal-processing Techniques
• Adaptation (sensitivity to onsets and offsets): modeled after forward masking experiments
• Peak isolation/Spectral sharpening: physiological and perceptual evidence
• Not all ‘uniform’ segments are equally important (VFR)
• VTTF changes slowly in time (Peak threading)• Extracting the spectral envelope more precisely
(Harmonic demodulation)
(A)
(B)
I. Adaptation: Auditory system adapts as a function of previous input
Varied masker level, probe delay, and frequency
Strope and Alwan (1997, 1998)
“These schemes will succeed only to the extent that metrics can be found that are sensitive to phonetically relevant spectral
differences..” D.H. Klatt, 1981
II. Peak Isolation
Quiet 5 dB SNR
9 6 1 3 Time
• Spectral changes are important perceptual cues for discrimination. Such changes can occur over very short timeintervals.
• Computing frames every 10 ms, as commonly done in ASR, is not sufficient to capture such dynamic changes.
III. Variable Frame Rate Analysis (VFR)
Zhu and Alwan (2000, 2003)You and Alwan (2002)
Frame selection in VFR for a digit string “one two” with silence
An Example of VFR
2000 4000 6000 8000 10000 12000
-2000
0
2000
4000
Sample
50 100 150 200 250 300 350 4000
10
20
30
Frame
d(i)
Selection
Speech waveform
Inter-frame distance based on Euclidean
distanceor Entropy
Frames selected
Summary• Achieved improved performance for rapid
speaker normalization and speaker ID with limited data using subglottal resonance information.
• Estimated speaker height using one feature
• Achieved improved ASR noise-robustness using: adaptation, peak isolation, and variable frame rate analysis
However, we are yet to achieve human performance on these tasks (except for height
estimation)!
Other Research Projects
•High-speed imaging of the vocal cords to improve modeling and synthesis
•Bird song classification using limited data
Acknowledgements
Former and Current Students: S. Wang, H. Ariskere, S. Park, B. Strope
Collaborators: S. Lulich, M. Sommers
Work supported in part by the NSF, NIH, DARPA, SONY Playstation