59
Feature Extraction from Speech Signals Abeer Alwan Speech Processing and Auditory Perception Laboratory (SPAPL) Department of Electrical Engineering, UCLA http://www.ee.ucla.edu/spapl [email protected]

Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Feature Extraction from Speech Signals

Abeer Alwan

Speech Processing and Auditory Perception Laboratory (SPAPL) Department of Electrical Engineering, UCLA

http://www.ee.ucla.edu/spapl [email protected]

Page 2: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

0 2000 4000 6000 8000 10000 12000 14000-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

ONE FIVE SEVEN

Waveform

Spectrogram

Page 3: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

Page 4: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE FIVE SEVEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Page 5: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE FIVE SEVEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Harmonics

Page 6: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Pitch f0 = 250 HzHarmonics

Page 7: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Pitch f0 = 250 HzHarmonics

Vocal Tract Transfer Envelop

Page 8: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE

FIVE

SE -VEN

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Pitch f0 = 250 HzHarmonics

Vocal Tract Transfer Envelope

Formants

Page 9: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6
Page 10: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

0 500 1000 1500 2000 2500 3000 3500 40000.3

0.4

0.5

0.6

0.7

0.8

0.9

Page 11: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

Filter Bank

Page 12: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

Filter Bank

log(.), DCT, ∆1 - ∆2

Page 13: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

Filter Bank

log(.), DCT, ∆1 - ∆2

Hidden Markov Models

Page 14: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6
Page 15: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

0 2000 4000 6000 8000 10000 12000 14000-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

20 40 60 80 100 120 140 160

500

1000

1500

2000

2500

3000

3500

4000

ONE FIVESEVEN

log(.), DCT, ∆1 - ∆2

Short-Time Fourier Transform

Mel Filter Bank

MFCC

Cepstrum Transform

Page 16: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Variability

• The variability in the way humans produce speech due to, for example, gender, accent, age, and emotion necessitates data-drivenapproaches to capture significant trends/behavior in the data.

• The same variability, however, may not be modeled adequately by such systems especially if data are limited and/or corrupted by noise.

Page 17: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Technological Challenges

•Robustness of automatic recognition systems (ASR) to background acoustic and/or channel noise

•ASR robustness to variations due to gender, age, affect, dialect, accent,speaking rate

•Limited data

Page 18: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Within and across speaker variation in voice quality

Page 19: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Motivation

– Voice quality varies both between and within speakers• Within-speaker variability may not be negligible

• Quantitative model for the variability is needed 19

Page 20: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Database

Large number of

speakers

Multiple recording sessions

Multiple speech tasks

High quality recording

TIMIT

Switchboard

Japanese

UCLA

20

• Need a new database• Existing databases are not adequate to study between- and within-

speaker variability at the same time

Page 21: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

UCLA Database• Data collected in collaboration with the Linguistics department

and Medical school• Inter-speaker variability

– Day/time variability (session variability)– Read speech vs. conversational speech– Low-affect speech vs. high-affect speech

• Recordings– Steady-state vowel /a/ (3 repetition)– Reading sentences– Explaining something to someone they do not know – Phone call to someone they know– Telling something unimportant/ joyful/ annoying – Speaking to pets

21

Page 22: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Preliminary Study Overview• Objective

– Validate the importance of studying inter- and intra-speaker variability in voice quality by acoustic analysis

• Preliminary experiments with steady-state vowels– Motivation

• Minimum intra-speaker variability• Acoustic measures can be estimated with high accuracy

• Stimuli– Subset of the UCLA database– 9 female, 9 male speakers (18 speakers)– 3 repetitions of vowel /a/ from 3 sessions on different days,

(9 tokens per speaker)– Sampling rate 22100Hz

22

Page 23: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Observations– Intra-speaker variance of some features are compatible to, or even larger than inter-

speaker variance

23

females males

F0H

NR

total

Page 24: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Experimental Setup

• Classifier– Support Vector Machine (LIBSVM)

• Train conditions– Matched

• Trained on all 3 sessions with first 2 vowels• Tested on all 3 sessions with the remaining 1 vowel

– Mismatched• Trained on 2 sessions with all 3 vowels• Tested on the remaining 1 session with 3 vowels

24

Page 25: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Speaker Identification Result

• Observations– Good features for matched condition are not necessarily

good for mismatched condition– Perceptual voice quality feature are most robust to session

variability25

Set 1

Vocal tract

Set 2

Voice source

Set 3

Perceptual

Set 4

MFCCs

81.5%

96.3% 94.4%98.2%

61.1%

79.6%

90.7%83.3%

40%

60%

80%

100%

set 1 set 2 set 3 set 4

accuracy

matched mismatched

Page 26: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Height Estimation and Speaker Adaptation using Subglottal

Resonances

Page 27: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

The Subglottal System

• The acoustic system below the glottis consists of the trachea, bronchi, and lungs

• Similar to the supraglottalsystem (vocal tract), the subglottal system is characterized by a series of poles and zeros referred to as subglottal resonances and anti-resonances.

Page 28: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Coupling of the Subglottal System• Introduces pole-zero pairs in the vocal tract

transfer function (Stevens, 1998)• Theoretically, subglottal resonances are

independent of speech sounds (Lulich, 2008b)

• The ‘acoustic length’ of the subglottalsystem is proportional to speaker height.

Page 29: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Motivation for studying SGRs• Vocal-tract shape changes continuously, but the

subglottal tract has a fixed configuration.– SGRs (red) remain constant irrespective of spoken content

and language, unlike formants (green).

Page 30: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

• SGRs form natural boundaries between vowel categories.

/iy/

/uw/ /aa/

increasing vowel height

increasing vowel backness

Page 31: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Major goals of this research

• Collect a sizable database of time-synchronized speech and subglottal acoustics for both adults and children.– Use an accelerometer to record subglottal acoustics.

• Develop automatic methods to extract subglottal information using speech signals only.

• Investigate the role of subglottal features in tasks requiring speaker-specific information.– Exploit the stationarity of subglottal acoustics for improved

performance when speech data are limited.

Page 32: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

The WashU-UCLA corpora• Time-synchronized recordings of speech (microphone) and subglottal

acoustics (accelerometer).

• Subjects: 50 adults (18 to 25 yrs); 43 children (6 to 17 yrs).– Native speakers of American English.– Speaker age and height were recorded.

• Recordings were made in a sound-attenuated booth and were phrases of the form “I said a <CVC> again”.

– 10 repetitions of each <CVC> for adults; 4 repetitions for kids.– Vowel beginning, steady-state and end were labeled manually.

• Data (for adults only) will be released for free by the LDC on 4/15.

K&K Sound HotSpotaccelerometer

Page 33: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Data (speech & subglottal acoustics)

Measurement and analysis

SGR estimation algorithms

Height estimationSpeaker normalization

Estimating subglottal spectral features

Speaker identification

Page 34: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Sg1 (Hz) Sg2 (Hz) Sg3 (Hz)Adults

Average male 542 1327 2198Average female 659 1511 2410Overall average 601 1419 2304

Children (6 to 17 yrs)Average male 727 1720 2710

Average female 752 1778 2720Overall average 730 1740 2710

Results: actual SGRs• Significant differences between:

– Adults and children– Adult males and adult females

Page 35: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Results: SGRs versus body height• Trachea length correlates with height [Griscom & Wohl,

1985].– Results in negative correlations between SGRs and height.

Page 36: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Evaluation setup

• For adults– Training: 35 speakers in the WashU-UCLA corpus.– Evaluation: 14 speakers in the MIT Tracheal Resonance

database (with known SGRs).

• For children– Training: 25 speakers in the WashU-UCLA corpus.– Evaluation: the remaining 18 speakers.

• Utterances are 2 to 3 seconds long; sampled at 8 kHz.

Page 37: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

ResultsRMSEs in Hz, for adults (5 dB SNR)

RMSEs in Hz, for children (5 dB SNR)

• RMSEs are on the order of measurement standard deviations.

0

50

100

Sg1 Sg2 Sg3

CleanBabblePinkFactoryWhite

0

100

200

Sg1 Sg2 Sg3

CleanBabblePinkCarWhite

Page 38: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Height Estimation

• Based on the correlation between SGRs and body height.

unknown speaker

WashU-UCLA corpus

automatic SGR estimation algorithm

model the relationship

between SGRs and speaker

height

estimated height

estimated SGRs

actual SGRs and heights

Page 39: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Height estimation: evaluation

Using Sg1 Using Sg2 Ganchev et al.

mean abs. error 5.3 cm 5.4 cm 5.3 cm

RMS error 6.6 cm 6.7 cm 6.8 cm

• Training data: SGRs and heights of 50 speakers.• Evaluation data: speech signals of 604 speakers (TIMIT).

Arsikere, Leung, Lulich and Alwan (Speech Communication, 2013)

Page 40: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

ResultsRMSEs (cm) for clean speech – comparing the three SGRs

RMSEs (cm) for noisy speech (0 dB SNR) using Sg1

• Advantages over [Ganchev et al., 2010] (RMSE = 6.8 cm):– Just 1 feature (as opposed to 50); very little training data (50

speakers vs. 468); no need to retrain models for every scenario.

6.8 6.9 7.1

Using Sg1 Using Sg2 Using Sg3

6.8 6.97.2

7.06.8

Clean Babble White Pink Factory

Page 41: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

SGR-based normalization (SGRN)• Motivation for using SGRs:

– Role of Sg1 and Sg2 as vowel-feature boundaries.– Phonetic invariance.– Can be estimated fairly well from noisy speech.

200 400 600 800 1000 12001000

1500

2000

2500

3000

F1 (Hz)

F2 (H

z)

Sg1c

Sg2cSg1m

Sg2m

Blue: male adultRed: male child

Page 42: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Evaluation setup• Task: recognition of connected digit utterances.

– TIDIGITS database (utterances of 1 to 7 digits each).– ASR models trained on 112 adult speakers; tested on 50

children.– Training in clean condition.– Testing in clean + babble, pink, white and car noise (5 to 15 dB).

• Hidden Markov Models (HMMs) are used.

• Features: standard MFCCs extracted at 10 ms intervals.

• Performance metric: word error rate (WER).– Total number of substitution, insertion and deletion errors.

Page 43: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Experimental results

39.7

26.218.9

23.217.9 14.416.7 16.1 14.1

6 to 8 yrs 9 to 11 yrs 12 to 15 yrs

BaselineVTLNSGRN

WERs (in %) by age group, averaged across noise types

24.7 27.5 27.5

18.2 18.4 17.414.9 16.2 15.6

1 or 2 words 3 to 5 words 6 or 7 words

BaselineVTLNSGRN

WERs (in %) by utterance length, averaged across noise types

Page 44: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Summary (speaker normalization)• Proposed approach: SGR-based normalization (SGRN).

– ML corrections applied to initial SGR estimates, which are fairly robust to noise; frequency-dependent scaling.

• Experiments on children’s ASR in quiet & noise: SGRN is better than VTLN, especially for young speakers and short utterances.

Page 45: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

SID

SGCCs Computed just like speech MFCCs, but from subglottal acoustics.

45

Page 46: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Conclusion and future directions

• Subglottal features are useful for: (1) height estimation, (2) speaker normalization for ASR, (3) speaker identification, and (4) cross-language adaptation.– Effective with limited data.– Robust to environmental noise.

• Future research directions:– Data collection and analysis in other languages.– Pole-zero models for SGR estimation in speech.– Larger databases and better models for height estimation.

Page 47: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Noise Robust ASR

Page 48: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Speech Perception in NoiseHealthy-hearing adults are remarkably adept at perceiving speech in noise. However,

•Nearly 30 million North Americans are hearing impaired, yet less than 6 million use hearing aids. The most common complaint of hearing-aid users is listening to speech in naturally noisy environments.

•The performance of automatic speech recognition (ASR) systems degrades significantly in the presence of noise.

Page 49: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

The ‘Robust’ Human Auditory System

The auditory system is extremely robust to noise due to both:

•“Intelligent” High-Level Processing• Inherently Robust Auditory

Representation (Front-End Processing)

Page 50: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6
Page 51: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Knowledge-Based Signal-processing Techniques

• Adaptation (sensitivity to onsets and offsets): modeled after forward masking experiments

• Peak isolation/Spectral sharpening: physiological and perceptual evidence

• Not all ‘uniform’ segments are equally important (VFR)

• VTTF changes slowly in time (Peak threading)• Extracting the spectral envelope more precisely

(Harmonic demodulation)

Page 52: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

(A)

(B)

I. Adaptation: Auditory system adapts as a function of previous input

Varied masker level, probe delay, and frequency

Strope and Alwan (1997, 1998)

Page 53: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

“These schemes will succeed only to the extent that metrics can be found that are sensitive to phonetically relevant spectral

differences..” D.H. Klatt, 1981

II. Peak Isolation

Page 54: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Quiet 5 dB SNR

9 6 1 3 Time

Page 55: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

• Spectral changes are important perceptual cues for discrimination. Such changes can occur over very short timeintervals.

• Computing frames every 10 ms, as commonly done in ASR, is not sufficient to capture such dynamic changes.

III. Variable Frame Rate Analysis (VFR)

Zhu and Alwan (2000, 2003)You and Alwan (2002)

Page 56: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Frame selection in VFR for a digit string “one two” with silence

An Example of VFR

2000 4000 6000 8000 10000 12000

-2000

0

2000

4000

Sample

50 100 150 200 250 300 350 4000

10

20

30

Frame

d(i)

Selection

Speech waveform

Inter-frame distance based on Euclidean

distanceor Entropy

Frames selected

Page 57: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Summary• Achieved improved performance for rapid

speaker normalization and speaker ID with limited data using subglottal resonance information.

• Estimated speaker height using one feature

• Achieved improved ASR noise-robustness using: adaptation, peak isolation, and variable frame rate analysis

However, we are yet to achieve human performance on these tasks (except for height

estimation)!

Page 58: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Other Research Projects

•High-speed imaging of the vocal cords to improve modeling and synthesis

•Bird song classification using limited data

Page 59: Feature Extraction from Speech Signals · 20 40 60 80 100 120 140 160 500 1000 1500 2000 2500 3000 3500 4000. 0 2000 4000 6000 8000 10000 12000 14000-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6

Acknowledgements

Former and Current Students: S. Wang, H. Ariskere, S. Park, B. Strope

Collaborators: S. Lulich, M. Sommers

Work supported in part by the NSF, NIH, DARPA, SONY Playstation