Email: {ikeno, John.Hansen}@utdallas.edu Slide 1 IAFPA-2006 Center for Robust Speech Systems SLIDES ...
34
Email: {ikeno, John.Hansen}@utdallas.edu Slide 1 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Ayako Ikeno and John H.L. Hansen IAFPA-2006 July 23-26, 2006 Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A.
Email: {ikeno, John.Hansen}@utdallas.edu Slide 1 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Ayako Ikeno and John
Email: {ikeno, John.Hansen}@utdallas.edu Slide 1 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Ayako Ikeno and John H.L. Hansen IAFPA-2006 July 23-26, 2006 Center
for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering
& Computer Science University of Texas at Dallas Richardson,
Texas 75083-0688, U.S.A.
Slide 3
Email: {ikeno, John.Hansen}@utdallas.edu Slide 2 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
CRSS & Speech Processing Overview Previous Studies on Stress
& Lombard Effect Perceptual Speaker ID with Lombard Speech
Speech Corpus - UTScope Experimental Setup Results Summary &
Impact
Slide 4
Email: {ikeno, John.Hansen}@utdallas.edu Slide 3 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
SPOKEN DOCUMENT RETRIEVAL Overview of CRSS-Hansen Research:
http://SpeechFind.utdallas.edu Speech Under Stress Speech
Enhancement UTDrive & CU-Move: In-Vehicle Voice Navigation
Dialect & Accent In-Set / Out-of-Set Speaker Detection
Normalization: Speaker, Environment, Language UAE, Egypt,
Palestine, etc. Cuba, Peru, Puerto Rico Cambridge, Irish, Welsh,
etc.
Slide 5
Email: {ikeno, John.Hansen}@utdallas.edu Slide 4 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
File:1998_WhyRecogBreak Disk:PwrBook(jhlh) E NVIRONMENTAL B ASED A
COUSTIC N OISE R OOM R EVERBERATION P HYSICAL T ASK D EMANDS C
OMMUNICATION B ASED M ICROPHONE V OICE C OMPRESSION C HANNEL /M
OBILE C ELLULAR S PEAKER B ASED P ROBLEMS S TRESS & E MOTION L
OMBARD E FFECT / N OISE P SYCHOLOGICAL T ASK D EMANDS A CCENT /L
ANGUAGE S PEAKER D IFFERENCES ( AGE, SEX, VOCAL TRACT ) S
PONTANEOUS S PEECH C ONTEXT B ASED E FFECTS H OMONYMS (E NGLISH
+10,000; J APANESE 120) C ONFUSABLE : (T AKE, S TAKE, S TRAIGHT ; C
AKE, K ATE ) A MBIGUOUS : J EET YET ? " IT ' S OURS " VS. " IT
SOURS " " NICE GUYS " VS. " NICE SKIES " "Um, I just wanna, I just
want to say, I don't know what I want to say." SPEECH
STRESSENVIRONMENT NOISE ACCENT LANGUAGE SPEECH RECOGNITION HUMAN
(AUDITORY) RECOGNITION VOICE COMMUNICATIONS CHANNEL NOISE AMERICAN
ENGLISH SPEAKER LOMBARD EFFECT SPEAKER RECOGNITION Why Speech
Systems Break?
Slide 6
Email: {ikeno, John.Hansen}@utdallas.edu Slide 5 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Speech Production: Phonetics & Acoustics Noise Stress
Microphone Speaker Speech Physiology Acoustic Speech Waveform
NeutralStress
Slide 7
Email: {ikeno, John.Hansen}@utdallas.edu Slide 6 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
DOES STRESS VARIABILITY IMPACT SPEAKER RECOGNITION? Limited
Research on Speaker Recognition over Stress, Lombard Effect, etc.
NATO RSG.10 Report showed probe experimental results with SUSAS
corpus NATO, 2000
Slide 8
Email: {ikeno, John.Hansen}@utdallas.edu Slide 7 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Pitch Glottal Spectral Slope (earlier studies by Hansen (1988), 200
speech features, 10,000 stat. tests) Formant Location
Slide 9
Email: {ikeno, John.Hansen}@utdallas.edu Slide 8 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Phone Duration RMS Intensity
Slide 10
Email: {ikeno, John.Hansen}@utdallas.edu Slide 9 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Conditional Gaussian fit (Zhou, Hansen 1997) Classification error
rate Neutral vs. Loud: 7.24% (Neutral), 8.28% (Loud) Neutral vs.
Lombard: 20.69% (Neutral), 19.31% (Lombard) Probability
distribution Detection (ROC) curves STRESS DETECTION USING
PITCH
Slide 11
Email: {ikeno, John.Hansen}@utdallas.edu Slide 10 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
ROC CURVES STRESS DETECTION
Slide 12
Email: {ikeno, John.Hansen}@utdallas.edu Slide 11 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Individual Feature Pitch Glottal Spectral Slope Intensity Phone
Duration Formant Location 1st formant 2nd formant Feature Fusion
Duration + Intensity + mean Pitch Stress/Neutral Error Rates 621%
1836% 2846% 3846% 50 58% 017% PAST STRESS DETECTION STUDIES USING
TRADITIONAL FEATURES
Slide 13
Email: {ikeno, John.Hansen}@utdallas.edu Slide 12 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Discrete time and Continuous time TEO : where, is Teager Energy
Operator TEO-CB-Auto-Env: Critical Band based TEO AUTOcorrelation
ENVelope Ref: Zhou, Hansen,Kaiser, IEEE Transactions on Speech
& Audio Processing, vol. 9(2): 201-216, March 2001 Critical
Frequency 17 Band Partition = based on Auditory Perception TEAGER
ENERGY OPERATOR
Slide 14
Email: {ikeno, John.Hansen}@utdallas.edu Slide 13 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Neutral HMM Model vs. Stress trained HMM Model Assessment for NATO
SUSC-0 Military Cockpit Recordings
Slide 15
Email: {ikeno, John.Hansen}@utdallas.edu Slide 14 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
GOAL: (1) Identify, Model, and Classify Speech Under Stress in
Military-Related Task Conditions, and (2) Improve Automatic Speech
Coding under Stress Effective Soldier of the Quarter Board Paradigm
Monitor and Track Biometrics of Stress: Heart rate, blood pressure,
stress hormones, psychometrics. Engineering: Focus on NONLINEAR Air
Turbulent Model Teager Energy Operator; Identify Stress Dependent
Performance across Speakers, phonemes APPROACH: Rahurkar, Hansen,
Meyerhoff, Saviolakis, Koenig, "Frequency Distribution based
Weighted Sub-Band Approach for Classification of
Emotional/Stressful Content in Speech," Interspeech, pp.721-724,
Geneva, Switzerland, Sept. 2003 (another paper at Interspeech-2005)
Detection of Speech Under Stress: WRAIR
Slide 16
Email: {ikeno, John.Hansen}@utdallas.edu Slide 15 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
First observed by Etienne Lombard in 1911 Change in speech
production in response to noise to increase communication
performance Lombard Test - standard test for hearing loss in U.S.
(ASHA) measure dB-SPL change in speech production Hansen (1988)
evaluation of 200 features with +10,000 statistical tests on 11
different stressed speech conditions to quantify changes in speech
production
Slide 17
Email: {ikeno, John.Hansen}@utdallas.edu Slide 16 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
IAFPA-06: focus on Lombard Effect Audio samples for the perceptual
experiment were extracted from UTScope corpus. S peech under CO
gnitive and P hysical stress & E motion Consists of 4 Domains
Lombard Effect noise levels & types Physical Stress stair
climbing/stepper Cognitive Stress driving (simulator & actual)
Emotion (Angry, Fear, Anxiety, Frustration)
Slide 18
Email: {ikeno, John.Hansen}@utdallas.edu Slide 17 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Goal: obtain Lombard Speech at different noise levels Quantify
ground truth with biometric analysis Lombard Effect Speech 9
conditions (3 noise, 3 levels) 1 sec. duration Pink Noise 65,75,85
dB-SPL Highway Noise (windows open) 70,80,90 dB-SPL Large Crowd
Noise 70,80,90 dB-SPL
Slide 19
Email: {ikeno, John.Hansen}@utdallas.edu Slide 18 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
UTScope PINK NOISE 65, 75, 86 dB-SPL HIGHWAY DRIVING, WINDOWS HALF
OPEN 70, 80,90 dB-SPL LARGE CROWD NOISE 70, 80, 90 dB-SPL PURETONE
HEARING SCREENING OPEN-AIR HEADPHONES FOR SPEECH FEEDBACK NOISE
LEVELS CALIBRATED WITH QUEST SLM
Slide 20
Email: {ikeno, John.Hansen}@utdallas.edu Slide 19 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
UTScope 20 TIMIT SENTENCES 5 DIGIT STRINGS 1 MINUTE SPONTANEOUS
SPEECH 100 SPEAKERS 8-CHANNEL DAT RECORDER P-MIC CLOSE-TALKING MIC
FAR-FIELD MIC
Slide 21
Email: {ikeno, John.Hansen}@utdallas.edu Slide 20 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
The ASHA-certified sound booth and recording equipments
Slide 22
Email: {ikeno, John.Hansen}@utdallas.edu Slide 21 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Male Lombard Male Neutral Lombard Effect impacts Temporal and
Spectral Structure (as expected) Evaluation: Perceptual Experiments
to assess Speaker Recognition
Slide 23
Email: {ikeno, John.Hansen}@utdallas.edu Slide 22 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Listener Test Speakers Corpus: UTScope Native US English speakers
Female speakers only Speech Conditions ReferenceTest
NL-LDNeutralLombard LD-LDLombard NL-NLNeutral Noise Type Highway
driving Noise Level 90dB-SPL
Slide 24
Email: {ikeno, John.Hansen}@utdallas.edu Slide 23 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Speech Materials Read speech TIMIT sentences: phonetically balanced
3 sentences per audio sample (.wav, 16k Hz) Ref : Basketball can be
an entertaining sport. My problem is, the cats meow always hurts my
ears. The causeway ended abruptly at the shore. Test : Youngsters
commonly love chocolate and candies as treats. December and January
are nice months to spend in Miami. There were other farmhouses
nearby.
Slide 25
Email: {ikeno, John.Hansen}@utdallas.edu Slide 24 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Listener Test Listeners (12: 2f/10m May 06, -- 41 as of July 06)
India(4), China(1), Korea(1), Mexico(1), Pakistan(1), Thai(1),
Turkey(1) US(1), Vietnam(1) Task: In-set vs. Out-of-set Speaker
Identification Reference/Training 12 In-set Female speakers Test 8
In-Set speakers 4 Out-of-Set speakers
Slide 26
Email: {ikeno, John.Hansen}@utdallas.edu Slide 25 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Reference audio: Neutral Lombard Test audio: Neutral Lombard
Slide 27
Email: {ikeno, John.Hansen}@utdallas.edu Slide 26 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
The effect of speech condition : significant (p=.0024). Mismatched
condition (NL-LD) accuracy: chance level (52%). Lombard speech
(LD-LD, 79%): higher accuracy than neutral speech (NL-NL, 67%).
Lombard effect may emphasize the speech characteristics, and
improve accuracy on perceptual speaker ID.
Slide 28
Email: {ikeno, John.Hansen}@utdallas.edu Slide 27 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
Emotion/stressMismatched Training Matched Training Neutral96
Angry3475 Lombard4899 Fast9190 Slow9098 Soft7389 Loud2281 Automated
System Performance (SUSAS Corpus) (See Hansen, et.al, The Impact of
Speech Under `Stress' on Military Speech Technology, NATO Research
& Tech. Org. RTO-TR-10, March 2000). Angry 62% Lombard 48% Loud
74% 5-74% LOSS The trend hold the same for the automated
system.
Slide 29
Email: {ikeno, John.Hansen}@utdallas.edu Slide 28 IAFPA-2006
Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006
In-Set accuracy : affected by the speech condition significantly
(p