View
215
Download
0
Tags:
Embed Size (px)
CRICOS No. 000213J
†e-Health Research Centre/ CSIRO ICT Centre
*Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processing
David Dean*, Patrick Lucey*, Sridha Sridharan* and Tim Wark*†
Presented by David Dean
CRICOS No. 000213J
2
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Audio-Visual Speech Processing - Overview
• Speech or speaker recognition traditionally audio only– Mature area of research
• Significant problems in real-world environments (Wark2001)– High acoustic noise– Variation of speech
• Audio-visual speech processing adds an additional modality to help alleviate these problems
CRICOS No. 000213J
3
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Audio-Visual Speech Processing - Overview
• Speech and speaker recognition tasks have many overlapping areas
• The same configuration can be used for both text-dependent speaker recognition, and speaker-dependent speech recognition– Train speaker-dependent word (or sub-word) models– Speaker recognition chooses amongst speakers for a
particular word, or– Word recognition chooses amongst words for a
particular speaker.
CRICOS No. 000213J
4
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Audio-Visual Speech Processing - Overview
• Little research has been done into how the two applications (speaker vs. speech) differ in areas other than the set of models chosen for recognition
• One area of interest in this research is the reliance on each modality– Acoustic features typically work equally well in either
application (Young2002)– Little consensus has been reach on the suitability of
visual features for each application
CRICOS No. 000213J
5
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Experimental Setup
Speech/ SpeakerDecision
Visual Feature Extraction
Acoustic Feature Extraction
Lip Location & Tracking
Visual Speech/Speaker
Models
Acoustic Speech/Speaker
Models
Decisio
n Fusion
CRICOS No. 000213J
6
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Lip location and tracking
CRICOS No. 000213J
7
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Finding Faces
• Manual Red, Green and Blue skin thresholds were trained for each speaker
• Faces were located by applying these thresholds to the video frames
CRICOS No. 000213J
8
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Finding and tracking eyes
• Top half of face region is searched for eyes
• A shifted version of Cr-Cb thresholding was performed to locate possible eye regions (Butler2003)
• Invalid eye candidate regions were removed, and the most likely pair of candidates chosen as the eyes
• New eye location compared to old, and ignored if too far from old
• About 40% of sequences had to be manually eye-tracked every 50 frames.
CRICOS No. 000213J
9
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Finding and tracking lips
• Eye locations are used to define rotation-normalised lip search region (LSR)
• LSR converted to Red/Green colour-space and thresholded
• Unlikely lip-candidates are removed• Rectangular area with largest
amount of lip-candidate area within is lip ROI.
CRICOS No. 000213J
10
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Feature Extraction and Datasets
• MFCC – 15 + 1 energy, + deltas and accelerations = 48 features
• PCA – 20 eigenlip coefficients + deltas and accelerations = 60 features– Eigenlip-space trained on entire data set of lip images
• Stationary speech from CUAVE (Patterson2002)– 5 sequences for training, 2 for testing (per speaker)– Testing was also performed on
speech-babble corrupted noisy versions
CRICOS No. 000213J
11
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Training
• Phone transcriptions obtained from earlier research (Lucey 2004) were used to train speaker independent HMM phone models in both audio and visual domains
• Speaker dependent models adapted using MLLR adaption from speaker independent models
• HMM Toolkit (HTK) was used (Young 2002)
CRICOS No. 000213J
12
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Comparing acoustic and visual information for speech processing
• Investigated using the identification rates of speaker-dependent acoustic and visual phoneme models
• Test segments freely transcribed using all speaker dependent phoneme models– No restriction to specified user or word
• Confusion tables for speech (phoneme) and speaker recognition were examined to get identification rates
Correct
s02m, /w/ s02m, /ah/
s02m, /n/
Audio s10m, /w/ s02m, /ah/
s02m, /n/
Video s02m, /sp/ s02m, /n/
CRICOS No. 000213J
13
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Example Confusion Table(Phonemes in Clean Acoustic Speech)
Actual Phonemes
Re
co
gn
ise
d P
ho
ne
me
s
CRICOS No. 000213J
14
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Example Confusion Table(Phonemes in Clean Visual Speech)
Actual Phonemes
Re
co
gn
ise
d P
ho
ne
me
s
CRICOS No. 000213J
15
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Likelihood of speaker and phone identification using phoneme models
CRICOS No. 000213J
16
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Fusion
• Because of the differing performance of each modality at speech and speaker recognition, the fusion configuration for each task must be adjusted with these performances in mind
• For these experiments– Weighted sum fusion of the top 10 normalised scores
in each modality
– ranges from 0 (video only) to 1 (audio only)
VAF sss ˆ1ˆˆ
CRICOS No. 000213J
17
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
-6 -3 0 3 6 9 12 clean
Noise (SNR dB)
Wo
rd E
rro
r R
ate
100% audio50% audio100% videobest
Speech vs Speaker
• The response of each system to speech-babble noise over a selected range of values were compared.
Word Identification
CRICOS No. 000213J
18
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
-6 -3 0 3 6 9 12 clean
Noise (SNR dB)
Ide
nti
fic
atio
n E
rro
r
100% audio50% audio100% videobest
Speech vs Speaker
• The response of each system to speech-babble noise over a selected range of values were compared.
Speaker Identification
CRICOS No. 000213J
19
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Speech vs Speaker
• Acoustic performance is basically equal for both tasks• Visual performance is clearly better for speaker
recognition• Speech recognition fusion is catastrophic at nearly all
noise levels• Speaker recognition is only catastrophic at high noise
levels• We can also get an idea of the dominance of each
modality by looking at values of that produce the ‘best’ lines (ideal adaptive fusion)
CRICOS No. 000213J
20
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
‘Best’ Fusion
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-6 -3 0 3 6 9 12 cleanNoise (SNR dB)
Be
st A
ud
io W
eig
hti
ng
(α
)
Word ID
Speaker ID
CRICOS No. 000213J
21
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Conclusion and Further Work
• PCA-based visual features are mostly person-dependent– Should be used with care in visual speech recognition tasks
• It is believed that this dependency stems from the large amount of static person-specific information capture along with the dynamic lip configuration– Skin colour, facial hair, etc.
• Visual information for speech recognition is only useful in high noise situations
CRICOS No. 000213J
22
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Conclusion and Further Work
• Even at very low levels of acoustic noise, visual speech information can provide similar performance to acoustic information for speaker recognition
• Adaptive fusion for speaker recognition should therefore be biased towards visual features for best performance
• Further study needs to be performed in methods of improving the visual modality for speech recognition by focusing more on the dynamic speech-related information– Mean-image removal, Optical flow, Contour representations
CRICOS No. 000213J
23
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
References
(Butler2003) D. Butler, C. McCool, M. McKay, S. Lowther, V. Chandran, and S. Sridharan, "Robust Face Localisation Using Motion, Colour and Fusion," presented at Proceedings of the Seventh International Conference on Digital Image Computing: Techniques and Applications, DICTA 2003, Macquarie University, Sydney, Australia, 2003.
(Lucey2004) P. Lucey, T. Martin, and S. Sridharan, "Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments," presented at SST 2004, Sydney, Australia, 2004.
(Patterson2002) E. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, "CUAVE: a new audio-visual database for multimodal human-computer interface research," presented at Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on, 2002.
(Young2002) S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, 3.2 ed. Cambridge, UK: Cambridge University Engineering Department., 2002.
(Wark2001) T. Wark and S. Sridharan, "Adaptive fusion of speech and lip information for robust speaker identification," Digital Signal Processing, vol. 11, pp. 169-186, 2001.
CRICOS No. 000213J
24
Speech, Audio, Image and Video Research Laboratory
Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre
Questions?