24
CRICOS No. 000213J e-Health Research Centre/ CSIRO ICT Centre *Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information for Speech Processing David Dean*, Patrick Lucey*, Sridha Sridharan* and Tim Wark* Presented by David Dean

CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

†e-Health Research Centre/ CSIRO ICT Centre

*Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processing

David Dean*, Patrick Lucey*, Sridha Sridharan* and Tim Wark*†

Presented by David Dean

Page 2: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

2

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Audio-Visual Speech Processing - Overview

• Speech or speaker recognition traditionally audio only– Mature area of research

• Significant problems in real-world environments (Wark2001)– High acoustic noise– Variation of speech

• Audio-visual speech processing adds an additional modality to help alleviate these problems

Page 3: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

3

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Audio-Visual Speech Processing - Overview

• Speech and speaker recognition tasks have many overlapping areas

• The same configuration can be used for both text-dependent speaker recognition, and speaker-dependent speech recognition– Train speaker-dependent word (or sub-word) models– Speaker recognition chooses amongst speakers for a

particular word, or– Word recognition chooses amongst words for a

particular speaker.

Page 4: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

4

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Audio-Visual Speech Processing - Overview

• Little research has been done into how the two applications (speaker vs. speech) differ in areas other than the set of models chosen for recognition

• One area of interest in this research is the reliance on each modality– Acoustic features typically work equally well in either

application (Young2002)– Little consensus has been reach on the suitability of

visual features for each application

Page 5: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

5

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Experimental Setup

Speech/ SpeakerDecision

Visual Feature Extraction

Acoustic Feature Extraction

Lip Location & Tracking

Visual Speech/Speaker

Models

Acoustic Speech/Speaker

Models

Decisio

n Fusion

Page 6: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

6

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Lip location and tracking

Page 7: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

7

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Finding Faces

• Manual Red, Green and Blue skin thresholds were trained for each speaker

• Faces were located by applying these thresholds to the video frames

Page 8: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

8

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Finding and tracking eyes

• Top half of face region is searched for eyes

• A shifted version of Cr-Cb thresholding was performed to locate possible eye regions (Butler2003)

• Invalid eye candidate regions were removed, and the most likely pair of candidates chosen as the eyes

• New eye location compared to old, and ignored if too far from old

• About 40% of sequences had to be manually eye-tracked every 50 frames.

Page 9: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

9

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Finding and tracking lips

• Eye locations are used to define rotation-normalised lip search region (LSR)

• LSR converted to Red/Green colour-space and thresholded

• Unlikely lip-candidates are removed• Rectangular area with largest

amount of lip-candidate area within is lip ROI.

Page 10: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

10

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Feature Extraction and Datasets

• MFCC – 15 + 1 energy, + deltas and accelerations = 48 features

• PCA – 20 eigenlip coefficients + deltas and accelerations = 60 features– Eigenlip-space trained on entire data set of lip images

• Stationary speech from CUAVE (Patterson2002)– 5 sequences for training, 2 for testing (per speaker)– Testing was also performed on

speech-babble corrupted noisy versions

Page 11: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

11

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Training

• Phone transcriptions obtained from earlier research (Lucey 2004) were used to train speaker independent HMM phone models in both audio and visual domains

• Speaker dependent models adapted using MLLR adaption from speaker independent models

• HMM Toolkit (HTK) was used (Young 2002)

Page 12: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

12

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Comparing acoustic and visual information for speech processing

• Investigated using the identification rates of speaker-dependent acoustic and visual phoneme models

• Test segments freely transcribed using all speaker dependent phoneme models– No restriction to specified user or word

• Confusion tables for speech (phoneme) and speaker recognition were examined to get identification rates

Correct

s02m, /w/ s02m, /ah/

s02m, /n/

Audio s10m, /w/ s02m, /ah/

s02m, /n/

Video s02m, /sp/ s02m, /n/

Page 13: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

13

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Example Confusion Table(Phonemes in Clean Acoustic Speech)

Actual Phonemes

Re

co

gn

ise

d P

ho

ne

me

s

Page 14: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

14

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Example Confusion Table(Phonemes in Clean Visual Speech)

Actual Phonemes

Re

co

gn

ise

d P

ho

ne

me

s

Page 15: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

15

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Likelihood of speaker and phone identification using phoneme models

Page 16: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

16

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Fusion

• Because of the differing performance of each modality at speech and speaker recognition, the fusion configuration for each task must be adjusted with these performances in mind

• For these experiments– Weighted sum fusion of the top 10 normalised scores

in each modality

– ranges from 0 (video only) to 1 (audio only)

VAF sss ˆ1ˆˆ

Page 17: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

17

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-6 -3 0 3 6 9 12 clean

Noise (SNR dB)

Wo

rd E

rro

r R

ate

100% audio50% audio100% videobest

Speech vs Speaker

• The response of each system to speech-babble noise over a selected range of values were compared.

Word Identification

Page 18: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

18

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

-6 -3 0 3 6 9 12 clean

Noise (SNR dB)

Ide

nti

fic

atio

n E

rro

r

100% audio50% audio100% videobest

Speech vs Speaker

• The response of each system to speech-babble noise over a selected range of values were compared.

Speaker Identification

Page 19: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

19

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Speech vs Speaker

• Acoustic performance is basically equal for both tasks• Visual performance is clearly better for speaker

recognition• Speech recognition fusion is catastrophic at nearly all

noise levels• Speaker recognition is only catastrophic at high noise

levels• We can also get an idea of the dominance of each

modality by looking at values of that produce the ‘best’ lines (ideal adaptive fusion)

Page 20: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

20

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

‘Best’ Fusion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-6 -3 0 3 6 9 12 cleanNoise (SNR dB)

Be

st A

ud

io W

eig

hti

ng

)

Word ID

Speaker ID

Page 21: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

21

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Conclusion and Further Work

• PCA-based visual features are mostly person-dependent– Should be used with care in visual speech recognition tasks

• It is believed that this dependency stems from the large amount of static person-specific information capture along with the dynamic lip configuration– Skin colour, facial hair, etc.

• Visual information for speech recognition is only useful in high noise situations

Page 22: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

22

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Conclusion and Further Work

• Even at very low levels of acoustic noise, visual speech information can provide similar performance to acoustic information for speaker recognition

• Adaptive fusion for speaker recognition should therefore be biased towards visual features for best performance

• Further study needs to be performed in methods of improving the visual modality for speech recognition by focusing more on the dynamic speech-related information– Mean-image removal, Optical flow, Contour representations

Page 23: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

23

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

References

(Butler2003) D. Butler, C. McCool, M. McKay, S. Lowther, V. Chandran, and S. Sridharan, "Robust Face Localisation Using Motion, Colour and Fusion," presented at Proceedings of the Seventh International Conference on Digital Image Computing: Techniques and Applications, DICTA 2003, Macquarie University, Sydney, Australia, 2003.

(Lucey2004) P. Lucey, T. Martin, and S. Sridharan, "Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments," presented at SST 2004, Sydney, Australia, 2004.

(Patterson2002) E. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, "CUAVE: a new audio-visual database for multimodal human-computer interface research," presented at Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on, 2002.

(Young2002) S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, 3.2 ed. Cambridge, UK: Cambridge University Engineering Department., 2002.

(Wark2001) T. Wark and S. Sridharan, "Adaptive fusion of speech and lip information for robust speaker identification," Digital Signal Processing, vol. 11, pp. 169-186, 2001.

Page 24: CRICOS No. 000213J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information

CRICOS No. 000213J

24

Speech, Audio, Image and Video Research Laboratory

Comparing Audio and Visual Information for Speech Processinge-Health Research Centre/ CSIRO ICT Centre

Questions?