Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December

Speech is bimodal essentially.

• Acoustic and Visual cues.H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature,

pp. 746-748, December 1976.

T. Chen and R. Rao, ''Audio-visual integration in multimodal

communication'', Proceedings of the IEEE, Special issue in Multimedia

Signal Processing, vol. 86, pp. 837-852, May 1998.

D.B. Stork and M.E. Hennecke editors, ''Speechreading by Hummans and

Machines. Springer, Berlin Germany, 1996.

Through their integration (fusion), the aim is:

• To increase the robustness and performance.

There are too many papers, so lets see a few... but from the point of view

of integration most of them.

AudioVisual-SpeechRecognition (AVSR)

● Tutorials.G. Potamianos, C. Neti, G. Gravier, A. Garg and A.W. Senior. Recent advances in

the authomatic recognition of audio-visual speech. ''Proceedings of the IEEE, vil.

91(9), pp. 1306-1326, September 2003.

G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech

recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts.

Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press,

2004.

● Real Conditions.G. Potamianos and C. Neti, ''Audio-visual speech recognition in challenging

environments'', In proc. European conference on Speech Technology, pp. 1293-

1296, 2003.

G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E. Marcheret, N.

Haas and J. Jiang, ''Towards practical deployement of audio-visual speech

recognition'', ICASSP'04, vol. 3, pp. 777-780, Montreal Canada, 2004.


Increase Robustness and Performance Based on the fact:• Visual modality is independent to most of the lost of

acoustic quality.

• Visual and Acoustic modalities work in a

complementary manner.B. Dodd and R. Campbell, eds, ''Hearing by Eye: The psychology of Lipreading''.

London, England. Laurence Erlbaum Associates Ltd., 1987.

• But, if the integrations is not well done: Catastrofic

fusion.J.R. Movellan and P. Mineiro, ''Modularity and catastrophic fusion: A bayesian

approach with applications to audio-visual speech recognition'', Tech. Rep.

97.01, Departement of Cognitive Science, UCSD, San Diego, CA, 1997.


Early Integration (EI):

• In the feature level, concatenate the features.

• But features are not synchronous (VOT)!S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech

recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September

2000.

C. C. Chibelushi, J.S. Mason and F. Deravi, ''Integration of acoustic and visual

speech for speaker recognition'', Eurospeech'93, Berlin,pp.157-160, September

1993.

AVSR Integration (Fusion)

voice onset time (VOT)

Late Integration (LI):

• In the decision level, combine the scores.

• Lost of all temporal information!A. Adjoudani and C. Benoit, ''Audio-visual speech recognition compared acroos

two architectures'', Eurospeech'95, Madrid Spain, pp. 1563-1566, September

1995.

S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech

recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September

2000.

M. Heckmann, F. Berthommier and K. Kroschel, ''Noise adaptive stream weighting

in audio-visual speech recognition'', EUROASIP Journal of Applied Signal

Processing, vol. 1, pp. 1260-1273, November 2002.


Middle Integration (MI) allows:

• Specific word or sub-word models.

• Synchronous continuous speech recognition.J. Luettin, G. Potamianos and C. Neti, ''Asynchronous stream modeling for large

vocabulary audio-visual speech recognition'', ICASSP'01, vol. 1, pp. 169-172,

Salt Lake City USA, May 2001.

G. Potamianos, J. Luettin and C. Neti, '' Hierarchical discriminant features for

audio-visual LVCSR'', ICASSP'01, vol. 1, pp. 165-168, Salt Lake City USA, May

2001.


Multistream HMM• State synchrony

• Weighting the

observations

A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for

audio-visual speech recognition'',EURASIP Journal on Applied Signal

Processing, vol. 11, pp. 1-15, 2002.

G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech

recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier,

edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT

Press, 2004.

AVSR Integration, Dynamic Bayesian Networks

t=1t=0 t=2 t=T

Product HMM• Asynchrony between

the streams

• Too many parameters

I am not sure about

this graphical representation

G. Gravier, G. Potamianos and C. Neti, ''Asynchrony modeling for audio-visual

speech recognition'', In Human Language Technology Conference, 2002.


t=1t=0 t=2 t=T

Factorial HMM• Transition probabilities

are independents for

each stream.

Z. Ghahramani and M.I. Jordan, ''Factorial hidden markov models'', In Proc.

Advances in Neural Information Processing Systems, vol. 8 pp. 472-478, 1985.


t=1t=0 t=2 t=T

Coupled HMM (1/2)• The backbones

have a dependence.

M. Brand, N. Oliver and A. Pentland, ''Coupled hidden markov models for complex

action recognition'', In Proc. IEEE Conf. on Computer Vision and Pattern

Recognition, pp. 994-999, 1997.

S. Chu and T. Huang, ''Audio-visual speech modeling using coupled hidden markov

models'', ICASSP'02, pp. 2009-2012, 2002.


t=1t=0 t=2 t=T

Coupled HMM (2/2)

A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for

audio-visual speech recognition'',EURASIP Journal on Applied Signal

Processing, vol. 11, pp. 1-15, 2002.

A. Subramanya, S. Gurbuz, E. Patterson, and J.N. Gowdy, ''Audiovisual speech

integration using coupled hidden markov models for continous speech recognition'',

ICASSP'03, 2003.


Implicite Modeling

J.N. Gowdy, A. Subramanaya, C. Bartels and Jeff Bilmes, ''DBN based Multi-stream

models for audio-visula speech recognition'', ICASSP'04, Montreal Canada, 2004.

X. Lei, G. Ji, T. Ng, J. Bilmes and M. Ostendorf, ''DBN based Multi-stream for Mandarin

Toneme Recognition'', ICASSP'05, Filadelphie USA, 2005.


Documents

Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December