Upload
bruce-price
View
217
Download
2
Embed Size (px)
Citation preview
Speech is bimodal essentially.
• Acoustic and Visual cues.H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature,
pp. 746-748, December 1976.
T. Chen and R. Rao, ''Audio-visual integration in multimodal
communication'', Proceedings of the IEEE, Special issue in Multimedia
Signal Processing, vol. 86, pp. 837-852, May 1998.
D.B. Stork and M.E. Hennecke editors, ''Speechreading by Hummans and
Machines. Springer, Berlin Germany, 1996.
Through their integration (fusion), the aim is:
• To increase the robustness and performance.
There are too many papers, so lets see a few... but from the point of view
of integration most of them.
AudioVisual-SpeechRecognition (AVSR)
● Tutorials.G. Potamianos, C. Neti, G. Gravier, A. Garg and A.W. Senior. Recent advances in
the authomatic recognition of audio-visual speech. ''Proceedings of the IEEE, vil.
91(9), pp. 1306-1326, September 2003.
G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech
recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts.
Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press,
2004.
● Real Conditions.G. Potamianos and C. Neti, ''Audio-visual speech recognition in challenging
environments'', In proc. European conference on Speech Technology, pp. 1293-
1296, 2003.
G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E. Marcheret, N.
Haas and J. Jiang, ''Towards practical deployement of audio-visual speech
recognition'', ICASSP'04, vol. 3, pp. 777-780, Montreal Canada, 2004.
AudioVisual-SpeechRecognition (AVSR)
Increase Robustness and Performance Based on the fact:• Visual modality is independent to most of the lost of
acoustic quality.
• Visual and Acoustic modalities work in a
complementary manner.B. Dodd and R. Campbell, eds, ''Hearing by Eye: The psychology of Lipreading''.
London, England. Laurence Erlbaum Associates Ltd., 1987.
• But, if the integrations is not well done: Catastrofic
fusion.J.R. Movellan and P. Mineiro, ''Modularity and catastrophic fusion: A bayesian
approach with applications to audio-visual speech recognition'', Tech. Rep.
97.01, Departement of Cognitive Science, UCSD, San Diego, CA, 1997.
AudioVisual-SpeechRecognition (AVSR)
Early Integration (EI):
• In the feature level, concatenate the features.
• But features are not synchronous (VOT)!S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech
recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September
2000.
C. C. Chibelushi, J.S. Mason and F. Deravi, ''Integration of acoustic and visual
speech for speaker recognition'', Eurospeech'93, Berlin,pp.157-160, September
1993.
AVSR Integration (Fusion)
voice onset time (VOT)
Late Integration (LI):
• In the decision level, combine the scores.
• Lost of all temporal information!A. Adjoudani and C. Benoit, ''Audio-visual speech recognition compared acroos
two architectures'', Eurospeech'95, Madrid Spain, pp. 1563-1566, September
1995.
S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech
recognition'', IEEE Transactions on Multimedia, vol. 2, pp. 141-151, September
2000.
M. Heckmann, F. Berthommier and K. Kroschel, ''Noise adaptive stream weighting
in audio-visual speech recognition'', EUROASIP Journal of Applied Signal
Processing, vol. 1, pp. 1260-1273, November 2002.
AVSR Integration (Fusion)
Middle Integration (MI) allows:
• Specific word or sub-word models.
• Synchronous continuous speech recognition.J. Luettin, G. Potamianos and C. Neti, ''Asynchronous stream modeling for large
vocabulary audio-visual speech recognition'', ICASSP'01, vol. 1, pp. 169-172,
Salt Lake City USA, May 2001.
G. Potamianos, J. Luettin and C. Neti, '' Hierarchical discriminant features for
audio-visual LVCSR'', ICASSP'01, vol. 1, pp. 165-168, Salt Lake City USA, May
2001.
AVSR Integration (Fusion)
Multistream HMM• State synchrony
• Weighting the
observations
A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for
audio-visual speech recognition'',EURASIP Journal on Applied Signal
Processing, vol. 11, pp. 1-15, 2002.
G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech
recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier,
edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT
Press, 2004.
AVSR Integration, Dynamic Bayesian Networks
t=1t=0 t=2 t=T
Product HMM• Asynchrony between
the streams
• Too many parameters
I am not sure about
this graphical representation
G. Gravier, G. Potamianos and C. Neti, ''Asynchrony modeling for audio-visual
speech recognition'', In Human Language Technology Conference, 2002.
AVSR Integration, Dynamic Bayesian Networks
t=1t=0 t=2 t=T
Factorial HMM• Transition probabilities
are independents for
each stream.
Z. Ghahramani and M.I. Jordan, ''Factorial hidden markov models'', In Proc.
Advances in Neural Information Processing Systems, vol. 8 pp. 472-478, 1985.
AVSR Integration, Dynamic Bayesian Networks
t=1t=0 t=2 t=T
Coupled HMM (1/2)• The backbones
have a dependence.
M. Brand, N. Oliver and A. Pentland, ''Coupled hidden markov models for complex
action recognition'', In Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pp. 994-999, 1997.
S. Chu and T. Huang, ''Audio-visual speech modeling using coupled hidden markov
models'', ICASSP'02, pp. 2009-2012, 2002.
AVSR Integration, Dynamic Bayesian Networks
t=1t=0 t=2 t=T
Coupled HMM (2/2)
A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for
audio-visual speech recognition'',EURASIP Journal on Applied Signal
Processing, vol. 11, pp. 1-15, 2002.
A. Subramanya, S. Gurbuz, E. Patterson, and J.N. Gowdy, ''Audiovisual speech
integration using coupled hidden markov models for continous speech recognition'',
ICASSP'03, 2003.
AVSR Integration, Dynamic Bayesian Networks
Implicite Modeling
J.N. Gowdy, A. Subramanaya, C. Bartels and Jeff Bilmes, ''DBN based Multi-stream
models for audio-visula speech recognition'', ICASSP'04, Montreal Canada, 2004.
X. Lei, G. Ji, T. Ng, J. Bilmes and M. Ostendorf, ''DBN based Multi-stream for Mandarin
Toneme Recognition'', ICASSP'05, Filadelphie USA, 2005.
AVSR Integration, Dynamic Bayesian Networks