Upload
randolph-hubbard
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
time
state
Athanassios Katsamanis, George Papandreou, Petros Maragos
School of E.C.E., National Technical University of Athens, Athens 15773, Greece
Audiovisual-to-articulatory speech inversion using Hidden Markov Models
ReferencesH. Kjellstrom, O. Engwall, and O. Balter, “Reconstructing tongue movements from audio and video,” in Interspeech, 2006, pp. 2238–2241.
O. Engwall, “Introducing visual cues in acoustic-to-articulatory inversion,” in INTERSPEECH, 2005, pp. 3205–3208.
J. Jiang, A. Alwan, P. A. Keating, E. T. Auer Jr., and L. E. Bernstein, “On the relationship between face movements, tongue movements, and speech acoustics,” EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1174–1188, 2002.
H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, “Quantitative association of vocal-tract and facial behavior,” Sp. Comm., vol. 26, pp. 23–43, 1998.
K. Richmond, S. King, and P. Taylor, “Modelling the uncertainty in recovering articulation from acoustics,” Computer Speech and Language, vol. 17, pp. 153–172, 2003.
S. Hiroya and M. Honda, “Estimation of articulatory movements from speech acoustics using an hmm-based speech production model,” IEEE TSAP, vol. 12, no. 2, pp. 175–185, March 2004.
O. Engwall and J. Beskow, “Resynthesis of 3d tongue movements from facial data,” in EUROSPEECH, 2003.
S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Tr. Multimedia, vol. 2, no. 3, pp. 141–151, 2000.
K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis. Acad. Press, 1979.
L. L. Scharf and J. K. Thomas, “Wiener filters in canonical coordinates for transform coding, filtering, and quantizing,” IEEE TSAP, vol. 46, no. 3, pp. 647–654, 1998.
L. Breiman and J. H. Friedman, “Predicting multivariate responses in multiple linear regression,” Journal of the Royal Stat. Soc. (B), vol. 59, no. 1, pp. 3–54, 1997.
AcknowledgementsThis research was co-financed partially by E.U.-European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%) under Grant ΠΕΝΕΔ-2003ΕΔ866, and partially by the European research project ASPI under Grant FP6-021324. We would also like to thank O. Engwall from KTH for providing us the QSMT database.
Evaluation
Measured (black) and predicted (light color) articulatory trajectories
Speech inversion ?Recover vocal tract geometry from
the speech signal and speaker’s face
Applications in Language Tutoring, Speech Therapy
Zero states correspond to the case of a global linear model.
Qualisys-Movetrack database
Ele
ctro
mag
net
ic
Art
icu
log
rap
hy
(EM
A)
Vid
eo
Au
dio
/p1/ /p2/
ya yv
3-D marker coordinates
wa wv
spectral characteristics/MFCC
Determination of multistream HMM state sequence
Why Canonical Correlation Analysis (CCA)? Leads to optimal reduced-rank linear regression models. Improved predictive performance in the case of limited data
Generalization error of the linear regression model vs. model order for varying training set size. Upper row: Tongue position from face expression. Lower Row: Face expression from tongue position.
We use multistream HMMsVisual-to-articulatory mapping is expected to be nonlinear. Visual stream incorporated following the Audiovisual ASR paradigm.
We apply CCATrain a linear mapping at each HMM state between audiovisual and articulatory data
Performance is improvedCompared to global linear modal or audio only or visual only HMM
€
y t = Aix t + ε t
x
Time t, state i:
Maximum A Posteriori articulatory parameter estimate:
€
ˆ x = (σ x−1 + Ai
TQi−1Ai)
−1
(σ x−1x + Ai
TQi−1y)
Where Qi is the covariance of the approximation error and the prior of x is considered to be Gaussian determined at the training phase