Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
_______________________________________________________________________
Carmen Magariños1, Carmen García-Mateo2, Eduardo R. Banga2
(1) 'Ramón Piñeiro' Centre for Research in Humanities, Xunta de Galicia, SPAIN
(2) Multimedia Technologies Group, Universidade de Vigo, SPAIN
Voice Transformation in the HTS* Framework
A brief overview
* http://hts.sp.nitech.ac.jp/
COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
2
Speech Synthesis Techniques
● Two main synthesis techniques:
– Unit Selection● Large corpus of speech units.● Good average speech quality but with fluctuations
– HMM synthesis● Small memory and computational footprints● Rather good stable speech quality
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
3
Phone models [4]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________
Hidden Markov Model (HMM)
Hidden Semi-Markov Model (HSMM)
COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
4
Phone models [2, 3, 5]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
5
HSMM speech synthesis [1, 2, 3, 5, 8]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________
Parameter generation from concatenated HSMMs, given the sequence length T :
1) Find the most likely state sequence.
2) Given the state sequence find the most likely observation sequence.
COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
6
Decision Tree-based Context Clustering [2, 3, 5]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
7
HSMM Synthesis with Speaker Adaptation [5, 7]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
8
Average Voice Model [5]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
9
Average Voice Model [5]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________
Problem:Have all nodes data from every speaker?
COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
10
Average Voice Model [5]
______________________________________________________________________Voice Transformation in the HTS Framework
Solution:Shared-Decision Tree-Based Context Clustering [5]
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
11
Shared-Decision Tree-based Context Clustering* [5]
_______________________________________________________________________
______________________________________________________________________Voice Transformation in the HTS Framework
(*) Not available in HTS
COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
12
HSMM-based MLLR Adaptation [5]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
13
● Speaker Adaptive Training (SAT) alleviates speaker bias when training the average voice model.
● Structural Maximum A Posteriori (SMAP) adaptation takes into account tree dependencies. The bias vectors, ϵ and ν, of the parents nodes are used as prior distributions when estimating the bias vectors of their child nodes.
● Global Variance (GV) method tries to prevent oversmoothing when generating the parameters for speech synthesis.
● Constrained MLLR (CMLLR) adaptation. Mean vectors and covariance matrices are estimated simultaneously.
Further improvements [5, 6, 7]
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
14
Speaker Adaptation Examples
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
● Test 1: Training with 3 Galician speakers
– Sabela ≈ 1300 sentences
– Iago ≈ 1300 sentences
– Concha ≈ 1300 sentences
– Adaptation to Paulino (100 sent.)
● Test 2: Training with 3 Galician speakers
– Sabela ≈ 1300 sentences
– Iago ≈ 1300 sentences
– Paulino ≈ 100 sentences
– Adaptation to Concha (40 sent.)
Original Synthetic AdaptedOriginal Synthetic AdaptedOriginal Synthetic AdaptedOriginal Synthetic Adapted
Original Synthetic AdaptedOriginal Synthetic AdaptedOriginal Synthetic AdaptedOriginal Adapted
15
HTS for Speaker De-identification?
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________
● Direct application of HTS Voice-Transformation (HTS VT) would require a previous phonetic ASR.
● HTS VTs could be used for fooling SIDs.
● HTS could be used for creating (partially) synthetic parallel corpora.
● Other uses of HTS-related tecniques?
COST IC 1206 meetingMataró (Spain)November 25-26th, 2013
16
[1] K. Tokuda, Takayoshi Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP-2000, pp.1315–1318, June 2000.
[2] T. Yoshimura, Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems, Ph.D Thesis, Nagoya Institute of Technology, Jan. 2002.
[3] Keiichi Tokuda, Heiga Zen, Alan W. Black, “An HMM-Based Approach to Multilingual Speech Synthesis (Chapter 7)”, in book “Text-to-Speech Synthesis: New Paradigms and Advances”, Shrikanth Narayanan, Abeer Alwan (Eds.), Prentice Hall, pp.135-153, Aug. 2004. (ISBN 978-0131456617)
[4] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Hidden semi-Markov model based speech synthesis, Proc. of ICSLP 2004, vol.II, pp.1397-1400, Oct. 2004.
[5] J. Yamagishi, “Average-Voice-Based Speech Synthesis”, Ph.D Thesis, Tokyo Institute of Technology, March 2006
[6] J. Yamagishi, T. Kobayashi, S. Renals, S. King, H. Zen, T. Toda, K. Tokuda, Improved Average-Voice-based Speech Synthesis using Gender-Mixed Modeling and A Parameter Generation Algorithm considering GV, Proc. ISCA SSW6, Aug. 2007
[7] Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., King, S. & Renals, S. (2009). “Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis”. IEEE Transactions on Audio, Speech & Language Processing, 17, 1208-1230.
[8] Heiga Zen, Keiichi Tokuda, Alan W. Black (2009), “Statistical parametric speech synthesis”, Speech Communication, vol.51, no.11, pp.1039-1154.
References
______________________________________________________________________Voice Transformation in the HTS Framework
_______________________________________________________________________
Lots of additional related references at:
http://www.sp.nitech.ac.jp/~tokuda/gyoseki/http://hts.sp.nitech.ac.jp/?Publications
COST IC 1206 meetingMataró (Spain)November 25-26th, 2013