Voice Transformation in the HTS Frameworkcostic1206.uvigo.es/sites/default/files/Meetings/Mataro/... · 2015. 3. 16. · 2 Speech Synthesis Techniques Two main synthesis techniques:

_______________________________________________________________________

Carmen Magariños1, Carmen García-Mateo2, Eduardo R. Banga2

(1) 'Ramón Piñeiro' Centre for Research in Humanities, Xunta de Galicia, SPAIN

(2) Multimedia Technologies Group, Universidade de Vigo, SPAIN

Voice Transformation in the HTS* Framework

A brief overview

* http://hts.sp.nitech.ac.jp/

COST IC 1206 meetingMataró (Spain)November 25-26th, 2013

2

Speech Synthesis Techniques

● Two main synthesis techniques:

– Unit Selection● Large corpus of speech units.● Good average speech quality but with fluctuations

– HMM synthesis● Small memory and computational footprints● Rather good stable speech quality

______________________________________________________________________Voice Transformation in the HTS Framework

_______________________________________________________________________COST IC 1206 meetingMataró (Spain)November 25-26th, 2013

3

Phone models [4]


_______________________________________________________________________

Hidden Markov Model (HMM)

Hidden Semi-Markov Model (HSMM)


4

Phone models [2, 3, 5]



5

HSMM speech synthesis [1, 2, 3, 5, 8]


_______________________________________________________________________

Parameter generation from concatenated HSMMs, given the sequence length T :

1) Find the most likely state sequence.

2) Given the state sequence find the most likely observation sequence.


6

Decision Tree-based Context Clustering [2, 3, 5]



7

HSMM Synthesis with Speaker Adaptation [5, 7]



8

Average Voice Model [5]



9



_______________________________________________________________________

Problem:Have all nodes data from every speaker?


10



Solution:Shared-Decision Tree-Based Context Clustering [5]


11

Shared-Decision Tree-based Context Clustering* [5]

_______________________________________________________________________


(*) Not available in HTS


12

HSMM-based MLLR Adaptation [5]



13

● Speaker Adaptive Training (SAT) alleviates speaker bias when training the average voice model.

● Structural Maximum A Posteriori (SMAP) adaptation takes into account tree dependencies. The bias vectors, ϵ and ν, of the parents nodes are used as prior distributions when estimating the bias vectors of their child nodes.

● Global Variance (GV) method tries to prevent oversmoothing when generating the parameters for speech synthesis.

● Constrained MLLR (CMLLR) adaptation. Mean vectors and covariance matrices are estimated simultaneously.

Further improvements [5, 6, 7]



14

Speaker Adaptation Examples



● Test 1: Training with 3 Galician speakers

– Sabela ≈ 1300 sentences

– Iago ≈ 1300 sentences

– Concha ≈ 1300 sentences

– Adaptation to Paulino (100 sent.)

● Test 2: Training with 3 Galician speakers

– Sabela ≈ 1300 sentences

– Iago ≈ 1300 sentences

– Paulino ≈ 100 sentences

– Adaptation to Concha (40 sent.)

Original Synthetic AdaptedOriginal Synthetic AdaptedOriginal Synthetic AdaptedOriginal Synthetic Adapted

Original Synthetic AdaptedOriginal Synthetic AdaptedOriginal Synthetic AdaptedOriginal Adapted

15

HTS for Speaker De-identification?


_______________________________________________________________________

● Direct application of HTS Voice-Transformation (HTS VT) would require a previous phonetic ASR.

● HTS VTs could be used for fooling SIDs.

● HTS could be used for creating (partially) synthetic parallel corpora.

● Other uses of HTS-related tecniques?


16

[1] K. Tokuda, Takayoshi Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP-2000, pp.1315–1318, June 2000.

[2] T. Yoshimura, Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems, Ph.D Thesis, Nagoya Institute of Technology, Jan. 2002.

[3] Keiichi Tokuda, Heiga Zen, Alan W. Black, “An HMM-Based Approach to Multilingual Speech Synthesis (Chapter 7)”, in book “Text-to-Speech Synthesis: New Paradigms and Advances”, Shrikanth Narayanan, Abeer Alwan (Eds.), Prentice Hall, pp.135-153, Aug. 2004. (ISBN 978-0131456617)

[4] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura, Hidden semi-Markov model based speech synthesis, Proc. of ICSLP 2004, vol.II, pp.1397-1400, Oct. 2004.

[5] J. Yamagishi, “Average-Voice-Based Speech Synthesis”, Ph.D Thesis, Tokyo Institute of Technology, March 2006

[6] J. Yamagishi, T. Kobayashi, S. Renals, S. King, H. Zen, T. Toda, K. Tokuda, Improved Average-Voice-based Speech Synthesis using Gender-Mixed Modeling and A Parameter Generation Algorithm considering GV, Proc. ISCA SSW6, Aug. 2007

[7] Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., King, S. & Renals, S. (2009). “Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis”. IEEE Transactions on Audio, Speech & Language Processing, 17, 1208-1230.

[8] Heiga Zen, Keiichi Tokuda, Alan W. Black (2009), “Statistical parametric speech synthesis”, Speech Communication, vol.51, no.11, pp.1039-1154.

References


_______________________________________________________________________

Lots of additional related references at:

http://www.sp.nitech.ac.jp/~tokuda/gyoseki/http://hts.sp.nitech.ac.jp/?Publications


Documents

Voice Transformation in the HTS Frameworkcostic1206.uvigo.es/sites/default/files/Meetings/Mataro/... · 2015. 3. 16. · 2 Speech Synthesis Techniques Two main synthesis techniques: