A Bayesian Approach to HMM-Based Speech Synthesis

A Bayesian Approach to HMM-Based Speech Synthesis

Kei Hashimoto , Heiga Zen ,

Yoshihiko Nankaku , Takashi Masuko ,

and Keiichi Tokuda

Nagoya Institute of Technology

Tokyo Institute of Technology

1

2

1 1

1

1

2

2

Background HMM-based speech synthesis system

Spectrum, excitation and duration are modeled Speech parameter seqs. are generated

Maximum likelihood (ML) criterion Train HMMs and generate speech parameters Point estimate ⇒ The over-fitting problem

Bayesian approach Estimate posterior dist. of model parameters Prior information can be use

⇒ Alleviate the over-fitting problem

Outline Bayesian speech synthesis

Variational Bayesian method Speech parameter generation

Bayesian context clustering Prior distribution using cross validation

Experiments Conclusion & Future work

3

Model training and speech synthesis

Bayesian speech synthesis (1/2)

4

: Model parameters

: Label seq. for synthesis: Label seq. for training: Training data seq.

: Synthesis data seq.

ML

Bayes

Bayesian speech synthesis (2/2)

Predictive distribution (marginal likelihood)

5

: HMM state seq. for synthesis data

Variational Bayesian method [Attias; ’99]

: HMM state seq. for training data: Likelihood of synthesis data: Likelihood of training data: Prior distribution for model parameters

Estimate approximate posterior dist. ⇒ Maximize a lower bound

Variational Bayesian method (1/2)

6

　　

： Expectation w.r.t.

（ Jensen’s inequality ）

: Approximate distribution of the true posterior distribution

Random variables are statistically independent

Optimal posterior distributions

　　

Variational Bayesian method (2/2)

7

　　

: normalization terms

Iterative updates as the EM algorithm

Approximation for speech synthesis

is dependent on synthesis data

⇒ Huge computational cost in the synthesis part

Ignore the dependency of synthesis data

⇒ Estimation from only training data

8

　　

　　

Prior distribution Conjugate prior distribution

⇒ Posterior dist. becomes a same family of dist. with prior dist.

Determination using statistics of prior data

9

　　　　

: Dimension of feature

： Covariance of prior data

： # of prior data

： Mean of prior data

　　　　Conjugate prior distribution

Likelihood function

Speech parameter generation Speech parameter

Consist of static and dynamic features

⇒ Only static feature seq. is generated Speech parameter generation based on

Bayesian approach ⇒ Maximize the lower bound

10

　　

　　

Relation between Bayes and ML

Compare with the ML criterion

Use of expectations of model parameters Can be solved by the same fashion of ML

11

　　

Output dist.

ML ⇒

Bayes ⇒





12

Bayesian context clustering

Context clustering based on maximizing

13

yes no

Select question

Gain of

Stopping condition

⇒ Split node based on gain

: Is this phoneme a vowel?

Impact of prior distribution Affect model selection as tuning parameters

⇒ Require determination technique of prior dist.

Conventional: maximize the marginal likelihood Lead to the over-fitting problem as the ML Tuning parameters are still required

Determination technique of prior distribution using cross validation [Hashimoto; ’08]

14

15

Bayesian approach using CV

Prior distribution based on Cross Validation

2,3 1,3Cross valid prior dist.

Calculate likelihood

Training data is randomly divided into K groups

Posterior dist.

1,2





16

17

Experimental conditions (1/2)Database ATR Japanese speech database B-set

Speaker MHT

Training data 450 utterances

Test data 53 utterances

Sampling rate 16 kHz

Window Blackman window

Frame size / shift 25 ms / 5 ms

Feature vector24 mel-cepstrum + Δ + ΔΔ and

log F0 + Δ + ΔΔ (78 dimension)

HMM5-state left-to-right HMM

without skip transition

18

Experimental conditions (2/2) Compared approach

Mean Opinion Score (MOS) test Subjects were 10 Japanese students 20 sentences were chosen at random

Training Context clustering # of states

ML-MDL ML MDL 2,491

Bayes-Bayes Bayes Bayes using CV 25,911

Bayes-MDL BayesBayes using CV

Adjust threshold2,553

ML-Bayes MLMDL

Adjust threshold27,106

Mean opinion score

Subjective listening test

192,491 25,911 27,1062,553

20

Conclusions and future work A new framework based on Bayesian approach

All processes are derived from a single predictive distribution

Improve the naturalness of synthesized speech

Future work Introduce HSMM instead of HMM Investigate the relation between the speech

quality and model structures

Documents

A Bayesian Approach to HMM-Based Speech Synthesis