Upload
jackson-burns
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A Bayesian Approach to HMM-Based Speech Synthesis. 1. 1. 1. 2. 1. Kei Hashimoto , Heiga Zen , Yoshihiko Nankaku , Takashi Masuko , and Keiichi Tokuda Nagoya Institute of Technology Tokyo Institute of Technology. 1. 2. Background. HMM-based speech synthesis system - PowerPoint PPT Presentation
Citation preview
A Bayesian Approach to HMM-Based Speech Synthesis
Kei Hashimoto , Heiga Zen ,
Yoshihiko Nankaku , Takashi Masuko ,
and Keiichi Tokuda
Nagoya Institute of Technology
Tokyo Institute of Technology
1
2
1 1
1
1
2
2
Background HMM-based speech synthesis system
Spectrum, excitation and duration are modeled Speech parameter seqs. are generated
Maximum likelihood (ML) criterion Train HMMs and generate speech parameters Point estimate ⇒ The over-fitting problem
Bayesian approach Estimate posterior dist. of model parameters Prior information can be use
⇒ Alleviate the over-fitting problem
Outline Bayesian speech synthesis
Variational Bayesian method Speech parameter generation
Bayesian context clustering Prior distribution using cross validation
Experiments Conclusion & Future work
3
Model training and speech synthesis
Bayesian speech synthesis (1/2)
4
: Model parameters
: Label seq. for synthesis: Label seq. for training: Training data seq.
: Synthesis data seq.
ML
Bayes
Bayesian speech synthesis (2/2)
Predictive distribution (marginal likelihood)
5
: HMM state seq. for synthesis data
Variational Bayesian method [Attias; ’99]
: HMM state seq. for training data: Likelihood of synthesis data: Likelihood of training data: Prior distribution for model parameters
Estimate approximate posterior dist. ⇒ Maximize a lower bound
Variational Bayesian method (1/2)
6
: Expectation w.r.t.
( Jensen’s inequality )
: Approximate distribution of the true posterior distribution
Random variables are statistically independent
Optimal posterior distributions
Variational Bayesian method (2/2)
7
: normalization terms
Iterative updates as the EM algorithm
Approximation for speech synthesis
is dependent on synthesis data
⇒ Huge computational cost in the synthesis part
Ignore the dependency of synthesis data
⇒ Estimation from only training data
8
Prior distribution Conjugate prior distribution
⇒ Posterior dist. becomes a same family of dist. with prior dist.
Determination using statistics of prior data
9
: Dimension of feature
: Covariance of prior data
: # of prior data
: Mean of prior data
Conjugate prior distribution
Likelihood function
Speech parameter generation Speech parameter
Consist of static and dynamic features
⇒ Only static feature seq. is generated Speech parameter generation based on
Bayesian approach ⇒ Maximize the lower bound
10
Relation between Bayes and ML
Compare with the ML criterion
Use of expectations of model parameters Can be solved by the same fashion of ML
11
Output dist.
ML ⇒
Bayes ⇒
Outline Bayesian speech synthesis
Variational Bayesian method Speech parameter generation
Bayesian context clustering Prior distribution using cross validation
Experiments Conclusion & Future work
12
Bayesian context clustering
Context clustering based on maximizing
13
yes no
Select question
Gain of
Stopping condition
⇒ Split node based on gain
: Is this phoneme a vowel?
Impact of prior distribution Affect model selection as tuning parameters
⇒ Require determination technique of prior dist.
Conventional: maximize the marginal likelihood Lead to the over-fitting problem as the ML Tuning parameters are still required
Determination technique of prior distribution using cross validation [Hashimoto; ’08]
14
15
Bayesian approach using CV
Prior distribution based on Cross Validation
2,3 1,3Cross valid prior dist.
Calculate likelihood
Training data is randomly divided into K groups
Posterior dist.
1,2
Outline Bayesian speech synthesis
Variational Bayesian method Speech parameter generation
Bayesian context clustering Prior distribution using cross validation
Experiments Conclusion & Future work
16
17
Experimental conditions (1/2)Database ATR Japanese speech database B-set
Speaker MHT
Training data 450 utterances
Test data 53 utterances
Sampling rate 16 kHz
Window Blackman window
Frame size / shift 25 ms / 5 ms
Feature vector24 mel-cepstrum + Δ + ΔΔ and
log F0 + Δ + ΔΔ (78 dimension)
HMM5-state left-to-right HMM
without skip transition
18
Experimental conditions (2/2) Compared approach
Mean Opinion Score (MOS) test Subjects were 10 Japanese students 20 sentences were chosen at random
Training Context clustering # of states
ML-MDL ML MDL 2,491
Bayes-Bayes Bayes Bayes using CV 25,911
Bayes-MDL BayesBayes using CV
Adjust threshold2,553
ML-Bayes MLMDL
Adjust threshold27,106
Mean opinion score
Subjective listening test
192,491 25,911 27,1062,553
20
Conclusions and future work A new framework based on Bayesian approach
All processes are derived from a single predictive distribution
Improve the naturalness of synthesized speech
Future work Introduce HSMM instead of HMM Investigate the relation between the speech
quality and model structures